[High-Level-API] word2vec new README with new code (#530)

e20e6517 · daminglu · GitHub · 34b152f7 · e20e6517 · e20e6517
5 changed file
--- a/01.fit_a_line/README.md
+++ b/01.fit_a_line/README.md
@@ -216,7 +216,7 @@ def event_handler_plot(event):
 ```
 ### Start Training
-We now can start training by calling `trainer.train()`. 
+We now can start training by calling `trainer.train()`.
 ```python
 %matplotlib inline

--- a/01.fit_a_line/index.html
+++ b/01.fit_a_line/index.html
@@ -81,7 +81,7 @@ $$MSE=\frac{1}{n}\sum_{i=1}^{n}{(\hat{Y_i}-Y_i)}^2$$
 That is, for a dataset of size $n$, MSE is the average value of the the prediction sqaure errors.
-### Training
+### Training Process
 After setting up our model, there are several major steps to go through to train it:
 1. Initialize the parameters including the weights $\vec{\omega}$ and the bias $b$. For example, we can set their mean values as $0$s, and their standard deviations as $1$s.
@@ -90,22 +90,6 @@ After setting up our model, there are several major steps to go through to train
 4. Repeat steps 2~3, until the loss is below a predefined threshold or the maximum number of epochs is reached.
 ## Dataset
-### Python Dataset Modules
-Our program starts with importing necessary packages:
-```python
-import paddle
-import paddle.fluid as fluid
-import numpy
-```
-We encapsulated the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) in our Python module `uci_housing`.  This module can
-1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if you haven't yet, and
-2.  [preprocess](#preprocessing) the dataset.
 ### An Introduction of the Dataset
 The UCI housing dataset has 506 instances. Each instance describes the attributes of a house in surburban Boston.  The attributes are explained below:
@@ -155,8 +139,26 @@ We split the dataset in two, one for adjusting the model parameters, namely, for
 When training complex models, we usually have one more split: the validation set. Complex models usually have [Hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_optimization) that need to be set before the training process, such as the number of layers in the network. Because hyperparameters are not part of the model parameters, they cannot be trained using the same loss function. Thus we will try several sets of hyperparameters to train several models and cross-validate them on the validation set to pick the best one; finally, the selected trained model is tested on the test set. Because our model is relatively simple, we will omit this validation process.
-## Datafeeder Configuration
+## Training
-We first define data feeders for test and train. The feeder reads a `BATCH_SIZE` of data each time and feed them to the training/testing process. If the user wants some randomness on the data order, she can define both a `BATCH_SIZE` and a `buf_size`. That way the datafeeder will yield the first `BATCH_SIZE` data out of a shuffle of the first `buf_size` data.
+`fit_a_line/trainer.py` demonstrates the training using [PaddlePaddle](http://paddlepaddle.org).
+### Datafeeder Configuration
+Our program starts with importing necessary packages:
+```python
+import paddle
+import paddle.fluid as fluid
+import numpy
+```
+We encapsulated the [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) in our Python module `uci_housing`.  This module can
+1. download the dataset to `~/.cache/paddle/dataset/uci_housing/housing.data`, if you haven't yet, and
+2.  [preprocess](#preprocessing) the dataset.
+We define data feeders for test and train. The feeder reads a `BATCH_SIZE` of data each time and feed them to the training/testing process. If the user wants some randomness on the data order, she can define both a `BATCH_SIZE` and a `buf_size`. That way the datafeeder will yield the first `BATCH_SIZE` data out of a shuffle of the first `buf_size` data.
 ```python
 BATCH_SIZE = 20
@@ -172,10 +174,6 @@ test_reader = paddle.batch(
    batch_size=BATCH_SIZE)
 ```
-## Training
-`fit_a_line/trainer.py` demonstrates the training using [PaddlePaddle](http://paddlepaddle.org).
 ### Train Program Configuration
 `train_program` sets up the network structure of this current training model. For linear regression, it is simply a fully connected layer from the input to the output. More complex structures like CNN and RNN will be introduced in later chapters. The `train_program` must return an avg_loss as its first returned parameter because it is needed in backpropagation.
@@ -260,6 +258,7 @@ def event_handler_plot(event):
 ```
 ### Start Training
+We now can start training by calling `trainer.train()`.
 ```python
 %matplotlib inline
@@ -275,11 +274,11 @@ trainer.train(
 ![png](./image/train_and_test.png)
-### Inference
+## Inference
 Initialize the Inferencer with the inference_program and the params_dirname, which is where we saved our params
-#### Setup the Inference Program.
+### Setup the Inference Program
 Similar to the trainer.train, the Inferencer needs to take an inference_program to do inference.
 Prune the train_program to only have the y_predict.
@@ -290,6 +289,9 @@ def inference_program():
    return y_predict
 ```
+### Infer
+Inferencer will load the trained model from `params_dirname` and use it to infer the unseen data.
 ```python
 inferencer = fluid.Inferencer(
    infer_func=inference_program, param_path=params_dirname, place=place)

--- a/04.word2vec/README.md
+++ b/04.word2vec/README.md
@@ -209,234 +209,237 @@ The neural network that we will be using is illustrated in the graph below:
 `word2vec/train.py` demonstrates training word2vec using PaddlePaddle:
- Import packages.
+### Datafeeder Configuration
+Our program starts with importing necessary packages:
-```python
-import math
-import paddle.v2 as paddle
-```
- Configure parameter.
-```python
-embsize = 32 # word vector dimension
-hiddensize = 256 # hidden layer dimension
-N = 5 # train 5-gram
-```
+- Import packages.
- functions used to save and load word dict and embedding table
 ```python
-# save and load word dict and embedding table
+import paddle
-def save_dict_and_embedding(word_dict, embeddings):
+import paddle.fluid as fluid
-    with open("word_dict", "w") as f:
+import numpy
-        for key in word_dict:
-            f.write(key + " " + str(word_dict[key]) + "\n")
-    with open("embedding_table", "w") as f:
-        numpy.savetxt(f, embeddings, delimiter=',', newline='\n')
-def load_dict_and_embedding():
-    word_dict = dict()
-    with open("word_dict", "r") as f:
-        for line in f:
-            key, value = line.strip().split(" ")
-            word_dict[key] = int(value)
-    embeddings = numpy.loadtxt("embedding_table", delimiter=",")
-    return word_dict, embeddings
 ```
- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
+- Configure parameters and build word dictionary.
 ```python
-def wordemb(inlayer):
+EMBED_SIZE = 32  # word vector dimension
-    wordemb = paddle.layer.table_projection(
+HIDDEN_SIZE = 256  # hidden layer dimension
-        input=inlayer,
+N = 5  # train 5-gram
-        size=embsize,
+BATCH_SIZE = 32  # batch size
-        param_attr=paddle.attr.Param(
-            name="_proj",
-            initial_std=0.001,
-            learning_rate=1,
-            l2_rate=0,
-            sparse_update=True))
-    return wordemb
-```
- Define name and type for input to data layer.
+# can use CPU or GPU
+use_cuda = os.getenv('WITH_GPU', '0') != '0'
-```python
-paddle.init(use_gpu=False, trainer_count=3)
 word_dict = paddle.dataset.imikolov.build_dict()
 dict_size = len(word_dict)
-# Every layer takes integer value of range [0, dict_size)
-firstword = paddle.layer.data(
-    name="firstw", type=paddle.data_type.integer_value(dict_size))
-secondword = paddle.layer.data(
-    name="secondw", type=paddle.data_type.integer_value(dict_size))
-thirdword = paddle.layer.data(
-    name="thirdw", type=paddle.data_type.integer_value(dict_size))
-fourthword = paddle.layer.data(
-    name="fourthw", type=paddle.data_type.integer_value(dict_size))
-nextword = paddle.layer.data(
-    name="fifthw", type=paddle.data_type.integer_value(dict_size))
-Efirst = wordemb(firstword)
-Esecond = wordemb(secondword)
-Ethird = wordemb(thirdword)
-Efourth = wordemb(fourthword)
 ```
- Concatenate n-1 word embedding vectors into a single feature vector.
+Unlike from the previous PaddlePaddle v2, in the new API (Fluid), we do not need to calculate word embedding ourselves. PaddlePaddle provides a built-in method `fluid.layers.embedding` and we can use it directly to build our N-gram neural network model.
+- We define our N-gram neural network structure as below. This structure will be used both in `train` and in `infer`
 ```python
-contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
+def inference_program(is_sparse):
+    first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
+    second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
+    third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
+    fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
+    embed_first = fluid.layers.embedding(
+        input=first_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_second = fluid.layers.embedding(
+        input=second_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_third = fluid.layers.embedding(
+        input=third_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_fourth = fluid.layers.embedding(
+        input=fourth_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    concat_embed = fluid.layers.concat(
+        input=[embed_first, embed_second, embed_third, embed_fourth], axis=1)
+    hidden1 = fluid.layers.fc(input=concat_embed,
+                              size=HIDDEN_SIZE,
+                              act='sigmoid')
+    predict_word = fluid.layers.fc(input=hidden1, size=dict_size, act='softmax')
+    return predict_word
 ```
- Feature vector will go through a fully connected layer which outputs a hidden feature vector.
+- As we already defined the N-gram neural network structure in the above, we can use it in our `train` method.
 ```python
-hidden1 = paddle.layer.fc(input=contextemb,
+def train_program(is_sparse):
-                          size=hiddensize,
+    # The declaration of 'next_word' must be after the invoking of inference_program,
-                          act=paddle.activation.Sigmoid(),
+    # or the data input order of train program would be [next_word, firstw, secondw,
-                          layer_attr=paddle.attr.Extra(drop_rate=0.5),
+    # thirdw, fourthw], which is not correct.
-                          bias_attr=paddle.attr.Param(learning_rate=2),
+    predict_word = inference_program(is_sparse)
-                          param_attr=paddle.attr.Param(
+    next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64')
-                                initial_std=1. / math.sqrt(embsize * 8),
+    cost = fluid.layers.cross_entropy(input=predict_word, label=next_word)
-                                learning_rate=1))
+    avg_cost = fluid.layers.mean(cost)
+    return avg_cost
 ```
- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated.
+- Now we will begin the training process. It is relatively simple compared to the previous version.  `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` are our training and test set. Both of the functions will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
-```python
-predictword = paddle.layer.fc(input=hidden1,
-                              size=dict_size,
-                              bias_attr=paddle.attr.Param(learning_rate=2),
-                              act=paddle.activation.Softmax())
-```
- We will use cross-entropy cost function.
-```python
-cost = paddle.layer.classification_cost(input=predictword, label=nextword)
-```
- Create parameter, optimizer and trainer.
-```python
-parameters = paddle.parameters.create(cost)
-adagrad = paddle.optimizer.AdaGrad(
-    learning_rate=3e-3,
-    regularization=paddle.optimizer.L2Regularization(8e-4))
-trainer = paddle.trainer.SGD(cost, parameters, adagrad)
-```
-Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
 `paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
+`event_handler` can be passed into `trainer.train` so that we can do some tasks after each step or epoch. These tasks include recording current metrics or terminate current training process.
 ```python
-def event_handler(event):
+def train(use_cuda, train_program, params_dirname):
-    if isinstance(event, paddle.event.EndIteration):
+    train_reader = paddle.batch(
-        if event.batch_id % 100 == 0:
+        paddle.dataset.imikolov.train(word_dict, N), BATCH_SIZE)
-            print "Pass %d, Batch %d, Cost %f, %s" % (
+    test_reader = paddle.batch(
-                event.pass_id, event.batch_id, event.cost, event.metrics)
+        paddle.dataset.imikolov.test(word_dict, N), BATCH_SIZE)
-    if isinstance(event, paddle.event.EndPass):
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-        result = trainer.test(
-                    paddle.batch(
+    def event_handler(event):
-                        paddle.dataset.imikolov.test(word_dict, N), 32))
+        if isinstance(event, fluid.EndStepEvent):
-        print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics)
+            outs = trainer.test(
-        with open("model_%d.tar"%event.pass_id, 'w') as f:
+                reader=test_reader,
-            trainer.save_parameter_to_tar(f)
+                feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
+            avg_cost = outs[0]
-trainer.train(
-    paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
+            # We output cost every 10 steps.
-    num_passes=100,
+            if event.step % 10 == 0:
-    event_handler=event_handler)
+                print "Step %d: Average Cost %f" % (event.step, event.cost)
+            # If average cost is lower than 5.0, we consider the model good enough to stop.
+            if avg_cost < 5.5:
+                trainer.save_params(params_dirname)
+                trainer.stop()
+            if math.isnan(avg_cost):
+                sys.exit("got NaN loss, training failed.")
+    trainer = fluid.Trainer(
+        train_func=train_program,
+        # Note here we need to choose more sophisticated optimizer
+        # such as AdaGrad with a decay rate. The normal SGD converges
+        # very slowly.
+        # optimizer=fluid.optimizer.SGD(learning_rate=0.001),
+        optimizer=fluid.optimizer.AdagradOptimizer(
+            learning_rate=3e-3,
+            regularization=fluid.regularizer.L2DecayRegularizer(8e-4)
+        ),
+        place=place)
+    trainer.train(
+        reader=train_reader,
+        num_epochs=1,
+        event_handler=event_handler,
+        feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
 ```
 `trainer.train` will start training, the output of `event_handler` will be similar to following:
 ```text
-Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
+Step 0: Average Cost 7.337213
-Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
+Step 10: Average Cost 6.136128
-Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
+Step 20: Average Cost 5.766995
 ...
 ```
-After 30 passes, we can get average error rate around 0.735611.
-## Save word dict and embedding table
-after training, we can save the word dict and embedding table for the future usage.
-```python
-# save word dict and embedding table
-embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
-save_dict_and_embedding(word_dict, embeddings)
-```
 ## Model Application
-After the model is trained, we can load the  saved model parameters and use it for other models. We can also use the parameters in various applications.
+After the model is trained, we can load the saved model parameters and do some inference.
-### Viewing Word Vector
+### Predicting the next word
-Parameters trained by PaddlePaddle can be viewed by `parameters.get()`. For example, we can check the word vector for word `apple`.
+We can use our trained model to predict the next word given its previous N-gram. For example
-```python
-embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
-print embeddings[word_dict['apple']]
+```python
+def infer(use_cuda, inference_program, params_dirname=None):
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    inferencer = fluid.Inferencer(
+        infer_func=inference_program, param_path=params_dirname, place=place)
+    # Setup inputs by creating 4 LoDTensors representing 4 words. Here each word
+    # is simply an index to look up for the corresponding word vector and hence
+    # the shape of word (base_shape) should be [1]. The length-based level of
+    # detail (lod) info of each LoDtensor should be [[1]] meaning there is only
+    # one lod_level and there is only one sequence of one word on this level.
+    # Note that lod info should be a list of lists.
+    lod1 = [[211]]  # 'among'
+    lod2 = [[6]]    # 'a'
+    lod3 = [[96]]   # 'group'
+    lod4 = [[4]]    # 'of'
+    base_shape = [1]
+    first_word  = fluid.create_lod_tensor(lod1, base_shape, place)
+    second_word = fluid.create_lod_tensor(lod2, base_shape, place)
+    third_word  = fluid.create_lod_tensor(lod3, base_shape, place)
+    fourth_word = fluid.create_lod_tensor(lod4, base_shape, place)
+    result = inferencer.infer(
+        {
+            'firstw': first_word,
+            'secondw': second_word,
+            'thirdw': third_word,
+            'fourthw': fourth_word
+        },
+        return_numpy=False)
+    print(numpy.array(result[0]))
+    most_possible_word_index = numpy.argmax(result[0])
+    print(most_possible_word_index)
+    print([key for key, value in word_dict.iteritems() if value == most_possible_word_index][0])
 ```
+When we spent 30 mins in training, the output is like below, which means the next word for `among a group of` is `unknown`. After several hours training, it gives a meaningful prediction as `workers`.
 ```text
-[-0.38961065 -0.02392169 -0.00093231  0.36301503  0.13538605  0.16076435
+[[4.0056456e-02 5.4810006e-02 5.3107393e-05 ... 1.0061498e-04
-0.0678709   0.1090285   0.42014077 -0.24119169 -0.31847557  0.20410083
+  8.9233123e-05 1.5757295e-01]]
-0.04910378  0.19021918 -0.0122014  -0.04099389 -0.16924137  0.1911236
+2072
-0.10917275  0.13068172 -0.23079982  0.42699069 -0.27679482 -0.01472992
+<unk>
-0.2069038   0.09005053 -0.3282454   0.12717034 -0.24218646  0.25304323
-0.19072419 -0.24286366]
 ```
-### Modifying Word Vector
+The main entrance of the program is fairly simple:
-Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`.
 ```python
-def modify_embedding(emb):
-    # Add your modification here.
-    pass
-modify_embedding(embeddings)
-parameters.set("_proj", embeddings)
-```
-### Calculating Cosine Similarity
+def main(use_cuda, is_sparse):
+    if use_cuda and not fluid.core.is_compiled_with_cuda():
+        return
-Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are:
+    params_dirname = "word2vec.inference.model"
+    train(
+        use_cuda=use_cuda,
+        train_program=partial(train_program, is_sparse),
+        params_dirname=params_dirname)
-```python
+    infer(
-from scipy import spatial
+        use_cuda=use_cuda,
+        inference_program=partial(inference_program, is_sparse),
+        params_dirname=params_dirname)
-emb_1 = embeddings[word_dict['world']]
-emb_2 = embeddings[word_dict['would']]
-print spatial.distance.cosine(emb_1, emb_2)
+main(use_cuda=use_cuda, is_sparse=True)
-```
-```text
-0.99375076448
 ```
 ## Conclusion
 This chapter introduces word embeddings, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
-In information retrieval, the relevance between the query and document keyword can be computed through the cosine similarity of their word embeddings. In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. In document classification, clustering the word embedding can group synonyms in the documents. We hope that readers can use word embedding models in their work after reading this chapter.
+In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. We hope that readers can use word embedding models in their work after reading this chapter.
 ## References

--- a/04.word2vec/index.html
+++ b/04.word2vec/index.html
@@ -251,234 +251,237 @@ The neural network that we will be using is illustrated in the graph below:
 `word2vec/train.py` demonstrates training word2vec using PaddlePaddle:
- Import packages.
+### Datafeeder Configuration
+Our program starts with importing necessary packages:
-```python
-import math
-import paddle.v2 as paddle
-```
- Configure parameter.
-```python
-embsize = 32 # word vector dimension
-hiddensize = 256 # hidden layer dimension
-N = 5 # train 5-gram
-```
+- Import packages.
- functions used to save and load word dict and embedding table
 ```python
-# save and load word dict and embedding table
+import paddle
-def save_dict_and_embedding(word_dict, embeddings):
+import paddle.fluid as fluid
-    with open("word_dict", "w") as f:
+import numpy
-        for key in word_dict:
-            f.write(key + " " + str(word_dict[key]) + "\n")
-    with open("embedding_table", "w") as f:
-        numpy.savetxt(f, embeddings, delimiter=',', newline='\n')
-def load_dict_and_embedding():
-    word_dict = dict()
-    with open("word_dict", "r") as f:
-        for line in f:
-            key, value = line.strip().split(" ")
-            word_dict[key] = int(value)
-    embeddings = numpy.loadtxt("embedding_table", delimiter=",")
-    return word_dict, embeddings
 ```
- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
+- Configure parameters and build word dictionary.
 ```python
-def wordemb(inlayer):
+EMBED_SIZE = 32  # word vector dimension
-    wordemb = paddle.layer.table_projection(
+HIDDEN_SIZE = 256  # hidden layer dimension
-        input=inlayer,
+N = 5  # train 5-gram
-        size=embsize,
+BATCH_SIZE = 32  # batch size
-        param_attr=paddle.attr.Param(
-            name="_proj",
-            initial_std=0.001,
-            learning_rate=1,
-            l2_rate=0,
-            sparse_update=True))
-    return wordemb
-```
- Define name and type for input to data layer.
+# can use CPU or GPU
+use_cuda = os.getenv('WITH_GPU', '0') != '0'
-```python
-paddle.init(use_gpu=False, trainer_count=3)
 word_dict = paddle.dataset.imikolov.build_dict()
 dict_size = len(word_dict)
-# Every layer takes integer value of range [0, dict_size)
-firstword = paddle.layer.data(
-    name="firstw", type=paddle.data_type.integer_value(dict_size))
-secondword = paddle.layer.data(
-    name="secondw", type=paddle.data_type.integer_value(dict_size))
-thirdword = paddle.layer.data(
-    name="thirdw", type=paddle.data_type.integer_value(dict_size))
-fourthword = paddle.layer.data(
-    name="fourthw", type=paddle.data_type.integer_value(dict_size))
-nextword = paddle.layer.data(
-    name="fifthw", type=paddle.data_type.integer_value(dict_size))
-Efirst = wordemb(firstword)
-Esecond = wordemb(secondword)
-Ethird = wordemb(thirdword)
-Efourth = wordemb(fourthword)
 ```
- Concatenate n-1 word embedding vectors into a single feature vector.
+Unlike from the previous PaddlePaddle v2, in the new API (Fluid), we do not need to calculate word embedding ourselves. PaddlePaddle provides a built-in method `fluid.layers.embedding` and we can use it directly to build our N-gram neural network model.
+- We define our N-gram neural network structure as below. This structure will be used both in `train` and in `infer`
 ```python
-contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
+def inference_program(is_sparse):
+    first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
+    second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
+    third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
+    fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
+    embed_first = fluid.layers.embedding(
+        input=first_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_second = fluid.layers.embedding(
+        input=second_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_third = fluid.layers.embedding(
+        input=third_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_fourth = fluid.layers.embedding(
+        input=fourth_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    concat_embed = fluid.layers.concat(
+        input=[embed_first, embed_second, embed_third, embed_fourth], axis=1)
+    hidden1 = fluid.layers.fc(input=concat_embed,
+                              size=HIDDEN_SIZE,
+                              act='sigmoid')
+    predict_word = fluid.layers.fc(input=hidden1, size=dict_size, act='softmax')
+    return predict_word
 ```
- Feature vector will go through a fully connected layer which outputs a hidden feature vector.
+- As we already defined the N-gram neural network structure in the above, we can use it in our `train` method.
 ```python
-hidden1 = paddle.layer.fc(input=contextemb,
+def train_program(is_sparse):
-                          size=hiddensize,
+    # The declaration of 'next_word' must be after the invoking of inference_program,
-                          act=paddle.activation.Sigmoid(),
+    # or the data input order of train program would be [next_word, firstw, secondw,
-                          layer_attr=paddle.attr.Extra(drop_rate=0.5),
+    # thirdw, fourthw], which is not correct.
-                          bias_attr=paddle.attr.Param(learning_rate=2),
+    predict_word = inference_program(is_sparse)
-                          param_attr=paddle.attr.Param(
+    next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64')
-                                initial_std=1. / math.sqrt(embsize * 8),
+    cost = fluid.layers.cross_entropy(input=predict_word, label=next_word)
-                                learning_rate=1))
+    avg_cost = fluid.layers.mean(cost)
+    return avg_cost
 ```
- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated.
+- Now we will begin the training process. It is relatively simple compared to the previous version.  `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` are our training and test set. Both of the functions will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
-```python
-predictword = paddle.layer.fc(input=hidden1,
-                              size=dict_size,
-                              bias_attr=paddle.attr.Param(learning_rate=2),
-                              act=paddle.activation.Softmax())
-```
- We will use cross-entropy cost function.
-```python
-cost = paddle.layer.classification_cost(input=predictword, label=nextword)
-```
- Create parameter, optimizer and trainer.
-```python
-parameters = paddle.parameters.create(cost)
-adagrad = paddle.optimizer.AdaGrad(
-    learning_rate=3e-3,
-    regularization=paddle.optimizer.L2Regularization(8e-4))
-trainer = paddle.trainer.SGD(cost, parameters, adagrad)
-```
-Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
 `paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
+`event_handler` can be passed into `trainer.train` so that we can do some tasks after each step or epoch. These tasks include recording current metrics or terminate current training process.
 ```python
-def event_handler(event):
+def train(use_cuda, train_program, params_dirname):
-    if isinstance(event, paddle.event.EndIteration):
+    train_reader = paddle.batch(
-        if event.batch_id % 100 == 0:
+        paddle.dataset.imikolov.train(word_dict, N), BATCH_SIZE)
-            print "Pass %d, Batch %d, Cost %f, %s" % (
+    test_reader = paddle.batch(
-                event.pass_id, event.batch_id, event.cost, event.metrics)
+        paddle.dataset.imikolov.test(word_dict, N), BATCH_SIZE)
-    if isinstance(event, paddle.event.EndPass):
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-        result = trainer.test(
-                    paddle.batch(
+    def event_handler(event):
-                        paddle.dataset.imikolov.test(word_dict, N), 32))
+        if isinstance(event, fluid.EndStepEvent):
-        print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics)
+            outs = trainer.test(
-        with open("model_%d.tar"%event.pass_id, 'w') as f:
+                reader=test_reader,
-            trainer.save_parameter_to_tar(f)
+                feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
+            avg_cost = outs[0]
-trainer.train(
-    paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
+            # We output cost every 10 steps.
-    num_passes=100,
+            if event.step % 10 == 0:
-    event_handler=event_handler)
+                print "Step %d: Average Cost %f" % (event.step, event.cost)
+            # If average cost is lower than 5.0, we consider the model good enough to stop.
+            if avg_cost < 5.5:
+                trainer.save_params(params_dirname)
+                trainer.stop()
+            if math.isnan(avg_cost):
+                sys.exit("got NaN loss, training failed.")
+    trainer = fluid.Trainer(
+        train_func=train_program,
+        # Note here we need to choose more sophisticated optimizer
+        # such as AdaGrad with a decay rate. The normal SGD converges
+        # very slowly.
+        # optimizer=fluid.optimizer.SGD(learning_rate=0.001),
+        optimizer=fluid.optimizer.AdagradOptimizer(
+            learning_rate=3e-3,
+            regularization=fluid.regularizer.L2DecayRegularizer(8e-4)
+        ),
+        place=place)
+    trainer.train(
+        reader=train_reader,
+        num_epochs=1,
+        event_handler=event_handler,
+        feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
 ```
 `trainer.train` will start training, the output of `event_handler` will be similar to following:
 ```text
-Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
+Step 0: Average Cost 7.337213
-Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
+Step 10: Average Cost 6.136128
-Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
+Step 20: Average Cost 5.766995
 ...
 ```
-After 30 passes, we can get average error rate around 0.735611.
-## Save word dict and embedding table
-after training, we can save the word dict and embedding table for the future usage.
-```python
-# save word dict and embedding table
-embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
-save_dict_and_embedding(word_dict, embeddings)
-```
 ## Model Application
-After the model is trained, we can load the  saved model parameters and use it for other models. We can also use the parameters in various applications.
+After the model is trained, we can load the saved model parameters and do some inference.
-### Viewing Word Vector
+### Predicting the next word
-Parameters trained by PaddlePaddle can be viewed by `parameters.get()`. For example, we can check the word vector for word `apple`.
+We can use our trained model to predict the next word given its previous N-gram. For example
-```python
-embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
-print embeddings[word_dict['apple']]
+```python
+def infer(use_cuda, inference_program, params_dirname=None):
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    inferencer = fluid.Inferencer(
+        infer_func=inference_program, param_path=params_dirname, place=place)
+    # Setup inputs by creating 4 LoDTensors representing 4 words. Here each word
+    # is simply an index to look up for the corresponding word vector and hence
+    # the shape of word (base_shape) should be [1]. The length-based level of
+    # detail (lod) info of each LoDtensor should be [[1]] meaning there is only
+    # one lod_level and there is only one sequence of one word on this level.
+    # Note that lod info should be a list of lists.
+    lod1 = [[211]]  # 'among'
+    lod2 = [[6]]    # 'a'
+    lod3 = [[96]]   # 'group'
+    lod4 = [[4]]    # 'of'
+    base_shape = [1]
+    first_word  = fluid.create_lod_tensor(lod1, base_shape, place)
+    second_word = fluid.create_lod_tensor(lod2, base_shape, place)
+    third_word  = fluid.create_lod_tensor(lod3, base_shape, place)
+    fourth_word = fluid.create_lod_tensor(lod4, base_shape, place)
+    result = inferencer.infer(
+        {
+            'firstw': first_word,
+            'secondw': second_word,
+            'thirdw': third_word,
+            'fourthw': fourth_word
+        },
+        return_numpy=False)
+    print(numpy.array(result[0]))
+    most_possible_word_index = numpy.argmax(result[0])
+    print(most_possible_word_index)
+    print([key for key, value in word_dict.iteritems() if value == most_possible_word_index][0])
 ```
+When we spent 30 mins in training, the output is like below, which means the next word for `among a group of` is `unknown`. After several hours training, it gives a meaningful prediction as `workers`.
 ```text
-[-0.38961065 -0.02392169 -0.00093231  0.36301503  0.13538605  0.16076435
+[[4.0056456e-02 5.4810006e-02 5.3107393e-05 ... 1.0061498e-04
-0.0678709   0.1090285   0.42014077 -0.24119169 -0.31847557  0.20410083
+  8.9233123e-05 1.5757295e-01]]
-0.04910378  0.19021918 -0.0122014  -0.04099389 -0.16924137  0.1911236
+2072
-0.10917275  0.13068172 -0.23079982  0.42699069 -0.27679482 -0.01472992
+<unk>
-0.2069038   0.09005053 -0.3282454   0.12717034 -0.24218646  0.25304323
-0.19072419 -0.24286366]
 ```
-### Modifying Word Vector
+The main entrance of the program is fairly simple:
-Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`.
 ```python
-def modify_embedding(emb):
-    # Add your modification here.
-    pass
-modify_embedding(embeddings)
-parameters.set("_proj", embeddings)
-```
-### Calculating Cosine Similarity
+def main(use_cuda, is_sparse):
+    if use_cuda and not fluid.core.is_compiled_with_cuda():
+        return
-Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are:
+    params_dirname = "word2vec.inference.model"
+    train(
+        use_cuda=use_cuda,
+        train_program=partial(train_program, is_sparse),
+        params_dirname=params_dirname)
-```python
+    infer(
-from scipy import spatial
+        use_cuda=use_cuda,
+        inference_program=partial(inference_program, is_sparse),
+        params_dirname=params_dirname)
-emb_1 = embeddings[word_dict['world']]
-emb_2 = embeddings[word_dict['would']]
-print spatial.distance.cosine(emb_1, emb_2)
+main(use_cuda=use_cuda, is_sparse=True)
-```
-```text
-0.99375076448
 ```
 ## Conclusion
 This chapter introduces word embeddings, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
-In information retrieval, the relevance between the query and document keyword can be computed through the cosine similarity of their word embeddings. In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. In document classification, clustering the word embedding can group synonyms in the documents. We hope that readers can use word embedding models in their work after reading this chapter.
+In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. We hope that readers can use word embedding models in their work after reading this chapter.
 ## References

--- a/04.word2vec/train.py
+++ b/04.word2vec/train.py
-import math
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
-import os
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
-import numpy
 import paddle.v2 as paddle
+import paddle.fluid as fluid
+import numpy
+import sys
+from functools import partial
-with_gpu = os.getenv('WITH_GPU', '0') != '0'
+import math
+import os
-embsize = 32
+EMBED_SIZE = 32
-hiddensize = 256
+HIDDEN_SIZE = 256
 N = 5
+BATCH_SIZE = 100
+# can use CPU or GPU
+use_cuda = os.getenv('WITH_GPU', '0') != '0'
+word_dict = paddle.dataset.imikolov.build_dict()
+dict_size = len(word_dict)
+def inference_program(is_sparse):
+    first_word = fluid.layers.data(name='firstw', shape=[1], dtype='int64')
+    second_word = fluid.layers.data(name='secondw', shape=[1], dtype='int64')
+    third_word = fluid.layers.data(name='thirdw', shape=[1], dtype='int64')
+    fourth_word = fluid.layers.data(name='fourthw', shape=[1], dtype='int64')
+    embed_first = fluid.layers.embedding(
+        input=first_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_second = fluid.layers.embedding(
+        input=second_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_third = fluid.layers.embedding(
+        input=third_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    embed_fourth = fluid.layers.embedding(
+        input=fourth_word,
+        size=[dict_size, EMBED_SIZE],
+        dtype='float32',
+        is_sparse=is_sparse,
+        param_attr='shared_w')
+    concat_embed = fluid.layers.concat(
+        input=[embed_first, embed_second, embed_third, embed_fourth], axis=1)
+    hidden1 = fluid.layers.fc(
+        input=concat_embed, size=HIDDEN_SIZE, act='sigmoid')
+    predict_word = fluid.layers.fc(input=hidden1, size=dict_size, act='softmax')
+    return predict_word
+def train_program(is_sparse):
+    # The declaration of 'next_word' must be after the invoking of inference_program,
+    # or the data input order of train program would be [next_word, firstw, secondw,
+    # thirdw, fourthw], which is not correct.
+    predict_word = inference_program(is_sparse)
+    next_word = fluid.layers.data(name='nextw', shape=[1], dtype='int64')
+    cost = fluid.layers.cross_entropy(input=predict_word, label=next_word)
+    avg_cost = fluid.layers.mean(cost)
+    return avg_cost
+def optimizer_func():
+    return fluid.optimizer.AdagradOptimizer(
+        learning_rate=3e-3,
+        regularization=fluid.regularizer.L2DecayRegularizer(8e-4))
-def wordemb(inlayer):
+def train(use_cuda, train_program, params_dirname):
-    wordemb = paddle.layer.table_projection(
+    train_reader = paddle.batch(
-        input=inlayer,
+        paddle.dataset.imikolov.train(word_dict, N), BATCH_SIZE)
-        size=embsize,
+    test_reader = paddle.batch(
-        param_attr=paddle.attr.Param(
+        paddle.dataset.imikolov.test(word_dict, N), BATCH_SIZE)
-            name="_proj", initial_std=0.001, learning_rate=1, l2_rate=0))
-    return wordemb
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-# save and load word dict and embedding table
-def save_dict_and_embedding(word_dict, embeddings):
-    with open("word_dict", "w") as f:
-        for key in word_dict:
-            f.write(key + " " + str(word_dict[key]) + "\n")
-    with open("embedding_table", "w") as f:
-        numpy.savetxt(f, embeddings, delimiter=',', newline='\n')
-def load_dict_and_embedding():
-    word_dict = dict()
-    with open("word_dict", "r") as f:
-        for line in f:
-            key, value = line.strip().split(" ")
-            word_dict[key] = int(value)
-    embeddings = numpy.loadtxt("embedding_table", delimiter=",")
-    return word_dict, embeddings
-def main():
-    paddle.init(use_gpu=with_gpu, trainer_count=3)
-    word_dict = paddle.dataset.imikolov.build_dict()
-    dict_size = len(word_dict)
-    # Every layer takes integer value of range [0, dict_size)
-    firstword = paddle.layer.data(
-        name="firstw", type=paddle.data_type.integer_value(dict_size))
-    secondword = paddle.layer.data(
-        name="secondw", type=paddle.data_type.integer_value(dict_size))
-    thirdword = paddle.layer.data(
-        name="thirdw", type=paddle.data_type.integer_value(dict_size))
-    fourthword = paddle.layer.data(
-        name="fourthw", type=paddle.data_type.integer_value(dict_size))
-    nextword = paddle.layer.data(
-        name="fifthw", type=paddle.data_type.integer_value(dict_size))
-    Efirst = wordemb(firstword)
-    Esecond = wordemb(secondword)
-    Ethird = wordemb(thirdword)
-    Efourth = wordemb(fourthword)
-    contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
-    hidden1 = paddle.layer.fc(
-        input=contextemb,
-        size=hiddensize,
-        act=paddle.activation.Sigmoid(),
-        layer_attr=paddle.attr.Extra(drop_rate=0.5),
-        bias_attr=paddle.attr.Param(learning_rate=2),
-        param_attr=paddle.attr.Param(
-            initial_std=1. / math.sqrt(embsize * 8), learning_rate=1))
-    predictword = paddle.layer.fc(
-        input=hidden1,
-        size=dict_size,
-        bias_attr=paddle.attr.Param(learning_rate=2),
-        act=paddle.activation.Softmax())
-    cost = paddle.layer.classification_cost(input=predictword, label=nextword)
-    parameters = paddle.parameters.create(cost)
-    adagrad = paddle.optimizer.AdaGrad(
-        learning_rate=3e-3,
-        regularization=paddle.optimizer.L2Regularization(8e-4))
-    trainer = paddle.trainer.SGD(cost, parameters, adagrad)
    def event_handler(event):
-        if isinstance(event, paddle.event.EndIteration):
+        if isinstance(event, fluid.EndStepEvent):
-            if event.batch_id % 100 == 0:
+            outs = trainer.test(
-                print "Pass %d, Batch %d, Cost %f, %s" % (
+                reader=test_reader,
-                    event.pass_id, event.batch_id, event.cost, event.metrics)
+                feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
+            avg_cost = outs[0]
-        if isinstance(event, paddle.event.EndPass):
-            result = trainer.test(
-                paddle.batch(paddle.dataset.imikolov.test(word_dict, N), 32))
-            print "Pass %d, Testing metrics %s" % (event.pass_id,
-                                                   result.metrics)
-            with open("model_%d.tar" % event.pass_id, 'w') as f:
-                trainer.save_parameter_to_tar(f)
-    trainer.train(
+            if event.step % 10 == 0:
-        paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
+                print "Step %d: Average Cost %f" % (event.step, avg_cost)
-        num_passes=100,
-        event_handler=event_handler)
+            if avg_cost < 5.5:
+                trainer.save_params(params_dirname)
+                trainer.stop()
-    # save word dict and embedding table
+            if math.isnan(avg_cost):
-    embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
+                sys.exit("got NaN loss, training failed.")
-    save_dict_and_embedding(word_dict, embeddings)
+    trainer = fluid.Trainer(
+        train_func=train_program,
+        # optimizer=fluid.optimizer.SGD(learning_rate=0.001),
+        optimizer_func=optimizer_func,
+        place=place)
+    trainer.train(
+        reader=train_reader,
+        num_epochs=1,
+        event_handler=event_handler,
+        feed_order=['firstw', 'secondw', 'thirdw', 'fourthw', 'nextw'])
+def infer(use_cuda, inference_program, params_dirname=None):
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    inferencer = fluid.Inferencer(
+        infer_func=inference_program, param_path=params_dirname, place=place)
+    # Setup inputs by creating 4 LoDTensors representing 4 words. Here each word
+    # is simply an index to look up for the corresponding word vector and hence
+    # the shape of word (base_shape) should be [1]. The length-based level of
+    # detail (lod) info of each LoDtensor should be [[1]] meaning there is only
+    # one lod_level and there is only one sequence of one word on this level.
+    # Note that lod info should be a list of lists.
+    lod = [[1]]
+    base_shape = [1]
+    # The range of random integers is [low, high]
+    first_word = fluid.create_random_int_lodtensor(
+        lod, base_shape, place, low=0, high=dict_size - 1)
+    second_word = fluid.create_random_int_lodtensor(
+        lod, base_shape, place, low=0, high=dict_size - 1)
+    third_word = fluid.create_random_int_lodtensor(
+        lod, base_shape, place, low=0, high=dict_size - 1)
+    fourth_word = fluid.create_random_int_lodtensor(
+        lod, base_shape, place, low=0, high=dict_size - 1)
+    result = inferencer.infer(
+        {
+            'firstw': first_word,
+            'secondw': second_word,
+            'thirdw': third_word,
+            'fourthw': fourth_word
+        },
+        return_numpy=False)
+    print(numpy.array(result[0]))
+def main(use_cuda, is_sparse):
+    if use_cuda and not fluid.core.is_compiled_with_cuda():
+        return
+    params_dirname = "word2vec.inference.model"
+    train(
+        use_cuda=use_cuda,
+        train_program=partial(train_program, is_sparse),
+        params_dirname=params_dirname)
+    infer(
+        use_cuda=use_cuda,
+        inference_program=partial(inference_program, is_sparse),
+        params_dirname=params_dirname)
 if __name__ == '__main__':
-    main()
+    main(use_cuda=use_cuda, is_sparse=True)