Cherry-pick all user guides to 1.2 (#615)

* Quickstart (#544) * upload enhowto * upload cluster quick start en * review commit * recover original multi_node.rst index * fix index of cluster_howto * fix index * fix bugs * upload save_load_variables_en (#549) * upload save_load_variables_en * Textual Review * upload train_on_baidu_cloud_en (#550) * upload train_on_baidu_cloud_en * First round review check * review on webpage, fix minors * upload indexen (#551) * upload indexen * Textual Review * upload metricsen (#552) * upload metricsen * First round review * minor fix to round 1 review * Update metricsen.rst * Change file name and re-review * lod_tensor_en (#554) * lod_tensor_en * Textual Content Check * Review and fix on webpage * numpt_array_en (#555) * numpt_array_en * Textual Content Check * fix wrong file location and name * review on webpage * simple_model (#557) * simple_model * Contextual Review * review on webpage * single_node (#558) * single_node * Textual Fix * review on webpage * test_while_training (#559) * test_while_training * Textual Review Textual Fix * py_reader_en (#564) * py_reader_en * Textual Check * user guides index of all (#595) * visualdl en +cn official readme (#598) * FluidModel Fix (#611) - delete the Models index in root index - update models/index_en.rst

Cherry-pick all user guides to 1.2 (#615)
* Quickstart (#544) * upload enhowto * upload cluster quick start en * review commit * recover original multi_node.rst index * fix index of cluster_howto * fix index * fix bugs * upload save_load_variables_en (#549) * upload save_load_variables_en * Textual Review * upload train_on_baidu_cloud_en (#550) * upload train_on_baidu_cloud_en * First round review check * review on webpage, fix minors * upload indexen (#551) * upload indexen * Textual Review * upload metricsen (#552) * upload metricsen * First round review * minor fix to round 1 review * Update metricsen.rst * Change file name and re-review * lod_tensor_en (#554) * lod_tensor_en * Textual Content Check * Review and fix on webpage * numpt_array_en (#555) * numpt_array_en * Textual Content Check * fix wrong file location and name * review on webpage * simple_model (#557) * simple_model * Contextual Review * review on webpage * single_node (#558) * single_node * Textual Fix * review on webpage * test_while_training (#559) * test_while_training * Textual Review Textual Fix * py_reader_en (#564) * py_reader_en * Textual Check * user guides index of all (#595) * visualdl en +cn official readme (#598) * FluidModel Fix (#611) - delete the Models index in root index - update models/index_en.rst
b97602f9 · Hao Wang · GitHub · 1fa68c75 · b97602f9 · b97602f9
21 changed file
--- a/doc/fluid/index_en.rst
+++ b/doc/fluid/index_en.rst
@@ -5,9 +5,9 @@
  :maxdepth: 1
  beginners_guide/index_en.rst
+  user_guides/index_en.rst
  design/index_en.rst
  howto/index_en.rst
  dev/index_en.rst
  api/index_en.rst
-  user_guides/models/index_en.rst
  advanced_usage/deploy/index_mobile.rst
--- a/doc/fluid/user_guides/howto/basic_concept/index_en.rst
+++ b/doc/fluid/user_guides/howto/basic_concept/index_en.rst
+###############
+Basic Concepts
+###############
+This section will introduce basic concepts in Fluid：
+- `LoD-Tensor User Guide <lod_tensor_en.html>`_ : LoD-Tensor is a unique term of Fluid. It appends sequence information to Tensor，and supports data of variable lengths.
+..  toctree::
+    :hidden:
+    lod_tensor_en.rst
--- a/doc/fluid/user_guides/howto/basic_concept/lod_tensor_en.rst
+++ b/doc/fluid/user_guides/howto/basic_concept/lod_tensor_en.rst
+#####################
+LoD-Tensor User Guide
+#####################
+LoD(Level-of-Detail) Tensor is a unique term in Fluid, which can be constructed by appending sequence information to Tensor. Data transferred in Fluid contain input, output and learnable parameters of the network, all of which are represented by LoD-Tensor.
+With the help of this user guide, you will learn the design idea of LoD-Tensor in Fluid so that you can use such a data type more flexibly.
+Challenge of variable-length sequences
+======================================
+In most deep learning frameworks, a mini-batch is represented by Tensor.
+For example, if there are 10 pictures in a mini-batch and the size of each picture is 32*32, the mini-batch will be a 10*32*32 Tensor.
+Or in the NLP task, there are N sentences in a mini-batch and the length of each sentence is L. Every word is represented by a one-hot vector with D dimensions. Then the mini-batch can be represented by an N*L*D Tensor.
+In the two examples above, the size of each sequence element remains the same. However, the data to be trained are variable-length sequences in many cases. For this scenario, method to be taken in most frameworks is to set a fixed length and sequence data shorter than the fixed length will be padded with 0 to reach the fixed length.
+Owing to the LoD-Tensor in Fluid, it is not necessary to keep the lengths of sequence data in every mini-batch constant.Therefore tasks sensitive to sequence formats like NLP can also be finished without padding.
+Index Data Structure (LoD) is introduced to Fluid to split Tensor into sequences.
+Index Structure - LoD 
+======================
+To have a better understanding of the concept of LoD, you can refer to the examples in this section.
+**mini-batch consisting of sentences**
+Suppose a mini-batch contains three sentences, and each contains 3, 1, 2 words respectively. Then the mini-batch can be represented by a (3+1+2)*D Tensor with some index information appended:
+.. code-block :: text
+  3       1   2
+  | | |   |   | |
+In the text above, each :code:`|` represents a word vector with D dimension and a 1-level LoD is made up of digits 3,1,2 .
+**recursive sequence**
+Take a 2-level LoD-Tensor for example, a mini-batch contains articles of 3 sentences, 1 sentence and 2 sentences. The number of words in every sentence is different. Then the mini-batch is formed as follows:
+.. code-block:: text
+  3            1 2
+  3   2  4     1 2  3
+  ||| || ||||  | || |||
+the LoD to express the format:
+.. code-block:: text
+  [[3，1，2]/*level=0*/，[3，2，4，1，2，3]/*level=1*/]
+**mini-batch consisting of video data**
+In the task of computer vision, it usually needs to deal objects with high dimension like videos and pictures. Suppose a mini-batch contains 3 videos, which is composed of 3 frames, 1 frames, 2 frames respectively. The size of each frame is 640*480. Then the mini-batch can be described as:
+.. code-block:: text
+  3     1  2
+  口口口 口 口口
+The size of the tensor at the bottom is (3+1+2)*640*480. Every :code:`口` represents a 640*480 picture.
+**mini-batch consisting of pictures**
+Traditionally, for a mini-batch of N pictures with fixed size, LoD-Tensor is described as:
+.. code-block:: text
+  1 1 1 1     1
+  口口口口 ... 口
+Under such circumstance, we will consider LoD-Tensor as a common tensor instead of ignoring information because of the indices of all elements are 1.
+.. code-block:: text
+  口口口口 ... 口
+**model parameter**
+model parameter is a common tensor which is described as a 0-level LoD-Tensor in Fluid.
+LoDTensor expressed by offset
+=============================
+To have a quick access to the original sequence, you can take the offset expression method——store the first and last element of a sequence instead of its length.
+In the example above, you can compute the length of fundamental elements:
+.. code-block:: text
+  3 2 4 1 2 3
+It is expressed by offset as follows:
+.. code-block:: text
+  0  3  5   9   10  12   15
+     =  =   =   =   =    =
+     3  2+3 4+5 1+9 2+10 3+12
+Therefore we infer that the first sentence starts from word 0 to word 3 and the second sentence starts from word 3 to word 5.
+Similarly, for the length of the top layer of LoD
+.. code-block:: text
+  3 1 2
+It can be expressed by offset:
+.. code-block:: text
+  0 3 4   6
+    = =   =
+    3 3+1 4+2
+Therefore the LoD-Tensor is expressed by offset:
+.. code-block:: text
+  0       3    4      6
+    3 5 9   10   12 15
+LoD-Tensor
+=============
+A LoD-Tensor can be regarded as a tree of which the leaf is an original sequence element and branch is the flag of fundamental element.
+There are two ways to express sequence information of LoD-Tensor in Fluid: primitive length and offset. LoD-Tensor is expressed by offset in Paddle to offer a quicker access to sequence;LoD-Tensor is expressed by primitive length in python API to make user understand and compute more easily. The primary length is named as  :code:`recursive_sequence_lengths` .
+Take a 2-level LoD-Tensor mentioned above as an example:
+.. code-block:: text
+  3           1  2
+  3   2  4    1  2  3
+  ||| || |||| |  || |||
+- LoD-Tensor expressed by offset: [ [0,3,4,6] , [0,3,5,9,10,12,15] ]
+- LoD-Tensor expressed by primitive length: recursive_sequence_lengths=[ [3-0 , 4-3 , 6-4] , [3-0 , 5-3 , 9-5 , 10-9 , 12-10 , 15-12] ]
+Take text sequence as an example,[3,1,2] indicates there are 3 articles in the mini-batch,which contains 3,1,2 sentences respectively.[3,2,4,1,2,3] indicates there are 3,2,4,1,2,3 words in sentences respectively.
+recursive_seq_lens is a double Layer nested list, and in other words, the element of the list is list. The size of the outermost list represents the nested layers, namely the size of lod-level; Each inner list represents the size of each element in each lod-level. 
+.. code-block:: python
+  #Create lod-tensor
+  import paddle.fluid as fluid
+  import numpy as np
+  a = fluid.create_lod_tensor(np.array([[1],[1],[1],
+                                    [1],[1],
+                                    [1],[1],[1],[1],
+                                    [1],
+                                    [1],[1],
+                                    [1],[1],[1]]).astype('int64') ,
+                            [[3,1,2] , [3,2,4,1,2,3]],
+                            fluid.CPUPlace())
+  #Check lod-tensor nested layers
+  print len(a.recursive_sequence_lengths())
+  # output：2
+  #Check the number of the most fundamental elements
+  print sum(a.recursive_sequence_lengths()[-1])
+  # output:15 (3+2+4+1+2+3=15)
+Code examples
+==============
+Input variable x is expanded according to specified layer level y-lod in the code example in this section. The example below contains some fundamental conception of LoD-Tensor. By following the code, you will
+-  Have a direct understanding of the implementation of :code:`fluid.layers.sequence_expand` in Fluid
+-  Know how to create LoD-Tensor in Fluid
+-  Learn how to print the content of LoDTensor
+**Define the Process of Computing**
+layers.sequence_expand expands x by obtaining the lod value of y. About more explanation of :code:`fluid.layers.sequence_expand` , please read :ref:`api_fluid_layers_sequence_expand` first. 
+Code of sequence expanding:
+.. code-block:: python
+  x = fluid.layers.data(name='x', shape=[1], dtype='float32', lod_level=0)
+  y = fluid.layers.data(name='y', shape=[1], dtype='float32', lod_level=1)
+  out = fluid.layers.sequence_expand(x=x, y=y, ref_level=0)
+*Note*：The dimension of input LoD-Tensor is only associated with the dimension of real data transferred in. The shape value set for x and y in the definition of network structure is just a placeholder with little influence on the result.  
+**Create Executor**
+.. code-block:: python
+  place = fluid.CPUPlace()
+  exe = fluid.Executor(place)
+  exe.run(fluid.default_startup_program())
+**Prepare Data**
+Here we use :code:`fluid.create_lod_tensor` to create the input data of :code:`sequence_expand` and expand x_d by defining LoD of y_d. The output value is only associated with LoD of y_d. And the data of y_d is not invovled in the process of computation. The dimension of y_d must keep consistent with as its LoD[-1] .
+About the user guide of :code:`fluid.create_lod_tensor()` , please refer to :ref:`api_fluid_create_lod_tensor` .
+Code：
+.. code-block:: python
+  x_d = fluid.create_lod_tensor(np.array([[1.1],[2.2],[3.3],[4.4]]).astype('float32'), [[1,3]], place)
+  y_d = fluid.create_lod_tensor(np.array([[1.1],[1.1],[1.1],[1.1],[1.1],[1.1]]).astype('float32'), [[1,3], [2,1,2,1]],place)
+**Execute Computing**
+For tensor whose LoD > 1 in Fluid, like data of other types, the order of transfering data is defined by :code:`feed` . In addition, parameter :code:`return_numpy=False` needs to be added to exe.run() to get the output of LoD-Tensor because results are Tensors with LoD information.
+.. code-block:: python
+  results = exe.run(fluid.default_main_program(),
+                    feed={'x':x_d, 'y': y_d },
+                    fetch_list=[out],return_numpy=False)
+**Check the result of LodTensor**
+Because of the special attributes of LoDTensor, you could not print to check the content. The usual solution to the problem is to fetch the LoDTensor as the output of network and then execute  numpy.array(lod_tensor) to transfer LoDTensor into numpy array: 
+.. code-block:: python
+  np.array(results[0])
+Output:
+.. code-block:: text
+  array([[1.1],[2.2],[3.3],[4.4],[2.2],[3.3],[4.4],[2.2],[3.3],[4.4]])
+**Check the length of sequence**
+You can get the recursive sequence length of LoDTensor by checking the sequence length:
+.. code-block:: python
+    results[0].recursive_sequence_lengths()
+Output
+.. code-block:: text
+    [[1L, 3L, 3L, 3L]]
+**Complete Code**
+You can check the output by executing the following complete code:
+.. code-block:: python
+    #Load 
+    import paddle
+    import paddle.fluid as fluid
+    import numpy as np
+    #Define forward computation
+    x = fluid.layers.data(name='x', shape=[1], dtype='float32', lod_level=0)
+    y = fluid.layers.data(name='y', shape=[1], dtype='float32', lod_level=1)
+    out = fluid.layers.sequence_expand(x=x, y=y, ref_level=0)
+    #Define place for computation
+    place = fluid.CPUPlace()
+    #Create executer
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+    #Create LoDTensor
+    x_d = fluid.create_lod_tensor(np.array([[1.1], [2.2],[3.3],[4.4]]).astype('float32'), [[1,3]], place)
+    y_d = fluid.create_lod_tensor(np.array([[1.1],[1.1],[1.1],[1.1],[1.1],[1.1]]).astype('float32'), [[1,3], [1,2,1,2]], place)
+    #Start computing
+    results = exe.run(fluid.default_main_program(),
+                      feed={'x':x_d, 'y': y_d },
+                      fetch_list=[out],return_numpy=False)
+    #Output result
+    print("The data of the result: {}.".format(np.array(results[0])))
+    #print the length of sequence of result
+    print("The recursive sequence lengths of the result: {}.".format(results[0].recursive_sequence_lengths()))
+    #print the LoD of result
+    print("The LoD of the result: {}.".format(results[0].lod()))
+Summary
+========
+Then, we believe that you have known about the concept LoD-Tensor. And an attempt to change x_d and y_d in code above and then to check the output may help you get a better understanding of this flexible structure.
+About more model applications of LoDTensor, you can refer to `Word2vec <../../../beginners_guide/basics/word2vec/index.html>`_ , `Individual Recommendation <../../../beginners_guide/basics/recommender_system/index.html>`_ , `Sentiment Analysis <../../../beginners_guide/basics/understand_sentiment/index.html>`_ in the beginner's guide. 
+About more difffiult and complex examples of application, please refer to associated information about `models <../../../user_guides/models/model12_en.html>`_ .
--- a/doc/fluid/user_guides/howto/configure_simple_model/index_en.rst
+++ b/doc/fluid/user_guides/howto/configure_simple_model/index_en.rst
+..  _user_guide_configure_simple_model_en:
+#######################
+Set up Simple Model
+#######################
+When solving practical problems, in the beginning you can model the problem logically, and get a clear picture of **input data type** , **computing logic** , **target solution** and **optimization algorithm** of model.
+PaddlePaddle provides abundant operators to implement logics of a model. In this article, we take a simple regression task as an example to clarify how to build model with PaddlePaddle.
+About complete code of the example,please refer to `fit_a_line <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_fit_a_line.py>`_ 。
+Description and Definition of Problem
+######################################
+Description : Given a pair of data :math:`<X, Y>`, figure out a function :math:`f` to make :math:`y=f(x)` . :math:`x\subset X` represents the feature of a sample, which is a real number vector with 13 dimensions; :math:`y \subset Y` is a real number representing corresponding value of the given sample.
+We can try to model the problem with a regression model. Though lots of loss functions are available for regression problem, here we choose commonly used mean-square error. To simplify the problem, assuming :math:`f` is a simple linear transformation funtion, we choose random gradient descent algorithm to solve problem.
+--------------------------+-------------------------------------------------------------------------------------+
+| input data type          |  sample feature: 13-dimension real number                                           |
+                          +-------------------------------------------------------------------------------------+
+|                          |  sample label: 1-dimension real number                                              |
+--------------------------+-------------------------------------------------------------------------------------+
+| computing logic          | use linear model to generate 1-dimensional real number as predicted output of model |
+--------------------------+-------------------------------------------------------------------------------------+
+| target solution          | minimize mean-squre error between predicted output of model and sample label        |
+--------------------------+-------------------------------------------------------------------------------------+
+| optimization algorithm   | random gradient descent                                                             |
+--------------------------+-------------------------------------------------------------------------------------+
+Model with PaddlePaddle
+#######################
+After getting clear of the of input data format, model structure, loss function and optimization algorithm in terms of logic, you need to use PaddlePaddle API and operators to implement logic of model. A typical model includes four parts: format of input data, forward computing logic, loss function and optimization algorithm.
+Data Layer
+-----------
+PaddlePaddle provides :code:`fluid.layers.data()` to describe format of input data.
+The output of :code:`fluid.layers.data()` is a Variable which is in fact a Tensor. Tensor can represent multi-demensional data with its great expressive feature.In order to accurately describe data structure, it is usually necessary to indicate the shape and type of data. The shape is int vector and type can be a string. About current supported data type, please refer to    :ref:`user_guide_paddle_support_data_types_en` . Data is often read in form of batch to train model. Since batch size may vary and data operator infers batch size according to actual data, here the batch size is ignored when shape is provided. It's enough to care for the shape of single sample. For more advanced usage, please refer to :ref:`user_guide_customize_batch_size_rank_en` .  :math:`x` is real number vector of :math:`13` dimenstions while :math:`y` is a real number. Data layer can be defined as follows:
+.. code-block:: python
+    x = fluid.layers.data(name='x', shape=[13], dtype='float32')
+    y = fluid.layers.data(name='y', shape=[1], dtype='float32')
+Data in this example model are relatively simple. In fact, data operator can describe variable-length and nested sequence data. You can also use :code:`open_files` to open file to train. For more detailed documentation, please refer to :ref:`user_guide_prepare_data_en` .
+Logic of Forward Computing
+---------------------------
+The most important part of a model is to implement logic of computing. PaddlePaddle provides lots of operators encapsulated in different granularity. These operators usually are correspondent to a kind or a group of transformation logic. The output of operator is the result of transfomation for input data. User can flexiblely use operators to implement models with complex logics. For example, many convolutional operators will be used in tasks associated with image tasks and LSTM/GRU operators will be used in sequence tasks. Various operators are usually combined in complex models to implement complex transformation. PaddlePaddle provides natural methods to combine operators. The following example displays the typical combination method:
+.. code-block:: python
+    op_1_out = fluid.layers.op_1(input=op_1_in, ...)
+    op_2_out = fluid.layers.op_2(input=op_1_out, ...)
+    ...
+In the example above, op_1 and op_2 represent types of operators,such as fc performing linear transformation(full connection) or conv performing convolutional transformation. The computing order of operators and direction of data stream are defined by the connection of input and output of operators. In the example above, the output of op_1 is the input of op_2. It will firstly compute op_1 and then op_2 in the process of computing. For more complex models, we may need to use control flow operators to make it perform dynamically according to the input data. In this situation, IfElseOp, WhileOp and other operators are provided in PaddlePaddle. About documentation of these operators, please refer to :code:`fluid.layers` . As for this task, we use a fc operator:
+.. code-block:: python
+    y_predict = fluid.layers.fc(input=x, size=1, act=None)
+Loss Function
+--------------
+Loss function is correspondent with the target solution. We can resolve the model by minimizing the loss value. The outputs of loss functions of most models are real numbers. But the loss operator in PaddlePaddle is only aimed at a single sample. When a batch is feeded, there will be many outputs from the loss operator, each of which is correspondent with the loss of a single sample. Therefore we usually append operators like ``mean`` after loss function to conduct reduction of losses. After each forward iteration, a loss value will be returned. After that, Chain derivation theorem will be performed automatically in PaddlePaddle to compute gradient value of every parameter and variable in computing model. Here we use mean square error cost: 
+.. code-block:: python
+    cost = fluid.layers.square_error_cost(input=y_predict, label=y)
+    avg_cost = fluid.layers.mean(cost)
+Optimization Method
+---------------------
+After the definition of loss function, we can get loss value by forward computing and then get gradient value of parameters with chain deravation theorem. Having obtained the gradients, parameters have to be updated and the simplest algorithm is the random gradient descent algorithm: :math:`w=w - \eta \cdot g` .But common random gradient descent algorithms have some disadvantages, such as unstable convergency. To improve the training speed and effect of model, academic scholars have come up with many optimized algorithm, including :code:`Momentum` , :code:`RMSProp` , :code:`Adam` . Strategies vary from optimization algorithm to another to update parameters of model. Usually we can choose appropriate algorthm according to specific tasks and models. No matter what optimization algorithm we adopt, learning rate is usually an important super parameter to be specified and carefully adjusted by trials. Take random gradient descent algorithm as an example here:
+.. code-block:: python
+    sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
+For more optimization operators,please refer to :code:`fluid.optimizer()` .
+What to do next?
+#################
+Attention needs to be paid for **Data Layer**, **Forward Computing Logic**, **Loss function** and **Optimization Function** while you use PaddlePaddle to implement models.
+The data format, computing logic, loss function and optimization function are all different in different tasks. A rich number of examples of model are provided in PaddlePaddle. You can build your own model structure by referring to these examples. You can visit `Model Repository <https://github.com/PaddlePaddle/models/tree/develop/fluid>`_ to refer to examples in official documentation.
--- a/doc/fluid/user_guides/howto/evaluation_and_debugging/debug/visualdl.md
+++ b/doc/fluid/user_guides/howto/evaluation_and_debugging/debug/visualdl.md
@@ -13,25 +13,28 @@ VisualDL是一个面向深度学习任务设计的可视化工具，包含了sca
 实现原生的性能和定制效果。
 ## 组件
-VisualDL 目前支持4种组件：
+VisualDL 目前支持以下组件：
- graph
 - scalar
- image
 - histogram
+- image
+- audio
+- graph
+- high dimensional
-### Graph
+### Scalar
-兼容 ONNX(Open Neural Network Exchange)[https://github.com/onnx/onnx], 通过与 python SDK的结合，VisualDL可以兼容包括 PaddlePaddle, pytorch, mxnet在内的大部分主流DNN平台。
+可以用于展示训练测试的误差趋势
 <p align="center">
-  <img src="https://raw.githubusercontent.com/daming-lu/large_files/master/graph_demo.gif" width="60%" />
+<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/loss_scalar.gif" width="60%"/>
 </p>
-### Scalar
+### Histogram
-可以用于展示训练测试的误差趋势
+用于可视化任何tensor中元素分布的变化趋势
 <p align="center">
-<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/loss_scalar.gif" width="60%"/>
+<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/histogram.gif" width="60%"/>
 </p>
 ### Image
@@ -41,12 +44,21 @@ VisualDL 目前支持4种组件：
 <img src="https://raw.githubusercontent.com/daming-lu/large_files/master/loss_image.gif" width="60%"/>
 </p>
-### Histogram
+### Audio
+可用于播放输入或生成的音频样本
-用于可视化任何tensor中元素分布的变化趋势
+### Graph
+兼容 ONNX(Open Neural Network Exchange)[https://github.com/onnx/onnx], 通过与 python SDK的结合，VisualDL可以兼容包括 PaddlePaddle, pytorch, mxnet在内的大部分主流DNN平台。
 <p align="center">
-<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/histogram.gif" width="60%"/>
+  <img src="https://raw.githubusercontent.com/daming-lu/large_files/master/graph_demo.gif" width="60%" />
+</p>
+### High Dimensional
+用高维度数据映射在2D/3D来可视化嵌入
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/docs/getting_started/high_dimensional_3d.png" width="60%"/>
 </p>
 ## 快速尝试
@@ -58,12 +70,14 @@ pip install --upgrade visualdl
 # 运行一个例子，vdl_create_scratch_log 将创建测试日志
 vdl_create_scratch_log
-visualDL --logdir=scratch_log --port=8080
+visualdl --logdir=scratch_log --port=8080
 # 访问 http://127.0.0.1:8080
 ```
-如果以上步骤出现问题，很可能是因为python或pip不同版本或不同位置所致，以下安装方法能解决。
+如果出现`TypeError: __init__() got an unexpected keyword argument 'file'`, 是因为protobuf不是3.5以上，运行`pip install --upgrade protobuf`就能解决。
+如果以上步骤还有出现其他问题，很可能是因为python或pip不同版本或不同位置所致，以下安装方法能解决。
 ## 使用 virtualenv 安装
@@ -100,13 +114,11 @@ pip install --upgrade visualdl
 # 运行一个例子，vdl_create_scratch_log 将创建测试日志
 vdl_create_scratch_log
-visualDL --logdir=scratch_log --port=8080
+visualdl --logdir=scratch_log --port=8080
 # 访问 http://127.0.0.1:8080
 ```
-如果出现`TypeError: __init__() got an unexpected keyword argument 'file'`, 是因为protobuf不是3.5以上，运行`pip install --upgrade protobuf`就能解决。
 如果在虚拟环境下仍然遇到安装问题，请尝试以下方法。
@@ -134,7 +146,7 @@ pip install --upgrade visualdl
 # 运行一个例子，vdl_create_scratch_log 将创建测试日志
 vdl_create_scratch_log
-visualDL --logdir=scratch_log --port=8080
+visualdl --logdir=scratch_log --port=8080
 # 访问 http://127.0.0.1:8080
 ```
@@ -151,7 +163,7 @@ python setup.py bdist_wheel
 pip install --upgrade dist/visualdl-*.whl
 ```
-如果打包和安装遇到其他问题，不安装只想运行Visual DL可以看[这里](https://github.com/PaddlePaddle/VisualDL/blob/develop/docs/develop/how_to_dev_frontend_cn.md)
+如果打包和安装遇到其他问题，不安装只想运行Visual DL可以看[这里](https://github.com/PaddlePaddle/VisualDL/blob/develop/docs/how_to_dev_frontend_en.md)
 ## SDK
@@ -210,11 +222,16 @@ int main() {
 当训练过程中已经产生了日志数据，就可以启动board进行实时预览可视化信息
 ```
-visualDL --logdir <some log dir>
+visualdl --logdir <some log dir>
 ```
 board 还支持一下参数来实现远程的访问：
 - `--host` 设定IP
 - `--port` 设定端口
- `--model_pb` 指定 ONNX 格式的模型文件
+- `-m / --model_pb` 指定 ONNX 格式的模型文件
+### 贡献
+VisualDL 是由 [PaddlePaddle](http://www.paddlepaddle.org/) 和
+[ECharts](http://echarts.baidu.com/) 合作推出的开源项目。我们欢迎所有人使用，提意见以及贡献代码。
--- a/doc/fluid/user_guides/howto/evaluation_and_debugging/debug/visualdl_en.md
+++ b/doc/fluid/user_guides/howto/evaluation_and_debugging/debug/visualdl_en.md
+# Visual DL Toolset
+<p align="center">
+  <img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/docs/images/vs-logo.png" width="60%" />
+</p>
+## Introduction
+VisualDL is a deep learning visualization tool that can help design deep learning jobs.
+It includes features such as scalar, parameter distribution, model structure and image visualization.
+Currently it is being developed at a high pace.
+New features will be continuously added.
+At present, most DNN frameworks use Python as their primary language. VisualDL supports Python by nature.
+Users can get plentiful visualization results by simply add a few lines of Python code into their model before training.
+Besides Python SDK, VisualDL was writen in C++ on the low level. It also provides C++ SDK that
+can be integrated into other platforms.  
+## Component
+VisualDL provides following components:
+- scalar
+- histogram
+- image
+- audio
+- graph
+- high dimensional
+### Scalar
+Scalar can be used to show the trends of error during training.
+<p align="center">
+<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/loss_scalar.gif" width="60%"/>
+</p>
+### Histogram
+Histogram can be used to visualize parameter distribution and trends for any tensor.
+<p align="center">
+<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/histogram.gif" width="60%"/>
+</p>
+### Image
+Image can be used to visualize any tensor or intermediate generated image.
+<p align="center">
+<img src="https://raw.githubusercontent.com/daming-lu/large_files/master/loss_image.gif" width="60%"/>
+</p>
+### Audio
+Audio can be used to play input audio samples or generated audio samples.
+### Graph
+Graph is compatible with ONNX ([Open Neural Network Exchange](https://github.com/onnx/onnx)),
+Cooperated with Python SDK, VisualDL can be compatible with most major DNN frameworks, including
+PaddlePaddle, PyTorch and MXNet.
+<p align="center">
+  <img src="https://raw.githubusercontent.com/daming-lu/large_files/master/graph_demo.gif" width="60%" />
+</p>
+### High Dimensional
+High Dimensional can be used to visualize data embeddings by projecting high-dimensional data into 2D / 3D.
+<p align="center">
+<img src="https://raw.githubusercontent.com/PaddlePaddle/VisualDL/develop/docs/getting_started/high_dimensional_3d.png" width="60%"/>
+</p>
+## Quick Start
+To give the VisualDL a quick test, please use the following commands.
+```
+# Install the VisualDL. Preferably under a virtual environment or anaconda.
+pip install --upgrade visualdl
+# run a demo, vdl_create_scratch_log will create logs for testing.
+vdl_create_scratch_log
+visualdl --logdir=scratch_log --port=8080
+# visit http://127.0.0.1:8080
+```
+If you encounter the error `TypeError: __init__() got an unexpected keyword argument 'file'`, that is due to protobuf version is not 3.5+，simply run `pip install --upgrade protobuf` will fix the issue.
+If you run into any other issues in above steps, it could be error caused by environmental issues by different python or pip versions.
+Following installation methods might fix the issues.
+## Install with Virtualenv
+[Virtualenv](https://virtualenv.pypa.io/en/stable/) creates isolated Python environment that prevents interfering
+by other Python programs on the same machine and make sure Python and pip are located properly.
+On macOS, install pip and virtualenv by:
+```
+sudo easy_install pip
+pip install --upgrade virtualenv
+```
+On Linux, install pip and virtualenv by:
+```
+sudo apt-get install python3-pip python3-dev python-virtualenv
+```
+Then create a Virtualenv environment by one of following command:
+```
+virtualenv ~/vdl  # for Python2.7
+virtualenv -p python3 ~/vdl for Python 3.x
+```
+```~/vdl``` will be your Virtualenv directory, you may choose to install anywhere.
+Activate your Virtualenv environment by:
+```
+source ~/vdl/bin/activate
+```
+Now you should be able to install VisualDL and run our demo:
+```
+pip install --upgrade visualdl
+# run a demo, vdl_create_scratch_log will create logs for testing.
+vdl_create_scratch_log
+visualdl --logdir=scratch_log --port=8080
+# visit http://127.0.0.1:8080
+```
+If you still have issues installing VisualDL from Virtualenv, try following installation method.
+## Install with Anaconda
+Anaconda is a python distribution, with installation and package management tools. Also it is an environment manager,
+which provides the facility to create different python environments, each with their own settings.
+Follow the instructions on the [Anaconda download site](https://www.anaconda.com/download) to download and install Anaconda.
+Download Python 3.6 version command-Line installer.
+Create a conda environment named ```vdl``` or anything you want by:
+```
+conda create -n vdl pip python=2.7 # or python=3.3, etc.
+```
+Activate the conda environment by:
+```
+source activate vdl
+```
+Now you should be able to install VisualDL and run our demo:
+```
+pip install --upgrade visualdl
+# run a demo, vdl_create_scratch_log will create logs for testing.
+vdl_create_scratch_log
+visualdl --logdir=scratch_log --port=8080
+# visit http://127.0.0.1:8080
+```
+If you still have issues installing VisualDL, try installing from sources as in following section.
+### Install from source
+```
+#Preferably under a virtualenv or anaconda.
+git clone https://github.com/PaddlePaddle/VisualDL.git
+cd VisualDL
+python setup.py bdist_wheel
+pip install --upgrade dist/visualdl-*.whl
+```
+If there are still issues regarding the ```pip install```, you can still start Visual DL by starting the dev server
+[here](https://github.com/PaddlePaddle/VisualDL/blob/develop/docs/how_to_dev_frontend_en.md)
+## SDK
+VisualDL provides both Python SDK and C++ SDK in order to fit more use cases.
+### Python SDK
+VisualDL now supports both Python 2 and Python 3.
+Below is an example of creating a simple Scalar component and inserting data from different timestamps:
+```python
+import random
+from visualdl import LogWriter
+logdir = "./tmp"
+logger = LogWriter(logdir, sync_cycle=10000)
+# mark the components with 'train' label.
+with logger.mode("train"):
+    # create a scalar component called 'scalars/scalar0'
+    scalar0 = logger.scalar("scalars/scalar0")
+# add some records during DL model running.
+for step in range(100):
+    scalar0.add_record(step, random.random())
+```
+### C++ SDK
+Here is the C++ SDK identical to the Python SDK example above:
+```c++
+#include <cstdlib>
+#include <string>
+#include "visualdl/logic/sdk.h"
+namespace vs = visualdl;
+namespace cp = visualdl::components;
+int main() {
+  const std::string dir = "./tmp";
+  vs::LogWriter logger(dir, 10000);
+  logger.SetMode("train");
+  auto tablet = logger.AddTablet("scalars/scalar0");
+  cp::Scalar<float> scalar0(tablet);
+  for (int step = 0; step < 1000; step++) {
+    float v = (float)std::rand() / RAND_MAX;
+    scalar0.AddRecord(step, v);
+  }
+  return 0;
+}
+```
+## Launch Visual DL
+After some logs have been generated during training, users can launch Visual DL application to see real-time data visualization by:
+```
+visualdl --logdir <some log dir>
+```
+visualDL also supports following optional parameters:
+- `--host` set IP
+- `--port` set port
+- `-m / --model_pb` specify ONNX format for model file to view graph
+### Contribute
+VisualDL is initially created by [PaddlePaddle](http://www.paddlepaddle.org/) and
+[ECharts](http://echarts.baidu.com/).
+We welcome everyone to use, comment and contribute to Visual DL :)
--- a/doc/fluid/user_guides/howto/evaluation_and_debugging/evaluation/metrics_en.rst
+++ b/doc/fluid/user_guides/howto/evaluation_and_debugging/evaluation/metrics_en.rst
+################
+Model Evaluation
+################
+Model evaluation is to use metrics to reflect the accuracy of the model under the expected target. The metrics are determined by model tasks. Model evaluation is an important basis for adjusting the super-parameters in training and evaluating the effect of the model. The input to the metric function is the predicted preds and labels for the current model, and the output is customized. The metric function is very similar to the loss function, but metric is not a component of the model training network.
+Users can get the current predicted preds and labels through training network, and customize the metric function on the Python side, or accelerate the metric calculation on the GPU by customizing the C++ Operator.
+The ``paddle.fluid.metrics`` module contains this feature.
+Common metrics
+##################
+The metric function varies with different model tasks, and so does the metric construction.
+The labels in regression task are real numbers, you can refer to the MSE (Mean Squared Error) method for help.
+The commonly used metrics for classification tasks are classification metrics. The metric function mentioned in this paper is generally metrics of binary classification. For details of metrics for multi-category and multi-label tasks, please read the corresponding API documents. For example, the ranking metric auc function works for multi-classification tasks because these tasks can be used as a 0,1 classification task.
+Fluid contains common classification metrics, such as Precision, Recall, Accuracy, etc. Please read the API documentation for more. Take ``Precision`` as an example, the specific method is
+.. code-block:: python
+	>>> import paddle.fluid as fluid
+   	>>> labels = fluid.layers.data(name="data", shape=[1], dtype="int32")
+	>>> data = fluid.layers.data(name="data", shape=[32, 32], dtype="int32")
+	>>> pred = fluid.layers.fc(input=data, size=1000, act="tanh")
+	>>> acc = fluid.metrics.Precision()
+	>>> for pass in range(PASSES):
+	>>>   acc.reset()
+	>>>   for data in train_reader():
+	>>>       loss, preds, labels = exe.run(fetch_list=[cost, preds, labels])
+	>>>   acc.update(preds=preds, labels=labels)
+	>>>   numpy_acc = acc.eval()
+As for other tasks such as MultiTask Learning, Metric Learning, and Learning To Rank, please refer to the API documentation for their various metric construction methods.
+Custom metrics
+################
+Fluid supports custom metrics and is flexible enough to support a wide range of computing tasks. The evaluation of the model is implemented below with a metric function composed of a simple counter, where ``preds`` is the prediction values and ``labels`` is the given labels.
+.. code-block:: python
+	>>> class MyMetric(MetricBase):
+	>>>     def __init__(self, name=None):
+	>>>         super(MyMetric, self).__init__(name)
+	>>>         self.counter = 0  # simple counter
+	>>>     def reset(self):
+	>>>         self.counter = 0
+	>>>     def update(self, preds, labels):
+	>>>         if not _is_numpy_(preds):
+	>>>             raise ValueError("The 'preds' must be a numpy ndarray.")
+	>>>         if not _is_numpy_(labels):
+	>>>             raise ValueError("The 'labels' must be a numpy ndarray.")
+	>>>         self.counter += sum(preds == labels)
+	>>>     def eval(self):
+	>>>         return self.counter
--- a/doc/fluid/user_guides/howto/evaluation_and_debugging/index_en.rst
+++ b/doc/fluid/user_guides/howto/evaluation_and_debugging/index_en.rst
+###############################
+Model Evaluation and Debugging
+###############################
+There are two articles in this section:
+- `Model Evaluation <../evaluation_and_debugging/evaluation/metrics_en.html>`_：This section will introduce the construction of common metrics.
+- `Visual DL <../evaluation_and_debugging/debug/visualdl_en.html>`_：How to use Visual DL to visualise training process.
+..  toctree::
+    :hidden:
+    evaluation/metrics_en.rst
+    debug/visualdl_en.md
--- a/doc/fluid/user_guides/howto/prepare_data/feeding_data_en.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/feeding_data_en.rst
+.. _user_guide_use_numpy_array_as_train_data_en:
+#################################
+Take Numpy Array as Training Data
+#################################
+PaddlePaddle Fluid supports configuring data layer with :code:`fluid.layers.data()` .
+Then you can use Numpy Array or directly use Python to create C++
+:code:`fluid.LoDTensor` , and then feed it to :code:`fluid.Executor` or :code:`fluid.ParallelExecutor` 
+through :code:`Executor.run(feed=...)` .
+Configure Data Layer
+############################
+With :code:`fluid.layers.data()` , you can configure data layer in neural network. Details are as follows:
+.. code-block:: python
+   import paddle.fluid as fluid
+   image = fluid.layers.data(name="image", shape=[3, 224, 224])
+   label = fluid.layers.data(name="label", shape=[1], dtype="int64")
+   # use image/label as layer input
+   prediction = fluid.layers.fc(input=image, size=1000, act="softmax")
+   loss = fluid.layers.cross_entropy(input=prediction, label=label)
+   ...
+In the code above, :code:`image` and :code:`label` are two input data layers created by :code:`fluid.layers.data` . :code:`image` is float data of shape :code:`[3, 224, 224]` ; :code:`label` is the int data of shape :code:`[1]` . Note that:
+1. :code:`-1` is represented for the dimension of batch size by default in Fluid. And :code:`-1` is added to the first dimension of :code:`shape` by default. Therefore in the code above, it would be alright to transfer numpy array of :code:`[32, 3, 224, 224]` to :code:`image` . If you want to customize the position of the batch size dimension, please set :code:`fluid.layers.data(append_batch_size=False)` .Please refer to the tutorial in the advanced user guide: :ref:`user_guide_customize_batch_size_rank_en` .
+2. Data type of category labels in Fluid is :code:`int64` and the label starts from 0. About the supported data types,please refer to :ref:`user_guide_paddle_support_data_types_en` .
+.. _user_guide_feed_data_to_executor_en:
+Transfer Train Data to Executor
+################################
+Both :code:`Executor.run` and :code:`ParallelExecutor.run` receive a parameter :code:`feed` .
+The parameter is a dict in Python. Its key is the name of data layer,such as :code:`image` in code above. And its value is the corresponding  numpy array.
+For example:
+.. code-block:: python
+   exe = fluid.Executor(fluid.CPUPlace())
+   exe.run(feed={
+      "image": numpy.random.random(size=(32, 3, 224, 224)).astype('float32'),
+      "label": numpy.random.random(size=(32, 1)).astype('int64')
+   })
+Advanced Usage
+###############
+How to feed Sequence Data
+--------------------------
+Sequence data is a unique data type supported by PaddlePaddle Fluid. You can take :code:`LoDTensor` as input data type.
+You need to: 
+1. Feed all data to be trained in a mini-batch.
+2. Get the length of each sequence.
+You can use :code:`fluid.create_lod_tensor` to create :code:`LoDTensor` .
+To feed sequence information, it is necessary to set the sequence nested depth :code:`lod_level` .
+For instance, if the training data are sentences consisting of words, :code:`lod_level=1`; if train data are paragraphs which consists of sentences that consists of words, :code:`lod_level=2` .
+For example:
+.. code-block:: python
+   sentence = fluid.layers.data(name="sentence", dtype="int64", shape=[1], lod_level=1)
+   ...
+   exe.run(feed={
+     "sentence": create_lod_tensor(
+       data=numpy.array([1, 3, 4, 5, 3, 6, 8], dtype='int64').reshape(-1, 1),
+       lod=[4, 1, 2],
+       place=fluid.CPUPlace()
+     )
+   })
+Training data :code:`sentence` contain three samples, the lengths of which are :code:`4, 1, 2` respectively.
+They are :code:`data[0:4]`, :code:`data[4:5]` and :code:`data[5:7]` respectively.
+How to prepare training data for every device in ParallelExecutor
+-------------------------------------------------------------------
+When you feed data to :code:`ParallelExecutor.run(feed=...)` , 
+you can explicitly assign data for every training device (such as GPU).
+You need to feed a list to :code:`feed` . Each element of the list is a dict.
+The key of the dict is name of data layer and the value of dict is value of data layer.
+For example:
+.. code-block:: python
+   parallel_executor = fluid.ParallelExecutor()
+   parallel_executor.run(
+     feed=[
+        {
+          "image": numpy.random.random(size=(32, 3, 224, 224)).astype('float32'),
+          "label": numpy.random.random(size=(32, 1)).astype('int64')
+        },
+        {
+          "image": numpy.random.random(size=(16, 3, 224, 224)).astype('float32'),
+          "label": numpy.random.random(size=(16, 1)).astype('int64')
+        },
+     ]
+   )
+In the code above, GPU0 will train 32 samples and GPU1 will train 16 samples.
+.. _user_guide_customize_batch_size_rank_en:
+Customize the BatchSize dimension
+------------------------------------
+Batch size is the first dimension of data by default in PaddlePaddle Fluid, indicated by :code:`-1` .But in advanced usage, batch_size could be fixed or respresented by other dimension or multiple dimensions, which could be implemented by setting :code:`fluid.layers.data(append_batch_size=False)` .
+1. fixed BatchSize dimension
+  .. code-block:: python
+     image = fluid.layers.data(name="image", shape=[32, 784], append_batch_size=False)
+  Here :code:`image` is always a matrix with size of :code:`[32, 784]` .
+2. batch size expressed by other dimension
+  .. code-block:: python
+     sentence = fluid.layers.data(name="sentence",
+                                  shape=[80, -1, 1],
+                                  append_batch_size=False,
+                                  dtype="int64")
+  Here the middle dimension of :code:`sentence` is batch size. This type of data layout is applied in fixed-length recurrent neural networks.
+.. _user_guide_paddle_support_data_types_en:
+Data types supported by Fluid
+-------------------------------
+Data types supported by PaddlePaddle Fluid contains:
+   * float16: supported by part of operations
+   * float32: major data type of real number
+   * float64: minor data type of real number, supported by most operations
+   * int32: minor data type of labels
+   * int64: major data type of labels
+   * uint64: minor data type of labels
+   * bool:  type of control flow data
+   * int16: minor type of labels
+   * uint8: input data type, used for pixel of picture
--- a/doc/fluid/user_guides/howto/prepare_data/index_en.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/index_en.rst
+..  _user_guide_prepare_data_en:
+#############
+Prepare Data
+#############
+PaddlePaddle Fluid supports two methods to feed data into networks:
+1. Synchronous method - Python Reader：Firstly, use :code:`fluid.layers.data` to set up data input layer. Then, feed in the training data through :code:`executor.run(feed=...)` in :code:`fluid.Executor` or :code:`fluid.ParallelExecutor` .
+2. Asynchronous method - py_reader：Firstly, use :code:`fluid.layers.py_reader` to set up data input layer. Then configure the data source with functions :code:`decorate_paddle_reader` or :code:`decorate_tensor_provider` of :code:`py_reader` . After that, call :code:`fluid.layers.read_file` to read data.
+Comparisons of the two methods:
+=========================  ====================================================   ===============================================
+Aspects                                   Synchronous Python Reader                       Asynchronous py_reader
+=========================  ====================================================   ===============================================
+API interface                          :code:`executor.run(feed=...)`                 :code:`fluid.layers.py_reader`
+data type                                   Numpy Array                                Numpy Array or LoDTensor
+data augmentation          carried out by other libraries on Python end            carried out by other libraries on Python end 
+velocity                                        slow                                            rapid
+recommended applications                model debugging                                      industrial training
+=========================  ====================================================   ===============================================
+Synchronous Python Reader
+##########################
+Fluid provides Python Reader to feed in data.
+Python Reader is a pure Python-side interface, and data feeding is synchronized with the model training/prediction process. Users can pass in data through Numpy Array. For specific operations, please refer to:
+.. toctree::
+   :maxdepth: 1
+   feeding_data_en.rst
+Python Reader supports advanced functions like group batch, shuffle. For specific operations, please refer to：
+.. toctree::
+   :maxdepth: 1
+   reader.md
+Asynchronous py_reader
+########################
+Fluid provides asynchronous data feeding method PyReader. It is more efficient as data feeding is not synchronized with the model training/prediction process. For specific operations, please refer to：
+.. toctree::
+   :maxdepth: 1
+   use_py_reader_en.rst
--- a/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst
+.. _user_guide_use_py_reader_en:
+############################################
+Use PyReader to read training and test data
+############################################
+Paddle Fluid supports PyReader, which implements feeding data from Python to C++. Different from :ref:`user_guide_use_numpy_array_as_train_data_en` , the process of loading data to Python is asynchronous with the process of :code:`Executor::Run()` reading data when PyReader is in use.
+Moreover, PyReader is able to work with :code:`double_buffer_reader` to upgrade the performance of reading data.
+Create PyReader Object
+################################
+You can create PyReader object as follows:
+.. code-block:: python
+    import paddle.fluid as fluid
+    py_reader = fluid.layers.py_reader(capacity=64,
+                                       shapes=[(-1,3,224,224), (-1,1)],
+                                       dtypes=['float32', 'int64'],
+                                       name='py_reader',
+                                       use_double_buffer=True)
+In the code, ``capacity`` is buffer size of PyReader; 
+``shapes`` is the size of parameters in the batch (such as image and label in picture classification task); 
+``dtypes`` is data type of parameters in the batch; 
+``name`` is name of PyReader instance; 
+``use_double_buffer`` is True by default, which means :code:`double_buffer_reader` is used.
+To create some different PyReader objects (Usually, you have to create two different PyReader objects for training and testing phase), the names of objects must be different. For example, In the same task, PyReader objects in training and testing period are created as follows:
+.. code-block:: python
+    import paddle.fluid as fluid
+    train_py_reader = fluid.layers.py_reader(capacity=64,
+                                             shapes=[(-1,3,224,224), (-1,1)],
+                                             dtypes=['float32', 'int64'],
+                                             name='train',
+                                             use_double_buffer=True)
+    test_py_reader = fluid.layers.py_reader(capacity=64,
+                                            shapes=[(-1,3,224,224), (-1,1)],
+                                            dtypes=['float32', 'int64'],
+                                            name='test',
+                                            use_double_buffer=True)
+Note: You could not copy PyReader object with :code:`Program.clone()` so you have to create PyReader objects in training and testing phase with the method mentioned above
+Because you could not copy PyReader with :code:`Program.clone()` so you have to share the parameters of training phase with testing phase through :code:`fluid.unique_name.guard()` .
+Details are as follows:
+.. code-block:: python
+    import paddle.fluid as fluid
+    import paddle.dataset.mnist as mnist
+    import paddle.v2
+    import numpy
+    def network(is_train):
+        reader = fluid.layers.py_reader(
+            capacity=10,
+            shapes=((-1, 784), (-1, 1)),
+            dtypes=('float32', 'int64'),
+            name="train_reader" if is_train else "test_reader",
+            use_double_buffer=True)
+        img, label = fluid.layers.read_file(reader)
+        ...
+        # Here, we omitted the definition of loss of the model
+        return loss , reader
+    train_prog = fluid.Program()
+    train_startup = fluid.Program()
+    with fluid.program_guard(train_prog, train_startup):
+        with fluid.unique_name.guard():
+            train_loss, train_reader = network(True)
+            adam = fluid.optimizer.Adam(learning_rate=0.01)
+            adam.minimize(train_loss)
+    test_prog = fluid.Program()
+    test_startup = fluid.Program()
+    with fluid.program_guard(test_prog, test_startup):
+        with fluid.unique_name.guard():
+            test_loss, test_reader = network(False)
+Configure data source of PyReader objects
+##########################################
+PyReader provides :code:`decorate_tensor_provider` and :code:`decorate_paddle_reader` , both of which receieve Python :code:`generator` as data source.The difference is:
+1. :code:`decorate_tensor_provider` :  :code:`generator` generates a  :code:`list` or :code:`tuple` each time, with each element of :code:`list` or :code:`tuple` being :code:`LoDTensor` or Numpy array, and :code:`LoDTensor` or :code:`shape` of Numpy array must be the same as :code:`shapes` stated while PyReader is created.
+2. :code:`decorate_paddle_reader` :  :code:`generator` generates a :code:`list` or :code:`tuple` each time, with each element of :code:`list` or :code:`tuple` being Numpy array,but the :code:`shape` of Numpy array doesn't have to be the same as :code:`shape` stated while PyReader is created. :code:`decorate_paddle_reader` will :code:`reshape` Numpy array internally. 
+Train and test model with PyReader
+##################################
+Details are as follows（the remaining part of the code above）:
+.. code-block:: python
+    place = fluid.CUDAPlace(0)
+    startup_exe = fluid.Executor(place)
+    startup_exe.run(train_startup)
+    startup_exe.run(test_startup)
+    trainer = fluid.ParallelExecutor(
+        use_cuda=True, loss_name=train_loss.name, main_program=train_prog)
+    tester = fluid.ParallelExecutor(
+        use_cuda=True, share_vars_from=trainer, main_program=test_prog)
+    train_reader.decorate_paddle_reader(
+        paddle.v2.reader.shuffle(paddle.batch(mnist.train(), 512), buf_size=8192))
+    test_reader.decorate_paddle_reader(paddle.batch(mnist.test(), 512))
+    for epoch_id in xrange(10):
+        train_reader.start()
+        try:
+            while True:
+                print 'train_loss', numpy.array(
+                    trainer.run(fetch_list=[train_loss.name]))
+        except fluid.core.EOFException:
+            print 'End of epoch', epoch_id
+            train_reader.reset()
+        test_reader.start()
+        try:
+            while True:
+                print 'test loss', numpy.array(
+                    tester.run(fetch_list=[test_loss.name]))
+        except fluid.core.EOFException:
+            print 'End of testing'
+            test_reader.reset()
+Specific steps are as follows:
+1. Before the start of every epoch, call :code:`start()` to invoke PyReader;
+2. At the end of every epoch, :code:`read_file` throws exception :code:`fluid.core.EOFException` . Call :code:`reset()` after catching up exception to reset the state of PyReader in order to start next epoch.
--- a/doc/fluid/user_guides/howto/training/cluster_howto_en.rst
+++ b/doc/fluid/user_guides/howto/training/cluster_howto_en.rst
+.. _cluster_howto_en:
+Manual for Distributed Training with Fluid
+==========================================
+Basic Idea Of Distributed Training
+-------------------------------------
+Distributed deep learning training is usually divided into two parallelization methods: data parallelism, model parallelism. Refer to the following figure:
+.. image:: src/parallelism.png
+In the model parallelism mode, the layers and parameters of the model will be distributed on multiple nodes. The model will go through multiple communications across nodes in the feeding forward and back propagation training of a mini-batch. Each node only saves a part of the entire model; 
+In data parallelism mode, each node holds the complete layers and parameters of the model, each node performs feeding forward and back propagation calculations on its own, and then conducts the aggregation of the gradients and updates the parameters on all nodes synchronously. 
+Current version of Fluid only provides data parallelism mode. In addition, implementations of special cases in model parallelism mode (e.g. large sparse model training ) will be explained in subsequent documents.
+In the training of data parallelism mode, Fluid uses two communication modes to deal with the requirements of distributed training for different training tasks, namely RPC Communication and Collective Communication. The RPC communication method uses `gRPC <https://github.com/grpc/grpc/>`_ , Collective communication method uses `NCCL2 <https://developer.nvidia.com/nccl>`_ . 
+.. csv-table:: The table above is a horizontal comparison of RPC communication and Collective communication
+	:header: "Feature", "Collective", "RPC"
+	"Ring-Based Communication", "Yes", "No"
+	"Asynchronous Training", "Yes", "Yes"
+	"Distributed Model", "No", "Yes"
+	"Fault-tolerant Training", "No", "Yes"
+	"Performance", "Faster", "Fast"
+- Structure of RPC Communication Method:
+  .. image:: src/dist_train_pserver.png
+  Data-parallelised distributed training in RPC communication mode will start multiple pserver processes and multiple trainer processes, each pserver process will save a part of the model parameters and be responsible for receiving the gradients sent from the trainers and updating these model parameters; Each trainer process will save a copy of the complete model, and use a part of the data to train, then send the gradients to the pservers, finally pull the updated parameters from the pserver.
+  The pserver process can be on a compute node that is completely different from the trainer, or it can share the same node with a trainer. The number of pserver processes required for a distributed task usually needs to be adjusted according to the actual situation to achieve the best performance. However, usually pserver processes are no more than trainer processes.
+  When using GPU training, the pserver can choose to use the GPU or only use the CPU. If the pserver also uses the GPU, it will result in the extra overhead of copying the gradient data received from the CPU to the GPU. In some cases, the overall training performance will be degraded.
+- Structure of NCCL2 communication method:
+  .. image:: src/dist_train_nccl2.png
+NCCL2 (Collective communication method) for distributed training avoids the need of pserver processes. Each trainer process holds a complete set of model parameters. After the calculation of the gradient, the trainer, through mutual communications, "Reduce" the gradient data to all devices of all nodes and then each node completes parameter updates of its own.
+Training in the Parameter Server Manner 
+----------------------------------------------
+Use the :code:`transpiler` API to quickly convert a program that can be executed on a single machine into a program that can be executed in a distributed manner. On different server nodes, pass values to corresponding arguments at :code:`transpiler` to get the :code:`Program` which current node is to execute:
+.. csv-table:: required configuration parameters
+   :header: "parameter", "description"
+   "role", "\ **required**\  distinguishes whether to start as pserver or trainer, this arugument is not passed into ``transpile`` , you can also use other variable names or environment variables"
+   "trainer_id", "\ **required**\  If it is a trainer process, it is used to specify the unique id of the current trainer in the task, starting from 0, and must be guaranteed not to be repeated in one task"
+   "pservers", "\ **required**\ ip:port list string of all pservers in current task, for example: 127.0.0.1:6170,127.0.0.1:6171"
+   "trainers", "\ **required**\  the number of trainer nodes"
+   "sync_mode", "\ **optional**\  True for synchronous mode, False for asynchronous mode"
+   "startup_program", "\ **optional**\  If startup_program is not the default fluid.default_startup_program(), this parameter needs to be passed in"
+   "current_endpoint", "\ **optional**\  This parameter is only required for NCCL2 mode"
+For example, suppose there are two nodes, namely :code:`192.168.1.1` and :code:`192.168.1.2`, use port 6170 to start 4 trainers.
+Then the code can be written as:
+.. code-block:: python
+	role = "PSERVER"
+	trainer_id = 0 # get actual trainer id from cluster
+	pserver_endpoints = "192.168.1.1:6170,192.168.1.2:6170"
+	current_endpoint = "192.168.1.1:6170" # get actual current endpoint
+	trainers = 4
+	t = fluid.DistributeTranspiler()
+	t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers)
+	if role == "PSERVER":
+		    pserver_prog = t.get_pserver_program(current_endpoint)
+		    pserver_startup = t.get_startup_program(current_endpoint,Pserver_prog)
+		    exe.run(pserver_startup)
+		    exe.run(pserver_prog)
+	elif role == "TRAINER":
+		train_loop(t.get_trainer_program())
+Choose Synchronous Or Asynchronous Training
+++++++++++++++++++++++++++++++++++++++++++++
+Fluid distributed tasks support synchronous training or asynchronous training. 
+In the synchronous training mode, all trainer nodes will merge the gradient data of all nodes synchronously per mini-batch and send them to the parameter server to complete the update. 
+In the asynchronous mode, each trainer does not wait for each other, and independently update the parameters on the parameter server. 
+In general, using the asynchronous training method can have a higher overall throughput than the synchronous training mode when there are more trainer nodes.
+When the :code:`transpile` function is called, the distributed training program is generated by default. The asynchronous training program can be generated by specifying the :code:`sync_mode=False` parameter:
+.. code-block:: python
+	t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, sync_mode=False)
+Whether To Use The Distributed Embedding Table For Training
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+Embedding is widely used in various network structures, especially text processing related models.
+In some scenarios, such as recommendation systems or search engines, the number of feature ids of embedding may be very large. When it reaches a certain number, the embedding parameter will become very large.
+On the one hand, the memory of the single machine may not be competent, resulting in the inability to train.
+On the other hand, the normal training mode needs to synchronize the complete set of parameters for each iteration. If the parameter is too large, the communication will become very slow, which will affect the training speed.
+Fluid supports the training of very large scale sparse features embedding at hundred billion level. The embedding parameter is only saved on the parameter server. The parameter prefetch and gradient sparse update method greatly reduce the traffic and improve the communication speed.
+This feature is only valid for distributed training and cannot be used on a single machine. Need to be used with sparse updates.
+Usage: When configuring embedding, add the parameters :code:`is_distributed=True` and :code:`is_sparse=True`.
+Parameters :code:`dict_size` Defines the total number of ids in the data. The id can be any value in the int64 range. As long as the total number of ids is less than or equal to dict_size, it can be supported.
+So before you configure, you need to estimate the total number of feature ids in the data.
+.. code-block:: python
+	emb = fluid.layers.embedding(
+		is_distributed=True,
+		input=input,
+		size=[dict_size, embedding_width],
+		is_sparse=True)
+Select Parameter Distribution Method
++++++++++++++++++++++++++++++++++++++
+Parameter :code:`split_method` can specify how the parameters are distributed on the parameter servers.
+Fluid uses `RoundRobin <https://en.wikipedia.org/wiki/Round-robin_scheduling>`_ by default to scatter parameters to multiple parameter servers. 
+In this case, the parameters are evenly distributed on all parameter servers in the case where the parameter segmentation is not turned off by default. 
+If you need to use something else, you can pass in other methods. The currently available methods are: :code:`RoundRobin` and :code:`HashName` . You can also use a customized distribution method, just refer to `here <https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/transpiler/ps_dispatcher.py#L44>`_
+to write customized distribution function
+Turn Off the slice-up of Parameters 
++++++++++++++++++++++++++++++++++++++
+Parameter :code:`slice_var_up` specifies whether to split large (more than 8192 elements) parameters into multiple parameter servers to balance the computational load. The default is on.
+When the sizes of the trainable parameters in the model are relatively uniform or a customized parameter distribution method is used, which evenly distributes the parameters on multiple parameter servers, you can choose to turn off the slice-up function, which reduces the computational and copying overhead of slicing and reorganization:
+.. code-block:: python
+	t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, slice_var_up=False)
+Turn On Memory Optimization
++++++++++++++++++++++++++++++
+In the parameter server distributed training mode, to enable memory optimization :code:`memory_optimize` , compared with a single machine, you need to pay attention to the following rules:
+- On the pserver side, **don't** execute :code:`memory_optimize`
+- On the trainer side, execute :code:`fluid.memory_optimize` and then execute :code:`t.transpile()`
+- On the trainer side, calling :code:`memory_optimize` needs to add :code:`skip_grads=True` to ensure the gradient sent is not renamed : :code:`fluid.memory_optimize(input_program, skip_grads=True)`
+Example:
+.. code-block:: python
+	if role == "TRAINER":
+		fluid.memory_optimize(fluid.default_main_program(), skip_grads=True)
+	t = fluid.DistributeTranspiler()
+	t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers)
+	if role == "PSERVER":
+		# start pserver here
+	elif role == "TRAINER":
+		# start trainer here
+Training Using NCCL2 Communication
+--------------------
+Distributed training in NCCL2 mode, because there is no parameter server role, the trainers directly communicate with each other. Pay attention to the following tips:
+* Configure :code:`mode="nccl2"` in :code:`fluid.DistributeTranspilerConfig` .
+* When calling :code:`transpile`, :code:`trainers` is fed with the endpoints of all trainer nodes, and passed with the argument :code:`current_endpoint` .
+* Initialize :code:`ParallelExecutor` with :code:`num_trainers` and :code:`trainer_id` .
+For example:
+.. code-block:: python
+	trainer_id = 0 # get actual trainer id here
+	trainers = "192.168.1.1:6170,192.168.1.2:6170"
+	current_endpoint = "192.168.1.1:6170"
+	config = fluid.DistributeTranspilerConfig()
+	config.mode = "nccl2"
+	t = fluid.DistributeTranspiler(config=config)
+	t.transpile(trainer_id, trainers=trainers, current_endpoint=current_endpoint)
+	txe = fluid.ParallelExecutor(use_cuda,
+		loss_name=loss_name, num_trainers=len(trainers.split(",")), trainer_id=trainer_id)
+	...
+.. csv-table:: Description of the necessary parameters for NCCL2 mode
+	:header: "parameter", "description"
+	"trainer_id", "The unique ID of each trainer node in the task, starting at 0, there cannot be any duplication"
+	"trainers", "endpoints of all trainer nodes in the task, used to broadcast NCCL IDs when NCCL2 is initialized"
+	"current_endpoint", "endpoint of current node"
+Currently, distributed training using NCCL2 only supports synchronous training. The distributed training using NCCL2 mode is more suitable for the model which is relatively large and needs \
+synchronous training and GPU training. If the hardware device supports RDMA and GPU Direct, this can achieve high distributed training performance.
+Important Notes on NCCL2 Distributed Training
++++++++++++++++++++++++++++++++++++++++++++++
+**Note** : Please ensure each node has the same amount of data to train in NCCL2 mode distributed training, which prevents
+exit at the final iteration. There are two common ways:
+- Randomly sample some data to complement nodes where less data are distributed. (We recommend this method for sake of a complete dataset to be trained)
+- Each node only trains fixed number of batches per pass, which is controlled by python codes. If a node has more data than this fixed amount, then these 
+  marginal data will not be trained.
+**Note** :  If there are multiple network devices in the system, you need to manually specify the devices used by NCCL2.
+Assuming you need to use :code:`eth2` as the communication device, you need to set the following environment variables:
+.. code-block:: bash
+    export NCCL_SOCKET_IFNAME=eth2
+In addition, NCCL2 provides other switch environment variables, such as whether to enable GPU Direct, whether to use RDMA, etc. For details, please refer to
+`ncclknobs <https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/index.html#ncclknobs>`_ .
--- a/doc/fluid/user_guides/howto/training/cluster_quick_start_en.rst
+++ b/doc/fluid/user_guides/howto/training/cluster_quick_start_en.rst
+..  _cluster_quick_start_en:
+Quick Start with Distributed Training
+==========================
+Preparation
+--------------------
+In this article, we'll show you how to quickly start a PaddlePaddle distributed training task in a cluster. Before you start, do some preparatory work as follows:
+1. Prepare a connected training cluster. Here we use 4 training nodes with format ``*.paddlepaddle.com`` to represent the host name of the node. You can modify it according to the actual situation.
+2. Make sure you have read :ref:`install_steps` before you start and can run PaddlePaddle on all nodes of the cluster.
+Example code
+-------------
+Let's use a very simple linear regression model as an example to explain how to start a distributed training task with 2 pserver server nodes and 2 trainer nodes. You can save this code as ``dist_train.py`` .
+.. code:: python
+    import os
+    import paddle
+    import paddle.fluid as fluid
+    # train reader
+    BATCH_SIZE = 20
+    EPOCH_NUM = 30
+    BATCH_SIZE = 8
+    train_reader = paddle.batch(
+        paddle.reader.shuffle(
+            paddle.dataset.uci_housing.train(), buf_size=500),
+        batch_size=BATCH_SIZE)
+    def train():
+        y = fluid.layers.data(name='y', shape=[1], dtype='float32')
+        x = fluid.layers.data(name='x', shape=[13], dtype='float32')
+        y_predict = fluid.layers.fc(input=x, size=1, act=None)
+        loss = fluid.layers.square_error_cost(input=y_predict, label=y)
+        avg_loss = fluid.layers.mean(loss)
+        opt = fluid.optimizer.SGD(learning_rate=0.001)
+        opt.minimize(avg_loss)
+        place = fluid.CPUPlace()
+        feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
+        exe = fluid.Executor(place)
+        # fetch distributed training environment setting
+        training_role = os.getenv("PADDLE_TRAINING_ROLE", None)
+        port = os.getenv("PADDLE_PSERVER_PORT", "6174")
+        pserver_ips = os.getenv("PADDLE_PSERVER_IPS", "")
+        trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+        eplist = []
+        for ip in pserver_ips.split(","):
+            eplist.append(':'.join([ip, port]))
+        pserver_endpoints = ",".join(eplist)
+        trainers = int(os.getenv("PADDLE_TRAINERS"))
+        current_endpoint = os.getenv("PADDLE_CURRENT_IP", "") + ":" + port
+        t = fluid.DistributeTranspiler()
+        t.transpile(
+            trainer_id = trainer_id,
+            pservers = pserver_endpoints,
+            trainers = trainers)
+        if training_role == "PSERVER":
+            pserver_prog = t.get_pserver_program(current_endpoint)
+            startup_prog = t.get_startup_program(current_endpoint, pserver_prog)
+            exe.run(startup_prog)
+            exe.run(pserver_prog)
+        elif training_role == "TRAINER":
+            trainer_prog = t.get_trainer_program()
+            exe.run(fluid.default_startup_program())
+            for epoch in range(EPOCH_NUM):
+                for batch_id, batch_data in enumerate(train_reader()):
+                    avg_loss_value, = exe.run(trainer_prog,
+                                          feed=feeder.feed(batch_data),
+                                          fetch_list=[avg_loss])
+                    if (batch_id + 1) % 10 == 0:
+                        print("Epoch: {0}, Batch: {1}, loss: {2}".format(
+                            epoch, batch_id, avg_loss_value[0]))
+            # destory the resource of current trainer node in pserver server node
+            exe.close()
+        else:
+            raise AssertionError("PADDLE_TRAINING_ROLE should be one of [TRAINER, PSERVER]")
+    train()
+Environment Variables
+------------------------------------
+When starting a distributed training task, different environment variables are used to represent different node roles, details as follows:
+.. list-table::
+  :header-rows: 1
+  * - Environment Variable
+    - Data Type 
+    - Example 
+    - Description
+  * - :code:`PADDLE_TRAINING_ROLE`
+    - str 
+    - :code:`PSERVER,TRANERR`
+    - role of current training node
+  * - :code:`PADDLE_PSERVER_IPS`
+    - str 
+    - :code:`ps0.paddlepaddle.com, ps1.paddlepaddle.com`
+    - The IP addresses or hostnames of all pserver nodes in the distributed training task, separated by ","
+  * - :code:`PADDLE_PSERVER_PORT`
+    - int 
+    - 6174 
+    - port that the pserver process listens to
+  * - :code:`PADDLE_TRAINERS`
+    - int
+    - 2 
+    - Number of trainer nodes in a distributed training task
+  * - :code:`PADDLE_CURRENT_IP`
+    - str 
+    - :code:`ps0.paddlepaddle.com`
+    - IP address or hostname of the current pserver node
+  * - :code:`PADDLE_TRAINER_ID`
+    - str 
+    - 0 
+    - ID of the current trainer node (unique), in the range of [0, PADDLE_TRAINERS)
+**Note:** Environment variables are just a way to get runtime information. In practical tasks, you can use command line parameters to obtain runtime information.
+API related to Distributed Training
+---------------------------------
+DistributeTranspiler
+~~~~~~~~~~~~~~~~~~~~~~
+The machines in distributed training tasks based on the pserver-trainer architecture are divided into two roles: Parameter Server (pserver) and trainer. In Fluid, users only need to configure the network configuration required for single node training. The ``DistributeTranspiler`` module automatically modifies the single-node network settings into settings on which pserver and trainer needs to run based on the role of current training node:
+.. code:: python
+  t = fluid.DistributeTranspiler()
+  t.transpile(
+    trainer_id = trainer_id,
+    pservers = pserver_endpoints,
+    trainers = trainers)
+  if PADDLE_TRAINING_ROLE == "TRAINER":
+    # fetch the trainer program and execute it
+    trainer_prog = t.get_trainer_program()
+    ...
+  elif PADDLE_TRAINER_ROLE == "PSERVER":
+    # fetch the pserver program and execute it
+    pserver_prog = t.get_pserver_program(current_endpoint)
+    ...
+Exe.close()
+~~~~~~~~~~~~~~
+The status information of all trainer nodes is saved in the pserver node. When trainer finishes training, ``exe.close()`` should be called to notify all PServer nodes to release the resources of the current Trainer nodes:
+.. code:: python
+  exe = fluid.Executor(fluid.CPUPlace())
+  # training process ...
+  exe.close() # notify PServer to destory the resource
+Start a Distributed Training Task
+----------------------------------
+.. list-table::
+   :header-rows: 1
+   * - Start Node 
+     - Start Command 
+     - Description
+   * - ps0.paddlepaddle.com 
+     - :code:`PADDLE_TRAINING_ROLE=PSERVER PADDLE_CURRENT_IP=ps0.paddlepaddle.com PADDLE_PSERVER_IPS=ps0.paddlepaddle.com, ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
+     - Start pserver node
+   * - ps1.paddlepaddle.com
+     - :code:`PADDLE_TRAINING_ROLE=PSERVER PADDLE_CURRENT_IP=ps1.paddlepaddle.com PADDLE_PSERVER_IPS=ps0.paddlepaddle.com, ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
+     - Start pserver node
+   * - trainer0.paddlepaddle.com       
+     - :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=ps0.paddlepaddle.com, ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=0 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
+     - Start the number 0 Trainer Node 
+   * - trainer1.paddlepaddle.com       
+     - :code:`PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=ps0.paddlepaddle.com, ps1.paddlepaddle.com PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=1 PADDLE_PSERVER_PORT=6174 python fluid_dist.py`
+     - Start the number 1 trainer node
--- a/doc/fluid/user_guides/howto/training/index_en.rst
+++ b/doc/fluid/user_guides/howto/training/index_en.rst
+######################
+Train Neural Networks
+######################
+PaddlePaddle Fluid supports both single-node training and multi-node training. A variety of training methods are supported for each training mode. This section contains the following sections:
+.. toctree::
+   :maxdepth: 1
+   single_node_en.rst
+   multi_node_en.rst
+   save_load_variables_en.rst
--- a/doc/fluid/user_guides/howto/training/multi_node_en.rst
+++ b/doc/fluid/user_guides/howto/training/multi_node_en.rst
+####################
+Multi-node Training
+####################
+.. toctree::
+   :maxdepth: 1
+   cluster_quick_start_en.rst
+   cluster_howto_en.rst
+   train_on_baidu_cloud_en.rst
--- a/doc/fluid/user_guides/howto/training/save_load_variables_en.rst
+++ b/doc/fluid/user_guides/howto/training/save_load_variables_en.rst
+.. _user_guide_save_load_vars_en:
+######################################################
+Save, Load Models or Variables & Incremental Learning
+######################################################
+Model variable classification
+##############################
+In PaddlePaddle Fluid, all model variables are represented by :code:`fluid.Variable()` as the base class. Under this base class, model variables can be divided into the following categories:
+1. Model parameter
+  The model parameters are the variables trained and learned in the deep learning model. During the training process, the training framework calculates the current gradient of each model parameter according to the back propagation algorithm, and updates the parameters according to their gradients by the optimizer. The essence of the training process of a model can be seen as the process of continuously iterative updating of model parameters. In PaddlePaddle Fluid, the model parameters are represented by :code:`fluid.framework.Parameter` , which is a derived class of :code:`fluid.Variable()` . Besides various properties of :code:`fluid.Variable()` , :code:`fluid.framework.Parameter` can also be configured with its own initialization methods, update rate and other properties.
+2. Persistable variable
+  Persistable variables refer to variables that persist throughout the training process and are not destroyed by the end of an iteration, such as the global learning rate which is dynamically adjusted. In PaddlePaddle Fluid, persistable variables are represented by setting the :code:`persistable` property of :code:`fluid.Variable()` to :code:`True`. All model parameters are persistable variables, but not all persistable variables are model parameters.
+3. Temporary variables
+  All model variables that do not belong to the above two categories are temporary variables. This type of variable exists only in one training iteration. After each iteration, all temporary variables will be destroyed, and before the next iteration, A new set of temporary variables will be constructed first for this iteration. In general, most of the variables in the model belong to this category, such as the input training data, the output of a normal layer, and so on.
+How to save model variables
+############################
+The model variables we need to save are different depending on the application. For example, if we just want to save the model for future predictions, just saving the model parameters will be enough. But if we need to save a checkpoint for future recovery of current training, then we should save all the persistable variables, and even record the current epoch and step id. It is because even though some model variables are not parameters, they are still essential for model training.
+Save the model to make prediction for new samples
+===================================================
+If we save the model to make prediction for new samples, just saving the model parameters will be sufficient. We can use the :code:`fluid.io.save_params()` interface to save model parameters.
+For example:
+.. code-block:: python
+    import paddle.fluid as fluid
+    exe = fluid.Executor(fluid.CPUPlace())
+    param_path = "./my_paddle_model"
+    prog = fluid.default_main_program()
+    fluid.io.save_params(executor=exe, dirname=param_path, main_program=None)
+In the example above, by calling the :code:`fluid.io.save_params` function, PaddlePaddle Fluid scans all model variables in the default :code:`fluid.Program` , i.e. :code:`prog` and picks out all model parameters. All these model parameters are saved to the specified :code:`param_path` .
+How to load model variables
+#############################
+Corresponding to saving of model variables, we provide two sets of APIs to load the model parameters and the persistable variables of model.
+Load model to make predictions for new samples
+================================================
+For models saved with :code:`fluid.io.save_params` , you can load them with :code:`fluid.io.load_params`.
+For example:
+.. code-block:: python
+    import paddle.fluid as fluid
+    exe = fluid.Executor(fluid.CPUPlace())
+    param_path = "./my_paddle_model"
+    prog = fluid.default_main_program()
+    fluid.io.load_params(executor=exe, dirname=param_path,
+                         main_program=prog)
+In the above example, by calling the :code:`fluid.io.load_params` function, PaddlePaddle Fluid will scan all the model variables in :code:`prog`, filter out all the model parameters, and try to load them from :code:`param_path` .
+It is important to note that the :code:`prog` here must be exactly the same as the forward part of the :code:`prog` used when calling :code:`fluid.io.save_params` and cannot contain any operations of parameter updates. If there is an inconsistency between the two, it may cause some variables not to be loaded correctly; if the parameter update operation is incorrectly included, it may cause the parameters to be changed during normal prediction. The relationship between these two :code:`fluid.Program` is similar to the relationship between training :code:`fluid.Program` and test :code:`fluid.Program`, see: :ref:`user_guide_test_while_training_en` .
+In addition, special care must be taken that :code:`fluid.default_startup_program()` **must** be run before calling :code:`fluid.io.load_params` . If you run it later, it may overwrite the loaded model parameters and cause an error.
+Prediction of the used models and parameters saving
+#######################################################
+The inference engine provides two interfaces : prediction model saving :code:`fluid.io.save_inference_model` and the prediction model loading :code:`fluid.io.load_inference_model`.
+- :code:`fluid.io.save_inference_model`: Please refer to  :ref:`api_guide_inference` .
+- :code:`fluid.io.load_inference_model`: Please refer to  :ref:`api_guide_inference` .
+Incremental training
+#####################
+Incremental training means that a learning system can continuously learn new knowledge from new samples and preserve most of the knowledge that has been learned before. Therefore, incremental learning involves two points: saving the parameters that need to be persisted at the end of the last training, and loading the last saved persistent parameters at the beginning of the next training. Therefore incremental training involves the following APIs:
+:code:`fluid.io.save_persistables`, :code:`fluid.io.load_persistables` .
+Single-node incremental training
+=================================
+The general steps of incremental training on a single unit are as follows:
+1. At the end of the training, call :code:`fluid.io.save_persistables` to save the persistable parameter to the specified location.
+2. After the training startup_program is executed successfully by the executor :code:`Executor`, call :code:`fluid.io.load_persistables` to load the previously saved persistable parameters.
+3. Continue training with the executor :code:`Executor` or :code:`ParallelExecutor`.
+Example:
+.. code-block:: python
+    import paddle.fluid as fluid
+    exe = fluid.Executor(fluid.CPUPlace())
+    path = "./models"
+    prog = fluid.default_main_program()
+    fluid.io.save_persistables(exe, path, prog)
+In the above example, by calling the :code:`fluid.io.save_persistables` function, PaddlePaddle Fluid will find all persistable variables from all model variables in the default :code:`fluid.Program`, e.t. :code:`prog` , and save them to the specified :code:`path` directory.
+.. code-block:: python
+    import paddle.fluid as fluid
+    exe = fluid.Executor(fluid.CPUPlace())
+    path = "./models"
+    startup_prog = fluid.default_startup_program()
+    exe.run(startup_prog)
+    fluid.io.load_persistables(exe, path, startup_prog)
+    main_prog = fluid.default_main_program()
+    exe.run(main_prog)
+In the above example, by calling the :code:`fluid.io.load_persistables` function, PaddlePaddle Fluid will find persistable variables from all model variables in the default :code:`fluid.Program` , e.t. :code:`prog` . and load them one by one from the specified :code:`path` directory to continue training.
+The general steps for multi-node incremental training (without distributed large-scale sparse matrices)
+=========================================================================================================
+There are several differences between multi-node incremental training and single-node incremental training:
+1. At the end of the training, when :code:`fluid.io.save_persistables` is called to save the persistence parameters, it is not necessary for all trainers to call this method, usually it is called on the 0th trainer.
+2. The parameters of multi-node incremental training are loaded on the PServer side, and the trainer side does not need to load parameters. After the PServers are fully started, the trainer will synchronize the parameters from the PServer.
+The general steps for multi-node incremental training (do not enable distributed large-scale sparse matrices) are:
+1. At the end of the training, Trainer 0 will call :code:`fluid.io.save_persistables` to save the persistable parameters to the specified :code:`path`.
+2. Share all the parameters saved by trainer 0 to all PServers through HDFS or other methods. (each PServer needs to have complete parameters).
+3. After the training startup_program is successfully executed by the executor ( :code:`Executor` ), the PServer calls :code:`fluid.io.load_persistables` to load the persistable parameters saved by the 0th trainer.
+4. The PServer continues to start PServer_program via the executor :code:`Executor`.
+5. All training node trainers conduct training process normally through the executor :code:`Executor` or :code:`ParallelExecutor` .
+For trainers whose parameters are to be saved during training, for example:
+.. code-block:: python
+    import paddle.fluid as fluid
+    exe = fluid.Executor(fluid.CPUPlace())
+    path = "./models"
+    trainer_id = 0
+    if trainer_id == 0:
+        prog = fluid.default_main_program()
+        fluid.io.save_persistables(exe, path, prog)
+.. code-block:: bash
+    hadoop fs -mkdir /remote/$path
+    hadoop fs -put $path /remote/$path
+In the above example, the 0 trainer calls the :code:`fluid.io.save_persistables` function. By calling this function,  PaddlePaddle Fluid will find all persistable variables in all model variables from default :code:`fluid.Program` , e.t.  :code:`prog` , and save them to the specified :code:`path` directory. The stored model is then uploaded to a location accessible for all PServers by invoking a third-party file system (such as HDFS).
+For the PServer to be loaded with parameters during training, for example:
+.. code-block:: python
+    import paddle.fluid as fluid
+    exe = fluid.Executor(fluid.CPUPlace())
+    path = "./models"
+	pserver_endpoints = "127.0.0.1:1001,127.0.0.1:1002"
+	trainers = 4
+	Training_role == "PSERVER"
+	config = fluid.DistributeTranspilerConfig()
+	t = fluid.DistributeTranspiler(config=config)
+	t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers, sync_mode=True)
+	if training_role == "PSERVER":
+		current_endpoint = "127.0.0.1:1001"
+		pserver_prog = t.get_pserver_program(current_endpoint)
+		pserver_startup = t.get_startup_program(current_endpoint, pserver_prog)
+		exe.run(pserver_startup)
+		fluid.io.load_persistables(exe, path, pserver_startup)
+		exe.run(pserver_prog)
+	if training_role == "TRAINER":
+		main_program = t.get_trainer_program()
+				exe.run(main_program)
+In the above example, each PServer obtains the parameters saved by trainer 0 by calling the HDFS command, and obtains the PServer's :code:`fluid.Program` by configuration. PaddlePaddle Fluid will find all persistable variables in all model variables from this :code:`fluid.Program` , e.t. :code:`pserver_startup` , and load them from the specified :code:`path` directory.
--- a/doc/fluid/user_guides/howto/training/single_node_en.rst
+++ b/doc/fluid/user_guides/howto/training/single_node_en.rst
+#####################
+Single-node training
+#####################
+Preparation
+############
+To perform single-node training in PaddlePaddle Fluid, you need to read :ref:`user_guide_prepare_data_en` and :ref:`user_guide_configure_simple_model_en` . When you have finished reading :ref:`user_guide_configure_simple_model_en` , you can get two :code:`fluid.Program`, namely :code:`startup_program` and :code:`main_program` . By default, you can use :code:`fluid.default_startup_program()` and :code:`fluid.default_main_program()` to get global :code:`fluid.Program` .
+For example:
+.. code-block:: python
+   import paddle.fluid as fluid
+   image = fluid.layers.data(name="image", shape=[784])
+   label = fluid.layers.data(name="label", shape=[1])
+   hidden = fluid.layers.fc(input=image, size=100, act='relu')
+   prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
+   loss = fluid.layers.mean(
+       fluid.layers.cross_entropy(
+           input=prediction,
+           label=label
+       )
+   )
+   sgd = fluid.optimizer.SGD(learning_rate=0.001)
+   sgd.minimize(loss)
+   # Here the fluid.default_startup_program() and fluid.default_main_program()
+   # has been constructed.
+After the configuration of model, the configurations of :code:`fluid.default_startup_program()` and :code:`fluid.default_main_program()` have been finished.
+Initialize Parameters
+#######################
+Random Initialization of Parameters
+====================================
+After the configuration of model,the initialization of parameters will be written into :code:`fluid.default_startup_program()` . By running this program in :code:`fluid.Executor()` , the random initialization of parameters will be finished in global :code:`fluid.global_scope()` .For example:
+.. code-block:: python
+   exe = fluid.Executor(fluid.CUDAPlace(0))
+   exe.run(program=fluid.default_startup_program())
+Note that in multi-GPU training, the parameters should be initialized on GPU0 and then will be distributed to multiple graphic cards through :code:`fluid.ParallelExecutor` .
+Load Predefined Parameters
+===========================
+In the neural network training, predefined models are usually loaded to continue training. For how to load predefined parameters, please refer to :ref:`user_guide_save_load_vars_en`.
+Single-card Training
+#####################
+Single-card training can be performed through calling :code:`run()` of :code:`fluid.Executor()` to run training :code:`fluid.Program` .
+In the runtime, feed data with :code:`run(feed=...)` and get persistable data with :code:`run(fetch=...)` . For example:
+.. code-block:: python
+   ...
+   loss = fluid.layers.mean(...)
+   exe = fluid.Executor(...)
+   # the result is an numpy array
+   result = exe.run(feed={"image": ..., "label": ...}, fetch_list=[loss])
+Notes:
+1. About data type supported by feed, please refer to the article :ref:`user_guide_feed_data_to_executor_en`.
+2. The return value of :code:`Executor.run` is the variable value of :code:`fetch_list=[...]` .The fetched Variable must be persistable. :code:`fetch_list` can be fed with either Variable list or name list of variables . :code:`Executor.run` returns Fetch result list.
+3. If the fetched data contain sequence information,  you can set :code:`exe.run(return_numpy=False, ...)` to directly get :code:`fluid.LoDTensor` . You can directly access the information in :code:`fluid.LoDTensor` .
+Multi-card Training
+#######################
+In multi-card training, you can use :code:`fluid.ParallelExecutor` to run training :code:`fluid.Program`. For example:
+.. code-block:: python
+   train_exe = fluid.ParallelExecutor(use_cuda=True, loss_name=loss.name,
+                                main_program=fluid.default_main_program())
+   train_exe.run(fetch_list=[loss.name], feed={...})
+Notes:
+1. The constructor of :code:`ParallelExecutor` needs to be set with :code:`fluid.Program` to be run which can not be modified at runtime. The default value is :code:`fluid.default_main_program()` .
+2. :code:`ParallelExecutor` should be indicated whether to use CUDA to train. In the mode of graphic card training, all graphic cards will be occupied. Users can configure `CUDA_VISIBLE_DEVICES <http://www.acceleware.com/blog/cudavisibledevices-masking-gpus>`_ to change graphics cards that are being used.
+Advanced Usage
+###############
+.. toctree::
+   :maxdepth: 2
+   test_while_training_en.rst
--- a/doc/fluid/user_guides/howto/training/test_while_training_en.rst
+++ b/doc/fluid/user_guides/howto/training/test_while_training_en.rst
+.. _user_guide_test_while_training_en:
+##############################
+Evaluate model while training
+##############################
+:code:`fluid.Program` for model test and evaluation is different from the one for training. In the test and evalution phase:
+1. There is no back propagation and no process of optimizing and updating parameters in evaluation and test.
+2. Operations in model evaluation can be different.
+   * Take the operator BatchNorm for example, algorithms are different in train and test.
+   * Evaluation model and training model can be totally different.
+Generate :code:`fluid.Program` for test
+#######################################
+Generate test :code:`fluid.Program` by cloning training :code:`fluid.Program` 
+============================================================================
+:code:`Program.clone()` can generate a copied new :code:`fluid.Program` . You can generate a copy of Program with operations applied for test by setting :code:`Program.clone(for_test=True)` . Simple usage is as follows:
+.. code-block:: python
+   import paddle.fluid as fluid
+   img = fluid.layers.data(name="image", shape=[784])
+   prediction = fluid.layers.fc(
+     input=fluid.layers.fc(input=img, size=100, act='relu'),
+     size=10,
+     act='softmax'
+   )
+   label = fluid.layers.data(name="label", shape=[1], dtype="int64")
+   loss = fluid.layers.mean(fluid.layers.cross_entropy(input=prediction, label=label))
+   acc = fluid.layers.accuracy(input=prediction, label=label)
+   test_program = fluid.default_main_program().clone(for_test=True)
+   adam = fluid.optimizer.Adam(learning_rate=0.001)
+   adam.minimize(loss)
+Before using :code:`Optimizer` , please copy :code:`fluid.default_main_program()` into a :code:`test_program` . Then you can pass test data to :code:`test_program` so that you can run test program without influencing training result.
+Configure training :code:`fluid.Program` and test :code:`fluid.Program` individually
+=====================================================================================
+If the training program is largely different from test program, you can define two totally different :code:`fluid.Program` , and perform training and test individually. In PaddlePaddle Fluid, all parameters are named. If two different operations or even two different networks use parameters with the same name, the value and memory space of these parameters are shared.
+Fluid adopts :code:`fluid.unique_name` package to randomly initialize the names of unnamed parameters. :code:`fluid.unique_name.guard` can keep the initialized names consistent across multiple times of calling some function.
+For example:
+.. code-block:: python
+   import paddle.fluid as fluid
+   def network(is_test):
+       file_obj = fluid.layers.open_files(filenames=["test.recordio"] if is_test else ["train.recordio"], ...)
+       img, label = fluid.layers.read_file(file_obj)
+       hidden = fluid.layers.fc(input=img, size=100, act="relu")
+       hidden = fluid.layers.batch_norm(input=hidden, is_test=is_test)
+       ...
+       return loss
+   with fluid.unique_name.guard():
+       train_loss = network(is_test=False)
+       sgd = fluid.optimizer.SGD(0.001)
+       sgd.minimize(train_loss)
+   test_program = fluid.Program()
+   with fluid.unique_name.guard():
+       with fluid.program_gurad(test_program, fluid.Program()):
+           test_loss = network(is_test=True)
+   # fluid.default_main_program() is the train program
+   # fluid.test_program is the test program
+Perform test :code:`fluid.Program`
+###################################
+Run test :code:`fluid.Program` with :code:`Executor` 
+=======================================================
+You can run test :code:`fluid.Program` with :code:`Executor.run(program=...)` .
+For example:
+.. code-block:: python
+   exe = fluid.Executor(fluid.CPUPlace())
+   test_acc = exe.run(program=test_program, feed=test_data_batch, fetch_list=[acc])
+   print 'Test accuracy is ', test_acc
+Run test :code:`fluid.Program` with :code:`ParallelExecutor` 
+=====================================================================
+You can use :code:`ParallelExecutor` for training and :code:`fluid.Program` for test to create a new test :code:`ParallelExecutor` ; then use test :code:`ParallelExecutor.run` to run test process.
+For example:
+.. code-block:: python
+   train_exec = fluid.ParallelExecutor(use_cuda=True, loss_name=loss.name)
+   test_exec = fluid.ParallelExecutor(use_cuda=True, share_vars_from=train_exec,
+                                      main_program=test_program)
+   test_acc = test_exec.run(fetch_list=[acc], ...)
--- a/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_en.rst
+++ b/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_en.rst
+.. _train_on_baidu_cloud_en:
+Distributed Training on Baidu Cloud
+=====================================
+PaddlePaddle Fluid distributed training allows you to start distributed training without relying on cluster systems (such as MPI, Kubernetes).
+This chapter will use `Baidu Cloud <https://cloud.baidu.com/>`_ as an example to show you how to perform large-scale distributed tasks in a cloud environment or even a cloud GPU environment.
+Create a cluster template
+---------------------------
+Log in to Baidu Cloud Console, select BCC Service, and click "Create Instance". Select the region, and note that only some regions have GPU servers available.
+After selecting an appropriate region, select the corresponding model and create an empty server, as shown below:
+.. image:: src/create_gpu_machine.png
+* In the operating system options, you can select the corresponding version according to your needs. Note that the CUDA version is selected based on the actual situation. Here we choose CUDA-9.2.
+* In the example, the payment method is selected as post-paid, which means that as the machine is released, the charge will stop correspondingly, which is more cost-effective for running a one-time task.
+After the machine is created successfully, execute the following command to install the paddlepaddle GPU version and related dependencies.
+.. code-block:: bash
+  apt-get update && apt-get install -y python python-pip python-opencv
+  # Note: Baidu cloud cuda-9.2 image does not have cudnn and nccl2 installed by default. It needs to be installed manually. If you intend to install it by yourself, you need to download it from the official website.
+  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb"
+  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/nccl_2.2.13-1+cuda9.0_x86_64.txz"
+  dpkg -i libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb
+  ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/lib/libcudnn.so
+  unxz nccl_2.2.13-1+cuda9.0_x86_64.txz
+  tar xf nccl_2.2.13-1+cuda9.0_x86_64.tar
+  cp -r nccl_2.2.13-1+cuda9.0_x86_64/lib/* /usr/lib
+  # Note: You can choose whether to use the following pip image to speed up the download.(for users within China)
+  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple matplotlib==2.2.3
+  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple paddlepaddle-gpu==0.15.0.post97
+After the installation is completed, use the following test program to test whether the current machine can run the GPU training program correctly. If an error is encountered, please fix the running environment problem according to the error message. In order to facilitate the startup of the GPU cluster, after the test program is successfully executed, select the current server and select "Create Customized Image" . You can select the configured image when you create the GPU cluster later.
+.. image:: src/create_image.png
+* test program:
+.. code-block:: python
+  from __future__ import print_function
+  import paddle.fluid.core as core
+  import math
+  import os
+  import sys
+  import numpy
+  import paddle
+  import paddle.fluid as fluid
+  BATCH_SIZE = 64
+  PASS_NUM = 1
+  def loss_net(hidden, label):
+      Prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
+      Loss = fluid.layers.cross_entropy(input=prediction, label=label)
+      Avg_loss = fluid.layers.mean(loss)
+      Acc = fluid.layers.accuracy(input=prediction, label=label)
+      Return prediction, avg_loss, acc
+  def conv_net(img, label):
+      conv_pool_1 = fluid.nets.simple_img_conv_pool(
+          input=img,
+          filter_size=5,
+          num_filters=20,
+          pool_size=2,
+          pool_stride=2,
+          act="relu")
+      conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
+      conv_pool_2 = fluid.nets.simple_img_conv_pool(
+          input=conv_pool_1,
+          filter_size=5,
+          num_filters=50,
+          pool_size=2,
+          pool_stride=2,
+          act="relu")
+      return loss_net(conv_pool_2, label)
+  def train(use_cuda):
+      if use_cuda and not fluid.core.is_compiled_with_cuda():
+          return
+      img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+      label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+      prediction, avg_loss, acc = conv_net(img, label)
+      test_program = fluid.default_main_program().clone(for_test=True)
+      optimizer = fluid.optimizer.Adam(learning_rate=0.001)
+      optimizer.minimize(avg_loss)
+      place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+      exe = fluid.Executor(place)
+      train_reader = paddle.batch(
+          paddle.reader.shuffle(
+              paddle.dataset.mnist.train(), buf_size=500),
+          batch_size=BATCH_SIZE)
+      test_reader = paddle.batch(
+          paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
+      feeder = fluid.DataFeeder(feed_list=[img, label], place=place)
+      exe.run(fluid.default_startup_program())
+      for pass_id in range(PASS_NUM):
+          for batch_id, data in enumerate(train_reader()):
+              acc_np, avg_loss_np = exe.run(fluid.default_main_program(),
+                                            feed=feeder.feed(data),
+                                            fetch_list=[acc, avg_loss])
+              if (batch_id + 1) % 10 == 0:
+                  print(
+                       'PassID {0:1}, BatchID {1:04}, Loss {2:2.2}, Acc {3:2.2}'.
+                      format(pass_id, batch_id + 1,
+                              float(avg_loss_np.mean()), float(acc_np.mean())))
+  if __name__ == '__main__':
+      train(True)
+Create a cluster
+------------------
+After creating the image, you can use this configured image to create a GPU cluster and create a sufficient number of GPU servers according to your actual needs.As an example, here are two GPU servers started, including the one created in the previous step, and a new server here.
+Click "Create Instance" to select GPU servers with the same settings in the same region. Especially, the image you just created should be selected as the operating system.
+.. image:: src/create_more_nodes.png
+Write cluster task startup scripts
+------------------------------------
+In order to facilitate the launch of distributed training tasks on more GPU servers, we will use
+`fabric <http://www.fabfile.org/>`_
+as a cluster task launch management tool. You can choose other familiar cluster frameworks, such as MPI, Kubernetes. 
+The methods demonstrated in this example are only proposed for simple cluster environments, and servers can log in to each other through SSH.
+To install the fabric, you need to execute:
+.. code-block:: bash
+  pip install fabric
+Suppose we have created two GPU servers, the ip addresses of them are :code:`172.16.0.5, 172.16.0.6` . On the first server,
+create the training program file :code:`dist_train_demo.py`, from
+`here <https://raw.githubusercontent.com/PaddlePaddle/FluidDoc/develop/doc/fluid/user_guides/howto/training/src/dist_train_demo.py>`_
+to download the code. Then write the :code:`fabfile.py` script to control the parameter servers and trainers that start the training task on different servers:
+.. code-block:: python
+  from fabric import Group, task
+  endpoints = "172.16.0.5:6173,172.16.0.6:6173"
+  port = "6173"
+  pservers = 2
+  trainers = 2
+  hosts = []
+  eps = []
+  for ep in endpoints.split(","):
+      eps.append(ep)
+      hosts.append(ep.split(":")[0])
+  def start_server(c):
+      current_endpoint = "%s:%s" % (c.host, port)
+      trainer_id = hosts.index(c.host)
+      cmd = "python /root/work/dist_train_demo.py pserver %s %s %d %d &> /root/work/server.log.%s &" % (
+          endpoints, current_endpoint, trainer_id, trainers, c.host)
+      c.run(cmd)
+  def start_trainer(c):
+      current_endpoint = "%s:%s" % (c.host, port)
+      trainer_id = hosts.index(c.host)
+      cmd = "python /root/work/dist_train_demo.py trainer %s %s %d %d &> /root/work/trainer.log.%s &" % (
+          endpoints, current_endpoint, trainer_id, trainers, c.host)
+      c.run(cmd)
+  @task
+  def start(c):
+      c.connect_kwargs.password = "work@paddle123"
+      c.run("mkdir -p /root/work")
+      c.put("dist_train_demo.py", "/root/work")
+      start_server(c)
+      start_trainer(c)
+  @task
+  def tail_log(c):
+      c.connect_kwargs.password = "work@paddle123"
+      c.run("tail /root/work/trainer.log.%s" % c.host)
+Save the above code to :code:`fabfile.py` and execute
+.. code-block:: bash
+  fab -H 172.16.0.5,172.16.0.6 start
+Right now, you can start a distributed training task. This task will start training on two GPU servers by starting two pserver processes and two trainer processes respectively.
+Get distributed training results
+---------------------------------
+The example task will be logged under :code:`/root/work`, respectively
+:code:`pserver.log.[IP]` and :code:`trainer.log.[IP]` can be manually
+view the results of these log files on the server. You can also use the fabric to obtain log information of all nodes, for example:
+.. code-block:: bash
+  fab -H 172.16.0.5,172.16.0.6 tail-log
+Terminate the cluster
+------------------------
+After the task is executed, don't forget to release the GPU cluster resources. To do this, firstly select the servers you want to release, and then select "Release" to shut down the machine and release the resources.
+If you need to perform a new task, you can use the previously saved image directly, start a new cluster, and start the training by following the previous steps.
+.. image:: src/release.png
--- a/doc/fluid/user_guides/index_en.rst
+++ b/doc/fluid/user_guides/index_en.rst
+###########
+User Guides
+###########
+..  todo::
+If you have got the hang of Beginner's Guide, and wish to model practical problems and build your original networks, this section will provide
+you with some detailed operations:
+    - `Basic Concepts <../user_guides/howto/basic_concept/index_en.html>`_ ：It explains basic concepts of Fluid.
+    - `Prepare Data <../user_guides/howto/prepare_data/index_en.html>`_ ：This section introduces data types supported and data transmission methods when you are training your networks with Fluid.
+    - `Set up Simple Model <../user_guides/howto/configure_simple_model/index_en.html>`_： This section illustrates how to model practical problems and build networks with related operators of Fluid.
+    - `Train Neural Networks <../user_guides/howto/training/index_en.html>`_：This section will guide you to perform single-node training, multi-node training, and save or load model variables.
+    - `Model Evaluation and Debugging <../user_guides/howto/evaluation_and_debugging/index_en.html>`_：It introduces the model evaluation and debugging methods in Fluid 
+Reproduced classic models of multiple directions in Fluid：
+    - `Fluid Model Library <../user_guides/models/index_en.html>`_
+..  toctree::
+    :hidden:
+    howto/basic_concept/index_en.rst
+    howto/prepare_data/index_en.rst
+    howto/configure_simple_model/index_en.rst
+    howto/training/index_en.rst
+    howto/evaluation_and_debugging/index_en.rst
+    models/index_en.rst
--- a/doc/fluid/user_guides/models/index_en.rst
+++ b/doc/fluid/user_guides/models/index_en.rst