diff --git a/doc/fluid/advanced_usage/index_en.rst b/doc/fluid/advanced_usage/index_en.rst
index eef0527ca6f69c46f59f75eb8e063fe285d7c341..e2ec4e2f7f99865eb0265800a29af66c1c80cfd7 100644
--- a/doc/fluid/advanced_usage/index_en.rst
+++ b/doc/fluid/advanced_usage/index_en.rst
@@ -6,7 +6,7 @@ Advanced User Guides
 
 So far you have already been familiar with Fluid. And the next expectation should be building a more efficient model or inventing your original Operator. If so, read more on:
 
-    - `Fluid Design Principles <../advanced_usage/design_idea/fluid_design_idea_en.html>`_ : Design principles underlying Fluid to help you understand how the framework runs.
+    - `Design Principles of Fluid <../advanced_usage/design_idea/fluid_design_idea_en.html>`_ : Design principles underlying Fluid to help you understand how the framework runs.
 
 	- `Deploy Inference Model <../advanced_usage/deploy/index_en.html>`_ ：How to deploy the trained network to perform practical inference
 
diff --git a/doc/fluid/api_cn/fluid_cn/Executor_cn.rst b/doc/fluid/api_cn/fluid_cn/Executor_cn.rst
index e7ae1ee0ffbad53e2db011d83764d116e28411de..76e26ad45d8aa2a6346578f02630a9aa8598a1d5 100644
--- a/doc/fluid/api_cn/fluid_cn/Executor_cn.rst
+++ b/doc/fluid/api_cn/fluid_cn/Executor_cn.rst
@@ -136,7 +136,7 @@ feed map为该program提供输入数据。fetch_list提供program训练结束后
   - **fetch_var_name** (str) – 结果获取算子(fetch operator)的输出变量名称
   - **scope** (Scope) – 执行这个program的域，用户可以指定不同的域。缺省为全局域
   - **return_numpy** (bool) – 如果为True,则将结果张量（fetched tensor）转化为numpy
-  - **use_program_cache** (bool) – 是否跨批使用缓存程序设置。设置为True时，只有当（1）程序没有用数据并行编译，并且（2）program、 feed变量名和fetch_list变量名与上一步相比没有更改时，运行速度才会更快。
+  - **use_program_cache** (bool) – 是否在不同的批次间使用相同的缓存程序设置。设置为True时，只有当（1）程序没有用数据并行编译，并且（2）program、 feed变量名和fetch_list变量名与上一步相比没有更改时，运行速度才会更快。
   
 返回: 根据fetch_list来获取结果
 
diff --git a/doc/fluid/api_cn/io_cn/save_persistables_cn.rst b/doc/fluid/api_cn/io_cn/save_persistables_cn.rst
index 773ccb7e0cf4d0eb2c12a907c7f3f9ebdeb8ae01..d4d80f623c9e848671b9a767664790f7b87f2ee0 100644
--- a/doc/fluid/api_cn/io_cn/save_persistables_cn.rst
+++ b/doc/fluid/api_cn/io_cn/save_persistables_cn.rst
@@ -13,8 +13,6 @@ save_persistables
  - **executor**  (Executor) – 保存变量的 executor
  - **dirname**  (str) – 目录路径
  - **main_program**  (Program|None) – 需要保存变量的 Program。如果为 None，则使用 default_main_Program 。默认值: None
- - **predicate**  (function|None) – 如果不等于None，当指定main_program， 那么只有 predicate(variable)==True 时，main_program中的变量
- - **vars**  (list[Variable]|None) –  要保存的所有变量的列表。 优先级高于main_program。默认值: None
  - **filename**  (str|None) – 保存变量的文件。如果想分开保存变量，设置 filename=None. 默认值: None
  
 返回: None
diff --git a/doc/fluid/api_cn/layers_cn.rst b/doc/fluid/api_cn/layers_cn.rst
index f812e6ad505cb9d0762f7e9e52d80cb3bf3d687c..286e7a5e9520ad7768bd713a73d00f1a381c7c5a 100644
--- a/doc/fluid/api_cn/layers_cn.rst
+++ b/doc/fluid/api_cn/layers_cn.rst
@@ -28,7 +28,6 @@ fluid.layers
     layers_cn/atan_cn.rst
     layers_cn/auc_cn.rst
     layers_cn/autoincreased_step_counter_cn.rst
-    layers_cn/batch_cn.rst
     layers_cn/batch_norm_cn.rst
     layers_cn/beam_search_cn.rst
     layers_cn/beam_search_decode_cn.rst
@@ -41,6 +40,7 @@ fluid.layers
     layers_cn/brelu_cn.rst
     layers_cn/cast_cn.rst
     layers_cn/ceil_cn.rst
+    layers_cn/center_loss_cn.rst
     layers_cn/chunk_eval_cn.rst
     layers_cn/clip_by_norm_cn.rst
     layers_cn/clip_cn.rst
@@ -96,9 +96,11 @@ fluid.layers
     layers_cn/exp_cn.rst
     layers_cn/expand_cn.rst
     layers_cn/exponential_decay_cn.rst
+    layers_cn/eye_cn.rst
     layers_cn/fc_cn.rst
     layers_cn/fill_constant_batch_size_like_cn.rst
     layers_cn/fill_constant_cn.rst
+    layers_cn/filter_by_instag_cn.rst
     layers_cn/flatten_cn.rst
     layers_cn/floor_cn.rst
     layers_cn/fsp_matrix_cn.rst
@@ -116,6 +118,7 @@ fluid.layers
     layers_cn/gru_unit_cn.rst
     layers_cn/hard_shrink_cn.rst
     layers_cn/hard_sigmoid_cn.rst
+    layers_cn/hard_swish_cn.rst
     layers_cn/has_inf_cn.rst
     layers_cn/has_nan_cn.rst
     layers_cn/hash_cn.rst
@@ -141,6 +144,7 @@ fluid.layers
     layers_cn/linear_lr_warmup_cn.rst
     layers_cn/linspace_cn.rst
     layers_cn/load_cn.rst
+    layers_cn/lod_append_cn.rst
     layers_cn/lod_reset_cn.rst
     layers_cn/log_cn.rst
     layers_cn/log_loss_cn.rst
@@ -153,6 +157,7 @@ fluid.layers
     layers_cn/lstm_cn.rst
     layers_cn/lstm_unit_cn.rst
     layers_cn/margin_rank_loss_cn.rst
+    layers_cn/match_matrix_tensor_cn.rst
     layers_cn/matmul_cn.rst
     layers_cn/maxout_cn.rst
     layers_cn/mean_cn.rst
@@ -165,11 +170,12 @@ fluid.layers
     layers_cn/natural_exp_decay_cn.rst
     layers_cn/nce_cn.rst
     layers_cn/noam_decay_cn.rst
+    layers_cn/Normal_cn.rst
     layers_cn/not_equal_cn.rst
     layers_cn/npair_loss_cn.rst
     layers_cn/one_hot_cn.rst
     layers_cn/ones_cn.rst
-    layers_cn/open_files_cn.rst
+    layers_cn/ones_like_cn.rst
     layers_cn/pad_cn.rst
     layers_cn/pad_constant_like_cn.rst
     layers_cn/pad2d_cn.rst
@@ -181,14 +187,12 @@ fluid.layers
     layers_cn/pool3d_cn.rst
     layers_cn/pow_cn.rst
     layers_cn/prelu_cn.rst
-    layers_cn/Preprocessor_cn.rst
     layers_cn/Print_cn.rst
     layers_cn/prior_box_cn.rst
     layers_cn/psroi_pool_cn.rst
     layers_cn/py_func_cn.rst
     layers_cn/py_reader_cn.rst
     layers_cn/random_crop_cn.rst
-    layers_cn/random_data_generator_cn.rst
     layers_cn/range_cn.rst
     layers_cn/rank_cn.rst
     layers_cn/rank_loss_cn.rst
@@ -207,6 +211,7 @@ fluid.layers
     layers_cn/reshape_cn.rst
     layers_cn/resize_bilinear_cn.rst
     layers_cn/resize_nearest_cn.rst
+    layers_cn/resize_trilinear_cn.rst
     layers_cn/retinanet_detection_output_cn.rst
     layers_cn/retinanet_target_assign_cn.rst
     layers_cn/reverse_cn.rst
@@ -239,6 +244,7 @@ fluid.layers
     layers_cn/sequence_softmax_cn.rst
     layers_cn/sequence_unpad_cn.rst
     layers_cn/shape_cn.rst
+    layers_cn/shard_index_cn.rst
     layers_cn/shuffle_channel_cn.rst
     layers_cn/shuffle_cn.rst
     layers_cn/sigmoid_cn.rst
@@ -280,10 +286,15 @@ fluid.layers
     layers_cn/topk_cn.rst
     layers_cn/transpose_cn.rst
     layers_cn/tree_conv_cn.rst
+    layers_cn/unfold_cn.rst
+    layers_cn/Uniform_cn.rst
     layers_cn/uniform_random_cn.rst
     layers_cn/uniform_random_batch_size_like_cn.rst
+    layers_cn/unique_cn.rst
+    layers_cn/unique_with_counts_cn.rst
     layers_cn/unsqueeze_cn.rst
     layers_cn/unstack_cn.rst
+    layers_cn/var_conv_2d_cn.rst
     layers_cn/warpctc_cn.rst
     layers_cn/where_cn.rst
     layers_cn/While_cn.rst
diff --git a/doc/fluid/api_cn/layers_cn/Preprocessor_cn.rst b/doc/fluid/api_cn/layers_cn/Preprocessor_cn.rst
deleted file mode 100644
index 95df07783ac96291cb89d9c0f0e7015a1a0f4521..0000000000000000000000000000000000000000
--- a/doc/fluid/api_cn/layers_cn/Preprocessor_cn.rst
+++ /dev/null
@@ -1,40 +0,0 @@
-.. _cn_api_fluid_layers_Preprocessor:
-
-Preprocessor
--------------------------------
-
-.. py:class:: paddle.fluid.layers.Preprocessor(reader, name=None)
-
-reader变量中数据预处理块。
-
-参数：
-    - **reader** (Variable)-reader变量
-    - **name** (str,默认None)-reader的名称
-
-**代码示例**:
-
-.. code-block:: python
-
-    import paddle.fluid as fluid
-    reader = fluid.layers.io.open_files(
-        filenames=['./data1.recordio', './data2.recordio'],
-        shapes=[(3, 224, 224), (1, )],
-        lod_levels=[0, 0],
-        dtypes=['float32', 'int64'])
-
-    preprocessor = fluid.layers.io.Preprocessor(reader=reader)
-    with preprocessor.block():
-        img, lbl = preprocessor.inputs()
-        img_out = img / 2
-        lbl_out = lbl + 1
-        preprocessor.outputs(img_out, lbl_out)
-    data_file = fluid.layers.io.double_buffer(preprocessor())
-
-
-
-
-
-
-
-
-
diff --git a/doc/fluid/api_cn/layers_cn/Print_cn.rst b/doc/fluid/api_cn/layers_cn/Print_cn.rst
index fd9835a079170a88eca05760c6e8b1b487458396..9aaf89225bc7600c2f6806fa136f15c8c2d3faf5 100644
--- a/doc/fluid/api_cn/layers_cn/Print_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/Print_cn.rst
@@ -34,16 +34,28 @@ Print
 .. code-block:: python
 
     import paddle.fluid as fluid
-     
-    input = fluid.layers.data(name="input", shape=[4, 32, 32], dtype="float32")
-    input = fluid.layers.Print(input, message = "The content of input layer:")
-    # value = some_layer(...)
-    # Print(value, summarize=10,
-    #     message="The content of some_layer: ")
-
-
-
 
+    input = fluid.layers.fill_constant(shape=[10,2], value=3, dtype='int64')
+    input = fluid.layers.Print(input, message="The content of input layer:")
+
+    main_program = fluid.default_main_program()
+    exe = fluid.Executor(fluid.CPUPlace())
+    exe.run(main_program)
+
+**运行输出**:
+
+.. code-block:: bash 
+   
+   1564546375   输出层内容:     place:CPUPlace
+   Tensor[fill_constant_0.tmp_0]
+       shape: [10,2,]
+       dtype: x
+       data: 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3, 
+       
+   # 不同的环境中运行时信息的类型可能不相同.
+   # 比如: 
+   #    如果Tensor y dtype='int64', 相应的 c++ 类型为 int64_t.
+   #    在 MacOS 和 gcc4.8.2的环境中输出的dtype为 "x" ("x" is typeid(int64_t).name()) 。
 
 
 
diff --git a/doc/fluid/api_cn/layers_cn/anchor_generator_cn.rst b/doc/fluid/api_cn/layers_cn/anchor_generator_cn.rst
index 655f6140371031ba2dbdf9f3cef5e012ca65f069..b6bd703a474d4f0edf4046ebc3723fb43453e941 100644
--- a/doc/fluid/api_cn/layers_cn/anchor_generator_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/anchor_generator_cn.rst
@@ -19,7 +19,7 @@ anchor_generator
     - **name** (str) - 先验框操作符名称。默认：None
 
 返回：
-    - Anchors(Varibale): 输出anchor，布局[H,W,num_anchors,4] , ``H``  是输入的高度， ``W`` 是输入的宽度， ``num_priors`` 是输入每位的框数,每个anchor格式（未归一化）为(xmin,ymin,xmax,ymax)
+    - Anchors(Varibale): 输出anchor，布局[H,W,num_anchors,4] , ``H``  是输入的高度， ``W`` 是输入的宽度， ``num_anchors`` 是输入每位的框数,每个anchor格式（未归一化）为(xmin,ymin,xmax,ymax)
 
     - Variances(Variable): anchor的扩展变量布局为 [H,W,num_priors,4]。 ``H`` 是输入的高度， ``W`` 是输入的宽度， ``num_priors`` 是输入每个位置的框数,每个变量的格式为(xcenter,ycenter,w,h)。
 
diff --git a/doc/fluid/api_cn/layers_cn/batch_cn.rst b/doc/fluid/api_cn/layers_cn/batch_cn.rst
deleted file mode 100644
index ebf09ca5bf3714340e0fe4946d6b42172916b051..0000000000000000000000000000000000000000
--- a/doc/fluid/api_cn/layers_cn/batch_cn.rst
+++ /dev/null
@@ -1,47 +0,0 @@
-.. _cn_api_fluid_layers_batch:
-
-batch
--------------------------------
-
-.. py:function:: paddle.fluid.layers.batch(reader, batch_size)
-
-该层是一个reader装饰器。接受一个reader变量并添加``batching``装饰。读取装饰的reader，输出数据自动组织成batch的形式。
-
-参数：
-    - **reader** (Variable)-装饰有“batching”的reader变量
-    - **batch_size** (int)-批尺寸
-
-返回：装饰有``batching``的reader变量
-
-返回类型：变量(Variable)
-
-**代码示例**：
-
-.. code-block:: python
-
-    import paddle.fluid as fluid
-    raw_reader = fluid.layers.io.open_files(filenames=['./data1.recordio',
-                                               './data2.recordio'],
-                                        shapes=[(3,224,224), (1,)],
-                                        lod_levels=[0, 0],
-                                        dtypes=['float32', 'int64'],
-                                        thread_num=2,
-                                        buffer_size=2)
-    batch_reader = fluid.layers.batch(reader=raw_reader, batch_size=5)
-
-    # 如果用raw_reader读取数据：
-    #     data = fluid.layers.read_file(raw_reader)
-    # 只能得到数据实例。
-    #
-    # 但如果用batch_reader读取数据：
-    #     data = fluid.layers.read_file(batch_reader)
-    # 每5个相邻的实例自动连接成一个batch。因此get('data')得到的是一个batch数据而不是一个实例。
-
-
-
-
-
-
-
-
-
diff --git a/doc/fluid/api_cn/layers_cn/beam_search_cn.rst b/doc/fluid/api_cn/layers_cn/beam_search_cn.rst
index 7bb2e241be96dc5638463d601732c53dfe6a7986..d053ab386388719dd35231992044d72fff4a17df 100644
--- a/doc/fluid/api_cn/layers_cn/beam_search_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/beam_search_cn.rst
@@ -22,7 +22,7 @@ beam_search
 参数:
   - **pre_ids** （Variable） -  LodTensor变量，它是上一步 ``beam_search`` 的输出。在第一步中。它应该是LodTensor，shape为 :math:`(batch\_size，1)` ， :math:`lod [[0,1，...，batch\_size]，[0,1，...，batch\_size]]`
   - **pre_scores** （Variable） -  LodTensor变量，它是上一步中beam_search的输出
-  - **ids** （Variable） - 包含候选ID的LodTensor变量。shape为 :math:`（batch\_size×beam\_ize，K）` ，其中 ``K`` 应该是 ``beam_size``
+  - **ids** （Variable） - 包含候选ID的LodTensor变量。shape为 :math:`（batch\_size×beam\_size，K）` ，其中 ``K`` 应该是 ``beam_size``
   - **scores** （Variable） - 与 ``ids`` 及其shape对应的累积分数的LodTensor变量, 与 ``ids`` 的shape相同。
   - **beam_size** （int） - 束搜索中的束宽度。
   - **end_id** （int） - 结束标记的id。
diff --git a/doc/fluid/api_cn/layers_cn/conv2d_transpose_cn.rst b/doc/fluid/api_cn/layers_cn/conv2d_transpose_cn.rst
index 0bfe3e20bc08918fc1a1918b0e1e284d26d137e0..7cc64f2be768e69d6d46e28726518b211f554244 100644
--- a/doc/fluid/api_cn/layers_cn/conv2d_transpose_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/conv2d_transpose_cn.rst
@@ -65,7 +65,7 @@ conv2d_transpose
   - **filter_size** (int|tuple|None) - 滤波器大小。如果filter_size是一个tuple，则形式为(filter_size_H, filter_size_W)。否则，滤波器将是一个方阵。如果filter_size=None，则内部会计算输出大小。
   - **padding** (int|tuple) - 填充大小。如果padding是一个元组，它必须包含两个整数(padding_H、padding_W)。否则，padding_H = padding_W = padding。默认:padding = 0。
   - **stride** (int|tuple) - 步长大小。如果stride是一个元组，那么元组的形式为(stride_H、stride_W)。否则，stride_H = stride_W = stride。默认:stride = 1。
-  - **dilation** (int|元组) - 膨胀(dilation)大小。如果dilation是一个元组，那么元组的形式为(dilation_H, dilation_W)。否则，dilation_H = dilation_W = dilation_W。默认:dilation= 1。
+  - **dilation** (int|元组) - 膨胀(dilation)大小。如果dilation是一个元组，那么元组的形式为(dilation_H, dilation_W)。否则，dilation_H = dilation_W = dilation。默认:dilation= 1。
   - **groups** (int) - Conv2d转置层的groups个数。从Alex Krizhevsky的CNN Deep论文中的群卷积中受到启发，当group=2时，前半部分滤波器只连接到输入通道的前半部分，而后半部分滤波器只连接到输入通道的后半部分。默认值:group = 1。
   - **param_attr** (ParamAttr|None) - conv2d_transfer中可学习参数/权重的属性。如果param_attr值为None或ParamAttr的一个属性，conv2d_transfer使用ParamAttrs作为param_attr的值。如果没有设置的param_attr初始化器，那么使用Xavier初始化。默认值:None。
   - **bias_attr** (ParamAttr|bool|None) - conv2d_tran_bias中的bias属性。如果设置为False，则不会向输出单元添加偏置。如果param_attr值为None或ParamAttr的一个属性，将conv2d_transfer使用ParamAttrs作为，bias_attr。如果没有设置bias_attr的初始化器，bias将初始化为零。默认值:None。
diff --git a/doc/fluid/api_cn/layers_cn/elu_cn.rst b/doc/fluid/api_cn/layers_cn/elu_cn.rst
index 370e393a9bba0744016b5798dff6bcd0567e4eb9..8fe2b3a80c5fea073519d29f623077910f076d93 100644
--- a/doc/fluid/api_cn/layers_cn/elu_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/elu_cn.rst
@@ -10,11 +10,11 @@ ELU激活层（ELU Activation Operator）
 根据 https://arxiv.org/abs/1511.07289 对输入张量中每个元素应用以下计算。
 
 .. math::
-        \\out=max(0,x)+min(0,α∗(ex−1))\\
+        \\out=max(0,x)+min(0,α∗(e^{x}−1))\\
 
 参数:
     - x(Variable)- ELU operator的输入
-    - alpha(FAOAT|1.0)- ELU的alpha值
+    - alpha(float|1.0)- ELU的alpha值
     - name (str|None) -这个层的名称(可选)。如果设置为None，该层将被自动命名。
 
 返回: ELU操作符的输出
diff --git a/doc/fluid/api_cn/layers_cn/lstm_unit_cn.rst b/doc/fluid/api_cn/layers_cn/lstm_unit_cn.rst
index 7e919a7d8482f08a9e12ec1454d8023381e2c676..fc7180629784c7561016126ec7b6e2edd035d626 100644
--- a/doc/fluid/api_cn/layers_cn/lstm_unit_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/lstm_unit_cn.rst
@@ -33,8 +33,8 @@ lstm单元的输入包括 :math:`x_{t}` ， :math:`h_{t-1}` 和 :math:`c_{t-1}`
 
 参数：
     - **x_t** (Variable) - 当前步的输入值，二维张量，shape为 M x N ，M是批尺寸，N是输入尺寸
-    - **hidden_t_prev** (Variable) - lstm单元的隐藏状态值，二维张量，shape为 M x S，M是批尺寸，N是lstm单元的大小
-    - **cell_t_prev** (Variable) - lstm单元的cell值，二维张量，shape为 M x S ，M是批尺寸，N是lstm单元的大小
+    - **hidden_t_prev** (Variable) - lstm单元的隐藏状态值，二维张量，shape为 M x S，M是批尺寸，S是lstm单元的大小
+    - **cell_t_prev** (Variable) - lstm单元的cell值，二维张量，shape为 M x S ，M是批尺寸，S是lstm单元的大小
     - **forget_bias** (Variable) - lstm单元的遗忘bias
     - **param_attr** (ParamAttr|None) - 可学习hidden-hidden权重的擦参数属性。如果设为None或者 ``ParamAttr`` 的一个属性，lstm_unit创建 ``ParamAttr`` 为param_attr。如果param_attr的初始化函数未设置，参数初始化为Xavier。默认：None
     - **bias_attr** (ParamAttr|None) - 可学习bias权重的bias属性。如果设为False，输出单元中则不添加bias。如果设为None或者 ``ParamAttr`` 的一个属性，lstm_unit创建 ``ParamAttr`` 为bias_attr。如果bias_attr的初始化函数未设置，bias初始化为0.默认：None
diff --git a/doc/fluid/api_cn/layers_cn/open_files_cn.rst b/doc/fluid/api_cn/layers_cn/open_files_cn.rst
deleted file mode 100644
index 6cd0a4a6c8c2782df166605e334eceb91d3c3467..0000000000000000000000000000000000000000
--- a/doc/fluid/api_cn/layers_cn/open_files_cn.rst
+++ /dev/null
@@ -1,47 +0,0 @@
-.. _cn_api_fluid_layers_open_files:
-
-open_files
--------------------------------
-
-.. py:function:: paddle.fluid.layers.open_files(filenames, shapes, lod_levels, dtypes, thread_num=None, buffer_size=None, pass_num=1, is_test=None)
-
-打开文件(Open files)
-
-该函数获取需要读取的文件列表，并返回Reader变量。通过Reader变量，我们可以从给定的文件中获取数据。所有文件必须有名称后缀来表示它们的格式，例如，``*.recordio``。
-
-参数：
-    - **filenames** (list)-文件名列表
-    - **shape** (list)-元组类型值列表，声明数据维度
-    - **lod_levels** (list)-整形值列表，声明数据的lod层级
-    - **dtypes** (list)-字符串类型值列表，声明数据类型
-    - **thread_num** (None)-用于读文件的线程数。默认：min(len(filenames),cpu_number)
-    - **buffer_size** (None)-reader的缓冲区大小。默认：3*thread_num
-    - **pass_num** (int)-用于运行的传递数量
-    - **is_test** (bool|None)-open_files是否用于测试。如果用于测试，生成的数据顺序和文件顺序一致。反之，无法保证每一epoch之间的数据顺序是一致的
-
-返回：一个Reader变量，通过该变量获取文件数据
-
-返回类型：变量(Variable)
-
-**代码示例**：
-
-.. code-block:: python
-
-    import paddle.fluid as fluid
-    reader = fluid.layers.io.open_files(filenames=['./data1.recordio',
-                                            './data2.recordio'],
-                                    shapes=[(3,224,224), (1,)],
-                                    lod_levels=[0, 0],
-                                    dtypes=['float32', 'int64'])
-
-    # 通过reader, 可使用''read_file''层获取数据:
-    image, label = fluid.layers.io.read_file(reader)
-
-
-
-
-
-
-
-
-
diff --git a/doc/fluid/api_cn/layers_cn/random_data_generator_cn.rst b/doc/fluid/api_cn/layers_cn/random_data_generator_cn.rst
deleted file mode 100644
index ffd91283d6841bce43656b8830d05ff87d2baace..0000000000000000000000000000000000000000
--- a/doc/fluid/api_cn/layers_cn/random_data_generator_cn.rst
+++ /dev/null
@@ -1,43 +0,0 @@
-.. _cn_api_fluid_layers_random_data_generator:
-
-random_data_generator
--------------------------------
-
-.. py:function:: paddle.fluid.layers.random_data_generator(low, high, shapes, lod_levels, for_parallel=True)
-
-创建一个均匀分布随机数据生成器.
-
-该层返回一个Reader变量。该Reader变量不是用于打开文件读取数据，而是自生成float类型的均匀分布随机数。该变量可作为一个虚拟reader来测试网络，而不需要打开一个真实的文件。
-
-参数：
-    - **low** (float)--数据均匀分布的下界
-    - **high** (float)-数据均匀分布的上界
-    - **shapes** (list)-元组数列表，声明数据维度
-    - **lod_levels** (list)-整形数列表，声明数据
-    - **for_parallel** (Bool)-若要运行一系列操作命令则将其设置为True
-
-返回：Reader变量，可从中获取随机数据
-
-返回类型：变量(Variable)
-
-**代码示例**：
-
-.. code-block:: python
-
-    import paddle.fluid as fluid
-    reader = fluid.layers.random_data_generator(
-                                 low=0.0,
-                                 high=1.0,
-                                 shapes=[[3,224,224], [1]],
-                                 lod_levels=[0, 0])
-    # 通过reader, 可以用'read_file'层获取数据:
-    image, label = fluid.layers.read_file(reader)
-
-
-
-
-
-
-
-
-
diff --git a/doc/fluid/api_cn/layers_cn/shard_index_cn.rst b/doc/fluid/api_cn/layers_cn/shard_index_cn.rst
new file mode 100644
index 0000000000000000000000000000000000000000..26b7b1bedfc051ea577c42649438e0b39720a285
--- /dev/null
+++ b/doc/fluid/api_cn/layers_cn/shard_index_cn.rst
@@ -0,0 +1,67 @@
+.. _cn_api_fluid_layers_shard_index:
+
+shard_index
+-------------------------------
+
+.. py:function:: paddle.fluid.layers.shard_index(input, index_num, nshards, shard_id, ignore_value=-1)
+
+该层为输入创建碎片化索引，通常在模型和数据并行混合训练时使用，索引数据（通常是标签）应该在每一个trainer里面被计算，通过
+::
+
+    assert index_num % nshards == 0
+
+    shard_size = index_num / nshards
+
+    如果 x / shard_size == shard_id
+
+    y = x % shard_size  
+
+    否则
+
+    y = ignore_value
+
+我们使用分布式 ``one-hot`` 表示来展示该层如何使用， 分布式的 ``one-hot`` 表示被分割为多个碎片, 碎片索引里不为1的都使用0来填充。为了在每一个trainer里面创建碎片化的表示，原始的索引应该先进行计算(i.e. sharded)。我们来看个例子：
+
+.. code-block:: text
+
+    X 是一个整形张量
+    X.shape = [4, 1]
+    X.data = [[1], [6], [12], [19]]
+
+    假设 index_num = 20 并且 nshards = 2, 我们可以得到 shard_size = 10
+
+    如果 shard_id == 0, 我们得到输出:
+    Out.shape = [4, 1]
+    Out.data = [[1], [6], [-1], [-1]]
+    如果 shard_id == 1, 我们得到输出:
+    Out.shape = [4, 1]
+    Out.data = [[-1], [-1], [2], [9]]
+
+    上面的例子中默认 ignore_value = -1
+
+参数：
+        - **input** (Variable）-  输入的索引，最后的维度应该为1
+        - **index_num** (scalar) - 定义索引长度的整形参数
+        - **nshards** (scalar) - shards数量
+        - **shard_id** (scalar) - 当前碎片的索引
+        - **ignore_value** (scalar) - 超出碎片索引范围的整型值
+
+返回： 输入的碎片索引
+
+返回类型：    Variable
+
+**代码示例：**
+
+.. code-block:: python
+
+    import paddle.fluid as fluid
+    label = fluid.layers.data(name="label", shape=[1], dtype="int64")
+    shard_label = fluid.layers.shard_index(input=label,
+                                       index_num=20,
+                                       nshards=2,
+                                       shard_id=0)
+
+
+
+
+
diff --git a/doc/fluid/api_guides/low_level/layers/math.rst b/doc/fluid/api_guides/low_level/layers/math.rst
index 1379077d24d89d4dbce267ee5a864dae3deb6b97..e00df723d79a58f8a9b2451a2d0cd5ad4d3ef415 100644
--- a/doc/fluid/api_guides/low_level/layers/math.rst
+++ b/doc/fluid/api_guides/low_level/layers/math.rst
@@ -55,14 +55,14 @@ API Reference 请参考 :ref:`cn_api_fluid_layers_floor`
 sin
 ------------------
 
-对输入 :code:`Tensor` 逐元素取正玄。
+对输入 :code:`Tensor` 逐元素取正弦。
 
 API Reference 请参考 :ref:`cn_api_fluid_layers_sin`
 
 cos
 ------------------
 
-对输入 :code:`Tensor` 逐元素取余玄。
+对输入 :code:`Tensor` 逐元素取余弦。
 
 API Reference 请参考 :ref:`cn_api_fluid_layers_cos`
 
diff --git a/doc/fluid/beginners_guide/install/compile/compile_Windows.md b/doc/fluid/beginners_guide/install/compile/compile_Windows.md
index 7c3f4b228be463b3b622e262b7d2e079db562a8a..bf680a7d9a92c073f29aaaa0473a9ffdb3ae336b 100644
--- a/doc/fluid/beginners_guide/install/compile/compile_Windows.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_Windows.md
@@ -2,7 +2,7 @@
 
 ## 环境准备
 
-* *Windows 7/8/10 专业版/企业版 (64bit) (GPU版本支持CUDA 8/9.2, 且仅支持单卡)*
+* *Windows 7/8/10 专业版/企业版 (64bit) (GPU版本支持CUDA 8.0/9.0/10.0, 且仅支持单卡)*
 * *Python 版本 2.7/3.5.1+/3.6/3.7 (64 bit)*
 * *pip 或 pip3 版本 9.0.1+ (64 bit)*
 * *Visual Studio 2015 Update3*
@@ -12,7 +12,7 @@
 * 如果您的计算机没有 NVIDIA® GPU，请编译CPU版的PaddlePaddle
 
 * 如果您的计算机有NVIDIA® GPU，并且满足以下条件，推荐编译GPU版的PaddlePaddle
-    * *CUDA 工具包8.0 配合cuDNN v7.1+, 9.0配合cuDNN v7.3+*
+    * *CUDA 工具包8.0配合cuDNN v7.1+, 9.0/10.0配合cuDNN v7.3+*
     * *GPU运算能力超过1.0的硬件设备*
 
 ## 安装步骤
diff --git a/doc/fluid/beginners_guide/install/compile/compile_Windows_en.md b/doc/fluid/beginners_guide/install/compile/compile_Windows_en.md
index 4255ceb2a1a274b024a92e07a17da9d00b285808..c6a6408a8bd81adc56d626b04a7ecf1a98e9b639 100644
--- a/doc/fluid/beginners_guide/install/compile/compile_Windows_en.md
+++ b/doc/fluid/beginners_guide/install/compile/compile_Windows_en.md
@@ -1,4 +1,3 @@
-***
 # **Compile on Windows from Source Code**
 
 This instruction will show you how to compile PaddlePaddle on a *64-bit desktop or laptop* and Windows 10. The Windows systems we support must meet the following requirements:
@@ -59,8 +58,7 @@ Please note: The current version does not support NCCL and distributed related f
 
 6. Execute cmake:
 
-	> For details on the compilation options, see [the compilation options list](../Tables.html/#Compile).
-
+	> For details on the compilation options, see [the compilation options list](../Tables.html/#Compile).	
 	* For users who need to compile **the CPU version PaddlePaddle**:
 
 		For Python2:`cmake .. -G "Visual Studio 14 2015 Win64" -DPYTHON_INCLUDE_DIR = $ {PYTHON_INCLUDE_DIRS} 
diff --git a/doc/fluid/beginners_guide/install/install_Windows.md b/doc/fluid/beginners_guide/install/install_Windows.md
index 1ffd6b1ebc915aaa0e1dd03ecb8537a049ccee7b..8256b7ce7819fd35177af0f193a7f114cedf70bb 100644
--- a/doc/fluid/beginners_guide/install/install_Windows.md
+++ b/doc/fluid/beginners_guide/install/install_Windows.md
@@ -2,7 +2,7 @@
 
 ## 环境准备
 
-* *Windows 7/8/10 专业版/企业版 (64bit) (GPU版本支持CUDA 8.0/9.0，且仅支持单卡)*
+* *Windows 7/8/10 专业版/企业版 (64bit) (GPU版本支持CUDA 8.0/9.0/10.0，且仅支持单卡)*
 * *Python 版本 2.7.15+/3.5.1+/3.6/3.7 (64 bit)*
 * *pip 或 pip3 版本 9.0.1+ (64 bit)*
 
@@ -60,10 +60,10 @@
 * 如果您的计算机没有 NVIDIA® GPU，请安装CPU版的PaddlePaddle
 
 * 如果您的计算机有 NVIDIA® GPU，并且满足以下条件，推荐安装GPU版的PaddlePaddle
-    * *CUDA 工具包8.0配合cuDNN v7.1+， 9.0配合cuDNN v7.3+*
+    * *CUDA 工具包8.0配合cuDNN v7.1+, 9.0/10.0配合cuDNN v7.3+*
     * *GPU运算能力超过1.0的硬件设备*
 
-注: 目前官方发布的windows安装包仅包含 CUDA 8.0/9.0 的单卡模式，不包含 CUDA 9.1/9.2/10.0/10.1，如需使用，请通过源码自行编译。
+注: 目前官方发布的windows安装包仅包含 CUDA 8.0/9.0/10.0 的单卡模式，不包含 CUDA 9.1/9.2/10.1，如需使用，请通过源码自行编译。
 
 您可参考NVIDIA官方文档了解CUDA和CUDNN的安装流程和配置方法，请见[CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/)，[cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/)
 
@@ -91,7 +91,7 @@ Windows系统下有3种安装方式：
 * 如果是python2.7, 建议使用`python`命令; 如果是python3.x, 则建议使用`python3`命令
 
 
-* `python -m pip install paddlepaddle-gpu -i https://pypi.tuna.tsinghua.edu.cn/simple` 此命令将安装支持CUDA 8.0/9.0 cuDNN v7.3+的PaddlePaddle，如您对CUDA或cuDNN版本有不同要求，可用`python -m pip install paddlepaddle-gpu==[版本号] -i https://pypi.tuna.tsinghua.edu.cn/simple`或 `python3 -m pip install paddlepaddle-gpu==[版本号] -i https://pypi.tuna.tsinghua.edu.cn/simple`命令来安装，版本号请见[这里](https://pypi.org/project/paddlepaddle-gpu/#history), 关于paddlepaddle与CUDA, cuDNN版本的对应关系请见[安装包列表](./Tables.html/#whls)
+* `python -m pip install paddlepaddle-gpu -i https://pypi.tuna.tsinghua.edu.cn/simple` 此命令将安装支持CUDA 8.0(配合cuDNN v7.1+)或者CUDA 9.0/10.0(配合cuDNN v7.3+)的PaddlePaddle，如您对CUDA或cuDNN版本有不同要求，可用`python -m pip install paddlepaddle-gpu==[版本号] -i https://pypi.tuna.tsinghua.edu.cn/simple`或 `python3 -m pip install paddlepaddle-gpu==[版本号] -i https://pypi.tuna.tsinghua.edu.cn/simple`命令来安装，版本号请见[这里](https://pypi.org/project/paddlepaddle-gpu/#history), 关于paddlepaddle与CUDA, cuDNN版本的对应关系请见[安装包列表](./Tables.html/#whls)
 
 
 <a name="check"></a>
diff --git a/doc/fluid/beginners_guide/install/install_Windows_en.md b/doc/fluid/beginners_guide/install/install_Windows_en.md
index 5dbd5bb95de63ed1aa034689d40f4b7fbedd0bae..f06bea82c198dc967fe5e2dcdfc9fc3be543f8c6 100644
--- a/doc/fluid/beginners_guide/install/install_Windows_en.md
+++ b/doc/fluid/beginners_guide/install/install_Windows_en.md
@@ -2,7 +2,7 @@
 
 ## Operating Environment
 
-* *Windows 7/8/10 Pro/Enterprise(64bit)(CUDA 8.0/9.0 are supported, and only single GPU is supported)*
+* *Windows 7/8/10 Pro/Enterprise(64bit)(CUDA 8.0/9.0/10.0 are supported, and only single GPU is supported)*
 * *Python 2.7.15+/3.5.1+/3.6/3.7(64bit)*
 * *pip or pip3 9.0.1+(64bit)*
 
@@ -16,10 +16,10 @@
 * If your computer doesn’t have NVIDIA® GPU, please install the CPU version of PaddlePaddle
 
 * If your computer has NVIDIA® GPU, and it satisfies the following requirements, we recommend you to install the GPU version of PaddlePaddle
-    * *CUDA Toolkit 8.0/9.0 with cuDNN v7.3+*
+    * *CUDA Toolkit 8.0 with cuDNN v7.1+, or 9.0/10.0 with cuDNN v7.3+*
     * *GPU's computing capability exceeds 1.0*
 
-Note: currently, the official Windows installation package only support CUDA 8.0/9.0 with single GPU, and don't support CUDA 9.1/9.2/10.0/10.1. if you need to use, please compile by yourself through the source code.
+Note: currently, the official Windows installation package only support CUDA 8.0/9.0/10.0 with single GPU, and don't support CUDA 9.1/9.2/10.1. if you need to use, please compile by yourself through the source code.
 
 Please refer to the NVIDIA official documents for the installation process and the configuration methods of [CUDA](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) and [cuDNN](https://docs.nvidia.com/deeplearning/sdk/cudnn-install/).
 
@@ -43,7 +43,7 @@ There is a checking function below for [verifyig whether the installation is suc
 Notice:
 
 * The version of pip and the version of python should be corresponding: python2.7 corresponds to `pip`; python3.x corresponds to `pip3`.
-* `pip install paddlepaddle-gpu` This command will install PaddlePaddle that supports CUDA 8.0/9.0 cuDNN v7.3+, Currently, PaddlePaddle doesn't support any other version of CUDA or cuDNN on Windows.
+* `pip install paddlepaddle-gpu` This command will install PaddlePaddle that supports CUDA 8.0(with cuDNN v7.1+), or CUDA 9.0/10.0(with cuDNN v7.3+).
 
 <a name="check"></a>
 ## Installation Verification
diff --git a/doc/fluid/beginners_guide/programming_guide/programming_guide.md b/doc/fluid/beginners_guide/programming_guide/programming_guide.md
index fffbb15c6a15ec4ca22fbaac6c6e4a01e5579f16..df1421363c9cae61bb6453574729294056788d2c 100644
--- a/doc/fluid/beginners_guide/programming_guide/programming_guide.md
+++ b/doc/fluid/beginners_guide/programming_guide/programming_guide.md
@@ -43,7 +43,6 @@ import paddle.fluid as fluid
 y = fluid.layers.fc(input=x, size=128, bias_attr=True)
 ```
 
-
 **2. 输入输出Tensor**
 
 整个神经网络的输入数据也是一个特殊的 Tensor，在这个 Tensor 中，一些维度的大小在定义模型时无法确定（通常包括：batch size，如果 mini-batch 之间数据可变，也会包括图片的宽度和高度等），在定义模型时需要占位。
@@ -97,7 +96,30 @@ type {
 persistable: false
 ```
 
-具体输出数值将在Executor运行时得到，详细过程会在后文展开描述。
+具体输出数值将在Executor运行时得到。获取运行时的Variable数值有两种方式：方式一是利用 `paddle.fluid.layers.Print` 创建一个打印操作，打印正在访问的张量。方式二是将Variable添加在fetch_list中。
+
+方式一的代码实现如下所示：
+
+```python
+import paddle.fluid as fluid
+data = fluid.layers.fill_constant(shape=[1], value=0, dtype='int64')
+data = fluid.layers.Print(data, message="Print data:")
+```
+
+运行时的输出结果：
+
+```
+1563874307	Print data: 	The place is:CPUPlace
+Tensor[fill_constant_0.tmp_0]
+	shape: [1,]
+	dtype: x
+	data: 0,
+```
+
+更多 Print API 的使用方式请查看：[Print操作命令](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/api_cn/layers_cn/control_flow_cn.html#print)。
+
+方式二Fetch_list的详细过程会在后文展开描述。
+
 
 ## 数据传入
 
diff --git a/doc/fluid/beginners_guide/programming_guide/programming_guide_en.md b/doc/fluid/beginners_guide/programming_guide/programming_guide_en.md
index c7ee85d4f2c82c12f40cd79f4ed937dddfc03c1a..0790b8815b83a32c59a24cc09203da85619a14c0 100644
--- a/doc/fluid/beginners_guide/programming_guide/programming_guide_en.md
+++ b/doc/fluid/beginners_guide/programming_guide/programming_guide_en.md
@@ -101,7 +101,29 @@ type {
 persistable: false
 ```
 
-Specific output value will be shown at the runtime of Executor. Detailed process will be explained later.
+Specific output value will be shown at the runtime of Executor. There are two ways to get runtime Variable value. The first way is to  use `paddle.fluid.layers.Print` to create a print op that will print the tensor being accessed. The second way is to add Variable to Fetch_list. 
+
+Code of the first way is as follows:
+
+```python
+import paddle.fluid as fluid
+data = fluid.layers.fill_constant(shape=[1], value=0, dtype='int64')
+data = fluid.layers.Print(data, message="Print data: ")
+```
+
+Output at  the runtime of Executor:
+
+```
+1563874307	Print data: 	The place is:CPUPlace
+Tensor[fill_constant_0.tmp_0]
+	shape: [1,]
+	dtype: x
+	data: 0,
+```
+
+For more information on how to use the Print API, please refer to [Print operator](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/api/layers/control_flow.html#print).
+
+Detailed process of the second way Fetch_list will be explained later.
 
 ## Feed data
 
diff --git a/doc/fluid/flags/cudnn_cn.rst b/doc/fluid/flags/cudnn_cn.rst
index e09f29f4981e962d1c4299e4611e0caf10ee4a86..dda86afd4daab94ae6bf1c37470a7d018d42e4ff 100755
--- a/doc/fluid/flags/cudnn_cn.rst
+++ b/doc/fluid/flags/cudnn_cn.rst
@@ -3,7 +3,7 @@ cudnn
 ==================
 
 
-conv_workspace_size_limit
+FLAGS_conv_workspace_size_limit
 *******************************************
 (始于0.13.0)
 
@@ -18,7 +18,7 @@ Uint64型，缺省值为4096。即4G内存工作区。
 FLAGS_conv_workspace_size_limit=1024 - 将用于选择cuDNN卷积算法的工作区限制大小设置为1024MB。
 
 
-cudnn_batchnorm_spatial_persistent
+FLAGS_cudnn_batchnorm_spatial_persistent
 *******************************************
 (始于1.4.0)
 
@@ -37,7 +37,7 @@ FLAGS_cudnn_batchnorm_spatial_persistent=True - 开启CUDNN_BATCHNORM_SPATIAL_PE
 此模式在某些任务中可以更快，因为将为CUDNN_DATA_FLOAT和CUDNN_DATA_HALF数据类型选择优化路径。我们默认将其设置为False的原因是此模式可能使用原子整数缩减(scaled atomic integer reduction)而导致某些输入数据范围的数字溢出。
 
 
-cudnn_deterministic
+FLAGS_cudnn_deterministic
 *******************************************
 (始于0.13.0)
 
@@ -56,7 +56,7 @@ FLAGS_cudnn_deterministic=True - 选择cuDNN中的确定性函数。
 现在，在cuDNN卷积和池化Operator中启用此flag。确定性算法速度可能较慢，因此该flag通常用于调试。
 
 
-cudnn_exhaustive_search
+FLAGS_cudnn_exhaustive_search
 *******************************************
 (始于1.2.0)
 
diff --git a/doc/fluid/flags/cudnn_en.rst b/doc/fluid/flags/cudnn_en.rst
index 97acf43f97c76294c5c57ed9b8a747142faaec6e..1c29e3de0157beae135b6a42267856b4e4a33c99 100755
--- a/doc/fluid/flags/cudnn_en.rst
+++ b/doc/fluid/flags/cudnn_en.rst
@@ -3,7 +3,7 @@ cudnn
 ==================
 
 
-conv_workspace_size_limit
+FLAGS_conv_workspace_size_limit
 *******************************************
 (since 0.13.0)
 
@@ -18,7 +18,7 @@ Example
 FLAGS_conv_workspace_size_limit=1024 set the workspace limit size for choosing cuDNN convolution algorithms to 1024MB.
 
 
-cudnn_batchnorm_spatial_persistent
+FLAGS_cudnn_batchnorm_spatial_persistent
 *******************************************
 (since 1.4.0)
 
@@ -37,7 +37,7 @@ Note
 This mode can be faster in some tasks because an optimized path will be selected for CUDNN_DATA_FLOAT and CUDNN_DATA_HALF data types. The reason we set it to False by default is that this mode may use scaled atomic integer reduction which may cause a numerical overflow for some input data range.
 
 
-cudnn_deterministic
+FLAGS_cudnn_deterministic
 *******************************************
 (since 0.13.0)
 
@@ -56,7 +56,7 @@ Note
 Now this flag is enabled in cuDNN convolution and pooling operator. The deterministic algorithms may slower, so this flag is generally used for debugging.
 
 
-cudnn_exhaustive_search
+FLAGS_cudnn_exhaustive_search
 *******************************************
 (since 1.2.0)
 
diff --git a/doc/fluid/flags/data_cn.rst b/doc/fluid/flags/data_cn.rst
index 061207dcb75ea4fdab65d56948326d7f4d65b158..db4bd5e3cd30298d10ad9c89d0f38e2407e037b5 100755
--- a/doc/fluid/flags/data_cn.rst
+++ b/doc/fluid/flags/data_cn.rst
@@ -3,7 +3,7 @@
 ==================
 
 
-enable_cublas_tensor_op_math
+FLAGS_enable_cublas_tensor_op_math
 *******************************************
 (始于1.2.0)
 
@@ -15,10 +15,10 @@ Bool型，缺省值为False。
 
 示例
 -------
-enable_cublas_tensor_op_math=True - 使用Tensor Core。
+FLAGS_enable_cublas_tensor_op_math=True - 使用Tensor Core。
 
 
-use_mkldnn
+FLAGS_use_mkldnn
 *******************************************
 (始于0.13.0)
 
diff --git a/doc/fluid/flags/data_en.rst b/doc/fluid/flags/data_en.rst
index 96c29933cf6e2326008183f51ff8d4e30e5200df..c156a37dd044b644078b838e0705b977cd86d43b 100755
--- a/doc/fluid/flags/data_en.rst
+++ b/doc/fluid/flags/data_en.rst
@@ -2,7 +2,7 @@
 data processing
 ==================
 
-enable_cublas_tensor_op_math
+FLAGS_enable_cublas_tensor_op_math
 *******************************************
 (since 1.2.0)
 
@@ -14,10 +14,10 @@ Bool. The default value is False.
 
 Example
 -------
-enable_cublas_tensor_op_math=True will use Tensor Core.
+FLAGS_enable_cublas_tensor_op_math=True will use Tensor Core.
 
 
-use_mkldnn
+FLAGS_use_mkldnn
 *******************************************
 (since 0.13.0)
 
diff --git a/doc/fluid/flags/debug_cn.rst b/doc/fluid/flags/debug_cn.rst
index 2927ae483c3f60afe0e500e484f97a42e58894ed..f414de88cb908a31752bb1fa32144697a21b2375 100755
--- a/doc/fluid/flags/debug_cn.rst
+++ b/doc/fluid/flags/debug_cn.rst
@@ -3,7 +3,7 @@
 ==================
 
 
-check_nan_inf
+FLAGS_check_nan_inf
 ********************
 (始于0.13.0)
 
@@ -18,7 +18,7 @@ Bool型，缺省值为False。
 FLAGS_check_nan_inf=True - 检查Operator的结果是否含有Nan或Inf。
 
 
-cpu_deterministic
+FLAGS_cpu_deterministic
 *******************************************
 (始于0.15.0)
 
@@ -33,7 +33,7 @@ Bool型，缺省值为False。
 FLAGS_cpu_deterministic=True - 在CPU侧确定计算结果。
 
 
-enable_rpc_profiler
+FLAGS_enable_rpc_profiler
 *******************************************
 (始于1.0.0)
 
@@ -48,7 +48,7 @@ Bool型，缺省值为False。
 FLAGS_enable_rpc_profiler=True - 启用RPC分析器并在分析器文件中记录时间线。
 
 
-multiple_of_cupti_buffer_size
+FLAGS_multiple_of_cupti_buffer_size
 *******************************************
 (始于1.4.0)
 
@@ -63,7 +63,7 @@ Int32型，缺省值为1。
 FLAGS_multiple_of_cupti_buffer_size=1 - 将CUPTI设备缓冲区大小的倍数设为1。
 
 
-reader_queue_speed_test_mode
+FLAGS_reader_queue_speed_test_mode
 *******************************************
 (始于1.1.0)
 
diff --git a/doc/fluid/flags/debug_en.rst b/doc/fluid/flags/debug_en.rst
index cc62d76fbcaeb6993700e5bee944842c3a58d3a7..39b93240d423d30fafad45803bffb523927d34ab 100755
--- a/doc/fluid/flags/debug_en.rst
+++ b/doc/fluid/flags/debug_en.rst
@@ -2,7 +2,7 @@
 debug
 ==================
 
-check_nan_inf
+FLAGS_check_nan_inf
 **************************************
 (since 0.13.0)
 
@@ -17,7 +17,7 @@ Example
 FLAGS_check_nan_inf=True will check the result of Operator whether the result has Nan or Inf.
 
 
-cpu_deterministic
+FLAGS_cpu_deterministic
 *******************************************
 (since 0.15.0)
 
@@ -32,7 +32,7 @@ Example
 FLAGS_cpu_deterministic=True will make the result of computation deterministic in CPU side.
 
 
-enable_rpc_profiler
+FLAGS_enable_rpc_profiler
 *******************************************
 (Since 1.0.0)
 
@@ -47,7 +47,7 @@ Example
 FLAGS_enable_rpc_profiler=True will enable rpc profiler and record the timeline to profiler file.
 
 
-multiple_of_cupti_buffer_size
+FLAGS_multiple_of_cupti_buffer_size
 *******************************************
 (since 1.4.0)
 
@@ -62,7 +62,7 @@ Example
 FLAGS_multiple_of_cupti_buffer_size=1 set the multiple of the CUPTI device buffer size to 1.
 
 
-reader_queue_speed_test_mode
+FLAGS_reader_queue_speed_test_mode
 *******************************************
 (since 1.1.0)
 
diff --git a/doc/fluid/flags/device_cn.rst b/doc/fluid/flags/device_cn.rst
index 7fda8d6922f02368ed50cf7b5c6261f504acac0b..0bed575e98c8fead62ecdc0b71f5ec9922a4fbff 100755
--- a/doc/fluid/flags/device_cn.rst
+++ b/doc/fluid/flags/device_cn.rst
@@ -3,7 +3,7 @@
 ==================
 
 
-paddle_num_threads
+FLAGS_paddle_num_threads
 *******************************************
 (始于0.15.0)
 
@@ -18,7 +18,7 @@ Int32型，缺省值为1。
 FLAGS_paddle_num_threads=2 - 将每个实例的最大线程数设为2。
 
 
-selected_gpus
+FLAGS_selected_gpus
 *******************************************
 (始于1.3)
 
diff --git a/doc/fluid/flags/device_en.rst b/doc/fluid/flags/device_en.rst
index eeae4d68d2de894dbd83844b2a251abf099855db..5397ee9fc9c4bd6baba3682eaffc8bf3cd66e37c 100755
--- a/doc/fluid/flags/device_en.rst
+++ b/doc/fluid/flags/device_en.rst
@@ -3,7 +3,7 @@ device management
 ==================
 
 
-paddle_num_threads
+FLAGS_paddle_num_threads
 *******************************************
 (since 0.15.0)
 
@@ -18,7 +18,7 @@ Example
 FLAGS_paddle_num_threads=2 will enable 2 threads as max number of threads for each instance.
 
 
-selected_gpus
+FLAGS_selected_gpus
 *******************************************
 (since 1.3)
 
diff --git a/doc/fluid/flags/distributed_cn.rst b/doc/fluid/flags/distributed_cn.rst
index 5bf67884c519e4a01a362f1b4c15b1eac257a53b..8c869ab46eb7dc79c9e6963bad8dda4c313914d8 100755
--- a/doc/fluid/flags/distributed_cn.rst
+++ b/doc/fluid/flags/distributed_cn.rst
@@ -3,7 +3,7 @@
 ==================
 
 
-communicator_fake_rpc
+FLAGS_communicator_fake_rpc
 **********************
 (始于1.5.0)
 
@@ -22,7 +22,7 @@ FLAGS_communicator_fake_rpc=True - 启用通信器fake模式。
 该flag仅用于paddlepaddle的开发者，普通用户不应对其设置。
 
 
-communicator_independent_recv_thread
+FLAGS_communicator_independent_recv_thread
 **************************************
 (始于1.5.0)
 
@@ -41,7 +41,7 @@ FLAGS_communicator_independent_recv_thread=True - 使用独立线程以从参数
 开发者使用该flag进行框架的调试与优化，普通用户不应对其设置。
 
 
-communicator_max_merge_var_num
+FLAGS_communicator_max_merge_var_num
 **************************************
 (始于1.5.0)
 
@@ -60,7 +60,7 @@ FLAGS_communicator_max_merge_var_num=16 - 将要通过通信器合并为一个
 该flag和训练器线程数有着密切关联，缺省值应和线程数一致。
 
 
-communicator_merge_sparse_grad
+FLAGS_communicator_merge_sparse_grad
 *******************************************
 (始于1.5.0)
 
@@ -79,11 +79,11 @@ FLAGS_communicator_merge_sparse_grad=true - 设置合并稀疏梯度。
 合并稀疏梯度会耗费时间。如果重复ID较多，内存占用会变少，通信会变快；如果重复ID较少，则并不会节约内存。
 
 
-communicator_min_send_grad_num_before_recv
+FLAGS_communicator_min_send_grad_num_before_recv
 *******************************************
 (始于1.5.0)
 
-在通信器中，有一个发送线程向参数服务器发送梯度，一个接收线程从参数服务器接收参数，且它们之间彼此独立。该flag用于控制接收线程的频率。 仅当发送线程至少发送communicator_min_send_grad_num_before_recv数量的梯度时，接收线程才会从参数服务器接收参数。
+在通信器中，有一个发送线程向参数服务器发送梯度，一个接收线程从参数服务器接收参数，且它们之间彼此独立。该flag用于控制接收线程的频率。 仅当发送线程至少发送FLAGS_communicator_min_send_grad_num_before_recv数量的梯度时，接收线程才会从参数服务器接收参数。
 
 取值范围
 ---------------
@@ -98,7 +98,7 @@ FLAGS_communicator_min_send_grad_num_before_recv=10 - 在接收线程从参数
 由于该flag和训练器的训练线程数强相关，而每个训练线程都会发送其梯度，所以缺省值应和线程数一致。
 
 
-communicator_send_queue_size
+FLAGS_communicator_send_queue_size
 *******************************************
 (始于1.5.0)
 
@@ -117,7 +117,7 @@ FLAGS_communicator_send_queue_size=10 - 设置每个梯度的队列大小为10
 该flag会影响训练速度，若队列大小过大，速度会变快但结果可能会变差。
 
 
-communicator_send_wait_times
+FLAGS_communicator_send_wait_times
 *******************************************
 (始于1.5.0)
 
@@ -132,7 +132,7 @@ Int32型，缺省值为5。
 FLAGS_communicator_send_wait_times=5 - 将合并数没有达到max_merge_var_num的情况下发送线程等待的次数设为5。
 
 
-communicator_thread_pool_size
+FLAGS_communicator_thread_pool_size
 *******************************************
 (始于1.5.0)
 
@@ -151,7 +151,7 @@ FLAGS_communicator_thread_pool_size=10 - 设置线程池大小为10。
 大部分情况下，用户不需要设置该flag。
 
 
-dist_threadpool_size
+FLAGS_dist_threadpool_size
 *******************************************
 (始于1.0.0)
 
@@ -166,7 +166,7 @@ Int32型，缺省值为0。
 FLAGS_dist_threadpool_size=10 - 将用于分布式模块的最大线程数设为10。
 
 
-rpc_deadline
+FLAGS_rpc_deadline
 *******************************************
 (始于1.0.0)
 
@@ -181,11 +181,11 @@ Int32型，缺省值为180000，单位为ms。
 FLAGS_rpc_deadline=180000 - 将deadline超时设为3分钟。
 
 
-rpc_disable_reuse_port
+FLAGS_rpc_disable_reuse_port
 *******************************************
 (始于1.2.0)
 
-rpc_disable_reuse_port为True时，grpc的 GRPC_ARG_ALLOW_REUSEPORT会被设置为False以禁用SO_REUSEPORT。
+FLAGS_rpc_disable_reuse_port为True时，grpc的 GRPC_ARG_ALLOW_REUSEPORT会被设置为False以禁用SO_REUSEPORT。
 
 取值范围
 ---------------
@@ -196,7 +196,7 @@ Bool型，缺省值为False。
 FLAGS_rpc_disable_reuse_port=True - 禁用SO_REUSEPORT。
 
 
-rpc_get_thread_num
+FLAGS_rpc_get_thread_num
 *******************************************
 (始于1.0.0)
 
@@ -211,7 +211,7 @@ Int32型，缺省值为12。
 FLAGS_rpc_get_thread_num=6 - 将从参数服务器获取参数的线程数设为6。
 
 
-rpc_send_thread_num
+FLAGS_rpc_send_thread_num
 *******************************************
 (始于1.0.0)
 
@@ -226,11 +226,11 @@ Int32型，缺省值为12。
 FLAGS_rpc_send_thread_num=6 - 将用于发送的线程数设为6。
 
 
-rpc_server_profile_path
+FLAGS_rpc_server_profile_path
 *******************************************
 since(v0.15.0)
 
-设置分析器输出日志文件路径前缀。完整路径为rpc_server_profile_path_listener_id，其中listener_id为随机数。 
+设置分析器输出日志文件路径前缀。完整路径为FLAGS_rpc_server_profile_path_listener_id，其中listener_id为随机数。 
 
 取值范围
 ---------------
diff --git a/doc/fluid/flags/distributed_en.rst b/doc/fluid/flags/distributed_en.rst
index adc7fa181e17e26867a2f699323f145e2b9d816f..d71803cc62e02b8695e2baceb64eb147011716f1 100755
--- a/doc/fluid/flags/distributed_en.rst
+++ b/doc/fluid/flags/distributed_en.rst
@@ -2,7 +2,7 @@
 distributed
 ==================
 
-communicator_fake_rpc
+FLAGS_communicator_fake_rpc
 **************************************
 (since 1.5.0)
 
@@ -21,7 +21,7 @@ Note
 This flag is only for developer of paddlepaddle, user should not set it.
 
 
-communicator_independent_recv_thread
+FLAGS_communicator_independent_recv_thread
 **************************************
 (since 1.5.0)
 
@@ -40,7 +40,7 @@ Note
 This flag is for developer to debug and optimize the framework. User should not set it.
 
 
-communicator_max_merge_var_num
+FLAGS_communicator_max_merge_var_num
 **************************************
 (since 1.5.0)
 
@@ -59,7 +59,7 @@ Note
 This flag has strong relationship with trainer thread num. The default value should be the same with thread num.
 
 
-communicator_merge_sparse_grad
+FLAGS_communicator_merge_sparse_grad
 *******************************
 (since 1.5.0)
 
@@ -78,11 +78,11 @@ Note
 Merging sparse gradient would be time-consuming. If the sparse gradient has many duplicated ids, it will save memory and communication could be much faster. Otherwise it will not save memory.
 
 
-communicator_min_send_grad_num_before_recv
+FLAGS_communicator_min_send_grad_num_before_recv
 *******************************************
 (since 1.5.0)
 
-In communicator, there is one send thread that send gradient to parameter server and one receive thread that receive parameter from parameter server. They work independently. This flag is used to control the frequency of receive thread. Only when the send thread send at least communicator_min_send_grad_num_before_recv gradients will the receive thread receive parameter from parameter server.
+In communicator, there is one send thread that send gradient to parameter server and one receive thread that receive parameter from parameter server. They work independently. This flag is used to control the frequency of receive thread. Only when the send thread send at least FLAGS_communicator_min_send_grad_num_before_recv gradients will the receive thread receive parameter from parameter server.
 
 Values accepted
 ---------------
@@ -97,7 +97,7 @@ Note
 This flag has strong relation with the training threads of trainer. because each training thread will send it's grad. So the default value should be training thread num.
 
 
-communicator_send_queue_size
+FLAGS_communicator_send_queue_size
 *******************************************
 (since 1.5.0)
 
@@ -116,7 +116,7 @@ Note
 This flag will affect the training speed, if the queue size is larger, the speed may be faster, but may make the result worse.
 
 
-communicator_send_wait_times
+FLAGS_communicator_send_wait_times
 *******************************************
 (since 1.5.0)
 
@@ -131,7 +131,7 @@ Example
 FLAGS_communicator_send_wait_times=5 set the times that send thread will wait if merge number does not reach max_merge_var_num to 5.
 
 
-communicator_thread_pool_size
+FLAGS_communicator_thread_pool_size
 *******************************************
 (since 1.5.0)
 
@@ -150,7 +150,7 @@ Note
 Most of time user does not need to set this flag.
 
 
-dist_threadpool_size
+FLAGS_dist_threadpool_size
 *******************************************
 (Since 1.0.0)
 
@@ -165,7 +165,7 @@ Example
 FLAGS_dist_threadpool_size=10 will enable 10 threads as max number of thread used for distributed module.
 
 
-rpc_deadline
+FLAGS_rpc_deadline
 *******************************************
 (Since 1.0.0)
 
@@ -180,11 +180,11 @@ Example
 FLAGS_rpc_deadline=180000 will set deadline timeout to 3 minute.
 
 
-rpc_disable_reuse_port
+FLAGS_rpc_disable_reuse_port
 *******************************************
 (since 1.2.0)
 
-When rpc_disable_reuse_port is true, the flag of grpc GRPC_ARG_ALLOW_REUSEPORT will be set to false to
+When FLAGS_rpc_disable_reuse_port is true, the flag of grpc GRPC_ARG_ALLOW_REUSEPORT will be set to false to
 disable the use of SO_REUSEPORT if it's available.
 
 Values accepted
@@ -196,7 +196,7 @@ Example
 FLAGS_rpc_disable_reuse_port=True will disable the use of SO_REUSEPORT.
 
 
-rpc_get_thread_num
+FLAGS_rpc_get_thread_num
 *******************************************
 (Since 1.0.0)
 
@@ -211,7 +211,7 @@ Example
 FLAGS_rpc_get_thread_num=6 will use 6 threads to get parameter from parameter server.
 
 
-rpc_send_thread_num
+FLAGS_rpc_send_thread_num
 *******************************************
 (Since 1.0.0)
 
@@ -226,11 +226,11 @@ Example
 FLAGS_rpc_send_thread_num=6 will set number thread used for send to 6.
 
 
-rpc_server_profile_path
+FLAGS_rpc_server_profile_path
 *******************************************
 since(v0.15.0)
 
-Set the profiler output log file path prefix. The complete path will be rpc_server_profile_path_listener_id, listener_id is a random number.
+Set the profiler output log file path prefix. The complete path will be FLAGS_rpc_server_profile_path_listener_id, listener_id is a random number.
 
 Values accepted
 ---------------
diff --git a/doc/fluid/flags/executor_cn.rst b/doc/fluid/flags/executor_cn.rst
index e94b4671cd7c22858f49732079a63eb010d0e198..56c3c7b04dc71029d8d4b42a04f07bf752de3181 100755
--- a/doc/fluid/flags/executor_cn.rst
+++ b/doc/fluid/flags/executor_cn.rst
@@ -3,7 +3,7 @@
 ==================
 
 
-enable_parallel_graph
+FLAGS_enable_parallel_graph
 *******************************************
 (始于1.2.0)
 
@@ -18,7 +18,7 @@ Bool型，缺省值为False。
 FLAGS_enable_parallel_graph=False - 通过ParallelExecutor强制禁用并行图执行模式。
 
 
-pe_profile_fname
+FLAGS_pe_profile_fname
 *******************************************
 (始于1.3.0)
 
@@ -33,7 +33,7 @@ String型，缺省值为empty ("")。
 FLAGS_pe_profile_fname="./parallel_executor.perf" - 将配置文件结果存储在parallel_executor.perf中。
 
 
-print_sub_graph_dir
+FLAGS_print_sub_graph_dir
 *******************************************
 (始于1.2.0)
 
@@ -48,7 +48,7 @@ String型，缺省值为empty ("")。
 FLAGS_print_sub_graph_dir="./sub_graphs.txt" - 将断开连接的子图打印到"./sub_graphs.txt"。
 
 
-use_ngraph
+FLAGS_use_ngraph
 *******************************************
 (始于1.4.0)
 
diff --git a/doc/fluid/flags/executor_en.rst b/doc/fluid/flags/executor_en.rst
index ccc5c5f92f9ba51f02657e481f9f5af27b719f49..7a262c001639a90b86595cd2d4a5607d75f80d59 100755
--- a/doc/fluid/flags/executor_en.rst
+++ b/doc/fluid/flags/executor_en.rst
@@ -3,7 +3,7 @@ executor
 ==================
 
 
-enable_parallel_graph
+FLAGS_enable_parallel_graph
 *******************************************
 (since 1.2.0)
 
@@ -18,7 +18,7 @@ Example
 FLAGS_enable_parallel_graph=False will force disable parallel graph execution mode by ParallelExecutor.
 
 
-pe_profile_fname
+FLAGS_pe_profile_fname
 *******************************************
 (since 1.3.0)
 
@@ -33,7 +33,7 @@ Example
 FLAGS_pe_profile_fname="./parallel_executor.perf" will store the profile result to parallel_executor.perf.
 
 
-print_sub_graph_dir
+FLAGS_print_sub_graph_dir
 *******************************************
 (since 1.2.0)
 
@@ -48,7 +48,7 @@ Example
 FLAGS_print_sub_graph_dir="./sub_graphs.txt" will print the disconnected subgraphs to "./sub_graphs.txt".
 
 
-use_ngraph
+FLAGS_use_ngraph
 *******************************************
 (since 1.4.0)
 
diff --git a/doc/fluid/flags/memory_cn.rst b/doc/fluid/flags/memory_cn.rst
index 6c09e750a03a18033f2933326884bc9e1af54c9e..8198b57b42cc662b689a5339d6714ea10fec2855 100755
--- a/doc/fluid/flags/memory_cn.rst
+++ b/doc/fluid/flags/memory_cn.rst
@@ -3,7 +3,7 @@
 ==================
 
 
-allocator_strategy
+FLAGS_allocator_strategy
 ********************
 (始于1.2)
 
@@ -20,7 +20,7 @@ FLAGS_allocator_strategy=legacy - 使用legacy分配器。
 FLAGS_allocator_strategy=naive_best_fit - 使用新设计的分配器。
 
 
-eager_delete_scope
+FLAGS_eager_delete_scope
 *******************************************
 (始于0.12.0)
 
@@ -35,7 +35,7 @@ Bool型，缺省值为True。
 FLAGS_eager_delete_scope=True - 同步局域删除。
 
 
-eager_delete_tensor_gb
+FLAGS_eager_delete_tensor_gb
 *******************************************
 (始于1.0.0)
 
@@ -58,7 +58,7 @@ FLAGS_eager_delete_tensor_gb=-1.0 - 禁用垃圾回收策略。
 建议用户在训练大型网络时设置FLAGS_eager_delete_tensor_gb=0.0以启用垃圾回收策略。
 
 
-enable_inplace_whitelist
+FLAGS_enable_inplace_whitelist
 *******************************************
 (始于1.4)
 
@@ -73,7 +73,7 @@ Bool型，缺省值为False。
 FLAGS_enable_inplace_whitelist=True - 在特定op上禁止内存原位复用优化。
 
 
-fast_eager_deletion_mode
+FLAGS_fast_eager_deletion_mode
 *******************************************
 (始于1.3)
 
@@ -90,7 +90,7 @@ FLAGS_fast_eager_deletion_mode=True - 启用快速垃圾回收策略。
 FLAGS_fast_eager_deletion_mode=False - 禁用快速垃圾回收策略。
 
 
-fraction_of_gpu_memory_to_use
+FLAGS_fraction_of_gpu_memory_to_use
 *******************************************
 (始于1.2.0)
 
@@ -109,7 +109,7 @@ FLAGS_fraction_of_gpu_memory_to_use=0.1 - 分配总GPU内存大小的10%作为
 Windows系列平台会将FLAGS_fraction_of_gpu_memory_to_use默认设为0.5，Linux则会默认设为0.92。
 
 
-free_idle_memory
+FLAGS_free_idle_memory
 *******************************************
 (始于0.15.0)
 
@@ -126,7 +126,7 @@ FLAGS_free_idle_memory=True - 空闲内存太多时释放。
 FLAGS_free_idle_memory=False - 不释放空闲内存。
 
 
-fuse_parameter_groups_size
+FLAGS_fuse_parameter_groups_size
 *******************************************
 (始于1.4.0)
 
@@ -141,7 +141,7 @@ Int32型，缺省值为3。
 FLAGS_fuse_parameter_groups_size=3 - 将单组参数的梯度大小设为3。
 
 
-fuse_parameter_memory_size
+FLAGS_fuse_parameter_memory_size
 *******************************************
 (始于1.5.0)
 
@@ -156,7 +156,7 @@ Double型，缺省值为-1.0。
 FLAGS_fuse_parameter_memory_size=16 - 将单组参数梯度的上限大小设为16MB。
 
 
-init_allocated_mem
+FLAGS_init_allocated_mem
 *******************************************
 (始于0.15.0)
 
@@ -173,7 +173,7 @@ FLAGS_init_allocated_mem=True - 对分配的内存进行非零初始化。
 FLAGS_init_allocated_mem=False - 不会对分配的内存进行非零初始化。
 
 
-initial_cpu_memory_in_mb
+FLAGS_initial_cpu_memory_in_mb
 *******************************************
 (始于0.14.0)
 
@@ -188,7 +188,7 @@ Uint64型，缺省值为500，单位为MB。
 FLAGS_initial_cpu_memory_in_mb=100 - 在FLAGS_fraction_of_cpu_memory_to_use*（总物理内存）大于100MB的情况下，首次提出分配请求时，分配器预先分配100MB内存，并在预分配的内存耗尽时再次分配100MB。
 
 
-initial_gpu_memory_in_mb
+FLAGS_initial_gpu_memory_in_mb
 *******************************************
 (始于1.4.0)
 
@@ -207,7 +207,7 @@ FLAGS_initial_gpu_memory_in_mb=4096 - 分配4GB作为初始GPU内存块大小。
 如果设置该flag，则FLAGS_fraction_of_gpu_memory_to_use设置的内存大小将被该flag覆盖。如果未设置该flag，PaddlePaddle将使用FLAGS_fraction_of_gpu_memory_to_use分配GPU内存。
 
 
-limit_of_tmp_allocation
+FLAGS_limit_of_tmp_allocation
 *******************************************
 (始于1.3)
 
@@ -222,7 +222,7 @@ Int64型，缺省值为-1。
 FLAGS_limit_of_tmp_allocation=1024 - 将temporary_allocation大小的上限设为1024字节。
 
 
-memory_fraction_of_eager_deletion
+FLAGS_memory_fraction_of_eager_deletion
 *******************************************
 (始于1.4)
 
@@ -242,7 +242,7 @@ FLAGS_memory_fraction_of_eager_deletion=1 - 释放所有临时变量。
 FLAGS_memory_fraction_of_eager_deletion=0.5 - 仅释放50%比例的占用内存最多的变量。
 
 
-reallocate_gpu_memory_in_mb
+FLAGS_reallocate_gpu_memory_in_mb
 *******************************************
 (始于1.4.0)
 
@@ -261,7 +261,7 @@ FLAGS_reallocate_gpu_memory_in_mb=1024 - 如果耗尽了分配的GPU内存块，
 如果设置了该flag，PaddlePaddle将重新分配该flag指定大小的gpu内存。否则分配FLAGS_fraction_of_gpu_memory_to_use指定比例的gpu内存。
 
 
-times_excess_than_required_tmp_allocation
+FLAGS_times_excess_than_required_tmp_allocation
 *******************************************
 (始于1.3)
 
@@ -276,7 +276,7 @@ Int64型，缺省值为2。
 FLAGS_times_excess_than_required_tmp_allocation=1024 - 设置TemporaryAllocator可以返回的最大大小为1024*N。
 
 
-use_pinned_memory
+FLAGS_use_pinned_memory
 *******************************************
 (始于0.12.0)
 
diff --git a/doc/fluid/flags/memory_en.rst b/doc/fluid/flags/memory_en.rst
index 1e407ac4a6a053f616aeb107786817db01c28783..0411801e61dd96b972d439e2d0f37644b7fb3eb4 100755
--- a/doc/fluid/flags/memory_en.rst
+++ b/doc/fluid/flags/memory_en.rst
@@ -3,7 +3,7 @@ memory management
 ==================
 
 
-allocator_strategy
+FLAGS_allocator_strategy
 **************************************
 (since 1.2)
 
@@ -21,7 +21,7 @@ FLAGS_allocator_strategy=naive_best_fit would use the new-designed allocator.
 
 
 
-eager_delete_scope
+FLAGS_eager_delete_scope
 *******************************************
 (since 0.12.0)
 
@@ -36,7 +36,7 @@ Example
 FLAGS_eager_delete_scope=True will make scope delete synchronously.
 
 
-eager_delete_tensor_gb
+FLAGS_eager_delete_tensor_gb
 *******************************************
 (since 1.0.0)
 
@@ -60,7 +60,7 @@ It is recommended that users enable garbage collection strategy by setting FLAGS
 
 
 
-enable_inplace_whitelist
+FLAGS_enable_inplace_whitelist
 *******************************************
 (since 1.4)
 
@@ -76,7 +76,7 @@ FLAGS_enable_inplace_whitelist=True would disable memory in-place optimization o
 
 
 
-fast_eager_deletion_mode
+FLAGS_fast_eager_deletion_mode
 *******************************************
 (since 1.3)
 
@@ -93,7 +93,7 @@ FLAGS_fast_eager_deletion_mode=True would turn on fast garbage collection strate
 FLAGS_fast_eager_deletion_mode=False would turn off fast garbage collection strategy.
 
 
-fraction_of_gpu_memory_to_use
+FLAGS_fraction_of_gpu_memory_to_use
 *******************************************
 (since 1.2.0)
 
@@ -113,7 +113,7 @@ Windows series platform will set FLAGS_fraction_of_gpu_memory_to_use to 0.5 by d
 Linux will set FLAGS_fraction_of_gpu_memory_to_use to 0.92 by default.
 
 
-free_idle_memory
+FLAGS_free_idle_memory
 *******************************************
 (since 0.15.0)
 
@@ -130,7 +130,7 @@ FLAGS_free_idle_memory=True will free idle memory when there is too much of it.
 FLAGS_free_idle_memory=False will not free idle memory.
 
 
-fuse_parameter_groups_size
+FLAGS_fuse_parameter_groups_size
 *******************************************
 (since 1.4.0)
 
@@ -146,7 +146,7 @@ FLAGS_fuse_parameter_groups_size=3 will set the size of one group parameters' gr
 
 
 
-fuse_parameter_memory_size
+FLAGS_fuse_parameter_memory_size
 *******************************************
 (since 1.5.0)
 
@@ -161,7 +161,7 @@ Example
 FLAGS_fuse_parameter_memory_size=16 set the up limited memory size of one group parameters' gradient to 16 Megabytes.
 
 
-init_allocated_mem
+FLAGS_init_allocated_mem
 *******************************************
 (since 0.15.0)
 
@@ -178,7 +178,7 @@ FLAGS_init_allocated_mem=True will make the allocated memory initialize as a non
 FLAGS_init_allocated_mem=False will not initialize the allocated memory.
 
 
-initial_cpu_memory_in_mb
+FLAGS_initial_cpu_memory_in_mb
 *******************************************
 (since 0.14.0)
 
@@ -193,7 +193,7 @@ Example
 FLAGS_initial_cpu_memory_in_mb=100, if FLAGS_fraction_of_cpu_memory_to_use*(total physical memory) > 100MB, then allocator will pre-allocate 100MB when first allocation request raises, and re-allocate 100MB again when the pre-allocated memory is exhaustive.
 
 
-initial_gpu_memory_in_mb
+FLAGS_initial_gpu_memory_in_mb
 *******************************************
 (since 1.4.0)
 
@@ -213,7 +213,7 @@ If you set this flag, the memory size set by FLAGS_fraction_of_gpu_memory_to_use
 If you don't set this flag, PaddlePaddle will use FLAGS_fraction_of_gpu_memory_to_use to allocate gpu memory.
 
 
-limit_of_tmp_allocation
+FLAGS_limit_of_tmp_allocation
 *******************************************
 (since 1.3)
 
@@ -228,7 +228,7 @@ Example
 FLAGS_limit_of_tmp_allocation=1024 will set the up limit of temporary_allocation size to 1024 bytes.
 
 
-memory_fraction_of_eager_deletion
+FLAGS_memory_fraction_of_eager_deletion
 *******************************************
 (since 1.4)
 
@@ -248,7 +248,7 @@ FLAGS_memory_fraction_of_eager_deletion=1 would release all temporary variables.
 FLAGS_memory_fraction_of_eager_deletion=0.5 would only release 50% of variables with largest memory size.
 
 
-reallocate_gpu_memory_in_mb
+FLAGS_reallocate_gpu_memory_in_mb
 *******************************************
 (since 1.4.0)
 
@@ -268,12 +268,12 @@ If this flag is set, PaddlePaddle will reallocate the gpu memory with size speci
 Else PaddlePaddle will reallocate with size set by FLAGS_fraction_of_gpu_memory_to_use.
 
 
-times_excess_than_required_tmp_allocation
+FLAGS_times_excess_than_required_tmp_allocation
 *******************************************
 (since 1.3)
 
 The FLAGS_times_excess_than_required_tmp_allocation indicates the max size the TemporaryAllocator can return. For Example
-, if the required memory size is N, and times_excess_than_required_tmp_allocation is 2.0, the TemporaryAllocator will return the available allocation that the range of size is N ~ 2*N.
+, if the required memory size is N, and FLAGS_times_excess_than_required_tmp_allocation is 2.0, the TemporaryAllocator will return the available allocation that the range of size is N ~ 2*N.
 
 Values accepted
 ---------------
@@ -284,7 +284,7 @@ Example
 FLAGS_times_excess_than_required_tmp_allocation=1024 will set the max size of the TemporaryAllocator can return to 1024*N.
 
 
-use_pinned_memory
+FLAGS_use_pinned_memory
 *******************************************
 (since 0.12.0)
 
diff --git a/doc/fluid/flags/others_cn.rst b/doc/fluid/flags/others_cn.rst
index 3e8f2ed7924214aff25ab339aecef4316b083700..3cd6f6a6a5af98f94fc2c415b3daa92580eeea00 100755
--- a/doc/fluid/flags/others_cn.rst
+++ b/doc/fluid/flags/others_cn.rst
@@ -4,7 +4,7 @@
 
 
 
-benchmark
+FLAGS_benchmark
 ********************
 (始于0.12.0)
 
@@ -19,7 +19,7 @@ Bool型，缺省值为False。
 FLAGS_benchmark=True -  同步以测试基准。
 
 
-inner_op_parallelism
+FLAGS_inner_op_parallelism
 *******************************************
 (始于1.3.0)
 
@@ -38,7 +38,7 @@ FLAGS_inner_op_parallelism=5 - 将operator内的线程数设为5。
 目前只有稀疏的adam op支持inner_op_parallelism。
 
 
-max_body_size
+FLAGS_max_body_size
 *******************************************
 (始于1.0.0)
 
@@ -53,7 +53,7 @@ Int32型，缺省值为2147483647。
 FLAGS_max_body_size=2147483647 - 将BRPC消息大小设为2147483647。
 
 
-sync_nccl_allreduce
+FLAGS_sync_nccl_allreduce
 *******************************************
 (始于1.3)
 
@@ -68,7 +68,7 @@ Bool型，缺省值为True。
 FLAGS_sync_nccl_allreduce=True - 在allreduce_op_handle中调用 `cudaStreamSynchronize(nccl_stream)` 。
 
 
-tracer_profile_fname
+FLAGS_tracer_profile_fname
 *******************************************
 (始于1.4.0)
 
diff --git a/doc/fluid/flags/others_en.rst b/doc/fluid/flags/others_en.rst
index 9b03141e13e4e48b61d1ffbf08b1c8cf46c747b2..91fdfdf021df171a7527adc4f105e40c41b87777 100755
--- a/doc/fluid/flags/others_en.rst
+++ b/doc/fluid/flags/others_en.rst
@@ -4,7 +4,7 @@ others
 
 
 
-benchmark
+FLAGS_benchmark
 **************************************
 (since 0.12.0)
 
@@ -19,7 +19,7 @@ Example
 FLAGS_benchmark=True will do some synchronizations to test benchmark.
 
 
-inner_op_parallelism
+FLAGS_inner_op_parallelism
 *******************************************
 (since 1.3.0)
 
@@ -38,7 +38,7 @@ Note
 currently only sparse adam op supports inner_op_parallelism.
 
 
-max_body_size
+FLAGS_max_body_size
 *******************************************
 (Since 1.0.0)
 
@@ -53,7 +53,7 @@ Example
 FLAGS_max_body_size=2147483647 will set the BRPC message size to 2147483647.
 
 
-sync_nccl_allreduce
+FLAGS_sync_nccl_allreduce
 *******************************************
 (since 1.3)
 
@@ -68,7 +68,7 @@ Example
 FLAGS_sync_nccl_allreduce=True will call `cudaStreamSynchronize(nccl_stream)` in allreduce_op_handle.
 
 
-tracer_profile_fname
+FLAGS_tracer_profile_fname
 *******************************************
 (since 1.4.0)
 
diff --git a/doc/fluid/user_guides/howto/basic_concept/index_cn.rst b/doc/fluid/user_guides/howto/basic_concept/index_cn.rst
index 010b4bfaa29398cf66b7742a5818f887f760e463..9faf80228f08d6fb04b827d71865183d265ed5dc 100644
--- a/doc/fluid/user_guides/howto/basic_concept/index_cn.rst
+++ b/doc/fluid/user_guides/howto/basic_concept/index_cn.rst
@@ -1,12 +1,374 @@
-############
-基本概念
-############
+.. _cn_user_guide_lod_tensor:
 
-本文介绍Fluid版本基本使用概念：
+##################
+LoD-Tensor使用说明
+##################
 
-- `LoD-Tensor使用说明 <lod_tensor.html>`_ : LoD-Tensor是Fluid中特有的概念，它在Tensor基础上附加了序列信息，支持处理变长数据。
+LoD(Level-of-Detail) Tensor是Fluid中特有的概念，它在Tensor基础上附加了序列信息。Fluid中可传输的数据包括：输入、输出、网络中的可学习参数，全部统一使用LoD-Tensor表示。
 
-..  toctree::
-    :hidden:
+阅读本文档将帮助您了解 Fluid 中的 LoD-Tensor 设计思想，以便您更灵活的使用这一数据类型。
+
+变长序列的挑战
+================
+
+大多数的深度学习框架使用Tensor表示一个mini-batch。
+
+例如一个mini-batch中有10张图片，每幅图片大小为32x32，则这个mini-batch是一个10x32x32的 Tensor。
+
+或者在处理NLP任务中，一个mini-batch包含N个句子，每个字都用一个D维的one-hot向量表示，假设所有句子都用相同的长度L，那这个mini-batch可以被表示为NxLxD的Tensor。
+
+上述两个例子中序列元素都具有相同大小，但是在许多情况下，训练数据是变长序列。基于这一场景，大部分框架采取的方法是确定一个固定长度，对小于这一长度的序列数据以0填充。
+
+在Fluid中，由于LoD-Tensor的存在，我们不要求每个mini-batch中的序列数据必须保持长度一致，因此您不需要执行填充操作，也可以满足处理NLP等具有序列要求的任务需求。
+
+Fluid引入了一个索引数据结构（LoD）来将张量分割成序列。
+
+
+LoD 索引
+===========
+
+为了更好的理解LoD的概念，本节提供了几个例子供您参考：
+
+**句子组成的 mini-batch**
+
+假设一个mini-batch中有3个句子，每个句子中分别包含3个、1个和2个单词。我们可以用(3+1+2)xD维Tensor 加上一些索引信息来表示这个mini-batch:
+
+.. code-block :: text
+
+  3       1   2
+  | | |   |   | |
+
+上述表示中，每一个 :code:`|` 代表一个D维的词向量，数字3，1，2构成了 1-level LoD。
+
+**递归序列**
+
+让我们来看另一个2-level LoD-Tensor的例子：假设存在一个mini-batch中包含3个句子、1个句子和2个句子的文章，每个句子都由不同数量的单词组成，则这个mini-batch的样式可以看作：
+
+.. code-block:: text
+
+
+  3            1 2
+  3   2  4     1 2  3
+  ||| || ||||  | || |||
+
+
+表示的LoD信息为：
+
+.. code-block:: text
+
+  [[3，1，2]/*level=0*/，[3，2，4，1，2，3]/*level=1*/]
+
+
+**视频的mini-batch**
+
+在视觉任务中，时常需要处理视频和图像这些元素是高维的对象，假设现存的一个mini-batch包含3个视频，分别有3个，1个和2个帧，每个帧都具有相同大小：640x480，则这个mini-batch可以被表示为：
+
+.. code-block:: text
+
+  3     1  2
+  口口口 口 口口
+
+
+最底层tensor大小为（3+1+2）x640x480，每一个 :code:`口` 表示一个640x480的图像
+
+**图像的mini-batch**
+
+在传统的情况下，比如有N个固定大小的图像的mini-batch，LoD-Tensor表示为:
+
+.. code-block:: text
+
+  1 1 1 1     1
+  口口口口 ... 口
+
+在这种情况下，我们不会因为索引值都为1而忽略信息，仅仅把LoD-Tensor看作是一个普通的张量:
+
+.. code-block:: text
+
+  口口口口 ... 口
+
+**模型参数**
+
+模型参数只是一个普通的张量，在Fluid中它们被表示为一个0-level LoD-Tensor。
+
+LoDTensor的偏移表示
+=====================
+
+为了快速访问基本序列，Fluid提供了一种偏移表示的方法——保存序列的开始和结束元素，而不是保存长度。
+
+在上述例子中，您可以计算基本元素的长度：
+
+.. code-block:: text
+
+  3 2 4 1 2 3
+
+将其转换为偏移表示：
+
+.. code-block:: text
+
+  0  3  5   9   10  12   15
+     =  =   =   =   =    =
+     3  2+3 4+5 1+9 2+10 3+12
+
+所以我们知道第一个句子是从单词0到单词3，第二个句子是从单词3到单词5。
+
+类似的，LoD的顶层长度
+
+.. code-block:: text
+
+  3 1 2
+
+可以被转化成偏移形式：
+
+.. code-block:: text
+
+  0 3 4   6
+    = =   =
+    3 3+1 4+2
+
+因此该LoD-Tensor的偏移表示为：
+
+.. code-block:: text
+
+  0       3    4      6
+    3 5 9   10   12 15
+
+
+LoD-Tensor
+=============
+一个LoD-Tensor可以被看作是一个树的结构，树叶是基本的序列元素，树枝作为基本元素的标识。
+
+在 Fluid 中 LoD-Tensor 的序列信息有两种表述形式：原始长度和偏移量。在 Paddle 内部采用偏移量的形式表述 LoD-Tensor，以获得更快的序列访问速度；在 python API中采用原始长度的形式表述 LoD-Tensor 方便用户理解和计算，并将原始长度称为： :code:`recursive_sequence_lengths` 。
+
+以上文提到的一个2-level LoD-Tensor为例：
+
+.. code-block:: text
+
+  3           1  2
+  3   2  4    1  2  3
+  ||| || |||| |  || |||
+
+- 以偏移量表示此 LoD-Tensor:[ [0,3,4,6] , [0,3,5,9,10,12,15] ]，
+- 以原始长度表达此 Lod-Tensor：recursive_sequence_lengths=[ [3-0 , 4-3 , 6-4] , [3-0 , 5-3 , 9-5 , 10-9 , 12-10 , 15-12] ]。
+
+
+以文字序列为例： [3,1,2] 可以表示这个mini-batch中有3篇文章，每篇文章分别有3、1、2个句子，[3,2,4,1,2,3] 表示每个句子中分别含有3、2、4、1、2、3个字。
+
+recursive_seq_lens 是一个双层嵌套列表，也就是列表的列表，最外层列表的size表示嵌套的层数，也就是lod-level的大小；内部的每个列表，对应表示每个lod-level下，每个元素的大小。
+
+下面三段代码分别介绍如何创建一个LoD-Tensor，如何将LoD-Tensor转换成Tensor，如何将Tensor转换成LoD-Tensor：
+
+* 创建 LoD-Tensor
+
+.. code-block:: python
+
+  #创建lod-tensor
+  import paddle.fluid as fluid
+  import numpy as np
+  
+  a = fluid.create_lod_tensor(np.array([[1],[1],[1],
+                                    [1],[1],
+                                    [1],[1],[1],[1],
+                                    [1],
+                                    [1],[1],
+                                    [1],[1],[1]]).astype('int64') ,
+                            [[3,1,2] , [3,2,4,1,2,3]],
+                            fluid.CPUPlace())
+  
+  #查看lod-tensor嵌套层数
+  print (len(a.recursive_sequence_lengths()))
+  # output：2
+
+  #查看最基础元素个数
+  print (sum(a.recursive_sequence_lengths()[-1]))
+  # output:15 (3+2+4+1+2+3=15)
+
+* LoD-Tensor 转 Tensor
+
+.. code-block:: python
+
+  import paddle.fluid as fluid
+  import numpy as np
+
+  # 创建一个 LoD-Tensor
+  a = fluid.create_lod_tensor(np.array([[1.1], [2.2],[3.3],[4.4]]).astype('float32'), [[1,3]], fluid.CPUPlace())
+
+  def LodTensor_to_Tensor(lod_tensor):
+    # 获取 LoD-Tensor 的 lod 信息
+    lod = lod_tensor.lod()
+    # 转换成 array
+    array = np.array(lod_tensor)
+    new_array = []
+    # 依照原LoD-Tensor的层级信息，转换成Tensor
+    for i in range(len(lod[0]) - 1):
+        new_array.append(array[lod[0][i]:lod[0][i + 1]])
+    return new_array
+
+  new_array = LodTensor_to_Tensor(a)
+
+  # 输出结果
+  print(new_array)
+
+* Tensor 转 LoD-Tensor
+
+.. code-block:: python
+
+  import paddle.fluid as fluid
+  import numpy as np
+
+  def to_lodtensor(data, place):
+    # 存储Tensor的长度作为LoD信息
+    seq_lens = [len(seq) for seq in data]
+    cur_len = 0
+    lod = [cur_len]
+    for l in seq_lens:
+        cur_len += l
+        lod.append(cur_len)
+    # 对待转换的 Tensor 降维
+    flattened_data = np.concatenate(data, axis=0).astype("int64")
+    flattened_data = flattened_data.reshape([len(flattened_data), 1])
+    # 为 Tensor 数据添加lod信息
+    res = fluid.LoDTensor()
+    res.set(flattened_data, place)
+    res.set_lod([lod])
+    return res
+
+  # new_array 为上段代码中转换的Tensor
+  lod_tensor = to_lodtensor(new_array,fluid.CPUPlace())
+
+  # 输出 LoD 信息
+  print("The LoD of the result: {}.".format(lod_tensor.lod()))
+
+  # 检验与原Tensor数据是否一致
+  print("The array : {}.".format(np.array(lod_tensor)))
+
+
+
+
+代码示例
+===========
+
+本节代码将根据指定的级别y-lod，扩充输入变量x。本例综合了LoD-Tensor的多个重要概念，跟随代码实现，您将：
+
+-  直观理解Fluid中 :code:`fluid.layers.sequence_expand` 的实现过程
+-  掌握如何在Fluid中创建LoD-Tensor
+-  学习如何打印LoDTensor内容
+
+
+  
+**定义计算过程**
+
+layers.sequence_expand通过获取 y 的 lod 值对 x 的数据进行扩充，关于 :code:`fluid.layers.sequence_expand` 的功能说明，请先阅读 :ref:`cn_api_fluid_layers_sequence_expand` 。
+
+序列扩充代码实现：
+
+.. code-block:: python
+
+  x = fluid.layers.data(name='x', shape=[1], dtype='float32', lod_level=1)
+  y = fluid.layers.data(name='y', shape=[1], dtype='float32', lod_level=2)
+  out = fluid.layers.sequence_expand(x=x, y=y, ref_level=0)
+
+*说明*：输出LoD-Tensor的维度仅与传入的真实数据维度有关，在定义网络结构阶段为x、y设置的shape值，仅作为占位，并不影响结果。
+
+**创建Executor**
+
+.. code-block:: python
+
+  place = fluid.CPUPlace()
+  exe = fluid.Executor(place)
+  exe.run(fluid.default_startup_program())
+
+**准备数据**
+
+这里我们调用 :code:`fluid.create_lod_tensor` 创建 :code:`sequence_expand` 的输入数据，通过定义 y_d 的 LoD 值，对 x_d 进行扩充。其中，输出值只与 y_d 的 LoD 值有关，y_d 的 data 值在这里并不参与计算，维度上与LoD[-1]一致即可。
+
+:code:`fluid.create_lod_tensor()` 的使用说明请参考 :ref:`cn_api_fluid_create_lod_tensor` 。
+
+实现代码如下：
+
+.. code-block:: python
+
+  x_d = fluid.create_lod_tensor(np.array([[1.1],[2.2],[3.3],[4.4]]).astype('float32'), [[1,3]], place)
+  y_d = fluid.create_lod_tensor(np.array([[1.1],[1.1],[1.1],[1.1],[1.1],[1.1]]).astype('float32'), [[1,3], [2,1,2,1]],place)
+
+
+**执行运算**
+
+在Fluid中，LoD>1的Tensor与其他类型的数据一样，使用 :code:`feed` 定义数据传入顺序。此外，由于输出results是带有LoD信息的Tensor，需在exe.run( )中添加 :code:`return_numpy=False` 参数，获得LoD-Tensor的输出结果。
+
+.. code-block:: python
+
+  results = exe.run(fluid.default_main_program(),
+                    feed={'x':x_d, 'y': y_d },
+                    fetch_list=[out],return_numpy=False)
+
+**查看LodTensor结果**
+
+由于LoDTensor的特殊属性，无法直接print查看内容，常用操作时将LoD-Tensor作为网络的输出fetch出来，然后执行 numpy.array(lod_tensor), 就能转成numpy array：
+
+.. code-block:: python
+
+  np.array(results[0])
+
+输出结果为：
+
+.. code-block:: text
+
+  array([[1.1],[2.2],[3.3],[4.4],[2.2],[3.3],[4.4],[2.2],[3.3],[4.4]])
+
+**查看序列长度**
+
+可以通过查看序列长度得到 LoDTensor 的递归序列长度：
+
+.. code-block:: python
+
+    results[0].recursive_sequence_lengths()
+    
+输出结果为：
+
+.. code-block:: text
+    
+    [[1L, 3L, 3L, 3L]]
+
+**完整代码**
+
+您可以运行下列完整代码，观察输出结果：
+
+.. code-block:: python
+    
+    #加载库
+    import paddle
+    import paddle.fluid as fluid
+    import numpy as np
+    #定义前向计算
+    x = fluid.layers.data(name='x', shape=[1], dtype='float32', lod_level=1)
+    y = fluid.layers.data(name='y', shape=[1], dtype='float32', lod_level=2)
+    out = fluid.layers.sequence_expand(x=x, y=y, ref_level=0)
+    #定义运算场所
+    place = fluid.CPUPlace()
+    #创建执行器
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+    #创建LoDTensor
+    x_d = fluid.create_lod_tensor(np.array([[1.1], [2.2],[3.3],[4.4]]).astype('float32'), [[1,3]], place)
+    y_d = fluid.create_lod_tensor(np.array([[1.1],[1.1],[1.1],[1.1],[1.1],[1.1]]).astype('float32'), [[1,3], [1,2,1,2]], place)
+    #开始计算
+    results = exe.run(fluid.default_main_program(),
+                      feed={'x':x_d, 'y': y_d },
+                      fetch_list=[out],return_numpy=False)
+    #输出执行结果
+    print("The data of the result: {}.".format(np.array(results[0])))
+    #输出 result 的序列长度
+    print("The recursive sequence lengths of the result: {}.".format(results[0].recursive_sequence_lengths()))
+    #输出 result 的 LoD
+    print("The LoD of the result: {}.".format(results[0].lod()))
+
+
+总结
+========
+
+至此，相信您已经基本掌握了LoD-Tensor的概念，尝试修改上述代码中的 x_d 与 y_d，观察输出结果，有助于您更好的理解这一灵活的结构。
+
+更多LoDTensor的模型应用，可以参考新手入门中的 `词向量 <../../../beginners_guide/basics/word2vec/index.html>`_ 、`个性化推荐 <../../../beginners_guide/basics/recommender_system/index.html>`_、`情感分析 <../../../beginners_guide/basics/understand_sentiment/index.html>`_ 等指导教程。
+
+更高阶的应用案例，请参考 `模型库 <../../../user_guides/models/index_cn.html>`_ 中的相关内容。
 
-    lod_tensor.rst
diff --git a/doc/fluid/user_guides/howto/basic_concept/index_en.rst b/doc/fluid/user_guides/howto/basic_concept/index_en.rst
index 5591d65b12c7ba3c1975587e604ac0872b82b5d2..1c0584f63565bfaa4e4a6c04ee7982f302930d55 100644
--- a/doc/fluid/user_guides/howto/basic_concept/index_en.rst
+++ b/doc/fluid/user_guides/howto/basic_concept/index_en.rst
@@ -1,12 +1,372 @@
-###############
-Basic Concepts
-###############
+#####################
+LoD-Tensor User Guide
+#####################
+
+LoD(Level-of-Detail) Tensor is a unique term in Fluid, which can be constructed by appending sequence information to Tensor. Data transferred in Fluid contain input, output and learnable parameters of the network, all of which are represented by LoD-Tensor.
+
+With the help of this user guide, you will learn the design idea of LoD-Tensor in Fluid so that you can use such a data type more flexibly.
+
+Challenge of variable-length sequences
+======================================
+
+In most deep learning frameworks, a mini-batch is represented by Tensor.
+
+For example, if there are 10 pictures in a mini-batch and the size of each picture is 32*32, the mini-batch will be a 10*32*32 Tensor.
+
+Or in the NLP task, there are N sentences in a mini-batch and the length of each sentence is L. Every word is represented by a one-hot vector with D dimensions. Then the mini-batch can be represented by an N*L*D Tensor.
+
+In the two examples above, the size of each sequence element remains the same. However, the data to be trained are variable-length sequences in many cases. For this scenario, method to be taken in most frameworks is to set a fixed length and sequence data shorter than the fixed length will be padded with 0 to reach the fixed length.
+
+Owing to the LoD-Tensor in Fluid, it is not necessary to keep the lengths of sequence data in every mini-batch constant.Therefore tasks sensitive to sequence formats like NLP can also be finished without padding.
+
+Index Data Structure (LoD) is introduced to Fluid to split Tensor into sequences.
+
+Index Structure - LoD 
+======================
+
+To have a better understanding of the concept of LoD, you can refer to the examples in this section.
+
+**mini-batch consisting of sentences**
+
+Suppose a mini-batch contains three sentences, and each contains 3, 1, 2 words respectively. Then the mini-batch can be represented by a (3+1+2)*D Tensor with some index information appended:
+
+.. code-block :: text
+
+  3       1   2
+  | | |   |   | |
+
+In the text above, each :code:`|` represents a word vector with D dimension and a 1-level LoD is made up of digits 3,1,2 .
+
+**recursive sequence**
+
+Take a 2-level LoD-Tensor for example, a mini-batch contains articles of 3 sentences, 1 sentence and 2 sentences. The number of words in every sentence is different. Then the mini-batch is formed as follows:
+
+.. code-block:: text
+
+
+  3            1 2
+  3   2  4     1 2  3
+  ||| || ||||  | || |||
+
+
+the LoD to express the format:
+
+.. code-block:: text
+
+  [[3，1，2]/*level=0*/，[3，2，4，1，2，3]/*level=1*/]
+
+
+**mini-batch consisting of video data**
+
+In the task of computer vision, it usually needs to deal objects with high dimension like videos and pictures. Suppose a mini-batch contains 3 videos, which is composed of 3 frames, 1 frames, 2 frames respectively. The size of each frame is 640*480. Then the mini-batch can be described as:
+
+.. code-block:: text
+
+  3     1  2
+  口口口 口 口口
+
+
+The size of the tensor at the bottom is (3+1+2)*640*480. Every :code:`口` represents a 640*480 picture.
+
+**mini-batch consisting of pictures**
+
+Traditionally, for a mini-batch of N pictures with fixed size, LoD-Tensor is described as:
+
+.. code-block:: text
+
+  1 1 1 1     1
+  口口口口 ... 口
+
+Under such circumstance, we will consider LoD-Tensor as a common tensor instead of ignoring information because of the indices of all elements are 1.
+
+.. code-block:: text
+
+  口口口口 ... 口
+
+**model parameter**
+
+model parameter is a common tensor which is described as a 0-level LoD-Tensor in Fluid.
+
+LoDTensor expressed by offset
+=============================
+
+To have a quick access to the original sequence, you can take the offset expression method——store the first and last element of a sequence instead of its length.
+
+In the example above, you can compute the length of fundamental elements:
+
+.. code-block:: text
+
+  3 2 4 1 2 3
+
+It is expressed by offset as follows:
+
+.. code-block:: text
+
+  0  3  5   9   10  12   15
+     =  =   =   =   =    =
+     3  2+3 4+5 1+9 2+10 3+12
+
+Therefore we infer that the first sentence starts from word 0 to word 3 and the second sentence starts from word 3 to word 5.
+
+Similarly, for the length of the top layer of LoD
+
+.. code-block:: text
+
+  3 1 2
+
+It can be expressed by offset:
+
+.. code-block:: text
+
+  0 3 4   6
+    = =   =
+    3 3+1 4+2
+
+Therefore the LoD-Tensor is expressed by offset:
+
+.. code-block:: text
+
+  0       3    4      6
+    3 5 9   10   12 15
+
+
+LoD-Tensor
+=============
+A LoD-Tensor can be regarded as a tree of which the leaf is an original sequence element and branch is the flag of fundamental element.
+
+There are two ways to express sequence information of LoD-Tensor in Fluid: primitive length and offset. LoD-Tensor is expressed by offset in Paddle to offer a quicker access to sequence;LoD-Tensor is expressed by primitive length in python API to make user understand and compute more easily. The primary length is named as  :code:`recursive_sequence_lengths` .
+
+Take a 2-level LoD-Tensor mentioned above as an example:
+
+.. code-block:: text
+
+  3           1  2
+  3   2  4    1  2  3
+  ||| || |||| |  || |||
+
+- LoD-Tensor expressed by offset: [ [0,3,4,6] , [0,3,5,9,10,12,15] ]
+- LoD-Tensor expressed by primitive length: recursive_sequence_lengths=[ [3-0 , 4-3 , 6-4] , [3-0 , 5-3 , 9-5 , 10-9 , 12-10 , 15-12] ]
+
+
+Take text sequence as an example,[3,1,2] indicates there are 3 articles in the mini-batch,which contains 3,1,2 sentences respectively.[3,2,4,1,2,3] indicates there are 3,2,4,1,2,3 words in sentences respectively.
+
+recursive_seq_lens is a double Layer nested list, and in other words, the element of the list is list. The size of the outermost list represents the nested layers, namely the size of lod-level; Each inner list represents the size of each element in each lod-level. 
+
+The following three pieces of codes introduce how to create LoD-Tensor, how to transform LoD-Tensor to Tensor and how to transform Tensor to LoD-Tensor respectively:
+
+  * Create LoD-Tensor
+
+.. code-block:: python
+
+  #Create lod-tensor
+  import paddle.fluid as fluid
+  import numpy as np
+  
+  a = fluid.create_lod_tensor(np.array([[1],[1],[1],
+                                    [1],[1],
+                                    [1],[1],[1],[1],
+                                    [1],
+                                    [1],[1],
+                                    [1],[1],[1]]).astype('int64') ,
+                            [[3,1,2] , [3,2,4,1,2,3]],
+                            fluid.CPUPlace())
+  
+  #Check lod-tensor nested layers
+  print (len(a.recursive_sequence_lengths()))
+  # output：2
+
+  #Check the number of the most fundamental elements
+  print (sum(a.recursive_sequence_lengths()[-1]))
+  # output:15 (3+2+4+1+2+3=15)
+
+* Transform LoD-Tensor to Tensor
+
+  .. code-block:: python
+
+   import paddle.fluid as fluid
+   import numpy as np
+
+   # create LoD-Tensor
+   a = fluid.create_lod_tensor(np.array([[1.1], [2.2],[3.3],[4.4]]).astype('float32'), [[1,3]], fluid.CPUPlace())
+
+   def LodTensor_to_Tensor(lod_tensor):
+     # get lod information of LoD-Tensor
+     lod = lod_tensor.lod()
+     # transform into array
+     array = np.array(lod_tensor)
+     new_array = []
+     # transform to Tensor according to the layer information of the original LoD-Tensor
+     for i in range(len(lod[0]) - 1):
+         new_array.append(array[lod[0][i]:lod[0][i + 1]])
+     return new_array
+
+   new_array = LodTensor_to_Tensor(a)
+
+   # output the result
+   print(new_array)
+
+ * Transform Tensor to LoD-Tensor
+
+  .. code-block:: python
+
+   import paddle.fluid as fluid
+   import numpy as np
+
+   def to_lodtensor(data, place):
+     # save the length of Tensor as LoD information
+     seq_lens = [len(seq) for seq in data]
+     cur_len = 0
+     lod = [cur_len]
+     for l in seq_lens:
+         cur_len += l
+         lod.append(cur_len)
+     # decrease the dimention of transformed Tensor
+     flattened_data = np.concatenate(data, axis=0).astype("int64")
+     flattened_data = flattened_data.reshape([len(flattened_data), 1])
+     # add lod information to Tensor data
+     res = fluid.LoDTensor()
+     res.set(flattened_data, place)
+     res.set_lod([lod])
+     return res
+
+   # new_array is the transformed Tensor above
+   lod_tensor = to_lodtensor(new_array,fluid.CPUPlace())
+
+   # output LoD information
+   print("The LoD of the result: {}.".format(lod_tensor.lod()))
+
+   # examine the consistency with Tensor data
+   print("The array : {}.".format(np.array(lod_tensor)))
+
+
+
+
+
+Code examples
+==============
+
+Input variable x is expanded according to specified layer level y-lod in the code example in this section. The example below contains some fundamental conception of LoD-Tensor. By following the code, you will
+
+-  Have a direct understanding of the implementation of :code:`fluid.layers.sequence_expand` in Fluid
+-  Know how to create LoD-Tensor in Fluid
+-  Learn how to print the content of LoDTensor
+
+
+  
+**Define the Process of Computing**
+
+layers.sequence_expand expands x by obtaining the lod value of y. About more explanation of :code:`fluid.layers.sequence_expand` , please read :ref:`api_fluid_layers_sequence_expand` first. 
+
+Code of sequence expanding:
+
+.. code-block:: python
+
+  x = fluid.layers.data(name='x', shape=[1], dtype='float32', lod_level=1)
+  y = fluid.layers.data(name='y', shape=[1], dtype='float32', lod_level=2)
+  out = fluid.layers.sequence_expand(x=x, y=y, ref_level=0)
+
+*Note*：The dimension of input LoD-Tensor is only associated with the dimension of real data transferred in. The shape value set for x and y in the definition of network structure is just a placeholder with little influence on the result.  
+
+**Create Executor**
+
+.. code-block:: python
+
+  place = fluid.CPUPlace()
+  exe = fluid.Executor(place)
+  exe.run(fluid.default_startup_program())
+
+**Prepare Data**
+
+Here we use :code:`fluid.create_lod_tensor` to create the input data of :code:`sequence_expand` and expand x_d by defining LoD of y_d. The output value is only associated with LoD of y_d. And the data of y_d is not invovled in the process of computation. The dimension of y_d must keep consistent with as its LoD[-1] .
+
+About the user guide of :code:`fluid.create_lod_tensor()` , please refer to :ref:`api_fluid_create_lod_tensor` .
+
+Code：
+
+.. code-block:: python
+
+  x_d = fluid.create_lod_tensor(np.array([[1.1],[2.2],[3.3],[4.4]]).astype('float32'), [[1,3]], place)
+  y_d = fluid.create_lod_tensor(np.array([[1.1],[1.1],[1.1],[1.1],[1.1],[1.1]]).astype('float32'), [[1,3], [2,1,2,1]],place)
+
+
+**Execute Computing**
+
+For tensor whose LoD > 1 in Fluid, like data of other types, the order of transfering data is defined by :code:`feed` . In addition, parameter :code:`return_numpy=False` needs to be added to exe.run() to get the output of LoD-Tensor because results are Tensors with LoD information.
+
+.. code-block:: python
+
+  results = exe.run(fluid.default_main_program(),
+                    feed={'x':x_d, 'y': y_d },
+                    fetch_list=[out],return_numpy=False)
+
+**Check the result of LodTensor**
+
+Because of the special attributes of LoDTensor, you could not print to check the content. The usual solution to the problem is to fetch the LoDTensor as the output of network and then execute  numpy.array(lod_tensor) to transfer LoDTensor into numpy array: 
+
+.. code-block:: python
+
+  np.array(results[0])
+
+Output:
+
+.. code-block:: text
+
+  array([[1.1],[2.2],[3.3],[4.4],[2.2],[3.3],[4.4],[2.2],[3.3],[4.4]])
+
+**Check the length of sequence**
+
+You can get the recursive sequence length of LoDTensor by checking the sequence length:
+
+.. code-block:: python
+
+    results[0].recursive_sequence_lengths()
+    
+Output
+
+.. code-block:: text
+    
+    [[1L, 3L, 3L, 3L]]
+
+**Complete Code**
+
+You can check the output by executing the following complete code:
+
+.. code-block:: python
+    
+    #Load 
+    import paddle
+    import paddle.fluid as fluid
+    import numpy as np
+    #Define forward computation
+    x = fluid.layers.data(name='x', shape=[1], dtype='float32', lod_level=1)
+    y = fluid.layers.data(name='y', shape=[1], dtype='float32', lod_level=2)
+    out = fluid.layers.sequence_expand(x=x, y=y, ref_level=0)
+    #Define place for computation
+    place = fluid.CPUPlace()
+    #Create executer
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+    #Create LoDTensor
+    x_d = fluid.create_lod_tensor(np.array([[1.1], [2.2],[3.3],[4.4]]).astype('float32'), [[1,3]], place)
+    y_d = fluid.create_lod_tensor(np.array([[1.1],[1.1],[1.1],[1.1],[1.1],[1.1]]).astype('float32'), [[1,3], [1,2,1,2]], place)
+    #Start computing
+    results = exe.run(fluid.default_main_program(),
+                      feed={'x':x_d, 'y': y_d },
+                      fetch_list=[out],return_numpy=False)
+    #Output result
+    print("The data of the result: {}.".format(np.array(results[0])))
+    #print the length of sequence of result
+    print("The recursive sequence lengths of the result: {}.".format(results[0].recursive_sequence_lengths()))
+    #print the LoD of result
+    print("The LoD of the result: {}.".format(results[0].lod()))
+
+
+Summary
+========
+
+Then, we believe that you have known about the concept LoD-Tensor. And an attempt to change x_d and y_d in code above and then to check the output may help you get a better understanding of this flexible structure.
+
+About more model applications of LoDTensor, you can refer to `Word2vec <../../../beginners_guide/basics/word2vec/index_en.html>`_ , `Individual Recommendation <../../../beginners_guide/basics/recommender_system/index_en.html>`_ , `Sentiment Analysis <../../../beginners_guide/basics/understand_sentiment/index_en.html>`_ in the beginner's guide. 
+
+About more difffiult and complex examples of application, please refer to associated information about `models <../../../user_guides/models/index_en.html>`_ .
 
-This section will introduce basic concepts in Fluid：
-
-- `LoD-Tensor User Guide <lod_tensor_en.html>`_ : LoD-Tensor is a unique term of Fluid. It appends sequence information to Tensor，and supports data of variable lengths.
-
-..  toctree::
-    :hidden:
-
-    lod_tensor_en.rst
diff --git a/doc/fluid/user_guides/howto/prepare_data/index_cn.rst b/doc/fluid/user_guides/howto/prepare_data/index_cn.rst
index ff349ad1c80812d9f6bb3ae3cc7f4e1b256c2f08..6cff321413c2e826ae6d04f98ebf0cf03e6ecf7b 100644
--- a/doc/fluid/user_guides/howto/prepare_data/index_cn.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/index_cn.rst
@@ -4,127 +4,12 @@
 准备数据
 ########
 
-使用PaddlePaddle Fluid准备数据分为三个步骤：
-
-Step1: 自定义Reader生成训练/预测数据
-###################################
-
-生成的数据类型可以为Numpy Array或LoDTensor。根据Reader返回的数据形式的不同，可分为Batch级的Reader和Sample（样本）级的Reader。
-
-Batch级的Reader每次返回一个Batch的数据，Sample级的Reader每次返回单个样本的数据
-
-如果您的数据是Sample级的数据，我们提供了一个可以数据预处理和组建batch的工具：:code:`Python Reader` 。
-
-
-Step2: 在网络配置中定义数据层变量
-###################################
-用户需使用 :code:`fluid.layers.data` 在网络中定义数据层变量。定义数据层变量时需指明数据层的名称name、数据类型dtype和维度shape。例如：
-
-.. code-block:: python
-
-    import paddle.fluid as fluid
-
-    image = fluid.layers.data(name='image', dtype='float32', shape=[28, 28])
-    label = fluid.layers.data(name='label', dtype='int64', shape=[1])
-
-
-需要注意的是，此处的shape是单个样本的维度，PaddlePaddle Fluid会在shape第0维位置添加-1，表示batch_size的维度，即此例中image.shape为[-1, 28, 28]，
-label.shape为[-1, 1]。
-
-若用户不希望框架在第0维位置添加-1，则可通过append_batch_size=False参数控制，即：
-
-.. code-block:: python
-
-   import paddle.fluid as fluid
-
-   image = fluid.layers.data(name='image', dtype='float32', shape=[28, 28], append_batch_size=False)
-   label = fluid.layers.data(name='label', dtype='int64', shape=[1], append_batch_size=False)
-
-此时，image.shape为[28, 28]，label.shape为[1]。
-
-Step3: 将数据送入网络进行训练/预测
-###################################
-
-Fluid提供两种方式，分别是异步PyReader接口方式或同步Feed方式，具体介绍如下：
-
-- 异步PyReader接口方式
-
-用户需要先使用 :code:`fluid.io.PyReader` 定义PyReader对象，然后通过PyReader对象的decorate方法设置数据源。
-使用PyReader接口时，数据传入与模型训练/预测过程是异步进行的，效率较高，推荐使用。
-
-- 同步Feed方式
-
-用户自行构造输入数据，并在 :code:`fluid.Executor` 或 :code:`fluid.ParallelExecutor`
-中使用 :code:`executor.run(feed=...)` 传入训练数据。数据准备和模型训练/预测的过程是同步进行的，
-效率较低。
-
-
-这两种准备数据方法的比较如下:
-
-========  =================================   =====================================
-对比项            同步Feed方式                          异步PyReader接口方式
-========  =================================   =====================================
-API接口     :code:`executor.run(feed=...)`          :code:`fluid.io.PyReader`
-数据格式         Numpy Array或LoDTensor               Numpy Array或LoDTensor
-数据增强          Python端使用其他库完成                  Python端使用其他库完成
-速度                     慢                                   快
-推荐用途                调试模型                              工业训练
-========  =================================   =====================================
-
-Reader数据类型对使用方式的影响
-###############################
-
-根据Reader数据类型的不同，上述步骤的具体操作将有所不同，具体介绍如下:
-
-读取Sample级Reader数据
-+++++++++++++++++++++
-
-若自定义的Reader每次返回单个样本的数据，用户需通过以下步骤完成数据送入：
-
-Step1. 组建数据
-=============================
-
-调用Fluid提供的Reader相关接口完成组batch和部分的数据预处理功能，具体请参见：
+本章详细介绍了如何为神经网络提供数据，包括数据的前期处理与后期的同步、异步读取。
 
 .. toctree::
    :maxdepth: 1
 
+   prepare_steps.rst
    reader_cn.md
-
-Step2. 送入数据
-=================================
-
-若使用异步PyReader接口方式送入数据，请调用 :code:`decorate_sample_generator` 或 :code:`decorate_sample_list_generator` 接口完成，具体请参见：
-
-- :ref:`user_guides_use_py_reader`
-
-若使用同步Feed方式送入数据，请使用DataFeeder接口将Reader数据转换为LoDTensor格式后送入网络，具体请参见 :ref:`cn_api_fluid_DataFeeder`
-
-读取Batch级Reader数据
-+++++++++++++++++++++++
-
-Step1. 组建数据
-=================
-
-由于Batch已经组好，已经满足了Step1的条件，可以直接进行Step2
-
-Step2. 送入数据
-=================================
-
-若使用异步PyReader接口方式送入数据，请调用PyReader的 :code:`decorate_batch_generator` 接口完成，具体方式请参见:
-
-.. toctree::
-   :maxdepth: 1
-
    use_py_reader.rst
-
-若使用同步Feed方式送入数据，具体请参见:
-
-.. toctree::
-   :maxdepth: 1
-
-   feeding_data.rst
-
-
-
-
+   feeding_data.rst
\ No newline at end of file
diff --git a/doc/fluid/user_guides/howto/prepare_data/index_en.rst b/doc/fluid/user_guides/howto/prepare_data/index_en.rst
index 4bc3776a6a0c8d4625d235546b0b9804331de8cd..b4d7a337949047ddde5848fa94194ad19174d4b3 100644
--- a/doc/fluid/user_guides/howto/prepare_data/index_en.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/index_en.rst
@@ -4,52 +4,12 @@
 Prepare Data
 #############
 
-PaddlePaddle Fluid supports two methods to feed data into networks:
-
-1. Synchronous method - Python Reader：Firstly, use :code:`fluid.layers.data` to set up data input layer. Then, feed in the training data through :code:`executor.run(feed=...)` in :code:`fluid.Executor` or :code:`fluid.ParallelExecutor` .
-
-2. Asynchronous method - py_reader：Firstly, use :code:`fluid.layers.py_reader` to set up data input layer. Then configure the data source with functions :code:`decorate_paddle_reader` or :code:`decorate_tensor_provider` of :code:`py_reader` . After that, call :code:`fluid.layers.read_file` to read data.
-
-
-
-Comparisons of the two methods:
-
-=========================  ====================================================   ===============================================
-Aspects                                   Synchronous Python Reader                       Asynchronous py_reader
-=========================  ====================================================   ===============================================
-API interface                          :code:`executor.run(feed=...)`                 :code:`fluid.layers.py_reader`
-data type                                   Numpy Array                                Numpy Array or LoDTensor
-data augmentation          carried out by other libraries on Python end            carried out by other libraries on Python end 
-velocity                                        slow                                            rapid
-recommended applications                model debugging                                      industrial training
-=========================  ====================================================   ===============================================
-
-Synchronous Python Reader
-##########################
-
-Fluid provides Python Reader to feed in data.
-
-Python Reader is a pure Python-side interface, and data feeding is synchronized with the model training/prediction process. Users can pass in data through Numpy Array. For specific operations, please refer to:
-
-
-.. toctree::
-   :maxdepth: 1
-
-   feeding_data_en.rst
-
-Python Reader supports advanced functions like group batch, shuffle. For specific operations, please refer to：
+This document mainly introduces how to provide data for the network, including Synchronous-method and Asynchronous-method.
 
 .. toctree::
    :maxdepth: 1
 
+   prepare_steps_en.rst
    reader.md
-
-Asynchronous py_reader
-########################
-
-Fluid provides asynchronous data feeding method PyReader. It is more efficient as data feeding is not synchronized with the model training/prediction process. For specific operations, please refer to：
-
-.. toctree::
-   :maxdepth: 1
-
    use_py_reader_en.rst
+   feeding_data_en.rst
\ No newline at end of file
diff --git a/doc/fluid/user_guides/howto/prepare_data/prepare_steps.rst b/doc/fluid/user_guides/howto/prepare_data/prepare_steps.rst
new file mode 100644
index 0000000000000000000000000000000000000000..69b39ee8dccd8477a4da4cb070787cc5db8be7a9
--- /dev/null
+++ b/doc/fluid/user_guides/howto/prepare_data/prepare_steps.rst
@@ -0,0 +1,130 @@
+..  _user_guide_prepare_steps:
+
+########
+准备步骤
+########
+
+使用PaddlePaddle Fluid准备数据分为三个步骤：
+
+Step1: 自定义Reader生成训练/预测数据
+###################################
+
+生成的数据类型可以为Numpy Array或LoDTensor。根据Reader返回的数据形式的不同，可分为Batch级的Reader和Sample（样本）级的Reader。
+
+Batch级的Reader每次返回一个Batch的数据，Sample级的Reader每次返回单个样本的数据
+
+如果您的数据是Sample级的数据，我们提供了一个可以数据预处理和组建batch的工具：:code:`Python Reader` 。
+
+
+Step2: 在网络配置中定义数据层变量
+###################################
+用户需使用 :code:`fluid.layers.data` 在网络中定义数据层变量。定义数据层变量时需指明数据层的名称name、数据类型dtype和维度shape。例如：
+
+.. code-block:: python
+
+    import paddle.fluid as fluid
+
+    image = fluid.layers.data(name='image', dtype='float32', shape=[28, 28])
+    label = fluid.layers.data(name='label', dtype='int64', shape=[1])
+
+
+需要注意的是，此处的shape是单个样本的维度，PaddlePaddle Fluid会在shape第0维位置添加-1，表示batch_size的维度，即此例中image.shape为[-1, 28, 28]，
+label.shape为[-1, 1]。
+
+若用户不希望框架在第0维位置添加-1，则可通过append_batch_size=False参数控制，即：
+
+.. code-block:: python
+
+   import paddle.fluid as fluid
+
+   image = fluid.layers.data(name='image', dtype='float32', shape=[28, 28], append_batch_size=False)
+   label = fluid.layers.data(name='label', dtype='int64', shape=[1], append_batch_size=False)
+
+此时，image.shape为[28, 28]，label.shape为[1]。
+
+Step3: 将数据送入网络进行训练/预测
+###################################
+
+Fluid提供两种方式，分别是异步PyReader接口方式或同步Feed方式，具体介绍如下：
+
+- 异步PyReader接口方式
+
+用户需要先使用 :code:`fluid.io.PyReader` 定义PyReader对象，然后通过PyReader对象的decorate方法设置数据源。
+使用PyReader接口时，数据传入与模型训练/预测过程是异步进行的，效率较高，推荐使用。
+
+- 同步Feed方式
+
+用户自行构造输入数据，并在 :code:`fluid.Executor` 或 :code:`fluid.ParallelExecutor`
+中使用 :code:`executor.run(feed=...)` 传入训练数据。数据准备和模型训练/预测的过程是同步进行的，
+效率较低。
+
+
+这两种准备数据方法的比较如下:
+
+========  =================================   =====================================
+对比项            同步Feed方式                          异步PyReader接口方式
+========  =================================   =====================================
+API接口     :code:`executor.run(feed=...)`          :code:`fluid.io.PyReader`
+数据格式         Numpy Array或LoDTensor               Numpy Array或LoDTensor
+数据增强          Python端使用其他库完成                  Python端使用其他库完成
+速度                     慢                                   快
+推荐用途                调试模型                              工业训练
+========  =================================   =====================================
+
+Reader数据类型对使用方式的影响
+###############################
+
+根据Reader数据类型的不同，上述步骤的具体操作将有所不同，具体介绍如下:
+
+读取Sample级Reader数据
++++++++++++++++++++++
+
+若自定义的Reader每次返回单个样本的数据，用户需通过以下步骤完成数据送入：
+
+Step1. 组建数据
+=============================
+
+调用Fluid提供的Reader相关接口完成组batch和部分的数据预处理功能，具体请参见：
+
+.. toctree::
+   :maxdepth: 1
+
+   reader_cn.md
+
+Step2. 送入数据
+=================================
+
+若使用异步PyReader接口方式送入数据，请调用 :code:`decorate_sample_generator` 或 :code:`decorate_sample_list_generator` 接口完成，具体请参见：
+
+- :ref:`user_guides_use_py_reader`
+
+若使用同步Feed方式送入数据，请使用DataFeeder接口将Reader数据转换为LoDTensor格式后送入网络，具体请参见 :ref:`cn_api_fluid_DataFeeder`
+
+读取Batch级Reader数据
++++++++++++++++++++++++
+
+Step1. 组建数据
+=================
+
+由于Batch已经组好，已经满足了Step1的条件，可以直接进行Step2
+
+Step2. 送入数据
+=================================
+
+若使用异步PyReader接口方式送入数据，请调用PyReader的 :code:`decorate_batch_generator` 接口完成，具体方式请参见:
+
+.. toctree::
+   :maxdepth: 1
+
+   use_py_reader.rst
+
+若使用同步Feed方式送入数据，具体请参见:
+
+.. toctree::
+   :maxdepth: 1
+
+   feeding_data.rst
+
+
+
+
diff --git a/doc/fluid/user_guides/howto/prepare_data/prepare_steps_en.rst b/doc/fluid/user_guides/howto/prepare_data/prepare_steps_en.rst
new file mode 100644
index 0000000000000000000000000000000000000000..b8c0a9948a624ef7e4bb0268b8780c490138794b
--- /dev/null
+++ b/doc/fluid/user_guides/howto/prepare_data/prepare_steps_en.rst
@@ -0,0 +1,55 @@
+..  _user_guide_prepare_steps_en:
+
+#############
+Prepare Steps
+#############
+
+PaddlePaddle Fluid supports two methods to feed data into networks:
+
+1. Synchronous method - Python Reader：Firstly, use :code:`fluid.layers.data` to set up data input layer. Then, feed in the training data through :code:`executor.run(feed=...)` in :code:`fluid.Executor` or :code:`fluid.ParallelExecutor` .
+
+2. Asynchronous method - py_reader：Firstly, use :code:`fluid.layers.py_reader` to set up data input layer. Then configure the data source with functions :code:`decorate_paddle_reader` or :code:`decorate_tensor_provider` of :code:`py_reader` . After that, call :code:`fluid.layers.read_file` to read data.
+
+
+
+Comparisons of the two methods:
+
+=========================  ====================================================   ===============================================
+Aspects                                   Synchronous Python Reader                       Asynchronous py_reader
+=========================  ====================================================   ===============================================
+API interface                          :code:`executor.run(feed=...)`                 :code:`fluid.layers.py_reader`
+data type                                   Numpy Array                                Numpy Array or LoDTensor
+data augmentation          carried out by other libraries on Python end            carried out by other libraries on Python end 
+velocity                                        slow                                            rapid
+recommended applications                model debugging                                      industrial training
+=========================  ====================================================   ===============================================
+
+Synchronous Python Reader
+##########################
+
+Fluid provides Python Reader to feed in data.
+
+Python Reader is a pure Python-side interface, and data feeding is synchronized with the model training/prediction process. Users can pass in data through Numpy Array. For specific operations, please refer to:
+
+
+.. toctree::
+   :maxdepth: 1
+
+   feeding_data_en.rst
+
+Python Reader supports advanced functions like group batch, shuffle. For specific operations, please refer to：
+
+.. toctree::
+   :maxdepth: 1
+
+   reader.md
+
+Asynchronous py_reader
+########################
+
+Fluid provides asynchronous data feeding method PyReader. It is more efficient as data feeding is not synchronized with the model training/prediction process. For specific operations, please refer to：
+
+.. toctree::
+   :maxdepth: 1
+
+   use_py_reader_en.rst
diff --git a/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst b/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst
index 0911230ab921d3fb6d89532cefbe39ce68063204..8e218d3e8c3c521c04789d9da06eb5afd7de2a71 100644
--- a/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst
+++ b/doc/fluid/user_guides/howto/prepare_data/use_py_reader_en.rst
@@ -4,7 +4,7 @@
 Use PyReader to read training and test data
 ############################################
 
-Besides Python Reader, we provide PyReader. The performance of PyReader is better than :ref:`user_guide_use_numpy_array_as_train_data` , because the process of loading data is asynchronous with the process of training model when PyReader is in use. And PyReader can coordinate with :code:`double_buffer_reader` to improve the performance of reading data. What's more, :code:`double_buffer_reader` can achieve the transformation from CPU Tensor to GPU Tensor, which improve the efficiency of reading data to some extent.
+Besides Python Reader, we provide PyReader. The performance of PyReader is better than :ref:`user_guide_use_numpy_array_as_train_data_en` , because the process of loading data is asynchronous with the process of training model when PyReader is in use. And PyReader can coordinate with :code:`double_buffer_reader` to improve the performance of reading data. What's more, :code:`double_buffer_reader` can achieve the transformation from CPU Tensor to GPU Tensor, which improve the efficiency of reading data to some extent.
 
 Create PyReader Object
 ################################
diff --git a/doc/fluid/user_guides/index_cn.rst b/doc/fluid/user_guides/index_cn.rst
index 109fb5b818707739ec5379c6d4298ddccbcd9521..99f68bec3c9da797dd43af2c4cc61b19413b1321 100644
--- a/doc/fluid/user_guides/index_cn.rst
+++ b/doc/fluid/user_guides/index_cn.rst
@@ -6,7 +6,7 @@
 
 如果您已经掌握了新手入门阶段的内容，期望可以针对实际问题建模、搭建自己网络，本模块提供了一些 Fluid 的使用细节供您参考：
 
-    - `基本概念 <../user_guides/howto/basic_concept/index_cn.html>`_ ：介绍了Fluid的基本使用概念
+    - `LoD-Tensor概念 <../user_guides/howto/basic_concept/index_cn.html>`_ ：介绍了Fluid LoD-Tensor的基本概念
 
     - `准备数据 <../user_guides/howto/prepare_data/index_cn.html>`_ ：介绍使用 Fluid 训练网络时，数据的支持类型及传输方法
 
diff --git a/doc/fluid/user_guides/index_en.rst b/doc/fluid/user_guides/index_en.rst
index fee5c7e642e86ae7c705e22ee4925a2fa0cf6400..e90e11347bdf6fa757f3ec06e8859e7e5f7d76f3 100644
--- a/doc/fluid/user_guides/index_en.rst
+++ b/doc/fluid/user_guides/index_en.rst
@@ -8,7 +8,7 @@ If you have got the hang of Beginner's Guide, and wish to model practical proble
 you with some detailed operations:
 
 
-    - `Basic Concepts <../user_guides/howto/basic_concept/index_en.html>`_ ：It explains basic concepts of Fluid.
+    - `LoD-Tensor Concepts <../user_guides/howto/basic_concept/index_en.html>`_ ：It explains basic concepts of Fluid LoD-Tensor.
 
     - `Prepare Data <../user_guides/howto/prepare_data/index_en.html>`_ ：This section introduces data types supported and data transmission methods when you are training your networks with Fluid.
 
diff --git a/external/Paddle b/external/Paddle
index 2e3ec66be0fa425c389def2db5db7494f77f905c..103d09169d2d023b0f477cc9362d724ab2531453 160000
--- a/external/Paddle
+++ b/external/Paddle
@@ -1 +1 @@
-Subproject commit 2e3ec66be0fa425c389def2db5db7494f77f905c
+Subproject commit 103d09169d2d023b0f477cc9362d724ab2531453
diff --git a/external/book b/external/book
index 0aa844b15af5f09d1f7c9effa60ea0da1cd0c84f..4d3d1663c2dd28241ab7ee32396fb0d92793f9fc 160000
--- a/external/book
+++ b/external/book
@@ -1 +1 @@
-Subproject commit 0aa844b15af5f09d1f7c9effa60ea0da1cd0c84f
+Subproject commit 4d3d1663c2dd28241ab7ee32396fb0d92793f9fc