merge develop (#1552)

* Update save_load_variables.rst (#1520) * update 1_6 release (#1522) * New tree (#1525) * update 1_6 release * fix readme test=document_preview * add new tree index * Add save load save_dygraph load_dygraph (#1519) * add save load save_dygraph load_dygraph, test=develop * add .idea in gitignore, test=develop * minor fix, test=develop * doc refines, test=develop * doc refines, test=develop * Update custom_op.md (#1528) * refine_design_idea_doc, test=develop (#1533) * Add basic concepts (#1530) * add cn docs, test=develop * refine docs, test=develop * add program doc, test=develop * fix links, test=develop * fix links, test=develop * fix links, test=develop * fix links, test=develop * Fluid -> Paddle, test=develop * Fluid -> Paddle, test=develop * minor fix, test=develop * minor fix, test=develop * add Categorical and MultivariateNormalDiag doc (#1517) * fix typos and links, test=develop, test=document_preview (#1514) * Update programming_guide.md, test=document_fix (#1534) 重新组织了"新手入门>编程指南"中文文档，去掉了部分基础概念的详细介绍，比如Tensor，program等，本文档中更加注重对paddle静态图使用方式的说明 * Add C-API inference document and provide demo in the doc (#1526) * add C-API inference document and provide demo in the doc * fix Layer、label_smooth and Pool2D api, test=develop (#1506) * fix Layer and Pool2D api, test=develop * update label_smooth_cn.rst, test=develop * modified cn doc for attr(shape) supporting -1, test=develop (#1531) * Add gather_tree_cn.rst (#1537) * Polish cn doc fo gru_unit, lstm_unit and scaled_dot_product_attention. * Add gather_tree_cn.rst and update the data type of assign/fill_constant_batch_size_like/gather_nd. * Fix _cn_api_fluid_layers_gather_tree * Update tensor_array_to_tensor and add mse_loss. * Update index_cn.rst * Add GetExpectedKernelType method override rule to op_notes (#1527) * add GetExpectedKernelType func override tule to op_notes * add examples url * add exemples to whether override method * polish description * polish doc based review comments * Update deploy_ctr_on_baidu_cloud_cn.rst (#1518) * Add picture for deploy_ctr_on_baidu_cloud_cn.rst * Baidu yun K8S + Volcano CTR Training Document 1.0 * Add ctr_pserver_log.png for deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Add pserver-log * Update deploy_ctr_on_baidu_cloud_cn.rst * chanage image of kubectl download version from 1.13.3 to 1.13.4 * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst Update on Sept 18. remove description of helm, tiller, Go install. replace new yaml provided by Jinghui Zhang * add some images * Update deploy_ctr_on_baidu_cloud_cn.rst add model output manipulation part * Update deploy_ctr_on_baidu_cloud_cn.rst * update deploy ctr on baidu cloud cn.rst * auto convert from md * Update deploy_ctr_on_baidu_cloud_cn.rst * replace the overview.png * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Delete train_on_baidu_cloud_en.rst it is outdated and cannot match with Chinese Version for months. * Delete train_on_baidu_cloud_cn.rst Has been replaced by ELASTIC CTR * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * [Paddle-Lite] add Paddle-Lite doc test=document_preview (#1545) * change the version number of windows inference (#1549) * Delete duplicate file api_tree * fix inference api (#1546) test=develop * 1026 develop (#1551) * add dataset * add back dataset and data feeder

merge develop (#1552)
* Update save_load_variables.rst (#1520) * update 1_6 release (#1522) * New tree (#1525) * update 1_6 release * fix readme test=document_preview * add new tree index * Add save load save_dygraph load_dygraph (#1519) * add save load save_dygraph load_dygraph, test=develop * add .idea in gitignore, test=develop * minor fix, test=develop * doc refines, test=develop * doc refines, test=develop * Update custom_op.md (#1528) * refine_design_idea_doc, test=develop (#1533) * Add basic concepts (#1530) * add cn docs, test=develop * refine docs, test=develop * add program doc, test=develop * fix links, test=develop * fix links, test=develop * fix links, test=develop * fix links, test=develop * Fluid -> Paddle, test=develop * Fluid -> Paddle, test=develop * minor fix, test=develop * minor fix, test=develop * add Categorical and MultivariateNormalDiag doc (#1517) * fix typos and links, test=develop, test=document_preview (#1514) * Update programming_guide.md, test=document_fix (#1534) 重新组织了"新手入门>编程指南"中文文档，去掉了部分基础概念的详细介绍，比如Tensor，program等，本文档中更加注重对paddle静态图使用方式的说明 * Add C-API inference document and provide demo in the doc (#1526) * add C-API inference document and provide demo in the doc * fix Layer、label_smooth and Pool2D api, test=develop (#1506) * fix Layer and Pool2D api, test=develop * update label_smooth_cn.rst, test=develop * modified cn doc for attr(shape) supporting -1, test=develop (#1531) * Add gather_tree_cn.rst (#1537) * Polish cn doc fo gru_unit, lstm_unit and scaled_dot_product_attention. * Add gather_tree_cn.rst and update the data type of assign/fill_constant_batch_size_like/gather_nd. * Fix _cn_api_fluid_layers_gather_tree * Update tensor_array_to_tensor and add mse_loss. * Update index_cn.rst * Add GetExpectedKernelType method override rule to op_notes (#1527) * add GetExpectedKernelType func override tule to op_notes * add examples url * add exemples to whether override method * polish description * polish doc based review comments * Update deploy_ctr_on_baidu_cloud_cn.rst (#1518) * Add picture for deploy_ctr_on_baidu_cloud_cn.rst * Baidu yun K8S + Volcano CTR Training Document 1.0 * Add ctr_pserver_log.png for deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Add pserver-log * Update deploy_ctr_on_baidu_cloud_cn.rst * chanage image of kubectl download version from 1.13.3 to 1.13.4 * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst Update on Sept 18. remove description of helm, tiller, Go install. replace new yaml provided by Jinghui Zhang * add some images * Update deploy_ctr_on_baidu_cloud_cn.rst add model output manipulation part * Update deploy_ctr_on_baidu_cloud_cn.rst * update deploy ctr on baidu cloud cn.rst * auto convert from md * Update deploy_ctr_on_baidu_cloud_cn.rst * replace the overview.png * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Delete train_on_baidu_cloud_en.rst it is outdated and cannot match with Chinese Version for months. * Delete train_on_baidu_cloud_cn.rst Has been replaced by ELASTIC CTR * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * Update deploy_ctr_on_baidu_cloud_cn.rst * [Paddle-Lite] add Paddle-Lite doc test=document_preview (#1545) * change the version number of windows inference (#1549) * Delete duplicate file api_tree * fix inference api (#1546) test=develop * 1026 develop (#1551) * add dataset * add back dataset and data feeder
6a432636 · Dong Daxiang · GitHub · 87ab6686 · 6a432636 · 6a432636
17 changed file
--- a/doc/fluid/advanced_usage/deploy/inference/python_infer_cn.md
+++ b/doc/fluid/advanced_usage/deploy/inference/python_infer_cn.md
@@ -83,7 +83,7 @@ config = AnalysisConfig("./model/model", "./model/params")

 其他预测引擎配置选项示例如下
 ``` python
-config.enable_use_gpu(100, 0) # 初始化200M显存，使用gpu id为0
+config.enable_use_gpu(100, 0) # 初始化100M显存，使用gpu id为0
 config.gpu_device_id()        # 返回正在使用的gpu id
 config.disable_gpu()		  # 禁用gpu
 config.switch_ir_optim(True)  # 开启IR优化 

--- a/doc/fluid/advanced_usage/development/new_op/op_notes.md
+++ b/doc/fluid/advanced_usage/development/new_op/op_notes.md
@@ -110,16 +110,53 @@ Fluid的Op的输入输出都是`Variable`，从设计上讲，`Variable`中可
 ### 3.OpKernel需要注册的数据类型
 目前要求所有OpKernel都要注册double和float数据类型。

-### 4.Op兼容性问题
+### 4.GetExpectedKernelType方法重写
+GetExpectedKernelType方法是OperatorWithKernel类中用于获取指定设备（例如CPU，GPU）上指定数据类型（例如double，float）的OpKernel的方法。该方法通过获取输入变量内部的Tensor数据类型得知需要的Kernel数据类型，但是由于Tensor在此处可能尚未被初始化，所以在该方法内使用输入变量时需要进行必要的初始化检查。在新增含Kernel的Op的时候，关于该方法的重写需要注意以下两点。
+
+#### 4.1 仅在必要时重写此方法
+
+基类OperatorWithKernel中的GetExpectedKernelType方法对于派生类Op的所有输入变量进行了完备的初始化检查，建议在新增的Op中直接使用基类的此方法，例如：
+
+- [MeanOp](https://github.com/PaddlePaddle/Paddle/blob/3556514e971bdbb98fdf0f556371c527f4dfa98c/paddle/fluid/operators/mean_op.cc#L39)：该Op的所有输入变量在Run之前应该全部被初始化，初始化检查是必要且合理的
+
+但是在一些情况下，直接使用基类的GetExpectedKernelType方法无法满足需求，则需要对该方法进行重写，具体情况及示例如下：
+
+1. OP的输入有多个，且数据类型不同，例如 [AccuracyOp](https://github.com/PaddlePaddle/Paddle/blob/370f0345b6d35a513c8e64d519a0edfc96b9276c/paddle/fluid/operators/metrics/accuracy_op.cc#L80)，需要重写GetExpectedKernelType方法，指定用某一输入变量获取kernel类型
+
+2. Op包含Dispensable的输入变量，该类输入变量是可选的，当用户未输入时，该类变量未被初始化属于合理情况，例如 [ConvOp](https://github.com/PaddlePaddle/Paddle/blob/250e72d254ccbe3521c29aa2801a1cb15b75ea73/paddle/fluid/operators/conv_op.cc#L206)，存在Bias等可选的输入变量，需要重写GetExpectedKernelType方法，指定用必须提供的输入变量获取kernel类型
+
+3. Op的部分输入变量即使未被初始化也属于合理情况，例如 [ConcatOp](https://github.com/PaddlePaddle/Paddle/blob/250e72d254ccbe3521c29aa2801a1cb15b75ea73/paddle/fluid/operators/concat_op.cc#L90)，输入变量X中有个Tensor需要连接，其中可能包含未被初始化的Tensor，需要重写GetExpectedKernelType方法，使用输入变量X获取kernel的过程中，合理忽略掉部分Tensor为空的情况
+
+4. OP的Kernel类型与输入变量无关（可能由其他参数指定），例如 [FillOp](https://github.com/PaddlePaddle/Paddle/blob/efbdad059634bef022d4a3f5b00aef6ef8e88ed6/paddle/fluid/operators/one_hot_op.cc#L72)，该Op没有输入，Kernel类型通过Op的dtype参数指定，因此需要重写GetExpectedKernelType方法，用参数指定的数据类型获取kernel类型
+
+5. Op Kernel的部分参数在使用某些库时，需要指定为相应的值，因此需要重写GetExpectedKernelType方法，覆盖默认参数
+    - 使用CUDNN库：需要指定OpKernel的LibraryType为kCUDNN，例如 [AffineGridOp](https://github.com/PaddlePaddle/Paddle/blob/370f0345b6d35a513c8e64d519a0edfc96b9276c/paddle/fluid/operators/affine_grid_op.cc#L78)
+    - 使用MKLDNN库：需要指定OpKernel的LibraryType和DataLayout为kMKLDNN [MulOp](https://github.com/PaddlePaddle/Paddle/blob/250e72d254ccbe3521c29aa2801a1cb15b75ea73/paddle/fluid/operators/mul_op.cc#L89)
+
+#### 4.2 重写此方法时需要对输入变量进行初始化检查
+
+在需要重写GetExpectedKernelType方法时，一般会根据某一输入变量获取Kernel的数据类型，此时请使用`OperatorWithKernel::IndicateVarDataType`接口获取变量的dtype，该方法对指定的输入变量进行了必要的初始化检查，详见[Paddle PR #20044](https://github.com/PaddlePaddle/Paddle/pull/20044)，实现示例如下，：
+
+```
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        OperatorWithKernel::IndicateVarDataType(ctx, "X"), ctx.GetPlace());
+  }
+```
+
+如果未使用带有初始化检查的方法，直接使用了`Tensor->type()`，可能会导致报出`holder_ should not be null. Tensor not initialized yet when Tensor::type()`的错误，例如[Paddle issue #19522](https://github.com/PaddlePaddle/Paddle/issues/19522) ，用户仅凭该错误信息将无法得知具体出错的Op，不利于调试。
+
+### 5.Op兼容性问题
 对Op的修改需要考虑兼容性问题，要保证Op修改之后，之前的模型都能够正常加载及运行。<font color="#FF0000">**所以现在不允许对已有的Op新增输入或者输出，不允许减去Op的已有属性及修改默认值**</font> 。

-### 5.ShareDataWith的调用
+### 6.ShareDataWith的调用
 ShareDataWith的功能是使两个Tensor共享底层buffer，在调用这个操作的时候需要特别注意，在Op内部不能将ShareDataWith作用在Op的输出上，即Op输出的Tensor必须是Malloc出来的。

-### 6.稀疏梯度参数更新方法
+### 7.稀疏梯度参数更新方法
 目前稀疏梯度在做更新的时候会先对梯度做merge，即对相同参数的梯度做累加，然后做参数以及附加参数（如velocity）的更新。

-### 7.显存优化
+### 8.显存优化
 通常反向Op会依赖于前向Op的某些输入(Input)、输出(Output)，以供反向Op计算使用。但有些情况下，反向Op不需要前向Op的所有输入和输出；有些情况下，反向Op只需要前向Op的部分输入和输出；有些情况下，反向Op只需要使用前向Op中输入和输出变量的Shape和LoD信息。若Op开发者在注册反向Op时，将不必要的前向Op输入和输出作为反向Op的输入，会导致这部分显存无法被框架现有的显存优化策略优化，从而导致模型显存占用过高。

 所以在写注册反向Op时需要注意以下几点：
@@ -175,7 +212,7 @@ REGISTER_OPERATOR(slice_grad, ops::SliceOpGrad,
                  ops::SliceOpGradNoNeedBufferVarsInference);
 ```

-### 8.混合设备调用
+### 9.混合设备调用
 由于GPU是异步执行的，当CPU调用返回之后，GPU端可能还没有真正的执行，所以如果在Op中创建了GPU运行时需要用到的临时变量，当GPU开始运行的时候，该临时变量可能在CPU端已经被释放，这样可能会导致GPU计算出错。

 关于GPU中的一些同步和异步操作：
@@ -195,7 +232,7 @@ The following device operations are asynchronous with respect to the host:

 更多内容可参考：[Asynchronous Concurrent Execution](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#asynchronous-concurrent-execution)，[API synchronization behavior](https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior)

-### 9. LoD 在 Op 内部的传导规范
+### 10. LoD 在 Op 内部的传导规范

 [LoD](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/lod_tensor.md) 是 Paddle Fluid 框架用来表示变长序列数据的属性，除了仅支持输入是 padding  data 的 Op 外，所有 Op 的实现都要考虑 LoD 的传导问题。


--- a/doc/fluid/api_cn/index_cn.rst
+++ b/doc/fluid/api_cn/index_cn.rst
@@ -10,6 +10,8 @@ API Reference
    average_cn.rst
    backward_cn.rst
    clip_cn.rst
+    data/data_feeder_cn.rst
+    data/dataset_cn.rst
    data_feeder_cn.rst
    dataset_cn.rst
    dygraph_cn.rst

--- a/doc/fluid/api_cn/layers_cn.rst
+++ b/doc/fluid/api_cn/layers_cn.rst
@@ -113,6 +113,7 @@ fluid.layers
    layers_cn/fsp_matrix_cn.rst
    layers_cn/gather_cn.rst
    layers_cn/gather_nd_cn.rst
+    layers_cn/gather_tree_cn.rst
    layers_cn/gaussian_random_batch_size_like_cn.rst
    layers_cn/gaussian_random_cn.rst
    layers_cn/generate_mask_labels_cn.rst
@@ -173,6 +174,7 @@ fluid.layers
    layers_cn/mean_cn.rst
    layers_cn/mean_iou_cn.rst
    layers_cn/merge_selected_rows_cn.rst
+    layers_cn/mse_loss_cn.rst
    layers_cn/mul_cn.rst
    layers_cn/multi_box_head_cn.rst
    layers_cn/multiclass_nms_cn.rst

--- a/doc/fluid/api_cn/layers_cn/assign_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/assign_cn.rst
@@ -8,7 +8,7 @@ assign
 该OP将输入Tensor或numpy数组拷贝至输出Tensor。

 参数：
-    - **input** (Variable|np.ndarray) - 输入Tensor或numpy数组，支持数据类型为float32, float64, int32和int64。
+    - **input** (Variable|np.ndarray) - 输入Tensor或numpy数组，支持数据类型为float32, float64, int32, int64和bool。
    - **output** (Variable，可选) - 输出Tensor。如果为None，则创建一个新的Tensor作为输出Tensor，默认值为None。

 返回：输出Tensor，形状、数据类型、数据值和 ``input`` 一致。

--- a/doc/fluid/api_cn/layers_cn/crop_tensor_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/crop_tensor_cn.rst
@@ -41,7 +41,7 @@ crop_tensor
                        [0, 0, 0, 0]]]

        参数：
-            shape = [2, 2, 3]
+            shape = [2, 2, -1]
            offsets = [0, 0, 1]

        输出：
@@ -52,8 +52,8 @@ crop_tensor
                         [6, 7, 8]]]

 参数:
-  - **x** (Variable): 1-D到6-D Tensor，数据类型为float32或float64。
-  - **shape** (list|tuple|Variable) - 输出Tensor的形状，数据类型为int32。如果是列表或元组，则其长度必须与x的维度大小相同，如果是Variable，则其应该是1-D Tensor。当它是列表时，每一个元素可以是整数或者形状为[1]的Tensor。含有Variable的方式适用于每次迭代时需要改变输出形状的情况。列表和元组中只有第一个元素可以被设置为-1，这意味着输出的第一维大小与输入相同。
+  - **x** (Variable): 1-D到6-D Tensor，数据类型为float32、float64、int32或者int64。
+  - **shape** (list|tuple|Variable) - 输出Tensor的形状，数据类型为int32。如果是列表或元组，则其长度必须与x的维度大小相同，如果是Variable，则其应该是1-D Tensor。当它是列表时，每一个元素可以是整数或者形状为[1]的Tensor。含有Variable的方式适用于每次迭代时需要改变输出形状的情况。
  - **offsets** (list|tuple|Variable，可选) - 每个维度上裁剪的偏移量，数据类型为int32。如果是列表或元组，则其长度必须与x的维度大小相同，如果是Variable，则其应是1-D Tensor。当它是列表时，每一个元素可以是整数或者形状为[1]的Variable。含有Variable的方式适用于每次迭代的偏移量（offset）都可能改变的情况。默认值：None，每个维度的偏移量为0。
  - **name** (str，可选) - 具体用法请参见 :ref:`api_guide_Name` ，一般无需设置，默认值为None。

@@ -62,39 +62,43 @@ crop_tensor
 返回类型: Variable

 抛出异常：
-    - :code:`ValueError` - shape 应该是列表、元组或Variable。
-    - :code:`ValueError` - offsets 应该是列表、元组、Variable或None。
+    - :code:`TypeError` - x 的数据类型应该是float32、float64、int32或者int64。
+    - :code:`TypeError` - shape 应该是列表、元组或Variable。
+    - :code:`TypeError` - shape 的数据类型应该是int32。
+    - :code:`TypeError` - offsets 应该是列表、元组、Variable或None。
+    - :code:`TypeError` - offsets 的数据类型应该是int32。
+    - :code:`TypeError` - offsets 的元素应该大于等于0。

 **代码示例**:

 ..  code-block:: python
    
    import paddle.fluid as fluid
-    x = fluid.layers.data(name="x", shape=[3, 5], dtype="float32")
+    x = fluid.data(name="x", shape=[None, 3, 5], dtype="float32")
    # x.shape = [-1, 3, 5], where -1 indicates batch size, and it will get the exact value in runtime.

-    # shape is a 1-D tensor variable
-    crop_shape = fluid.layers.data(name="crop_shape", shape=[3], dtype="int32", append_batch_size=False)
+    # shape is a 1-D Tensor
+    crop_shape = fluid.data(name="crop_shape", shape=[3], dtype="int32")
    crop0 = fluid.layers.crop_tensor(x, shape=crop_shape)
    # crop0.shape = [-1, -1, -1], it means crop0.shape[0] = x.shape[0] in runtime.

    # or shape is a list in which each element is a constant
-    crop1 = fluid.layers.crop_tensor(x, shape=[-1, 2, 3])
+    crop1 = fluid.layers.crop_tensor(x, shape=[-1, -1, 3], offsets=[0, 1, 0])
    # crop1.shape = [-1, 2, 3]

-    # or shape is a list in which each element is a constant or variable
-    y = fluid.layers.data(name="y", shape=[3, 8, 8], dtype="float32")
-    dim1 = fluid.layers.data(name="dim1", shape=[1], dtype="int32", append_batch_size=False)
-    crop2 = fluid.layers.crop_tensor(y, shape=[-1, 3, dim1, 4])
-    # crop2.shape = [-1, 3, -1, 4]
+    # or shape is a list in which each element is a constant or Tensor
+    y = fluid.data(name="y", shape=[3, 8, 8], dtype="float32")
+    dim1 = fluid.layers.data(name="dim1", shape=[1], dtype="int32")
+    crop2 = fluid.layers.crop_tensor(y, shape=[3, dim1, 4])
+    # crop2.shape = [3, -1, 4]

-    # offsets is a 1-D tensor variable
-    crop_offsets = fluid.layers.data(name="crop_offsets", shape=[3], dtype="int32", append_batch_size=False)
+    # offsets is a 1-D Tensor
+    crop_offsets = fluid.data(name="crop_offsets", shape=[3], dtype="int32")
    crop3 = fluid.layers.crop_tensor(x, shape=[-1, 2, 3], offsets=crop_offsets)
    # crop3.shape = [-1, 2, 3]

-    # offsets is a list in which each element is a constant or variable
-    offsets_var =  fluid.layers.data(name="dim1", shape=[1], dtype="int32", append_batch_size=False)
+    # offsets is a list in which each element is a constant or Tensor
+    offsets_var =  fluid.data(name="dim1", shape=[1], dtype="int32")
    crop4 = fluid.layers.crop_tensor(x, shape=[-1, 2, 3], offsets=[0, 1, offsets_var])
    # crop4.shape = [-1, 2, 3]

--- a/doc/fluid/api_cn/layers_cn/fill_constant_batch_size_like_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/fill_constant_batch_size_like_cn.rst
@@ -3,18 +3,19 @@
 fill_constant_batch_size_like
 -------------------------------

-.. py:function:: paddle.fluid.layers.fill_constant_batch_size_like(input,shape,dtype,value,input_dim_idx=0,output_dim_idx=0)
+.. py:function:: paddle.fluid.layers.fill_constant_batch_size_like(input,shape,dtype,value,input_dim_idx=0,output_dim_idx=0,force_cpu=False)

 该OP创建一个形状为shape并且数据类型为dtype的Tensor，同时用 ``value`` 中提供的常量初始化该Tensor。在输入为LoDTensor并且input_dim_idx为0的
 时候将输出output_dim_idx维度的大小设置为input输入的batch_size的值，创建的Tensor的stop_gradient属性默认为False。

 参数：
-    - **input** (Variable)- 输入的Tensor或者LoDTensor，支持数据类型为 float32， float64， int32， int64。
+    - **input** (Variable)- 输入的Tensor或者LoDTensor，支持数据类型为 float32， float64， int32， int64，bool。
    - **shape** (list)- 创建Tensor的shape，最后创建的LoDTensor的shape可能会依据input发生变动。
-    - **dtype** (np.dtype|core.VarDesc.VarType|str)- 创建Tensor的数据类型，支持数据类型为 float32， float64， int32， int64。
+    - **dtype** (np.dtype|core.VarDesc.VarType|str)- 创建Tensor的数据类型，支持数据类型为 float32， float64， int32， int64，bool。
    - **value** (float|int)-  用于初始化输出Tensor的常量数据的值。
    - **input_dim_idx** (int)- 当值为0并且输入为LoDTensor的时候，创建Tensor的output_dim_idx维度会设置为input的batch_size值，默认值为0。
    - **output_dim_idx** (int) -用于指定创建的Tensor哪个维度设置为输入batch_size的值，默认值为0。
+    - **force_cpu** (bool)- 用于返回的Tensor是否创建在CPU上，默认值为False，若设为true，则数据在CPU上。

 返回：创建的Tensor, 数据类型为dtype。


--- a/doc/fluid/api_cn/layers_cn/gather_nd_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/gather_nd_cn.rst
@@ -50,7 +50,7 @@ gather_nd


 参数：
-    - **input** (Variable) - 输入张量，数据类型可以是int32，int64，float32，float64。
+    - **input** (Variable) - 输入张量，数据类型可以是int32，int64，float32，float64, bool。
    - **index** (int) - 输入的索引张量，数据类型为非负int32或非负int64。它的维度 :code:`index.rank` 必须大于1，并且 :code:`index.shape[-1] <= input.rank` 。
    - **name** (string) - 该层的名字，默认值为None，表示会自动命名。
    

--- a/doc/fluid/api_cn/layers_cn/gather_tree_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/gather_tree_cn.rst
+.. _cn_api_fluid_layers_gather_tree:
+
+gather_tree
+-------------------------------
+
+.. py:function:: paddle.fluid.layers.gather_tree(ids, parents)
+
+该OP在整个束搜索(Beam Search)结束后使用。在搜索结束后，可以获得每个时间步选择的的候选词id及其对应的在搜索树中的parent节点， ``ids`` 和 ``parents`` 的形状布局均为 :math:`[max\_time, batch\_size, beam\_size]` ，该OP从最后一个时间步回溯产生完整的id序列。
+
+
+示例：
+
+::
+
+        给定:
+            ids = [[[2 2]
+                    [6 1]]
+                   [[3 9]
+                    [6 1]]
+                   [[0 1]
+                    [9 0]]]
+            parents = [[[0 0]
+                        [1 1]]
+                       [[1 0]
+                        [1 0]]
+                       [[0 0]
+                        [0 1]]]
+
+        结果:                
+            gather_tree(ids, parents)  
+                        = [[[2 2]
+                            [1 6]]
+                           [[3 3]
+                            [6 1]]
+                           [[0 1]
+                            [9 0]]]
+
+
+
+参数：
+    - **ids** (Variable) - 形状为 :math:`[length, batch\_size, beam\_size]` 的三维Tensor，数据类型是int32或int64。包含了所有时间步选择的id。
+    - **parents** (Variable) - 形状和数据类型均与 ``ids`` 相同的Tensor。包含了束搜索中每一时间步所选id对应的parent。
+    
+返回：和 ``ids`` 具有相同形状和数据类型的Tensor。包含了根据parent回溯而收集产生的完整id序列。
+
+返回类型：Variable
+
+**代码示例**：
+
+.. code-block:: python
+
+    import paddle.fluid as fluid
+
+    ids = fluid.data(name='ids',
+                     shape=[5, 2, 2],
+                     dtype='int64')
+    parents = fluid.data(name='parents',
+                         shape=[5, 2, 2],
+                         dtype='int64')
+    final_sequences = fluid.layers.gather_tree(ids, parents)
+
+
+
+
+
--- a/doc/fluid/api_cn/layers_cn/gru_unit_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/gru_unit_cn.rst
@@ -29,7 +29,7 @@ Gated Recurrent Unit（GRU）循环神经网络计算单元。该OP用于完成
    h_t & = (1-u_t) \odot h_{t-1} + u_t \odot \tilde{h_t}


-其中， :math:`x_t` 为当前时间步的输入，这个输入并非 ``input``，该OP不包含 :math:`W_{ux}x_{t}, W_{rx}x_{t}, W_{cx}x_{t}` 的计算，**注意** 要在该OP前使用大小为 ``size`` 的3倍的全连接层并将其输出作为 ``input``；
+其中， :math:`x_t` 为当前时间步的输入，这个输入并非 ``input``，该OP不包含 :math:`W_{ux}x_{t}, W_{rx}x_{t}, W_{cx}x_{t}` 的计算，**注意** 要在该OP前使用大小为GRU隐单元数目的3倍的全连接层并将其输出作为 ``input``；
 :math:`h_{t-1}` 为前一时间步的隐状态 ``hidden``； :math:`u_t` 、 :math:`r_t` 、 :math:`\tilde{h_t}` 和 :math:`h_t` 分别代表了GRU单元中update gate（更新门）、reset gate（重置门）、candidate hidden（候选隐状态）和隐状态输出; :math:`\odot` 为逐个元素相乘；
 :math:`W_{uh}, b_u` 、 :math:`W_{rh}, b_r` 和 :math:`W_{ch}, b_c` 分别代表更新门、重置门和候选隐状态在计算时使用的权重矩阵和偏置。在实现上，三个权重矩阵合并为一个 :math:`[D, D \times 3]` 形状的Tensor存放，三个偏置拼接为一个 :math:`[1, D \times 3]` 形状的Tensor存放，其中 :math:`D` 为隐单元的数目；权重Tensor存放布局为： :math:`W_{uh}` 和 :math:`W_{rh}` 拼接为 :math:`[D, D  \times 2]` 形状位于前半部分，:math:`W_{ch}` 以 :math:`[D, D]` 形状位于后半部分。


--- a/doc/fluid/api_cn/layers_cn/lstm_unit_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/lstm_unit_cn.rst
@@ -31,7 +31,7 @@ Long-Short Term Memory（LSTM）循环神经网络计算单元。该OP用于完
    - **bias_attr** (ParamAttr，可选) - 指定偏置参数属性的对象。默认值为None，表示使用默认的偏置参数属性。具体用法请参见 :ref:`cn_api_fluid_ParamAttr` 。
    - **name**  (str，可选) – 具体用法请参见 :ref:`api_guide_Name` ，一般无需设置，默认值为None。

-返回：Variable的二元组，包含了两个形状均为 :math:`[N, D]` 的Tensor，分别表示hiddel和cell输出，即公式中的 :math:`h_{t}` 和 :math:`c_{t}` 。数据类型与输入 ``x_t`` 相同。
+返回：Variable的二元组，包含了两个形状和数据类型均与 ``hidden_t_prev`` 相同的Tensor，分别表示hiddel和cell输出，即公式中的 :math:`h_{t}` 和 :math:`c_{t}` 。

 返回类型：tuple


--- a/doc/fluid/api_cn/layers_cn/mse_loss_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/mse_loss_cn.rst
+.. _cn_api_fluid_layers_mse_loss:
+
+mse_loss
+-------------------------------
+
+.. py:function:: paddle.fluid.layers.mse_loss(input,label)
+
+该OP用于计算预测值和目标值的均方差误差。
+
+对于预测值input和目标值label，公式为：
+
+.. math::
+
+    Out = MEAN((input-label)^{2})
+
+参数：
+    - **input** (Variable) - 预测值，维度为 :math:`[N_1, N_2, ..., N_k, D]` 的多维Tensor，其中最后一维D是类别数目。数据类型为float32或float64。
+    - **label** (Variable) - 目标值，维度为 :math:`[N_1, N_2, ..., N_k, D]` 的多维Tensor，其中最后一维D是类别数目。数据类型为float32或float64。
+
+返回：预测值和目标值的均方差
+
+返回类型：变量（Variable）
+
+**代码示例**：
+
+.. code-block:: python
+
+    import paddle.fluid as fluid
+    y = fluid.data(name='y', shape=[1], dtype='float32')
+    y_predict = fluid.data(name='y_predict', shape=[1], dtype='float32')
+    cost = fluid.layers.mse_loss(input=y_predict, label=y)
+
+
+
--- a/doc/fluid/api_cn/layers_cn/tensor_array_to_tensor_cn.rst
+++ b/doc/fluid/api_cn/layers_cn/tensor_array_to_tensor_cn.rst
@@ -3,44 +3,78 @@
 tensor_array_to_tensor
 -------------------------------

-.. py:function:: paddle.fluid.layers.tensor_array_to_tensor(input, axis=1, name=None)
+.. py:function:: paddle.fluid.layers.tensor_array_to_tensor(input, axis=1, name=None, use_stack=False)

-该OP在指定轴上连接LoDTensorArray中的元素。
+该OP将 ``input`` 这个LoDTensorArray中的所有Tensor沿 ``axis`` 指定的轴进行拼接（concat）或堆叠（stack）。
+
+示例：
+
+::
+    
+    - 案例 1：
+
+        给定:
+            
+            input.data = {[[0.6, 0.1, 0.3],
+                           [0.5, 0.3, 0.2]],
+                          [[1.3],
+                           [1.8]],
+                          [[2.3, 2.1],
+                           [2.5, 2.4]]}
+
+            axis = 1, use_stack = False
+
+        结果:                
+
+            output.data = [[0.6, 0.1, 0.3, 1.3, 2.3, 2.1],
+                           [0.5, 0.3, 0.2, 1.8, 2.5, 2.4]]
+
+            output_index.data = [3, 1, 2]
+
+    - 案例 2：
+
+        给定:
+            
+            input.data = {[[0.6, 0.1],
+                           [0.5, 0.3]],
+                          [[0.3, 1.3],
+                           [0.2, 1.8]],
+                          [[2.3, 2.1],
+                           [2.5, 2.4]]}
+
+            axis = 1, use_stack = False
+
+        结果:                
+
+            output.data = [[[0.6, 0.1]
+                            [0.3, 1.3]
+                            [2.3, 2.1],
+                           [[0.5, 0.3]
+                            [0.2, 1.8]
+                            [2.5, 2.4]]]
+
+            output_index.data = [2, 2, 2]

 参数：
  - **input** (Variable) - 输入的LoDTensorArray。支持的数据类型为：float32、float64、int32、int64。
  - **axis** (int，可选) - 指定对输入Tensor进行运算的轴， ``axis`` 的有效范围是[-R, R)，R是输入 ``input`` 中Tensor的Rank，``axis`` 为负时与 ``axis`` +R 等价。默认值为1。
  - **name** (str，可选) – 具体用法请参见 :ref:`api_guide_Name` ，一般无需设置，默认值为None。
+  - **use_stack** (bool，可选) – 指明使用stack或concat进行运算，若为stack模式，要求LoDTensorArray中的所有Tensor具有相同的形状。默认值为False。

-返回：LoDTensor
+返回：Variable的二元组， 包含了两个Tensor。第一个Tensor表示对数组内的元素进行stack或concat的输出结果，数据类型与数组中的Tensor相同；第二个Tensor包含了数组中各Tensor在 `axis` 维度的大小，数据类型为int32。

-返回类型： Variable
+返回类型： tuple

 **代码示例：**

 .. code-block:: python

-  import paddle.fluid as fluid
-  import numpy as np
-
-  place = fluid.CPUPlace()
-
-  x1 = fluid.layers.data(name="x", shape=[2,2], lod_level=0)
-  tmp = fluid.layers.fill_constant(shape=[2,3], dtype="float32", value=1)
-  x_arr = fluid.layers.create_array(dtype="float32")
-  c0 = fluid.layers.fill_constant(shape=[1], dtype='int64', value=0)
-  fluid.layers.array_write(x=tmp, i=c0, array=x_arr)
-  c1 = fluid.layers.fill_constant(shape=[1], dtype='int64', value=1)
-  fluid.layers.array_write(x=x1, i=c1, array=x_arr)
-  output, output_index = fluid.layers.tensor_array_to_tensor(input=x_arr, axis=1)
-
-  exe = fluid.Executor(place)
-  exe.run(fluid.default_startup_program())
-
-  feedx = fluid.LoDTensor()
-  feedx.set(np.array([[1.3,-2.4],[0,4]]).astype("float32"), place)
-  res = exe.run(fluid.default_main_program(), feed={'x':feedx}, fetch_list=[output], return_numpy=False)
-
-  print(np.array(res[0]))
-  # [[ 1.   1.   1.   1.3 -2.4]
-  #  [ 1.   1.   1.   0.   4. ]]
+    import paddle.fluid as fluid
+    import numpy as np
+    x0 = fluid.layers.assign(np.random.rand(2, 2).astype("float32"))
+    x1 = fluid.layers.assign(np.random.rand(2, 2).astype("float32"))
+    i = fluid.layers.fill_constant(shape=[1], dtype="int64", value=0)
+    array = fluid.layers.create_array(dtype='float32')
+    fluid.layers.array_write(x0, i, array)
+    fluid.layers.array_write(x1, i + 1, array)
+    output, output_index = fluid.layers.tensor_array_to_tensor(input=array)
--- a/doc/fluid/api_cn/nets_cn/scaled_dot_product_attention_cn.rst
+++ b/doc/fluid/api_cn/nets_cn/scaled_dot_product_attention_cn.rst
@@ -21,11 +21,11 @@ scaled_dot_product_attention
 参数：
    - **queries** （Variable） - 形状为 :math:`[N, L_q, d_k \times h]` 的三维Tensor，其中 :math:`N` 为batch_size， :math:`L_q` 为查询序列长度， :math:`d_k \times h` 为查询的特征维度大小，:math:`h` 为head数。数据类型为float32或float64。
    - **keys** （Variable） - 形状为 :math:`[N, L_k, d_k \times h]` 的三维Tensor，其中 :math:`N` 为batch_size， :math:`L_k` 为键值序列长度， :math:`d_k \times h` 为键的特征维度大小，:math:`h` 为head数。数据类型与 ``queries`` 相同。
-    - **values** （Variable） - 形状为 :math:`[N, L_k, d_v \times h]` 的三维Tensor，其中 :math:`N` 为batch_size， :math:`L_v` 为键值序列长度， :math:`d_v \times h` 为值的特征维度大小，:math:`h` 为head数。数据类型与 ``queries`` 相同。
+    - **values** （Variable） - 形状为 :math:`[N, L_k, d_v \times h]` 的三维Tensor，其中 :math:`N` 为batch_size， :math:`L_k` 为键值序列长度， :math:`d_v \times h` 为值的特征维度大小，:math:`h` 为head数。数据类型与 ``queries`` 相同。
    - **num_heads** （int） - 指明所使用的head数。head数为1时不对输入进行线性变换。默认值为1。
    - **dropout_rate** （float） - 以指定的概率对要attention到的内容进行dropout。默认值为0，即不使用dropout。

-返回： 形状为 :math:`[N, L_q, d_v * h]` 的三维Tensor，其中 :math:`N` 为batch_size， :math:`L_q` 为查询序列长度， :math:`d_v * h` 为值的特征维度大小。与输入具有相同的数据类型。表示。
+返回： 形状为 :math:`[N, L_q, d_v * h]` 的三维Tensor，其中 :math:`N` 为batch_size， :math:`L_q` 为查询序列长度， :math:`d_v * h` 为值的特征维度大小。与输入具有相同的数据类型。表示Multi-Head Attention的输出。

 返回类型： Variable


--- a/doc/fluid/user_guides/howto/training/deploy_ctr_on_baidu_cloud_cn.rst
+++ b/doc/fluid/user_guides/howto/training/deploy_ctr_on_baidu_cloud_cn.rst
@@ -10,7 +10,7 @@ ELASTIC CTR

 * `1. 总体概览 <#head1>`_
 * `2. 前置需求 <#head2>`_
-* `3. 分布式训练+serving方案一键部署 <#head3>`_
+* `3. 分布式训练+Serving方案一键部署 <#head3>`_
 * `4. 查看结果 <#head4>`_
 * `5. 二次开发指南 <#head5>`_

@@ -20,10 +20,10 @@ ELASTIC CTR
 本项目提供了端到端的CTR训练和二次开发的解决方案，主要特点：


-* 使用K8S集群解决原来在物理集群上训练时，会出现类似于配置参数冗杂，环境搭建繁复等问题。
-* 使用基于Kube-batch开发的Volcano框架来进行任务提交和弹性调度。
-* 使用Paddle Serving来进行模型的上线和预测。
-* 使用Cube作为稀疏参数的分布式存储，在预测端与Paddle Serving对接。
+* 整体方案在k8s环境一键部署，可快速搭建与验证效果
+* 基于Paddle transpiler模式的大规模分布式高速训练
+* 训练资源弹性伸缩
+* 工业级稀疏参数Serving组件，高并发条件下单位时间吞吐总量是Redis的13倍 [\ `注1 <#annotation_1>`_\ ]

 本方案整体流程如下图所示：

@@ -37,11 +37,13 @@ ELASTIC CTR


 * trainer/pserver: 训练环节采用PaddlePaddle parameter server模式，对应trainer和pserver角色。分布式训练使用\ `volcano <https://volcano.sh/>`_\ 做批量任务管理工具
-* file server: 训练产出的模型文件，托管到File Server，供下游模块下载；训练产出的文件包括：ProgramDesc和模型参数，模型参数中最大的embedding由工具转换为seqfile格式，经过一系列流程配送到cube分布式稀疏参数服务，其余模型参数保持不变，配送到Paddle Serving模块
-* cube-transfer: 负责监控上游训练作业产出的模型文件（hadoop sequence file）变化，拉取到本地，并调用cube-builder构建cube字典文件；通知cube-agent节点拉取最新的字典文件，并维护各个cube-server上版本一致性
+* file server: 训练产出的模型文件，托管到File Server，供下游模块下载；训练产出的文件包括：ProgramDesc和模型参数，模型参数中最大的embedding由工具转换为seqfile格式，经过一系列流程配送到Cube分布式稀疏参数服务，其余模型参数保持不变，配送到Paddle Serving模块
+* cube-transfer: 负责监控上游训练作业产出的模型文件（hadoop sequence file）变化，拉取到本地，并调用cube-builder构建Cube字典文件；通知cube-agent节点拉取最新的字典文件，并维护各个cube-server上版本一致性
 * cube-builder: 负责将训练作业产出的模型文件（hadoop sequence file格式）转换成可以被cube-server加载的字典文件。字典文件具有特定的数据结构，针对尺寸和内存中访问做了高度优化
-* Cube-Server: 提供分片kv读写能力的服务节点
-* Cube-agent: 与cube-server同机部署，接收cube-transfer下发的字典文件更新命令，拉取数据到本地，通知cube-server进行更新
+* cube-server: 提供分片kv读写能力的服务节点
+* cube-agent: 与cube-server同机部署，接收cube-transfer下发的字典文件更新命令，拉取数据到本地，通知cube-server进行更新
+* Paddle Serving: 加载CTR预估任务模型ProgramDesc和dense参数，提供预测服务
+* Client: CTR预估任务的demo客户端

 以上组件串联完成从训练到预测部署的所有流程。本文档所提供的一键部署脚本\ `paddle-suite.sh <https://github.com/PaddlePaddle/Serving/blob/master/doc/resource/paddle-suite.sh>`_\ 可一键部署上述所有组件。

@@ -58,7 +60,7 @@ ELASTIC CTR

 **第2节 前置需求** 指导用户从零开始，在百度云上申请BCE集群，并部署volcano工具。本方案需使用\ `volcano <https://volcano.sh/>`_\ 做训练环节批量任务管理工具，目前在百度云上验证通过

-**第3节 分布式训练+serving方案部署** 使用paddle-suite.sh，一键部署分布式训练+serving完整流程；并详细解释脚本每一步的工作和含义
+**第3节 分布式训练+Serving方案部署** 使用paddle-suite.sh，一键部署分布式训练+serving完整流程；并详细解释脚本每一步的工作和含义

 **第4节 查看结果** 根据各个pod输出，验证一键安装状态

@@ -76,7 +78,7 @@ ELASTIC CTR
 `百度智能云CCE容器引擎帮助文档-创建集群 <https://cloud.baidu.com/doc/CCE/GettingStarted/24.5C.E5.88.9B.E5.BB.BA.E9.9B.86.E7.BE.A4.html#.E6.93.8D.E4.BD.9C.E6.AD.A5.E9.AA.A4>`_\ ，在百度智能云上建立一个集群，节点配置需要满足如下要求


-* CPU核数 &gt; 4
+* CPU核数 > 4

 申请容器引擎示例:

@@ -146,7 +148,7 @@ ELASTIC CTR
   :alt: image


-3. :raw-html-m2r:`<span id='head3'>分布式训练+serving方案一键部署</span>`
+3. :raw-html-m2r:`<span id='head3'>分布式训练+Serving方案一键部署</span>`
 =============================================================================

 3.1 下载部署方案脚本文件
@@ -397,6 +399,8 @@ pserver日志示例：

   $ docker build -t ${DOCKER_IMAGE_NAME} .
   $ docker push  ${DOCKER_IMAGE_NAME}
+   
+推荐使用百度云提供的镜像仓库，这里是说明文档\ `推送镜像到镜像仓库 <https://cloud.baidu.com/doc/CCE/s/Yjxppt74z/#%E6%8E%A8%E9%80%81%E9%95%9C%E5%83%8F%E5%88%B0%E9%95%9C%E5%83%8F%E4%BB%93%E5%BA%93>`_\ 

 5.2 指定训练规模
 ----------------
@@ -427,10 +431,10 @@ pserver日志示例：

 如上图所示

-5.3 指定cube参数服务器的分片数量和副本数量
+5.3 指定Cube参数服务器的分片数量和副本数量
 ------------------------------------------

-在cube.yaml文件当中，我们可以看到每一个cube的节点的定义，有一个\ ``cubeserver pod``\ 和\ ``cube serverservice``\ 。如果我们需要增加cube的副本数和分片数，只需要在yaml文件中复制相关的定义和环境变量即可。
+在cube.yaml文件当中，我们可以看到每一个Cube的节点的定义，有一个\ ``cube server pod``\ 和\ ``cube server service``\ 。如果我们需要增加cube的副本数和分片数，只需要在yaml文件中复制相关的定义和环境变量即可。


 .. image:: src/cube_config1.png
@@ -444,7 +448,7 @@ pserver日志示例：
   :alt: image


-以上两个图片，一个是对cube POD的定义，一个是对cubeSERVICE的定义。如果需要扩展Cube分片数量，可以复制POD和SERVICE的定义，并重命名它们。示例程序给出的是2个分片，复制之后第3个可以命名为cube-2。
+以上两个图片，一个是对Cube POD的定义，一个是对CubeSERVICE的定义。如果需要扩展Cube分片数量，可以复制POD和SERVICE的定义，并重命名它们。示例程序给出的是2个分片，复制之后第3个可以命名为cube-2。

 5.4 Serving适配新的模型
 -----------------------
@@ -460,3 +464,125 @@ pserver日志示例：
 用户可在此基础上进行修改。

 关于Paddle Serving的完整开发模式，可参考\ `Serving从零开始写一个预测服务 <https://github.com/PaddlePaddle/Serving/blob/develop/doc/CREATING.md>`_\ ，以及\ `Paddle Serving的其他文档 <https://github.com/PaddlePaddle/Serving/tree/develop/doc>`_
+
+
+注释
+====
+
+注1. :raw-html-m2r:`<span id='annotation_1'>Cube和Redis性能对比测试环境</span>`
+-----------------------------------------------------------------------------------
+
+Cube和Redis均在百度云环境上部署，测试时只测试单个Cube server和Redis server节点的性能。
+
+client端和server端分别位于2台独立的云主机，机器间ping延时为0.3ms-0.5ms。
+
+机器配置：Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz 32核
+
+Cube测试环境
+^^^^^^^^^^^^
+
+测试key 64bit整数，value为10个float （40字节）
+
+首先用本方案一键部署脚本部署完成。
+
+用Paddle Serving的Cube客户端SDK，编写测试代码
+
+基本原理，启动k个线程，每个线程访问M次Cube server，每次批量获取N个key，总时间加和求平均。
+
+.. list-table::
+   :header-rows: 1
+
+   * - 并发数 （压测线程数）
+     - batch size
+     - 平均响应时间 (us)
+     - total qps
+   * - 1
+     - 1000
+     - 1312
+     - 762
+   * - 4
+     - 1000
+     - 1496
+     - 2674
+   * - 8
+     - 1000
+     - 1585
+     - 5047
+   * - 16
+     - 1000
+     - 1866
+     - 8574
+   * - 24
+     - 1000
+     - 2236
+     - 10733
+   * - 32
+     - 1000
+     - 2602
+     - 12298
+     
+
+Redis测试环境
+^^^^^^^^^^^^^
+
+测试key 1-1000000之间随机整数，value为40字节字符串
+
+server端部署Redis-server (latest stable 5.0.6)
+
+client端为基于\ `redisplusplus <https://github.com/sewenew/redis-plus-plus>`_\ 编写的客户端\ `get_values.cpp <https://github.com/PaddlePaddle/Serving/blob/master/doc/resource/get_value.cpp>`_
+
+基本原理：启动k个线程，每个线程访问M次Redis server，每次用mget批量获取N个key。总时间加和求平均。
+
+调用方法：
+
+.. code-block:: bash
+
+   $ ./get_values -h 192.168.1.1 -t 3 -r 10000 -b 1000
+
+其中
+-h server所在主机名
+-t 并发线程数
+-r 每线程请求次数
+-b 每个mget请求的key个数
+
+.. list-table::
+   :header-rows: 1
+
+   * - 并发数 （压测线程数）
+     - batch size
+     - 平均响应时间 (us)
+     - total qps
+   * - 1
+     - 1000
+     - 1159
+     - 862
+   * - 4
+     - 1000
+     - 3537
+     - 1079
+   * - 8
+     - 1000
+     - 7726
+     - 1073
+   * - 16
+     - 1000
+     - 15440
+     - 1034
+   * - 24
+     - 1000
+     - 24279
+     - 1004
+   * - 32
+     - 1000
+     - 32570
+     - 996
+
+
+测试结论
+^^^^^^^^
+
+由于Redis高效的时间驱动模型和全内存操作，在单并发时，Redis平均响应时间比Cube少接近50% (1100us vs. 1680us)
+
+在扩展性方面，Redis受制于单线程模型，随并发数增加，响应时间加倍增加，而总吞吐在1000qps左右即不再上涨；而Cube则随着压测并发数增加，总的qps一直上涨，说明Cube能够较好处理并发请求，具有良好的扩展能力。
+
+RocksDB在线程数较少的时候，平均响应时间和qps慢于Redis，但是在16以及更多线程的测试当中，RocksDB提供了更快的响应时间和更大的qps。
--- a/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst
+++ b/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst
-.. _train_on_baidu_cloud_cn:
-
-在百度云启动分布式训练
-=========================
-
-PaddlePaddle Fluid分布式训练，可以不依赖集群系统（比如MPI，Kubernetes）启动分布式训练。
-本章节将会以 `百度云 <https://cloud.baidu.com/>`_ 为实例，说明如何在云端环境，甚至云端GPU环境启动
-大规模分布式任务。
-
-创建集群模板
----------
-
-登录到百度云控制台，选择BCC服务，点击“创建实例”。选择地域，注意，只有一些地域有GPU服务器可选，
-选择合适的地域之后，再选择对应型号，然后创建一个空的服务器，如下图：
-
-.. image:: src/create_gpu_machine.png
-
-* 在操作系统选项中，可以根据需要选择对应的版本，注意根据实际情况选择CUDA版本，这里我们选择CUDA-9.2。
-* 示例中选择机器付费方式为后付费，表示随着机器的释放，收费也会对应停止，对运行一次性任务会比较划算。
-
-在机器创建成功之后，执行下面的命令安装paddlepaddle GPU版本和相关依赖。
-
-.. code-block:: bash
-
-  apt-get update && apt-get install -y python python-pip python-opencv
-  # 注：百度云cuda-9.2镜像默认没有安装cudnn和nccl2，需要手动安装，如果自行安装，需要从官网下载
-  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb"
-  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/nccl_2.2.13-1+cuda9.0_x86_64.txz"
-  dpkg -i libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb
-  ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/lib/libcudnn.so
-  unxz nccl_2.2.13-1+cuda9.0_x86_64.txz
-  tar xf nccl_2.2.13-1+cuda9.0_x86_64.tar
-  cp -r nccl_2.2.13-1+cuda9.0_x86_64/lib/* /usr/lib
-  # 注：可以选择是否使用下面的pip镜像加速下载
-  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple matplotlib==2.2.3
-  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple paddlepaddle-gpu==0.15.0.post97
-
-
-完成安装后，使用下面的测试程序，测试当前机器是否可以正确运行GPU训练程序，如果遇到报错，请根据报错提示修复
-运行环境问题。为了方便启动GPU集群，测试程序执行成功之后，选择当前服务器，然后选择“创建自定义镜像”，后续
-创建GPU集群时即可选择配置好的镜像。
-
-.. image:: src/create_image.png
-
-* 测试程序：
-
-.. code-block:: python
-
-  from __future__ import print_function
-
-  import paddle.fluid.core as core
-  import math
-  import os
-  import sys
-
-  import numpy
-
-  import paddle
-  import paddle.fluid as fluid
-
-  BATCH_SIZE = 64
-  PASS_NUM = 1
-
-  def loss_net(hidden, label):
-      prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
-      loss = fluid.layers.cross_entropy(input=prediction, label=label)
-      avg_loss = fluid.layers.mean(loss)
-      acc = fluid.layers.accuracy(input=prediction, label=label)
-      return prediction, avg_loss, acc
-
-  def conv_net(img, label):
-      conv_pool_1 = fluid.nets.simple_img_conv_pool(
-          input=img,
-          filter_size=5,
-          num_filters=20,
-          pool_size=2,
-          pool_stride=2,
-          act="relu")
-      conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
-      conv_pool_2 = fluid.nets.simple_img_conv_pool(
-          input=conv_pool_1,
-          filter_size=5,
-          num_filters=50,
-          pool_size=2,
-          pool_stride=2,
-          act="relu")
-      return loss_net(conv_pool_2, label)
-
-
-  def train(use_cuda):
-      if use_cuda and not fluid.core.is_compiled_with_cuda():
-          return
-      img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
-      label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-      prediction, avg_loss, acc = conv_net(img, label)
-
-      test_program = fluid.default_main_program().clone(for_test=True)
-
-      optimizer = fluid.optimizer.Adam(learning_rate=0.001)
-      optimizer.minimize(avg_loss)
-
-      place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-      exe = fluid.Executor(place)
-
-      train_reader = paddle.batch(
-          paddle.reader.shuffle(
-              paddle.dataset.mnist.train(), buf_size=500),
-          batch_size=BATCH_SIZE)
-      test_reader = paddle.batch(
-          paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
-      feeder = fluid.DataFeeder(feed_list=[img, label], place=place)
-
-
-      exe.run(fluid.default_startup_program())
-
-
-      for pass_id in range(PASS_NUM):
-          for batch_id, data in enumerate(train_reader()):
-              acc_np, avg_loss_np = exe.run(fluid.default_main_program(),
-                                            feed=feeder.feed(data),
-                                            fetch_list=[acc, avg_loss])
-              if (batch_id + 1) % 10 == 0:
-                  print(
-                      'PassID {0:1}, BatchID {1:04}, Loss {2:2.2}, Acc {3:2.2}'.
-                      format(pass_id, batch_id + 1,
-                              float(avg_loss_np.mean()), float(acc_np.mean())))
-
-  if __name__ == '__main__':
-      train(True)
-
-
-创建集群
------
-
-完成创建镜像之后，可以使用这个配置好的镜像创建一个GPU集群，根据您的实际需求创建足够数量的GPU服务器，
-作为示例，这里启动2台GPU服务器，包括上一步创建的服务器，所以这里再启动一台新的服务器。
-
-点击“创建实例”，在相同地域选择同样配置的GPU服务器，注意选择刚才创建的镜像作为操作系统。
-
-.. image:: src/create_more_nodes.png
-
-编写集群任务启动脚本
----------------
-
-为了方便在更多的GPU服务器上启动分布式训练任务，我们将使用
-`fabric <http://www.fabfile.org/>`_
-作为集群任务启动管理工具，您可以选择其他熟悉的集群框架，比如MPI, Kubernetes，本示例演示的方法
-仅针对简单集群环境，而且服务器之间可以互相ssh登录。
-
-安装fabric，需要执行：
-
-.. code-block:: bash
-
-  pip install fabric
-
-假设我们创建了2台GPU服务器，ip分别是 :code:`172.16.0.5,172.16.0.6` ，然后在第一台服务器上，
-先创建训练程序文件 :code:`dist_train_demo.py` ，从
-`这里 <https://raw.githubusercontent.com/PaddlePaddle/FluidDoc/develop/doc/fluid/user_guides/howto/training/src/dist_train_demo.py>`_
-下载代码。然后编写 :code:`fabfile.py` 脚本，用于控制在不同服务器上启动训练任务的parameter server和trainer：
-
-.. code-block:: python
-
-  from fabric import Group, task
-
-  endpoints = "172.16.0.5:6173,172.16.0.6:6173"
-  port = "6173"
-  pservers = 2
-  trainers = 2
-
-  hosts = []
-  eps = []
-  for ep in endpoints.split(","):
-      eps.append(ep)
-      hosts.append(ep.split(":")[0])
-
-  def start_server(c):
-      current_endpoint = "%s:%s" % (c.host, port)
-      trainer_id = hosts.index(c.host)
-      cmd = "python /root/work/dist_train_demo.py pserver %s %s %d %d &> /root/work/server.log.%s &" % (
-          endpoints, current_endpoint, trainer_id, trainers, c.host)
-      c.run(cmd)
-
-  def start_trainer(c):
-      current_endpoint = "%s:%s" % (c.host, port)
-      trainer_id = hosts.index(c.host)
-      cmd = "python /root/work/dist_train_demo.py trainer %s %s %d %d &> /root/work/trainer.log.%s &" % (
-          endpoints, current_endpoint, trainer_id, trainers, c.host)
-      c.run(cmd)
-
-  @task
-  def start(c):
-      c.connect_kwargs.password = "work@paddle123"
-      c.run("mkdir -p /root/work")
-      c.put("dist_train_demo.py", "/root/work")
-      start_server(c)
-      start_trainer(c)
-
-  @task
-  def tail_log(c):
-      c.connect_kwargs.password = "work@paddle123"
-      c.run("tail /root/work/trainer.log.%s" % c.host)
-
-保存上述代码到 :code:`fabfile.py` 之后，执行
-
-.. code-block:: bash
-
-  fab -H 172.16.0.5,172.16.0.6 start
-
-就可以开始一个分布式训练任务。这个任务会在两台GPU服务器分别启动2个pserver进程和2个trainer进程开始训练。
-
-获取分布式训练结果
---------------
-
-示例任务会在 :code:`/root/work` 下记录日志，分别为
-:code:`pserver.log.[IP]` 和 :code:`trainer.log.[IP]` 的形式，可以手动在
-服务器上查看这些日志文件观察结果，也可以使用fabric获取所有节点的日志信息，比如：
-
-.. code-block:: bash
-
-  fab -H 172.16.0.5,172.16.0.6 tail-log
-
-关闭集群
------
-
-任务执行完成后，不要忘记释放掉GPU集群资源，勾选选择需要释放的服务器，选择“释放”，则会关闭机器并释放资源。
-如果需要执行新的任务，可以直接使用之前保存的镜像，启动新的集群，并参照前面的步骤开始训练。
-
-.. image:: src/release.png
--- a/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_en.rst
+++ b/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_en.rst
-.. _train_on_baidu_cloud_en:
-
-Distributed Training on Baidu Cloud
-=====================================
-
-PaddlePaddle Fluid distributed training allows you to start distributed training without relying on cluster systems (such as MPI, Kubernetes).
-This chapter will use `Baidu Cloud <https://cloud.baidu.com/>`_ as an example to show you how to perform large-scale distributed tasks in a cloud environment or even a cloud GPU environment.
-
-Create a cluster template
---------------------------
-
-Log in to Baidu Cloud Console, select BCC Service, and click "Create Instance". Select the region, and note that only some regions have GPU servers available.
-After selecting an appropriate region, select the corresponding model and create an empty server, as shown below:
-
-.. image:: src/create_gpu_machine.png
-
-* In the operating system options, you can select the corresponding version according to your needs. Note that the CUDA version is selected based on the actual situation. Here we choose CUDA-9.2.
-* In the example, the payment method is selected as post-paid, which means that as the machine is released, the charge will stop correspondingly, which is more cost-effective for running a one-time task.
-
-After the machine is created successfully, execute the following command to install the paddlepaddle GPU version and related dependencies.
-
-.. code-block:: bash
-
-  apt-get update && apt-get install -y python python-pip python-opencv
-  # Note: Baidu cloud cuda-9.2 image does not have cudnn and nccl2 installed by default. It needs to be installed manually. If you intend to install it by yourself, you need to download it from the official website.
-  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb"
-  wget -q "http://paddle-train-on-cloud.cdn.bcebos.com/nccl_2.2.13-1+cuda9.0_x86_64.txz"
-  dpkg -i libcudnn7_7.2.1.38-1+cuda9.2_amd64.deb
-  ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/lib/libcudnn.so
-  unxz nccl_2.2.13-1+cuda9.0_x86_64.txz
-  tar xf nccl_2.2.13-1+cuda9.0_x86_64.tar
-  cp -r nccl_2.2.13-1+cuda9.0_x86_64/lib/* /usr/lib
-  # Note: You can choose whether to use the following pip image to speed up the download.(for users within China)
-  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple matplotlib==2.2.3
-  pip install -i https://pypi.tuna.tsinghua.edu.cn/simple paddlepaddle-gpu==0.15.0.post97
-
-
-After the installation is completed, use the following test program to test whether the current machine can run the GPU training program correctly. If an error is encountered, please fix the running environment problem according to the error message. In order to facilitate the startup of the GPU cluster, after the test program is successfully executed, select the current server and select "Create Customized Image" . You can select the configured image when you create the GPU cluster later.
-
-.. image:: src/create_image.png
-
-* test program:
-
-.. code-block:: python
-
-  from __future__ import print_function
-
-  import paddle.fluid.core as core
-  import math
-  import os
-  import sys
-
-  import numpy
-
-  import paddle
-  import paddle.fluid as fluid
-
-  BATCH_SIZE = 64
-  PASS_NUM = 1
-
-  def loss_net(hidden, label):
-      Prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
-      Loss = fluid.layers.cross_entropy(input=prediction, label=label)
-      Avg_loss = fluid.layers.mean(loss)
-      Acc = fluid.layers.accuracy(input=prediction, label=label)
-      Return prediction, avg_loss, acc
-
-  def conv_net(img, label):
-      conv_pool_1 = fluid.nets.simple_img_conv_pool(
-          input=img,
-          filter_size=5,
-          num_filters=20,
-          pool_size=2,
-          pool_stride=2,
-          act="relu")
-      conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
-      conv_pool_2 = fluid.nets.simple_img_conv_pool(
-          input=conv_pool_1,
-          filter_size=5,
-          num_filters=50,
-          pool_size=2,
-          pool_stride=2,
-          act="relu")
-      return loss_net(conv_pool_2, label)
-
-
-  def train(use_cuda):
-      if use_cuda and not fluid.core.is_compiled_with_cuda():
-          return
-      img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
-      label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-      prediction, avg_loss, acc = conv_net(img, label)
-
-      test_program = fluid.default_main_program().clone(for_test=True)
-
-      optimizer = fluid.optimizer.Adam(learning_rate=0.001)
-      optimizer.minimize(avg_loss)
-
-      place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
-      exe = fluid.Executor(place)
-
-      train_reader = paddle.batch(
-          paddle.reader.shuffle(
-              paddle.dataset.mnist.train(), buf_size=500),
-          batch_size=BATCH_SIZE)
-      test_reader = paddle.batch(
-          paddle.dataset.mnist.test(), batch_size=BATCH_SIZE)
-      feeder = fluid.DataFeeder(feed_list=[img, label], place=place)
-
-
-      exe.run(fluid.default_startup_program())
-
-
-      for pass_id in range(PASS_NUM):
-          for batch_id, data in enumerate(train_reader()):
-              acc_np, avg_loss_np = exe.run(fluid.default_main_program(),
-                                            feed=feeder.feed(data),
-                                            fetch_list=[acc, avg_loss])
-              if (batch_id + 1) % 10 == 0:
-                  print(
-                       'PassID {0:1}, BatchID {1:04}, Loss {2:2.2}, Acc {3:2.2}'.
-                      format(pass_id, batch_id + 1,
-                              float(avg_loss_np.mean()), float(acc_np.mean())))
-
-  if __name__ == '__main__':
-      train(True)
-
-
-Create a cluster
------------------
-
-After creating the image, you can use this configured image to create a GPU cluster and create a sufficient number of GPU servers according to your actual needs.As an example, here are two GPU servers started, including the one created in the previous step, and a new server here.
-
-Click "Create Instance" to select GPU servers with the same settings in the same region. Especially, the image you just created should be selected as the operating system.
-
-.. image:: src/create_more_nodes.png
-
-Write cluster task startup scripts
------------------------------------
-
-In order to facilitate the launch of distributed training tasks on more GPU servers, we will use
-`fabric <http://www.fabfile.org/>`_
-as a cluster task launch management tool. You can choose other familiar cluster frameworks, such as MPI, Kubernetes. 
-
-The methods demonstrated in this example are only proposed for simple cluster environments, and servers can log in to each other through SSH.
-
-To install the fabric, you need to execute:
-
-.. code-block:: bash
-
-  pip install fabric
-
-Suppose we have created two GPU servers, the ip addresses of them are :code:`172.16.0.5, 172.16.0.6` . On the first server,
-create the training program file :code:`dist_train_demo.py`, from
-`here <https://raw.githubusercontent.com/PaddlePaddle/FluidDoc/develop/doc/fluid/user_guides/howto/training/src/dist_train_demo.py>`_
-to download the code. Then write the :code:`fabfile.py` script to control the parameter servers and trainers that start the training task on different servers:
-
-.. code-block:: python
-
-  from fabric import Group, task
-
-  endpoints = "172.16.0.5:6173,172.16.0.6:6173"
-  port = "6173"
-  pservers = 2
-  trainers = 2
-
-  hosts = []
-  eps = []
-  for ep in endpoints.split(","):
-      eps.append(ep)
-      hosts.append(ep.split(":")[0])
-
-  def start_server(c):
-      current_endpoint = "%s:%s" % (c.host, port)
-      trainer_id = hosts.index(c.host)
-      cmd = "python /root/work/dist_train_demo.py pserver %s %s %d %d &> /root/work/server.log.%s &" % (
-          endpoints, current_endpoint, trainer_id, trainers, c.host)
-      c.run(cmd)
-
-  def start_trainer(c):
-      current_endpoint = "%s:%s" % (c.host, port)
-      trainer_id = hosts.index(c.host)
-      cmd = "python /root/work/dist_train_demo.py trainer %s %s %d %d &> /root/work/trainer.log.%s &" % (
-          endpoints, current_endpoint, trainer_id, trainers, c.host)
-      c.run(cmd)
-
-  @task
-  def start(c):
-      c.connect_kwargs.password = "work@paddle123"
-      c.run("mkdir -p /root/work")
-      c.put("dist_train_demo.py", "/root/work")
-      start_server(c)
-      start_trainer(c)
-
-  @task
-  def tail_log(c):
-      c.connect_kwargs.password = "work@paddle123"
-      c.run("tail /root/work/trainer.log.%s" % c.host)
-
-Save the above code to :code:`fabfile.py` and execute
-
-.. code-block:: bash
-
-  fab -H 172.16.0.5,172.16.0.6 start
-
-Right now, you can start a distributed training task. This task will start training on two GPU servers by starting two pserver processes and two trainer processes respectively.
-
-Get distributed training results
---------------------------------
-
-The example task will be logged under :code:`/root/work`, respectively
-:code:`pserver.log.[IP]` and :code:`trainer.log.[IP]` can be manually
-view the results of these log files on the server. You can also use the fabric to obtain log information of all nodes, for example:
-
-.. code-block:: bash
-
-  fab -H 172.16.0.5,172.16.0.6 tail-log
-
-Terminate the cluster
------------------------
-
-After the task is executed, don't forget to release the GPU cluster resources. To do this, firstly select the servers you want to release, and then select "Release" to shut down the machine and release the resources.
-If you need to perform a new task, you can use the previously saved image directly, start a new cluster, and start the training by following the previous steps.
-
-.. image:: src/release.png