Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into gen_nccl_id_op

04bde96e · typhoonzero · a135fec1 · 95314a23 · 04bde96e · 04bde96e
152 changed file
--- a/.copyright.hook
+++ b/.copyright.hook
@@ -9,7 +9,7 @@ import subprocess
 import platform
 COPYRIGHT = '''
-  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.

--- a/cmake/external/mkldnn.cmake
+++ b/cmake/external/mkldnn.cmake
@@ -53,7 +53,7 @@ ExternalProject_Add(
    ${EXTERNAL_PROJECT_LOG_ARGS}
    DEPENDS             ${MKLDNN_DEPENDS}
    GIT_REPOSITORY      "https://github.com/01org/mkl-dnn.git"
-    GIT_TAG             "v0.11"
+    GIT_TAG             "v0.14"
    PREFIX              ${MKLDNN_SOURCES_DIR}
    UPDATE_COMMAND      ""
    CMAKE_ARGS          -DCMAKE_INSTALL_PREFIX=${MKLDNN_INSTALL_DIR}

--- a/contrib/float16/float16_inference_report.md
+++ b/contrib/float16/float16_inference_report.md
+# Float16 Inference in PaddlePaddle Fluid
+Kexin Zhao <zhaokexin01@baidu.com>
 ## Introduction
-Working with deep neural networks (DNN) is a two-stage process. First we train DNN using labeled examples of inputs and desired outputs to obtain the model parameters (weights), then we deploy DNN along with the trained weights to run inference on unknown inputs. Typically, these weights are in float data type and hence we run inference in float mode using these weights. This post focuses on the discussion of how to use low precision float16 data type to represent these trained weights and run inference in float16 mode as well as the advantages of float16 inference over its float counterpart by showing some experiment results. 
+Deep learning is usually a two-stage work: training and inference. The training stage estimates model parameters (weights) from data.  The inference stage loads the weights and uses them to interpret inputs. Typically, weights are 32-bit float values (float32).  Some new devices, including NVIDIA Volta GPUs, support higher speed computation using 16-bit float values (float16).
+This article explains our efforts with PaddlePaddle to train using float32 and to inference using float16. We describe a [*transpiler*](https://github.com/PaddlePaddle/Paddle/blob/a4d3de0071e1f3912230c3ab3f9ac74cf06b093a/doc/fluid/design/motivation/fluid_compiler.md), which converts a PaddlePaddle Fluid model, which, to be precise, should be called a [Fluid *program*](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/program.md), into the inference program, and converts the weights from float32 into float16.
 ## What is float16?
-float16 (or FP16) is a half-precision floating-point format that uses 16 bits in memory to represent a value. The advantage over 32-bit single-precision floating-point format (commonly known as float data type) is that it requires half the storage and bandwidth at the expense of precision and range. Fortunately, DNN inference has high tolerance against the loss of precision and range when using float16 to represent the weights and the inference accuracy will only be minimally affected in most cases. This gives us the opportunity to use float16 data type to speedup the inference.
+float16 (or FP16) is a half-precision floating-point format that uses 16 bits in memory to represent a value. The advantage over 32-bit single-precision floating-point format (commonly known as float or float32 data type) is that it requires half the storage and bandwidth at the expense of precision and range. Fortunately, DNN inference has a high tolerance for the loss of precision and range when using float16 to represent the weights, and the inference accuracy will only be minimally affected in most cases, which gives us the opportunity to use float16 data type to speed up the inference.
 Interested readers can refer to our [design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/data_type/float16.md) and [code](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/platform/float16.h) for more details on how we implement the float16 data type.
 ## Why float16?
-The trend in today's deep learning community is to use bigger and deeper model. This translates to larger memory footprint, higher computation demands, and as a result higher energy consumption on computing devices. The advantages of float16 over float are correspondingly three-fold:
+The trend in today's deep learning community is to use bigger and deeper model, which translates to larger memory footprint, higher computation demands, and as a result higher energy consumption on computing devices. The advantages of float16 over float32 are correspondingly three-fold:
-1. We only need half the memory size to load the same model using float16 representations. Moreover, most of the intermediate results generated during float16 inference are also of float16 data type. This makes the whole memory footprint of float16 inference roughly about half of its float counterpart. This is especially useful when deploying inference on mobile devices with limited available memory. Also given the same available memory, the maximum batch size for float16 inference is about twice that for float inference.
+1. We only need half the memory size to load the same model using float16 representations. Moreover, most of the intermediate results generated during float16 inference are also of the float16 data type. As a result, the whole memory footprint of float16 inference is roughly half of its float counterpart, which is especially useful when deploying inference on mobile devices with limited available memory. Also given the same available memory, the maximum batch size for float16 inference is about twice that for float inference.
-2. Because float16 occupies less memory than float, in theory hardware devices can achieve much higher floating point operators per second (FLOPS) for float16 data than float data. Right now, an outstanding example of hardware devices that actually deliver such advantages is Nvidia's latest Volta architecture GPUs, including Tesla V100 and Titan V. Moreover float16 takes less time to read from or write to memory and hence float16 can make inference more efficient especially in memory-bound applications where the performance is largely affected by how fast it is to read and write data.
+2. Because float16 occupies less memory than float, in theory, hardware devices can achieve much higher floating point operators per second (FLOPS) for float16 data than float data. Right now, NVIDIA's latest Volta GPUs, including Tesla V100 and Titan V, can deliver significantly higher FLOPS for float16 using Tensor Cores. Moreover, float16 takes less time to read from or write to memory, and hence float16 can make inference more efficient especially in memory-bound applications where the performance is mostly affected by how fast it is to read and write data.
-3. From the energy efficiency perspective, the energy needed to read, write, and compute float16 data is much less that its float counterpart, which can significantly reduce the battery power consumption on mobile devices or the total cost of ownership (TCO) of data centers.
+3. From the energy efficiency perspective, the energy needed to read, write, and compute float16 data is much less than its float counterpart, which can significantly reduce the battery power consumption on mobile devices or the total cost of ownership (TCO) of data centers.
 ## Fluid implementation of float16 inference
 ### Overview
 Fluid use [Program](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/modules/python_api.md#program) instead of computation graph to describe a neural network model and the optimization procedure. Fluid program is a python wrapper around a protobuf message called [ProgramDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/program.md). Similar to programming languages, the basic structure of a Fluid program is some nested [blocks](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/modules/python_api.md#block), where each block consists of some [variable](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/modules/python_api.md#variable) definitions and a sequence of [operators](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/modules/python_api.md#operator). An [executor](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/concepts/executor.md) will run a given program by sequentially executing the operators in the entrance block. 
 ### Basic requirement
-When an operator is run by an executor, it uses a kernel to perform computations on tensors contained in the input variables, and then write the results to the tensors in the output variables. Each operator has multiple kernels for different combinations of data types, devices, and library types, respectively. The operator will select the appropriate kernel to run based on, among other things, the data type of the input tensors. By default, every Fluid operator has a kernel for float data type that takes float inputs and generates float outputs.
+When an executor runs an operator, it uses a kernel to perform computations on tensors contained in the input variables, and then writes the results to the tensors in the output variables. Each operator has multiple kernels for different combinations of data types, devices, and library types, respectively. The operator will select the appropriate kernel to run based on, among other things, the data type of the input tensors. By default, every Fluid operator has a kernel for float data type that takes float inputs and generates float outputs.
-This means that if we provide float input to the first operator in a program, then each operator will use float kernel to compute float output and send it as input to the next operator to trigger its float kernel. This chain effect will makes the program run in float mode and gives us a final output of float data type. 
+If we provide float input to the first operator in a program, then each operator will use float kernel to compute float output and send it as input to the next operator to trigger its float kernel. This chain effect will make the program run in float mode and gives us a final output of float data type. 
-The same principle applies if we want a program to run in float16 mode. We provide input variable of float16 data type to the first operator and every subsequent operator will invoke the float16 kernel until we get the final output in float16 data type. So the preliminary requirements for float16 inference is to add float16 kernels to operators that are needed in a specific kind of neural networks. Our current focus is on Convolutional Neural Networks (CNN) and hence we have added float16 kernels to the following operators: convolution, pooling, GEMM, elementwise addition, batch norm, dropout, various activations including relu and tanh, and softmax.
+The same principle applies if we want a program to run in float16 mode. We provide input variable of the float16 data type to the first operator, and every subsequent operator will invoke the float16 kernel until we get the final output in float16. So the preliminary requirements for float16 inference are to add float16 kernels to operators that are needed in a specific kind of neural networks. Our current focus is on Convolutional Neural Networks (CNN) and hence we have added float16 kernels to the following operators: convolution, pooling, GEMM, elementwise addition, batch norm, dropout, various activations including relu and tanh, and softmax.
 ### float16 transpiler
-Furthermore, we need a float16 transpiler to achieve the following usage code:
+Furthermore, we need a transpiler to write float16 inference code similar to the following:
 ```python
 # Get the float32 inference program and load the associated float32 weights
@@ -64,14 +71,15 @@ fluid.io.save_inference_model(fp16_save_dirname, feed_target_names,
                              float16_inference_program)
 ```
-In this scenario, we already have a float32 inference program and some associated float32 weights that can do float32 inference. We can easily use the `transpile` method of the `Float16Transpiler` class to do certain modifications to the existing program and weights so that we have a new float16 program and the associated float16 weights.
+In this scenario, we already have a float32 inference program and some associated float32 weights. We can simply use the `transpile` method of the `Float16Transpiler` class to do certain modifications to the existing program and weights so that we have a new float16 program and the associated float16 weights.
-We can then run various inference experiments in float16 mode and save the float16 program and weights on disk for future deployment. To enhance the code usability, we maintain a consistent API so that user can use the same float32 input data to run inference program in either float32 and float16 mode and obtain output data both of float32 data type. This requires us to add some cast operators in the program to convert between float16 tensor and float32 tensor.
+We can then run various inference experiments in float16 mode and save the float16 program and weights on disk for future deployment. To enhance the code usability, we maintain a consistent API so that user can use the same float32 input data to run inference program in either float32 and float16 mode and obtain output data both of float32 data type. Consequently, we need to add cast operators in the float16 inference program for conversions between the float16 tensor and float32 tensor.
 The float16 transpiler is implemented to fulfill the requirements mentioned above. The details of the float16 transpiler can be found [here](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/data_type/float16.md#float16-inference).
 ### Experiment results
-We provide demo codes that can be used to reproduce the experiment results by doing:
+Simply running the following commands to reproduce the experiment results presented in this section:
 ```bash
 git clone https://github.com/PaddlePaddle/Paddle.git
 cd Paddle
@@ -84,8 +92,8 @@ nvidia-docker build -t paddle:float16 .
 nvidia-docker run -it -v $PWD:/paddle paddle:float16 /paddle/contrib/float16/run_float16_demo.sh
 ```
-#### Correctness
+#### Accuracy
-As is mentioned before, DNN inference has been found to be tolerant against the loss of precision and range incured by float16 and we want to see how good this tolerance is.
+As is mentioned before, DNN inference has been found to be tolerant against the loss of precision and range incurred by float16, and we want to see how good this tolerance is.
 We train a resnet32 model using cifar10 data set, save it when test set accuracy is above 60%, and then test the inference accuracy on the 10000 examples of the cifar10 test set in float16 and float32 mode, respectively.
@@ -105,18 +113,18 @@ We repeat the test ten times and get the following results:
 | #10    | 62.53%  | 62.48%   |
 | average| 62.63%  | 62.62%   |
-We can see that the accuracy of float16 inference is very close to that of float32 inference in every experiment (within 0.05% difference) and is overall 0.01% better than its float32 counterpart averaged over 10 tests. 
+We can see that the accuracy of float16 inference is very close to that of float32 inference in every experiment (within 0.05% difference) and is overall 0.01% better than its float32 counterpart averaged over ten tests. 
 #### Performance benchmark
-Currently, Fluid inference in float16 mode is only supported on Nvidia GPU device. There is no motivation to support float16 inference on non-ARM CPUs because float16 is not natively supported there and float16 calculation will only be slower than its float counterpart. 
+Currently, Fluid only supports float16 inference on NVIDIA GPUs. There is no motivation to support float16 inference on non-ARM CPUs where float16 is not natively supported, and float16 calculation will only be slower than its float32 counterpart. 
-Nvidia started to support its native float16 data type (which has the same internal memory representation as Fluid float16 class) on CUDA 7.5. Moreover, float16 speedups on common computational intensive tasks including GEMM (general matrix-matrix multiplication) and convolution are supported since cublas 7.5 and cuDNN 5.0.
+NVIDIA started to support its native float16 data type (which has the same internal memory representation as Fluid's float16 class) on CUDA 7.5. Moreover, float16 speedups on computationally intensive tasks including GEMM (general matrix-matrix multiplication) and convolution are supported since cuBLAS 7.5 and cuDNN 5.0.
-Recently, the introduction of [tensor core](https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/) in volta architecture GPUs and the support of tensor core calculation in CUDA 9.0 and cuDNN 7 make float16 truly superior to float in certain deep learning applications.
+Recently, the introduction of [Tensor Core](https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/) in Volta architecture GPUs and the support of Tensor Core computation in CUDA 9.0 and cuDNN 7 make float16 genuinely superior to float in some deep learning applications.
-We thus benchmark the float16 inference performance on a single Nvidia Tesla V100 GPU (volta architecture and with tensor cores) and compare it with its float32 counterpart. All the following results are in ms (millisecond) averaged over 1000 mini-batches with respective to different mini-batch(mb) sizes.
+We thus benchmark the float16 inference performance on a single NVIDIA Tesla V100 GPU (Volta architecture and with Tensor Cores) and compare it with its float32 counterpart. All the following results are in ms (millisecond) averaged over 1000 mini-batches with respective to different mini-batch(mb) sizes.
-Average inference time for one mini-batch on Vgg16 model tested on imagenet data set:
+Average inference time for one mini-batch on Vgg16 model tested on ImageNet dataset:
 | total | mb=1  | mb=2  | mb=4  | mb=8  | mb=16  | mb=32 | mb=64  |
 |-------|-----: |-----: |-----: |-----: |------: |------:|-------:|
@@ -124,7 +132,7 @@ Average inference time for one mini-batch on Vgg16 model tested on imagenet data
 |float16|  3.32 | 4.11  |  5.88 |  9.41 | 16.54  | 30.47 |  60.23 |
 |Speedup|  4.22 | 2.36  |  3.91 |  3.00 |  3.26  |  2.77 |   2.97 |
-We can see that float16 inference provides 2x ~ 4x speedup on different batch sizes. 
+We can see that float16 inference provides **2x ~ 4x** speedup on different batch sizes. 
 Convolution operation is ususally the computational bottleneck of CNN, so we also check the average time spent on the Fluid convolution operators for one mini-batch as follows:
@@ -134,9 +142,9 @@ Convolution operation is ususally the computational bottleneck of CNN, so we als
 |float16|  1.78 | 2.10  |  2.93 |  4.55 |  7.99  | 14.63 |  28.67 |
 |Speedup|  6.71 | 3.31  |  6.37 |  4.71 |  5.18  |  4.14 |   4.54 |
-Fluid convolution operator uses cuDNN 7 to implement the kernel and we can see that with the help of tensor core, float16 convolution is significantly faster than its float32 counterpart, which makes the overall float16 inference performance much better.
+Fluid convolution operator uses cuDNN 7 to implement the kernel, and we can see that with the help of Tensor Core, float16 convolution is significantly faster than its float32 counterpart, which makes the overall float16 inference performance much better.
-Similarly, we also list the benchmark results of Resnet50 model tested on imagenet data set:
+Similarly, we also list the benchmark results of Resnet50 model tested on the ImageNet dataset:
 | total | mb=1  | mb=2  | mb=4  | mb=8  | mb=16  | mb=32 | mb=64  | mb=128 |
 |-------|-----: |-----: |-----: |-----: |------: |------:|-------:|-------:|
@@ -150,14 +158,14 @@ Similarly, we also list the benchmark results of Resnet50 model tested on imagen
 |float16| 4.19  | 4.30  | 3.96  | 4.21  |  5.63  |  8.77 | 15.24  | 28.40  |
 |Speedup| 1.30  | 1.27  | 1.64  | 1.99  |  2.45  |  2.79 |  2.70  |  2.59  |
-We find that the speedup provided by float16 inference starts relatively small at 1.15x for batch size 1 and gradually increase to about 2x for larger batch sizes. Similar trend can be found for the time spent on the convolution operator. Note that right now the tensor core will only be utilized in the convolution operation when certain dimentional requirements are met for the input data and filter. The speedup by float16 inference for Resnet50 is smaller than the Vgg16 counterpart partially because the convolution operation in Resnet is much simpler than the Vgg counterpart and this makes the tensor core less utilized in Resnet than in Vgg.
+We find that the speedup provided by float16 inference starts relatively small at 1.15x for batch size 1 and gradually increases to about 2x for larger batch sizes. A similar trend can be found for the time spent on the convolution operator. Note that right now Tensor Cores will only be utilized in the convolution operation when the input data and filter meet specific dimensional requirements. The speedup by float16 inference for Resnet50 is smaller than the Vgg16 counterpart partially because the convolution operation in Resnet is much simpler than its Vgg counterpart and this makes the tensor core less utilized in Resnet than in Vgg.
-We also did the same benchmark on a Nvidia GeForce GTX 1080 Ti GPU that does not support tensor core. The results show that for Vgg16, float16 inference provides consistent small speedup (around 1.15x) for all mini-batch sizes, while for Resnet50, float16 inference is slower than its float32 counterpart in small batch sizes (mb = 1 and 2) and then deliver around 1.15x speedup for all larger batch sizes. By comparing the benchmarks on 1080 Ti and V100, we find that tensor core, which is specialized for float16 computations, is a critical component for high performance float16 inference.
+We also did the same benchmark on a single NVIDIA GeForce GTX 1080 Ti GPU that does not support Tensor Core. The results show that for Vgg16, float16 inference provides consistent small speedup (around 1.15x) for all mini-batch sizes, while for Resnet50, float16 inference is slower than its float32 counterpart in small batch sizes (mb = 1 and 2) and then delivers around 1.15x speedup for all larger batch sizes. By comparing the benchmarks on 1080 Ti and V100, we find that Tensor Core, which is specialized for float16 computations, is a critical component of high performance float16 inference.
-Please refer to [here](https://github.com/PaddlePaddle/Paddle/blob/develop/contrib/float16/float16_benchmark.md) for comprehensive benchmark results.
+Please refer to [here](https://github.com/PaddlePaddle/Paddle/blob/develop/contrib/float16/float16_benchmark.md) for complete benchmark results.
 ### Summary
 1. Fluid is now able to run inference in float16 mode via a float16 transpiler. We currently support CNN programs, including Vgg and Resnet, to run in float16 inference mode.
-2. The accuracy of float16 inference is verified to be almost identical to the float32 counterpart at least on CNNs.
+2. The accuracy of float16 inference is verified to be almost identical to its float32 counterpart at least on CNN models.
-3. float16 inference provides significant speedup on large and computationally intensive Vgg16 network on image net data set. For the much smaller and simpler Resnet50, the speedup provided by float16 inference is less significant than on Vgg16 but still favorable especially for large batch size.
+3. float16 inference provides a significant speedup on large and computationally intensive Vgg16 model on ImageNet dataset. For the much smaller and simpler Resnet50 model, the speedup provided by float16 inference is less significant than for Vgg16 model but still favorable, especially for large batch sizes.
-4. We cannot achieve the superior float16 inference performance without the help of the newly introduced tensor cores on the Nvidia Volta architecture GPUs.
+4. We cannot achieve the superior float16 inference performance without the help of the newly introduced Tensor Cores on NVIDIA Volta architecture GPUs.
--- a/contrib/inference/README.md
+++ b/contrib/inference/README.md
+# Embed Paddle Inference in Your Application
+Paddle inference offers the APIs in `C` and `C++` languages.
+One can easily deploy a model trained by Paddle following the steps as below:
+1. Optimize the native model;
+2. Write some codes for deployment.
+Let's explain the steps in detail.
+## Optimize the native Fluid Model
+The native model that get from the training phase needs to be optimized for that.
+- Clean the noise such as the cost operators that do not need inference;
+- Prune unnecessary computation fork that has nothing to do with the output;
+- Remove extraneous variables;
+- Memory reuse for native Fluid executor;
+- Translate the model storage format to some third-party engine's, so that the inference API can utilize the engine for acceleration;
+We have an official tool to do the optimization, call `paddle_inference_optimize --help` for more information.
+## Write some codes
+Read `paddle_inference_api.h` for more information.
--- a/contrib/inference/paddle_inference_api.h
+++ b/contrib/inference/paddle_inference_api.h
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+#pragma once
+#include <string>
+#include <vector>
+namespace paddle {
+class Predictor {
+public:
+  struct Attr;
+  Predictor() = default;
+  // Build the network before inference.
+  bool Init(const Attr& attr);
+  // Predict an record.
+  // Arguments:
+  //   inputs: the name of the input variables.
+  //   outputs: the name of the output varaibles.
+  //   input_shapes: the shape of the input variables.
+  //   output_shapes: the shape of the output variables.
+  //   input_data: the data of the input variables.
+  //   output_data: the data of the output variables.
+  bool Run(const std::vector<std::string>& inputs,
+           const std::vector<std::string>& outputs,
+           const std::vector<std::vector<int>>& input_shapes,
+           const std::vector<std::vector<int>>& output_shapes,
+           const std::vector<std::vector<float>>& input_data,
+           std::vector<std::vector<float>>* output_data);
+  // Clone a predictor that share the model weights.
+  Predictor* Clone();
+  // Destroy the Predictor.
+  ~Predictor();
+  struct Attr {
+    enum class EngineKind;
+    std::string model_dir;      // path to the model directory.
+    bool enable_engine{false};  // Enable to execute (part of) the model on
+                                // third-party engines.
+    EngineKind engine_kind{Attr::EngineKind::kNone};
+    enum class EngineKind {
+      kNone = -1,          // Use the native Fluid facility.
+      kAnakin,             // Use Anakin for inference.
+      kTensorRT,           // Use TensorRT for inference.
+      kAutoMixedAnakin,    // Automatically mix Fluid with Anakin.
+      kAutoMixedTensorRT,  // Automatically mix Fluid with TensorRT.
+    };
+  };
+};
+}  // namespace paddle
--- a/doc/fluid/build_and_install/paddleci.png
+++ b/doc/fluid/build_and_install/paddleci.png
+../../v2/build_and_install/paddleci.png
\ No newline at end of file
--- a/doc/fluid/design/motivation/api.md
+++ b/doc/fluid/design/motivation/api.md
@@ -77,8 +77,7 @@ print "The sematic-vector of testA: ", paddle.infer(fA, parameters, testA)
 ### Example 2. Sharing Parameters between "Models"
-We use [GAN](https://github.com/PaddlePaddle/book/tree/develop/gan) in
+We use GAN in this example.  In the following example program, `d0` and `d1`
-this example.  In the following example program, `d0` and `d1`
 correspond to the two networks in the following figure:
 <img src="https://github.com/wangyang59/book/raw/00036f4b0da5225041a6824587c1a01cf20159b1/gan/image/gan_ig.png" width=400 />

--- a/doc/fluid/design/motivation/refactorization.md
+++ b/doc/fluid/design/motivation/refactorization.md
@@ -125,12 +125,12 @@ Compile Time -> IR -> Runtime
 ## Operator/OpWithKernel/OpKernel
-![class_diagram](http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/49caf1fb70820fb4a6c217634317c9306f361f36/op_op_with_kern_class_diagram.dot)
+![class_diagram](https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/doc/fluid/images/op_op_with_kern_class_diagram.dot)
 ---
 ## Operator
-![class_diagram](http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/dd598e8f1976f5759f58af5e5ef94738a6b2e661/op.dot)
+![class_diagram](https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/doc/fluid/images/op.dot)
 * `Operator` is the fundamental building block of the user interface.
    * Operator stores input/output variable names and attributes.
@@ -141,7 +141,7 @@ Compile Time -> IR -> Runtime
 ## OpWithKernel/Kernel
-![class_diagram](http://api.paddlepaddle.org/graphviz?dot=https://gist.githubusercontent.com/reyoung/53df507f6749762675dff3e7ce53372f/raw/9d7f4eba185cf41c8e2fbfb40ae21890dbddcd39/op_with_kernel.dot)
+![class_diagram](https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/doc/fluid/images/op_with_kernel.dot)
 * `OpWithKernel` inherits `Operator`.
 * `OpWithKernel` contains a Kernel map.

--- a/doc/fluid/design/multi_devices/operator_kernel_type.md
+++ b/doc/fluid/design/multi_devices/operator_kernel_type.md
@@ -75,7 +75,7 @@ Different layout leads to different implementation of the operator kernel. There
 - The inference of Layout is at run-time, not at compile-time.
- Every operator has to implement different kernels for different layouts. Let's take MKLDNN as an example. If we want to implement an MKLDNN convolution operator, we have to implement all the kernels for different layouts, which are listed [here](http://01org.github.io/mkl-dnn/structmkldnn_1_1memory.html). And we will have a special macro to  register kernels for MKLDNN operators.
+- Every operator has to implement different kernels for different layouts. Let's take MKLDNN as an example. If we want to implement an MKLDNN convolution operator, we have to implement all the kernels for different layouts, which are listed [here](http://intel.github.io/mkl-dnn/structmkldnn_1_1memory.html). And we will have a special macro to  register kernels for MKLDNN operators.
 `Layout` is also defined as a enum variable:

--- a/doc/fluid/howto/cluster/nccl2_rdma_training.md
+++ b/doc/fluid/howto/cluster/nccl2_rdma_training.md
+# Distributed Training with NCCL2 and RDMA
+When doing distributed multi-GPU training, network bandwith often becomes the
+bottle neck. We introduce a way to use NCCL2 to do such training job to
+achieve best performace.
+## Prepare Hardwares with RDMA and Multiple GPUs
+I'm using two Linux servers each of them is installed with 8 GPUs and
+one 100Gb RDMA card.
+Base environment is:
+* OS: CentOS 7.4
+* RDMA device: "Mellanox Technologies MT27700 Family [ConnectX-4]"
+* Kernel version: `4.4.88-1.el7.elrepo.x86_64`
+* Docker version: `1.12.6`
+* Docker storage driver: `overlay2`
+* IP addresses: 192.168.16.30,192.168.16.34
+In general, the steps including:
+1. Install GPU drivers
+1. Install RDMA drivers
+1. Install "InfiniBand Support"
+1. Use docker to run tests and make sure GPUs and RDMA can work inside
+   the container.
+I'll ommit section "Install GPU drivers" because we can find it easily
+somewhere else.
+### Install RDMA drivers
+For my case, I've got two machines with device
+"Mellanox Technologies MT27700 Family [ConnectX-4]" installed. The OS was
+"CentOS 7.4" and I updated the kernel to version 4.4 so that docker can
+work with latest overlay2 filesystem.
+***NOTE: before you start, make sure you have a way to get a console
+of the server other than ssh because we may need to re-configure the
+network device.***
+1. Go to http://www.mellanox.com/page/products_dyn?product_family=26,
+   download `MLNX_OFED` software in the bottom of the page, and upload it
+   onto the server.
+1. Run `./mlnxofedinstall --add-kernel-support` in the software package.
+1. Run `/etc/init.d/openibd restart` to make everything work, note that
+   this operation may cause the network goes down if you are using this
+   RDMA device as default network device and use ssh to login the server.
+1. Re-configure the network interface, for example:
+   `ifconfig eth2 192.168.16.30/20 up`, then add routes if needed:
+   `ip route add default via 192.168.16.1 dev eth2`.
+1. Do the same thing on the other node.
+1. Use `ping` to test if the two nodes have typical ICMP connection.
+1. Use either `udaddy` or `ib_write_bw` to test the network connection is
+   ready and have the desired bandwith.
+### Prepare Docker Image to Run RDMA Programs
+1. Build a docker image using cuda base image like: `nvidia/cuda:8.0-cudnn5-devel-ubuntu16.04` and install paddlepaddle whl
+   package in it.
+1. Start a docker container and mount GPU driver libs into it (you can
+   skip this step if you are using nvidia-docker).
+1. Mount RDMA dirvers and libs into the docker image (see below section),
+   also `udaddy` and `ib_write_bw` if needed.
+1. Mount GPU devices and RDMA devices into the container using `--device`
+   or just use privileged mode `--privileged`.
+1. Start the container using host network mode: `--net=host`
+### RDMA Library Files Needed
+Usually, `MLNX_OFED` install latest supported libs under
+`/usr/lib64/mlnx_ofed/valgrind`. Other libs also needed to run RDMA programs
+is listed below. These libs must be mounted into the docker container.
+* Libs under `/usr/lib64/mlnx_ofed/valgrind`
+  * libibcm.so
+  * libibverbs.so
+  * libmlx4.so
+  * libmlx5.so
+  * libmlx5-rdmav2.so
+  * librdmacm.so
+* Other libs:
+  * libnl-3.so.200
+  * libnl-route-3.so.200
+  * libnuma.so.1
+## Start to Run the Training Job
+Setting NCCL environment variables to turn NCCL switches on and off:
+| Env Name | Description |
+| --- | --- |
+| NCCL_SOCKET_IFNAME | The RDMA device, e.g. eth2 |
+| NCCL_P2P_DISABLE | Set to 1 to disable P2P transfer between GPUs |
+| NCCL_IB_DISABLE | Set to 1 to disable using RDMA |
+| NCCL_IB_CUDA_SUPPORT | Set to 1 to enable GPU Direct if supported |
+| NCCL_DEBUG | Set debug level: VERSION, WARN, INFO |
+My two servers are: `192.168.16.30,192.168.16.34`, On node 1, Run :
+```bash
+PADDLE_TRAINER_ID=0 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.30 stdbuf -oL python vgg16.py
+```
+On node 2, Run:
+```bash
+PADDLE_TRAINER_ID=1 PADDLE_PORT=48372 PADDLE_WORKERS=192.168.16.30,192.168.16.34 POD_IP=192.168.16.34 stdbuf -oL python vgg16.py
+```
--- a/doc/fluid/images/op.dot
+++ b/doc/fluid/images/op.dot
+digraph sample { 
+  graph [rankdir=TD]; node [shape=record];
+  op [label="{Operator| InferShape()=0\lRun()=0\l | map&#60;string, string[]&#62; inputs_\lmap&#60;string, string[]&#62; outputs_ \l AttributeMap attrs_\l}"]; 
+}
\ No newline at end of file
--- a/doc/fluid/images/op_op_with_kern_class_diagram.dot
+++ b/doc/fluid/images/op_op_with_kern_class_diagram.dot
+digraph sample { 
+  graph [rankdir=TD]; node [shape=record];
+  op [label="{Operator| InferShape()=0\lRun()=0\l | map&#60;string, string[]&#62; inputs_\lmap&#60;string, string[]&#62; outputs_ \l AttributeMap attrs_\l}"]; 
+  op_with_kern [label="{OpWithKernel | InferShape()=0\lRun()\l | map&#60;OpKernelKey,OpKernel&#62;kernels_ }"]
+  op_kernel [label="{OpKernel | Compute()=0}"]
+  op_kernel_key [label="{OpKernelKey| Place place\n...}"]
+  op -> op_with_kern [dir=back, arrowtail=onormal]
+  op_with_kern -> op_kernel [arrowhead=vee, label="contains many"]
+  {
+    rank=same;
+    op_with_kern
+    op_kernel
+  }
+  op_kernel -> op_kernel_key [style=invis]
+  {
+    rank=same;
+    op_kernel
+    op_kernel_key
+  }
+  op_with_kern -> op_kernel_key [arrowhead=vee, label ="\nas map key"]
+  mul_op [label="MulOp"]
+  op_with_kern -> mul_op [dir=back, arrowtail=onormal]
+  mul_kernel [label="template &#60;typename Place&#62;\lclass MulOpKernel\l"]
+  op_kernel -> mul_kernel [dir=back, arrowtail=onormal]
+  mul_op -> mul_kernel [arrowhead=vee, label="register many"]
+  {
+    rank=same;
+    mul_op;
+    mul_kernel;
+  }
+}
\ No newline at end of file
--- a/doc/fluid/images/op_with_kernel.dot
+++ b/doc/fluid/images/op_with_kernel.dot
+digraph sample { 
+  graph [rankdir=TD]; node [shape=record];
+  op [label="{Operator}"]; 
+  op_with_kern [label="{OpWithKernel | InferShape()=0\lRun()\l | map&#60;OpKernelKey,OpKernel&#62;kernels_ }"]
+  op_kernel [label="{OpKernel | Compute()=0}"]
+  op_kernel_key [label="{OpKernelKey| Place place\n...}"]
+  op -> op_with_kern [dir=back, arrowtail=onormal]
+  op_with_kern -> op_kernel [arrowhead=vee, label="contains many"]
+  {
+    rank=same;
+    op_with_kern
+    op_kernel
+  }
+  op_kernel -> op_kernel_key [style=invis]
+  {
+    rank=same;
+    op_kernel
+    op_kernel_key
+  }
+  op_with_kern -> op_kernel_key [arrowhead=vee, label ="\nas map key"]
+}
\ No newline at end of file
--- a/doc/v2/api/config/layer.rst
+++ b/doc/v2/api/config/layer.rst
@@ -142,7 +142,7 @@ gated_unit
 -----------
 ..  autoclass:: paddle.v2.layer.gated_unit
    :noindex:
 Recurrent Layer Group
 =====================
@@ -354,7 +354,7 @@ dropout
 --------
 ..  autoclass:: paddle.v2.layer.dropout
    :noindex:
 dot_prod
 ---------
 .. autoclass:: paddle.v2.layer.dot_prod
@@ -460,6 +460,11 @@ multi_binary_label_cross_entropy_cost
 ..  autoclass:: paddle.v2.layer.multi_binary_label_cross_entropy_cost
    :noindex:
+classification_cost
+-------------------
+.. autoclass:: paddle.v2.layer.classification_cost
+   :noindex:
 huber_regression_cost
 -------------------------
 ..  autoclass:: paddle.v2.layer.huber_regression_cost
@@ -534,7 +539,7 @@ detection_output
 ----------------
 ..  autoclass:: paddle.v2.layer.detection_output
    :noindex:
 Check Layer
 ============

--- a/doc/v2/design/mkl/mkldnn.md
+++ b/doc/v2/design/mkl/mkldnn.md
@@ -5,7 +5,7 @@
 充分展现英特尔平台的优势，有效提升PaddlePaddle在英特尔架构上的性能。
 <div align="center">
-<img src="image/overview.png"><br/>
+<img src="https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/doc/v2/images/overview.png"><br/>
 Figure 1. PaddlePaddle on IA
 </div>
@@ -42,16 +42,43 @@ Figure 1. PaddlePaddle on IA
 MKL，MKLML以及MKL-DNN三者关系如下表：
-| Name        |  Open Source     | License     | Descriptions  |
+<table>
-| :---------- | :--------------- | :---------- | :------------ |
+<thead>
-|   MKL       |     No           | Proprietary | Accelerate math processing routines |
+<tr>
-|   MKLML     |     No           | Proprietary | Small package of MKL, especially for Machine Learning |
+<th>Name</th>
-|   MKL-DNN   |     Yes          | Apache 2.0  | Accelerate primitives processing routines especially for Deep Neural Networks  |
+<th>Open Source</th>
+<th>License</th>
+<th>Descriptions</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>MKL</td>
+<td>No</td>
+<td>Proprietary</td>
+<td>Accelerate math processing routines</td>
+</tr>
+<tr>
+<td>MKLML</td>
+<td>No</td>
+<td>Proprietary</td>
+<td>Small package of MKL, especially for Machine Learning</td>
+</tr>
+<tr>
+<td>MKL-DNN</td>
+<td>Yes</td>
+<td>Apache 2.0</td>
+<td>Accelerate primitives processing routines especially for Deep Neural Networks</td>
+</tr>
+</tbody>
+</table>
 MKLML可以与MKL-DNN共同使用，以此达到最好的性能。
 <div align="center">
-<img src="image/engine.png"><br/>
+<img src="https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/doc/v2/images/engine.png"><br/>
 Figure 2. PaddlePaddle with MKL Engines
 </div>
@@ -103,7 +130,7 @@ MKL-DNN的库目前只有动态库`libmkldnn.so`。
 所以我们定义了一个`MKLDNNMatrix`用于管理MKL-DNN数据的不同格式以及相互之间的转换。
 <div align="center">
-<img src="image/matrix.png"><br/>
+<img src="https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/doc/v2/images/matrix.png"><br/>
 Figure 3. MKLDNNMatrix
 </div>
@@ -113,7 +140,7 @@ Figure 3. MKLDNNMatrix
 子类只需要使用定义好的接口，实现具体的函数功能即可。
 <div align="center">
-<img src="image/layers.png"><br/>
+<img src="https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/doc/v2/images/layers.png"><br/>
 Figure 4. MKLDNNLayer
 </div>
@@ -150,7 +177,7 @@ Figure 4. MKLDNNLayer
 所以整体上，在实现每个子类的时候就不需要关心分支的事情了。
 <div align="center">
-<img src="image/gradients.png"><br/>
+<img src="https://raw.githubusercontent.com/PaddlePaddle/Paddle/develop/doc/v2/images/gradients.png"><br/>
 Figure 5. Merge Gradients
 </div>

--- a/doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md
+++ b/doc/v2/howto/cluster/multi_cluster/k8s_distributed_en.md
@@ -41,7 +41,7 @@ Training docker image needs to package the paddle pserver and paddle trainer run
 - Generating the initialization arguments for `Paddle PServer` and `Paddle Training` processes.
 Since the paddlepaddle official docker image already has the runtimes we need, we'll take it as the base image and pack some additional scripts for the processes mentioned above to build our training image. for more detail, please find from the following link:
- https://github.com/PaddlePaddle/Paddle/blob/develop/doc/howto/usage/cluster/src/k8s_train/Dockerfile
+- https://github.com/PaddlePaddle/Paddle/tree/develop/doc/v2/howto/cluster/multi_cluster/src/k8s_train/Dockerfile
 ```bash
@@ -62,7 +62,7 @@ represent the Docker Image which built in this step.
 ### Prepare Training Data
 We can download and split the training job by creating a Kubernetes Job, or custom your image
-by editing [k8s_train](./src/k8s_train/).
+by editing [k8s_train](https://github.com/PaddlePaddle/Paddle/tree/develop/doc/v2/howto/cluster/multi_cluster/src/k8s_train).
 Before creating a Job, we need to bind a [persistenVolumeClaim](https://kubernetes.io/docs/user-guide/persistent-volumes) by the different type of
 the different file system, the generated dataset would be saved on this volume.

--- a/doc/v2/images/FullyConnected.jpg
+++ b/doc/v2/images/FullyConnected.jpg
--- a/doc/v2/images/add_security_group.png
+++ b/doc/v2/images/add_security_group.png
--- a/doc/v2/images/bi_lstm.jpg
+++ b/doc/v2/images/bi_lstm.jpg
--- a/doc/v2/images/checkpointing.png
+++ b/doc/v2/images/checkpointing.png
--- a/doc/v2/images/create_efs.png
+++ b/doc/v2/images/create_efs.png
--- a/doc/v2/images/csr.png
+++ b/doc/v2/images/csr.png
--- a/doc/v2/images/data_dispatch.png
+++ b/doc/v2/images/data_dispatch.png
--- a/doc/v2/images/dataset.graffle
+++ b/doc/v2/images/dataset.graffle
--- a/doc/v2/images/dataset.png
+++ b/doc/v2/images/dataset.png
--- a/doc/v2/images/doc_en.png
+++ b/doc/v2/images/doc_en.png
--- a/doc/v2/images/efs_mount.png
+++ b/doc/v2/images/efs_mount.png
--- a/doc/v2/images/encoder-decoder-attention-model.png
+++ b/doc/v2/images/encoder-decoder-attention-model.png
--- a/doc/v2/images/engine.png
+++ b/doc/v2/images/engine.png
--- a/doc/v2/images/file_storage.graffle
+++ b/doc/v2/images/file_storage.graffle
--- a/doc/v2/images/file_storage.png
+++ b/doc/v2/images/file_storage.png
--- a/doc/v2/images/glossary_rnn.dot
+++ b/doc/v2/images/glossary_rnn.dot
+digraph G{
+	subgraph cluster_timestep0 {
+		label="recurrent timestep i-1"
+		bgcolor=lightgray
+		node [style=filled,color=white]
+		fc0_0 [label="fc 0"]
+		fc0_1 [label="fc 1"]
+		fc0_2 [label="fc 2"]
+		fc0_0 -> fc0_1
+		fc0_1 -> fc0_2
+	}
+	subgraph cluster_timestep1 {
+		label="recurrent timestep i"
+		node [style=filled];
+		fc1_0 [label="fc 0"]
+		fc1_1 [label="fc 1"]
+		fc1_2 [label="fc 2"]
+		color=blue
+		fc1_0 -> fc1_1
+		fc1_1 -> fc1_2
+	}
+	subgraph cluster_timestep2 {
+		label="recurrent timestep i+1"
+		bgcolor=lightgray
+		node [style=filled,color=white]
+		fc2_0 [label="fc 0"]
+		fc2_1 [label="fc 1"]
+		fc2_2 [label="fc 2"]
+		fc2_0 -> fc2_1
+		fc2_1 -> fc2_2
+	}
+	fc0_1 -> fc1_1 [style="dotted" constraint=false]
+	fc1_1 -> fc2_1 [style="dotted" constraint=false]
+}
\ No newline at end of file
--- a/doc/v2/images/glossary_rnn_with_memory.dot
+++ b/doc/v2/images/glossary_rnn_with_memory.dot
+digraph G{
+	subgraph cluster_timestep0 {
+		label="recurrent timestep i-1"
+		bgcolor=lightgray
+		node [style=filled,color=white]
+		fc0_0 [label="fc 0"]
+		fc0_1 [label="fc 1"]
+		fc0_2 [label="fc 2"]
+		m0 [label="memory"]
+		fc0_0 -> fc0_1
+		fc0_1 -> fc0_2
+		fc0_1 -> m0
+		m0 -> fc0_1
+	}
+	subgraph cluster_timestep1 {
+		label="recurrent timestep i"
+		node [style=filled];
+		fc1_0 [label="fc 0"]
+		fc1_1 [label="fc 1"]
+		fc1_2 [label="fc 2"]
+		m1 [label="memory"]
+		color=blue
+		fc1_0 -> fc1_1
+		fc1_1 -> fc1_2
+		fc1_1 -> m1
+		m1 -> fc1_1
+	}
+	subgraph cluster_timestep2 {
+		label="recurrent timestep i+1"
+		bgcolor=lightgray
+		node [style=filled,color=white]
+		fc2_0 [label="fc 0"]
+		fc2_1 [label="fc 1"]
+		fc2_2 [label="fc 2"]
+		m2 [label="memory"]
+		fc2_0 -> fc2_1
+		fc2_1 -> fc2_2
+		fc2_1 -> m2
+		m2 -> fc2_1
+	}
+	m0 -> m1 [style="dotted" constraint=false]
+	m1 -> m2 [style="dotted" constraint=false]
+}
\ No newline at end of file
--- a/doc/v2/images/gradients.png
+++ b/doc/v2/images/gradients.png
--- a/doc/v2/images/init_lock.graffle
+++ b/doc/v2/images/init_lock.graffle
--- a/doc/v2/images/init_lock.png
+++ b/doc/v2/images/init_lock.png
--- a/doc/v2/images/k8s-paddle-arch.png
+++ b/doc/v2/images/k8s-paddle-arch.png
--- a/doc/v2/images/layers.png
+++ b/doc/v2/images/layers.png
--- a/doc/v2/images/managed_policy.png
+++ b/doc/v2/images/managed_policy.png
--- a/doc/v2/images/matrix.png
+++ b/doc/v2/images/matrix.png
--- a/doc/v2/images/nvvp1.png
+++ b/doc/v2/images/nvvp1.png
--- a/doc/v2/images/nvvp2.png
+++ b/doc/v2/images/nvvp2.png
--- a/doc/v2/images/nvvp3.png
+++ b/doc/v2/images/nvvp3.png
--- a/doc/v2/images/nvvp4.png
+++ b/doc/v2/images/nvvp4.png
--- a/doc/v2/images/overview.png
+++ b/doc/v2/images/overview.png
--- a/doc/v2/images/paddle-cloud-in-data-center.png
+++ b/doc/v2/images/paddle-cloud-in-data-center.png
--- a/doc/v2/images/paddle-etcd.graffle
+++ b/doc/v2/images/paddle-etcd.graffle
--- a/doc/v2/images/paddle-etcd.png
+++ b/doc/v2/images/paddle-etcd.png
--- a/doc/v2/images/paddle-model-sharding.graffle
+++ b/doc/v2/images/paddle-model-sharding.graffle
--- a/doc/v2/images/paddle-model-sharding.png
+++ b/doc/v2/images/paddle-model-sharding.png
--- a/doc/v2/images/paddle-ps-0.png
+++ b/doc/v2/images/paddle-ps-0.png
--- a/doc/v2/images/paddle-ps-1.png
+++ b/doc/v2/images/paddle-ps-1.png
--- a/doc/v2/images/paddle-ps.graffle
+++ b/doc/v2/images/paddle-ps.graffle
--- a/doc/v2/images/paddle-task-queues.graffle
+++ b/doc/v2/images/paddle-task-queues.graffle
--- a/doc/v2/images/paddle-task-queues.png
+++ b/doc/v2/images/paddle-task-queues.png
--- a/doc/v2/images/paddle-task-states.graffle
+++ b/doc/v2/images/paddle-task-states.graffle
--- a/doc/v2/images/paddle-task-states.png
+++ b/doc/v2/images/paddle-task-states.png
--- a/doc/v2/images/ps_cn.png
+++ b/doc/v2/images/ps_cn.png
--- a/doc/v2/images/ps_en.png
+++ b/doc/v2/images/ps_en.png
--- a/doc/v2/images/pserver_and_trainer.png
+++ b/doc/v2/images/pserver_and_trainer.png
--- a/doc/v2/images/pserver_init.graffle
+++ b/doc/v2/images/pserver_init.graffle
--- a/doc/v2/images/pserver_init.png
+++ b/doc/v2/images/pserver_init.png
--- a/doc/v2/images/route53_create_recordset.png
+++ b/doc/v2/images/route53_create_recordset.png
--- a/doc/v2/images/route53_create_zone.png
+++ b/doc/v2/images/route53_create_zone.png
--- a/doc/v2/images/sequence_data.png
+++ b/doc/v2/images/sequence_data.png
--- a/doc/v2/images/simple_full_hierarchical_recurrent.dot
+++ b/doc/v2/images/simple_full_hierarchical_recurrent.dot
+digraph G {
+  rankdir=LR;
+  subgraph cluster_t0 {
+    a [label="4"]
+    b [label="5"]
+    c [label="2"]
+  }
+  subgraph cluster_t1 {
+    d [label="0"]
+    e [label="9"]
+  }
+  subgraph cluster_t2 {
+    f [label="8"]
+    g [label="1"]
+    h [label="4"]
+  }
+  a -> b;
+  b -> c;
+  c -> d [constraint=false];
+  d -> e;
+  e -> f [constraint=false];
+  f -> g;
+  g -> h;
+}
\ No newline at end of file
--- a/doc/v2/images/simple_full_recurrent.dot
+++ b/doc/v2/images/simple_full_recurrent.dot
+digraph G {
+  rankdir=LR;
+  a [label="4"]
+  b [label="5"]
+  c [label="2"]
+  d [label="0"]
+  e [label="9"]
+  f [label="8"]
+  g [label="1"]
+  h [label="4"]
+  a -> b;
+  b -> c;
+  c -> d;
+  d -> e;
+  e -> f;
+  f -> g;
+  g -> h;
+}
\ No newline at end of file
--- a/doc/v2/images/submit-job.graffle
+++ b/doc/v2/images/submit-job.graffle
--- a/doc/v2/images/submit-job.png
+++ b/doc/v2/images/submit-job.png
--- a/doc/v2/images/trainer.graffle
+++ b/doc/v2/images/trainer.graffle
--- a/doc/v2/images/trainer.png
+++ b/doc/v2/images/trainer.png
--- a/doc/v2/images/trainer_cn.png
+++ b/doc/v2/images/trainer_cn.png
--- a/doc/v2/images/worker_security_group.png
+++ b/doc/v2/images/worker_security_group.png
--- a/doc/v2/images/workflow_of_CAPI.png
+++ b/doc/v2/images/workflow_of_CAPI.png
--- a/paddle/fluid/framework/block_desc.cc
+++ b/paddle/fluid/framework/block_desc.cc
@@ -134,6 +134,11 @@ OpDesc *BlockDesc::PrependOp() {
  return ops_.front().get();
 }
+void BlockDesc::PrependAllocatedOp(std::unique_ptr<OpDesc> &&op_desc) {
+  need_update_ = true;
+  ops_.emplace_front(std::move(op_desc));
+}
 OpDesc *BlockDesc::InsertOp(size_t index) {
  need_update_ = true;
  auto it = ops_.begin() + index;

--- a/paddle/fluid/framework/block_desc.h
+++ b/paddle/fluid/framework/block_desc.h
@@ -88,6 +88,8 @@ class BlockDesc {
  OpDesc *PrependOp();
+  void PrependAllocatedOp(std::unique_ptr<OpDesc> &&op_desc);
  OpDesc *InsertOp(size_t index);
  /*

--- a/paddle/fluid/framework/details/broadcast_op_handle.cc
+++ b/paddle/fluid/framework/details/broadcast_op_handle.cc
@@ -38,9 +38,7 @@ void BroadcastOpHandle::RunImpl() {
      out_var_handles.size(), places_.size(),
      "The number of output should equal to the number of places.");
-  // Wait input done, this Wait is asynchronous operation platform::Place
+  WaitInputVarGenerated();
-  // &in_place;
-  WaitInputVarGenerated(*in_var_handle);
  std::vector<const Scope *> var_scopes;
  for (auto *s : local_scopes_) {
@@ -50,29 +48,9 @@ void BroadcastOpHandle::RunImpl() {
  auto *in_var =
      var_scopes.at(in_var_handle->scope_idx_)->FindVar(in_var_handle->name_);
  PADDLE_ENFORCE_NOT_NULL(in_var);
  Tensor &in_tensor = VariableVisitor::GetMutableTensor(in_var);
-  // NOTE: The tensors' Place of input and output must be all on GPU or all on
+  InitOutputValue(*in_var_handle, out_var_handles);
-  // CPU.
-  for (auto *out_var_handle : out_var_handles) {
-    if (out_var_handle->IsTheSameVar(*in_var_handle)) {
-      continue;
-    }
-    auto t_out_p = out_var_handle->place_;
-    auto *out_var = var_scopes.at(out_var_handle->scope_idx_)
-                        ->FindVar(out_var_handle->name_);
-    PADDLE_ENFORCE_NOT_NULL(out_var);
-    if (platform::is_gpu_place(in_tensor.place())) {
-      PADDLE_ENFORCE(platform::is_gpu_place(t_out_p),
-                     "Places of input and output must be all on GPU.");
-    } else {
-      t_out_p = platform::CPUPlace();
-    }
-    VariableVisitor::ShareDimsAndLoD(*in_var, out_var);
-    VariableVisitor::GetMutableTensor(out_var).mutable_data(t_out_p,
-                                                            in_tensor.type());
-  }
  if (platform::is_cpu_place(in_tensor.place())) {
    for (auto *out_var_handle : out_var_handles) {
@@ -147,11 +125,37 @@ void BroadcastOpHandle::RunImpl() {
  }
 }
-void BroadcastOpHandle::WaitInputVarGenerated(const VarHandle &in_var) {
+void BroadcastOpHandle::InitOutputValue(
-  if (in_var.generated_op_) {
+    const VarHandle &in_var_handle,
-    for (auto &pair : dev_ctxes_) {
+    const std::vector<VarHandle *> &out_var_handles) const {
-      in_var.generated_op_->Wait(pair.second);
+  std::vector<const Scope *> var_scopes;
+  for (auto *s : local_scopes_) {
+    var_scopes.emplace_back(s->FindVar(kLocalExecScopeName)->Get<Scope *>());
+  }
+  auto *in_var =
+      var_scopes.at(in_var_handle.scope_idx_)->FindVar(in_var_handle.name_);
+  Tensor &in_tensor = VariableVisitor::GetMutableTensor(in_var);
+  // NOTE: The tensors' Place of input and output must be all on GPU or all on
+  // CPU.
+  for (auto *out_var_handle : out_var_handles) {
+    if (out_var_handle->IsTheSameVar(in_var_handle)) {
+      continue;
    }
+    auto t_out_p = out_var_handle->place_;
+    auto *out_var = var_scopes.at(out_var_handle->scope_idx_)
+                        ->FindVar(out_var_handle->name_);
+    PADDLE_ENFORCE_NOT_NULL(out_var);
+    if (is_gpu_place(in_tensor.place())) {
+      PADDLE_ENFORCE(platform::is_gpu_place(t_out_p),
+                     "Places of input and output must be all on GPU.");
+    } else {
+      t_out_p = platform::CPUPlace();
+    }
+    VariableVisitor::ShareDimsAndLoD(*in_var, out_var);
+    VariableVisitor::GetMutableTensor(out_var).mutable_data(t_out_p,
+                                                            in_tensor.type());
  }
 }

--- a/paddle/fluid/framework/details/broadcast_op_handle.h
+++ b/paddle/fluid/framework/details/broadcast_op_handle.h
@@ -57,7 +57,6 @@ struct BroadcastOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
-  void WaitInputVarGenerated(const VarHandle &in_var);
 private:
  const std::vector<Scope *> &local_scopes_;
@@ -65,6 +64,9 @@ struct BroadcastOpHandle : public OpHandleBase {
 #ifdef PADDLE_WITH_CUDA
  const platform::NCCLContextMap *nccl_ctxs_;
 #endif
+  void InitOutputValue(const VarHandle &in_var_handle,
+                       const std::vector<VarHandle *> &out_var_handles) const;
 };
 }  // namespace details
 }  // namespace framework

--- a/paddle/fluid/framework/details/computation_op_handle.cc
+++ b/paddle/fluid/framework/details/computation_op_handle.cc
@@ -26,20 +26,20 @@ ComputationOpHandle::ComputationOpHandle(const OpDesc &op_desc, Scope *scope,
      place_(place) {}
 void ComputationOpHandle::RunImpl() {
-  auto *cur_ctx = dev_ctxes_[place_];
+  WaitInputVarGenerated(place_);
-  for (auto *in : inputs_) {
-    bool need_wait = in->generated_op_ &&
-                     in->generated_op_->DeviceContext(place_) != cur_ctx;
-    if (need_wait) {
-      in->generated_op_->Wait(cur_ctx);
-    }
-  }
  this->RunAndRecordEvent([this] {
    op_->Run(*scope_->FindVar(kLocalExecScopeName)->Get<Scope *>(), place_);
  });
 }
+bool ComputationOpHandle::NeedWait(VarHandleBase *in_var) {
+  bool need_wait =
+      in_var && in_var->generated_op_ &&
+      in_var->generated_op_->DeviceContext(place_) != dev_ctxes_[place_];
+  return need_wait;
+}
 std::string ComputationOpHandle::Name() const { return op_->Type(); }
 }  // namespace details
 }  // namespace framework

--- a/paddle/fluid/framework/details/computation_op_handle.h
+++ b/paddle/fluid/framework/details/computation_op_handle.h
@@ -36,6 +36,8 @@ struct ComputationOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
+  virtual bool NeedWait(VarHandleBase *in_var);
 private:
  std::unique_ptr<OperatorBase> op_;
  Scope *scope_;

--- a/paddle/fluid/framework/details/fetch_op_handle.cc
+++ b/paddle/fluid/framework/details/fetch_op_handle.cc
@@ -31,7 +31,7 @@ FetchOpHandle::~FetchOpHandle() {
  }
 }
-void FetchOpHandle::Wait(platform::DeviceContext *waited_dev) {
+void FetchOpHandle::RecordWaitEventOnCtx(platform::DeviceContext *waited_ctx) {
  PADDLE_THROW("Nobody should wait FetchOp. Unexpceted Error");
 }
@@ -45,12 +45,8 @@ void FetchOpHandle::WaitAndMergeCPUTensors() const {
 }
 void FetchOpHandle::RunImpl() {
-  auto cpu_ctx =
+  WaitInputVarGenerated(platform::CPUPlace());
-      platform::DeviceContextPool::Instance().Get(platform::CPUPlace());
-  for (auto *input : inputs_) {
-    auto *var = static_cast<VarHandle *>(input);
-    var->generated_op_->Wait(cpu_ctx);
-  }
  tensors_.resize(inputs_.size());
  auto *var_handle = static_cast<VarHandle *>(inputs_[0]);
  auto &var_name = var_handle->name_;
@@ -77,6 +73,15 @@ void FetchOpHandle::RunImpl() {
  this->WaitAndMergeCPUTensors();
 }
+void FetchOpHandle::WaitInputVarGenerated(const platform::Place &place) {
+  auto cpu_ctx = platform::DeviceContextPool::Instance().Get(place);
+  for (auto *input : inputs_) {
+    if (input->generated_op_) {
+      input->generated_op_->RecordWaitEventOnCtx(cpu_ctx);
+    }
+  }
+}
 std::string FetchOpHandle::Name() const { return "Fetch"; }
 }  // namespace details

--- a/paddle/fluid/framework/details/fetch_op_handle.h
+++ b/paddle/fluid/framework/details/fetch_op_handle.h
@@ -33,7 +33,7 @@ struct FetchOpHandle : public OpHandleBase {
  ~FetchOpHandle();
-  void Wait(platform::DeviceContext *waited_dev) override;
+  void RecordWaitEventOnCtx(platform::DeviceContext *waited_ctx) override;
  void WaitAndMergeCPUTensors() const;
@@ -42,6 +42,8 @@ struct FetchOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
+  virtual void WaitInputVarGenerated(const platform::Place &place);
 private:
  FeedFetchList *data_;
  size_t offset_;

--- a/paddle/fluid/framework/details/gather_op_handle.cc
+++ b/paddle/fluid/framework/details/gather_op_handle.cc
@@ -55,7 +55,7 @@ void GatherOpHandle::RunImpl() {
                 "Currently, gather_op only can gather SelectedRows.");
  // Wait input done, this Wait is asynchronous operation
-  WaitInputVarGenerated(in_var_handles);
+  WaitInputVarGenerated();
  auto &pre_in_value = pre_in_var->Get<framework::SelectedRows>();
  std::vector<int64_t> out_rows;
@@ -111,17 +111,6 @@ void GatherOpHandle::RunImpl() {
  });
 }
-void GatherOpHandle::WaitInputVarGenerated(
-    const std::vector<VarHandle *> &in_var_handles) {
-  for (auto *in : in_var_handles) {
-    if (in->generated_op_) {
-      for (auto pair : dev_ctxes_) {
-        in->generated_op_->Wait(pair.second);
-      }
-    }
-  }
-}
 std::string GatherOpHandle::Name() const { return "gather"; }
 }  // namespace details
 }  // namespace framework

--- a/paddle/fluid/framework/details/gather_op_handle.h
+++ b/paddle/fluid/framework/details/gather_op_handle.h
@@ -39,7 +39,6 @@ struct GatherOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
-  void WaitInputVarGenerated(const std::vector<VarHandle *> &in_var_handles);
 private:
  const std::vector<Scope *> &local_scopes_;

--- a/paddle/fluid/framework/details/nccl_all_reduce_op_handle.cc
+++ b/paddle/fluid/framework/details/nccl_all_reduce_op_handle.cc
@@ -34,10 +34,7 @@ void NCCLAllReduceOpHandle::RunImpl() {
    return;  // No need to all reduce when GPU count = 1;
  } else {
    // Wait input done
-    for (auto *in : inputs_) {
+    WaitInputVarGenerated();
-      auto &p = static_cast<VarHandle *>(in)->place_;
-      in->generated_op_->Wait(dev_ctxes_[p]);
-    }
    auto &var_name = static_cast<VarHandle *>(this->inputs_[0])->name_;
    int dtype = -1;

--- a/paddle/fluid/framework/details/op_handle_base.cc
+++ b/paddle/fluid/framework/details/op_handle_base.cc
@@ -56,15 +56,15 @@ void OpHandleBase::Run(bool use_event) {
  RunImpl();
 }
-void OpHandleBase::Wait(platform::DeviceContext *waited_dev) {
+void OpHandleBase::RecordWaitEventOnCtx(platform::DeviceContext *waited_ctx) {
 #ifdef PADDLE_WITH_CUDA
-  if (platform::is_cpu_place(waited_dev->GetPlace()) || events_.empty()) {
+  if (platform::is_cpu_place(waited_ctx->GetPlace()) || events_.empty()) {
    for (auto &dev_ctx : dev_ctxes_) {
      dev_ctx.second->Wait();
    }
  } else {
    auto stream =
-        static_cast<platform::CUDADeviceContext *>(waited_dev)->stream();
+        static_cast<platform::CUDADeviceContext *>(waited_ctx)->stream();
    for (auto &ev : events_) {
      PADDLE_ENFORCE(cudaStreamWaitEvent(stream, ev.second, 0));
    }
@@ -86,6 +86,28 @@ void OpHandleBase::AddOutput(VarHandleBase *out) {
  out->generated_op_ = this;
 }
+void OpHandleBase::WaitInputVarGenerated() {
+  for (auto in_var : inputs_) {
+    if (NeedWait(in_var)) {
+      for (auto &pair : dev_ctxes_) {
+        in_var->generated_op_->RecordWaitEventOnCtx(pair.second);
+      }
+    }
+  }
+}
+void OpHandleBase::WaitInputVarGenerated(const platform::Place &place) {
+  for (auto *in : inputs_) {
+    if (NeedWait(in)) {
+      in->generated_op_->RecordWaitEventOnCtx(dev_ctxes_[place]);
+    }
+  }
+}
+bool OpHandleBase::NeedWait(VarHandleBase *in_var) {
+  return in_var && in_var->generated_op_;
+}
 void OpHandleBase::RunAndRecordEvent(const std::function<void()> &callback) {
 #ifdef PADDLE_WITH_CUDA
  if (!events_.empty()) {  // Use event

--- a/paddle/fluid/framework/details/op_handle_base.h
+++ b/paddle/fluid/framework/details/op_handle_base.h
@@ -38,12 +38,24 @@ class OpHandleBase {
  void Run(bool use_event);
-  virtual void Wait(platform::DeviceContext *waited_dev);
+  virtual void RecordWaitEventOnCtx(platform::DeviceContext *waited_ctx);
  void AddInput(VarHandleBase *in);
  void AddOutput(VarHandleBase *out);
+  // This method adds the wait events of all the input on all the device
+  // context.
+  // NODE: This Wait is asynchronous operation.
+  virtual void WaitInputVarGenerated();
+  // This method adds the wait events of all the input on the specified device
+  // context.
+  // NODE: This Wait is asynchronous operation.
+  virtual void WaitInputVarGenerated(const platform::Place &place);
+  virtual bool NeedWait(VarHandleBase *in_var);
  // If the Op involves data transfer of multiple devices that
  // will likely block other computations.
  virtual bool IsMultiDeviceTransfer() { return false; }

--- a/paddle/fluid/framework/details/reduce_op_handle.cc
+++ b/paddle/fluid/framework/details/reduce_op_handle.cc
@@ -51,7 +51,7 @@ void ReduceOpHandle::RunImpl() {
  PADDLE_ENFORCE_NOT_NULL(pre_in_var);
  // Wait input done, this Wait is asynchronous operation
-  WaitInputVarGenerated(in_var_handles);
+  WaitInputVarGenerated();
  // NOTE: The Places of all input tensor must be all on CPU or all on GPU.
  std::vector<platform::Place> in_places;  // used to get dev_ctx
@@ -80,19 +80,21 @@ void ReduceOpHandle::RunImpl() {
  }
  if (pre_in_var->IsType<framework::SelectedRows>()) {
-    std::vector<const SelectedRows *> in_selected_rows =
+    this->RunAndRecordEvent([&] {
-        GetInputValues<SelectedRows>(in_var_handles, var_scopes);
+      std::vector<const SelectedRows *> in_selected_rows =
+          GetInputValues<SelectedRows>(in_var_handles, var_scopes);
-    GatherSelectedRows(in_selected_rows, in_places, dev_ctxes_, t_out_p,
+      GatherSelectedRows(in_selected_rows, in_places, dev_ctxes_, t_out_p,
-                       out_var->GetMutable<framework::SelectedRows>());
+                         out_var->GetMutable<framework::SelectedRows>());
+    });
  } else {
    std::vector<const LoDTensor *> lod_tensors =
        GetInputValues<LoDTensor>(in_var_handles, var_scopes);
    if (paddle::platform::is_cpu_place(lod_tensors[0]->place())) {
-      ReduceLoDTensor func(lod_tensors,
+      this->RunAndRecordEvent([&] {
-                           out_var->GetMutable<framework::LoDTensor>());
+        ReduceLoDTensor func(lod_tensors,
-      VisitDataType(ToDataType(lod_tensors[0]->type()), func);
+                             out_var->GetMutable<framework::LoDTensor>());
+        VisitDataType(ToDataType(lod_tensors[0]->type()), func);
+      });
    } else if (paddle::platform::is_gpu_place(lod_tensors[0]->place())) {
 #ifdef PADDLE_WITH_CUDA
      auto pre_in = pre_in_var->Get<framework::LoDTensor>();
@@ -157,17 +159,6 @@ std::vector<const T *> ReduceOpHandle::GetInputValues(
  return in_selected_rows;
 }
-void ReduceOpHandle::WaitInputVarGenerated(
-    const std::vector<VarHandle *> &in_var_handles) {
-  for (auto *in : in_var_handles) {
-    if (in->generated_op_) {
-      for (auto pair : dev_ctxes_) {
-        in->generated_op_->Wait(pair.second);
-      }
-    }
-  }
-}
 std::string ReduceOpHandle::Name() const { return "reduce"; }
 }  // namespace details
 }  // namespace framework

--- a/paddle/fluid/framework/details/reduce_op_handle.h
+++ b/paddle/fluid/framework/details/reduce_op_handle.h
@@ -60,8 +60,6 @@ struct ReduceOpHandle : public OpHandleBase {
 protected:
  void RunImpl() override;
-  void WaitInputVarGenerated(const std::vector<VarHandle *> &in_var_handles);
  template <typename T>
  std::vector<const T *> GetInputValues(
      const std::vector<VarHandle *> &in_var_handles,

--- a/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc
+++ b/paddle/fluid/framework/details/scale_loss_grad_op_handle.cc
@@ -29,6 +29,7 @@ ScaleLossGradOpHandle::ScaleLossGradOpHandle(size_t num_dev, Scope *scope,
 ScaleLossGradOpHandle::~ScaleLossGradOpHandle() {}
 void ScaleLossGradOpHandle::RunImpl() {
+  // Doesn't wait any event
  std::string var_name = static_cast<VarHandle *>(this->outputs_[0])->name_;
  auto &local_scope = *scope_->FindVar(kLocalExecScopeName)->Get<Scope *>();

--- a/paddle/fluid/framework/details/send_op_handle.cc
+++ b/paddle/fluid/framework/details/send_op_handle.cc
@@ -26,13 +26,16 @@ SendOpHandle::SendOpHandle(const framework::OpDesc &op_desc,
      place_(place) {}
 void SendOpHandle::RunImpl() {
+  // TODO(wuyi): need further analysis whether wait VarDummyHandle.
  // Wait input done
  for (auto *in : inputs_) {
    auto &p = static_cast<VarHandle *>(in)->place_;
    if (in->DebugString() == "dummy") {  // HACK
      continue;
    }
-    in->generated_op_->Wait(dev_ctxes_[p]);
+    if (in->generated_op_) {
+      in->generated_op_->RecordWaitEventOnCtx(dev_ctxes_[p]);
+    }
  }
  auto &tmp_scope = local_scope_->FindVar(kLocalExecScopeName)->Get<Scope *>();
  // FIXME(wuyi): can not use RunAndRecordEvent here, for it will cause dead

--- a/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc
+++ b/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc
@@ -14,8 +14,6 @@
 #include "paddle/fluid/framework/details/threaded_ssa_graph_executor.h"
-#include "paddle/fluid/framework/details/fetch_op_handle.h"
 namespace paddle {
 namespace framework {
 namespace details {
@@ -45,73 +43,33 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
  // Should revisit it if overlapping is available.
  std::unordered_set<OpHandleBase *> delayed_ops;
-  auto InsertPendingVar = [&pending_vars, &ready_vars](VarHandleBase &var) {
-    pending_vars.insert(&var);
-    if (var.generated_op_ == nullptr) {
-      ready_vars.Push(&var);
-    }
-  };
-  auto InsertPendingOp = [&pending_ops](OpHandleBase &op_instance) {
-    pending_ops.insert({&op_instance, op_instance.Inputs().size()});
-  };
  // Transform SSAGraph to pending_ops & pending_vars
  for (auto &var_map : graph_->vars_) {
    for (auto &name_pair : var_map) {
      for (auto &version_pair : name_pair.second) {
-        InsertPendingVar(*version_pair);
+        InsertPendingVar(&pending_vars, &ready_vars, version_pair.get());
      }
    }
  }
  for (auto &var : graph_->dep_vars_) {
-    InsertPendingVar(*var);
+    InsertPendingVar(&pending_vars, &ready_vars, var.get());
  }
  for (auto &op : graph_->ops_) {
    if (op->Inputs().empty()) {  // Special case, Op has no input.
      ready_ops.insert(op.get());
    } else {
-      InsertPendingOp(*op);
+      InsertPendingOp(&pending_ops, op.get());
    }
  }
  // Step 2. Insert FetchOps
  std::vector<std::unique_ptr<FetchOpHandle>> fetch_ops;
-  FeedFetchList fetch_data(fetch_tensors.size());
-  std::unordered_map<std::string, std::vector<VarHandleBase *>> fetched_vars;
-  for (auto &fetch_var_name : fetch_tensors) {
-    for (auto &var_map : graph_->vars_) {
-      auto it = var_map.find(fetch_var_name);
-      if (it != var_map.end()) {
-        fetched_vars[fetch_var_name].push_back(it->second.rbegin()->get());
-      }
-    }
-  }
  std::unordered_set<std::unique_ptr<VarHandleBase>> fetch_dependencies;
-  for (size_t i = 0; i < fetch_tensors.size(); ++i) {
+  FeedFetchList fetch_data(fetch_tensors.size());
-    auto &var_name = fetch_tensors[i];
-    auto &vars = fetched_vars.at(var_name);
-    auto *op = new FetchOpHandle(&fetch_data, i, &local_scopes_);
-    fetch_ops.emplace_back(op);
-    for (auto &p : places_) {
-      op->SetDeviceContext(p, fetch_ctxs_.Get(p));
-    }
-    for (auto *var : vars) {
-      op->AddInput(var);
-    }
-    auto *fetch_dummy = new DummyVarHandle();
+  InsertFetchOps(fetch_tensors, &fetch_ops, &fetch_dependencies, &pending_ops,
-    op->AddOutput(fetch_dummy);
+                 &pending_vars, &ready_vars, &fetch_data);
-    fetch_dependencies.emplace(fetch_dummy);
-    InsertPendingVar(*fetch_dummy);
-    InsertPendingOp(*op);
-  }
  auto run_all_ops = [&](std::unordered_set<OpHandleBase *> &set) {
    for (auto *op : set) {
@@ -174,6 +132,60 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
  return fetch_data;
 }
+void ThreadedSSAGraphExecutor::InsertFetchOps(
+    const std::vector<std::string> &fetch_tensors,
+    std::vector<std::unique_ptr<FetchOpHandle>> *fetch_ops,
+    std::unordered_set<std::unique_ptr<VarHandleBase>> *fetch_dependencies,
+    std::unordered_map<OpHandleBase *, size_t> *pending_ops,
+    std::unordered_set<VarHandleBase *> *pending_vars,
+    BlockingQueue<VarHandleBase *> *ready_vars, FeedFetchList *fetch_data) {
+  std::unordered_map<std::string, std::vector<VarHandleBase *>> fetched_vars;
+  for (auto &fetch_var_name : fetch_tensors) {
+    for (auto &var_map : graph_->vars_) {
+      auto it = var_map.find(fetch_var_name);
+      if (it != var_map.end()) {
+        fetched_vars[fetch_var_name].push_back(it->second.rbegin()->get());
+      }
+    }
+  }
+  for (size_t i = 0; i < fetch_tensors.size(); ++i) {
+    auto &var_name = fetch_tensors[i];
+    auto &vars = fetched_vars.at(var_name);
+    auto *op = new FetchOpHandle(fetch_data, i, &local_scopes_);
+    fetch_ops->emplace_back(op);
+    for (auto &p : places_) {
+      op->SetDeviceContext(p, fetch_ctxs_.Get(p));
+    }
+    for (auto *var : vars) {
+      op->AddInput(var);
+    }
+    auto *fetch_dummy = new DummyVarHandle();
+    op->AddOutput(fetch_dummy);
+    fetch_dependencies->emplace(fetch_dummy);
+    this->InsertPendingVar(pending_vars, ready_vars, fetch_dummy);
+    this->InsertPendingOp(pending_ops, op);
+  }
+}
+void ThreadedSSAGraphExecutor::InsertPendingOp(
+    std::unordered_map<OpHandleBase *, size_t> *pending_ops,
+    OpHandleBase *op_instance) const {
+  pending_ops->insert({op_instance, op_instance->Inputs().size()});
+}
+void ThreadedSSAGraphExecutor::InsertPendingVar(
+    std::unordered_set<VarHandleBase *> *pending_vars,
+    BlockingQueue<VarHandleBase *> *ready_vars, VarHandleBase *var) const {
+  pending_vars->insert(var);
+  if (var->generated_op_ == nullptr) {
+    ready_vars->Push(var);
+  }
+}
 void ThreadedSSAGraphExecutor::RunOp(
    BlockingQueue<VarHandleBase *> *ready_var_q, details::OpHandleBase *op) {
  auto op_run = [ready_var_q, op, this] {

--- a/paddle/fluid/framework/details/threaded_ssa_graph_executor.h
+++ b/paddle/fluid/framework/details/threaded_ssa_graph_executor.h
@@ -23,6 +23,7 @@
 #include <functional>
 #include "ThreadPool.h"  // ThreadPool in thrird party
 #include "paddle/fluid/framework/blocking_queue.h"
+#include "paddle/fluid/framework/details/fetch_op_handle.h"
 #include "paddle/fluid/framework/details/ssa_graph_executor.h"
 namespace paddle {
@@ -58,6 +59,21 @@ class ThreadedSSAGraphExecutor : public SSAGraphExecutor {
  std::unique_ptr<platform::EnforceNotMet> exception_;
  std::atomic<int> running_ops_;
  bool allow_op_delay_;
+  void InsertPendingOp(std::unordered_map<OpHandleBase *, size_t> *pending_ops,
+                       OpHandleBase *op_instance) const;
+  void InsertPendingVar(std::unordered_set<VarHandleBase *> *pending_vars,
+                        BlockingQueue<VarHandleBase *> *ready_vars,
+                        VarHandleBase *var) const;
+  void InsertFetchOps(
+      const std::vector<std::string> &fetch_tensors,
+      std::vector<std::unique_ptr<FetchOpHandle>> *fetch_ops,
+      std::unordered_set<std::unique_ptr<VarHandleBase>> *fetch_dependencies,
+      std::unordered_map<OpHandleBase *, size_t> *pending_ops,
+      std::unordered_set<VarHandleBase *> *pending_vars,
+      BlockingQueue<VarHandleBase *> *ready_vars, FeedFetchList *fetch_data);
 };
 }  // namespace details

--- a/paddle/fluid/inference/CMakeLists.txt
+++ b/paddle/fluid/inference/CMakeLists.txt
@@ -20,7 +20,9 @@ if(NOT APPLE)
 endif()
 if(WITH_TESTING)
+  # both tests/book and analysis depends the models that generated by python/paddle/fluid/tests/book
  add_subdirectory(tests/book)
+  add_subdirectory(analysis)
 endif()
 if (TENSORRT_FOUND)

--- a/paddle/fluid/inference/analysis/CMakeLists.txt
+++ b/paddle/fluid/inference/analysis/CMakeLists.txt
+cc_library(dot SRCS dot.cc)
--- a/paddle/fluid/inference/analysis/dot.cc
+++ b/paddle/fluid/inference/analysis/dot.cc
+//   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#include "paddle/fluid/inference/analysis/dot.h"
+namespace paddle {
+namespace inference {
+namespace analysis {
+size_t Dot::counter = 0;
+}  // namespace analysis
+}  // namespace inference
+}  // namespace paddle
--- a/paddle/fluid/inference/analysis/dot.h
+++ b/paddle/fluid/inference/analysis/dot.h
+//   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+/*
+ * This file implements some helper classes and methods for DOT programming
+ * support. It will give a visualization of the graph and that helps to debug
+ * the logics of each Pass.
+ */
+#pragma once
+#include <glog/logging.h>
+#include <sstream>
+#include <unordered_map>
+#include <vector>
+namespace paddle {
+namespace inference {
+namespace analysis {
+/*
+ * A Dot template that helps to build a DOT graph definition.
+ */
+class Dot {
+ public:
+  static size_t counter;
+  struct Attr {
+    std::string key;
+    std::string value;
+    Attr(const std::string& key, const std::string& value)
+        : key(key), value(value) {}
+    std::string repr() const {
+      std::stringstream ss;
+      ss << key << "=" << '"' << value << '"';
+      return ss.str();
+    }
+  };
+  struct Node {
+    std::string name;
+    std::vector<Attr> attrs;
+    Node(const std::string& name, const std::vector<Attr>& attrs)
+        : name(name),
+          attrs(attrs),
+          id_("node_" + std::to_string(Dot::counter++)) {}
+    std::string id() const { return id_; }
+    std::string repr() const {
+      std::stringstream ss;
+      CHECK(!name.empty());
+      ss << id_;
+      for (size_t i = 0; i < attrs.size(); i++) {
+        if (i == 0) {
+          ss << "[label=" << '"' << name << '"' << " ";
+        }
+        ss << attrs[i].repr();
+        ss << ((i < attrs.size() - 1) ? " " : "]");
+      }
+      return ss.str();
+    }
+   private:
+    std::string id_;
+  };
+  struct Edge {
+    std::string source;
+    std::string target;
+    std::vector<Attr> attrs;
+    Edge(const std::string& source, const std::string& target,
+         const std::vector<Attr>& attrs)
+        : source(source), target(target), attrs(attrs) {}
+    std::string repr() const {
+      std::stringstream ss;
+      CHECK(!source.empty());
+      CHECK(!target.empty());
+      ss << source << "->" << target;
+      for (size_t i = 0; i < attrs.size(); i++) {
+        if (i == 0) {
+          ss << "[";
+        }
+        ss << attrs[i].repr();
+        ss << ((i < attrs.size() - 1) ? " " : "]");
+      }
+      return ss.str();
+    }
+  };
+  Dot() = default;
+  explicit Dot(const std::vector<Attr>& attrs) : attrs_(attrs) {}
+  void AddNode(const std::string& name, const std::vector<Attr>& attrs) {
+    CHECK(!nodes_.count(name)) << "duplicate Node '" << name << "'";
+    nodes_.emplace(name, Node{name, attrs});
+  }
+  void AddEdge(const std::string& source, const std::string& target,
+               const std::vector<Attr>& attrs) {
+    CHECK(!source.empty());
+    CHECK(!target.empty());
+    auto sid = nodes_.at(source).id();
+    auto tid = nodes_.at(target).id();
+    edges_.emplace_back(sid, tid, attrs);
+  }
+  // Compile to DOT language codes.
+  std::string Build() const {
+    std::stringstream ss;
+    const std::string indent = "   ";
+    ss << "digraph G {" << '\n';
+    // Add graph attrs
+    for (const auto& attr : attrs_) {
+      ss << indent << attr.repr() << '\n';
+    }
+    // add nodes
+    for (auto& item : nodes_) {
+      ss << indent << item.second.repr() << '\n';
+    }
+    // add edges
+    for (auto& edge : edges_) {
+      ss << indent << edge.repr() << '\n';
+    }
+    ss << "} // end G";
+    return ss.str();
+  }
+ private:
+  std::unordered_map<std::string, Node> nodes_;
+  std::vector<Edge> edges_;
+  std::vector<Attr> attrs_;
+};
+}  // namespace analysis
+}  // namespace inference
+}  // namespace paddle
--- a/paddle/fluid/inference/engine.h
+++ b/paddle/fluid/inference/engine.h
@@ -19,6 +19,9 @@ limitations under the License. */
 namespace paddle {
 namespace inference {
+struct Buffer;
+enum class DeviceType { UNK = -1, CPU, GPU };
 /*
 * EngineBase is the base class of all inference engines. An inference engine
 * takes a paddle program as input, and outputs the result in fluid Tensor
@@ -45,8 +48,20 @@ class EngineBase {
  // Execute the engine, that will run the inference network.
  virtual void Execute(int batch_size) = 0;
+  // Return the IO buffer that allocated in engine. One can read/write directly
+  // on the buffer. If the buffer's buffer is nullptr, one can also allocate
+  // memory and maintain it outside the engine.
+  virtual Buffer& buffer(const std::string& name) = 0;
  virtual ~EngineBase() {}
 };  // class EngineBase
+struct Buffer {
+  void* buffer{nullptr};               // buffer should be allocated only once.
+  int max_size;                        // buffer allocated space.
+  int size;                            // data size.
+  DeviceType device{DeviceType::UNK};  // tells which device this buffer is on.
+};
 }  // namespace inference
 }  // namespace paddle
--- a/paddle/fluid/inference/tensorrt/CMakeLists.txt
+++ b/paddle/fluid/inference/tensorrt/CMakeLists.txt
+nv_library(tensorrt_engine SRCS engine.cc DEPS framework_proto)
 nv_test(test_tensorrt SRCS test_tensorrt.cc DEPS dynload_cuda device_context dynamic_loader)
-nv_test(test_tensorrt_engine SRCS test_engine.cc engine.cc DEPS dynload_cuda)
+nv_test(test_tensorrt_engine SRCS test_engine.cc DEPS dynload_cuda tensorrt_engine)
-nv_test(test_io_converter SRCS test_io_converter.cc io_converter.cc DEPS dynload_cuda dynamic_loader lod_tensor)
-set(ENGINE_FILE ${CMAKE_CURRENT_SOURCE_DIR}/engine.cc)
 add_subdirectory(convert)
--- a/paddle/fluid/inference/tensorrt/convert/CMakeLists.txt
+++ b/paddle/fluid/inference/tensorrt/convert/CMakeLists.txt
-nv_test(test_tensorrt_op_converter SRCS test_op_converter.cc mul_op.cc conv2d_op.cc DEPS ${FLUID_CORE_MODULES})
+nv_test(test_op_converter SRCS test_op_converter.cc mul_op.cc conv2d_op.cc DEPS ${FLUID_CORE_MODULES})
-nv_test(test_tensorrt_activation_op SRCS test_activation_op.cc ${ENGINE_FILE} activation_op.cc 
+nv_test(test_trt_activation_op SRCS test_activation_op.cc activation_op.cc
-  DEPS ${FLUID_CORE_MODULES} activation_op)
+  DEPS ${FLUID_CORE_MODULES} activation_op tensorrt_engine)
+nv_test(test_io_converter SRCS test_io_converter.cc io_converter.cc DEPS dynload_cuda dynamic_loader lod_tensor)
--- a/paddle/fluid/inference/tensorrt/io_converter.cc
+++ b/paddle/fluid/inference/tensorrt/io_converter.cc
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
-#include "paddle/fluid/inference/tensorrt/io_converter.h"
+#include "paddle/fluid/inference/tensorrt/convert/io_converter.h"
 #include <cuda.h>
 #include "paddle/fluid/platform/enforce.h"
@@ -50,7 +50,7 @@ class DefaultInputConverter : public EngineInputConverter {
  }
 };
-REGISTER_TENSORRT_INPUT_CONVERTER(mul, DefaultInputConverter);
+REGISTER_TENSORRT_INPUT_CONVERTER(default, DefaultInputConverter);
 }  // namespace tensorrt
 }  // namespace inference

--- a/paddle/fluid/inference/tensorrt/io_converter.h
+++ b/paddle/fluid/inference/tensorrt/io_converter.h
@@ -40,7 +40,8 @@ class EngineInputConverter {
  static void Run(const std::string& in_op_type, const LoDTensor& in, void* out,
                  size_t max_size, cudaStream_t* stream) {
    PADDLE_ENFORCE(stream != nullptr);
-    auto* converter = Registry<EngineInputConverter>::Lookup(in_op_type);
+    auto* converter = Registry<EngineInputConverter>::Lookup(
+        in_op_type, "default" /* default_type */);
    PADDLE_ENFORCE_NOT_NULL(converter);
    converter->SetStream(stream);
    (*converter)(in, out, max_size);

--- a/paddle/fluid/inference/tensorrt/convert/op_converter.h
+++ b/paddle/fluid/inference/tensorrt/convert/op_converter.h
@@ -19,6 +19,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/block_desc.h"
 #include "paddle/fluid/framework/scope.h"
 #include "paddle/fluid/inference/tensorrt/engine.h"
+#include "paddle/fluid/inference/utils/singleton.h"
 namespace paddle {
 namespace inference {
@@ -32,34 +33,23 @@ class OpConverter {
  OpConverter() {}
  virtual void operator()(const framework::OpDesc& op) {}
-  void Execute(const framework::OpDesc& op, TensorRTEngine* engine) {
+  void Run(const framework::OpDesc& op, TensorRTEngine* engine) {
    std::string type = op.Type();
-    auto it = converters_.find(type);
+    auto* it = Registry<OpConverter>::Lookup(type);
-    PADDLE_ENFORCE(it != converters_.end(), "no OpConverter for optype [%s]",
+    PADDLE_ENFORCE_NOT_NULL(it, "no OpConverter for optype [%s]", type);
-                   type);
+    it->SetEngine(engine);
-    it->second->SetEngine(engine);
+    (*it)(op);
-    (*it->second)(op);
-  }
-  static OpConverter& Global() {
-    static auto* x = new OpConverter;
-    return *x;
-  }
-  template <typename T>
-  void Register(const std::string& key) {
-    converters_[key] = new T;
  }
  // convert fluid op to tensorrt layer
  void ConvertOp(const framework::OpDesc& op, TensorRTEngine* engine) {
-    OpConverter::Global().Execute(op, engine);
+    OpConverter::Run(op, engine);
  }
  // convert fluid block to tensorrt network
  void ConvertBlock(const framework::BlockDesc& block, TensorRTEngine* engine) {
    for (auto op : block.AllOps()) {
-      OpConverter::Global().Execute(*op, engine);
+      OpConverter::Run(*op, engine);
    }
  }
@@ -78,12 +68,12 @@ class OpConverter {
  framework::Scope* scope_{nullptr};
 };
-#define REGISTER_TRT_OP_CONVERTER(op_type__, Converter__)      \
+#define REGISTER_TRT_OP_CONVERTER(op_type__, Converter__)       \
-  struct trt_##op_type__##_converter {                         \
+  struct trt_##op_type__##_converter {                          \
-    trt_##op_type__##_converter() {                            \
+    trt_##op_type__##_converter() {                             \
-      OpConverter::Global().Register<Converter__>(#op_type__); \
+      Registry<OpConverter>::Register<Converter__>(#op_type__); \
-    }                                                          \
+    }                                                           \
-  };                                                           \
+  };                                                            \
  trt_##op_type__##_converter trt_##op_type__##_converter__;
 }  // namespace tensorrt

--- a/paddle/fluid/inference/tensorrt/convert/test_activation_op.cc
+++ b/paddle/fluid/inference/tensorrt/convert/test_activation_op.cc
@@ -26,7 +26,7 @@ namespace paddle {
 namespace inference {
 namespace tensorrt {
-void compare(float input, float expect) {
+void Compare(float input, float expect) {
  framework::Scope scope;
  platform::CUDAPlace place;
  platform::CUDADeviceContext ctx(place);
@@ -85,8 +85,8 @@ void compare(float input, float expect) {
 }
 TEST(OpConverter, ConvertRelu) {
-  compare(1, 1);   // relu(1) = 1
+  Compare(1, 1);   // relu(1) = 1
-  compare(-5, 0);  // relu(-5) = 0
+  Compare(-5, 0);  // relu(-5) = 0
 }
 }  // namespace tensorrt

--- a/paddle/fluid/inference/tensorrt/test_io_converter.cc
+++ b/paddle/fluid/inference/tensorrt/test_io_converter.cc
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/framework/lod_tensor.h"
-#include "paddle/fluid/inference/tensorrt/io_converter.h"
+#include "paddle/fluid/inference/tensorrt/convert/io_converter.h"
 #include <gtest/gtest.h>
@@ -34,7 +34,7 @@ TEST_F(EngineInputConverterTester, DefaultCPU) {
  ASSERT_EQ(cudaMalloc(&buffer, tensor.memory_size()), 0);
  cudaStream_t stream;
-  EngineInputConverter::Run("mul", tensor, buffer, tensor.memory_size(),
+  EngineInputConverter::Run("test", tensor, buffer, tensor.memory_size(),
                            &stream);
 }
@@ -44,7 +44,7 @@ TEST_F(EngineInputConverterTester, DefaultGPU) {
  ASSERT_EQ(cudaMalloc(&buffer, tensor.memory_size()), 0);
  cudaStream_t stream;
-  EngineInputConverter::Run("mul", tensor, buffer, tensor.memory_size(),
+  EngineInputConverter::Run("test", tensor, buffer, tensor.memory_size(),
                            &stream);
 }

--- a/paddle/fluid/inference/tensorrt/convert/test_op_converter.cc
+++ b/paddle/fluid/inference/tensorrt/convert/test_op_converter.cc
@@ -20,7 +20,7 @@ namespace paddle {
 namespace inference {
 namespace tensorrt {
-TEST(BlockConverter, ConvertBlock) {
+TEST(OpConverter, ConvertBlock) {
  framework::ProgramDesc prog;
  auto* block = prog.MutableBlock(0);
  auto* mul_op = block->AppendOp();

--- a/paddle/fluid/inference/tensorrt/engine.cc
+++ b/paddle/fluid/inference/tensorrt/engine.cc
@@ -30,16 +30,24 @@ void TensorRTEngine::Build(const DescType& paddle_model) {
 }
 void TensorRTEngine::Execute(int batch_size) {
-  infer_context_->enqueue(batch_size, buffers_.data(), *stream_, nullptr);
+  std::vector<void*> buffers;
+  for (auto& buf : buffers_) {
+    PADDLE_ENFORCE_NOT_NULL(buf.buffer, "buffer should be allocated");
+    PADDLE_ENFORCE_GT(buf.max_size, 0);
+    PADDLE_ENFORCE(buf.device == DeviceType::GPU);
+    buffers.push_back(buf.buffer);
+  }
+  infer_context_->enqueue(batch_size, buffers.data(), *stream_, nullptr);
  cudaStreamSynchronize(*stream_);
 }
 TensorRTEngine::~TensorRTEngine() {
  // clean buffer
-  for (auto& buffer : buffers_) {
+  for (auto& buf : buffers_) {
-    if (buffer != nullptr) {
+    if (buf.buffer != nullptr) {
-      PADDLE_ENFORCE_EQ(0, cudaFree(buffer));
+      PADDLE_ENFORCE_EQ(0, cudaFree(buf.buffer));
-      buffer = nullptr;
+      buf.buffer = nullptr;
+      buf.max_size = 0;
    }
  }
 }
@@ -59,7 +67,7 @@ void TensorRTEngine::FreezeNetwork() {
  infer_context_.reset(infer_engine_->createExecutionContext());
  // allocate GPU buffers.
-  buffers_.resize(buffer_sizes_.size(), nullptr);
+  buffers_.resize(buffer_sizes_.size());
  for (auto& item : buffer_sizes_) {
    if (item.second == 0) {
      auto slot_offset = infer_engine_->getBindingIndex(item.first.c_str());
@@ -67,7 +75,11 @@ void TensorRTEngine::FreezeNetwork() {
                        infer_engine_->getBindingDataType(slot_offset))] *
                    AccumDims(infer_engine_->getBindingDimensions(slot_offset));
    }
-    PADDLE_ENFORCE_EQ(0, cudaMalloc(&buffer(item.first), item.second));
+    auto& buf = buffer(item.first);
+    CHECK(buf.buffer == nullptr);  // buffer should be allocated only once.
+    PADDLE_ENFORCE_EQ(0, cudaMalloc(&buf.buffer, item.second));
+    buf.size = buf.max_size = item.second;
+    buf.device = DeviceType::GPU;
  }
 }
@@ -113,7 +125,7 @@ void TensorRTEngine::DeclareOutput(const std::string& name) {
 }
 void* TensorRTEngine::GetOutputInGPU(const std::string& name) {
-  return buffer(name);
+  return buffer(name).buffer;
 }
 void TensorRTEngine::GetOutputInCPU(const std::string& name, void* dst,
@@ -123,11 +135,13 @@ void TensorRTEngine::GetOutputInCPU(const std::string& name, void* dst,
  PADDLE_ENFORCE(it != buffer_sizes_.end());
  PADDLE_ENFORCE_GT(it->second, 0);
  PADDLE_ENFORCE_GE(max_size, it->second);
-  PADDLE_ENFORCE_EQ(0, cudaMemcpyAsync(dst, buffer(name), it->second,
+  auto& buf = buffer(name);
+  PADDLE_ENFORCE_NOT_NULL(buf.buffer, "buffer should be allocated before");
+  PADDLE_ENFORCE_EQ(0, cudaMemcpyAsync(dst, buf.buffer, it->second,
                                       cudaMemcpyDeviceToHost, *stream_));
 }
-void*& TensorRTEngine::buffer(const std::string& name) {
+Buffer& TensorRTEngine::buffer(const std::string& name) {
  PADDLE_ENFORCE(infer_engine_ != nullptr, "call FreezeNetwork first.");
  auto it = buffer_sizes_.find(name);
  PADDLE_ENFORCE(it != buffer_sizes_.end());
@@ -137,10 +151,12 @@ void*& TensorRTEngine::buffer(const std::string& name) {
 void TensorRTEngine::SetInputFromCPU(const std::string& name, void* data,
                                     size_t size) {
-  void* buf = buffer(name);
+  auto& buf = buffer(name);
-  cudaMemcpyAsync(buf, data, size, cudaMemcpyHostToDevice, *stream_);
+  PADDLE_ENFORCE_NOT_NULL(buf.buffer);
-  PADDLE_ENFORCE_EQ(
+  PADDLE_ENFORCE_LE(size, buf.max_size, "buffer is too small");
-      0, cudaMemcpyAsync(buf, data, size, cudaMemcpyHostToDevice, *stream_));
+  PADDLE_ENFORCE(buf.device == DeviceType::GPU);
+  PADDLE_ENFORCE_EQ(0, cudaMemcpyAsync(buf.buffer, data, size,
+                                       cudaMemcpyHostToDevice, *stream_));
 }
 void TensorRTEngine::SetITensor(const std::string& name,

--- a/paddle/fluid/inference/tensorrt/engine.h
+++ b/paddle/fluid/inference/tensorrt/engine.h
@@ -87,7 +87,9 @@ class TensorRTEngine : public EngineBase {
  // these memory directly for acceleration, for example, output the converted
  // data directly to the buffer to save data copy overhead.
  // NOTE this should be used after calling `FreezeNetwork`.
-  void*& buffer(const std::string& name);
+  Buffer& buffer(const std::string& name) override;
+  cudaStream_t* stream() { return stream_; }
  // Fill an input from CPU memory with name and size.
  void SetInputFromCPU(const std::string& name, void* data, size_t size);
@@ -116,7 +118,7 @@ class TensorRTEngine : public EngineBase {
  cudaStream_t* stream_;
  nvinfer1::ILogger& logger_;
-  std::vector<void*> buffers_;
+  std::vector<Buffer> buffers_;
  // max data size for the buffers.
  std::unordered_map<std::string /*name*/, size_t /*max size*/> buffer_sizes_;
  std::unordered_map<std::string /*name*/, nvinfer1::ITensor* /*ITensor*/>

--- a/paddle/fluid/inference/tensorrt/test_engine.cc
+++ b/paddle/fluid/inference/tensorrt/test_engine.cc
@@ -77,6 +77,37 @@ TEST_F(TensorRTEngineTest, add_layer) {
  ASSERT_EQ(y_cpu, x_v * 2 + 3);
 }
+TEST_F(TensorRTEngineTest, add_layer_multi_dim) {
+  // Weight in CPU memory.
+  // It seems tensorrt FC use col-major: [[1.0, 3.3], [1.1, 4.4]]
+  // instead of row-major, which is [[1.0, 1.1], [3.3, 4.4]]
+  float raw_weight[4] = {1.0, 1.1, 3.3, 4.4};
+  float raw_bias[2] = {1.3, 2.4};
+  TensorRTEngine::Weight weight(nvinfer1::DataType::kFLOAT, raw_weight, 4);
+  TensorRTEngine::Weight bias(nvinfer1::DataType::kFLOAT, raw_bias, 2);
+  auto* x = engine_->DeclareInput("x", nvinfer1::DataType::kFLOAT,
+                                  nvinfer1::DimsCHW{1, 2, 1});
+  auto* fc_layer = TRT_ENGINE_ADD_LAYER(engine_, FullyConnected, *x, 2,
+                                        weight.get(), bias.get());
+  PADDLE_ENFORCE(fc_layer != nullptr);
+  engine_->DeclareOutput(fc_layer, 0, "y");
+  engine_->FreezeNetwork();
+  ASSERT_EQ(engine_->engine()->getNbBindings(), 2);
+  float x_v[2] = {1.0, 2.0};
+  engine_->SetInputFromCPU("x", reinterpret_cast<void*>(&x_v),
+  2 * sizeof(float));
+  engine_->Execute(1);
+  LOG(INFO) << "to get output";
+  float y_cpu[2] = {-1., -1.};
+  engine_->GetOutputInCPU("y", &y_cpu[0], sizeof(float) * 2);
+  ASSERT_EQ(y_cpu[0], 4.5);
+  ASSERT_EQ(y_cpu[1], 14.5);
+}
 }  // namespace tensorrt
 }  // namespace inference
 }  // namespace paddle
--- a/paddle/fluid/inference/tests/book/CMakeLists.txt
+++ b/paddle/fluid/inference/tests/book/CMakeLists.txt
@@ -36,5 +36,5 @@ inference_test(label_semantic_roles)
 inference_test(recognize_digits ARGS mlp conv)
 inference_test(recommender_system)
 #inference_test(rnn_encoder_decoder)
-inference_test(understand_sentiment ARGS conv)
+#inference_test(understand_sentiment ARGS conv)
 inference_test(word2vec)
--- a/paddle/fluid/inference/utils/singleton.h
+++ b/paddle/fluid/inference/utils/singleton.h
@@ -14,6 +14,7 @@ limitations under the License. */
 #pragma once
+#include <string>
 #include <unordered_map>
 #include "paddle/fluid/platform/enforce.h"
@@ -49,9 +50,15 @@ struct Registry {
    items_[name] = new ItemChild;
  }
-  static ItemParent* Lookup(const std::string& name) {
+  static ItemParent* Lookup(const std::string& name,
+                            const std::string& default_name = "") {
    auto it = items_.find(name);
-    if (it == items_.end()) return nullptr;
+    if (it == items_.end()) {
+      if (default_name == "")
+        return nullptr;
+      else
+        return items_.find(default_name)->second;
+    }
    return it->second;
  }

--- a/paddle/fluid/operators/conv_op.h
+++ b/paddle/fluid/operators/conv_op.h
@@ -187,7 +187,8 @@ class GemmConvKernel : public framework::OpKernel<T> {
        // gemm
        Tensor out_slice = out_batch.Slice(g * out_step, (g + 1) * out_step);
        Tensor filter_slice = filter.Slice(g * out_step, (g + 1) * out_step);
-        blas.MatMul(filter_slice, col_matrix, &out_slice);
+        blas.MatMul(filter_slice, false, col_matrix, false, T(1.0), &out_slice,
+                    T(0.0));
      }
    }
  }
@@ -304,7 +305,8 @@ class GemmConvGradKernel : public framework::OpKernel<T> {
            col_matrix.ShareDataWith(in_grad_slice);
            col_matrix.Resize(col_matrix_shape);
          }
-          blas.MatMul(filter_slice, true, out_grad_slice, false, &col_matrix);
+          blas.MatMul(filter_slice, true, out_grad_slice, false, T(1.0),
+                      &col_matrix, T(0.0));
          if (is_expand && data_dim == 2U) {
            col2im(dev_ctx, col, dilations, strides,
@@ -351,8 +353,8 @@ class GemmConvGradKernel : public framework::OpKernel<T> {
          // gemm
          Tensor filter_grad_slice =
              filter_grad_.Slice(g * out_step, (g + 1) * out_step);
-          blas.MatMul(out_grad_slice, false, col_matrix, true,
+          blas.MatMul(out_grad_slice, false, col_matrix, true, T(1.0),
-                      &filter_grad_slice);
+                      &filter_grad_slice, T(1.0));
        }
      }
    }

--- a/paddle/fluid/operators/conv_transpose_op.h
+++ b/paddle/fluid/operators/conv_transpose_op.h
@@ -135,7 +135,8 @@ class GemmConvTransposeKernel : public framework::OpKernel<T> {
      // col_matrix = filter * input_batch
      // of shape (c * k_h * k_w, h * w) or (c * k_d * k_h * k_w, d * h * w)
-      blas.MatMul(filter, true, input_batch, false, &col_matrix);
+      blas.MatMul(filter, true, input_batch, false, static_cast<T>(1.0),
+                  &col_matrix, static_cast<T>(0.0));
      if (data_dim == 2U) {
        // col2im: col_matrix -> dy
@@ -267,7 +268,8 @@ class GemmConvTransposeGradKernel : public framework::OpKernel<T> {
          // or
          // (m, c * k_d * k_h * k_w) * (c * k_d * k_h * k_w, d * h * w) -> (m,
          // d, h, w)
-          blas.MatMul(filter, false, col_matrix, false, &input_grad_batch);
+          blas.MatMul(filter, false, col_matrix, false, static_cast<T>(1.0),
+                      &input_grad_batch, static_cast<T>(0.0));
        }
        if (filter_grad) {
          // input batch
@@ -277,7 +279,8 @@ class GemmConvTransposeGradKernel : public framework::OpKernel<T> {
          // or
          // (m, d * h * w) * (d * h * w, c * k_d * k_h * k_w) -> (m, c * k_d *
          // k_h * k_w)
-          blas.MatMul(in_batch, false, col_matrix, true, &filter_grad_);
+          blas.MatMul(in_batch, false, col_matrix, true, static_cast<T>(1.0),
+                      &filter_grad_, static_cast<T>(1.0));
        }
      }
    }

--- a/paddle/fluid/operators/detail/sendrecvop_utils.cc
+++ b/paddle/fluid/operators/detail/sendrecvop_utils.cc
@@ -32,141 +32,139 @@ namespace paddle {
 namespace operators {
 namespace detail {
+using VarMsg = sendrecv::VariableMessage;
+void GetTensorPayload(framework::Variable* var,
+                      const platform::DeviceContext& ctx, VarMsg* request,
+                      void** payload, size_t* payload_size) {
+  auto tensor = var->Get<framework::LoDTensor>();
+  // FIXME(wuyi): data types in send_recv.proto is copied from
+  // framework.proto
+  request->set_data_type(
+      static_cast<VarMsg::Type>(framework::ToDataType(tensor.type())));
+  for (auto& dim : framework::vectorize(tensor.dims())) {
+    request->add_dims(dim);
+  }
+  const framework::LoD lod = tensor.lod();
+  if (lod.size() > 0) {
+    request->set_lod_level(lod.size());
+    for (auto& each : lod) {
+      VarMsg::LodData* lod_inner = request->add_lod();
+      for (auto& d : each) {
+        lod_inner->add_lod_data(d);
+      }
+    }
+  }
+  if (platform::is_gpu_place(ctx.GetPlace())) {
+#ifdef PADDLE_WITH_CUDA
+    PADDLE_ENFORCE(platform::is_gpu_place(tensor.place()));
+    platform::CPUPlace cpu;
+    auto& gpu_dev_ctx = static_cast<const platform::CUDADeviceContext&>(ctx);
+    auto copy_size = tensor.numel() * framework::SizeOfType(tensor.type());
+    *payload = memory::Alloc(cpu, copy_size);
+    memory::Copy(cpu, *payload, boost::get<platform::CUDAPlace>(tensor.place()),
+                 reinterpret_cast<const void*>(tensor.data<void>()), copy_size,
+                 gpu_dev_ctx.stream());
+    ctx.Wait();
+#endif
+  } else {
+    *payload = tensor.data<void>();
+  }
+  *payload_size = tensor.numel() * framework::SizeOfType(tensor.type());
+}
+void GetSelectedRowsPayload(framework::Variable* var,
+                            const platform::DeviceContext& ctx, VarMsg* request,
+                            void** payload, size_t* payload_size) {
+  auto* slr = var->GetMutable<framework::SelectedRows>();
+  request->set_data_type(
+      static_cast<VarMsg::Type>(framework::ToDataType(slr->value().type())));
+  request->set_lod_level(0);
+  request->set_slr_height(slr->height());
+  for (auto& dim : framework::vectorize(slr->value().dims())) {
+    request->add_dims(dim);
+  }
+  auto* tensor = slr->mutable_value();
+  if (platform::is_gpu_place(ctx.GetPlace())) {
+#ifdef PADDLE_WITH_CUDA
+    platform::CPUPlace cpu;
+    auto& gpu_dev_ctx = static_cast<const platform::CUDADeviceContext&>(ctx);
+    auto copy_size = tensor->numel() * framework::SizeOfType(tensor->type());
+    *payload = memory::Alloc(cpu, copy_size);
+    memory::Copy(cpu, *payload,
+                 boost::get<platform::CUDAPlace>(tensor->place()),
+                 reinterpret_cast<const void*>(tensor->data<void>()), copy_size,
+                 gpu_dev_ctx.stream());
+    ctx.Wait();
+#endif
+  } else {
+    *payload = slr->mutable_value()->data<void>();
+  }
+  *payload_size = tensor->numel() * framework::SizeOfType(tensor->type());
+}
 void SerializeToByteBuffer(const std::string& name, framework::Variable* var,
                           const platform::DeviceContext& ctx,
                           ::grpc::ByteBuffer* msg,
                           const std::string& out_name) {
-  using VarMsg = sendrecv::VariableMessage;
+  // Default DestroyCallback does nothing, When using GPU
-  // When using GPU, need to free the copied CPU buffer
+  // the CPU buffer need to be freed.
-  // when the ByteBuffer destroies
-  // TODO(typhoonzero): add unref here, if we have dependent
-  // parallelism execution, need to know when to free the tensor.
  DestroyCallback destroy_callback = [](void* backing) {};
+  VarMsg request;
-  auto buffer = std::unique_ptr<char[]>(new char[1024]);
-  void* buf = buffer.get();
  void* payload = nullptr;
-  size_t payload_size = 0;
+  size_t payload_size;
-  ProtoEncodeHelper e(static_cast<char*>(buf), 1024);
+  request.set_varname(name);
  // Note: normally the profiler is enabled in 1 trainer, hence only
  // 1 trainer returns true for ShouldSendProfileState(). It tells PS
  // servers the trainer's profiling state so that PS can follow the
  // trainer.
-  if (platform::ShouldSendProfileState()) {
+  request.set_profile(platform::IsProfileEnabled());
-    e.WriteBool(VarMsg::kProfileFieldNumber, platform::IsProfileEnabled());
+  if (!out_name.empty()) {
+    request.set_out_varname(out_name);
  }
-  e.WriteString(VarMsg::kVarnameFieldNumber, name);
  if (var->IsType<framework::LoDTensor>()) {
-    e.WriteUint64(VarMsg::kTypeFieldNumber, 0);
+    request.set_type(::sendrecv::LOD_TENSOR);
+    GetTensorPayload(var, ctx, &request, &payload, &payload_size);
  } else if (var->IsType<framework::SelectedRows>()) {
-    e.WriteUint64(VarMsg::kTypeFieldNumber, 1);
+    request.set_type(::sendrecv::SELECTED_ROWS);
+    GetSelectedRowsPayload(var, ctx, &request, &payload, &payload_size);
 #ifdef PADDLE_WITH_CUDA
  } else if (var->IsType<ncclUniqueId>()) {
-    // NOTE: sendrecv only support RAW type for NCCL_ID
+    request.set_type(::sendrecv::NCCL_ID);
-    VLOG(3) << "serilizing: setting var type nccl id";
-    e.WriteUint64(VarMsg::kTypeFieldNumber, 2);
 #endif
+  } else {
+    PADDLE_THROW("Serialize does not support type: %s",
+                 typeid(var->Type()).name());
  }
-  if (!out_name.empty()) {
+  if (platform::is_gpu_place(ctx.GetPlace())) {
-    e.WriteString(VarMsg::kOutVarnameFieldNumber, out_name);
+    // GPU data is copied to CPU buffer when sending,
-  }
+    // free the buffer when possible.
-  if (var->IsType<framework::LoDTensor>()) {
+    destroy_callback = [](void* backing) {
-    // ===========================Tensor==================================
-    auto tensor = var->Get<framework::LoDTensor>();
-    e.WriteUint64(VarMsg::kDataTypeFieldNumber,
-                  framework::ToDataType(tensor.type()));
-    for (auto& dim : framework::vectorize(tensor.dims())) {
-      e.WriteUint64(VarMsg::kDimsFieldNumber, dim);
-    }
-    auto lod = tensor.lod();  // std::vector<Vector<size_t>>
-    if (lod.size() > 0) {
-      e.WriteUint64(VarMsg::kLodLevelFieldNumber, lod.size());
-      for (auto& each : lod) {
-        e.WriteVarlengthBeginning(VarMsg::kLodFieldNumber,
-                                  2 +      // tag + varintlength of submessage
-                                      1 +  // kLodDataFieldNumber
-                                      each.size());
-        // auto copied from GPU
-        for (auto& d : each) {
-          e.WriteUint64(VarMsg::LodData::kLodDataFieldNumber, d);
-        }
-      }
-    }
-    if (platform::is_gpu_place(ctx.GetPlace())) {
-#ifdef PADDLE_WITH_CUDA
-      PADDLE_ENFORCE(platform::is_gpu_place(tensor.place()));
      platform::CPUPlace cpu;
-      auto& gpu_dev_ctx = static_cast<const platform::CUDADeviceContext&>(ctx);
+      memory::Free(cpu, backing);
-      auto copy_size = tensor.numel() * framework::SizeOfType(tensor.type());
+    };
-      payload = memory::Alloc(cpu, copy_size);
+  }
-      memory::Copy(cpu, payload,
-                   boost::get<platform::CUDAPlace>(tensor.place()),
-                   reinterpret_cast<const void*>(tensor.data<void>()),
-                   copy_size, gpu_dev_ctx.stream());
-      ctx.Wait();
-      destroy_callback = [](void* backing) {
-        platform::CPUPlace cpu;
-        memory::Free(cpu, backing);
-      };
-#endif
+  std::string header;
-    } else {
+  request.AppendToString(&header);
-      payload = tensor.data<void>();
+  auto buffer = std::unique_ptr<char[]>(new char[1024]);
-    }
+  void* buf = buffer.get();
-    payload_size = tensor.numel() * framework::SizeOfType(tensor.type());
+  ProtoEncodeHelper e(static_cast<char*>(buf), 1024);
-    e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, payload_size);
+  e.WriteRawBytes(std::string(header.data(), header.size()));
-  } else if (var->IsType<framework::SelectedRows>()) {
+// NCCLID is copied directly to the message, return bytebuffer
-    // ===========================SELECTED
+// with only one slice if serializing NCCLID.
-    // ROWS==================================
-    // TODO(typhoonzero): selectedrows implement should not use unique_ptr
-    auto* slr = var->GetMutable<framework::SelectedRows>();
-    e.WriteUint64(VarMsg::kDataTypeFieldNumber,
-                  framework::ToDataType(slr->value().type()));
-    for (auto& dim : framework::vectorize(slr->value().dims())) {
-      e.WriteUint64(VarMsg::kDimsFieldNumber, dim);
-    }
-    e.WriteUint64(VarMsg::kLodLevelFieldNumber, 0);
-    e.WriteUint64(VarMsg::kSlrHeightFieldNumber, slr->height());
-    auto* tensor = slr->mutable_value();
-    if (platform::is_gpu_place(ctx.GetPlace())) {
-#ifdef PADDLE_WITH_CUDA
-      platform::CPUPlace cpu;
-      auto& gpu_dev_ctx = static_cast<const platform::CUDADeviceContext&>(ctx);
-      auto copy_size = tensor->numel() * framework::SizeOfType(tensor->type());
-      payload = memory::Alloc(cpu, copy_size);
-      memory::Copy(cpu, payload,
-                   boost::get<platform::CUDAPlace>(tensor->place()),
-                   reinterpret_cast<const void*>(tensor->data<void>()),
-                   copy_size, gpu_dev_ctx.stream());
-      ctx.Wait();
-      destroy_callback = [](void* backing) {
-        platform::CPUPlace cpu;
-        memory::Free(cpu, backing);
-      };
-#endif
-    } else {
-      payload = slr->mutable_value()->data<void>();
-    }
-    payload_size = tensor->numel() * framework::SizeOfType(tensor->type());
-    e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, payload_size);
 #ifdef PADDLE_WITH_CUDA
-  } else if (var->IsType<ncclUniqueId>()) {
+  if (var->IsType<ncclUniqueId>()) {
-    // ===========================NCCL ID==================================
    e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber,
                              NCCL_UNIQUE_ID_BYTES);
    ncclUniqueId* uid = var->GetMutable<ncclUniqueId>();
    e.WriteRawBytes(std::string(uid->internal, NCCL_UNIQUE_ID_BYTES));
-#endif
-  } else {
-    PADDLE_THROW("Serialize does not support type: %s",
-                 typeid(var->Type()).name());
-  }
-#ifdef PADDLE_WITH_CUDA
-  if (var->IsType<ncclUniqueId>()) {
    // for serialize NCCL_ID
    ::grpc::Slice slices(e.size());
    memcpy(const_cast<uint8_t*>(slices.begin()), e.data(), e.size());
@@ -175,6 +173,8 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var,
    return;
  }
 #endif
+  e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, payload_size);
  // steal reference of tensor data
  ::grpc::Slice slices[4];  // metadata, tensor, rows meta, rows
  int num_slices = 2;       // only SelectedRows have rows buffer
@@ -185,12 +185,9 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var,
                                    static_cast<char*>(payload)),
      ::grpc::Slice::STEAL_REF);
-  if (framework::ToVarType(var->Type()) ==
+  if (var->IsType<framework::SelectedRows>()) {
-      framework::proto::VarType_Type_SELECTED_ROWS) {
    auto* slr = var->GetMutable<framework::SelectedRows>();
    ProtoEncodeHelper e2(static_cast<char*>(buf), 128);
-    // NOTE: rows is of type int64_t
    size_t rows_memory_size =
        slr->rows().size() * framework::SizeOfType(typeid(int64_t));
    e2.WriteVarlengthBeginning(VarMsg::kRowsFieldNumber, rows_memory_size);
@@ -201,10 +198,7 @@ void SerializeToByteBuffer(const std::string& name, framework::Variable* var,
        grpc_slice_new_with_user_data(
            const_cast<void*>(
                reinterpret_cast<const void*>(slr->rows().data())),
-            rows_memory_size,
+            rows_memory_size, [](void* backing) {},
-            [](void* backing) {
-              // TODO(typhoonzero): add unref here, same as above.
-            },
            const_cast<char*>(
                reinterpret_cast<const char*>(slr->rows().data()))),
        ::grpc::Slice::STEAL_REF);

--- a/paddle/fluid/operators/detail/serde_test.cc
+++ b/paddle/fluid/operators/detail/serde_test.cc
@@ -117,11 +117,11 @@ void RunTestLodTensor(platform::Place place, int from_type = 0) {
  // serialize var to ByteBuffer
  framework::Variable var;
  auto* tensor = var.GetMutable<framework::LoDTensor>();
-  tensor->Resize(framework::make_ddim({4, 8, 4, 2}));
+  tensor->Resize(framework::make_ddim({512, 8, 4, 2}));
  framework::LoD lod;
  lod.push_back(framework::Vector<size_t>({1, 3, 8}));
  tensor->set_lod(lod);
-  int tensor_numel = 4 * 8 * 4 * 2;
+  int tensor_numel = 512 * 8 * 4 * 2;
  platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
  auto& ctx = *pool.Get(place);
  tensor->mutable_data<float>(place);
@@ -142,7 +142,7 @@ void RunTestLodTensor(platform::Place place, int from_type = 0) {
  EXPECT_TRUE(varmsg.ParseFromString(tmp));
  EXPECT_EQ(varmsg.varname(), "myvar");
  EXPECT_EQ(varmsg.type(), 0);
-  EXPECT_EQ(varmsg.dims()[0], 4);
+  EXPECT_EQ(varmsg.dims()[0], 512);
  EXPECT_EQ(varmsg.dims()[1], 8);
  EXPECT_EQ(varmsg.dims()[2], 4);
  EXPECT_EQ(varmsg.dims()[3], 2);

--- a/paddle/fluid/operators/detail/variable_response.cc
+++ b/paddle/fluid/operators/detail/variable_response.cc
@@ -213,15 +213,15 @@ bool ParseLodData(::google::protobuf::io::CodedInputStream* input,
        }
        if (wt == WIRETYPE_LENGTH_DELIMITED) {
-          int length = 0;
+          int num_bytes = 0;
-          if (!input->ReadVarintSizeAsInt(&length)) {
+          if (!input->ReadVarintSizeAsInt(&num_bytes)) {
            return tag;
          }
+          int start_pos = input->CurrentPosition();
-          for (int i = 0; i < length; i++) {
+          while (input->CurrentPosition() - start_pos < num_bytes) {
            uint64_t v;
            if (!input->ReadVarint64(&v)) {
-              return false;
+              return tag;
            }
            lod->push_back(v);
          }
@@ -278,8 +278,8 @@ int VariableResponse::Parse(Source* source) {
        break;
      }
      case sendrecv::VariableMessage::kTypeFieldNumber: {
-        uint64_t v;
+        uint32_t v;
-        if ((wt != WIRETYPE_VARINT) || !input.ReadVarint64(&v)) {
+        if ((wt != WIRETYPE_VARINT) || !input.ReadVarint32(&v)) {
          return tag;
        }
@@ -287,8 +287,8 @@ int VariableResponse::Parse(Source* source) {
        break;
      }
      case sendrecv::VariableMessage::kDataTypeFieldNumber: {
-        uint64_t v = 0;
+        uint32_t v = 0;
-        if ((wt != WIRETYPE_VARINT) || !input.ReadVarint64(&v)) {
+        if ((wt != WIRETYPE_VARINT) || !input.ReadVarint32(&v)) {
          return tag;
        }
@@ -308,11 +308,12 @@ int VariableResponse::Parse(Source* source) {
        // packed
        if (wt == WIRETYPE_LENGTH_DELIMITED) {
-          int length = 0;
+          int num_bytes = 0;
-          if (!input.ReadVarintSizeAsInt(&length)) {
+          if (!input.ReadVarintSizeAsInt(&num_bytes)) {
            return tag;
          }
-          for (int i = 0; i < length; i++) {
+          int start_pos = input.CurrentPosition();
+          while (input.CurrentPosition() - start_pos < num_bytes) {
            uint64_t v;
            if (!input.ReadVarint64(&v)) {
              return tag;
@@ -321,7 +322,6 @@ int VariableResponse::Parse(Source* source) {
          }
          break;
        }
        return tag;
      }
      case sendrecv::VariableMessage::kLodLevelFieldNumber: {
@@ -375,9 +375,10 @@ int VariableResponse::Parse(Source* source) {
                        meta_.type() == sendrecv::NCCL_ID) &&
                           meta_.varname() != "",
                       "meta info should be got first!");
-        int length = 0;
+        int num_bytes = 0;
        if (wt != WIRETYPE_LENGTH_DELIMITED ||
-            !ReadVarintSizeAsInt(&input, &length)) {
+            !ReadVarintSizeAsInt(&input, &num_bytes)) {
          return tag;
        }
@@ -387,7 +388,7 @@ int VariableResponse::Parse(Source* source) {
          if (var != nullptr) {
            ncclUniqueId* id = var->GetMutable<ncclUniqueId>();
            if (!ReadRaw(&input, *dev_ctx_, platform::CPUPlace(), id->internal,
-                         length)) {
+                         num_bytes)) {
              return tag;
            }
          }
@@ -401,14 +402,14 @@ int VariableResponse::Parse(Source* source) {
        if (meta_.type() == sendrecv::LOD_TENSOR) {
          PADDLE_ENFORCE(meta_.lod_size() >= 0,
                         "lod info should be got first!");
-          if (!CopyLodTensorData(&input, *dev_ctx_, dims, length)) {
+          if (!CopyLodTensorData(&input, *dev_ctx_, dims, num_bytes)) {
            return tag;
          }
          break;
        }
        if (meta_.type() == sendrecv::SELECTED_ROWS) {
-          if (!CopySelectRowsTensorData(&input, *dev_ctx_, dims, length)) {
+          if (!CopySelectRowsTensorData(&input, *dev_ctx_, dims, num_bytes)) {
            return tag;
          }
          break;
@@ -422,13 +423,13 @@ int VariableResponse::Parse(Source* source) {
                           meta_.varname() != "",
                       "meta info should be got first!");
-        int length = 0;
+        int num_bytes = 0;
        if (wt != WIRETYPE_LENGTH_DELIMITED ||
-            !ReadVarintSizeAsInt(&input, &length)) {
+            !ReadVarintSizeAsInt(&input, &num_bytes)) {
          return tag;
        }
-        if (!CopySelectRowsData(&input, *dev_ctx_, length)) {
+        if (!CopySelectRowsData(&input, *dev_ctx_, num_bytes)) {
          return tag;
        }
        break;

--- a/paddle/fluid/operators/load_op.cc
+++ b/paddle/fluid/operators/load_op.cc
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include <fstream>
+#include "paddle/fluid/framework/data_type_transform.h"
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/platform/device_context.h"
 #include "paddle/fluid/platform/profiler.h"
@@ -47,17 +48,25 @@ class LoadOp : public framework::OperatorBase {
    DeserializeFromStream(fin, tensor, *dev_ctx);
-    if (platform::is_gpu_place(place)) {
+    auto load_as_fp16 = Attr<bool>("load_as_fp16");
-      // copy CPU to GPU
+    auto in_dtype = framework::ToDataType(tensor->type());
-      framework::LoDTensor cpu_tensor;
+    auto out_dtype = load_as_fp16 ? framework::proto::VarType::FP16 : in_dtype;
-      cpu_tensor.ShareDataWith(*tensor);
-      cpu_tensor.set_lod(tensor->lod());
+    if (in_dtype != out_dtype) {
+      // convert to float16 tensor
-      // reset tensor
+      auto in_kernel_type = framework::OpKernelType(in_dtype, place);
+      auto out_kernel_type = framework::OpKernelType(out_dtype, place);
+      framework::LoDTensor fp16_tensor;
+      // copy LoD info to the new tensor
+      fp16_tensor.set_lod(tensor->lod());
+      framework::TransDataType(in_kernel_type, out_kernel_type, *tensor,
+                               &fp16_tensor);
+      // reset output tensor
      out_var->Clear();
      tensor = out_var->GetMutable<framework::LoDTensor>();
-      tensor->set_lod(cpu_tensor.lod());
+      tensor->set_lod(fp16_tensor.lod());
-      TensorCopy(cpu_tensor, place, *dev_ctx, tensor);
+      tensor->ShareDataWith(fp16_tensor);
    }
  }
 };
@@ -67,6 +76,13 @@ class LoadOpProtoMaker : public framework::OpProtoAndCheckerMaker {
  LoadOpProtoMaker(OpProto *proto, OpAttrChecker *op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddOutput("Out", "(Tensor) The tensor need to be loaded");
+    AddAttr<bool>(
+        "load_as_fp16",
+        "(boolean, default false)"
+        "If true, the tensor will be first loaded and then "
+        "converted to float16 data type. Otherwise, the tensor will be "
+        "directly loaded without data type conversion.")
+        .SetDefault(false);
    AddAttr<std::string>("file_path",
                         "(string) "
                         "Variable will be loaded from \"file_path\".")

--- a/paddle/fluid/operators/math/blas.cc
+++ b/paddle/fluid/operators/math/blas.cc
@@ -13,10 +13,40 @@
 // limitations under the License.
 #include "paddle/fluid/operators/math/blas.h"
+#include <utility>
 namespace paddle {
 namespace operators {
 namespace math {
-// Do nothing. Blas is a header only library.
+MatDescriptor CreateMatrixDescriptor(const framework::DDim &tensor_dim,
+                                     int num_flatten_cols, bool trans) {
+  PADDLE_ENFORCE_GT(tensor_dim.size(), 1);
+  MatDescriptor retv;
+  if (num_flatten_cols > 1) {
+    auto flatten_dim = framework::flatten_to_2d(tensor_dim, num_flatten_cols);
+    retv.height_ = flatten_dim[0];
+    retv.width_ = flatten_dim[1];
+  } else {
+    if (tensor_dim.size() == 2) {
+      retv.height_ = tensor_dim[0];
+      retv.width_ = tensor_dim[1];
+    } else {
+      auto dim_vec = framework::vectorize(tensor_dim);
+      retv.batch_size_ = 1;
+      for (size_t i = 0; i < dim_vec.size() - 2; ++i) {
+        retv.batch_size_ *= dim_vec[i];
+      }
+      retv.height_ = dim_vec[dim_vec.size() - 2];
+      retv.width_ = dim_vec[dim_vec.size() - 1];
+      retv.stride_ = retv.height_ * retv.width_;
+    }
+  }
+  if (trans) {
+    std::swap(retv.width_, retv.height_);
+  }
+  retv.trans_ = trans;
+  return retv;
+}
 }  // namespace math
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/fluid/operators/math/blas.h
+++ b/paddle/fluid/operators/math/blas.h
@@ -46,6 +46,50 @@ namespace paddle {
 namespace operators {
 namespace math {
+/**
+ * Matrix Descriptor of a memory buffer.
+ *
+ * It is used for Blas::MatMul. MatMul operator can be batched.
+ * if Mat A is [BatchSize, H, W], Mat B is [BatchSize, H, W]. It will be a
+ * `batch_size` times of GEMM. The batched GEMM could be faster base on the
+ * implementation of the blas library. The batch size could be zero. If any
+ * matrix of `matmul` has a batch size, the will be a batched GEMM, too. e.g.,
+ * Mat A is [BatchSize, H1, W2], and Mat B [H2, W2], The result matrix wil be
+ * [BatchSize, H1, W2]
+ *
+ * The boolean flag, `trans`, describe the memory is the transpose of matrix or
+ * not. If the trans is true, the last two dims of matrix are transposed. The
+ * memory layout of the matrix is [Width, Height] or [BatchSize, Width, Height].
+ *
+ * The MatDescriptor is not only the dimension or shape of a matrix, it also
+ * contains the layout, stride of matrix. It is clearer to have a structure than
+ * reuse `DDim`.
+ */
+struct MatDescriptor {
+  int64_t height_;
+  int64_t width_;
+  int64_t stride_{0};
+  int64_t batch_size_{0};
+  bool trans_;
+};
+/**
+ * Create Matrix Descriptor from a tensor dim, num_flatten_cols, and transpose
+ * flag
+ *
+ * @param tensor_dim: The dimension of the tensor. The rank of this dimension
+ * must larger than 1.
+ *
+ * @param num_flatten_cols:  Reshape a tensor to a matrix. The matrix's first
+ * dimension(column length) will be the product of tensor's first `num_col_dims`
+ * dimensions. If num_flatten_cols is zero, the first N-2 dimension will be the
+ * batch_size of descriptor.
+ *
+ * @param trans: True if the matrix is transposed.
+ */
+extern MatDescriptor CreateMatrixDescriptor(const framework::DDim& tensor_dim,
+                                            int num_flatten_cols, bool trans);
 template <typename DeviceContext>
 class Blas {
 public:
@@ -90,6 +134,11 @@ class Blas {
                   int K, T alpha, const T* A, const T* B, T beta, T* C,
                   int batchCount, int64_t strideA, int64_t strideB) const;
+  template <typename T>
+  void MatMul(const framework::Tensor& mat_a, const MatDescriptor& dim_a,
+              const framework::Tensor& mat_b, const MatDescriptor& dim_b,
+              T alpha, framework::Tensor* mat_out, T beta) const;
 private:
  const DeviceContext& context_;
 };

--- a/paddle/fluid/operators/math/blas_impl.h
+++ b/paddle/fluid/operators/math/blas_impl.h
@@ -180,6 +180,31 @@ void Blas<platform::CPUDeviceContext>::BatchedGEMM(
 #endif
 }
+template <typename DeviceContext>
+template <typename T>
+void Blas<DeviceContext>::MatMul(const framework::Tensor &mat_a,
+                                 const MatDescriptor &dim_a,
+                                 const framework::Tensor &mat_b,
+                                 const MatDescriptor &dim_b, T alpha,
+                                 framework::Tensor *mat_out, T beta) const {
+  PADDLE_ENFORCE_EQ(dim_a.width_, dim_b.height_);
+  CBLAS_TRANSPOSE transA = !dim_a.trans_ ? CblasNoTrans : CblasTrans;
+  CBLAS_TRANSPOSE transB = !dim_b.trans_ ? CblasNoTrans : CblasTrans;
+  if (dim_a.batch_size_ == 0 && dim_b.batch_size_ == 0) {
+    this->template GEMM<T>(transA, transB, dim_a.height_, dim_b.width_,
+                           dim_a.width_, alpha, mat_a.data<T>(),
+                           mat_b.data<T>(), beta, mat_out->data<T>());
+  } else {
+    PADDLE_ENFORCE(dim_a.batch_size_ == dim_b.batch_size_ ||
+                   dim_a.batch_size_ == 0 || dim_b.batch_size_ == 0);
+    this->template BatchedGEMM<T>(
+        transA, transB, dim_a.height_, dim_b.width_, dim_a.width_, alpha,
+        mat_a.data<T>(), mat_b.data<T>(), beta, mat_out->data<T>(),
+        dim_a.batch_size_ == 0 ? dim_b.batch_size_ : dim_a.batch_size_,
+        dim_a.stride_, dim_b.stride_);
+  }
+}
 }  // namespace math
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/fluid/operators/math/matmul.h
+++ b/paddle/fluid/operators/math/matmul.h
-/* Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved.
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-    http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-#pragma once
-#include <algorithm>
-#include <vector>
-#include "paddle/fluid/operators/math/blas.h"
-namespace paddle {
-namespace operators {
-namespace math {
-// Implements the logic of numpy matmul:
-// https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.matmul.html
-//
-// but allowing also for a, b to be transposed
-//
-// Both a & b can be 1- to 3-dimensional. Higher rank tensors are not supported
-// yet.
-template <typename DeviceContext, typename T>
-class MatMulFunctor {
- public:
-  void operator()(const DeviceContext& context, const framework::Tensor& a,
-                  bool trans_a, const framework::Tensor& b, bool trans_b,
-                  T alpha, framework::Tensor* out, T beta) {
-    auto dim_a = a.dims();
-    auto dim_b = b.dims();
-    PADDLE_ENFORCE(a.place() == b.place() && b.place() == out->place(),
-                   "Tensors must all be in the same place.");
-    PADDLE_ENFORCE_GE(dim_a.size(), 1,
-                      "Input tensor a must be at least 1-dimensional.");
-    PADDLE_ENFORCE_GE(dim_b.size(), 1,
-                      "Input tensor b must be at least 1-dimensional.");
-    std::vector<int64_t> out_dim;
-    int64_t batch_count = 1;
-    if (dim_a.size() > 3) {
-      PADDLE_ENFORCE(dim_b.size() == dim_a.size(),
-                     "The dimensions of X and Y must be the same, and both of "
-                     "them should be %d-dimensional.",
-                     dim_b.size());
-      // The first rank-2 dimensions are accumulated on the batch_count, and the
-      // last two dimensions are used for matrix multiplication.
-      for (int j = 0; j < dim_a.size() - 2; ++j) {
-        PADDLE_ENFORCE_EQ(dim_b[j], dim_a[j],
-                          "The %d-th dimension of X and Y must be the same.",
-                          j);
-        out_dim.push_back(dim_a[j]);
-        batch_count *= dim_a[j];
-      }
-    }
-    int M = 0, N = 0, kA = 0, kB = 0, batchCountA = 0, batchCountB = 0,
-        strideA = 0, strideB = 0;
-    switch (dim_a.size()) {
-      case 1:
-        // similar to np.matmul:
-        // prepend dimension 1 (no transpose) or append dimension 1 (transpose)
-        M = trans_a ? dim_a[0] : 1;
-        kA = trans_a ? 1 : dim_a[0];
-        break;
-      case 2:
-        M = trans_a ? dim_a[1] : dim_a[0];
-        kA = trans_a ? dim_a[0] : dim_a[1];
-        break;
-      case 3:
-        batchCountA = dim_a[0];
-        M = trans_a ? dim_a[2] : dim_a[1];
-        kA = trans_a ? dim_a[1] : dim_a[2];
-        strideA = M * kA;
-        break;
-      default:
-        batchCountA = batch_count;
-        size_t mat_s = dim_a.size() - 2;
-        M = trans_a ? dim_a[mat_s + 1] : dim_a[mat_s];
-        kA = trans_a ? dim_a[mat_s] : dim_a[mat_s + 1];
-        strideA = M * kA;
-    }
-    switch (dim_b.size()) {
-      case 1:
-        // similar to np.matmul:
-        // append dimension 1 (no transpose) or prepend dimension 1 (transpose)
-        kB = trans_b ? 1 : dim_b[0];
-        N = trans_b ? dim_b[0] : 1;
-        break;
-      case 2:
-        kB = trans_b ? dim_b[1] : dim_b[0];
-        N = trans_b ? dim_b[0] : dim_b[1];
-        break;
-      case 3:
-        batchCountB = dim_b[0];
-        kB = trans_b ? dim_b[2] : dim_b[1];
-        N = trans_b ? dim_b[1] : dim_b[2];
-        strideB = kB * N;
-        break;
-      default:
-        batchCountB = batch_count;
-        size_t mat_s = dim_b.size() - 2;
-        kB = trans_b ? dim_b[mat_s + 1] : dim_b[mat_s];
-        N = trans_b ? dim_b[mat_s] : dim_b[mat_s + 1];
-        strideB = kB * N;
-    }
-    PADDLE_ENFORCE_EQ(
-        kA, kB,
-        "First matrix's width must be equal with second matrix's height.");
-    if (batchCountA && batchCountB) {
-      PADDLE_ENFORCE_EQ(
-          batchCountA, batchCountB,
-          "When input tensors a and b are both batched, they must have the "
-          "same batch dimension.");
-    }
-    int batchCount = std::max(batchCountA, batchCountB);
-    CBLAS_TRANSPOSE transA = (trans_a == false) ? CblasNoTrans : CblasTrans;
-    CBLAS_TRANSPOSE transB = (trans_b == false) ? CblasNoTrans : CblasTrans;
-    auto blas = GetBlas<DeviceContext, T>(context);
-    if (!batchCount) {
-      // regular matrix multiplication
-      blas.GEMM(transA, transB, M, N, kA, alpha, a.data<T>(), b.data<T>(), beta,
-                out->data<T>());
-    } else {
-      // batched matrix multiplication
-      blas.BatchedGEMM(transA, transB, M, N, kA, alpha, a.data<T>(),
-                       b.data<T>(), beta, out->data<T>(), batchCount, strideA,
-                       strideB);
-    }
-  }
-};
-}  // namespace math
-}  // namespace operators
-}  // namespace paddle
--- a/paddle/fluid/operators/matmul_op.cc
+++ b/paddle/fluid/operators/matmul_op.cc
@@ -12,14 +12,257 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
-#include "paddle/fluid/operators/matmul_op.h"
 #include <algorithm>
+#include <utility>
 #include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/operators/detail/safe_ref.h"
+#include "paddle/fluid/operators/math/blas.h"
 namespace paddle {
 namespace operators {
+/**
+ * Get row matrix shape from a vector shape. If the rank of x_dim > 1, the
+ * original x_dim is returned.
+ */
+static framework::DDim RowMatrixFromVector(const framework::DDim& x_dim) {
+  if (x_dim.size() > 1) {
+    return x_dim;
+  }
+  return framework::make_ddim({1, x_dim[0]});
+}
+/**
+ * Get column matrix shape from a vector shape. If the ran of y_dim > 1, the
+ * original y_dim is returned.
+ */
+static framework::DDim ColumnMatrixFromVector(const framework::DDim& y_dim) {
+  if (y_dim.size() > 1) {
+    return y_dim;
+  }
+  return framework::make_ddim({y_dim[0], 1});
+}
+template <typename DeviceContext, typename T>
+class MatMulKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto& x =
+        detail::Ref(context.Input<framework::Tensor>("X"), "Cannot find X");
+    auto& y =
+        detail::Ref(context.Input<framework::Tensor>("Y"), "Cannot find Y");
+    auto* out = context.Output<framework::Tensor>("Out");
+    out->mutable_data<T>(context.GetPlace());
+    auto blas = math::GetBlas<DeviceContext, T>(context);
+    auto mat_dim_a = math::CreateMatrixDescriptor(
+        RowMatrixFromVector(x.dims()), 0, context.Attr<bool>("transpose_X"));
+    auto mat_dim_b = math::CreateMatrixDescriptor(
+        ColumnMatrixFromVector(y.dims()), 0, context.Attr<bool>("transpose_Y"));
+    blas.MatMul(x, mat_dim_a, y, mat_dim_b, T(1), out, T(0));
+  }
+};
+// Reshape a rank-3 tensor from P x M x N to (P * M) x N.
+// Identity op if the tensor is not of rank 3.
+static framework::Tensor FoldInitDims(const framework::Tensor& input) {
+  auto output = input;
+  auto in_dims = input.dims();
+  if (in_dims.size() == 3) {
+    output.Resize({in_dims[0] * in_dims[1], in_dims[2]});
+  }
+  return output;
+}
+// Reshape a rank-3 tensor from P x M x N to M x (P * N).
+// (Warning: This requires transposing data and writes into new memory.)
+// Identity op if the tensor is not of rank 3.
+template <typename DeviceContext, typename T>
+static framework::Tensor FoldHeadAndLastDims(const DeviceContext& context,
+                                             const framework::Tensor& input) {
+  auto in_dims = input.dims();
+  if (in_dims.size() != 3) {
+    return input;
+  }
+  framework::Tensor output;
+  output.Resize({in_dims[1], in_dims[0], in_dims[2]});
+  output.mutable_data<T>(context.GetPlace());
+  std::vector<int> axis = {1, 0, 2};
+  math::Transpose<DeviceContext, T, 3> trans;
+  trans(context, input, &output, axis);
+  output.Resize({in_dims[1], in_dims[0] * in_dims[2]});
+  return output;
+}
+/**
+ * Reshape a tensor to 3-D or 2-D tensor by matrix descriptor.
+ *
+ * The shape would be [BatchSize, H, W] or [H, W].
+ * If transposed, `H,W` will be swapped.
+ */
+static void ReshapeTensorIntoMatrixSequence(
+    framework::Tensor* x, const math::MatDescriptor& descriptor) {
+  int64_t h, w;
+  h = descriptor.height_;
+  w = descriptor.width_;
+  if (descriptor.trans_) {
+    std::swap(w, h);
+  }
+  if (descriptor.batch_size_) {
+    x->Resize({descriptor.batch_size_, h, w});
+  } else {
+    x->Resize({h, w});
+  }
+}
+/**
+ * Reshape the x,y,out tensor to 3-D or 2-D tensor by matrix descriptor
+ * Out = matmul(x, y)
+ *
+ * This method will first calculate X,Y matrix sequence, and then calculate
+ * the out shape.
+ *
+ * Assume X = [BatchSize, H1, W1], Y = [BatchSize, H2, W2]
+ * The out = [BatchSize, H1, W2]
+ *
+ * If there is no batch size in `X` and `Y`, the out will be [H1, W2]
+ * If any of `X` and `Y` has batch size BatchSize, the out will have the
+ * BatchSize.
+ */
+static void ReshapeXYOutIntoMatrixSequence(framework::Tensor* x,
+                                           framework::Tensor* y,
+                                           framework::Tensor* out, bool trans_x,
+                                           bool trans_y) {
+  auto x_dim = RowMatrixFromVector(x->dims());
+  auto y_dim = ColumnMatrixFromVector(y->dims());
+  auto mat_dim_x = math::CreateMatrixDescriptor(x_dim, 0, trans_x);
+  auto mat_dim_y = math::CreateMatrixDescriptor(y_dim, 0, trans_y);
+  if (mat_dim_x.batch_size_ == 0 && mat_dim_y.batch_size_ == 0) {
+    out->Resize({mat_dim_x.height_, mat_dim_y.width_});
+  } else {
+    out->Resize({std::max(mat_dim_x.batch_size_, mat_dim_y.batch_size_),
+                 mat_dim_x.height_, mat_dim_y.width_});
+  }
+  ReshapeTensorIntoMatrixSequence(x, mat_dim_x);
+  ReshapeTensorIntoMatrixSequence(y, mat_dim_y);
+}
+// Using dimensional constraints on matrix multiplication, it is
+// straight-forward to check the following table for when X and Y
+// are both matrices.
+//
+// transpose_X | False    | True     | False    | True
+// transpose_Y | False    | False    | True     | True
+// -----------+----------+----------+----------+-----------
+//        dX = | dOut Y^T | Y dOut^T | dOut Y   | Y^T dOut^T
+//        dY = | X^T dOut | X dOut   | dOut^T X | dOut^T X^T
+//
+// When X is a vector of size K, we treat it instead as a matrix of shape
+// (1, K). Similarly, when Y is a vector of size K, we treat it instead as
+// a matrix of shape (K, 1).
+//
+// When X and Y are both 3-dimensional tensors, then the first dimension
+// the batch dimension can be ignored and the exact same formulas apply
+// as for two matrices.
+//
+// Finally, when, e.g., X is a 3-dimensional tensor but Y is a matrix, we end
+// up with formulas like
+//
+//   dY_{ij} = \sum_{p, m} X_{pmi} dOut_{pmj}
+//
+// To handle this sort of scenario, we reshape X : P x M x K, dOut: P x M x N
+// to X: (P * M) x K, dOut: (P * M) x N.
+template <typename DeviceContext, typename T>
+class MatMulGradKernel : public framework::OpKernel<T> {
+ public:
+  void MatMul(const framework::ExecutionContext& context,
+              const framework::Tensor& a, bool trans_a,
+              const framework::Tensor& b, bool trans_b,
+              framework::Tensor* out) const {
+    out->mutable_data<T>(context.GetPlace());
+    auto blas = math::GetBlas<DeviceContext, T>(context);
+    auto mat_dim_a = math::CreateMatrixDescriptor(a.dims(), 0, trans_a);
+    auto mat_dim_b = math::CreateMatrixDescriptor(b.dims(), 0, trans_b);
+    blas.MatMul(a, mat_dim_a, b, mat_dim_b, T(1), out, T(0));
+  }
+  void CalcInputGrad(const framework::ExecutionContext& context,
+                     const framework::Tensor& a, bool trans_a,
+                     bool is_fold_init_dims_a, const framework::Tensor& b,
+                     bool trans_b, bool is_fold_init_dims_b,
+                     framework::Tensor* out) const {
+    if (out == nullptr) return;
+    bool need_combine = (a.dims().size() == 3 || b.dims().size() == 3) &&
+                        out->dims().size() == 2;
+    if (!need_combine) {
+      MatMul(context, a, trans_a, b, trans_b, out);
+    } else {
+      auto& ctx = context.template device_context<DeviceContext>();
+      MatMul(context, is_fold_init_dims_a
+                          ? FoldInitDims(a)
+                          : FoldHeadAndLastDims<DeviceContext, T>(ctx, a),
+             trans_a, is_fold_init_dims_b
+                          ? FoldInitDims(b)
+                          : FoldHeadAndLastDims<DeviceContext, T>(ctx, b),
+             trans_b, out);
+    }
+  }
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto x = *context.Input<framework::Tensor>("X");
+    auto y = *context.Input<framework::Tensor>("Y");
+    auto dout =
+        *context.Input<framework::Tensor>(framework::GradVarName("Out"));
+    auto* dx = context.Output<framework::Tensor>(framework::GradVarName("X"));
+    auto* dy = context.Output<framework::Tensor>(framework::GradVarName("Y"));
+    bool transpose_x = context.Attr<bool>("transpose_X");
+    bool transpose_y = context.Attr<bool>("transpose_Y");
+    ReshapeXYOutIntoMatrixSequence(&x, &y, &dout, transpose_x, transpose_y);
+    framework::DDim dx_dims;
+    if (dx) {
+      dx_dims = dx->dims();
+      if (dx_dims != x.dims()) {
+        dx->Resize(x.dims());
+      }
+    }
+    framework::DDim dy_dims;
+    if (dy) {
+      dy_dims = dy->dims();
+      if (dy_dims != y.dims()) {
+        dy->Resize(y.dims());
+      }
+    }
-using framework::Tensor;
+    if (transpose_x && transpose_y) {
+      CalcInputGrad(context, y, true, true, dout, true, false, dx);
+      CalcInputGrad(context, dout, true, true, x, true, false, dy);
+    } else if (transpose_x) {
+      CalcInputGrad(context, y, false, false, dout, true, false, dx);
+      CalcInputGrad(context, x, false, false, dout, false, true, dy);
+    } else if (transpose_y) {
+      CalcInputGrad(context, dout, false, false, y, false, true, dx);
+      CalcInputGrad(context, dout, true, true, x, false, true, dy);
+    } else {
+      CalcInputGrad(context, dout, false, false, y, true, false, dx);
+      CalcInputGrad(context, x, true, true, dout, false, true, dy);
+    }
+    if (dx) {
+      if (dx_dims != x.dims()) {
+        dx->Resize(dx_dims);
+      }
+    }
+    if (dy) {
+      if (dy_dims != y.dims()) {
+        dy->Resize(dy_dims);
+      }
+    }
+  }
+};
 class MatMulOp : public framework::OperatorWithKernel {
 public:
@@ -36,121 +279,41 @@ class MatMulOp : public framework::OperatorWithKernel {
    auto dim_x = context->GetInputDim("X");
    auto dim_y = context->GetInputDim("Y");
-    bool transpose_x = context->Attrs().Get<bool>("transpose_X");
-    bool transpose_y = context->Attrs().Get<bool>("transpose_Y");
-    PADDLE_ENFORCE_GE(dim_x.size(), 1,
-                      "Input tensor X must be at least 1-dimensional.");
-    PADDLE_ENFORCE_GE(dim_y.size(), 1,
-                      "Input tensor Y must be at least 1-dimensional.");
-    std::vector<int64_t> out_dim;
-    int64_t batch_count = 1;
-    if (dim_x.size() > 3) {
-      PADDLE_ENFORCE_EQ(
-          dim_y.size(), dim_x.size(),
-          "The dimensions of X and Y must be the same, and both of "
-          "them should be %d-dimensional.",
-          dim_x.size());
-      // The first rank-2 dimensions are accumulated on the batch_count, and the
-      // last two dimensions are used for matrix multiplication.
-      for (int j = 0; j < dim_x.size() - 2; ++j) {
-        PADDLE_ENFORCE_EQ(dim_y[j], dim_x[j],
-                          "The %d-th dimension of X and Y must be the same.",
-                          j);
-        out_dim.push_back(dim_x[j]);
-        batch_count *= dim_x[j];
-      }
-    }
-    int M = 0, N = 0, KX = 0, KY = 0, batchCountX = 0, batchCountY = 0;
+    auto mat_dim_x =
-    bool remove_initial_dim = false, remove_final_dim = false;
+        math::CreateMatrixDescriptor(RowMatrixFromVector(dim_x), 0,
+                                     context->Attrs().Get<bool>("transpose_X"));
-    switch (dim_x.size()) {
+    auto mat_dim_y =
-      case 1:
+        math::CreateMatrixDescriptor(ColumnMatrixFromVector(dim_y), 0,
-        if (transpose_x) {
+                                     context->Attrs().Get<bool>("transpose_Y"));
-          M = dim_x[0];
-          KX = 1;
-        } else {
-          M = 1;
-          KX = dim_x[0];
-          remove_initial_dim = true;
-        }
-        break;
-      case 2:
-        M = transpose_x ? dim_x[1] : dim_x[0];
-        KX = transpose_x ? dim_x[0] : dim_x[1];
-        break;
-      case 3:
-        batchCountX = dim_x[0];
-        M = transpose_x ? dim_x[2] : dim_x[1];
-        KX = transpose_x ? dim_x[1] : dim_x[2];
-        break;
-      default:
-        batchCountX = batch_count;
-        size_t mat_s = dim_x.size() - 2;
-        M = transpose_x ? dim_x[mat_s + 1] : dim_x[mat_s];
-        KX = transpose_x ? dim_x[mat_s] : dim_x[mat_s + 1];
-        break;
-    }
-    switch (dim_y.size()) {
+    PADDLE_ENFORCE_EQ(mat_dim_x.width_, mat_dim_y.height_);
-      case 1:
+    PADDLE_ENFORCE(mat_dim_x.batch_size_ == mat_dim_y.batch_size_ ||
-        if (transpose_y) {
+                   mat_dim_x.batch_size_ == 0 || mat_dim_y.batch_size_ == 0);
-          N = dim_y[0];
+    std::vector<int64_t> dim_out;
-          KY = 1;
+    if (mat_dim_x.batch_size_ != 0) {
-        } else {
+      dim_out = framework::vectorize(dim_x);
-          N = 1;
+      dim_out[dim_out.size() - 2] = mat_dim_x.height_;
-          KY = dim_y[0];
+      dim_out[dim_out.size() - 1] = mat_dim_y.width_;
-          remove_final_dim = true;
+    } else if (mat_dim_y.batch_size_ != 0) {
-        }
+      dim_out = framework::vectorize(dim_y);
-        break;
+      dim_out[dim_out.size() - 2] = mat_dim_x.height_;
-      case 2:
+      dim_out[dim_out.size() - 1] = mat_dim_y.width_;
-        KY = transpose_y ? dim_y[1] : dim_y[0];
+    } else {
-        N = transpose_y ? dim_y[0] : dim_y[1];
+      dim_out = {mat_dim_x.height_, mat_dim_y.width_};
-        break;
-      case 3:
-        batchCountY = dim_y[0];
-        KY = transpose_y ? dim_y[2] : dim_y[1];
-        N = transpose_y ? dim_y[1] : dim_y[2];
-        break;
-      default:
-        batchCountY = batch_count;
-        size_t mat_s = dim_y.size() - 2;
-        KY = transpose_y ? dim_y[mat_s + 1] : dim_y[mat_s];
-        N = transpose_y ? dim_y[mat_s] : dim_y[mat_s + 1];
    }
-    PADDLE_ENFORCE_EQ(
+    if (dim_x.size() == 1 && dim_out[dim_out.size() - 2] == 1) {
-        KX, KY,
+      std::swap(dim_out[dim_out.size() - 2], dim_out[dim_out.size() - 1]);
-        "First matrix's width must be equal with second matrix's height.");
+      dim_out.resize(dim_out.size() - 1);
-    if (batchCountX && batchCountY) {
-      PADDLE_ENFORCE_EQ(
-          batchCountX, batchCountY,
-          "When Input(X) and Input(Y) are both three dimensional, they "
-          "must have the same batch dimension.");
    }
-    int batchCount = std::max(batchCountX, batchCountY);
-    std::vector<int64_t> dim_out;
+    if (dim_y.size() == 1 && dim_out[dim_out.size() - 1] == 1) {
-    if (batchCount) {
+      dim_out.resize(dim_out.size() - 1);
-      if (dim_x.size() > 3) {
-        dim_out.insert(dim_out.begin(), out_dim.begin(), out_dim.end());
-      } else {
-        dim_out.push_back(batchCount);
-      }
    }
-    if (!remove_initial_dim) {
-      dim_out.push_back(M);
+    if (dim_out.empty()) {
-    }
+      dim_out = {1};
-    if (!remove_final_dim) {
-      dim_out.push_back(N);
-    }
-    if (dim_out.size() == 0) {
-      // We don't support 0-dimensional Tensors (scalars), so instead
-      // treat the output as a Tensor of shape (1, ) in this case.
-      dim_out.push_back(1);
    }
    context->SetOutputDim("Out", framework::make_ddim(dim_out));
    context->ShareLoD("X", /*->*/ "Out");
@@ -233,15 +396,40 @@ class MatMulOpGrad : public framework::OperatorWithKernel {
  }
 };
+class MatMulOpGradMaker : public framework::SingleGradOpDescMaker {
+ public:
+  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
+ protected:
+  std::unique_ptr<framework::OpDesc> Apply() const override {
+    auto* retv = new framework::OpDesc();
+    retv->SetType("matmul_grad");
+    retv->SetInput("X", Input("X"));
+    retv->SetInput("Y", Input("Y"));
+    retv->SetInput(framework::GradVarName("Out"), OutputGrad("Out"));
+    retv->SetOutput(framework::GradVarName("X"), InputGrad("X"));
+    retv->SetOutput(framework::GradVarName("Y"), InputGrad("Y"));
+    retv->SetAttrMap(Attrs());
+    return std::unique_ptr<framework::OpDesc>(retv);
+  }
+};
 }  // namespace operators
 }  // namespace paddle
 namespace ops = paddle::operators;
 REGISTER_OPERATOR(matmul, ops::MatMulOp, ops::MatMulOpMaker,
-                  paddle::framework::DefaultGradOpDescMaker<true>);
+                  ops::MatMulOpGradMaker);
 REGISTER_OPERATOR(matmul_grad, ops::MatMulOpGrad);
 REGISTER_OP_CPU_KERNEL(
    matmul, ops::MatMulKernel<paddle::platform::CPUDeviceContext, float>);
 REGISTER_OP_CPU_KERNEL(
    matmul_grad,
    ops::MatMulGradKernel<paddle::platform::CPUDeviceContext, float>);
+#ifdef PADDLE_WITH_CUDA
+REGISTER_OP_CUDA_KERNEL(
+    matmul, ops::MatMulKernel<paddle::platform::CUDADeviceContext, float>);
+REGISTER_OP_CUDA_KERNEL(
+    matmul_grad,
+    ops::MatMulGradKernel<paddle::platform::CUDADeviceContext, float>);
+#endif
--- a/paddle/fluid/operators/matmul_op.cu.cc
+++ b/paddle/fluid/operators/matmul_op.cu.cc
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-    http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-#include "paddle/fluid/operators/matmul_op.h"
-namespace ops = paddle::operators;
-REGISTER_OP_CUDA_KERNEL(
-    matmul, ops::MatMulKernel<paddle::platform::CUDADeviceContext, float>);
-REGISTER_OP_CUDA_KERNEL(
-    matmul_grad,
-    ops::MatMulGradKernel<paddle::platform::CUDADeviceContext, float>);
--- a/paddle/fluid/operators/matmul_op.h
+++ b/paddle/fluid/operators/matmul_op.h
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-    http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-#pragma once
-#include <algorithm>
-#include <functional>
-#include <vector>
-#include "paddle/fluid/framework/op_registry.h"
-#include "paddle/fluid/operators/math/math_function.h"
-#include "paddle/fluid/operators/math/matmul.h"
-namespace paddle {
-namespace operators {
-namespace matmul_detail {
-using Tensor = framework::Tensor;
-using DDim = framework::DDim;
-using framework::make_ddim;
-using framework::vectorize;
-template <typename DeviceContext, typename T>
-class MatMulKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    const Tensor& x = *context.Input<Tensor>("X");
-    const Tensor& y = *context.Input<Tensor>("Y");
-    Tensor* out = context.Output<Tensor>("Out");
-    out->mutable_data<T>(context.GetPlace());
-    bool transpose_x = context.Attr<bool>("transpose_X");
-    bool transpose_y = context.Attr<bool>("transpose_Y");
-    math::MatMulFunctor<DeviceContext, T>()(
-        context.template device_context<DeviceContext>(), x, transpose_x, y,
-        transpose_y, T(1), out, T(0));
-  }
-};
-template <typename T>
-inline Tensor Reshape(const Tensor& input, const DDim& dims) {
-  Tensor output;
-  output.ShareDataWith(input);
-  output.Resize(dims);
-  return output;
-}
-// Reshape a rank-3 tensor from P x M x N to (P * M) x N.
-// Identity op if the tensor is not of rank 3.
-template <typename T>
-Tensor CombineBatchAndM(const Tensor& input) {
-  Tensor output;
-  output.ShareDataWith(input);
-  auto in_dims = input.dims();
-  if (in_dims.size() == 3) {
-    std::vector<int64_t> out_dims = {in_dims[0] * in_dims[1], in_dims[2]};
-    output.Resize(make_ddim(out_dims));
-  }
-  return output;
-}
-// Reshape a rank-3 tensor from P x M x N to M x (P * N).
-// (Warning: This requires transposing data and writes into new memory.)
-// Identity op if the tensor is not of rank 3.
-template <typename DeviceContext, typename T>
-Tensor CombineBatchAndN(const DeviceContext& context, const Tensor& input) {
-  Tensor output;
-  auto in_dims = input.dims();
-  if (in_dims.size() == 3) {
-    output.Resize({in_dims[1], in_dims[0], in_dims[2]});
-    output.mutable_data<T>(context.GetPlace());
-    std::vector<int> axis = {1, 0, 2};
-    math::Transpose<DeviceContext, T, 3> trans;
-    trans(context, input, &output, axis);
-    std::vector<int64_t> out_dims = {in_dims[1], in_dims[0] * in_dims[2]};
-    output.Resize({in_dims[1], in_dims[0] * in_dims[2]});
-  } else {
-    output.ShareDataWith(input);
-  }
-  return output;
-}
-// Using dimensional constraints on matrix multiplication, it is
-// straight-forward to check the following table for when X and Y
-// are both matrices.
-//
-// transpose_X | False    | True     | False    | True
-// transpose_Y | False    | False    | True     | True
-// -----------+----------+----------+----------+-----------
-//        dX = | dOut Y^T | Y dOut^T | dOut Y   | Y^T dOut^T
-//        dY = | X^T dOut | X dOut   | dOut^T X | dOut^T X^T
-//
-// When X is a vector of size K, we treat it instead as a matrix of shape
-// (1, K). Similarly, when Y is a vector of size K, we treat it instead as
-// a matrix of shape (K, 1).
-//
-// When X and Y are both 3-dimensional tensors, then the first dimension
-// the batch dimension can be ignored and the exact same formulas apply
-// as for two matrices.
-//
-// Finally, when, e.g., X is a 3-dimensional tensor but Y is a matrix, we end
-// up with formulas like
-//
-//   dY_{ij} = \sum_{p, m} X_{pmi} dOut_{pmj}
-//
-// To handle this sort of scenario, we reshape X : P x M x K, dOut: P x M x N
-// to X: (P * M) x K, dOut: (P * M) x N.
-template <typename DeviceContext, typename T>
-class MatMulGradKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    const Tensor& x = *context.Input<Tensor>("X");
-    const Tensor& y = *context.Input<Tensor>("Y");
-    const Tensor& dout = *context.Input<Tensor>(framework::GradVarName("Out"));
-    Tensor* dx = context.Output<Tensor>(framework::GradVarName("X"));
-    Tensor* dy = context.Output<Tensor>(framework::GradVarName("Y"));
-    bool transpose_x = context.Attr<bool>("transpose_X");
-    bool transpose_y = context.Attr<bool>("transpose_Y");
-    std::vector<int64_t> x_dims = vectorize(x.dims());
-    std::vector<int64_t> y_dims = vectorize(y.dims());
-    // If X is a vector, reshape it to a matrix.
-    if (x_dims.size() == 1) {
-      x_dims.insert(x_dims.begin(), 1);
-    }
-    // If Y is a vector, reshape it to a matrix.
-    if (y_dims.size() == 1) {
-      y_dims.push_back(1);
-    }
-    int batch_count = 0;
-    // The first rank-2 dimensions are accumulated on the batch_count, and the
-    // last two dimensions are used for matrix multiplication.
-    if (x_dims.size() > 3) {
-      batch_count = accumulate(x_dims.begin(), x_dims.end() - 2, 1,
-                               std::multiplies<int>());
-    }
-    // Fix the dOut dimensions.
-    int M = 0, N = 0, batchCountX = 0, batchCountY = 0;
-    switch (x_dims.size()) {
-      case 2:
-        M = transpose_x ? x_dims[1] : x_dims[0];
-        break;
-      case 3:
-        batchCountX = x_dims[0];
-        M = transpose_x ? x_dims[2] : x_dims[1];
-        break;
-      default:
-        batchCountX = batch_count;
-        size_t mat_s = x_dims.size() - 2;
-        M = transpose_x ? x_dims[mat_s + 1] : x_dims[mat_s];
-    }
-    switch (y_dims.size()) {
-      case 2:
-        N = transpose_y ? y_dims[0] : y_dims[1];
-        break;
-      case 3:
-        batchCountY = y_dims[0];
-        N = transpose_y ? y_dims[1] : y_dims[2];
-        break;
-      default:
-        batchCountY = batch_count;
-        size_t mat_s = y_dims.size() - 2;
-        N = transpose_y ? y_dims[mat_s] : y_dims[mat_s + 1];
-    }
-    if (batchCountX && batchCountY) {
-      PADDLE_ENFORCE_EQ(
-          batchCountX, batchCountY,
-          "When Input(X) and Input(Y) are both three dimensional, they "
-          "must have the same batch dimension.");
-    }
-    int batchCount = std::max(batchCountX, batchCountY);
-    std::vector<int64_t> dout_dims = {M, N};
-    if (batchCount) {
-      if (x_dims.size() > 3) {
-        dout_dims.insert(dout_dims.begin(), x_dims.begin(), x_dims.end() - 2);
-      } else {
-        dout_dims.insert(dout_dims.begin(), batchCount);
-      }
-    }
-    Tensor X = Reshape<T>(x, make_ddim(x_dims));
-    Tensor Y = Reshape<T>(y, make_ddim(y_dims));
-    Tensor dOut = Reshape<T>(dout, make_ddim(dout_dims));
-    auto& dev_ctx = context.template device_context<DeviceContext>();
-    if (dx) {
-      dx->mutable_data<T>(context.GetPlace());
-      const Tensor& dOut_for_dX =
-          (x_dims.size() == 2 && y_dims.size() == 3)
-              ? CombineBatchAndN<DeviceContext, T>(dev_ctx, dOut)
-              : dOut;
-      if (x_dims.size() == 2 && y_dims.size() == 3) {
-        Y = transpose_y ? CombineBatchAndM<T>(Y)
-                        : CombineBatchAndN<DeviceContext, T>(dev_ctx, Y);
-      }
-      if (transpose_x) {
-        math::MatMulFunctor<DeviceContext, T>()(
-            dev_ctx, Y, transpose_y, dOut_for_dX, transpose_x, T(1), dx, T(0));
-      } else {
-        math::MatMulFunctor<DeviceContext, T>()(
-            dev_ctx, dOut_for_dX, transpose_x, Y, !transpose_y, T(1), dx, T(0));
-      }
-    }
-    if (dy) {
-      dy->mutable_data<T>(context.GetPlace());
-      const Tensor& dOut_for_dY = (y_dims.size() == 2 && x_dims.size() == 3)
-                                      ? CombineBatchAndM<T>(dOut)
-                                      : dOut;
-      if (y_dims.size() == 2 && x_dims.size() == 3) {
-        X = transpose_x ? CombineBatchAndN<DeviceContext, T>(dev_ctx, X)
-                        : CombineBatchAndM<T>(X);
-        dOut = CombineBatchAndM<T>(dOut);
-      }
-      if (transpose_y) {
-        math::MatMulFunctor<DeviceContext, T>()(
-            dev_ctx, dOut_for_dY, transpose_y, X, transpose_x, T(1), dy, T(0));
-      } else {
-        math::MatMulFunctor<DeviceContext, T>()(
-            dev_ctx, X, !transpose_x, dOut_for_dY, transpose_y, T(1), dy, T(0));
-      }
-    }
-  }
-};
-}  // namespace matmul_detail
-using matmul_detail::MatMulKernel;
-using matmul_detail::MatMulGradKernel;
-}  // namespace operators
-}  // namespace paddle
--- a/paddle/fluid/operators/save_combine_op.cc
+++ b/paddle/fluid/operators/save_combine_op.cc
@@ -18,6 +18,7 @@ limitations under the License. */
 #include <numeric>
 #include <sstream>
 #include "paddle/fluid/framework/data_type.h"
+#include "paddle/fluid/framework/data_type_transform.h"
 #include "paddle/fluid/framework/framework.pb.h"
 #include "paddle/fluid/framework/lod_tensor.h"
 #include "paddle/fluid/framework/op_registry.h"
@@ -69,6 +70,7 @@ class SaveCombineOp : public framework::OperatorBase {
               const platform::Place &place) const override {
    auto filename = Attr<std::string>("file_path");
    auto overwrite = Attr<bool>("overwrite");
+    auto save_as_fp16 = Attr<bool>("save_as_fp16");
    bool is_present = FileExists(filename);
    if (is_present && !overwrite) {
@@ -100,8 +102,24 @@ class SaveCombineOp : public framework::OperatorBase {
                     inp_var_names[i]);
      auto &tensor = var->Get<framework::LoDTensor>();
-      // Serialize tensor
+      // Serialize tensors one by one
-      framework::SerializeToStream(fout, tensor, dev_ctx);
+      // Check types to see if a fp16 transformation is required
+      auto in_dtype = framework::ToDataType(tensor.type());
+      auto out_dtype =
+          save_as_fp16 ? framework::proto::VarType::FP16 : in_dtype;
+      if (in_dtype != out_dtype) {
+        auto in_kernel_type = framework::OpKernelType(in_dtype, place);
+        auto out_kernel_type = framework::OpKernelType(out_dtype, place);
+        framework::LoDTensor out;
+        // copy LoD info to the new tensor
+        out.set_lod(tensor.lod());
+        framework::TransDataType(in_kernel_type, out_kernel_type, tensor, &out);
+        framework::SerializeToStream(fout, out, dev_ctx);
+      } else {
+        framework::SerializeToStream(fout, tensor, dev_ctx);
+      }
    }
    fout.close();
  }
@@ -125,6 +143,12 @@ to a file on disk.
                  "(boolean, default true)"
                  "Overwrite the output file if it exists.")
        .SetDefault(true);
+    AddAttr<bool>("save_as_fp16",
+                  "(boolean, default false)"
+                  "If true, the tensor will be converted to float16 data "
+                  "type and then saved. Otherwise, the tensor will be "
+                  "directly saved without data type conversion.")
+        .SetDefault(false);
    AddAttr<std::string>(
        "file_path",
        "(string)"

--- a/paddle/fluid/operators/save_load_combine_op_test.cc
+++ b/paddle/fluid/operators/save_load_combine_op_test.cc
@@ -17,15 +17,17 @@ limitations under the License. */
 #include <vector>
 #include "gtest/gtest.h"
 #include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/platform/float16.h"
 USE_NO_KERNEL_OP(save_combine);
 USE_NO_KERNEL_OP(load_combine);
-int* CreateForSaveCombineOp(int x, int y, const std::vector<int>& lod_info,
+template <typename T, typename U>
-                            std::string var_name,
+T* CreateForSaveCombineOp(int x, int y, const std::vector<int>& lod_info,
-                            const paddle::platform::CPUPlace& place,
+                          std::string var_name,
-                            paddle::framework::Scope* scope,
+                          const paddle::platform::CPUPlace& place,
-                            paddle::framework::LoD* expect_lod) {
+                          paddle::framework::Scope* scope,
+                          paddle::framework::LoD* expect_lod) {
  auto var = scope->Var(var_name);
  auto tensor = var->GetMutable<paddle::framework::LoDTensor>();
  tensor->Resize({x, y});
@@ -34,9 +36,10 @@ int* CreateForSaveCombineOp(int x, int y, const std::vector<int>& lod_info,
    (*expect_lod)[0].push_back(lod_info[i]);
  }
  tensor->set_lod(*expect_lod);
-  int* expect = tensor->mutable_data<int>(place);
+  T* expect = tensor->mutable_data<T>(place);
  for (int64_t i = 0; i < tensor->numel(); ++i) {
-    expect[i] = static_cast<int>(i);
+    expect[i] = static_cast<T>(
+        static_cast<U>(i));  // For FP16, we intend to do float(float16(i))
  }
  return expect;
 }
@@ -48,18 +51,20 @@ paddle::framework::LoDTensor* GeneratePlaceholderBeforeLoad(
  return target;
 }
-int* GetValuesAfterLoadCombineOp(paddle::framework::LoDTensor* target,
+template <typename T>
-                                 const paddle::framework::Scope& scope,
+T* GetValuesAfterLoadCombineOp(paddle::framework::LoDTensor* target,
-                                 paddle::framework::LoD* actual_lod) {
+                               const paddle::framework::Scope& scope,
-  int* actual = target->data<int>();
+                               paddle::framework::LoD* actual_lod) {
+  T* actual = target->data<T>();
  *actual_lod = target->lod();
  return actual;
 }
-void CheckValues(int* expect, int* actual, paddle::framework::LoD expect_lod,
+template <typename T, typename U>
-                 paddle::framework::LoD actual_lod, const int& numel) {
+void CheckValues(T* expect, U* actual, const paddle::framework::LoD& expect_lod,
-  for (int64_t i = 0; i < numel; ++i) {
+                 const paddle::framework::LoD& actual_lod, const int& numel) {
-    EXPECT_EQ(expect[i], actual[i]);
+  for (int i = 0; i < numel; ++i) {
+    EXPECT_EQ(expect[i], static_cast<T>(actual[i]));
  }
  EXPECT_EQ(expect_lod.size(), actual_lod.size());
  for (size_t i = 0; i < expect_lod.size(); ++i) {
@@ -78,26 +83,26 @@ TEST(SaveLoadCombineOp, CPU) {
  std::vector<int> lod1 = {0, 1, 2, 3, 10};
  int numel1 = 100;
  paddle::framework::LoD expect_lod1;
-  int* expect1 = CreateForSaveCombineOp(10, 10, lod1, "test_var1", place,
+  int* expect1 = CreateForSaveCombineOp<int, int>(10, 10, lod1, "test_var1",
-                                        &scope, &expect_lod1);
+                                                  place, &scope, &expect_lod1);
  std::vector<int> lod2 = {0, 2, 5, 10};
  int numel2 = 200;
  paddle::framework::LoD expect_lod2;
-  int* expect2 = CreateForSaveCombineOp(10, 20, lod2, "test_var2", place,
+  int* expect2 = CreateForSaveCombineOp<int, int>(10, 20, lod2, "test_var2",
-                                        &scope, &expect_lod2);
+                                                  place, &scope, &expect_lod2);
  std::vector<int> lod3 = {0, 2, 3, 20};
  int numel3 = 4000;
  paddle::framework::LoD expect_lod3;
-  int* expect3 = CreateForSaveCombineOp(20, 200, lod3, "test_var3", place,
+  int* expect3 = CreateForSaveCombineOp<int, int>(20, 200, lod3, "test_var3",
-                                        &scope, &expect_lod3);
+                                                  place, &scope, &expect_lod3);
  std::vector<int> lod4 = {0, 1, 20};
  int numel4 = 1000;
  paddle::framework::LoD expect_lod4;
-  int* expect4 = CreateForSaveCombineOp(20, 50, lod4, "test_var4", place,
+  int* expect4 = CreateForSaveCombineOp<int, int>(20, 50, lod4, "test_var4",
-                                        &scope, &expect_lod4);
+                                                  place, &scope, &expect_lod4);
  // Set attributes
  std::string filename = "check_tensor.ls";
@@ -123,15 +128,92 @@ TEST(SaveLoadCombineOp, CPU) {
  load_combine_op->Run(scope, place);
  paddle::framework::LoD actual_lod1, actual_lod2, actual_lod3, actual_lod4;
-  int* actual1 = GetValuesAfterLoadCombineOp(target1, scope, &actual_lod1);
+  int* actual1 = GetValuesAfterLoadCombineOp<int>(target1, scope, &actual_lod1);
-  int* actual2 = GetValuesAfterLoadCombineOp(target2, scope, &actual_lod2);
+  int* actual2 = GetValuesAfterLoadCombineOp<int>(target2, scope, &actual_lod2);
-  int* actual3 = GetValuesAfterLoadCombineOp(target3, scope, &actual_lod3);
+  int* actual3 = GetValuesAfterLoadCombineOp<int>(target3, scope, &actual_lod3);
-  int* actual4 = GetValuesAfterLoadCombineOp(target4, scope, &actual_lod4);
+  int* actual4 = GetValuesAfterLoadCombineOp<int>(target4, scope, &actual_lod4);
-  CheckValues(expect1, actual1, expect_lod1, actual_lod1, numel1);
+  CheckValues<int, int>(expect1, actual1, expect_lod1, actual_lod1, numel1);
-  CheckValues(expect2, actual2, expect_lod2, actual_lod2, numel2);
+  CheckValues<int, int>(expect2, actual2, expect_lod2, actual_lod2, numel2);
-  CheckValues(expect3, actual3, expect_lod3, actual_lod3, numel3);
+  CheckValues<int, int>(expect3, actual3, expect_lod3, actual_lod3, numel3);
-  CheckValues(expect4, actual4, expect_lod4, actual_lod4, numel4);
+  CheckValues<int, int>(expect4, actual4, expect_lod4, actual_lod4, numel4);
+}
+// FP16 version of SaveLoadCombineOp Test
+TEST(SaveLoadCombineFP16Op, CPU) {
+  paddle::framework::Scope scope;
+  paddle::platform::CPUPlace place;
+  std::vector<int> lod1 = {0, 1, 2, 3, 10};
+  int numel1 = 100;
+  paddle::framework::LoD expect_lod1;
+  float* expect1 = CreateForSaveCombineOp<float, paddle::platform::float16>(
+      10, 10, lod1, "test_var1", place, &scope, &expect_lod1);
+  std::vector<int> lod2 = {0, 2, 5, 10};
+  int numel2 = 200;
+  paddle::framework::LoD expect_lod2;
+  float* expect2 = CreateForSaveCombineOp<float, paddle::platform::float16>(
+      10, 20, lod2, "test_var2", place, &scope, &expect_lod2);
+  std::vector<int> lod3 = {0, 20};
+  int numel3 = 4000;
+  paddle::framework::LoD expect_lod3;
+  float* expect3 = CreateForSaveCombineOp<float, paddle::platform::float16>(
+      20, 200, lod3, "test_var3", place, &scope, &expect_lod3);
+  std::vector<int> lod4 = {0, 1, 20};
+  int numel4 = 1000;
+  paddle::framework::LoD expect_lod4;
+  float* expect4 = CreateForSaveCombineOp<float, paddle::platform::float16>(
+      20, 50, lod4, "test_var4", place, &scope, &expect_lod4);
+  // Set attributes
+  std::string filename = "check_tensor_fp16.ls";
+  paddle::framework::AttributeMap attrs;
+  attrs.insert({"file_path", std::string(filename)});
+  attrs.insert({"save_as_fp16", true});
+  // Run the save_combine_op
+  auto save_combine_op = paddle::framework::OpRegistry::CreateOp(
+      "save_combine",
+      {{"X", {"test_var1", "test_var2", "test_var3", "test_var4"}}}, {}, attrs);
+  save_combine_op->Run(scope, place);
+  // Set up output vars
+  auto target1 = GeneratePlaceholderBeforeLoad("out_var1", &scope);
+  auto target2 = GeneratePlaceholderBeforeLoad("out_var2", &scope);
+  auto target3 = GeneratePlaceholderBeforeLoad("out_var3", &scope);
+  auto target4 = GeneratePlaceholderBeforeLoad("out_var4", &scope);
+  // Run the load_combine_op
+  auto load_combine_op = paddle::framework::OpRegistry::CreateOp(
+      "load_combine", {},
+      {{"Out", {"out_var1", "out_var2", "out_var3", "out_var4"}}}, attrs);
+  load_combine_op->Run(scope, place);
+  paddle::framework::LoD actual_lod1, actual_lod2, actual_lod3, actual_lod4;
+  paddle::platform::float16* actual1 =
+      GetValuesAfterLoadCombineOp<paddle::platform::float16>(target1, scope,
+                                                             &actual_lod1);
+  paddle::platform::float16* actual2 =
+      GetValuesAfterLoadCombineOp<paddle::platform::float16>(target2, scope,
+                                                             &actual_lod2);
+  paddle::platform::float16* actual3 =
+      GetValuesAfterLoadCombineOp<paddle::platform::float16>(target3, scope,
+                                                             &actual_lod3);
+  paddle::platform::float16* actual4 =
+      GetValuesAfterLoadCombineOp<paddle::platform::float16>(target4, scope,
+                                                             &actual_lod4);
+  CheckValues<float, paddle::platform::float16>(expect1, actual1, expect_lod1,
+                                                actual_lod1, numel1);
+  CheckValues<float, paddle::platform::float16>(expect2, actual2, expect_lod2,
+                                                actual_lod2, numel2);
+  CheckValues<float, paddle::platform::float16>(expect3, actual3, expect_lod3,
+                                                actual_lod3, numel3);
+  CheckValues<float, paddle::platform::float16>(expect4, actual4, expect_lod4,
+                                                actual_lod4, numel4);
 }
 // Test with original SaveLoadTest
@@ -141,7 +223,7 @@ TEST(SaveLoadTestWithCombineOp, CPU) {
  auto var = scope.Var("test_var");
  auto tensor = var->GetMutable<paddle::framework::LoDTensor>();
-  tensor->Resize({3, 10});
+  tensor->Resize({3, 4000});
  paddle::framework::LoD expect_lod;
  expect_lod.resize(1);
  expect_lod[0].push_back(0);

--- a/paddle/fluid/operators/save_load_op_test.cc
+++ b/paddle/fluid/operators/save_load_op_test.cc
@@ -63,14 +63,21 @@ TEST(SaveLoadOp, CPU) {
  }
 }
-TEST(SaveLoadFP16Op, CPU) {
+TEST(SaveFP16Op, CPU) {
  paddle::framework::Scope scope;
  paddle::platform::CPUPlace place;
  auto var = scope.Var("test_var");
  auto tensor = var->GetMutable<paddle::framework::LoDTensor>();
  tensor->Resize({3, 10});
+  paddle::framework::LoD expect_lod;
+  expect_lod.resize(1);
+  expect_lod[0].push_back(0);
+  expect_lod[0].push_back(1);
+  expect_lod[0].push_back(2);
+  expect_lod[0].push_back(3);
+  tensor->set_lod(expect_lod);
  float* expect = tensor->mutable_data<float>(place);
  for (int64_t i = 0; i < tensor->numel(); ++i) {
    expect[i] = static_cast<float>(paddle::platform::float16(i));
@@ -93,4 +100,60 @@ TEST(SaveLoadFP16Op, CPU) {
  for (int64_t i = 0; i < tensor->numel(); ++i) {
    EXPECT_EQ(expect[i], static_cast<float>(actual[i]));
  }
+  auto& actual_lod = target->lod();
+  EXPECT_EQ(expect_lod.size(), actual_lod.size());
+  for (size_t i = 0; i < expect_lod.size(); ++i) {
+    for (size_t j = 0; j < expect_lod[i].size(); ++j) {
+      EXPECT_EQ(expect_lod[i][j], actual_lod[i][j]);
+    }
+  }
+}
+TEST(LoadFP16Op, CPU) {
+  paddle::framework::Scope scope;
+  paddle::platform::CPUPlace place;
+  auto var = scope.Var("test_var");
+  auto tensor = var->GetMutable<paddle::framework::LoDTensor>();
+  tensor->Resize({3, 10});
+  paddle::framework::LoD expect_lod;
+  expect_lod.resize(1);
+  expect_lod[0].push_back(0);
+  expect_lod[0].push_back(1);
+  expect_lod[0].push_back(2);
+  expect_lod[0].push_back(3);
+  tensor->set_lod(expect_lod);
+  float* expect = tensor->mutable_data<float>(place);
+  for (int64_t i = 0; i < tensor->numel(); ++i) {
+    expect[i] = static_cast<float>(paddle::platform::float16(i));
+  }
+  paddle::framework::AttributeMap attrs;
+  attrs.insert({"file_path", std::string("tensor.save")});
+  attrs.insert({"load_as_fp16", true});
+  auto save_op = paddle::framework::OpRegistry::CreateOp(
+      "save", {{"X", {"test_var"}}}, {}, attrs);
+  save_op->Run(scope, place);
+  auto load_var = scope.Var("out_var");
+  auto load_op = paddle::framework::OpRegistry::CreateOp(
+      "load", {}, {{"Out", {"out_var"}}}, attrs);
+  load_op->Run(scope, place);
+  auto target = load_var->Get<paddle::framework::LoDTensor>();
+  paddle::platform::float16* actual = target.data<paddle::platform::float16>();
+  for (int64_t i = 0; i < tensor->numel(); ++i) {
+    EXPECT_EQ(expect[i], static_cast<float>(actual[i]));
+  }
+  auto& actual_lod = target.lod();
+  EXPECT_EQ(expect_lod.size(), actual_lod.size());
+  for (size_t i = 0; i < expect_lod.size(); ++i) {
+    for (size_t j = 0; j < expect_lod[i].size(); ++j) {
+      EXPECT_EQ(expect_lod[i][j], actual_lod[i][j]);
+    }
+  }
 }
--- a/paddle/fluid/operators/test_send_nccl_id.cc
+++ b/paddle/fluid/operators/test_send_nccl_id.cc
@@ -64,7 +64,7 @@ void StartServer() {
 TEST(SendNcclId, Normal) {
  std::thread server_thread(StartServer);
  // wait server to start
-  sleep(2);
+  rpc_service.WaitServerReady();
  f::Scope scope;
  p::CPUPlace place;

--- a/paddle/fluid/platform/cuda_device_function.h
+++ b/paddle/fluid/platform/cuda_device_function.h
@@ -63,6 +63,7 @@ __device__ T reduceSum(T val, int tid, int len) {
    val += platform::CudaShuffleDownSync(mask, val, offset);
  if (tid < warpSize) shm[tid] = 0;
+  __syncthreads();
  if (tid % warpSize == 0) {
    shm[tid / warpSize] = val;

--- a/paddle/fluid/platform/profiler.cc
+++ b/paddle/fluid/platform/profiler.cc
@@ -463,7 +463,7 @@ void SetProfileListener() {
  std::mt19937 rng;
  rng.seed(std::random_device()());
  std::uniform_int_distribution<std::mt19937::result_type> dist6(
-      1, std::numeric_limits<int64_t>::max());
+      1, std::numeric_limits<std::mt19937::result_type>::max());
  profiler_lister_id = dist6(rng);
 }
 int64_t ListenerId() { return profiler_lister_id; }

--- a/paddle/scripts/paddle_build.sh
+++ b/paddle/scripts/paddle_build.sh
@@ -398,7 +398,7 @@ function gen_dockerfile() {
    cat <<EOF
    ========================================
-    Generate /paddle/build/Dockerfile ...
+    Generate ${PADDLE_ROOT}/build/Dockerfile ...
    ========================================
 EOF
@@ -422,7 +422,7 @@ EOF
        CMD='"true"'
    fi
-    cat >> /paddle/build/Dockerfile <<EOF
+    cat >> ${PADDLE_ROOT}/build/Dockerfile <<EOF
    ADD python/dist/*.whl /
    # run paddle version to install python packages first
    RUN apt-get update &&\
@@ -436,8 +436,14 @@ EOF
    ${DOCKERFILE_CUDNN_DSO}
    ${DOCKERFILE_GPU_ENV}
    ENV NCCL_LAUNCH_MODE PARALLEL
-    ADD go/cmd/pserver/pserver /usr/bin/
+EOF
-    ADD go/cmd/master/master /usr/bin/
+    if [[ ${WITH_GOLANG:-OFF} == "ON" ]]; then
+        cat >> ${PADDLE_ROOT}/build/Dockerfile <<EOF
+        ADD go/cmd/pserver/pserver /usr/bin/
+        ADD go/cmd/master/master /usr/bin/
+EOF
+    fi
+    cat >> ${PADDLE_ROOT}/build/Dockerfile <<EOF
    # default command shows the paddle version and exit
    CMD [${CMD}]
 EOF

--- a/paddle/scripts/paddle_docker_build.sh
+++ b/paddle/scripts/paddle_docker_build.sh
@@ -32,7 +32,7 @@ function start_build_docker() {
    DOCKER_ENV=$(cat <<EOL
        -e FLAGS_fraction_of_gpu_memory_to_use=0.15 \
        -e CTEST_OUTPUT_ON_FAILURE=1 \
-        -e CTEST_PARALLEL_LEVEL=5 \
+        -e CTEST_PARALLEL_LEVEL=1 \
        -e APT_MIRROR=${apt_mirror} \
        -e WITH_GPU=ON \
        -e CUDA_ARCH_NAME=Auto \

--- a/python/paddle/dataset/wmt16.py
+++ b/python/paddle/dataset/wmt16.py
@@ -96,7 +96,7 @@ def __get_dict_size(src_dict_size, trg_dict_size, src_lang):
    src_dict_size = min(src_dict_size, (TOTAL_EN_WORDS if src_lang == "en" else
                                        TOTAL_DE_WORDS))
    trg_dict_size = min(trg_dict_size, (TOTAL_DE_WORDS if src_lang == "en" else
-                                        TOTAL_ENG_WORDS))
+                                        TOTAL_EN_WORDS))
    return src_dict_size, trg_dict_size

--- a/python/paddle/fluid/executor.py
+++ b/python/paddle/fluid/executor.py
@@ -299,14 +299,18 @@ class Executor(object):
        if feed is None:
            feed = {}
        if not isinstance(feed, dict):
-            raise TypeError("feed should be a map")
+            raise TypeError(
+                "feed requires dict as its Parameter. But you passed in %s" %
+                (type(feed)))
        if fetch_list is None:
            fetch_list = []
        if program is None:
            program = default_main_program()
        if not isinstance(program, Program):
-            raise TypeError()
+            raise TypeError(
+                "Executor requires Program as its Parameter. But you passed in %s"
+                % (type(program)))
        if scope is None:
            scope = global_scope()

--- a/python/paddle/fluid/optimizer.py
+++ b/python/paddle/fluid/optimizer.py
@@ -47,6 +47,8 @@ class Optimizer(object):
            raise TypeError("learning rate should be float or Variable")
        self.regularization = regularization
        self._learning_rate = learning_rate
+        # the learning rate type should be inferenced from loss
+        self._dtype = None
        # each program should have a independent learning rate
        # program -> Variable(learning_rate)
        self._learning_rate_map = dict()
@@ -77,7 +79,7 @@ class Optimizer(object):
            name=unique_name.generate("learning_rate"),
            shape=[1],
            value=float(self._learning_rate),
-            dtype='float32',
+            dtype='float32' if self._dtype == None else self._dtype,
            persistable=True)
    def global_learning_rate(self, program=None):
@@ -200,6 +202,7 @@ class Optimizer(object):
        # Create any accumulators
        program = loss.block.program
+        self._dtype = loss.dtype
        with program_guard(program, startup_program):
            global_block = framework.default_main_program().global_block()
            start = len(global_block.ops)
@@ -391,7 +394,7 @@ class AdamOptimizer(Optimizer):
        beta_shape = [1]
        self._beta1_pow_acc = self.helper.create_global_variable(
            name=unique_name.generate('beta1_pow_acc'),
-            dtype='float32',
+            dtype='float32' if self._dtype == None else self._dtype,
            shape=beta_shape,
            lod_level=0,
            persistable=True)
@@ -400,7 +403,7 @@ class AdamOptimizer(Optimizer):
        self._beta2_pow_acc = self.helper.create_global_variable(
            name=unique_name.generate('beta2_pow_acc'),
-            dtype='float32',
+            dtype='float32' if self._dtype == None else self._dtype,
            shape=beta_shape,
            lod_level=0,
            persistable=True)
@@ -493,7 +496,7 @@ class AdamaxOptimizer(Optimizer):
        beta_shape = [1]
        self._beta1_pow_acc = self.helper.create_global_variable(
            name=unique_name.generate('beta1_pow_acc'),
-            dtype='float32',
+            dtype='float32' if self._dtype == None else self._dtype,
            shape=beta_shape,
            lod_level=0,
            persistable=True)
@@ -900,8 +903,10 @@ class ModelAverage(Optimizer):
        # param = (sum_1 + sum_2 + sum_3) / (num_accumulates + old_num_accumulates)
        tmp = layers.sum(x=[num_accumulates, old_num_accumulates])
        sum = layers.sum(x=[sum_1, sum_2, sum_3])
-        tmp = layers.cast(x=tmp, dtype='float32')
+        tmp = layers.cast(
-        sum = layers.cast(x=sum, dtype='float32')
+            x=tmp, dtype='float32' if self._dtype == None else self._dtype)
+        sum = layers.cast(
+            x=sum, dtype='float32' if self._dtype == None else self._dtype)
        layers.elementwise_div(x=sum, y=tmp, out=param)
    def _add_average_restore_op(self, block, param_grad):

--- a/python/paddle/fluid/tests/book/image_classification/notest_image_classification_resnet.py
+++ b/python/paddle/fluid/tests/book/image_classification/notest_image_classification_resnet.py
--- a/python/paddle/fluid/tests/book/image_classification/notest_image_classification_vgg.py
+++ b/python/paddle/fluid/tests/book/image_classification/notest_image_classification_vgg.py
--- a/python/paddle/fluid/tests/book/high-level-api/label_semantic_roles/no_test_label_semantic_roles.py
+++ b/python/paddle/fluid/tests/book/high-level-api/label_semantic_roles/no_test_label_semantic_roles.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import paddle
+import paddle.fluid as fluid
+import numpy
+WORD_DICT, VERB_DICT, LABEL_DICT = paddle.dataset.conll05.get_dict()
+WORD_DICT_LEN = len(WORD_DICT)
+LABEL_DICT_LEN = len(LABEL_DICT)
+PRED_DICT_LEN = len(VERB_DICT)
+MARK_DICT_LEN = 2
+def lstm_net(word, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2, mark):
+    WORD_DIM = 32
+    MARK_DIM = 5
+    HIDDEN_DIM = 512
+    DEPTH = 8
+    EMBEDDING_NAME = 'emb'
+    # Data definitions
+    word = fluid.layers.data(
+        name='word_data', shape=[1], dtype='int64', lod_level=1)
+    predicate = fluid.layers.data(
+        name='verb_data', shape=[1], dtype='int64', lod_level=1)
+    ctx_n2 = fluid.layers.data(
+        name='ctx_n2_data', shape=[1], dtype='int64', lod_level=1)
+    ctx_n1 = fluid.layers.data(
+        name='ctx_n1_data', shape=[1], dtype='int64', lod_level=1)
+    ctx_0 = fluid.layers.data(
+        name='ctx_0_data', shape=[1], dtype='int64', lod_level=1)
+    ctx_p1 = fluid.layers.data(
+        name='ctx_p1_data', shape=[1], dtype='int64', lod_level=1)
+    ctx_p2 = fluid.layers.data(
+        name='ctx_p2_data', shape=[1], dtype='int64', lod_level=1)
+    mark = fluid.layers.data(
+        name='mark_data', shape=[1], dtype='int64', lod_level=1)
+    # 8 features
+    predicate_embedding = fluid.layers.embedding(
+        input=predicate,
+        size=[PRED_DICT_LEN, WORD_DIM],
+        dtype='float32',
+        is_sparse=IS_SPARSE,
+        param_attr='vemb')
+    mark_embedding = fluid.layers.embedding(
+        input=mark,
+        size=[MARK_DICT_LEN, MARK_DIM],
+        dtype='float32',
+        is_sparse=IS_SPARSE)
+    word_input = [word, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2]
+    emb_layers = [
+        fluid.layers.embedding(
+            size=[WORD_DICT_LEN, WORD_DIM],
+            input=x,
+            param_attr=fluid.ParamAttr(
+                name=EMBEDDING_NAME, trainable=False)) for x in word_input
+    ]
+    emb_layers.append(predicate_embedding)
+    emb_layers.append(mark_embedding)
+    hidden_0_layers = [
+        fluid.layers.fc(input=emb, size=HIDDEN_DIM, act='tanh')
+        for emb in emb_layers
+    ]
+    hidden_0 = fluid.layers.sums(input=hidden_0_layers)
+    lstm_0 = fluid.layers.dynamic_lstm(
+        input=hidden_0,
+        size=HIDDEN_DIM,
+        candidate_activation='relu',
+        gate_activation='sigmoid',
+        cell_activation='sigmoid')
+    # stack L-LSTM and R-LSTM with direct edges
+    input_tmp = [hidden_0, lstm_0]
+    for i in range(1, DEPTH):
+        mix_hidden = fluid.layers.sums(input=[
+            fluid.layers.fc(input=input_tmp[0], size=HIDDEN_DIM, act='tanh'),
+            fluid.layers.fc(input=input_tmp[1], size=HIDDEN_DIM, act='tanh')
+        ])
+        lstm = fluid.layers.dynamic_lstm(
+            input=mix_hidden,
+            size=HIDDEN_DIM,
+            candidate_activation='relu',
+            gate_activation='sigmoid',
+            cell_activation='sigmoid',
+            is_reverse=((i % 2) == 1))
+        input_tmp = [mix_hidden, lstm]
+    feature_out = fluid.layers.sums(input=[
+        fluid.layers.fc(input=input_tmp[0], size=LABEL_DICT_LEN, act='tanh'),
+        fluid.layers.fc(input=input_tmp[1], size=LABEL_DICT_LEN, act='tanh')
+    ])
+    return feature_out
+def inference_network():
+    predict = lstm_net(word, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2,
+                       mark)
+    crf_decode = fluid.layers.crf_decoding(
+        input=feature_out, param_attr=fluid.ParamAttr(name='crfw'))
+    return crf_decode
+def train_network():
+    MIX_HIDDEN_LR = 1e-3
+    predict = lstm_net(word, predicate, ctx_n2, ctx_n1, ctx_0, ctx_p1, ctx_p2,
+                       mark)
+    target = fluid.layers.data(
+        name='target', shape=[1], dtype='int64', lod_level=1)
+    crf_cost = fluid.layers.linear_chain_crf(
+        input=predict,
+        label=target,
+        param_attr=fluid.ParamAttr(
+            name='crfw', learning_rate=MIX_HIDDEN_LR))
+    avg_cost = fluid.layers.mean(crf_cost)
+    return avg_cost
+def train(use_cuda, save_path):
+    BATCH_SIZE = 128
+    EPOCH_NUM = 1
+    train_reader = paddle.batch(
+        paddle.reader.shuffle(
+            paddle.dataset.conll05.train(), buf_size=8192),
+        batch_size=BATCH_SIZE)
+    test_reader = paddle.batch(
+        paddle.dataset.conll05.test(), batch_size=BATCH_SIZE)
+    def event_handler(event):
+        if isinstance(event, fluid.EndIteration):
+            if (event.batch_id % 10) == 0:
+                avg_cost = trainer.test(reader=test_reader)
+                print('BatchID {0:04}, Loss {1:2.2}'.format(event.batch_id + 1,
+                                                            avg_cost))
+                if avg_cost > 0.01:  # Low threshold for speeding up CI
+                    trainer.save_params(save_path)
+                    return
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    sgd_optimizer = fluid.optimizer.SGD(
+        learning_rate=fluid.layers.exponential_decay(
+            learning_rate=0.01,
+            decay_steps=100000,
+            decay_rate=0.5,
+            staircase=True))
+    trainer = fluid.Trainer(train_network, optimizer=sgd_optimizer, place=place)
+    trainer.train(train_reader, EPOCH_NUM, event_handler=event_handler)
+def infer(use_cuda, save_path):
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    inferencer = fluid.Inferencer(
+        inference_program, param_path=save_path, place=place)
+    def create_random_lodtensor(lod, place, low, high):
+        data = np.random.random_integers(low, high,
+                                         [lod[-1], 1]).astype("int64")
+        res = fluid.LoDTensor()
+        res.set(data, place)
+        res.set_lod([lod])
+        return res
+    # Create an input example
+    lod = [0, 4, 10]
+    word = create_random_lodtensor(lod, place, low=0, high=WORD_DICT_LEN - 1)
+    pred = create_random_lodtensor(lod, place, low=0, high=PRED_DICT_LEN - 1)
+    ctx_n2 = create_random_lodtensor(lod, place, low=0, high=WORD_DICT_LEN - 1)
+    ctx_n1 = create_random_lodtensor(lod, place, low=0, high=WORD_DICT_LEN - 1)
+    ctx_0 = create_random_lodtensor(lod, place, low=0, high=WORD_DICT_LEN - 1)
+    ctx_p1 = create_random_lodtensor(lod, place, low=0, high=WORD_DICT_LEN - 1)
+    ctx_p2 = create_random_lodtensor(lod, place, low=0, high=WORD_DICT_LEN - 1)
+    mark = create_random_lodtensor(lod, place, low=0, high=MARK_DICT_LEN - 1)
+    results = inferencer.infer({
+        'word_data': word,
+        'verb_data': pred,
+        'ctx_n2_data': ctx_n2,
+        'ctx_n1_data': ctx_n1,
+        'ctx_0_data': ctx_0,
+        'ctx_p1_data': ctx_p1,
+        'ctx_p2_data': ctx_p2,
+        'mark_data': mark
+    })
+    print("infer results: ", results)
+def main(use_cuda):
+    if use_cuda and not fluid.core.is_compiled_with_cuda():
+        return
+    save_path = "label_semantic_roles.inference.model"
+    train(use_cuda, save_path)
+    infer(use_cuda, save_path)
+if __name__ == '__main__':
+    for use_cuda in (False, True):
+        main(use_cuda=use_cuda)
--- a/python/paddle/fluid/tests/book/high-level-api/recognize_digits/notest_recognize_digits_conv.py
+++ b/python/paddle/fluid/tests/book/high-level-api/recognize_digits/notest_recognize_digits_conv.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import argparse
+import paddle.fluid as fluid
+import paddle
+import sys
+import numpy
+import unittest
+import math
+import sys
+import os
+import paddle.v2.dataset as dataset
+BATCH_SIZE = 64
+def inference_program():
+    img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+    conv_pool_1 = fluid.nets.simple_img_conv_pool(
+        input=img,
+        filter_size=5,
+        num_filters=20,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
+    conv_pool_2 = fluid.nets.simple_img_conv_pool(
+        input=conv_pool_1,
+        filter_size=5,
+        num_filters=50,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    prediction = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax')
+    return prediction
+def train_program():
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    predict = inference_program()
+    cost = fluid.layers.cross_entropy(input=predict, label=label)
+    avg_cost = fluid.layers.mean(cost)
+    acc = fluid.layers.accuracy(input=predict, label=label)
+    return avg_cost, acc
+def train(use_cuda, save_dirname):
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    optimizer = fluid.optimizer.Adam(learning_rate=0.001)
+    trainer = fluid.Trainer(train_program, place=place, optimizer=optimizer)
+    def event_handler(event):
+        if isinstance(event, fluid.EndIteration):
+            avg_cost, acc = event.values
+            print("avg_cost: %s" % avg_cost)
+            print("acc     : %s" % acc)
+            if (event.batch_id + 1) % 10 == 0:
+                test_metrics = trainer.test(reader=dataset.mnist.test())
+                avg_cost_set = test_metrics[0]
+                acc_set = test_metrics[1]
+                # get test acc and loss
+                acc = numpy.array(acc_set).mean()
+                avg_cost = numpy.array(avg_cost_set).mean()
+                if float(acc) > 0.2:  # Smaller value to increase CI speed
+                    trainer.save_params(save_dirname)
+                else:
+                    print('BatchID {0}, Test Loss {1:0.2}, Acc {2:0.2}'.format(
+                        event.batch_id + 1, float(avg_cost), float(acc)))
+                    if math.isnan(float(avg_cost)):
+                        sys.exit("got NaN loss, training failed.")
+    trainer.train(
+        reader=dataset.mnist.train(), num_pass=100, event_handler=event_handler)
+def infer(use_cuda, save_dirname=None):
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    inferencer = fluid.Inferencer(
+        inference_program, param_path=save_dirname, place=place)
+    batch_size = 1
+    tensor_img = numpy.random.uniform(-1.0, 1.0,
+                                      [batch_size, 1, 28, 28]).astype("float32")
+    results = inferencer.infer({'img': tensor_img})
+    print("infer results: ", results[0])
+def main(use_cuda):
+    save_dirname = "recognize_digits_conv.inference.model"
+    # call train() with is_local argument to run distributed train
+    train(use_cuda=use_cuda, save_dirname=save_dirname)
+    infer(use_cuda=use_cuda, save_dirname=save_dirname)
+if __name__ == '__main__':
+    for use_cuda in (False, True):
+        main(use_cuda=use_cuda)
--- a/python/paddle/fluid/tests/book/high-level-api/recognize_digits/notest_recognize_digits_mlp.py
+++ b/python/paddle/fluid/tests/book/high-level-api/recognize_digits/notest_recognize_digits_mlp.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import print_function
+import argparse
+import paddle.fluid as fluid
+import paddle
+import sys
+import numpy
+import unittest
+import math
+import sys
+import os
+import paddle.v2.dataset as dataset
+BATCH_SIZE = 64
+def inference_program():
+    img = fluid.layers.data(name='img', shape=[1, 28, 28], dtype='float32')
+    hidden = fluid.layers.fc(input=img, size=200, act='tanh')
+    hidden = fluid.layers.fc(input=hidden, size=200, act='tanh')
+    prediction = fluid.layers.fc(input=hidden, size=10, act='softmax')
+    return prediction
+def train_program():
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    predict = inference_program()
+    cost = fluid.layers.cross_entropy(input=predict, label=label)
+    avg_cost = fluid.layers.mean(cost)
+    acc = fluid.layers.accuracy(input=predict, label=label)
+    return avg_cost, acc
+def train(use_cuda, save_dirname):
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    optimizer = fluid.optimizer.Adam(learning_rate=0.001)
+    trainer = fluid.Trainer(train_program, place=place, optimizer=optimizer)
+    def event_handler(event):
+        if isinstance(event, fluid.EndIteration):
+            avg_cost, acc = event.values
+            print("avg_cost: %s" % avg_cost)
+            print("acc     : %s" % acc)
+            if (event.batch_id + 1) % 10 == 0:
+                test_metrics = trainer.test(reader=dataset.mnist.test())
+                avg_cost_set = test_metrics[0]
+                acc_set = test_metrics[1]
+                # get test acc and loss
+                acc = numpy.array(acc_set).mean()
+                avg_cost = numpy.array(avg_cost_set).mean()
+                if float(acc) > 0.2:  # Smaller value to increase CI speed
+                    trainer.save_params(save_dirname)
+                else:
+                    print('BatchID {0}, Test Loss {1:0.2}, Acc {2:0.2}'.format(
+                        event.batch_id + 1, float(avg_cost), float(acc)))
+                    if math.isnan(float(avg_cost)):
+                        sys.exit("got NaN loss, training failed.")
+    trainer.train(
+        reader=dataset.mnist.train(), num_pass=100, event_handler=event_handler)
+def infer(use_cuda, save_dirname=None):
+    place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+    inferencer = fluid.Inferencer(
+        inference_program, param_path=save_dirname, place=place)
+    batch_size = 1
+    tensor_img = numpy.random.uniform(-1.0, 1.0,
+                                      [batch_size, 1, 28, 28]).astype("float32")
+    results = inferencer.infer({'img': tensor_img})
+    print("infer results: ", results[0])
+def main(use_cuda):
+    save_dirname = "recognize_digits_mlp.inference.model"
+    # call train() with is_local argument to run distributed train
+    train(use_cuda=use_cuda, save_dirname=save_dirname)
+    infer(use_cuda=use_cuda, save_dirname=save_dirname)
+if __name__ == '__main__':
+    for use_cuda in (False, True):
+        main(use_cuda=use_cuda)
--- a/python/paddle/fluid/tests/book/understand_sentiment/notest_understand_sentiment_stacked_lstm.py
+++ b/python/paddle/fluid/tests/book/understand_sentiment/notest_understand_sentiment_stacked_lstm.py
--- a/python/paddle/fluid/tests/book/word2vec/no_test_word2vec_new_api.py
+++ b/python/paddle/fluid/tests/book/word2vec/no_test_word2vec_new_api.py
--- a/python/paddle/fluid/tests/book/test_understand_sentiment.py
+++ b/python/paddle/fluid/tests/book/test_understand_sentiment.py
--- a/python/paddle/fluid/tests/book/test_label_semantic_roles.py
+++ b/python/paddle/fluid/tests/book/test_label_semantic_roles.py
@@ -36,7 +36,7 @@ depth = 8
 mix_hidden_lr = 1e-3
 IS_SPARSE = True
-PASS_NUM = 100
+PASS_NUM = 10
 BATCH_SIZE = 10
 embedding_name = 'emb'

--- a/python/paddle/fluid/tests/unittests/test_matmul_op.py
+++ b/python/paddle/fluid/tests/unittests/test_matmul_op.py
@@ -111,21 +111,24 @@ class Generator(object):
 # Generate test cases for all possibilities
-for dim_X in [1, 2, 3]:
+def inject_test(dim_x, dim_y, trans_x, trans_y):
-    for dim_Y in [1, 2, 3]:
+    test_name = ('TestMatMulOp_dimX_{}_dim_Y_{}_transX_{}_transY_{}'.format(
-        for transpose_X in [False, True]:
+        dim_x, dim_y, trans_x, trans_y))
-            for transpose_Y in [False, True]:
+    shape_x, shape_y = generate_compatible_shapes(dim_x, dim_y, trans_x,
-                test_name = (
+                                                  trans_y)
-                    'TestMatMulOp_dimX_{}_dim_Y_{}_transX_{}_transY_{}'.format(
+    globals()[test_name] = type(test_name, (Generator, OpTest), {
-                        dim_X, dim_Y, transpose_X, transpose_Y))
+        'shape_X': shape_x,
-                shape_X, shape_Y = generate_compatible_shapes(
+        'shape_Y': shape_y,
-                    dim_X, dim_Y, transpose_X, transpose_Y)
+        'transpose_X': trans_x,
-                globals()[test_name] = type(test_name, (Generator, OpTest), {
+        'transpose_Y': trans_y,
-                    'shape_X': shape_X,
+    })
-                    'shape_Y': shape_Y,
-                    'transpose_X': transpose_X,
-                    'transpose_Y': transpose_Y,
+for dim_X in (1, 2, 3):
-                })
+    for dim_Y in (1, 2, 3):
+        for transose_x in (False, True):
+            for transose_y in (False, True):
+                inject_test(dim_X, dim_Y, transose_x, transose_y)
 # Test case n-dim
@@ -149,7 +152,7 @@ def generate_compatible_shapes(dim, transpose_X, transpose_Y):
    return shape_X, shape_Y
-# Test case n-dim
+# # Test case n-dim
 for dim in [4]:
    for transpose_X in [False, True]:
        for transpose_Y in [False, True]:

--- a/python/paddle/fluid/tests/unittests/test_memory_optimization_transpiler.py
+++ b/python/paddle/fluid/tests/unittests/test_memory_optimization_transpiler.py
@@ -18,7 +18,7 @@ import unittest
 import paddle.fluid.layers as layers
 import paddle.fluid.optimizer as optimizer
 from paddle.fluid.framework import Program, program_guard
-from paddle.fluid.memory_optimization_transpiler import memory_optimize
+from paddle.fluid.transpiler import memory_optimize
 class TestControlFlowGraph(unittest.TestCase):

--- a/python/paddle/fluid/tests/unittests/test_network_with_dtype.py
+++ b/python/paddle/fluid/tests/unittests/test_network_with_dtype.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import paddle.fluid.core as core
+from paddle.fluid.executor import Executor
+BATCH_SIZE = 20
+class TestNetWithDtype(unittest.TestCase):
+    def setUp(self):
+        self.dtype = "float64"
+        self.init_dtype()
+        self.x = fluid.layers.data(name='x', shape=[13], dtype=self.dtype)
+        self.y = fluid.layers.data(name='y', shape=[1], dtype=self.dtype)
+        y_predict = fluid.layers.fc(input=self.x, size=1, act=None)
+        cost = fluid.layers.square_error_cost(input=y_predict, label=self.y)
+        avg_cost = fluid.layers.mean(cost)
+        self.fetch_list = [avg_cost]
+        sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
+        sgd_optimizer.minimize(avg_cost)
+    def run_net_on_place(self, place):
+        train_reader = paddle.batch(
+            paddle.dataset.uci_housing.train(), batch_size=BATCH_SIZE)
+        feeder = fluid.DataFeeder(place=place, feed_list=[self.x, self.y])
+        exe = fluid.Executor(place)
+        exe.run(fluid.default_startup_program())
+        for data in train_reader():
+            exe.run(fluid.default_main_program(),
+                    feed=feeder.feed(data),
+                    fetch_list=self.fetch_list)
+            # the main program is runable, the datatype is fully supported
+            break
+    def init_dtype(self):
+        pass
+    def test_cpu(self):
+        place = fluid.CPUPlace()
+        self.run_net_on_place(place)
+    def test_gpu(self):
+        if not core.is_compiled_with_cuda():
+            return
+        place = fluid.CUDAPlace(0)
+        self.run_net_on_place(place)
+# TODO(dzhwinter): make sure the fp16 is runable
+# class TestFloat16(SimpleNet):
+#     def init_dtype(self):
+#         self.dtype = "float16"
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_parallel_executor.py
+++ b/python/paddle/fluid/tests/unittests/test_parallel_executor.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import numpy
+import numpy as np
 import unittest
 import paddle.fluid as fluid
@@ -243,7 +243,7 @@ class TestParallelExecutorBase(unittest.TestCase):
            begin = time.time()
            first_loss, = run_executor(
                exe=exe, feed=feed_dict, fetch_list=[loss.name])
-            first_loss = numpy.array(first_loss)
+            first_loss = np.array(first_loss)
            for i in xrange(iter):
                run_executor(exe=exe, feed=feed_dict, fetch_list=[])
@@ -256,7 +256,7 @@ class TestParallelExecutorBase(unittest.TestCase):
                print "%.4f Instance per second" % (
                    (batch_size * iter + 2) / (end - begin))
-            last_loss = numpy.array(last_loss)
+            last_loss = np.array(last_loss)
            print first_loss, last_loss
            # self.assertGreater(first_loss[0], last_loss[0])
@@ -284,8 +284,8 @@ class TestMNIST(TestParallelExecutorBase):
        self.check_network_convergence(simple_fc_net)
        self.check_network_convergence(simple_fc_net, allow_op_delay=True)
-        img = numpy.zeros(shape=[32, 784], dtype='float32')
+        img = np.zeros(shape=[32, 784], dtype='float32')
-        label = numpy.ones(shape=[32, 1], dtype='int64')
+        label = np.ones(shape=[32, 1], dtype='int64')
        self.check_network_convergence(
            simple_fc_net, feed_dict={"image": img,
                                      "label": label})
@@ -294,8 +294,8 @@ class TestMNIST(TestParallelExecutorBase):
        self.check_simple_fc_convergence()
    def check_simple_fc_parallel_accuracy(self):
-        img = numpy.zeros(shape=[32, 784], dtype='float32')
+        img = np.zeros(shape=[32, 784], dtype='float32')
-        label = numpy.ones(shape=[32, 1], dtype='int64')
+        label = np.ones(shape=[32, 1], dtype='int64')
        single_first_loss, single_last_loss = self.check_network_convergence(
            method=simple_fc_net,
            seed=1000,
@@ -319,8 +319,8 @@ class TestMNIST(TestParallelExecutorBase):
    def check_batchnorm_fc_convergence(self):
        self.check_network_convergence(fc_with_batchnorm)
-        img = numpy.zeros(shape=[32, 784], dtype='float32')
+        img = np.zeros(shape=[32, 784], dtype='float32')
-        label = numpy.ones(shape=[32, 1], dtype='int64')
+        label = np.ones(shape=[32, 1], dtype='int64')
        self.check_network_convergence(
            fc_with_batchnorm, feed_dict={"image": img,
                                          "label": label})
@@ -404,9 +404,6 @@ class ModelHyperParams(object):
    dropout = 0.1
-import numpy as np
 def prepare_batch_input(insts, src_pad_idx, trg_pad_idx, n_head):
    """
    Pad the instances to the max sequence length in batch, and generate the
@@ -533,9 +530,8 @@ class ParallelExecutorTestingDuringTraining(unittest.TestCase):
            opt.minimize(loss)
            batch_size = 32
-            image = numpy.random.normal(size=(batch_size,
+            image = np.random.normal(size=(batch_size, 784)).astype('float32')
-                                              784)).astype('float32')
+            label = np.random.randint(0, 10, (batch_size, 1), dtype="int64")
-            label = numpy.random.randint(0, 10, (batch_size, 1), dtype="int64")
            place = fluid.CUDAPlace(0)
            exe = fluid.Executor(place)
@@ -552,12 +548,12 @@ class ParallelExecutorTestingDuringTraining(unittest.TestCase):
            for i in xrange(5):
                test_loss, = test_exe.run([loss.name], feed=feed_dict)
-                test_loss = numpy.array(test_loss)
+                test_loss = np.array(test_loss)
                train_loss, = train_exe.run([loss.name], feed=feed_dict)
-                train_loss = numpy.array(train_loss)
+                train_loss = np.array(train_loss)
                self.assertTrue(
-                    numpy.allclose(
+                    np.allclose(
                        train_loss, test_loss, atol=1e-8),
                    "Train loss: " + str(train_loss) + "\n Test loss:" +
                    str(test_loss))
@@ -712,7 +708,7 @@ class TestCRFModel(unittest.TestCase):
            data = train_data()
            for i in xrange(10):
                cur_batch = next(data)
-                print map(numpy.array,
+                print map(np.array,
                          pe.run(feed=feeder.feed(cur_batch),
                                 fetch_list=[avg_cost.name]))[0]
@@ -721,3 +717,84 @@ class TestCRFModel(unittest.TestCase):
    def test_update_dense_parameter(self):
        self.check_network_convergence(is_sparse=False)
+# test fetch all the variables of global_block
+import paddle.dataset.flowers as flowers
+import math
+def Lenet(data, class_dim):
+    conv1 = fluid.layers.conv2d(data, 32, 5, 1, act=None)
+    bn1 = fluid.layers.batch_norm(conv1, act='relu')
+    pool1 = fluid.layers.pool2d(bn1, 2, 'max', 2)
+    conv2 = fluid.layers.conv2d(pool1, 50, 5, 1, act=None)
+    bn2 = fluid.layers.batch_norm(conv2, act='relu')
+    pool2 = fluid.layers.pool2d(bn2, 2, 'max', 2)
+    fc1 = fluid.layers.fc(pool2, size=500, act='relu')
+    fc2 = fluid.layers.fc(fc1, size=class_dim, act='softmax')
+    return fc2
+class TestFetchOp(unittest.TestCase):
+    def parallel_exe(self, train_inputs, seed):
+        main = fluid.Program()
+        startup = fluid.Program()
+        startup.random_seed = seed
+        with fluid.program_guard(main, startup):
+            data = fluid.layers.data(
+                name='image', shape=[3, 224, 224], dtype='float32')
+            label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+            out = Lenet(data, class_dim=102)
+            loss = fluid.layers.cross_entropy(input=out, label=label)
+            loss = fluid.layers.mean(loss)
+            opt = fluid.optimizer.Momentum(
+                learning_rate=0.1,
+                momentum=0.9,
+                regularization=fluid.regularizer.L2Decay(1e-4))
+            opt.minimize(loss)
+            # TODO(zcd): I found that onece the memory optimizer is open,
+            # parallel_exe doesn't fetch some variable, such as conv2d_0.b_0@GRAD,
+            # conv2d_1.b_0@GRAD. Those variables should not be pruned.
+            # fluid.memory_optimize(main)
+            place = fluid.CUDAPlace(0)
+            exe = fluid.Executor(place)
+            exe.run(startup)
+            feeder = fluid.DataFeeder(place=place, feed_list=[data, label])
+            pe = fluid.ParallelExecutor(
+                use_cuda=True, loss_name=loss.name, main_program=main)
+            fetch_list = []
+            all_vars = main.global_block().vars
+            for k, v in all_vars.iteritems():
+                if 'tmp' not in k and k[0] is not '_' or v.persistable:
+                    fetch_list.append(k)
+            for data in train_inputs:
+                ret = pe.run(fetch_list, feed=feeder.feed(data))
+                for i in range(len(fetch_list)):
+                    assert not math.isnan(np.sum(ret[i])) and \
+                           not math.isinf(np.sum(ret[i]))
+    def test_update_sparse_parameter(self):
+        tst_reader = paddle.batch(flowers.test(use_xmap=False), batch_size=16)
+        tst_reader_iter = tst_reader()
+        iters = 3
+        train_inputs = []
+        for i in range(iters):
+            train_inputs.append(tst_reader_iter.next())
+        self.parallel_exe(train_inputs, seed=1)
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_split_var.py
+++ b/python/paddle/fluid/tests/unittests/test_split_var.py
@@ -14,7 +14,7 @@
 import math
 import unittest
-from paddle.fluid.distribute_transpiler import split_dense_variable
+from paddle.fluid.transpiler.distribute_transpiler import split_dense_variable
 import paddle.fluid as fluid
 import paddle.fluid.core as core
 import random

--- a/python/paddle/fluid/trainer.py
+++ b/python/paddle/fluid/trainer.py
@@ -173,9 +173,9 @@ class Trainer(object):
    def save_params(self, param_path):
        # reference: save_persistables in io.py
-        exe = executor.Executor(self.place)
+        with self._prog_and_scope_guard():
-        io.save_persistables(
+            exe = executor.Executor(self.place)
-            exe, dirname=param_path, main_program=self.startup_program)
+            io.save_persistables(exe, dirname=param_path)
    @staticmethod
    def _check_and_get_place(place):

--- a/python/paddle/fluid/transpiler/distribute_transpiler.py
+++ b/python/paddle/fluid/transpiler/distribute_transpiler.py
@@ -18,7 +18,9 @@ import math
 import distributed_splitter as splitter
 from .. import core
-from ..framework import Program, default_main_program, Variable, Parameter
+from ..framework import Program, default_main_program, \
+                        default_startup_program, \
+                        Variable, Parameter, grad_var_name
 LOOKUP_TABLE_TYPE = "lookup_table"
 LOOKUP_TABLE_GRAD_TYPE = "lookup_table_grad"
@@ -153,43 +155,43 @@ class DistributeTranspiler:
                  split_method=splitter.round_robin,
                  sync_mode=True):
        """
-            Transpile the program to distributed data-parallelism programs.
+        Transpile the program to distributed data-parallelism programs.
-            The main_program will be transformed to use a remote parameter server
+        The main_program will be transformed to use a remote parameter server
-            to do parameter optimization. And the optimization graph will be put
+        to do parameter optimization. And the optimization graph will be put
-            into a parameter server program.
+        into a parameter server program.
-            Use different methods to split trainable variables to different
+        Use different methods to split trainable variables to different
-            parameter servers.
+        parameter servers.
-            Steps to transpile trainer:
+        Steps to transpile trainer:
-            1. split variable to multiple blocks, aligned by product(dim[1:]) (width).
+        1. split variable to multiple blocks, aligned by product(dim[1:]) (width).
-            2. rename splited grad variables to add trainer_id suffix ".trainer_%d".
+        2. rename splited grad variables to add trainer_id suffix ".trainer_%d".
-            3. modify trainer program add split_op to each grad variable.
+        3. modify trainer program add split_op to each grad variable.
-            4. append send_op to send splited variables to server and fetch
+        4. append send_op to send splited variables to server and fetch
-               params(splited blocks or origin param) from server.
+            params(splited blocks or origin param) from server.
-            5. append concat_op to merge splited blocks to update local weights.
+        5. append concat_op to merge splited blocks to update local weights.
-            Steps to transpile pserver:
+        Steps to transpile pserver:
-            1. create new program for parameter server.
+        1. create new program for parameter server.
-            2. create params and grad variables that assigned to current server instance.
+        2. create params and grad variables that assigned to current server instance.
-            3. create a sub-block in the server side program
+        3. create a sub-block in the server side program
-            4. append ops that should run on current server instance.
+        4. append ops that should run on current server instance.
-            5. add listen_and_serv op
+        5. add listen_and_serv op
-            :param trainer_id: one unique id for each trainer in a job.
+        :param trainer_id: one unique id for each trainer in a job.
-            :type trainer_id: int
+        :type trainer_id: int
-            :param program: program to transpile, default is default_main_program
+        :param program: program to transpile, default is default_main_program
-            :type program: Program
+        :type program: Program
-            :param pservers: parameter server endpoints like "m1:6174,m2:6174"
+        :param pservers: parameter server endpoints like "m1:6174,m2:6174"
-            :type pservers: string
+        :type pservers: string
-            :param trainers: total number of workers/trainers in the job
+        :param trainers: total number of workers/trainers in the job
-            :type trainers: int
+        :type trainers: int
-            :param split_method: A function to determin how to split variables
+        :param split_method: A function to determin how to split variables
-                to different servers equally.
+            to different servers equally.
-            :type split_method: function
+        :type split_method: function
-            :type sync_mode: boolean default True
+        :type sync_mode: boolean default True
-            :param sync_mode: if sync_mode is set True, it means that dist transpiler
+        :param sync_mode: if sync_mode is set True, it means that dist transpiler
-            will transpile the program into sync_mode pserver and trainer program.
+        will transpile the program into sync_mode pserver and trainer program.
        """
        assert (callable(split_method))
        if program is None:
@@ -244,7 +246,7 @@ class DistributeTranspiler:
            ]
            grad_list = [
                grad for grad in grad_list
-                if grad.name != framework.grad_var_name(self.table_name)
+                if grad.name != grad_var_name(self.table_name)
            ]
            self.table_param_grad = [
                param_grad for param_grad in params_grads
@@ -494,7 +496,7 @@ class DistributeTranspiler:
        were split to several blocks.
        """
        s_prog = Program()
-        orig_s_prog = framework.default_startup_program()
+        orig_s_prog = default_startup_program()
        params = self.param_grad_ep_mapping[endpoint]["params"]
        def _get_splited_name_and_shape(varname):
@@ -619,7 +621,7 @@ class DistributeTranspiler:
        # 2. add split_ids_op and send_vars_op to send gradient to pservers
        # there should only be one table_name
        all_ops = program.global_block().ops
-        table_grad_name = framework.grad_var_name(self.table_name)
+        table_grad_name = grad_var_name(self.table_name)
        for op in all_ops:
            if table_grad_name in op.output_arg_names:
                op_index = list(all_ops).index(op)
@@ -692,7 +694,7 @@ class DistributeTranspiler:
            persistable=True)
        grad_var = _clone_var(
            pserver_program.global_block(),
-            self.origin_program.global_block().vars[framework.grad_var_name(
+            self.origin_program.global_block().vars[grad_var_name(
                self.table_name)],
            persistable=False)

--- a/tools/aws_benchmarking/server/cluster_master.py
+++ b/tools/aws_benchmarking/server/cluster_master.py
@@ -20,6 +20,7 @@ import time
 import threading
 import logging
 import copy
+import csv
 import netaddr
 import boto3
@@ -136,6 +137,12 @@ parser.add_argument(
 parser.add_argument(
    '--master_server_ip', type=str, default="", help="master server private ip")
+parser.add_argument(
+    '--metric_data_identifier',
+    type=str,
+    default="**metrics_data: ",
+    help="key string to identify metrics data")
 parser.add_argument(
    '--no_clean_up',
    type=str2bool,
@@ -155,6 +162,11 @@ logging.basicConfig(
 log_files = ["master.log"]
+metrics = {}
+metrics_csv_file_name = "metrics.csv"
+is_metrics_file_created = False
 def create_subnet():
    # if no vpc id provided, list vpcs
@@ -329,12 +341,42 @@ def create_pservers():
        cleanup(args.task_name)
+def save_metrics_data(str_msg):
+    #parse msg
+    logging.info("found metrics data, saving it to csv file")
+    global is_metrics_file_created
+    metrics_raw = str_msg.split(",")
+    with open(args.log_path + metrics_csv_file_name, 'a') as csvfile:
+        csv_fieldnames = []
+        csv_write_data = {}
+        for metric in metrics_raw:
+            metric_data = metric.split("=")
+            metric_key = metric_data[0].strip()
+            metric_val = float(metric_data[1].strip())
+            if not metric_key in metrics:
+                metrics[metric_key] = []
+            metric_repo = metrics[metric_key]
+            metric_repo.append(metric_val)
+            csv_fieldnames.append(metric_key)
+            csv_write_data[metric_key] = metric_val
+        writer = csv.DictWriter(csvfile, fieldnames=csv_fieldnames)
+        if not is_metrics_file_created:
+            writer.writeheader()
+            is_metrics_file_created = True
+        writer.writerow(csv_write_data)
+        logging.info("csv file appended")
 def log_to_file(source, filename):
    if not filename in log_files:
        log_files.append(filename)
    with open(args.log_path + filename, "a") as log_file:
        for line in iter(source.readline, ""):
            log_file.write(line)
+            if (line.startswith(args.metric_data_identifier)):
+                #found key data, trying to add to csv
+                line = line.replace(args.metric_data_identifier, "")
+                save_metrics_data(line)
 def parse_command(command_raw, defaults={}):