update to the develop branch.

cf2608e3 · dangqingqing · 64fe9bcc · 154e1d04 · cf2608e3 · cf2608e3
131 changed file
--- a/Dockerfile
+++ b/Dockerfile
@@ -22,7 +22,7 @@ COPY ./paddle/scripts/docker/root/ /root/

 RUN apt-get update && \
    apt-get install -y \
-    git python-pip python-dev openssh-server bison  \
+    git python-pip python-dev openssh-server bison libnccl-dev \
    wget unzip unrar tar xz-utils bzip2 gzip coreutils ntp \
    curl sed grep graphviz libjpeg-dev zlib1g-dev  \
    python-matplotlib gcc-4.8 g++-4.8 \

--- a/doc/design/block.md
+++ b/doc/design/block.md
@@ -189,7 +189,7 @@ OpDesc {
  inputs = {0} // the index of x in vars of BlockDesc above
  outputs = {5, 3} // indices of act and hidden_out in vars of BlockDesc above
  attrs {
-    "memories" : {1} // the index of h
+    "states" : {1} // the index of h
    "step_net" : <above step net>
  }
 };

--- a/doc/design/register_grad_op.md
+++ b/doc/design/register_grad_op.md
@@ -3,17 +3,17 @@

 ## The Problem Posed

-Currently, for each C++ operator class definition, there registers a *gradient operator creator* function, which takes a C++ operator instance and returns the corresponding gradient operator instance.
+Currently, for each C++ operator class definition, a *gradient operator creator* function is registered, which takes as input a C++ operator instance and returns the corresponding gradient operator instance.

-However, we noticed two problems with the current deisgn:
+However, we noticed two problems with the current design:

-1. As we decided to separate the *compilation* and *execution* phases, we need to change the creator to take an `OpDesc` protobuf message in a `ProgramDesc` and inserts corresponding `OpDesc` messages into the `ProgramDesc` message.
+1. As we decided to separate the *compilation* and the *execution* phases, we need to change the creator to take an `OpDesc` protobuf message in a `ProgramDesc` and inserts corresponding `OpDesc` messages into the `ProgramDesc` message.

-1. Some operator's gradient computation requires more than one gradient operators.  For example, the gradient of *minus* consists of two operators -- an identity operaotr and a scale operator.  So we need to make the registration mechanism to support the mapping from an operator to a set of operators for gradient computation.
+1. For some operators, the gradient computation can be written in terms of existing operators.  For example, the gradient of *minus* operator consists of two operators -- an *identity* operator followed by a *scale* operator.  Hence the registration mechanism needs to support mapping from an operator to a set of operators for the gradient computation.

 ## The Current Implementation

-The C++ class `OpInfos` store in a association map which key is the operator type. The `grad_op_type` indicate associated gradient operator type. Operator can create gradient operator by `OpInfo::creator_` of gradient. The pseudo code is
+Instances of the C++ class `OpInfo` are stored an associative map whose key is the operator type. The `grad_op_type` indicates the associated gradient operator type. An operator can create the gradient operator by invoking `OpInfo::creator_` of the gradient operator. The pseudo code is as follows

 ```cpp
 struct OpInfo {
@@ -31,16 +31,16 @@ OperatorBase* CreateGradientOperator(const OperatorBase& op) {

 ## Proposed Solution

-The mapping relationship between an operator and its gradient operators is a function. The interface of that function is:
+The mapping relationship between an operator and its gradient operators is a function. The interface of this function is:

 ```cpp
 // (OpDesc) --> vector<OpDesc>
 std::function<std::vector<OpDescBind>(const OpDescBind&)>;
 ```

-The function takes an `OpDescBind` of the forward operator and returns one or many gradient operator descriptions. `OpDescBind` is a C++ wrapper for protobuf message `OpDesc` to manipulate `OpDesc` fast.
+The function takes an `OpDescBind` of the forward operator and returns one or many gradient operator descriptions. `OpDescBind` is a C++ wrapper for  the protobuf message `OpDesc` for rapid manipulation of `OpDesc`.

-The `GradOpDescMaker` will be registered in `OpInfo`, to replace `grad_op_type_` field. The `OpInfo` should be
+The `GradOpDescMaker` will be registered in `OpInfo` and will replace the `grad_op_type_` field. The `OpInfo` should look like 

 ```cpp
 struct OpInfo {
@@ -49,7 +49,7 @@ struct OpInfo {
 };
 ```

-The `grad_op_maker_ ` is `nullptr` if the operator does not have associated gradient operators.
+The `grad_op_maker_ ` is a `nullptr` if the operator does not have any associated gradient operators.

 We propose a base class called `GradOpDescMakerBase` to let operator developers generate `Gradient Operators` easily. The public interface of that class is

@@ -74,7 +74,7 @@ func = [] (const OpDescBind& fwd_op) {

 We can write many helper functions since the `GradOpDescMakerBase` is a class now. The basic helper functions get the variables of `Input`, `Output`, `InputGradient` and `OutputGradient` in the forwarding operator.

-We should chagne register macros at the same time. In the current solution, there is no difference between forwarding operators and backward operators. So `REGISTER_OP` just register one operator. If the `REGISTER_OPERATOR ` contains `OpProtoAndCheckerMaker` and `GradOpDescMaker`, we just list them in the same macro. It can be done by a macro contains `__VA_ARGS__`.
+We should change register macros at the same time. In the current solution, there is no difference between forwarding operators and backward operators. So `REGISTER_OP` just register one operator. If the `REGISTER_OPERATOR ` contains `OpProtoAndCheckerMaker` and `GradOpDescMaker`, we just list them in the same macro. It can be done by a macro contains `__VA_ARGS__`.

 The user interface should be


--- a/doc/faq/local/index_cn.rst
+++ b/doc/faq/local/index_cn.rst
@@ -174,7 +174,7 @@ decoder_inputs = paddle.layer.fc(
 1. 两者都是对梯度的截断，但截断时机不同，前者在 :code:`optimzier` 更新网络参数时应用；后者在激活函数反向计算时被调用；
 2. 截断对象不同：前者截断可学习参数的梯度，后者截断回传给前层的梯度;

-除此之外，还可以通过减小学习律或者对数据进行归一化处理来解决这类问题。
+除此之外，还可以通过减小学习率或者对数据进行归一化处理来解决这类问题。

 5.  如何调用 infer 接口输出多个layer的预测结果
 -----------------------------------------------

--- a/paddle/capi/CMakeLists.txt
+++ b/paddle/capi/CMakeLists.txt
@@ -28,23 +28,37 @@ add_style_check_target(paddle_capi ${CAPI_SOURCES} ${CAPI_HEADER}

 add_dependencies(paddle_capi paddle_proto)

-# combine all paddle static libraries together, into libpaddle_capi_whole.a
-# user should use PaddleCAPI as -lpaddle_capi_whole
-set(PADDLE_CAPI_INFER_LIBS
-    paddle_utils
-    paddle_parameter
-    paddle_math
-    paddle_cuda
-    paddle_function
-    paddle_gserver
-    paddle_proto)
-
+# TODO: paddle_capi_whole will be removed.
+if(MOBILE_INFERENCE)
+    set(PADDLE_CAPI_INFER_LIBS
+        paddle_utils
+        paddle_parameter
+        paddle_math
+        paddle_cuda
+        paddle_function
+        paddle_gserver
+        paddle_proto)
+else()
+    set(PADDLE_CAPI_INFER_LIBS
+        paddle_utils
+        paddle_parameter
+        paddle_math
+        paddle_cuda
+        paddle_function
+        paddle_gserver
+        paddle_proto
+        paddle_pserver
+        paddle_network)
+endif()
 cc_library(paddle_capi_whole DEPS paddle_capi ${PADDLE_CAPI_INFER_LIBS})

-# No shared library for iOS
+# Link the static library for inference
+cc_library(paddle_capi_engine DEPS paddle_capi paddle_utils paddle_parameter paddle_math paddle_cuda paddle_proto)
+cc_library(paddle_capi_layers DEPS paddle_function paddle_gserver)
+
+# Link the shared library for inference
 if(NOT IOS)
-  set(LINK_FLAGS " -Wl,--retain-symbols-file ${CMAKE_CURRENT_SOURCE_DIR}/export.sym -Wl,--version-script ${CMAKE_CURRENT_SOURCE_DIR}/export.map")
-  # TODO: merge mkl into paddle_capi_shared
+  set(LINK_FLAGS "-Wl,--version-script ${CMAKE_CURRENT_SOURCE_DIR}/paddle_capi.map")
  add_library(paddle_capi_shared SHARED ${CAPI_SOURCES})
  set_target_properties(paddle_capi_shared	PROPERTIES LINK_FLAGS "${LINK_FLAGS}")
  target_include_directories(paddle_capi_shared PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
@@ -53,9 +67,10 @@ endif()

 # install library & headers.
 install(FILES ${CAPI_HEADERS} DESTINATION include/paddle)
+install(FILES paddle_capi.map DESTINATION include/paddle)
 install(FILES ${CMAKE_CURRENT_BINARY_DIR}/config.h DESTINATION include/paddle)
 if(ANDROID)
-  install(TARGETS paddle_capi_whole paddle_capi_shared
+  install(TARGETS paddle_capi_whole paddle_capi_engine paddle_capi_layers paddle_capi_shared
          ARCHIVE DESTINATION lib/${ANDROID_ABI}
          LIBRARY DESTINATION lib/${ANDROID_ABI})
  execute_process(
@@ -80,7 +95,7 @@ if(ANDROID)
      )"
  )
 else(ANDROID)
-  install(TARGETS paddle_capi_whole ARCHIVE DESTINATION lib)
+  install(TARGETS paddle_capi_whole paddle_capi_engine paddle_capi_layers ARCHIVE DESTINATION lib)
  if(NOT IOS)
    install(TARGETS paddle_capi_shared DESTINATION lib)
  endif()

--- a/paddle/capi/export.sym
+++ b/paddle/capi/export.sym
--- a/paddle/capi/export.map
+++ b/paddle/capi/export.map
--- a/paddle/framework/CMakeLists.txt
+++ b/paddle/framework/CMakeLists.txt
@@ -19,16 +19,15 @@ cc_test(scope_test SRCS scope_test.cc DEPS scope)
 proto_library(framework_proto SRCS framework.proto)

 cc_library(attribute SRCS attribute.cc DEPS framework_proto)
-cc_library(proto_desc SRCS var_desc.cc op_desc.cc block_desc.cc program_desc.cc DEPS attribute ddim op_info)
-cc_test(program_desc_test SRCS program_desc_test.cc DEPS proto_desc
-device_context)
+cc_test(program_desc_test SRCS program_desc_test.cc DEPS proto_desc)
 cc_library(op_proto_maker SRCS op_proto_maker.cc DEPS framework_proto attribute)
 cc_test(op_proto_maker_test SRCS op_proto_maker_test.cc DEPS op_proto_maker)
 cc_library(op_info SRCS op_info.cc DEPS attribute framework_proto)
-cc_library(operator SRCS operator.cc DEPS op_info device_context tensor scope proto_desc glog)
+cc_library(operator SRCS operator.cc DEPS op_info device_context tensor scope glog)
 cc_test(operator_test SRCS operator_test.cc DEPS operator op_registry)
+cc_library(proto_desc SRCS var_desc.cc op_desc.cc block_desc.cc program_desc.cc DEPS attribute ddim op_info operator)

-cc_library(op_registry SRCS op_registry.cc DEPS op_proto_maker op_info operator glog)
+cc_library(op_registry SRCS op_registry.cc DEPS op_proto_maker op_info operator glog proto_desc)
 cc_test(op_registry_test SRCS op_registry_test.cc DEPS op_registry)

 py_proto_compile(framework_py_proto SRCS framework.proto)
@@ -44,7 +43,7 @@ add_custom_command(TARGET framework_py_proto POST_BUILD
 cc_library(backward SRCS backward.cc DEPS net_op)
 cc_test(backward_test SRCS backward_test.cc DEPS backward recurrent_op device_context)

-cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto backward)
+cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto backward glog)

 cc_library(prune SRCS prune.cc DEPS framework_proto)
 cc_test(prune_test SRCS prune_test.cc DEPS op_info prune recurrent_op device_context)

--- a/paddle/framework/backward.cc
+++ b/paddle/framework/backward.cc
@@ -21,6 +21,7 @@

 #include "paddle/framework/block_desc.h"
 #include "paddle/framework/op_registry.h"
+#include "paddle/operators/dynamic_recurrent_op.h"
 #include "paddle/operators/net_op.h"
 #include "paddle/operators/recurrent_op.h"

@@ -220,8 +221,7 @@ static std::unique_ptr<OperatorBase> BackwardRecursive(
    // process recurrent gradient op as a special operator.
    if (forwardOp.Type() == "recurrent") {
      // NOTE clean up cycle call somewhere (RNN's stepnet constains itself),
-      // or
-      // this will result in infinite loop.
+      // or this will result in infinite loop.
      const auto& rnnop =
          *static_cast<const operators::RecurrentOp*>(&forwardOp);
      auto rnn_grad_op =
@@ -231,6 +231,18 @@ static std::unique_ptr<OperatorBase> BackwardRecursive(
      // create stepnet's gradient op
      rnn_grad_op->set_stepnet(
          BackwardRecursive(stepnet_op, no_grad_names, grad_to_var, uniq_id));
+    } else if (forwardOp.Type() == "dynamic_recurrent") {
+      // NOTE clean up cycle call somewhere (RNN's stepnet constains itself),
+      // or this will result in infinite loop.
+      const auto& rnnop =
+          *static_cast<const operators::DynamicRecurrentOp*>(&forwardOp);
+      auto rnn_grad_op =
+          static_cast<operators::DynamicRecurrentGradientOp*>(grad_op.get());
+      const auto& stepnet_op =
+          *static_cast<const OperatorBase*>(&rnnop.rnn.GetStepUnit());
+      // create stepnet's gradient op
+      rnn_grad_op->rnn.SetStepUnit(
+          BackwardRecursive(stepnet_op, no_grad_names, grad_to_var, uniq_id));
    }

    if (net->ops_.empty()) {  // Current no aux op is added to network

--- a/paddle/framework/block_desc.cc
+++ b/paddle/framework/block_desc.cc
@@ -41,6 +41,19 @@ bool BlockDescBind::HasVar(const std::string &name) const {
  return vars_.find(name) != vars_.end();
 }

+VarDescBind *BlockDescBind::FindVarRecursive(const std::string &name) const {
+  auto it = vars_.find(name);
+  if (it == vars_.end()) {
+    return Parent() == kNoneBlockIndex ? nullptr
+                                       : ParentBlock()->FindVarRecursive(name);
+  }
+  return it->second.get();
+}
+
+bool BlockDescBind::HasVarRecursive(const std::string &name) const {
+  return FindVarRecursive(name) != nullptr;
+}
+
 std::vector<VarDescBind *> BlockDescBind::AllVars() const {
  std::vector<VarDescBind *> res;
  for (const auto &p : vars_) {
@@ -97,7 +110,7 @@ void BlockDescBind::Flush() {
 }

 BlockDescBind *BlockDescBind::ParentBlock() const {
-  if (this->desc_->parent_idx() == -1) {
+  if (this->desc_->parent_idx() == kNoneBlockIndex) {
    return nullptr;
  }
  return prog_->Block(static_cast<size_t>(this->desc_->parent_idx()));

--- a/paddle/framework/block_desc.h
+++ b/paddle/framework/block_desc.h
@@ -21,6 +21,7 @@ limitations under the License. */
 #include <vector>

 #include "paddle/framework/op_desc.h"
+#include "paddle/framework/proto_desc.h"
 #include "paddle/framework/var_desc.h"
 #include "paddle/platform/macros.h"

@@ -56,6 +57,10 @@ class BlockDescBind {

  bool HasVar(const std::string &var_name) const;

+  VarDescBind *FindVarRecursive(const std::string &name_bytes) const;
+
+  bool HasVarRecursive(const std::string &var_name) const;
+
  std::set<std::string> LocalVarNames() const {
    std::set<std::string> var_names;
    for (auto &var : vars_) {

--- a/paddle/framework/data_type.h
+++ b/paddle/framework/data_type.h
@@ -26,6 +26,8 @@ inline DataType ToDataType(std::type_index type) {
    return DataType::FP64;
  } else if (typeid(int).hash_code() == type.hash_code()) {
    return DataType::INT32;
+  } else if (typeid(int64_t).hash_code() == type.hash_code()) {
+    return DataType::INT64;
  } else {
    PADDLE_THROW("Not supported");
  }

--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@@ -68,9 +68,13 @@ void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id) {

  for (auto& var : block.vars()) {
    if (var.persistable()) {
-      scope->Var(var.name());
+      auto* ptr = scope->Var(var.name());
+      VLOG(3) << "Create Variable " << var.name()
+              << " global, which pointer is " << ptr;
    } else {
-      local_scope.Var(var.name());
+      auto* ptr = local_scope.Var(var.name());
+      VLOG(3) << "Create Variable " << var.name()
+              << " locally, which pointer is " << ptr;
    }
  }

@@ -80,8 +84,7 @@ void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id) {
    op->Run(local_scope, *device);
  }

-  // TODO(tonyyang-svail):
-  //  - Destroy local_scope
+  scope->DeleteScope(&local_scope);
 }

 }  // namespace framework

--- a/paddle/framework/feed_fetch_method.h
+++ b/paddle/framework/feed_fetch_method.h
@@ -13,37 +13,45 @@ See the License for the specific language governing permissions and
 limitations under the License. */

 #pragma once
+#include "glog/logging.h"
+#include "paddle/framework/feed_fetch_type.h"
 #include "paddle/framework/scope.h"
 #include "paddle/framework/variable.h"

 namespace paddle {
 namespace framework {

-template <typename T>
-void SetFeedVariable(const LoDTensor& input, const std::string& var_name,
-                     size_t index) {
+void SetFeedVariable(Scope* scope, const LoDTensor& input,
+                     const std::string& var_name, size_t index) {
  // If var_name Variable is not found in GlobalScope, a new variable will
  // be created.
-  Variable* g_feed_value = GetGlobalScope().Var(var_name);
+  VLOG(3) << "SetFeedVariable name=" << var_name << " index=" << index;
+  Variable* g_feed_value = scope->Var(var_name);
  auto& feed_inputs =
      *(g_feed_value->GetMutable<std::vector<paddle::framework::LoDTensor>>());
  if (index >= feed_inputs.size()) {
    feed_inputs.resize(index + 1);
  }
  // shared data with input tensor
-  feed_inputs[index].ShareDataWith<T>(input);
+  feed_inputs[index].ShareDataWith(input);
  // set lod
  feed_inputs[index].set_lod(input.lod());
 }

-LoDTensor& GetFetchVariable(const std::string& var_name, size_t index) {
+LoDTensor& GetFetchVariable(const Scope& scope, const std::string& var_name,
+                            size_t index) {
  // Since we want to fetch LodTensor from a variable, the variable must
  // be created alreadly.
-  Variable* g_fetch_value = GetGlobalScope().FindVar(var_name);
-  auto& fetch_outputs =
-      *(g_fetch_value->GetMutable<std::vector<paddle::framework::LoDTensor>>());
+  Variable* g_fetch_value = scope.FindVar(var_name);
+  PADDLE_ENFORCE(g_fetch_value->IsType<FeedFetchList>(),
+                 "Only %s can be invoked by GetFetchVariable",
+                 typeid(FeedFetchList).name());
+  auto& fetch_outputs = *g_fetch_value->GetMutable<FeedFetchList>();
+  auto& tensor = fetch_outputs[index];
+  VLOG(3) << "Fetch " << var_name << " with index " << index
+          << " shape= " << tensor.dims();
  PADDLE_ENFORCE_LT(index, fetch_outputs.size());
-  return fetch_outputs[index];
+  return tensor;
 }

 }  // namespace framework

--- a/paddle/framework/framework.proto
+++ b/paddle/framework/framework.proto
@@ -68,6 +68,7 @@ message OpProto {

    optional bool duplicable = 3 [ default = false ];
    optional bool intermediate = 4 [ default = false ];
+    optional bool dispensable = 5 [ default = false ];
  }

  // AttrProto describes the C++ type Attribute.
@@ -112,6 +113,8 @@ message VarDesc {
  enum VarType {
    LOD_TENSOR = 1;
    SELECTED_ROWS = 2;
+    FEED_MINIBATCH = 3;
+    FETCH_LIST = 4;
  }
  required string name = 1;
  required VarType type = 2;

--- a/paddle/framework/lod_tensor.cc
+++ b/paddle/framework/lod_tensor.cc
@@ -25,31 +25,50 @@ LoD SliceLevels(const LoD& in, size_t level_begin, size_t level_end) {
  for (size_t i = level_begin; i < level_end; i++) {
    new_lod.emplace_back(in.at(i));
  }
+  // transform the lowest level to absolute offset.
+  LoD abs_offset_lod = ToAbsOffset(in);
+  new_lod.back() = abs_offset_lod[level_end - 1];
  return new_lod;
 }

 LoD SliceInLevel(const LoD& in, size_t level, size_t elem_begin,
                 size_t elem_end) {
-  // slice the lod.
-  LoD new_lod;
-  new_lod.reserve(in.size() - level);
-  auto start = in.at(level)[elem_begin];
-  auto end = in.at(level)[elem_end];
-
-  for (auto it = in.begin() + level; it != in.end(); it++) {
-    auto it_begin = std::find(it->begin(), it->end(), start);
-    auto it_end = std::find(it_begin, it->end(), end);
-    PADDLE_ENFORCE(it_begin != it->end(), "error in parsing lod info");
-    PADDLE_ENFORCE(it_end != it->end(), "error in parsing lod info");
-    new_lod.emplace_back(it_begin, it_end + 1);
-    // reset offset if tensor is copyed and sliced.
-    std::transform(new_lod.back().begin(), new_lod.back().end(),
-                   new_lod.back().begin(),
-                   [start](int v) { return v - start; });
-    PADDLE_ENFORCE_EQ(new_lod.back().front(), 0, "error in slice LoD");
+  PADDLE_ENFORCE_LT(level, in.size());
+  PADDLE_ENFORCE_LT(elem_end, in[level].size());
+
+  LoD res;
+  res.resize(in.size() - level);
+  // copy the first level
+  res[0].assign(in[level].begin() + elem_begin,
+                in[level].begin() + elem_end + 1);
+  for (size_t lvl = 1; lvl < res.size(); lvl++) {
+    const auto& in_level = in[level + lvl];
+    const auto& above_level = res[lvl - 1];
+    auto& out_level = res[lvl];
+    out_level.assign(in_level.begin() + above_level.front(),
+                     in_level.begin() + above_level.back() + 1);
  }
-  PADDLE_ENFORCE_LE(new_lod.size(), in.size());
-  return new_lod;
+  for (size_t lvl = 0; lvl < res.size(); lvl++) {
+    // to make the first offset equals 0, all the elements minus the first
+    // element
+    size_t front = res[lvl].front();
+    for (auto& ele : res[lvl]) {
+      ele -= front;
+    }
+  }
+  return res;
+}
+
+LoD ToAbsOffset(const LoD& in) {
+  // the lowest level stores relative offsets
+  if (in.empty() || in.size() == 1) return in;
+  LoD result = in;
+  for (int level = result.size() - 2; level >= 0; level--) {
+    for (auto& ele : result[level]) {
+      ele = result[level + 1][ele];
+    }
+  }
+  return result;
 }

 bool operator==(const LoD& a, const LoD& b) {
@@ -75,17 +94,7 @@ bool operator==(const LoD& a, const LoD& b) {
 size_t LoDTensor::NumElements(size_t level, size_t idx) const {
  PADDLE_ENFORCE_LT(level, NumLevels());
  PADDLE_ENFORCE_LT(idx, NumElements(level));
-  // the last level of LoD, just return number of records in Tensor
-  if (level == NumLevels() - 1) {
-    return lod_[level][idx + 1] - lod_[level][idx];
-  }
-  // high level of LoD, and there is another lower level, return number of
-  // lower-level elements
-  auto tmp = SliceInLevel(lod_, level, idx, idx + 1);
-  PADDLE_ENFORCE_GE(tmp.size(), 2);
-  // there is a 0 as a placeholder stored in LoD, so the number of elements
-  // equals lod.size() - 1
-  return tmp[1].size() - 1;
+  return lod_[level][idx + 1] - lod_[level][idx];
 }

 void LoDTensor::ShrinkLevels(size_t level_begin, size_t level_end) {

--- a/paddle/framework/lod_tensor.h
+++ b/paddle/framework/lod_tensor.h
@@ -39,23 +39,36 @@ using Vector = thrust::host_vector<
 #endif

 /*
- * 3-level LoD stores
+ * LoD is short for Level of Details.
 *
- * 0 10 20
- * 0 5 10 15 20
- * 0 2 5 7 10 12 15 20
- *
- * - in a level, each element indicates offset in the underlying Tensor
+ * - in a level, each element indicates relative offset of the lower level
 * - the first element should be 0 and that indicates that this sequence start
 * from 0
 * - each sequence's begin and end(no-inclusive) is level[id, id+1]
+ *
+ * For example:
+ *    3-level LoD stores
+ *
+ *    0 2 3
+ *    0 2 4 7
+ *    0 2 5 7 10 12 15 20
 */
 using LoD = std::vector<Vector<size_t>>;

+/*
+ * Slice levels from a LoD.
+ * NOTE the lowest level should always be the absolute offsets of the underlying
+ * tensor instances. So if higher layers are sliced without the lowest level,
+ * the lower level of the sliced LoD will be transformed to the absolute offset.
+ */
 LoD SliceLevels(const LoD& in, size_t level_begin, size_t level_end);

 LoD SliceInLevel(const LoD& in, size_t level, size_t elem_begin,
                 size_t elem_end);
+/*
+ * Transform an LoD from relative offsets to absolute offsets.
+ */
+LoD ToAbsOffset(const LoD& in);

 bool operator==(const LoD& a, const LoD& b);


--- a/paddle/framework/lod_tensor_test.cc
+++ b/paddle/framework/lod_tensor_test.cc
@@ -30,8 +30,8 @@ class LoDTensorTester : public ::testing::Test {
    // 0 5 10 15 20
    // 0 2 5 7 10 12 15 20
    LoD lod;
-    lod.push_back(std::vector<size_t>{0, 10, 20});
-    lod.push_back(std::vector<size_t>{0, 5, 10, 15, 20});
+    lod.push_back(std::vector<size_t>{0, 2, 3});
+    lod.push_back(std::vector<size_t>{0, 2, 5, 8});
    lod.push_back(std::vector<size_t>{0, 2, 5, 7, 10, 12, 15, 17, 20});

    ASSERT_EQ(lod.size(), 3UL);
@@ -52,14 +52,14 @@ TEST_F(LoDTensorTester, NumLevels) { ASSERT_EQ(lod_tensor_.NumLevels(), 3UL); }

 TEST_F(LoDTensorTester, NumElements) {
  ASSERT_EQ(lod_tensor_.NumElements(0), 2UL);
-  ASSERT_EQ(lod_tensor_.NumElements(1), 4UL);
+  ASSERT_EQ(lod_tensor_.NumElements(1), 3UL);
  ASSERT_EQ(lod_tensor_.NumElements(2), 8UL);
 }

 TEST_F(LoDTensorTester, NumElements2) {
  ASSERT_EQ(lod_tensor_.NumElements(0, 0), 2UL);
-  ASSERT_EQ(lod_tensor_.NumElements(0, 1), 2UL);
-  ASSERT_EQ(lod_tensor_.NumElements(1, 1), 2UL);
+  ASSERT_EQ(lod_tensor_.NumElements(0, 1), 1UL);
+  ASSERT_EQ(lod_tensor_.NumElements(1, 1), 3UL);
 }

 TEST_F(LoDTensorTester, ShrinkLevels) {
@@ -68,17 +68,16 @@ TEST_F(LoDTensorTester, ShrinkLevels) {
    LoDTensor new_lod_tensor = lod_tensor_;
    new_lod_tensor.ShrinkLevels(level, level + 1);
    ASSERT_EQ(new_lod_tensor.NumLevels(), 1UL);
-    ASSERT_EQ(new_lod_tensor.NumElements(0), lod_tensor_.NumElements(level));
    ASSERT_EQ(new_lod_tensor.data<float>(), lod_tensor_.data<float>());
  }
  // shrink 2 level
  for (size_t level = 0; level < 2UL; ++level) {
    LoDTensor new_lod_tensor = lod_tensor_;
    new_lod_tensor.ShrinkLevels(level, level + 2);
+    // the lowest level's last element should be the tensor's batch_size.
+    ASSERT_EQ(new_lod_tensor.lod().back().back(),
+              lod_tensor_.lod().back().back());
    ASSERT_EQ(new_lod_tensor.NumLevels(), 2UL);
-    ASSERT_EQ(new_lod_tensor.NumElements(0), lod_tensor_.NumElements(level));
-    ASSERT_EQ(new_lod_tensor.NumElements(1),
-              lod_tensor_.NumElements(level + 1));
    ASSERT_EQ(new_lod_tensor.data<float>(), lod_tensor_.data<float>());
  }
 }
@@ -86,19 +85,19 @@ TEST_F(LoDTensorTester, ShrinkLevels) {
 TEST_F(LoDTensorTester, ShrinkInLevel) {
  size_t level = 0;
  LoDTensor new_lod_tensor = lod_tensor_;
-  new_lod_tensor.ShrinkInLevel(level, 0, 2);
+  new_lod_tensor.ShrinkInLevel(level, 0, 1);
  EXPECT_EQ(new_lod_tensor.NumLevels(), 3UL);
-  EXPECT_EQ(new_lod_tensor.NumElements(0), 2UL);
-  EXPECT_EQ(new_lod_tensor.NumElements(1), 4UL);
-  EXPECT_EQ(new_lod_tensor.NumElements(2), 8UL);
+  EXPECT_EQ(new_lod_tensor.NumElements(0), 1UL);
+  EXPECT_EQ(new_lod_tensor.NumElements(1), 2UL);
+  EXPECT_EQ(new_lod_tensor.NumElements(2), 5UL);
  ASSERT_EQ(new_lod_tensor.data<float>(), lod_tensor_.data<float>());

  level = 1;
  new_lod_tensor = lod_tensor_;
-  new_lod_tensor.ShrinkInLevel(level, 0, 2);
+  new_lod_tensor.ShrinkInLevel(level, 1, 2);
  ASSERT_EQ(new_lod_tensor.NumLevels(), 2UL);
-  ASSERT_EQ(new_lod_tensor.NumElements(0), 2UL);
-  ASSERT_EQ(new_lod_tensor.NumElements(1), 4UL);
+  ASSERT_EQ(new_lod_tensor.NumElements(0), 1UL);
+  ASSERT_EQ(new_lod_tensor.NumElements(1), 3UL);
  ASSERT_EQ(new_lod_tensor.data<float>(), lod_tensor_.data<float>());
 }


--- a/paddle/framework/op_proto_maker.h
+++ b/paddle/framework/op_proto_maker.h
@@ -44,6 +44,11 @@ class OpProtoAndCheckerMaker {
      var_->set_intermediate(true);
      return *this;
    }
+
+    VariableBuilder& AsDispensable() {
+      var_->set_dispensable(true);
+      return *this;
+    }
  };

  VariableBuilder AddInput(const std::string& name, const std::string& comment);

--- a/paddle/framework/operator.cc
+++ b/paddle/framework/operator.cc
@@ -252,5 +252,20 @@ std::ostream& operator<<(std::ostream& os,
  return os;
 }

+bool OpSupportGPU(const std::string& op_type) {
+  auto& all_kernels = OperatorWithKernel::AllOpKernels();
+  auto it = all_kernels.find(op_type);
+  if (it == all_kernels.end()) {
+    // All control operator must support GPU
+    return true;
+  }
+  for (auto& kern_pair : it->second) {
+    if (platform::is_gpu_place(kern_pair.first.place_)) {
+      return true;
+    }
+  }
+  return false;
+}
+
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/operator.h
+++ b/paddle/framework/operator.h
@@ -327,37 +327,47 @@ class CompileTimeInferShapeContext : public InferShapeContext {
  bool HasInput(const std::string& name) const override {
    const std::vector<std::string>& input_names = op_.Input(name);
    auto length = input_names.size();
+    if (length == 0) {
+      return false;
+    }
    PADDLE_ENFORCE_EQ(length, 1UL,
                      "Input(%s) should have only one value, "
                      "but it have %d now",
                      name, length);
-    return block_.HasVar(input_names[0]);
+    return block_.HasVarRecursive(input_names[0]);
  }

  bool HasOutput(const std::string& name) const override {
    const std::vector<std::string>& output_names = op_.Output(name);
    auto length = output_names.size();
+    if (length == 0) {
+      return false;
+    }
    PADDLE_ENFORCE_EQ(length, 1UL,
                      "Output(%s) should have only one value, "
                      "but it have %d now",
                      name, length);
-    return block_.HasVar(output_names[0]);
+    return block_.HasVarRecursive(output_names[0]);
  }

  bool HasInputs(const std::string& name) const override {
    const std::vector<std::string>& input_names = op_.Input(name);
-    PADDLE_ENFORCE(!input_names.empty(), "Inputs(%s) length is 0", name);
+    if (input_names.empty()) {
+      return false;
+    }
    for (auto& input : input_names) {
-      if (!block_.HasVar(input)) return false;
+      if (!block_.HasVarRecursive(input)) return false;
    }
    return true;
  }

  bool HasOutputs(const std::string& name) const override {
    const std::vector<std::string>& output_names = op_.Output(name);
-    PADDLE_ENFORCE(!output_names.empty(), "Inputs(%s) length is 0", name);
+    if (output_names.empty()) {
+      return false;
+    }
    for (auto& output : output_names) {
-      if (!block_.HasVar(output)) return false;
+      if (!block_.HasVarRecursive(output)) return false;
    }
    return true;
  }
@@ -404,11 +414,11 @@ class CompileTimeInferShapeContext : public InferShapeContext {

 private:
  DDim GetDim(const std::string& name) const override {
-    return framework::make_ddim(block_.FindVar(name)->Shape());
+    return framework::make_ddim(block_.FindVarRecursive(name)->Shape());
  }

  void SetDim(const std::string& name, const DDim& dim) override {
-    block_.FindVar(name)->SetShape(framework::vectorize(dim));
+    block_.FindVarRecursive(name)->SetShape(framework::vectorize(dim));
  }

  const OpDescBind& op_;
@@ -421,13 +431,27 @@ class RuntimeInferShapeContext : public InferShapeContext {
      : op_(op), scope_(scope) {}

  bool HasInput(const std::string& name) const override {
-    auto ipt = op_.Input(name);
+    auto& ins = Inputs(name);
+    size_t length = ins.size();
+    if (length == 0) {
+      return false;
+    }
+    PADDLE_ENFORCE_EQ(length, 1UL, "Input %s should have more than one inputs",
+                      name);
+    auto ipt = ins[0];
    auto* var = ipt == kEmptyVarName ? nullptr : scope_.FindVar(ipt);
    return var != nullptr;
  }

  bool HasOutput(const std::string& name) const override {
-    auto ipt = op_.Output(name);
+    auto& outs = Outputs(name);
+    size_t length = outs.size();
+    if (length == 0) {
+      return false;
+    }
+    PADDLE_ENFORCE_EQ(length, 1UL, "Output %s should have more than one inputs",
+                      name);
+    auto ipt = outs[0];
    auto* var = ipt == kEmptyVarName ? nullptr : scope_.FindVar(ipt);
    return var != nullptr;
  }
@@ -649,5 +673,7 @@ class OperatorWithKernel : public OperatorBase {
 std::ostream& operator<<(std::ostream& os,
                         const OperatorWithKernel::OpKernelKey& kernel_key);

+extern bool OpSupportGPU(const std::string& op_type);
+
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/program_desc.cc
+++ b/paddle/framework/program_desc.cc
@@ -35,8 +35,8 @@ ProgramDesc *ProgramDescBind::Proto() {

 ProgramDescBind::ProgramDescBind() {
  auto *block = prog_.mutable_blocks()->Add();
-  block->set_idx(0);
-  block->set_parent_idx(-1);
+  block->set_idx(kRootBlockIndex);
+  block->set_parent_idx(kNoneBlockIndex);
  blocks_.emplace_back(new BlockDescBind(this, block));
 }


--- a/paddle/framework/program_desc.h
+++ b/paddle/framework/program_desc.h
@@ -17,6 +17,7 @@ limitations under the License. */
 #include <memory>
 #include <vector>
 #include "paddle/framework/framework.pb.h"
+#include "paddle/framework/proto_desc.h"
 #include "paddle/platform/macros.h"

 namespace paddle {

--- a/paddle/framework/program_desc_test.cc
+++ b/paddle/framework/program_desc_test.cc
@@ -80,4 +80,4 @@ TEST(ProgramDesc, copy_ctor) {
  // different and it is correct.
 }
 }  // namespace framework
-}  // namespace paddle
\ No newline at end of file
+}  // namespace paddle
--- a/paddle/framework/proto_desc.h
+++ b/paddle/framework/proto_desc.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+namespace paddle {
+namespace framework {
+
+// The Index of first Block in Program. also called root block.
+constexpr int kRootBlockIndex = 0;
+// The Parent Index of root Block, this block does not exist.
+constexpr int kNoneBlockIndex = -1;
+
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/scope.cc
+++ b/paddle/framework/scope.cc
@@ -65,12 +65,11 @@ void Scope::DropKids() {
  kids_.clear();
 }

-framework::Scope& GetGlobalScope() {
-  static framework::Scope* g_scope = nullptr;
-  if (g_scope == nullptr) {
-    g_scope = new framework::Scope();
-  }
-  return *g_scope;
+void Scope::DeleteScope(Scope* scope) {
+  auto it = std::find(this->kids_.begin(), this->kids_.end(), scope);
+  PADDLE_ENFORCE(it != this->kids_.end(), "Cannot find %p as kid scope", scope);
+  this->kids_.erase(it);
+  delete scope;
 }

 }  // namespace framework

--- a/paddle/framework/scope.h
+++ b/paddle/framework/scope.h
@@ -59,6 +59,8 @@ class Scope {
  /// Find the scope or an ancestor scope that contains the given variable.
  const Scope* FindScope(const Variable* var) const;

+  void DeleteScope(Scope* scope);
+
  /// Drop all kids scopes belonged to this scope.
  void DropKids();

@@ -72,8 +74,5 @@ class Scope {

  DISABLE_COPY_AND_ASSIGN(Scope);
 };
-
-framework::Scope& GetGlobalScope();
-
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/tensor.h
+++ b/paddle/framework/tensor.h
@@ -60,6 +60,10 @@ class Tensor {
  template <typename T>
  inline T* mutable_data(platform::Place place);

+  inline void* mutable_data(platform::Place place, std::type_index type);
+
+  inline void* mutable_data(platform::Place place);
+
  /**
   * @brief     Return a pointer to mutable memory block.
   *
@@ -81,7 +85,6 @@ class Tensor {
  inline Tensor& Resize(const DDim& dims);

  /*! The internal of two tensors share the same memory block. */
-  template <typename T>
  inline Tensor& ShareDataWith(const Tensor& src);

  /**
@@ -96,26 +99,9 @@ class Tensor {
  // TODO(qijun): https://github.com/PaddlePaddle/Paddle/issues/4647
  // Remove `CopyFrom` and `CopyFromVector` from Tensor interface
  // and make them global functions
-  template <typename T>
  inline void CopyFrom(const Tensor& src, const platform::Place& dst_place,
                       const platform::DeviceContext& ctx);

-  // FIXME(yuyang18): CopyFrom should without template T, use the replace
-  // `CopyFrom` with `CopyFromTensor`
-  inline void CopyFromTensor(const Tensor& src,
-                             const platform::Place& dst_place,
-                             const platform::DeviceContext& ctx) {
-    // NOLINTNEXTLINES_8 cpplint.py will recognize below lines as functions.
-    // That is a bug of cpplint.py. Just ignore lint these lines.
-    if (src.type() == std::type_index(typeid(double))) {
-      CopyFrom<double>(src, dst_place, ctx);
-    } else if (src.type() == std::type_index(typeid(float))) {
-      CopyFrom<float>(src, dst_place, ctx);
-    } else if (src.type() == std::type_index(typeid(int))) {
-      CopyFrom<int>(src, dst_place, ctx);
-    }
-  }
-
  /**
   * @brief   Copy the content of an external vector to a tensor.
   *
@@ -135,7 +121,6 @@ class Tensor {
   * @param[in] begin_idx   The begin index of the slice.
   * @param[in] end_idx     The end index of the slice.
   */
-  template <typename T>
  inline Tensor Slice(const int& begin_idx, const int& end_idx) const;

  platform::Place place() const {
@@ -146,7 +131,6 @@ class Tensor {
  std::type_index type() const { return holder_->type(); }

 private:
-  template <typename T>
  inline void check_memory_size() const;

 private:
@@ -155,20 +139,22 @@ class Tensor {
   *          parameter of Variable.
   */
  struct Placeholder {
-    virtual ~Placeholder() {}
+    virtual ~Placeholder() = default;
    virtual void* ptr() const = 0;
    virtual size_t size() const = 0;
    virtual std::type_index type() const = 0;
    virtual platform::Place place() const = 0;
+    virtual void set_type(std::type_index type) = 0;
  };

-  template <typename T, typename Place>
+  template <typename Place>
  struct PlaceholderImpl : public Placeholder {
-    PlaceholderImpl(Place place, size_t size)
-        : ptr_(static_cast<T*>(memory::Alloc(place, size)),
-               memory::PODDeleter<T, Place>(place)),
+    PlaceholderImpl(Place place, size_t size, std::type_index type)
+        : ptr_(static_cast<uint8_t*>(memory::Alloc(place, size)),
+               memory::PODDeleter<uint8_t, Place>(place)),
          place_(place),
-          size_(size) {
+          size_(size),
+          type_(type) {
      PADDLE_ENFORCE_NOT_NULL(ptr_, "Insufficient %s memory to allocation.",
                              (is_cpu_place(place_) ? "CPU" : "GPU"));
    }
@@ -176,16 +162,20 @@ class Tensor {
    virtual size_t size() const { return size_; }
    virtual platform::Place place() const { return place_; }
    virtual void* ptr() const { return static_cast<void*>(ptr_.get()); }
-    virtual std::type_index type() const { return std::type_index(typeid(T)); }
+    virtual std::type_index type() const { return type_; }
+    virtual void set_type(std::type_index type) { type_ = type; }

    /*! the pointer of memory block. */
-    std::unique_ptr<T, memory::PODDeleter<T, Place>> ptr_;
+    std::unique_ptr<uint8_t, memory::PODDeleter<uint8_t, Place>> ptr_;

    /*! the place of memory block. */
    platform::Place place_;

    /*! the size of memory block. */
    size_t size_;
+
+    /* the current type of memory */
+    std::type_index type_;
  };

  /*! holds the memory block if allocated. */

--- a/paddle/framework/tensor_array.cc
+++ b/paddle/framework/tensor_array.cc
@@ -106,8 +106,8 @@ void TensorArray::Write(size_t index, const LoDTensor& value) {

  values_[index].Resize(value.dims());
  values_[index].mutable_data<value_type>(platform::CPUPlace());
-  values_[index].CopyFrom<value_type>(value, platform::CPUPlace(),
-                                      platform::CPUDeviceContext());
+  values_[index].CopyFrom(value, platform::CPUPlace(),
+                          platform::CPUDeviceContext());
 }

 void TensorArray::WriteShared(size_t index, const LoDTensor& value) {
@@ -116,7 +116,7 @@ void TensorArray::WriteShared(size_t index, const LoDTensor& value) {
    values_.resize(index + 1);
  }

-  values_[index].ShareDataWith<value_type>(value);
+  values_[index].ShareDataWith(value);
 }

 LoDTensor TensorArray::Pack(size_t level, const std::vector<DySeqMeta>& meta,
@@ -163,9 +163,9 @@ LoDTensor TensorArray::Stack() const {
  result.mutable_data<value_type>(platform::CPUPlace());

  for (size_t idx = 0; idx < size(); idx++) {
-    result.Slice<value_type>(idx, idx + 1)
-        .CopyFrom<value_type>(Read(idx), platform::CPUPlace(),
-                              platform::CPUDeviceContext());
+    result.Slice(idx, idx + 1)
+        .CopyFrom(Read(idx), platform::CPUPlace(),
+                  platform::CPUDeviceContext());
  }
  return result;
 }
@@ -191,13 +191,12 @@ void TensorArray::Unstack(const LoDTensor& source, bool data_shared) const {
    auto& value = values_[elem];
    if (data_shared) {
      // share memory
-      value.ShareDataWith<value_type>(source.Slice<value_type>(elem, elem + 1));
+      value.ShareDataWith(source.Slice(elem, elem + 1));
    } else {
      // copy
      value.Resize(value_dims);
-      value.CopyFrom<value_type>(source.Slice<value_type>(elem, elem + 1),
-                                 platform::CPUPlace(),
-                                 platform::CPUDeviceContext());
+      value.CopyFrom(source.Slice(elem, elem + 1), platform::CPUPlace(),
+                     platform::CPUDeviceContext());
    }
  }
 }
@@ -242,11 +241,10 @@ LoDTensor DynamicBatchUnpacker::GetBatch(size_t index) {

  for (size_t i = 0; i < indice.size(); i++) {
    auto index = indice[i];
-    auto target = result.Slice<value_type>(i, i + 1);
-    auto slice = source->Slice<value_type>(index, index + 1);
+    auto target = result.Slice(i, i + 1);
+    auto slice = source->Slice(index, index + 1);

-    target.CopyFrom<value_type>(slice, platform::CPUPlace(),
-                                platform::CPUDeviceContext());
+    target.CopyFrom(slice, platform::CPUPlace(), platform::CPUDeviceContext());
  }

  return result;
@@ -277,10 +275,10 @@ LoDTensor PackDynamicBatch(const std::vector<LoDTensor>& source,
      // target is result[index]
      auto index = seq_meta.begin + batch_id;
      if (index >= seq_meta.end) break;
-      auto source_ = source[batch_id].Slice<float>(seq_id, seq_id + 1);
-      auto target = result.Slice<float>(index, index + 1);
-      target.CopyFrom<float>(source_, platform::CPUPlace(),
-                             platform::CPUDeviceContext());
+      auto source_ = source[batch_id].Slice(seq_id, seq_id + 1);
+      auto target = result.Slice(index, index + 1);
+      target.CopyFrom(source_, platform::CPUPlace(),
+                      platform::CPUDeviceContext());
    }
  }


--- a/paddle/framework/tensor_array_test.cc
+++ b/paddle/framework/tensor_array_test.cc
@@ -91,7 +91,7 @@ class TensorArrayPackTester : public ::testing::Test {
      size_t begin = level[i];
      size_t end = level[i + 1];
      for (size_t j = begin; j < end; j++) {
-        auto record = source.Slice<int>(j, j + 1);
+        auto record = source.Slice(j, j + 1);
        for (int dim = 0; dim < 128; dim++) {
          record.mutable_data<int>(platform::CPUPlace())[dim] = j - begin;
        }

--- a/paddle/framework/tensor_impl.h
+++ b/paddle/framework/tensor_impl.h
@@ -19,12 +19,50 @@ limitations under the License. */
 namespace paddle {
 namespace framework {

+template <typename... T>
+struct SizeOfTypeFunctor;
+
 template <typename T>
+struct SizeOfTypeFunctor<T> {
+  size_t operator()(std::type_index type) const {
+    if (typeid(T).hash_code() == type.hash_code()) {
+      return sizeof(T);
+    } else {
+      return 0UL;
+    }
+  }
+};
+
+template <>
+struct SizeOfTypeFunctor<> {
+  size_t operator()(std::type_index type) const { return 0UL; }
+};
+
+template <typename HEAD, typename... TAIL>
+struct SizeOfTypeFunctor<HEAD, TAIL...> {
+  size_t operator()(std::type_index type) const {
+    SizeOfTypeFunctor<HEAD> head;
+    size_t head_size = head(type);
+    if (head_size != 0) {
+      return head_size;
+    }
+    SizeOfTypeFunctor<TAIL...> tail;
+    return tail(type);
+  }
+};
+
+static inline size_t SizeOfType(std::type_index type) {
+  SizeOfTypeFunctor<int, float, double, int16_t, int64_t> functor;
+  size_t size = functor(type);
+  PADDLE_ENFORCE(size != 0UL, "Cannot get size of type %s", type.name());
+  return size;
+}
+
 inline void Tensor::check_memory_size() const {
  PADDLE_ENFORCE_NOT_NULL(
      holder_, "Tensor holds no memory. Call Tensor::mutable_data first.");
  PADDLE_ENFORCE_GE(
-      holder_->size(), numel() * sizeof(T) + offset_,
+      holder_->size(), numel() * SizeOfType(type()) + offset_,
      "Tensor's dims_ is out of bound. Call Tensor::mutable_data "
      "first to re-allocate memory.\n"
      "or maybe the required data-type mismatches the data already stored.");
@@ -32,14 +70,23 @@ inline void Tensor::check_memory_size() const {

 template <typename T>
 inline const T* Tensor::data() const {
-  check_memory_size<T>();
+  check_memory_size();
+  PADDLE_ENFORCE(std::is_same<T, void>::value ||
+                     holder_->type().hash_code() == typeid(T).hash_code(),
+                 "Tensor holds the wrong type, it holds %s",
+                 this->holder_->type().name());
+
  return reinterpret_cast<const T*>(
      reinterpret_cast<uintptr_t>(holder_->ptr()) + offset_);
 }

 template <typename T>
 inline T* Tensor::data() {
-  check_memory_size<T>();
+  check_memory_size();
+  PADDLE_ENFORCE(std::is_same<T, void>::value ||
+                     holder_->type().hash_code() == typeid(T).hash_code(),
+                 "Tensor holds the wrong type, it holds %s",
+                 this->holder_->type().name());
  return reinterpret_cast<T*>(reinterpret_cast<uintptr_t>(holder_->ptr()) +
                              offset_);
 }
@@ -54,51 +101,62 @@ inline T* Tensor::mutable_data(DDim dims, platform::Place place) {
 template <typename T>
 inline T* Tensor::mutable_data(platform::Place place) {
  static_assert(std::is_pod<T>::value, "T must be POD");
+  return reinterpret_cast<T*>(mutable_data(place, typeid(T)));
+}
+
+inline void* Tensor::mutable_data(platform::Place place, std::type_index type) {
+  if (holder_ != nullptr) {
+    holder_->set_type(type);
+  }
  PADDLE_ENFORCE_GT(numel(), 0,
                    "Tensor's numel must be larger than zero to call "
                    "Tensor::mutable_data. Call Tensor::set_dim first.");
+  int64_t size = numel() * SizeOfType(type);
  /* some versions of boost::variant don't have operator!= */
-  int64_t size = numel() * sizeof(T);
  if (holder_ == nullptr || !(holder_->place() == place) ||
      holder_->size() < size + offset_) {
    if (platform::is_cpu_place(place)) {
-      holder_.reset(new PlaceholderImpl<T, platform::CPUPlace>(
-          boost::get<platform::CPUPlace>(place), size));
+      holder_.reset(new PlaceholderImpl<platform::CPUPlace>(
+          boost::get<platform::CPUPlace>(place), size, type));
    } else if (platform::is_gpu_place(place)) {
 #ifndef PADDLE_WITH_CUDA
      PADDLE_THROW("'GPUPlace' is not supported in CPU only device.");
    }
 #else
-      holder_.reset(new PlaceholderImpl<T, platform::GPUPlace>(
-          boost::get<platform::GPUPlace>(place), size));
+      holder_.reset(new PlaceholderImpl<platform::GPUPlace>(
+          boost::get<platform::GPUPlace>(place), size, type));
    }
 #endif
    offset_ = 0;
  }
-  return reinterpret_cast<T*>(reinterpret_cast<uintptr_t>(holder_->ptr()) +
-                              offset_);
+  return reinterpret_cast<void*>(reinterpret_cast<uintptr_t>(holder_->ptr()) +
+                                 offset_);
+}
+
+inline void* Tensor::mutable_data(platform::Place place) {
+  PADDLE_ENFORCE(this->holder_ != nullptr,
+                 "Cannot invoke mutable data if current hold nothing");
+  return mutable_data(place, holder_->type());
 }

-template <typename T>
 inline Tensor& Tensor::ShareDataWith(const Tensor& src) {
-  src.check_memory_size<T>();
+  src.check_memory_size();
  *this = src;
  return *this;
 }

-template <typename T>
 inline void Tensor::CopyFrom(const Tensor& src,
                             const platform::Place& dst_place,
                             const platform::DeviceContext& ctx) {
-  src.check_memory_size<T>();
+  src.check_memory_size();
  Resize(src.dims());

  auto src_place = src.holder_->place();
-  auto src_ptr = static_cast<const void*>(src.data<T>());
+  auto src_ptr = src.data<void>();

-  auto dst_ptr = static_cast<void*>(mutable_data<T>(dst_place));
+  auto dst_ptr = mutable_data(dst_place, src.type());

-  auto size = src.numel() * sizeof(T);
+  auto size = src.numel() * SizeOfType(src.type());

  if (platform::is_cpu_place(src_place) && platform::is_cpu_place(dst_place)) {
    memory::Copy(boost::get<platform::CPUPlace>(dst_place), dst_ptr,
@@ -165,9 +223,8 @@ inline void Tensor::CopyFromVector(const std::vector<T>& src,
 #endif
 }

-template <typename T>
 inline Tensor Tensor::Slice(const int& begin_idx, const int& end_idx) const {
-  check_memory_size<T>();
+  check_memory_size();
  PADDLE_ENFORCE_GE(begin_idx, 0, "Slice begin index is less than zero.");
  PADDLE_ENFORCE_LE(end_idx, dims_[0], "Slice end index is out of bound.");
  PADDLE_ENFORCE_LT(begin_idx, end_idx,
@@ -182,7 +239,7 @@ inline Tensor Tensor::Slice(const int& begin_idx, const int& end_idx) const {
    DDim dst_dims = dims_;
    dst_dims[0] = end_idx - begin_idx;
    dst.Resize(dst_dims);
-    dst.offset_ = offset_ + begin_idx * base * sizeof(T);
+    dst.offset_ = offset_ + begin_idx * base * SizeOfType(type());
    return dst;
  }
 }
@@ -196,10 +253,9 @@ inline const DDim& Tensor::dims() const { return dims_; }

 inline int64_t Tensor::numel() const { return product(dims_); }

-template <typename T>
 inline Tensor ReshapeToMatrix(const Tensor& src, int num_col_dims) {
  Tensor res;
-  res.ShareDataWith<T>(src);
+  res.ShareDataWith(src);
  res.Resize(flatten_to_2d(src.dims(), num_col_dims));
  return res;
 }

--- a/paddle/framework/tensor_test.cc
+++ b/paddle/framework/tensor_test.cc
@@ -108,7 +108,7 @@ TEST(Tensor, ShareDataWith) {
    // Try to share data form uninitialized tensor
    bool caught = false;
    try {
-      dst_tensor.ShareDataWith<float>(src_tensor);
+      dst_tensor.ShareDataWith(src_tensor);
    } catch (paddle::platform::EnforceNotMet err) {
      caught = true;
      std::string msg =
@@ -122,7 +122,7 @@ TEST(Tensor, ShareDataWith) {
    ASSERT_TRUE(caught);

    src_tensor.mutable_data<int>(make_ddim({2, 3, 4}), CPUPlace());
-    dst_tensor.ShareDataWith<int>(src_tensor);
+    dst_tensor.ShareDataWith(src_tensor);
    ASSERT_EQ(src_tensor.data<int>(), dst_tensor.data<int>());
  }

@@ -131,7 +131,7 @@ TEST(Tensor, ShareDataWith) {
    Tensor src_tensor;
    Tensor dst_tensor;
    src_tensor.mutable_data<int>(make_ddim({2, 3, 4}), GPUPlace());
-    dst_tensor.ShareDataWith<int>(src_tensor);
+    dst_tensor.ShareDataWith(src_tensor);
    ASSERT_EQ(src_tensor.data<int>(), dst_tensor.data<int>());
  }
 #endif
@@ -143,7 +143,7 @@ TEST(Tensor, Slice) {
  {
    Tensor src_tensor;
    src_tensor.mutable_data<int>(make_ddim({5, 3, 4}), CPUPlace());
-    Tensor slice_tensor = src_tensor.Slice<int>(1, 3);
+    Tensor slice_tensor = src_tensor.Slice(1, 3);
    DDim slice_dims = slice_tensor.dims();
    ASSERT_EQ(arity(slice_dims), 3);
    EXPECT_EQ(slice_dims[0], 2);
@@ -167,7 +167,7 @@ TEST(Tensor, Slice) {
  {
    Tensor src_tensor;
    src_tensor.mutable_data<double>(make_ddim({6, 9}), GPUPlace());
-    Tensor slice_tensor = src_tensor.Slice<double>(2, 6);
+    Tensor slice_tensor = src_tensor.Slice(2, 6);
    DDim slice_dims = slice_tensor.dims();
    ASSERT_EQ(arity(slice_dims), 2);
    EXPECT_EQ(slice_dims[0], 4);
@@ -202,7 +202,7 @@ TEST(Tensor, CopyFrom) {
    memcpy(src_ptr, arr, 9 * sizeof(int));

    auto cpu_place = new paddle::platform::CPUPlace();
-    dst_tensor.CopyFrom<int>(src_tensor, *cpu_place, cpu_ctx);
+    dst_tensor.CopyFrom(src_tensor, *cpu_place, cpu_ctx);

    const int* dst_ptr = dst_tensor.data<int>();
    ASSERT_NE(src_ptr, dst_ptr);
@@ -210,8 +210,8 @@ TEST(Tensor, CopyFrom) {
      EXPECT_EQ(src_ptr[i], dst_ptr[i]);
    }

-    Tensor slice_tensor = src_tensor.Slice<int>(1, 2);
-    dst_tensor.CopyFrom<int>(slice_tensor, *cpu_place, cpu_ctx);
+    Tensor slice_tensor = src_tensor.Slice(1, 2);
+    dst_tensor.CopyFrom(slice_tensor, *cpu_place, cpu_ctx);
    const int* slice_ptr = slice_tensor.data<int>();
    dst_ptr = dst_tensor.data<int>();
    ASSERT_NE(dst_ptr, slice_ptr);
@@ -233,11 +233,11 @@ TEST(Tensor, CopyFrom) {
    // CPU Tensor to GPU Tensor
    auto gpu_place = new paddle::platform::GPUPlace(0);
    CUDADeviceContext gpu_ctx(*gpu_place);
-    gpu_tensor.CopyFrom<int>(src_tensor, *gpu_place, gpu_ctx);
+    gpu_tensor.CopyFrom(src_tensor, *gpu_place, gpu_ctx);

    // GPU Tensor to CPU Tensor
    auto cpu_place = new paddle::platform::CPUPlace();
-    dst_tensor.CopyFrom<int>(gpu_tensor, *cpu_place, gpu_ctx);
+    dst_tensor.CopyFrom(gpu_tensor, *cpu_place, gpu_ctx);

    // Sync before Compare Tensors
    gpu_ctx.Wait();
@@ -247,13 +247,13 @@ TEST(Tensor, CopyFrom) {
      EXPECT_EQ(src_ptr[i], dst_ptr[i]);
    }

-    Tensor slice_tensor = src_tensor.Slice<int>(1, 2);
+    Tensor slice_tensor = src_tensor.Slice(1, 2);

    // CPU Slice Tensor to GPU Tensor
-    gpu_tensor.CopyFrom<int>(slice_tensor, *gpu_place, gpu_ctx);
+    gpu_tensor.CopyFrom(slice_tensor, *gpu_place, gpu_ctx);

    // GPU Tensor to CPU Tensor
-    dst_tensor.CopyFrom<int>(gpu_tensor, *cpu_place, gpu_ctx);
+    dst_tensor.CopyFrom(gpu_tensor, *cpu_place, gpu_ctx);

    // Sync before Compare Slice Tensors
    gpu_ctx.Wait();
@@ -320,7 +320,7 @@ TEST(Tensor, CopyFromVector) {
    CUDADeviceContext gpu_ctx(*gpu_place);
    gpu_tensor.CopyFromVector<int>(src_vec, gpu_ctx);
    // Copy from GPU to CPU tensor for comparison
-    dst_tensor.CopyFrom<int>(gpu_tensor, *cpu_place, gpu_ctx);
+    dst_tensor.CopyFrom(gpu_tensor, *cpu_place, gpu_ctx);

    // Sync before Compare Tensors
    gpu_ctx.Wait();
@@ -340,7 +340,7 @@ TEST(Tensor, CopyFromVector) {
    cpu_tensor.CopyFromVector<int>(src_vec, cpu_ctx);
    gpu_tensor.Resize(make_ddim({2, 2}));
    gpu_tensor.CopyFromVector<int>(src_vec, gpu_ctx);
-    dst_tensor.CopyFrom<int>(gpu_tensor, *cpu_place, gpu_ctx);
+    dst_tensor.CopyFrom(gpu_tensor, *cpu_place, gpu_ctx);

    // Sync before Compare Tensors
    gpu_ctx.Wait();
@@ -368,7 +368,7 @@ TEST(Tensor, ReshapeToMatrix) {
  for (int i = 0; i < 2 * 3 * 4 * 9; ++i) {
    src_ptr[i] = i;
  }
-  Tensor res = ReshapeToMatrix<int>(src, 2);
+  Tensor res = ReshapeToMatrix(src, 2);
  ASSERT_EQ(res.dims()[0], 2 * 3);
  ASSERT_EQ(res.dims()[1], 4 * 9);
 }
--- a/paddle/framework/variable.h
+++ b/paddle/framework/variable.h
@@ -25,7 +25,10 @@ class Variable {
 public:
  template <typename T>
  const T& Get() const {
-    PADDLE_ENFORCE(IsType<T>(), "Variable must be type %s", typeid(T).name());
+    PADDLE_ENFORCE(holder_ != nullptr, "Variable must hold some thing");
+    PADDLE_ENFORCE(IsType<T>(),
+                   "Variable must be type %s, the holding type is %s",
+                   typeid(T).name(), holder_->Type().name());
    return *static_cast<const T*>(holder_->Ptr());
  }


--- a/paddle/gserver/activations/MKLDNNActivation.cpp
+++ b/paddle/gserver/activations/MKLDNNActivation.cpp
@@ -126,7 +126,7 @@ void MKLDNNEltwiseActivation::resetFwd(Argument& act) {
  copyInVal_ = nullptr;
  if (act.grad && algo == algorithm::eltwise_tanh) {
    // tanh need save src input for backward
-    inVal_ = MKLDNNMatrix::create(nullptr, val_->getPrimitiveDesc());
+    inVal_ = MKLDNNMatrix::create(val_->getPrimitiveDesc());
    copyInVal_ = std::make_shared<mkldnn::reorder>(*val_, *inVal_);
    CHECK(copyInVal_) << "should not be emptry";
    pipelineFwd_.push_back(*copyInVal_);
@@ -145,7 +145,7 @@ void MKLDNNEltwiseActivation::resetBwd(Argument& act) {
  algorithm algo = getAlgo(this->getName());
  float alpha = getBwdAlpha();
  float beta = getBeta();
-  grad_ = MKLDNNMatrix::create(act.grad, val_->getPrimitiveDesc());
+  grad_ = MKLDNNMatrix::create(val_->getPrimitiveDesc(), act.grad);
  auto eng = CPUEngine::Instance().getEngine();
  auto bwdDesc = eltwise_bwd::desc(
      algo, grad_->getMemoryDesc(), val_->getMemoryDesc(), alpha, beta);
@@ -230,7 +230,7 @@ void MKLDNNActivation::resetFwd(Argument& act) {
    int ic = cnt_ / bs / ih / iw;
    CHECK_EQ(cnt_, (size_t)bs * ic * ih * iw);
    val_ = MKLDNNMatrix::create(
-        act.value, {bs, ic, ih, iw}, mkldnn::memory::format::nchw, *engine_);
+        {bs, ic, ih, iw}, mkldnn::memory::format::nchw, *engine_, act.value);
    CHECK(val_);
    val_->downSpatial();
  }

--- a/paddle/gserver/layers/MKLDNNBase.h
+++ b/paddle/gserver/layers/MKLDNNBase.h
@@ -21,8 +21,8 @@ namespace paddle {
 typedef enum {
  MKLDNN_BASE = 1,   // basical info of MKLDNN
  MKLDNN_TESTS = 1,  // gtest info of MKLDNN
-  MKLDNN_SIZES = 2,  // size info of MKLDNN
-  MKLDNN_FMTS = 3,   // format info of MKLDNN
+  MKLDNN_FMTS = 2,   // format info of MKLDNN
+  MKLDNN_SIZES = 3,  // size info of MKLDNN
  MKLDNN_ALL = 4,    // show all info of MKLDNN
 } MKLDNN_LOG_LEVEL;


--- a/paddle/gserver/layers/MKLDNNConvLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNConvLayer.cpp
@@ -116,8 +116,6 @@ void MKLDNNConvLayer::resetFwd(std::vector<primitive>& pipeline,
  resetFwdBuffers(fwdPD_, in, wgt, bias, out);

  resetFwdPipeline(pipeline, fwdPD_, in, wgt, bias, out);
-
-  printValueFormatFlow();
 }

 void MKLDNNConvLayer::resetBwd(std::vector<primitive>& pipeline,
@@ -135,12 +133,6 @@ void MKLDNNConvLayer::resetBwd(std::vector<primitive>& pipeline,
  resetBwdBuffers(bwdWgtPD, bwdDataPD, in, wgt, bias, out);

  resetBwdPipeline(pipeline, bwdWgtPD, bwdDataPD, in, wgt, bias, out);
-
-  printGradFormatFlow();
-}
-
-void MKLDNNConvLayer::updateInputData() {
-  cpuInVal_->setData(getInputValue(0, CPU_DEVICE)->getData());
 }

 void MKLDNNConvLayer::updateWeights(const UpdateCallback& callback) {
@@ -211,11 +203,18 @@ void MKLDNNConvLayer::resetFwdBuffers(
    MKLDNNMatrixPtr& bias,
    MKLDNNMatrixPtr& out) {
  CHECK(pd);
-  resetInValue(pd, in);
+  resetInValue(
+      in, std::make_shared<memory::primitive_desc>(pd->src_primitive_desc()));
+
+  resetOutValue(out, pd->dst_primitive_desc());

-  resetWgtBiasValue(pd, wgt, bias);
+  resetWithMatrix(wgt, weight_->getW(), pd->weights_primitive_desc());

-  resetOutValue(pd, out);
+  if (biases_ && biases_->getW()) {
+    resetWithMatrix(bias, biases_->getW(), pd->bias_primitive_desc());
+  } else {
+    bias = nullptr;
+  }
 }

 void MKLDNNConvLayer::resetFwdPipeline(
@@ -225,104 +224,12 @@ void MKLDNNConvLayer::resetFwdPipeline(
    MKLDNNMatrixPtr& wgt,
    MKLDNNMatrixPtr& bias,
    MKLDNNMatrixPtr& out) {
-  if (cvtInVal_) {
-    pipeline.push_back(*cvtInVal_);
-  }
-
  if (bias) {
    fwd_.reset(new conv_fwd(*pd, *in, *wgt, *bias, *out));
  } else {
    fwd_.reset(new conv_fwd(*pd, *in, *wgt, *out));
  }
  pipeline.push_back(*fwd_);
-
-  if (cvtOutVal_) {
-    pipeline.push_back(*cvtOutVal_);
-  }
-}
-
-void MKLDNNConvLayer::resetInValue(
-    std::shared_ptr<conv_fwd::primitive_desc>& pd, MKLDNNMatrixPtr& in) {
-  const MatrixPtr& inMat = inputLayers_[0]->getOutputValue();
-  in = MKLDNNMatrix::create(inMat, pd->src_primitive_desc());
-
-  // create buffer and reorder if input value do not match
-  cpuInVal_ = nullptr;
-  cvtInVal_ = nullptr;
-
-  MKLDNNMatrixPtr dnnIn = std::dynamic_pointer_cast<MKLDNNMatrix>(inMat);
-  CHECK_EQ(inputIsOnlyMKLDNN(), dnnIn != nullptr);
-  if (dnnIn != nullptr && dnnIn->getPrimitiveDesc() == in->getPrimitiveDesc()) {
-    in = dnnIn;
-    return;
-  }
-  if (dnnIn) {
-    if (dnnIn->getFormat() == format::nc) {
-      CHECK(ih_ == 1 && iw_ == 1) << "when input is nc format";
-      // create a new one with nchw format and same data
-      memory::dims inDims = memory::dims{bs_, ic_, 1, 1};
-      dnnIn = MKLDNNMatrix::create(inMat, inDims, format::nchw, engine_);
-    }
-    if (dnnIn->getPrimitiveDesc() == in->getPrimitiveDesc()) {
-      in = dnnIn;
-      return;
-    }
-    cpuInVal_ = dnnIn;
-    in = MKLDNNMatrix::create(nullptr, pd->src_primitive_desc());
-    cvtInVal_ = MKLDNNMatrix::createReorder(cpuInVal_, in);
-    CHECK(cvtInVal_) << "should not be emptry";
-  } else {
-    memory::dims inDims = memory::dims{bs_, ic_, ih_, iw_};
-    cpuInVal_ = MKLDNNMatrix::create(inMat, inDims, format::nchw, engine_);
-    if (cpuInVal_->getPrimitiveDesc() != in->getPrimitiveDesc()) {
-      // create new mkldnn matrix
-      in = MKLDNNMatrix::create(nullptr, pd->src_primitive_desc());
-      cvtInVal_ = MKLDNNMatrix::createReorder(cpuInVal_, in);
-      CHECK(cvtInVal_) << "should not be emptry";
-    } else {
-      in = cpuInVal_;
-    }
-  }
-}
-
-void MKLDNNConvLayer::resetWgtBiasValue(
-    std::shared_ptr<conv_fwd::primitive_desc>& pd,
-    MKLDNNMatrixPtr& wgt,
-    MKLDNNMatrixPtr& bias) {
-  wgt = MKLDNNMatrix::create(weight_->getW(), pd->weights_primitive_desc());
-  VLOG(MKLDNN_FMTS) << "Weight value format: " << wgt->getFormat();
-
-  bias = (biases_ && biases_->getW())
-             ? MKLDNNMatrix::create(biases_->getW(), pd->bias_primitive_desc())
-             : nullptr;
-}
-
-void MKLDNNConvLayer::resetOutValue(
-    std::shared_ptr<conv_fwd::primitive_desc>& pd, MKLDNNMatrixPtr& out) {
-  out = MKLDNNMatrix::create(output_.value, pd->dst_primitive_desc());
-
-  // create reorder if output value has cpu device and pd do not match
-  cpuOutVal_ = nullptr;
-  cvtOutVal_ = nullptr;
-  if (!outputIsOnlyMKLDNN()) {
-    const MatrixPtr& cpuOut = getOutput(CPU_DEVICE).value;
-    memory::dims outDims = memory::dims{bs_, oc_, oh_, ow_};
-    cpuOutVal_ = MKLDNNMatrix::create(cpuOut, outDims, format::nchw, engine_);
-    if (cpuOutVal_->getPrimitiveDesc() != pd->dst_primitive_desc()) {
-      out = MKLDNNMatrix::create(nullptr, pd->dst_primitive_desc());
-      cvtOutVal_ = MKLDNNMatrix::createReorder(out, cpuOutVal_);
-      CHECK(cvtOutVal_) << "should not be empty";
-    } else {
-      cpuOut->setData(output_.value->getData());
-      cpuOutVal_ = out;
-    }
-    // when output is cpu device, change the mkldnn output value and make them
-    // share the same data. Then if next layer use inputlayer->getOuputValue()
-    // to achieve the input value, it will get the right data.
-    output_.value = std::dynamic_pointer_cast<Matrix>(cpuOutVal_);
-    return;
-  }
-  output_.value = std::dynamic_pointer_cast<Matrix>(out);
 }

 void MKLDNNConvLayer::resetBwdWgtPD(
@@ -331,8 +238,8 @@ void MKLDNNConvLayer::resetBwdWgtPD(
  loadConvSettings(wgtDims, biasDims, strides, dilations, padL, padR);

  // create backward weight using input, output and weight value memory desc
-  CHECK(inVal_) << "Should have input value";
-  CHECK(outVal_) << "Should have output value";
+  CHECK(inVal_) << "Should have internal input value";
+  CHECK(outVal_) << "Should have internal output value";
  CHECK(wgtVal_) << "Should have weight value";
  algorithm algo = algorithm::convolution_direct;
  padding_kind padKind = padding_kind::zero;
@@ -372,8 +279,8 @@ void MKLDNNConvLayer::resetBwdDataPD(

  memory::dims wgtDims, biasDims, strides, dilations, padL, padR;
  loadConvSettings(wgtDims, biasDims, strides, dilations, padL, padR);
-  CHECK(inVal_) << "Should have input value";
-  CHECK(outVal_) << "Should have output value";
+  CHECK(inVal_) << "Should have internal input value";
+  CHECK(outVal_) << "Should have internal output value";
  // create backward data using input and output value memory desc
  // but using weight memory desc with any format
  auto bwdDataDesc = conv_bwdData::desc(algorithm::convolution_direct,
@@ -399,12 +306,27 @@ void MKLDNNConvLayer::resetBwdBuffers(
    MKLDNNMatrixPtr& bias,
    MKLDNNMatrixPtr& out) {
  CHECK(wgtPD);
-  resetOutGrad(wgtPD, out);
+  resetOutGrad(out, wgtPD->diff_dst_primitive_desc());

-  resetWgtBiasGrad(wgtPD, wgt, bias);
+  resetWithMatrix(
+      wgt, weight_->getWGrad(), wgtPD->diff_weights_primitive_desc());
+  CHECK(wgtVal_ != nullptr &&
+        wgt->getPrimitiveDesc() == wgtVal_->getPrimitiveDesc())
+      << "primitive desc of weight grad and value should be equal";

-  resetInGrad(dataPD, in);
+  bias = nullptr;
+  if (biases_ && biases_->getWGrad()) {
+    resetWithMatrix(
+        bias, biases_->getWGrad(), wgtPD->diff_bias_primitive_desc());
+    CHECK(bias && biasVal_ &&
+          bias->getPrimitiveDesc() == biasVal_->getPrimitiveDesc())
+        << "primitive desc of bias grad should equal the bias value";
+  }

+  if (dataPD == nullptr) {
+    return;
+  }
+  resetInGrad(in, dataPD->diff_src_primitive_desc());
  resetWgtValBwdData(dataPD, wgtValBwdData_);
 }

@@ -416,10 +338,7 @@ void MKLDNNConvLayer::resetBwdPipeline(
    MKLDNNMatrixPtr& wgt,
    MKLDNNMatrixPtr& bias,
    MKLDNNMatrixPtr& out) {
-  if (cvtOutGrad_) {
-    pipeline.push_back(*cvtOutGrad_);
-  }
-
+  CHECK(inVal_);
  // add bwdWgt handle
  if (bias) {
    bwdWgt_.reset(new conv_bwdWgt(*wgtPD, *inVal_, *out, *wgt, *bias));
@@ -431,99 +350,13 @@ void MKLDNNConvLayer::resetBwdPipeline(
  if (dataPD == nullptr) {
    return;
  }
-
  if (cvtWgtVal_) {
    pipeline.push_back(*cvtWgtVal_);
  }
-
  // add bwdData handle
  CHECK(wgtValBwdData_) << "Should have weight memory";
  bwdData_.reset(new conv_bwdData(*dataPD, *out, *wgtValBwdData_, *in));
  pipeline.push_back(*bwdData_);
-
-  if (cvtInGrad_) {
-    pipeline.push_back(*cvtInGrad_);
-  }
-}
-
-void MKLDNNConvLayer::resetOutGrad(
-    std::shared_ptr<conv_bwdWgt::primitive_desc>& wgtPD, MKLDNNMatrixPtr& out) {
-  cpuOutGrad_ = nullptr;
-  cvtOutGrad_ = nullptr;
-  CHECK(outVal_ != nullptr &&
-        outVal_->getPrimitiveDesc() == wgtPD->diff_dst_primitive_desc())
-      << "primitive desc of out grad and value should be equal";
-  if (outputIsOnlyMKLDNN()) {
-    MKLDNNLayer::resetOutGrad(out, outVal_->getPrimitiveDesc());
-  } else {
-    const MatrixPtr& cpuOut = getOutput(CPU_DEVICE).grad;
-    // always share the same grad data of CPU output
-    // then the activation can get the right grad from output_.grad
-    output_.grad->setData(cpuOut->getData());
-    // same PrimitiveDesc with cpuInVal_
-    CHECK(cpuOutVal_);
-    cpuOutGrad_ = MKLDNNMatrix::create(cpuOut, cpuOutVal_->getPrimitiveDesc());
-    // create reorder if primitive desc does not match
-    if (cpuOutGrad_->getPrimitiveDesc() != outVal_->getPrimitiveDesc()) {
-      out = MKLDNNMatrix::create(nullptr, outVal_->getPrimitiveDesc());
-      cvtOutGrad_ = MKLDNNMatrix::createReorder(cpuOutGrad_, out);
-      CHECK(cvtOutGrad_);
-    } else {
-      out = cpuOutGrad_;
-    }
-  }
-}
-
-void MKLDNNConvLayer::resetWgtBiasGrad(
-    std::shared_ptr<conv_bwdWgt::primitive_desc>& wgtPD,
-    MKLDNNMatrixPtr& wgt,
-    MKLDNNMatrixPtr& bias) {
-  wgt = MKLDNNMatrix::create(weight_->getWGrad(),
-                             wgtPD->diff_weights_primitive_desc());
-  CHECK(nullptr != wgtVal_ &&
-        wgt->getPrimitiveDesc() == wgtVal_->getPrimitiveDesc())
-      << "primitive desc of weight grad and value should be equal";
-  VLOG(MKLDNN_FMTS) << "weight grad format: " << wgt->getFormat();
-
-  bias = nullptr;
-  if (biasVal_ == nullptr) {
-    return;
-  }
-  bias = MKLDNNMatrix::create(biases_->getWGrad(),
-                              wgtPD->diff_bias_primitive_desc());
-  CHECK(bias->getPrimitiveDesc() == biasVal_->getPrimitiveDesc())
-      << "primitive desc of bias grad should equal the bias value";
-}
-
-void MKLDNNConvLayer::resetInGrad(
-    std::shared_ptr<conv_bwdData::primitive_desc>& dataPD,
-    MKLDNNMatrixPtr& in) {
-  in = nullptr;
-  cpuInGrad_ = nullptr;
-  cvtInGrad_ = nullptr;
-  if (dataPD == nullptr) {
-    return;
-  }
-
-  if (inputIsOnlyMKLDNN()) {
-    MKLDNNLayer::resetInGrad(in, dataPD->diff_src_primitive_desc());
-    CHECK(nullptr != inVal_ &&
-          in->getPrimitiveDesc() == inVal_->getPrimitiveDesc())
-        << "primitive desc of input grad and value should be equal";
-  } else {
-    const MatrixPtr& cpuIn = getInputGrad(0, CPU_DEVICE);
-    // same PrimitiveDesc with cpuInVal_
-    CHECK(cpuInVal_);
-    cpuInGrad_ = MKLDNNMatrix::create(cpuIn, cpuInVal_->getPrimitiveDesc());
-    in = cpuInGrad_;
-    // create reorder if PrimitiveDesc does not match
-    if (cpuInGrad_->getPrimitiveDesc() != dataPD->diff_src_primitive_desc()) {
-      in = MKLDNNMatrix::create(getInputGrad(0, MKLDNN_DEVICE),
-                                dataPD->diff_src_primitive_desc());
-      cvtInGrad_ = MKLDNNMatrix::createReorder(in, cpuInGrad_);
-      CHECK(cvtInGrad_);
-    }
-  }
 }

 void MKLDNNConvLayer::resetWgtValBwdData(
@@ -537,8 +370,7 @@ void MKLDNNConvLayer::resetWgtValBwdData(
  // since the primitive_desc would be different with wgtVal_
  CHECK(wgtVal_) << "should have weight value";
  if (dataPD->weights_primitive_desc() != wgtVal_->getPrimitiveDesc()) {
-    wgtValBwdData_ =
-        MKLDNNMatrix::create(nullptr, dataPD->weights_primitive_desc());
+    wgtValBwdData_ = MKLDNNMatrix::create(dataPD->weights_primitive_desc());
    cvtWgtVal_ = MKLDNNMatrix::createReorder(wgtVal_, wgtValBwdData_);
    CHECK(cvtWgtVal_);
  } else {

--- a/paddle/gserver/layers/MKLDNNConvLayer.h
+++ b/paddle/gserver/layers/MKLDNNConvLayer.h
@@ -48,17 +48,6 @@ protected:
  // save forward primitive_desc, which can be used backward
  std::shared_ptr<conv_fwd::primitive_desc> fwdPD_;

-  // MKLDNNMatrixPtr which should be created from CPU Device
-  MKLDNNMatrixPtr cpuInVal_;
-  MKLDNNMatrixPtr cpuInGrad_;
-  MKLDNNMatrixPtr cpuOutVal_;
-  MKLDNNMatrixPtr cpuOutGrad_;
-  // convert handle between CPU device and MKLDNN device
-  std::shared_ptr<mkldnn::reorder> cvtInVal_;
-  std::shared_ptr<mkldnn::reorder> cvtInGrad_;
-  std::shared_ptr<mkldnn::reorder> cvtOutVal_;
-  std::shared_ptr<mkldnn::reorder> cvtOutGrad_;
-
  // whether the weight has been init
  bool hasInitedWgt_;

@@ -94,8 +83,6 @@ public:
                MKLDNNMatrixPtr& bias,
                MKLDNNMatrixPtr& out) override;

-  void updateInputData() override;
-
  void updateWeights(const UpdateCallback& callback) override;

  void convertWeightsFromPaddle() override;
@@ -109,26 +96,6 @@ public:
                       << ", sw: " << sw_ << ", dh: " << dh_ << ", dw: " << dw_;
  }

-  void printValueFormatFlow() override {
-    if (cpuInVal_) {
-      VLOG(MKLDNN_FMTS) << cpuInVal_->getFormat() << " >>>";
-    }
-    MKLDNNLayer::printValueFormatFlow();
-    if (cpuOutVal_) {
-      VLOG(MKLDNN_FMTS) << " >>> " << cpuOutVal_->getFormat();
-    }
-  }
-
-  void printGradFormatFlow() override {
-    if (cpuInGrad_) {
-      VLOG(MKLDNN_FMTS) << cpuInGrad_->getFormat() << " <<<";
-    }
-    MKLDNNLayer::printGradFormatFlow();
-    if (cpuOutGrad_) {
-      VLOG(MKLDNN_FMTS) << " <<< " << cpuOutGrad_->getFormat();
-    }
-  }
-
 protected:
  /**
   * load the dims settings of this conv
@@ -162,23 +129,6 @@ protected:
                        MKLDNNMatrixPtr& bias,
                        MKLDNNMatrixPtr& out);

-  /**
-   * reset MKLDNNMatrix of input value
-   */
-  void resetInValue(std::shared_ptr<conv_fwd::primitive_desc>& pd,
-                    MKLDNNMatrixPtr& in);
-  /**
-   * reset MKLDNNMatrix of weight and bias value
-   */
-  void resetWgtBiasValue(std::shared_ptr<conv_fwd::primitive_desc>& pd,
-                         MKLDNNMatrixPtr& wgt,
-                         MKLDNNMatrixPtr& bias);
-  /**
-   * reset MKLDNNMatrix of output value
-   */
-  void resetOutValue(std::shared_ptr<conv_fwd::primitive_desc>& pd,
-                     MKLDNNMatrixPtr& out);
-
  /**
   * reset the backward weight primitive descriptor.
   */
@@ -207,22 +157,6 @@ protected:
                        MKLDNNMatrixPtr& bias,
                        MKLDNNMatrixPtr& out);

-  /**
-   * reset MKLDNNMatrix of output grad
-   */
-  void resetOutGrad(std::shared_ptr<conv_bwdWgt::primitive_desc>& wgtPD,
-                    MKLDNNMatrixPtr& out);
-  /**
-   * reset MKLDNNMatrix of weight and bias grad
-   */
-  void resetWgtBiasGrad(std::shared_ptr<conv_bwdWgt::primitive_desc>& wgtPD,
-                        MKLDNNMatrixPtr& wgt,
-                        MKLDNNMatrixPtr& bias);
-  /**
-   * reset MKLDNNMatrix of input grad
-   */
-  void resetInGrad(std::shared_ptr<conv_bwdData::primitive_desc>& dataPD,
-                   MKLDNNMatrixPtr& in);
  /**
   * reset MKLDNNMatrix of weight value for backward data
   * since the primitive_desc would be different with wgtVal_

--- a/paddle/gserver/layers/MKLDNNFcLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNFcLayer.cpp
@@ -62,7 +62,7 @@ void MKLDNNFcLayer::convertWeightsFromPaddle() {
  CHECK(wgtVal_) << "should have been initialized";
  bool hasNoSpatial_ = ih_ == 1 && iw_ == 1;
  auto targetDim = wgtVal_->getDims();
-  auto srcFmt = hasNoSpatial_ ? memory::format::io : memory::format::ihwo;
+  auto srcFmt = hasNoSpatial_ ? format::io : format::ihwo;
  wgtVal_->reorderDataFrom(wgtVal_, srcFmt, targetDim);
  hasInitedWgt_ = true;
 }
@@ -71,7 +71,7 @@ void MKLDNNFcLayer::convertWeightsToPaddle() {
  CHECK(wgtVal_) << "should have been initialized";
  bool hasNoSpatial_ = ih_ == 1 && iw_ == 1;
  auto targetDim = wgtVal_->getDims();
-  auto dstFmt = hasNoSpatial_ ? memory::format::io : memory::format::ihwo;
+  auto dstFmt = hasNoSpatial_ ? format::io : format::ihwo;
  wgtVal_->reorderDataTo(wgtVal_, dstFmt, targetDim);
 }

@@ -100,8 +100,6 @@ void MKLDNNFcLayer::resetFwd(std::vector<primitive>& pipeline,
  resetFwdPD(fwdPD_, in, wgt, bias, out);

  resetFwdPipeline(pipeline, fwdPD_, in, wgt, bias, out);
-
-  printValueFormatFlow();
 }

 void MKLDNNFcLayer::resetBwd(std::vector<primitive>& pipeline,
@@ -119,12 +117,6 @@ void MKLDNNFcLayer::resetBwd(std::vector<primitive>& pipeline,
  resetBwdDataPD(bwdDataPD, in, out);

  resetBwdPipeline(pipeline, bwdWgtPD, bwdDataPD, in, wgt, bias, out);
-
-  printGradFormatFlow();
-}
-
-void MKLDNNFcLayer::updateInputData() {
-  inVal_->setData(getInputValue(0, CPU_DEVICE)->getData());
 }

 void MKLDNNFcLayer::updateWeights(const UpdateCallback& callback) {
@@ -139,51 +131,30 @@ void MKLDNNFcLayer::resetFwdBuffers(MKLDNNMatrixPtr& in,
                                    MKLDNNMatrixPtr& bias,
                                    MKLDNNMatrixPtr& out) {
  resetInValue(in);
-
-  resetWgtBiasValue(wgt, bias);
-
-  resetOutValue(out);
-}
-
-void MKLDNNFcLayer::resetInValue(MKLDNNMatrixPtr& in) {
-  if (inputIsOnlyMKLDNN()) {
-    const MatrixPtr& dnnIn = getInputValue(0);
-    in = std::dynamic_pointer_cast<MKLDNNMatrix>(dnnIn);
-    CHECK(in) << "Input should be MKLDNNMatrix";
-  } else {
-    CHECK_EQ(getPrev(0)->getDeviceId(), CPU_DEVICE) << "Only support CPU yet";
-    const MatrixPtr& cpuIn = getInputValue(0, CPU_DEVICE);
-    in = MKLDNNMatrix::create(
-        cpuIn, {bs_, ic_, ih_, iw_}, format::nchw, engine_);
-  }
+  CHECK(in);
  in->downSpatial();
-}

-void MKLDNNFcLayer::resetWgtBiasValue(MKLDNNMatrixPtr& wgt,
-                                      MKLDNNMatrixPtr& bias) {
+  auto outPD =
+      MKLDNNMatrix::createPrimitiveDesc({bs_, oc_}, format::nc, engine_);
+  resetOutValue(out, outPD);
+
  format wgtFmt = format::oihw;
-  if (inVal_->getFormat() == format::nChw8c) {
+  if (in->getFormat() == format::nChw8c) {
    wgtFmt = format::oIhw8i;
-  } else if (inVal_->getFormat() == format::nChw16c) {
+  } else if (in->getFormat() == format::nChw16c) {
    wgtFmt = format::oIhw16i;
  }
-  wgt = MKLDNNMatrix::create(
-      weight_->getW(), {oc_, ic_, ih_, iw_}, wgtFmt, engine_);
+  auto wgtPD =
+      MKLDNNMatrix::createPrimitiveDesc({oc_, ic_, ih_, iw_}, wgtFmt, engine_);
+  resetWithMatrix(wgt, weight_->getW(), wgtPD);
  wgt->downSpatial();
-  VLOG(MKLDNN_FMTS) << "Weight value format: " << wgt->getFormat();
-
-  bias = (biases_ && biases_->getW())
-             ? MKLDNNMatrix::create(biases_->getW(), {oc_}, format::x, engine_)
-             : nullptr;
-}

-void MKLDNNFcLayer::resetOutValue(MKLDNNMatrixPtr& out) {
-  out = MKLDNNMatrix::create(output_.value, {bs_, oc_}, format::nc, engine_);
-  if (!outputIsOnlyMKLDNN()) {
-    // fc cpu output value do not need create convert, just share data
-    getOutput(CPU_DEVICE).value->setData(out->getData());
+  if (biases_ && biases_->getW()) {
+    auto biasPD = MKLDNNMatrix::createPrimitiveDesc({oc_}, format::x, engine_);
+    resetWithMatrix(bias, biases_->getW(), biasPD);
+  } else {
+    bias = nullptr;
  }
-  output_.value = std::dynamic_pointer_cast<Matrix>(out);
 }

 void MKLDNNFcLayer::resetFwdPD(std::shared_ptr<fc_fwd::primitive_desc>& pd,
@@ -219,7 +190,6 @@ void MKLDNNFcLayer::resetFwdPipeline(
  } else {
    fwd_.reset(new fc_fwd(*pd, *in, *wgt, *out));
  }
-
  pipeline.push_back(*fwd_);
 }

@@ -227,44 +197,18 @@ void MKLDNNFcLayer::resetBwdBuffers(MKLDNNMatrixPtr& in,
                                    MKLDNNMatrixPtr& wgt,
                                    MKLDNNMatrixPtr& bias,
                                    MKLDNNMatrixPtr& out) {
-  resetOutGrad(out);
-
-  resetWgtBiasGrad(wgt, bias);
-
-  resetInGrad(in);
-}
-
-void MKLDNNFcLayer::resetOutGrad(MKLDNNMatrixPtr& out) {
-  CHECK(outVal_);
-  if (outputIsOnlyMKLDNN()) {
-    MKLDNNLayer::resetOutGrad(out, outVal_->getPrimitiveDesc());
-  } else {
-    const MatrixPtr& cpuOut = getOutput(CPU_DEVICE).grad;
-    output_.grad->setData(cpuOut->getData());
-    out = MKLDNNMatrix::create(cpuOut, outVal_->getPrimitiveDesc());
-  }
-}
+  CHECK(inVal_ && outVal_);
+  resetOutGrad(out, outVal_->getPrimitiveDesc());
+  resetInGrad(in, inVal_->getPrimitiveDesc());

-void MKLDNNFcLayer::resetWgtBiasGrad(MKLDNNMatrixPtr& wgt,
-                                     MKLDNNMatrixPtr& bias) {
  CHECK(wgtVal_);
-  wgt = MKLDNNMatrix::create(weight_->getWGrad(), wgtVal_->getPrimitiveDesc());
+  resetWithMatrix(wgt, weight_->getWGrad(), wgtVal_->getPrimitiveDesc());

-  bias = nullptr;
-  if (biasVal_ == nullptr) {
-    return;
-  }
-  bias =
-      MKLDNNMatrix::create(biases_->getWGrad(), biasVal_->getPrimitiveDesc());
-}
-
-void MKLDNNFcLayer::resetInGrad(MKLDNNMatrixPtr& in) {
-  in = nullptr;
-  if (inputLayers_[0]->getOutput().grad == nullptr) {
-    return;
+  if (biasVal_) {
+    resetWithMatrix(bias, biases_->getWGrad(), biasVal_->getPrimitiveDesc());
+  } else {
+    bias = nullptr;
  }
-  CHECK(inVal_);
-  MKLDNNLayer::resetInGrad(in, inVal_->getPrimitiveDesc());
 }

 void MKLDNNFcLayer::resetBwdWgtPD(

--- a/paddle/gserver/layers/MKLDNNFcLayer.h
+++ b/paddle/gserver/layers/MKLDNNFcLayer.h
@@ -66,8 +66,6 @@ public:
                MKLDNNMatrixPtr& bias,
                MKLDNNMatrixPtr& out) override;

-  void updateInputData() override;
-
  void updateWeights(const UpdateCallback& callback) override;

  void convertWeightsFromPaddle() override;
@@ -84,9 +82,6 @@ protected:
                       MKLDNNMatrixPtr& wgt,
                       MKLDNNMatrixPtr& bias,
                       MKLDNNMatrixPtr& out);
-  void resetInValue(MKLDNNMatrixPtr& in);
-  void resetWgtBiasValue(MKLDNNMatrixPtr& wgt, MKLDNNMatrixPtr& bias);
-  void resetOutValue(MKLDNNMatrixPtr& out);
  void resetFwdPD(std::shared_ptr<fc_fwd::primitive_desc>& pd,
                  MKLDNNMatrixPtr in,
                  MKLDNNMatrixPtr wgt,
@@ -109,9 +104,6 @@ protected:
                       MKLDNNMatrixPtr& wgt,
                       MKLDNNMatrixPtr& bias,
                       MKLDNNMatrixPtr& out);
-  void resetOutGrad(MKLDNNMatrixPtr& out);
-  void resetWgtBiasGrad(MKLDNNMatrixPtr& wgt, MKLDNNMatrixPtr& bias);
-  void resetInGrad(MKLDNNMatrixPtr& in);
  void resetBwdWgtPD(std::shared_ptr<fc_bwdWgt::primitive_desc>& pd,
                     MKLDNNMatrixPtr& wgt,
                     MKLDNNMatrixPtr& bias,

--- a/paddle/gserver/layers/MKLDNNLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNLayer.cpp
+/* Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "MKLDNNLayer.h"
+
+using namespace mkldnn;  // NOLINT
+typedef memory::format format;
+
+namespace paddle {
+
+bool MKLDNNLayer::init(const LayerMap& layerMap,
+                       const ParameterMap& parameterMap) {
+  CHECK(FLAGS_use_mkldnn) << "MkldnnLayers only support use_mkldnn."
+                          << "Please set WITH_MKLDNN=ON "
+                          << "and set use_mkldnn=True";
+  CHECK(!useGpu_) << "Do not support GPU yet";
+
+  // set device id before Layer::init
+  setDevice(MKLDNN_DEVICE);
+  // change param device to MKLDNN device
+  setParamsDevice(MKLDNN_DEVICE, parameterMap);
+  if (!Layer::init(layerMap, parameterMap)) {
+    return false;
+  }
+  setOutputMap();
+  checkCPUOutputsNumber();
+
+  stream_.reset(new MKLDNNStream());
+  engine_ = CPUEngine::Instance().getEngine();
+  return true;
+}
+
+void MKLDNNLayer::forward(PassType passType) {
+  passType_ = passType;
+
+  {
+    REGISTER_TIMER_INFO("mkldnn_FwdTimer", getName().c_str());
+    CHECK(!inputLayers_.empty());
+    copySeqInfoToOutputs();
+    size_t elemenCnt = inputLayers_[0]->getOutputValue()->getElementCnt();
+    if (inputElemenCnt_ != elemenCnt) {
+      VLOG(MKLDNN_BASE) << getName() << " reset mkldnn forward";
+      // reset when input total sizes changed, not only the batchsize
+      inputElemenCnt_ = elemenCnt;
+      pipelineFwd_.clear();
+      reshape(bs_, ic_, ih_, iw_, oc_, oh_, ow_);
+      // all cpu device output grad or value share output's
+      shareCPUDevice();
+      resetFwd(pipelineFwd_, inVal_, wgtVal_, biasVal_, outVal_);
+      // MKLDNNLayer output value should be MKLDNNMatrix
+      // so external output value is necessary.
+      // Then external input value is not necessary,
+      // since input may be mkldnn internal buffer.
+      CHECK(extOutVal_) << "external output value is necessary";
+      output_.value = std::dynamic_pointer_cast<Matrix>(extOutVal_);
+      CHECK(inVal_ && outVal_) << "internal memories are necessary";
+      if (cvtInVal_) {
+        pipelineFwd_.insert(pipelineFwd_.begin(), *cvtInVal_);
+      }
+      if (cvtOutVal_) {
+        pipelineFwd_.push_back(*cvtOutVal_);
+      }
+      convertWeightsFromPaddle();
+      printSizeInfo();
+      printValueFormat();
+      needResetBwd_ = true;
+    }
+
+    if (inputLayers_[0]->getType() == "data") {
+      // Update input value data when input layer is "data" type,
+      // since the input value data address might be changed.
+      CHECK(extInVal_);
+      extInVal_->setData(getInputValue(0, CPU_DEVICE)->getData());
+    }
+
+    if (!outputOnlyMKLDNN_) {
+      clearGrads();
+    }
+    stream_->submit(pipelineFwd_);
+  }
+  {
+    REGISTER_TIMER_INFO("FwActTimer", getName().c_str());
+    forwardActivation();
+  }
+}
+
+void MKLDNNLayer::backward(const UpdateCallback& callback) {
+  if (needResetBwd_) {
+    VLOG(MKLDNN_BASE) << getName() << " reset mkldnn backward";
+    pipelineBwd_.clear();
+    pipelineMergeGrad_.clear();
+    mergeGrad_ = nullptr;
+    resetBwd(pipelineBwd_, inGrad_, wgtGrad_, biasGrad_, outGrad_);
+    // external output grad is not necessary
+    // since output may be mkldnn internal buffer or merge them directly.
+    CHECK(outGrad_) << "internal output grad is necessary";
+    if (extOutGrad_) {
+      CHECK_EQ(extOutGrad_->getData(), output_.grad->getData())
+          << "the external buffer should share the same data with output_.grad";
+    }
+    if (cvtOutGrad_) {
+      pipelineBwd_.insert(pipelineBwd_.begin(), *cvtOutGrad_);
+    }
+    if (cvtInGrad_) {
+      pipelineBwd_.push_back(*cvtInGrad_);
+    }
+    printGradFormat();
+    needResetBwd_ = false;
+  }
+
+  // merge grad must before backward activation
+  if (mergeGrad_) {
+    REGISTER_TIMER_INFO("MergeBpGrad", getName().c_str());
+    stream_->submit(pipelineMergeGrad_);
+  }
+  {
+    REGISTER_TIMER_INFO("BpActTimer", getName().c_str());
+    backwardActivation();
+  }
+  {
+    REGISTER_TIMER_INFO("mkldnn_bwdTimer", getName().c_str());
+    stream_->submit(pipelineBwd_);
+  }
+  {
+    REGISTER_TIMER_INFO("WeightUpdate", getName().c_str());
+    updateWeights(callback);
+  }
+}
+
+void MKLDNNLayer::reshapeInput(int& batchsize, int& height, int& width) {
+  const Argument& input = inputLayers_[0]->getOutput();
+  batchsize = input.getBatchSize();
+  int h = input.getFrameHeight();
+  int w = input.getFrameWidth();
+  if (h != 0) {
+    height = h;
+  }
+  if (w != 0) {
+    width = w;
+  }
+}
+
+void MKLDNNLayer::reshapeOutput(size_t height, size_t width) {
+  output_.setFrameHeight(height);
+  output_.setFrameWidth(width);
+  for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
+    outputOtherDevice_[i].setFrameHeight(height);
+    outputOtherDevice_[i].setFrameWidth(width);
+  }
+}
+
+void MKLDNNLayer::resetWithMatrix(MKLDNNMatrixPtr& dnn,
+                                  const MatrixPtr& mat,
+                                  memory::primitive_desc pd) {
+  dnn = nullptr;
+  if (mat == nullptr) {
+    return;
+  }
+  dnn = MKLDNNMatrix::create(pd, mat);
+}
+
+void MKLDNNLayer::resetInValue(
+    MKLDNNMatrixPtr& in, const std::shared_ptr<memory::primitive_desc>& intPD) {
+  cvtInVal_ = nullptr;
+  extInVal_ = nullptr;
+  in = nullptr;
+  CHECK_GT(bs_ * ic_ * ih_ * iw_, 0);
+  auto extPD = MKLDNNMatrix::createPrimitiveDesc(
+      {bs_, ic_, ih_, iw_}, format::nchw, engine_);
+  const MatrixPtr& inMat = inputLayers_[0]->getOutputValue();
+  in = std::dynamic_pointer_cast<MKLDNNMatrix>(inMat);
+  CHECK_EQ(inputIsOnlyMKLDNN(), in != nullptr);
+  if (in == nullptr || in->getFormat() == format::nc) {
+    in = MKLDNNMatrix::create(extPD, inMat);
+  }
+  extInVal_ = isPaddleFormat(in->getFormat()) ? in : nullptr;
+  if (in->getFormat() == format::nc) {
+    CHECK(ih_ == 1 && iw_ == 1);
+  }
+  if (nullptr == intPD || in->getPrimitiveDesc() == *intPD) {
+    return;
+  }
+  // need create reorder
+  in = MKLDNNMatrix::create(*intPD);
+  extInVal_ = extInVal_ ? extInVal_ : MKLDNNMatrix::create(extPD, inMat);
+  cvtInVal_ = MKLDNNMatrix::createReorder(extInVal_, in);
+  CHECK(cvtInVal_) << "should not be emptry";
+}
+
+void MKLDNNLayer::resetOutValue(MKLDNNMatrixPtr& out,
+                                memory::primitive_desc intPD) {
+  cvtOutVal_ = nullptr;
+  out = MKLDNNMatrix::create(intPD, output_.value);
+  extOutVal_ = out;
+  if (outputIsOnlyMKLDNN() || isPaddleFormat(extOutVal_->getFormat())) {
+    return;
+  }
+  // need create reorder
+  CHECK_GT(bs_ * oc_ * oh_ * ow_, 0);
+  extOutVal_ = MKLDNNMatrix::create(
+      memory::dims{bs_, oc_, oh_, ow_}, format::nchw, engine_, output_.value);
+  out = MKLDNNMatrix::create(intPD);
+  cvtOutVal_ = MKLDNNMatrix::createReorder(out, extOutVal_);
+  CHECK(cvtOutVal_) << "should not be empty";
+}
+
+void MKLDNNLayer::resetInGrad(MKLDNNMatrixPtr& in,
+                              memory::primitive_desc intPD) {
+  cvtInGrad_ = nullptr;
+  extInGrad_ = nullptr;
+  in = nullptr;
+  LayerPtr& input = inputLayers_[0];
+  if (input->getOutputGrad() == nullptr) {
+    // no need input grad
+    return;
+  }
+  CHECK(inputIsOnlyMKLDNN() || input->getOutputMapSize() <= 1)
+      << "only support input is MKLDNN layer or only have one output layer";
+  // when input is a mkldnn branch node,
+  // this layer will save input grad to a internal buffer,
+  // and the mkldnn input layer will merge them to actual prev->output_.grad
+  const MatrixPtr& inMat =
+      input->getOutputMapSize() <= 1 ? input->getOutputGrad() : nullptr;
+  in = MKLDNNMatrix::create(intPD, inMat);
+  Argument& arg = input->getOutput(this->getName());
+  arg.grad = std::dynamic_pointer_cast<Matrix>(in);
+  CHECK(inVal_);
+  CHECK(inVal_->getPrimitiveDesc() == intPD) << "the primitive desc must equal";
+  if (inputIsOnlyMKLDNN()) {
+    return;
+  }
+
+  extInGrad_ = in;
+  if (isPaddleFormat(extInGrad_->getFormat())) {
+    return;
+  }
+  // need create reorder
+  // TODO(TJ): add macro definition to simplify it
+  CHECK(extInVal_ != nullptr && isPaddleFormat(extInVal_->getFormat()))
+      << "should have external input value and the format must be nchw(nc)";
+  extInGrad_ = MKLDNNMatrix::create(extInVal_->getPrimitiveDesc(), inMat);
+  CHECK(inVal_ != nullptr && inVal_->getPrimitiveDesc() == intPD)
+      << "should have internal input value and primitive desc must equal";
+  in = MKLDNNMatrix::create(intPD);
+  cvtInGrad_ = MKLDNNMatrix::createReorder(in, extInGrad_);
+  CHECK(cvtInGrad_);
+}
+
+void MKLDNNLayer::resetOutGrad(MKLDNNMatrixPtr& out,
+                               memory::primitive_desc intPD) {
+  cvtOutGrad_ = nullptr;
+  extOutGrad_ = nullptr;
+  out = nullptr;
+  MatrixPtr& outMat = output_.grad;
+  out = MKLDNNMatrix::create(intPD, outMat);
+  resetMergeGrad(out);
+  if (outputIsOnlyMKLDNN()) {
+    return;
+  }
+  CHECK_LE(outputMap_.size(), 1U) << "do not support mixed with cpu device";
+  extOutGrad_ = out;
+  if (isPaddleFormat(extOutGrad_->getFormat())) {
+    return;
+  }
+  // need create reorder
+  CHECK(extOutVal_ != nullptr && isPaddleFormat(extOutVal_->getFormat()))
+      << "should have external output value and the format must be nchw(nc)";
+  extOutGrad_ = MKLDNNMatrix::create(extOutVal_->getPrimitiveDesc(), outMat);
+  CHECK(outVal_ != nullptr && outVal_->getPrimitiveDesc() == intPD)
+      << "should have internal output value and primitive desc must equal";
+  out = MKLDNNMatrix::create(intPD);
+  cvtOutGrad_ = MKLDNNMatrix::createReorder(extOutGrad_, out);
+  CHECK(cvtOutGrad_);
+}
+
+void MKLDNNLayer::resetMergeGrad(MKLDNNMatrixPtr& out) {
+  mergeGrad_ = nullptr;
+  pipelineMergeGrad_.clear();
+  if (outputMap_.size() <= 1 || !outputIsOnlyMKLDNN()) {
+    // do not merge when output is not all MKLDNN or only one output
+    return;
+  }
+  CHECK(out) << "should have reset internal ouput grad";
+  std::vector<double> scales(outputMap_.size(), 1.0);
+  std::vector<memory::primitive_desc> srcPDs;
+  std::vector<primitive::at> srcs;
+  for (auto it = outputMap_.begin(); it != outputMap_.end(); ++it) {
+    MKLDNNMatrixPtr src =
+        std::dynamic_pointer_cast<MKLDNNMatrix>(it->second->grad);
+    CHECK(src) << "should be MKLDNNMatrix";
+    auto srcDims = src->getDims();
+    auto dstDims = out->getDims();
+    CHECK_EQ(srcDims.size(), dstDims.size());
+    for (size_t i = 0; i < srcDims.size(); ++i) {
+      CHECK_EQ(srcDims[i], dstDims[i]);
+    }
+    VLOG(MKLDNN_BASE) << getName() << " has output grad " << it->first
+                      << ", format " << src->getFormat();
+    srcPDs.push_back(src->getPrimitiveDesc());
+    srcs.push_back(*src);
+  }
+
+  // TODO(TJ): remove me when mkldnn sum support different formats
+  for (size_t i = 1; i < srcPDs.size(); ++i) {
+    CHECK(srcPDs[0] == srcPDs[i]);
+  }
+  tmpOutGrad_ = out;
+  tmpCvt_ = nullptr;
+  if (out->getPrimitiveDesc() != srcPDs[0]) {
+    tmpOutGrad_ = MKLDNNMatrix::create(srcPDs[0]);
+    tmpCvt_ = MKLDNNMatrix::createReorder(tmpOutGrad_, out);
+    CHECK(tmpCvt_);
+    pipelineMergeGrad_.push_back(*tmpCvt_);
+  }
+
+  auto sumPD =
+      sum::primitive_desc(tmpOutGrad_->getMemoryDesc(), scales, srcPDs);
+  mergeGrad_.reset(new sum(sumPD, srcs, *tmpOutGrad_));
+  pipelineMergeGrad_.insert(pipelineMergeGrad_.begin(), *mergeGrad_);
+}
+
+}  // namespace paddle
--- a/paddle/gserver/layers/MKLDNNLayer.h
+++ b/paddle/gserver/layers/MKLDNNLayer.h
@@ -58,11 +58,31 @@ protected:
  std::vector<mkldnn::primitive> pipelineFwd_;
  std::vector<mkldnn::primitive> pipelineBwd_;

-  // MKLDNNMatrixPtr with internal format
+  /* Value and grad are seperated as internal and external buffers.
+   * Each MKLDNNLayer must init or reset internal buffer at least,
+   * and the external buffer format is always nchw of nc(when h==w==1),
+   * which is the same format as paddle.
+   * The output_.value and output_.grad always save the external data,
+   * when mixed with cpu device.
+   * When all layers are mkldnn layers, they could save internal data.
+   */
+  // below MKLDNNMatrix buffers are all internal buffers
  MKLDNNMatrixPtr inVal_;
  MKLDNNMatrixPtr inGrad_;
  MKLDNNMatrixPtr outVal_;
  MKLDNNMatrixPtr outGrad_;
+  // below are external value and grad
+  MKLDNNMatrixPtr extInVal_;
+  MKLDNNMatrixPtr extInGrad_;
+  MKLDNNMatrixPtr extOutVal_;
+  MKLDNNMatrixPtr extOutGrad_;
+  // convert handle between external and internal buffers
+  std::shared_ptr<mkldnn::reorder> cvtInVal_;
+  std::shared_ptr<mkldnn::reorder> cvtInGrad_;
+  std::shared_ptr<mkldnn::reorder> cvtOutVal_;
+  std::shared_ptr<mkldnn::reorder> cvtOutGrad_;
+
+  // weight and bias are always internal buffers
  MKLDNNMatrixPtr wgtVal_;
  MKLDNNMatrixPtr wgtGrad_;
  MKLDNNMatrixPtr biasVal_;
@@ -91,6 +111,7 @@ public:
        oh_(0),
        ow_(0),
        needResetBwd_(true),
+        outputOnlyMKLDNN_(false),
        engine_(mkldnn::engine::cpu, 0),
        stream_(nullptr),
        fwd_(nullptr),
@@ -99,92 +120,9 @@ public:

  ~MKLDNNLayer() {}

-  virtual bool init(const LayerMap& layerMap,
-                    const ParameterMap& parameterMap) {
-    CHECK(FLAGS_use_mkldnn) << "MkldnnLayers only support use_mkldnn."
-                            << "Please set WITH_MKLDNN=ON "
-                            << "and set use_mkldnn=True";
-    CHECK(!useGpu_) << "Do not support GPU yet";
-
-    // set device id before Layer::init
-    setDevice(MKLDNN_DEVICE);
-    // change param device to MKLDNN device
-    setParamsDevice(MKLDNN_DEVICE, parameterMap);
-    if (!Layer::init(layerMap, parameterMap)) {
-      return false;
-    }
-    setOutputMap();
-    checkCPUOutputsNumber();
-
-    stream_.reset(new MKLDNNStream());
-    engine_ = CPUEngine::Instance().getEngine();
-    return true;
-  }
-
-  void forward(PassType passType) override {
-    passType_ = passType;
-
-    {
-      REGISTER_TIMER_INFO("mkldnn_FwdTimer", getName().c_str());
-      CHECK(!inputLayers_.empty());
-      copySeqInfoToOutputs();
-      size_t elemenCnt = inputLayers_[0]->getOutput().value->getElementCnt();
-      if (inputElemenCnt_ != elemenCnt) {
-        VLOG(MKLDNN_BASE) << getName() << " reset mkldnn forward";
-        // reset when input total sizes changed, not only the batchsize
-        inputElemenCnt_ = elemenCnt;
-        pipelineFwd_.clear();
-        reshape(bs_, ic_, ih_, iw_, oc_, oh_, ow_);
-        resetFwd(pipelineFwd_, inVal_, wgtVal_, biasVal_, outVal_);
-        convertWeightsFromPaddle();
-        needResetBwd_ = true;
-      }
-
-      if (inputLayers_[0]->getType() == "data") {
-        updateInputData();
-      }
-
-      if (!outputOnlyMKLDNN_) {
-        clearGrads();
-      }
-      stream_->submit(pipelineFwd_);
-    }
-
-    /* activation */ {
-      REGISTER_TIMER_INFO("FwActTimer", getName().c_str());
-      forwardActivation();
-    }
-  }
-
-  void backward(const UpdateCallback& callback) override {
-    if (needResetBwd_) {
-      VLOG(MKLDNN_BASE) << getName() << " reset mkldnn backward";
-      pipelineBwd_.clear();
-      pipelineMergeGrad_.clear();
-      mergeGrad_ = nullptr;
-      resetBwd(pipelineBwd_, inGrad_, wgtGrad_, biasGrad_, outGrad_);
-      needResetBwd_ = false;
-    }
-
-    // merge grad must before backward activation
-    if (mergeGrad_) {
-      REGISTER_TIMER_INFO("MergeBpGrad", getName().c_str());
-      stream_->submit(pipelineMergeGrad_);
-    }
-    {
-      REGISTER_TIMER_INFO("BpActTimer", getName().c_str());
-      backwardActivation();
-    }
-    {
-      REGISTER_TIMER_INFO("mkldnn_bwdTimer", getName().c_str());
-      stream_->submit(pipelineBwd_);
-    }
-
-    {
-      REGISTER_TIMER_INFO("WeightUpdate", getName().c_str());
-      updateWeights(callback);
-    }
-  }
+  virtual bool init(const LayerMap& layerMap, const ParameterMap& parameterMap);
+  virtual void forward(PassType passType);
+  virtual void backward(const UpdateCallback& callback);

  /**
   * reshape the input image sizes
@@ -195,7 +133,7 @@ public:
      int& bs, int& ic, int& ih, int& iw, int oc, int& oh, int& ow) = 0;

  /**
-   * reset the mkldnn forward primitve and memory
+   * reset the mkldnn forward primitve and memories
   * only would be called when input size changes
   */
  virtual void resetFwd(std::vector<mkldnn::primitive>& pipeline,
@@ -205,7 +143,7 @@ public:
                        MKLDNNMatrixPtr& out) = 0;

  /**
-   * reset the mkldnn backward primitve and memory for mkldnn fc
+   * reset the mkldnn backward primitve and memories
   * only would be called when needed
   */
  virtual void resetBwd(std::vector<mkldnn::primitive>& pipeline,
@@ -214,12 +152,6 @@ public:
                        MKLDNNMatrixPtr& bias,
                        MKLDNNMatrixPtr& out) = 0;

-  /**
-   * Update input value data when input layer is "data" type.
-   * Since the input value data address might be changed.
-   */
-  virtual void updateInputData() {}
-
  /**
   * Update weights and biases if necessary.
   */
@@ -246,131 +178,78 @@ protected:
  /**
   * reshape the input image sizes and input batchsize
   */
-  virtual void reshapeInput(int& batchsize, int& height, int& width) {
-    const Argument& input = inputLayers_[0]->getOutput();
-    batchsize = input.getBatchSize();
-    int h = input.getFrameHeight();
-    int w = input.getFrameWidth();
-    if (h != 0) {
-      height = h;
-    }
-    if (w != 0) {
-      width = w;
-    }
-  }
+  void reshapeInput(int& batchsize, int& height, int& width);

  /**
   * reshape output image sizes
   */
-  virtual void reshapeOutput(size_t height, size_t width) {
-    output_.setFrameHeight(height);
-    output_.setFrameWidth(width);
-    for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
-      outputOtherDevice_[i].setFrameHeight(height);
-      outputOtherDevice_[i].setFrameWidth(width);
-    }
-  }
+  void reshapeOutput(size_t height, size_t width);

  /**
-   * reset the output grad matrix from primitive desc.
-   * and reset the merge grad primitive if needed.
-   * note: when this layer has serval outputs,
-   *       it could not be mixed with cpu device,
-   *       since it can not get memory desc from cpu device.
+   * reset MKLDNNMatrix from Matrix and internal primitive desc.
+   * reset nullptr if matrix or primitive desc is empty
   */
-  virtual void resetOutGrad(MKLDNNMatrixPtr& out,
-                            mkldnn::memory::primitive_desc pd) {
-    CHECK(outputIsOnlyMKLDNN()) << "do not support mixed with other device yet";
-    mergeGrad_ = nullptr;
-    pipelineMergeGrad_.clear();
-    out = MKLDNNMatrix::create(output_.grad, pd);
-    if (outputMap_.size() <= 1) {
-      return;
-    }
-    std::vector<double> scales(outputMap_.size(), 1.0);
-    std::vector<mkldnn::memory::primitive_desc> srcPDs;
-    std::vector<mkldnn::primitive::at> srcs;
-    for (auto it = outputMap_.begin(); it != outputMap_.end(); ++it) {
-      MKLDNNMatrixPtr src =
-          std::dynamic_pointer_cast<MKLDNNMatrix>(it->second->grad);
-      VLOG(MKLDNN_BASE) << getName() << " has output grad " << it->first;
-      CHECK(src) << "should be MKLDNNMatrix";
-      auto srcDims = src->getDims();
-      auto dstDims = out->getDims();
-      CHECK_EQ(srcDims.size(), dstDims.size());
-      for (size_t i = 0; i < srcDims.size(); ++i) {
-        CHECK_EQ(srcDims[i], dstDims[i]);
-      }
-      srcPDs.push_back(src->getPrimitiveDesc());
-      srcs.push_back(*src);
-    }
+  void resetWithMatrix(MKLDNNMatrixPtr& dnn,
+                       const MatrixPtr& mat,
+                       mkldnn::memory::primitive_desc pd);

-    // TODO(TJ): remove me when mkldnn sum support different formats
-    for (size_t i = 1; i < srcPDs.size(); ++i) {
-      CHECK(srcPDs[0] == srcPDs[i]);
-    }
-    tmpOutGrad_ = nullptr;
-    tmpCvt_ = nullptr;
-    if (out->getPrimitiveDesc() != srcPDs[0]) {
-      tmpOutGrad_ = MKLDNNMatrix::create(nullptr, srcPDs[0]);
-      tmpCvt_ = MKLDNNMatrix::createReorder(tmpOutGrad_, out);
-      CHECK(tmpCvt_);
-      pipelineMergeGrad_.push_back(*tmpCvt_);
-    } else {
-      tmpOutGrad_ = out;
-    }
+  /**
+   * reset input value from input MKLDNNMatrix and internal primitive desc.
+   * reset both internal and external buffer and create reorder if necessary.
+   */
+  void resetInValue(
+      MKLDNNMatrixPtr& in,
+      const std::shared_ptr<mkldnn::memory::primitive_desc>& intPD = nullptr);

-    auto sumPD = mkldnn::sum::primitive_desc(
-        tmpOutGrad_->getMemoryDesc(), scales, srcPDs);
-    mergeGrad_.reset(new mkldnn::sum(sumPD, srcs, *tmpOutGrad_));
-    pipelineMergeGrad_.insert(pipelineMergeGrad_.begin(), *mergeGrad_);
-  }
+  /**
+   * reset output value from internal primitive desc.
+   * reset both internal and external buffer and create reorder if necessary.
+   */
+  void resetOutValue(MKLDNNMatrixPtr& out,
+                     mkldnn::memory::primitive_desc intPD);

  /**
-   * reset input grad from primitive desc.
-   * this function is avaiable for input is only mkldnn
-   * or input do not care cpu device
+   * reset input grad from internal primitive desc.
+   * reset both internal and external buffer and create reorder if necessary.
   */
-  virtual void resetInGrad(MKLDNNMatrixPtr& in,
-                           mkldnn::memory::primitive_desc pd) {
-    LayerPtr& input = inputLayers_[0];
-    const MatrixPtr& grad =
-        input->getOutputMapSize() > 1 ? nullptr : input->getOutput().grad;
-    in = MKLDNNMatrix::create(grad, pd);
-    Argument& arg = input->getOutput(this->getName());
-    arg.grad = std::dynamic_pointer_cast<Matrix>(in);
-  }
+  void resetInGrad(MKLDNNMatrixPtr& in, mkldnn::memory::primitive_desc intPD);

  /**
-   * print info about sizes
+   * reset output grad from internal primitive desc.
+   * merge grad if necessary.
+   * reset both internal and external buffer and create reorder if necessary.
+   * note: about merge grad, when this layer has several outputs,
+   *       it could not be mixed with cpu device,
+   *       since it can not get memory desc from cpu device.
   */
-  virtual void printSizeInfo() {
-    VLOG(MKLDNN_SIZES) << getName() << ": bs: " << bs_ << ", ic: " << ic_
-                       << ", ih: " << ih_ << ", iw: " << iw_ << ", oc: " << oc_
-                       << ", oh: " << oh_ << ", ow: " << ow_;
-  }
+  void resetOutGrad(MKLDNNMatrixPtr& out, mkldnn::memory::primitive_desc intPD);

  /**
-   * Print the mkldnn memory format flow of value
+   * reset the merge grad primitive if necessary.
+   * note: do not support the grads mixed with cpu device,
+   *       since it can not get memory desc from cpu device.
   */
-  virtual void printValueFormatFlow() {
-    if (inVal_ && outVal_) {
-      VLOG(MKLDNN_FMTS) << inVal_->getFormat() << " >>> "
-                        << outVal_->getFormat();
-    }
-  }
+  void resetMergeGrad(MKLDNNMatrixPtr& out);
+
+protected:
+  /**
+   * Set deviceId of this layer.
+   */
+  void setDevice(int id) { deviceId_ = id; }

  /**
-   * Print the mkldnn memory format flow of grad
+   * check the format is nchw or nc,
+   * which is supported by Paddle default memory layout
   */
-  virtual void printGradFormatFlow() {
-    if (inGrad_ && outGrad_) {
-      VLOG(MKLDNN_FMTS) << inGrad_->getFormat() << " <<< "
-                        << outGrad_->getFormat();
+  bool isPaddleFormat(mkldnn::memory::format fmt) {
+    if (fmt == mkldnn::memory::format::nchw ||
+        fmt == mkldnn::memory::format::nc) {
+      return true;
+    } else {
+      return false;
    }
  }

-protected:
  /**
   * If input only has MKLDNN device.
   * Otherwise, only support the previous layer using CPU device.
@@ -380,7 +259,6 @@ protected:
    if (prevDevice == MKLDNN_DEVICE) {
      return true;
    } else {
-      // do not support GPU yet
      CHECK_EQ(prevDevice, CPU_DEVICE) << "Only support CPU yet";
      return false;
    }
@@ -400,18 +278,74 @@ protected:
  }

  /**
-   * Set deviceId of this layer.
+   * print info about sizes
   */
-  void setDevice(int id) { deviceId_ = id; }
+  virtual void printSizeInfo() {
+    VLOG(MKLDNN_SIZES) << getName() << ": bs: " << bs_ << ", ic: " << ic_
+                       << ", ih: " << ih_ << ", iw: " << iw_ << ", oc: " << oc_
+                       << ", oh: " << oh_ << ", ow: " << ow_;
+  }
+
+  /**
+   * print the mkldnn memory format of value
+   */
+  virtual void printValueFormat() {
+    if (extInVal_) {
+      VLOG(MKLDNN_FMTS) << extInVal_->getFormat() << " >>> ";
+    }
+    if (inVal_) {
+      VLOG(MKLDNN_FMTS) << inVal_->getFormat() << " >>>";
+    }
+    if (outVal_) {
+      VLOG(MKLDNN_FMTS) << outVal_->getFormat() << " >>> ";
+    }
+    if (extOutVal_) {
+      VLOG(MKLDNN_FMTS) << extOutVal_->getFormat();
+    }
+    if (wgtVal_) {
+      VLOG(MKLDNN_FMTS) << "Weight value format: " << wgtVal_->getFormat();
+    }
+    if (biasVal_) {
+      VLOG(MKLDNN_FMTS) << "Bias value format: " << biasVal_->getFormat();
+    }
+  }
+
+  /**
+   * print the mkldnn memory format of grad
+   */
+  virtual void printGradFormat() {
+    if (extOutGrad_) {
+      VLOG(MKLDNN_FMTS) << extOutGrad_->getFormat();
+    }
+    if (outGrad_) {
+      VLOG(MKLDNN_FMTS) << outGrad_->getFormat() << " <<< ";
+    }
+    if (inGrad_) {
+      VLOG(MKLDNN_FMTS) << inGrad_->getFormat() << " <<<";
+    }
+    if (extInGrad_) {
+      VLOG(MKLDNN_FMTS) << extInGrad_->getFormat() << " <<< ";
+    }
+    if (wgtGrad_) {
+      VLOG(MKLDNN_FMTS) << "Weight grad format: " << wgtGrad_->getFormat();
+    }
+    if (biasGrad_) {
+      VLOG(MKLDNN_FMTS) << "Bias grad format: " << biasGrad_->getFormat();
+    }
+  }

 private:
  /**
   * clear all grad
   */
  void clearGrads() {
-    output_.grad->zeroMem();
+    if (output_.grad) {
+      output_.grad->zeroMem();
+    }
    for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
-      outputOtherDevice_[i].grad->zeroMem();
+      if (outputOtherDevice_[i].grad) {
+        outputOtherDevice_[i].grad->zeroMem();
+      }
    }
  }

@@ -449,6 +383,19 @@ private:
    }
  }

+  /**
+   * if have cpu device, share value and grad data with output_
+   */
+  void shareCPUDevice() {
+    if (outputIsOnlyMKLDNN()) {
+      return;
+    }
+    for (size_t i = 0; i < outputOtherDevice_.size(); i++) {
+      outputOtherDevice_[i].value = output_.value;
+      outputOtherDevice_[i].grad = output_.grad;
+    }
+  }
+
  /**
   * Check the cpu device number of outputOtherDevice_.
   * should have only one at most.

--- a/paddle/gserver/layers/MKLDNNPoolLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNPoolLayer.cpp
@@ -85,8 +85,6 @@ void MKLDNNPoolLayer::resetFwd(std::vector<primitive>& pipeline,
  resetFwdPD(fwdPD_, in, out);

  resetFwdPipeline(pipeline, fwdPD_, in, out);
-
-  printValueFormatFlow();
 }

 void MKLDNNPoolLayer::resetBwd(std::vector<primitive>& pipeline,
@@ -101,65 +99,22 @@ void MKLDNNPoolLayer::resetBwd(std::vector<primitive>& pipeline,
  resetBwdPD(pd, in, out);

  resetBwdPipeline(pipeline, pd, in, out);
-
-  printGradFormatFlow();
-}
-
-void MKLDNNPoolLayer::updateInputData() {
-  inVal_->setData(getInputValue(0, CPU_DEVICE)->getData());
 }

 void MKLDNNPoolLayer::resetFwdBuffers(MKLDNNMatrixPtr& in,
                                      MKLDNNMatrixPtr& out) {
  resetInValue(in);

-  resetOutValue(out);
-}
-
-void MKLDNNPoolLayer::resetInValue(MKLDNNMatrixPtr& in) {
-  if (inputIsOnlyMKLDNN()) {
-    const MatrixPtr& dnnIn = getInputValue(0);
-    in = std::dynamic_pointer_cast<MKLDNNMatrix>(dnnIn);
-    CHECK(in) << "Input should be MKLDNNMatrix";
-  } else {
-    CHECK_EQ(getPrev(0)->getDeviceId(), CPU_DEVICE) << "Only support CPU yet";
-    const MatrixPtr& cpuIn = getInputValue(0, CPU_DEVICE);
-    in = MKLDNNMatrix::create(
-        cpuIn, {bs_, ic_, ih_, iw_}, format::nchw, engine_);
-  }
-}
-
-void MKLDNNPoolLayer::resetOutValue(MKLDNNMatrixPtr& out) {
-  CHECK(inVal_) << "Should reset input value first";
  memory::dims outDims = memory::dims{bs_, oc_, oh_, ow_};
-  out = MKLDNNMatrix::create(
-      output_.value, outDims, inVal_->getFormat(), engine_);
-
-  // create reorder if output value has cpu device and pd do not match
-  cpuOutVal_ = nullptr;
-  cvtOutVal_ = nullptr;
-  if (!outputIsOnlyMKLDNN()) {
-    const MatrixPtr& cpuOut = getOutput(CPU_DEVICE).value;
-    cpuOutVal_ = MKLDNNMatrix::create(cpuOut, outDims, format::nchw, engine_);
-    if (cpuOutVal_->getPrimitiveDesc() != out->getPrimitiveDesc()) {
-      out = MKLDNNMatrix::create(nullptr, out->getPrimitiveDesc());
-      cvtOutVal_ = MKLDNNMatrix::createReorder(out, cpuOutVal_);
-      CHECK(cvtOutVal_) << "should not be emptry";
-    } else {
-      cpuOut->setData(output_.value->getData());
-      cpuOutVal_ = out;
-    }
-    output_.value = std::dynamic_pointer_cast<Matrix>(cpuOutVal_);
-    return;
-  }
-  output_.value = std::dynamic_pointer_cast<Matrix>(outVal_);
+  CHECK(in);
+  auto outPD =
+      MKLDNNMatrix::createPrimitiveDesc(outDims, in->getFormat(), engine_);
+  resetOutValue(out, outPD);
 }

 void MKLDNNPoolLayer::resetFwdPD(std::shared_ptr<pool_fwd::primitive_desc>& pd,
                                 MKLDNNMatrixPtr in,
                                 MKLDNNMatrixPtr out) {
-  memory::dims inDims = memory::dims{bs_, ic_, ih_, iw_};
-  memory::dims outDims = memory::dims{bs_, oc_, oh_, ow_};
  memory::dims kernels = memory::dims{fh_, fw_};
  memory::dims strides = memory::dims{sh_, sw_};
  memory::dims padL = memory::dims{ph_, pw_};
@@ -194,58 +149,26 @@ void MKLDNNPoolLayer::resetFwdPipeline(
             ? std::make_shared<pool_fwd>(pool_fwd(*pd, *in, *out, *workspace_))
             : std::make_shared<pool_fwd>(pool_fwd(*pd, *in, *out));
  pipeline.push_back(*fwd_);
-
-  if (cvtOutVal_) {
-    pipeline.push_back(*cvtOutVal_);
-  }
 }

 void MKLDNNPoolLayer::resetBwdBuffers(MKLDNNMatrixPtr& in,
                                      MKLDNNMatrixPtr& out) {
-  resetOutGrad(out);
-
-  resetInGrad(in);
-}
-void MKLDNNPoolLayer::resetOutGrad(MKLDNNMatrixPtr& out) {
-  cpuOutGrad_ = nullptr;
-  cvtOutGrad_ = nullptr;
-  CHECK(outVal_);
-  if (outputIsOnlyMKLDNN()) {
-    MKLDNNLayer::resetOutGrad(out, outVal_->getPrimitiveDesc());
-  } else {
-    const MatrixPtr& cpuOut = getOutput(CPU_DEVICE).grad;
-    // always share the same grad data of CPU output
-    // then the activation can get the right grad from output_.grad
-    output_.grad->setData(cpuOut->getData());
-    cpuOutGrad_ = MKLDNNMatrix::create(
-        cpuOut, memory::dims{bs_, oc_, oh_, ow_}, format::nchw, engine_);
-    if (cpuOutGrad_->getPrimitiveDesc() != outVal_->getPrimitiveDesc()) {
-      out = MKLDNNMatrix::create(nullptr, outVal_->getPrimitiveDesc());
-      cvtOutGrad_ = MKLDNNMatrix::createReorder(cpuOutGrad_, out);
-      CHECK(cvtOutGrad_) << "should not be emptry";
-    } else {
-      out = cpuOutGrad_;
-    }
-  }
-}
-
-void MKLDNNPoolLayer::resetInGrad(MKLDNNMatrixPtr& in) {
-  in = nullptr;
-  if (inputLayers_[0]->getOutput().grad == nullptr) {
-    return;
-  }
-  CHECK(inVal_);
-  MKLDNNLayer::resetInGrad(in, inVal_->getPrimitiveDesc());
+  CHECK(inVal_ && outVal_);
+  resetOutGrad(out, outVal_->getPrimitiveDesc());
+  resetInGrad(in, inVal_->getPrimitiveDesc());
 }

 void MKLDNNPoolLayer::resetBwdPD(std::shared_ptr<pool_bwd::primitive_desc>& pd,
                                 MKLDNNMatrixPtr& in,
                                 MKLDNNMatrixPtr& out) {
+  pd = nullptr;
+  if (in == nullptr) {
+    return;
+  }
  memory::dims kernels = memory::dims{fh_, fw_};
  memory::dims strides = memory::dims{sh_, sw_};
  memory::dims padL = memory::dims{ph_, pw_};
  memory::dims padR = getPaddingR();
-  CHECK(in);
  CHECK(out);
  auto bwdDesc = pool_bwd::desc(poolAlgo_,
                                in->getMemoryDesc(),
@@ -263,8 +186,8 @@ void MKLDNNPoolLayer::resetBwdPipeline(
    std::shared_ptr<pool_bwd::primitive_desc>& pd,
    MKLDNNMatrixPtr& in,
    MKLDNNMatrixPtr& out) {
-  if (cvtOutGrad_) {
-    pipeline.push_back(*cvtOutGrad_);
+  if (pd == nullptr) {
+    return;
  }

  bwdData_ =

--- a/paddle/gserver/layers/MKLDNNPoolLayer.h
+++ b/paddle/gserver/layers/MKLDNNPoolLayer.h
@@ -38,13 +38,6 @@ protected:
  // pooling_avg or pooling_max
  mkldnn::algorithm poolAlgo_;

-  // MKLDNNMatrixPtr which should be created from CPU Device
-  MKLDNNMatrixPtr cpuOutVal_;
-  MKLDNNMatrixPtr cpuOutGrad_;
-  // convert handle between CPU device and MKLDNN device
-  std::shared_ptr<mkldnn::reorder> cvtOutVal_;
-  std::shared_ptr<mkldnn::reorder> cvtOutGrad_;
-
  // save forward primitive_desc, which can be used backward
  std::shared_ptr<pool_fwd::primitive_desc> fwdPD_;
  // according to https://github.com/01org/mkl-dnn/blob/master/tests/gtests/
@@ -74,8 +67,6 @@ public:
                MKLDNNMatrixPtr& bias,
                MKLDNNMatrixPtr& out) override;

-  void updateInputData() override;
-
  void printSizeInfo() override {
    MKLDNNLayer::printSizeInfo();
    VLOG(MKLDNN_SIZES) << getName() << ": fh: " << fh_ << ", fw: " << fw_
@@ -90,8 +81,6 @@ protected:
   *                    reset pipeline.
   */
  void resetFwdBuffers(MKLDNNMatrixPtr& in, MKLDNNMatrixPtr& out);
-  void resetInValue(MKLDNNMatrixPtr& in);
-  void resetOutValue(MKLDNNMatrixPtr& out);
  void resetFwdPD(std::shared_ptr<pool_fwd::primitive_desc>& pd,
                  MKLDNNMatrixPtr in,
                  MKLDNNMatrixPtr out);
@@ -106,8 +95,6 @@ protected:
   *                     reset pipeline.
   */
  void resetBwdBuffers(MKLDNNMatrixPtr& in, MKLDNNMatrixPtr& out);
-  void resetOutGrad(MKLDNNMatrixPtr& out);
-  void resetInGrad(MKLDNNMatrixPtr& in);
  void resetBwdPD(std::shared_ptr<pool_bwd::primitive_desc>& pd,
                  MKLDNNMatrixPtr& in,
                  MKLDNNMatrixPtr& out);

--- a/paddle/gserver/tests/MKLDNNTester.cpp
+++ b/paddle/gserver/tests/MKLDNNTester.cpp
@@ -97,7 +97,7 @@ void MKLDNNTester::randomWgtDatas() {
    parameters_[REF][i]->randomize();
    dnnValue->copyFrom(*refValue);

-    VLOG(lvl_) << "Random weight data " << parameters_[DNN][i]->getName();
+    VLOG(MKLDNN_TESTS) << "Random weight " << parameters_[DNN][i]->getName();
    printVector(dnnValue);
  }
 }
@@ -109,7 +109,7 @@ void MKLDNNTester::randomBotDatas() {
    dataLayers_[REF][i]->getOutputValue()->randomizeUniform();
    dataLayers_[DNN][i]->getOutputValue()->copyFrom(
        *(dataLayers_[REF][i]->getOutputValue()));
-    VLOG(lvl_) << "Input " << i << " data:";
+    VLOG(MKLDNN_TESTS) << "Random Foward, InputValue " << i;
    printMatrix(dataLayers_[REF][i]->getOutputValue());
  }
 }
@@ -118,12 +118,12 @@ void MKLDNNTester::randomTopDiffs() {
  refLayer_->getOutputGrad()->randomizeUniform();
  dnnLayer_->getOutput(CPU_DEVICE)
      .grad->copyFrom(*(refLayer_->getOutputGrad()));
-  VLOG(lvl_) << "Random Backward Input, TopDiff: ";
+  VLOG(MKLDNN_TESTS) << "Random Backward, OutputGrad";
  printMatrix(refLayer_->getOutputGrad());
 }

 void MKLDNNTester::checkForward() {
-  VLOG(MKLDNN_ALL) << "Check Forward";
+  VLOG(MKLDNN_TESTS) << "Check Forward";
  printTopDatas();
  double delta =
      compareMatrix(dnnLayer_->getOutputValue(), refLayer_->getOutputValue());
@@ -131,15 +131,15 @@ void MKLDNNTester::checkForward() {
 }

 void MKLDNNTester::checkBackwardData() {
-  VLOG(MKLDNN_ALL) << "Check Backward Data";
+  VLOG(MKLDNN_TESTS) << "Check Backward Data";
  // TODO(TJ): uncomment me when batch norm ready
  // const bool isBN = dnnLayer_->getType() == "mkldnn_batch_norm";
  for (size_t i = 0; i < dataLayers_[DNN].size(); ++i) {
    const MatrixPtr& dnnDiff = dataLayers_[DNN][i]->getOutputGrad();
    const MatrixPtr& refDiff = dataLayers_[REF][i]->getOutputGrad();
-    VLOG(lvl_) << "Mkldnn Backward Output BotDiff " << i;
+    VLOG(MKLDNN_ALL) << "MKLDNN Backward Result: InputGrad " << i;
    printMatrix(dnnDiff);
-    VLOG(lvl_) << "Reference Backward Output BotDiff " << i;
+    VLOG(MKLDNN_ALL) << "Reference Backward Result: InputGrad " << i;
    printMatrix(refDiff);

    double delta = compareMatrix(dnnDiff, refDiff);
@@ -153,7 +153,7 @@ void MKLDNNTester::checkBackwardData() {
 }

 void MKLDNNTester::checkBackwardWgts() {
-  VLOG(MKLDNN_ALL) << "Check Backward Weight";
+  VLOG(MKLDNN_TESTS) << "Check Backward Weight";
  CHECK_EQ(parameters_[DNN].size(), parameters_[REF].size());
  vector<VectorPtr> dnnWgts;  // used to temply save mkldnn weights
  saveWgt(parameters_[DNN], dnnWgts);
@@ -165,9 +165,11 @@ void MKLDNNTester::checkBackwardWgts() {
  for (size_t i = 0; i < parameters_[DNN].size(); ++i) {
    const VectorPtr& dnn = parameters_[DNN][i]->getBuf(PARAMETER_VALUE);
    const VectorPtr& ref = parameters_[REF][i]->getBuf(PARAMETER_VALUE);
-    VLOG(lvl_) << "Mkldnn Output weight " << parameters_[DNN][i]->getName();
+    VLOG(MKLDNN_ALL) << "MKLDNN Result: weight value"
+                     << parameters_[DNN][i]->getName();
    printVector(dnn);
-    VLOG(lvl_) << "Reference Output weight " << parameters_[REF][i]->getName();
+    VLOG(MKLDNN_ALL) << "Reference Result: weight value "
+                     << parameters_[REF][i]->getName();
    printVector(ref);

    double delta = compareVector(dnn, ref);
@@ -240,7 +242,8 @@ void MKLDNNTester::printTopDatas() {
  }

  for (int n = 0; n < NUM; ++n) {
-    VLOG(lvl_) << testLayers_[n]->getType() << " forward output TopData: ";
+    VLOG(MKLDNN_ALL) << testLayers_[n]->getType()
+                     << " Forward Result: OutputValue";
    printMatrix(testLayers_[n]->getOutputValue());
  }
 }
@@ -252,7 +255,7 @@ void MKLDNNTester::printMatrix(const MatrixPtr& m) {

  std::ostringstream ostr;
  m->print(ostr);
-  VLOG(lvl_) << std::endl << ostr.str();
+  VLOG(MKLDNN_ALL) << std::endl << ostr.str();
 }

 void MKLDNNTester::printVector(const VectorPtr& v) {
@@ -262,7 +265,7 @@ void MKLDNNTester::printVector(const VectorPtr& v) {

  std::ostringstream ostr;
  v->print(ostr, v->getSize());
-  VLOG(lvl_) << std::endl << ostr.str();
+  VLOG(MKLDNN_ALL) << std::endl << ostr.str();
 }

 double MKLDNNTester::getDelta(const real* d1,
@@ -314,7 +317,7 @@ void MKLDNNTester::runOnce() {
  UpdateCallback updateCallback = [](Parameter* para) {
    auto& grad = para->getBuf(PARAMETER_GRADIENT);
    auto& value = para->getBuf(PARAMETER_VALUE);
-    real lr = 1e-3;
+    real lr = 1e-2;
    value->add(*grad, lr);
    grad->zeroMem();
  };
@@ -340,10 +343,9 @@ void MKLDNNTester::run(const TestConfig& dnn,
                       size_t batchSize,
                       size_t inputImgH,
                       size_t inputImgW,
+                       bool printDetails,
                       size_t iter,
-                       float epsilon,
-                       bool log,
-                       int level) {
+                       float epsilon) {
  CHECK(dnn.layerConfig.type().compare(0, 7, "mkldnn_") == 0 ||
        dnn.layerConfig.active_type().compare(0, 7, "mkldnn_") == 0)
      << "should be MKLDNN layer or MKLDNN activation";
@@ -359,10 +361,9 @@ void MKLDNNTester::run(const TestConfig& dnn,

  ih_ = inputImgH;
  iw_ = inputImgW;
+  log_ = printDetails;
  iter_ = iter;
  eps_ = epsilon;
-  log_ = log;
-  lvl_ = level;

  // Firstly test mkldnn init from PARAM_FORMAT_ORIGINAL weight
  reset(dnn, ref, batchSize);
@@ -531,9 +532,11 @@ void MKLDNNTester::getOutResult(const std::string& configPath,
 void MKLDNNTester::compareResult(DataOut& ref, DataOut& dnn, float eps) {
  CHECK_EQ(ref.outValues.size(), dnn.outValues.size());
  CHECK_EQ(ref.paraValues.size(), dnn.paraValues.size());
+  VLOG(MKLDNN_TESTS) << "compare value size: " << ref.outValues.size();
  for (size_t i = 0; i < ref.outValues.size(); i++) {
    EXPECT_LE(fabs(compareMatrix(ref.outValues[i], dnn.outValues[i])), eps);
  }
+  VLOG(MKLDNN_TESTS) << "compare param size: " << ref.outValues.size();
  for (size_t i = 0; i < ref.paraValues.size(); i++) {
    EXPECT_LE(fabs(compareVector(ref.paraValues[i], dnn.paraValues[i])), eps);
  }
@@ -544,9 +547,10 @@ void MKLDNNTester::runBranchesTest(const std::string& configPath,
                                   float eps) {
  DataIn in;
  initArgument(in, configPath, iter);
-
  DataOut outCpu, outDnn;
+  VLOG(MKLDNN_TESTS) << "runing cpu network";
  getOutResult(configPath, in, outCpu, false, iter);
+  VLOG(MKLDNN_TESTS) << "runing mkldnn network";
  getOutResult(configPath, in, outDnn, true, iter);

  compareResult(outCpu, outDnn, eps);

--- a/paddle/gserver/tests/MKLDNNTester.h
+++ b/paddle/gserver/tests/MKLDNNTester.h
@@ -58,8 +58,6 @@ protected:
  size_t iter_;
  /// whether to print out the details
  bool log_;
-  /// vlog level to print the matrix details datas
-  int lvl_;
  /// epsilon
  float eps_;
  /// input image size, default 1
@@ -70,7 +68,6 @@ public:
    iter_ = iter;
    eps_ = epsilon;
    log_ = false;
-    lvl_ = MKLDNN_ALL;
  }

  ~MKLDNNTester() {}
@@ -81,10 +78,9 @@ public:
           size_t batchSize,
           size_t inputImgH = 1,
           size_t inputImgW = 1,
+           bool printDetails = false,
           size_t iter = 3,
-           float epsilon = 1e-4,
-           bool log = false,
-           int level = MKLDNN_ALL);
+           float epsilon = 1e-4);
  static void runBranchesTest(const std::string& configPath,
                              size_t iter = 3,
                              float eps = 1e-4);

--- a/paddle/gserver/tests/mkldnn_branches_fc.conf
+++ b/paddle/gserver/tests/mkldnn_branches_fc.conf
+# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.trainer_config_helpers import *
+
+settings(batch_size=16)
+channels = get_config_arg("channels", int, 2)
+
+def two_fc(input, group_name):
+  out1 = fc_layer(input=input,
+            name=group_name+'_fc1',
+            size=channels,
+            bias_attr=False,
+            act=LinearActivation())
+
+  out2 = fc_layer(input=input,
+            name=group_name+'_fc2',
+            size=channels,
+            bias_attr=False,
+            act=LinearActivation())
+  return out1, out2
+
+data = data_layer(name ="input", size=channels*16*16)
+
+conv = img_conv_layer(input=data,
+            num_channels=channels,
+            filter_size=3,
+            num_filters=channels,
+            padding=1,
+            shared_biases=True,
+            act=LinearActivation())
+
+pool = img_pool_layer(input=conv,
+            pool_size=3,
+            stride=2,
+            padding=1,
+            pool_type=AvgPooling())
+
+a1, a2 = two_fc(input=pool, group_name='a')
+
+concat = concat_layer(input=[a1, a2])
+
+b1, b2 = two_fc(input=pool, group_name='b')
+
+addto = addto_layer(input=[b1, b2])
+
+outputs([concat, addto])
--- a/paddle/gserver/tests/mkldnn_branches_pool.conf
+++ b/paddle/gserver/tests/mkldnn_branches_pool.conf
+# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle.trainer_config_helpers import *
+
+settings(batch_size=16)
+channels = get_config_arg("channels", int, 2)
+
+def two_pool(input, group_name):
+  out1 = img_pool_layer(input=input,
+            name=group_name+'_pool1',
+            pool_size=3,
+            stride=2,
+            padding=0,
+            pool_type=MaxPooling())
+
+  out2 = img_pool_layer(input=input,
+            name=group_name+'_pool2',
+            pool_size=5,
+            stride=2,
+            padding=1,
+            pool_type=MaxPooling())
+  return out1, out2
+
+data = data_layer(name ="input", size=channels*16*16)
+
+conv = img_conv_layer(input=data,
+            num_channels=channels,
+            filter_size=3,
+            num_filters=channels,
+            padding=1,
+            shared_biases=True,
+            act=LinearActivation())
+
+pool = img_pool_layer(input=conv,
+            pool_size=3,
+            stride=1,
+            padding=1,
+            pool_type=AvgPooling())
+
+a1, a2 = two_pool(input=pool, group_name='a')
+
+concat = concat_layer(input=[a1, a2])
+
+b1, b2 = two_pool(input=pool, group_name='b')
+
+addto = addto_layer(input=[b1, b2])
+
+outputs([concat, addto])
--- a/paddle/gserver/tests/test_MKLDNN.cpp
+++ b/paddle/gserver/tests/test_MKLDNN.cpp
@@ -250,7 +250,7 @@ TEST(MKLDNNActivation, Activations) {

 DECLARE_string(config_args);
 TEST(MKLDNNLayer, branches) {
-  std::vector<std::string> cases = {"conv"};
+  std::vector<std::string> cases = {"conv", "pool", "fc"};
  for (auto name : cases) {
    std::string config = "./gserver/tests/mkldnn_branches_" + name + ".conf";
    for (auto channels : {2, 32}) {

--- a/paddle/gserver/tests/test_PyDataProvider2.py
+++ b/paddle/gserver/tests/test_PyDataProvider2.py
@@ -51,7 +51,10 @@ def test_sparse_non_value_no_seq(setting, filename):
        yield [(i + 1) * (j + 1) for j in xrange(10)]


-@provider(input_types=[sparse_vector(30000, seq_type=SequenceType.NO_SEQUENCE)])
+@provider(input_types=[
+    sparse_float_vector(
+        30000, seq_type=SequenceType.NO_SEQUENCE)
+])
 def test_sparse_value_no_seq(setting, filename):
    for i in xrange(200):
        yield [((i + 1) * (j + 1), float(j) / float(i + 1)) for j in xrange(10)]

--- a/paddle/math/MKLDNNMatrix.cpp
+++ b/paddle/math/MKLDNNMatrix.cpp
@@ -18,7 +18,7 @@ using namespace mkldnn;  // NOLINT

 namespace paddle {

-MKLDNNMatrixPtr MKLDNNMatrix::create(MatrixPtr m, memory::primitive_desc pd) {
+MKLDNNMatrixPtr MKLDNNMatrix::create(memory::primitive_desc pd, MatrixPtr m) {
  memory::desc md = pd.desc();
  size_t ndims = md.data.ndims;
  int* dims = md.data.dims;
@@ -41,12 +41,12 @@ MKLDNNMatrixPtr MKLDNNMatrix::create(MatrixPtr m, memory::primitive_desc pd) {
  return std::make_shared<MKLDNNMatrix>(cpuMatrix, pd);
 }

-MKLDNNMatrixPtr MKLDNNMatrix::create(MatrixPtr m,
-                                     memory::dims dims,
+MKLDNNMatrixPtr MKLDNNMatrix::create(memory::dims dims,
                                     memory::format fmt,
                                     engine& eg,
+                                     MatrixPtr m,
                                     mkldnn::memory::data_type dtype) {
-  return create(m, memory::primitive_desc(memory::desc(dims, dtype, fmt), eg));
+  return create(createPrimitiveDesc(dims, fmt, eg, dtype), m);
 }

 std::shared_ptr<reorder> MKLDNNMatrix::createReorder(const MKLDNNMatrixPtr& src,

--- a/paddle/math/MKLDNNMatrix.h
+++ b/paddle/math/MKLDNNMatrix.h
@@ -40,24 +40,37 @@ public:
  /**
   * Create MKLDNNMatrix from a MatrixPtr and memory primitive_desc
   */
-  static MKLDNNMatrixPtr create(MatrixPtr m, mkldnn::memory::primitive_desc pd);
+  static MKLDNNMatrixPtr create(mkldnn::memory::primitive_desc pd,
+                                MatrixPtr m = nullptr);

  /**
   * Create MKLDNNMatrix from a MatrixPtr and memory details info
   */
  static MKLDNNMatrixPtr create(
-      MatrixPtr m,
      mkldnn::memory::dims dims,
      mkldnn::memory::format fmt,
      mkldnn::engine& eg,
+      MatrixPtr m = nullptr,
      mkldnn::memory::data_type dtype = mkldnn::memory::data_type::f32);

+  /**
+   * Create primitive descriptor.
+   * default with f32 dtype
+   */
+  static mkldnn::memory::primitive_desc createPrimitiveDesc(
+      const mkldnn::memory::dims dims,
+      const mkldnn::memory::format& fmt,
+      const mkldnn::engine& eg,
+      const mkldnn::memory::data_type& dtype = mkldnn::memory::data_type::f32) {
+    return mkldnn::memory::primitive_desc(memory::desc(dims, dtype, fmt), eg);
+  }
+
  /**
   * Create Memory descriptor.
   * default with any format and f32 dtype
   */
  static mkldnn::memory::desc createMemoryDesc(
-      const mkldnn::memory::dims& dims,
+      const mkldnn::memory::dims dims,
      const mkldnn::memory::format& fmt = mkldnn::memory::format::any,
      const mkldnn::memory::data_type& dtype = mkldnn::memory::data_type::f32) {
    return mkldnn::memory::desc(dims, dtype, fmt);

--- a/paddle/operators/accuracy_op.cc
+++ b/paddle/operators/accuracy_op.cc
@@ -69,5 +69,8 @@ information, or not. But the output only shares the LoD with input `Inference`.

 namespace ops = paddle::operators;
 REGISTER_OP_WITHOUT_GRADIENT(accuracy, ops::AccuracyOp, ops::AccuracyOpMaker);
-REGISTER_OP_CPU_KERNEL(accuracy,
-                       ops::AccuracyKernel<paddle::platform::CPUPlace, float>);
+REGISTER_OP_CPU_KERNEL(
+    accuracy, ops::AccuracyKernel<paddle::platform::CPUPlace, float>,
+    ops::AccuracyKernel<paddle::platform::CPUPlace, int>,
+    ops::AccuracyKernel<paddle::platform::CPUPlace, double>,
+    ops::AccuracyKernel<paddle::platform::CPUPlace, int64_t>);
--- a/paddle/operators/accuracy_op.cu
+++ b/paddle/operators/accuracy_op.cu
@@ -21,9 +21,9 @@ namespace paddle {
 namespace operators {
 using platform::PADDLE_CUDA_NUM_THREADS;

-template <int BlockSize>
-__global__ void AccuracyCudaKernel(const int N, const int D, const int* Xdata,
-                                   const int* labeldata, float* accuracy) {
+template <typename T, int BlockSize>
+__global__ void AccuracyCudaKernel(const int N, const int D, const T* Xdata,
+                                   const T* labeldata, float* accuracy) {
  int count = 0;
  __shared__ int total[BlockSize];

@@ -57,8 +57,8 @@ class AccuracyOpCUDAKernel : public framework::OpKernel<T> {
    auto* accuracy = ctx.Output<Tensor>("Accuracy");
    // FIXME(typhoonzero): only support indices currently
    // if add support for output values, how to detect the data type?
-    const int* inference_data = inference->data<int>();
-    const int* label_data = label->data<int>();
+    const T* inference_data = inference->data<T>();
+    const T* label_data = label->data<T>();
    float* accuracy_data = accuracy->mutable_data<float>(ctx.GetPlace());

    size_t num_samples = inference->dims()[0];
@@ -69,7 +69,7 @@ class AccuracyOpCUDAKernel : public framework::OpKernel<T> {
      return;
    }

-    AccuracyCudaKernel<PADDLE_CUDA_NUM_THREADS><<<
+    AccuracyCudaKernel<T, PADDLE_CUDA_NUM_THREADS><<<
        1, PADDLE_CUDA_NUM_THREADS, 0,
        reinterpret_cast<const platform::CUDADeviceContext&>(
            ctx.device_context())
@@ -81,5 +81,7 @@ class AccuracyOpCUDAKernel : public framework::OpKernel<T> {
 }  // namespace operators
 }  // namespace paddle

-REGISTER_OP_GPU_KERNEL(accuracy,
-                       paddle::operators::AccuracyOpCUDAKernel<float>);
+REGISTER_OP_GPU_KERNEL(accuracy, paddle::operators::AccuracyOpCUDAKernel<float>,
+                       paddle::operators::AccuracyOpCUDAKernel<double>,
+                       paddle::operators::AccuracyOpCUDAKernel<int>,
+                       paddle::operators::AccuracyOpCUDAKernel<int64_t>);
--- a/paddle/operators/adam_op.cc
+++ b/paddle/operators/adam_op.cc
@@ -43,10 +43,6 @@ class AdamOp : public framework::OperatorWithKernel {
                   "Output(Moment1Out) of AdamOp should not be null.");
    PADDLE_ENFORCE(ctx->HasOutput("Moment2Out"),
                   "Output(Moment2Out) of AdamOp should not be null.");
-    PADDLE_ENFORCE(ctx->HasOutput("Beta1PowOut"),
-                   "Output(Beta1PowOut) of AdamOp should not be null.");
-    PADDLE_ENFORCE(ctx->HasOutput("Beta2PowOut"),
-                   "Output(Beta2PowOut) of AdamOp should not be null.");

    auto lr_dims = ctx->GetInputDim("LearningRate");
    PADDLE_ENFORCE_EQ(framework::product(lr_dims), 1,
@@ -72,8 +68,6 @@ class AdamOp : public framework::OperatorWithKernel {
    ctx->SetOutputDim("ParamOut", param_dims);
    ctx->SetOutputDim("Moment1Out", param_dims);
    ctx->SetOutputDim("Moment2Out", param_dims);
-    ctx->SetOutputDim("Beta1PowOut", beta1_pow_dims);
-    ctx->SetOutputDim("Beta2PowOut", beta2_pow_dims);
  }
 };

@@ -92,8 +86,6 @@ class AdamOpMaker : public framework::OpProtoAndCheckerMaker {
    AddOutput("ParamOut", "(Tensor) Output parameter");
    AddOutput("Moment1Out", "(Tensor) Output first moment");
    AddOutput("Moment2Out", "(Tensor) Output second moment");
-    AddOutput("Beta1PowOut", "(Tensor) Output beta1 power accumulator");
-    AddOutput("Beta2PowOut", "(Tensor) Output beta2 power accumulator");

    AddAttr<float>("beta1",
                   "(float, default 0.9) "
@@ -121,10 +113,8 @@ Adam updates:

 moment1_out = beta1 * moment1 + (1 − beta1) * grad
 moment2_out = beta2 * moment2 + (1 − beta2) * grad * grad
-beta1_pow_out = beta1_pow * beta1
-beta2_pow_out = beta2_pow * beta2
 learning_rate_t = learning_rate_t *
-                  sqrt(1 - beta2_pow_out) / (1 - beta1_pow_out)
+                  sqrt(1 - beta2_pow) / (1 - beta1_pow)
 param_out = param - learning_rate_t * moment1/ (sqrt(moment2) + epsilon)

 References:

--- a/paddle/operators/adam_op.h
+++ b/paddle/operators/adam_op.h
@@ -26,14 +26,10 @@ class AdamOpKernel : public framework::OpKernel<T> {
    auto param_out_tensor = ctx.Output<framework::Tensor>("ParamOut");
    auto moment1_out_tensor = ctx.Output<framework::Tensor>("Moment1Out");
    auto moment2_out_tensor = ctx.Output<framework::Tensor>("Moment2Out");
-    auto beta1_pow_out_tensor = ctx.Output<framework::Tensor>("Beta1PowOut");
-    auto beta2_pow_out_tensor = ctx.Output<framework::Tensor>("Beta2PowOut");

    param_out_tensor->mutable_data<T>(ctx.GetPlace());
    moment1_out_tensor->mutable_data<T>(ctx.GetPlace());
    moment2_out_tensor->mutable_data<T>(ctx.GetPlace());
-    beta1_pow_out_tensor->mutable_data<T>(ctx.GetPlace());
-    beta2_pow_out_tensor->mutable_data<T>(ctx.GetPlace());

    float beta1 = ctx.Attr<float>("beta1");
    float beta2 = ctx.Attr<float>("beta2");
@@ -56,18 +52,13 @@ class AdamOpKernel : public framework::OpKernel<T> {
    auto param_out = framework::EigenVector<T>::Flatten(*param_out_tensor);
    auto moment1_out = framework::EigenVector<T>::Flatten(*moment1_out_tensor);
    auto moment2_out = framework::EigenVector<T>::Flatten(*moment2_out_tensor);
-    auto beta1_pow_out =
-        framework::EigenVector<T>::Flatten(*beta1_pow_out_tensor);
-    auto beta2_pow_out =
-        framework::EigenVector<T>::Flatten(*beta2_pow_out_tensor);
    auto place = ctx.GetEigenDevice<Place>();

    moment1_out.device(place) = beta1 * moment1 + (1 - beta1) * grad;
    moment2_out.device(place) = beta2 * moment2 + (1 - beta2) * grad.square();
-    beta1_pow_out.device(place) = beta1_pow * beta1;
-    beta2_pow_out.device(place) = beta2_pow * beta2;
+
    // All of these are tensors of 1 element
-    auto lr_t = lr * (1 - beta2_pow_out).sqrt() / (1 - beta1_pow_out);
+    auto lr_t = lr * (1 - beta2_pow).sqrt() / (1 - beta1_pow);
    // Eigen does not support automatic broadcast
    // Get dimensions of moment vector to broadcast lr_t
    Eigen::DSizes<int, 1> m_dsize(moment1_out_tensor->numel());

--- a/paddle/operators/adamax_op.cc
+++ b/paddle/operators/adamax_op.cc
@@ -41,8 +41,6 @@ class AdamaxOp : public framework::OperatorWithKernel {
                   "Output(MomentOut) of AdamaxOp should not be null.");
    PADDLE_ENFORCE(ctx->HasOutput("InfNormOut"),
                   "Output(InfNormOut) of AdamaxOp should not be null.");
-    PADDLE_ENFORCE(ctx->HasOutput("Beta1PowOut"),
-                   "Output(Beta1PowOut) of AdamaxOp should not be null.");

    auto lr_dims = ctx->GetInputDim("LearningRate");
    PADDLE_ENFORCE_EQ(framework::product(lr_dims), 1,
@@ -64,7 +62,6 @@ class AdamaxOp : public framework::OperatorWithKernel {
    ctx->SetOutputDim("ParamOut", param_dims);
    ctx->SetOutputDim("MomentOut", param_dims);
    ctx->SetOutputDim("InfNormOut", param_dims);
-    ctx->SetOutputDim("Beta1PowOut", beta1_pow_dims);
  }
 };

@@ -86,7 +83,6 @@ class AdamaxOpMaker : public framework::OpProtoAndCheckerMaker {
    AddOutput("InfNormOut",
              "(Tensor) "
              "Output exponentially weighted infinity norm");
-    AddOutput("Beta1PowOut", "(Tensor) Output beta1 power accumulator");

    AddAttr<float>("beta1",
                   "(float, default 0.9) "
@@ -113,8 +109,7 @@ Adamax updates:

 moment_out = beta1 * moment + (1 - beta1) * grad
 inf_norm_out = max(beta2 * inf_norm + epsilon, abs(grad))
-beta1_pow_out = beta1_pow * beta1
-learning_rate_t = learning_rate/(1 - beta1_pow_out)
+learning_rate_t = learning_rate/(1 - beta1_pow)
 param_out = param - learning_rate_t * moment_out/inf_norm_out

 The original paper does not have an epsilon attribute.

--- a/paddle/operators/adamax_op.h
+++ b/paddle/operators/adamax_op.h
@@ -26,12 +26,10 @@ class AdamaxOpKernel : public framework::OpKernel<T> {
    auto param_out_tensor = ctx.Output<framework::Tensor>("ParamOut");
    auto moment_out_tensor = ctx.Output<framework::Tensor>("MomentOut");
    auto inf_norm_out_tensor = ctx.Output<framework::Tensor>("InfNormOut");
-    auto beta1_pow_out_tensor = ctx.Output<framework::Tensor>("Beta1PowOut");

    param_out_tensor->mutable_data<T>(ctx.GetPlace());
    moment_out_tensor->mutable_data<T>(ctx.GetPlace());
    inf_norm_out_tensor->mutable_data<T>(ctx.GetPlace());
-    beta1_pow_out_tensor->mutable_data<T>(ctx.GetPlace());

    float beta1 = ctx.Attr<float>("beta1");
    float beta2 = ctx.Attr<float>("beta2");
@@ -53,15 +51,12 @@ class AdamaxOpKernel : public framework::OpKernel<T> {
    auto moment_out = framework::EigenVector<T>::Flatten(*moment_out_tensor);
    auto inf_norm_out =
        framework::EigenVector<T>::Flatten(*inf_norm_out_tensor);
-    auto beta1_pow_out =
-        framework::EigenVector<T>::Flatten(*beta1_pow_out_tensor);
    auto place = ctx.GetEigenDevice<Place>();

    moment_out.device(place) = beta1 * moment + (1 - beta1) * grad;
    inf_norm_out.device(place) =
        grad.abs().cwiseMax((beta2 * inf_norm) + epsilon);
-    beta1_pow_out.device(place) = beta1_pow * beta1;
-    auto lr_t = lr / (1 - beta1_pow_out);
+    auto lr_t = lr / (1 - beta1_pow);
    Eigen::DSizes<int, 1> m_dsize(moment_out_tensor->numel());
    param_out.device(place) =
        param - lr_t.broadcast(m_dsize) * (moment_out / inf_norm_out);

--- a/paddle/operators/clip_op.cc
+++ b/paddle/operators/clip_op.cc
@@ -27,8 +27,8 @@ class ClipOp : public framework::OperatorWithKernel {
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
                   "Output(Out) of ClipOp should not be null.");
    auto x_dims = ctx->GetInputDim("X");
-    auto max = Attr<float>("max");
-    auto min = Attr<float>("min");
+    auto max = ctx->Attrs().Get<float>("max");
+    auto min = ctx->Attrs().Get<float>("min");
    PADDLE_ENFORCE_LT(min, max, "max should be greater than min.");
    ctx->SetOutputDim("Out", x_dims);
    ctx->ShareLoD("X", /*->*/ "Out");

--- a/paddle/operators/conv2d_op.h
+++ b/paddle/operators/conv2d_op.h
@@ -108,17 +108,17 @@ class GemmConv2DKernel : public framework::OpKernel<T> {
    int in_step = input_channels / groups;
    int out_step = output_channels / groups;
    for (int i = 0; i < batch_size; i++) {
-      Tensor in_batch = input->Slice<T>(i, i + 1).Resize(input_shape);
-      Tensor out_batch = output->Slice<T>(i, i + 1).Resize(output_matrix_shape);
+      Tensor in_batch = input->Slice(i, i + 1).Resize(input_shape);
+      Tensor out_batch = output->Slice(i, i + 1).Resize(output_matrix_shape);
      for (int g = 0; g < groups; g++) {
        // im2col
-        Tensor in_slice = in_batch.Slice<T>(g * in_step, (g + 1) * in_step);
+        Tensor in_slice = in_batch.Slice(g * in_step, (g + 1) * in_step);
        im2col(context.device_context(), in_slice, col, strides[0], strides[1],
               paddings[0], paddings[1]);

        // gemm
-        Tensor out_slice = out_batch.Slice<T>(g * out_step, (g + 1) * out_step);
-        Tensor filter_slice = filter.Slice<T>(g * out_step, (g + 1) * out_step);
+        Tensor out_slice = out_batch.Slice(g * out_step, (g + 1) * out_step);
+        Tensor filter_slice = filter.Slice(g * out_step, (g + 1) * out_step);
        math::matmul<Place, T>(context.device_context(), filter_slice, false,
                               col_matrix, false, T(1.0), &out_slice, T(0.0));
      }
@@ -198,22 +198,20 @@ class GemmConvGrad2DKernel : public framework::OpKernel<T> {

      for (int i = 0; i < batch_size; i++) {
        Tensor out_grad_batch =
-            output_grad->Slice<T>(i, i + 1).Resize(output_matrix_shape);
-        Tensor in_grad_batch =
-            input_grad->Slice<T>(i, i + 1).Resize(input_shape);
+            output_grad->Slice(i, i + 1).Resize(output_matrix_shape);
+        Tensor in_grad_batch = input_grad->Slice(i, i + 1).Resize(input_shape);
        for (int g = 0; g < groups; g++) {
          // gemm
          Tensor out_grad_slice =
-              out_grad_batch.Slice<T>(g * out_step, (g + 1) * out_step);
-          Tensor filter_slice =
-              filter.Slice<T>(g * out_step, (g + 1) * out_step);
+              out_grad_batch.Slice(g * out_step, (g + 1) * out_step);
+          Tensor filter_slice = filter.Slice(g * out_step, (g + 1) * out_step);
          math::matmul<Place, T>(context.device_context(), filter_slice, true,
                                 out_grad_slice, false, T(1.0), &col_matrix,
                                 T(0.0));

          // col2im
          Tensor in_grad_slice =
-              in_grad_batch.Slice<T>(g * in_step, (g + 1) * in_step);
+              in_grad_batch.Slice(g * in_step, (g + 1) * in_step);
          col2im(context.device_context(), in_grad_slice, col, strides[0],
                 strides[1], paddings[0], paddings[1]);
        }
@@ -229,19 +227,19 @@ class GemmConvGrad2DKernel : public framework::OpKernel<T> {

      for (int i = 0; i < batch_size; i++) {
        Tensor out_grad_batch =
-            output_grad->Slice<T>(i, i + 1).Resize(output_matrix_shape);
-        Tensor in_batch = input->Slice<T>(i, i + 1).Resize(input_shape);
+            output_grad->Slice(i, i + 1).Resize(output_matrix_shape);
+        Tensor in_batch = input->Slice(i, i + 1).Resize(input_shape);
        for (int g = 0; g < groups; g++) {
          // im2col
          Tensor out_grad_slice =
-              out_grad_batch.Slice<T>(g * out_step, (g + 1) * out_step);
-          Tensor in_slice = in_batch.Slice<T>(g * in_step, (g + 1) * in_step);
+              out_grad_batch.Slice(g * out_step, (g + 1) * out_step);
+          Tensor in_slice = in_batch.Slice(g * in_step, (g + 1) * in_step);
          im2col(context.device_context(), in_slice, col, strides[0],
                 strides[1], paddings[0], paddings[1]);

          // gemm
          Tensor filter_grad_slice =
-              filter_grad_.Slice<T>(g * out_step, (g + 1) * out_step);
+              filter_grad_.Slice(g * out_step, (g + 1) * out_step);
          math::matmul<Place, T>(context.device_context(), out_grad_slice,
                                 false, col_matrix, true, T(1.0),
                                 &filter_grad_slice, T(1.0));

--- a/paddle/operators/dynamic_recurrent_op.cc
+++ b/paddle/operators/dynamic_recurrent_op.cc
@@ -23,6 +23,7 @@ using framework::Scope;
 using framework::TensorArray;
 using framework::LoDTensor;
 using framework::Variable;
+using framework::OperatorBase;
 using framework::DySeqMetaBatch;

 namespace detail {
@@ -43,72 +44,72 @@ inline void CreateVariables(Scope& scope,
 * be reordered, but the RNN op should not change the `boot_state` as an input
 * variable's content.
 */
-template <typename T>
-inline void ReorderBootState(const DySeqMetaBatch& metas,
-                             const LoDTensor& boot_state, LoDTensor* tensor,
-                             const platform::Place& dst_place) {
+inline void ReorderInitialState(const DySeqMetaBatch& metas,
+                                const LoDTensor& boot_state, LoDTensor* tensor,
+                                const platform::Place& dst_place) {
  for (size_t seq_id = 0; seq_id < metas.size(); seq_id++) {
-    auto slice = tensor->Slice<T>(seq_id, seq_id + 1);
+    auto slice = tensor->Slice(seq_id, seq_id + 1);
    auto boot_slice =
-        boot_state.Slice<T>(metas[seq_id].ori_idx, metas[seq_id].ori_idx + 1);
+        boot_state.Slice(metas[seq_id].ori_idx, metas[seq_id].ori_idx + 1);
    // TODO(superjom) pass in device context as an argument
-    slice.template CopyFrom<T>(boot_slice, dst_place,
-                               platform::CPUDeviceContext());
+    slice.CopyFrom(boot_slice, dst_place, platform::CPUDeviceContext());
  }
 }

-}  // namespace detail
-
-class DynamicRecurrentOpProtoAndCheckerMaker
-    : public framework::OpProtoAndCheckerMaker {
- public:
-  DynamicRecurrentOpProtoAndCheckerMaker(framework::OpProto* proto,
-                                         framework::OpAttrChecker* op_checker)
-      : OpProtoAndCheckerMaker(proto, op_checker) {
-    const auto& name = DynamicRecurrentOp::kArgName;
-    // inputs and outputs stored in proto
-    AddInput(name.inlinks,
-             "the inputs that need to be segmented for each step.")
-        .AsDuplicable();
-    AddInput(name.boot_memories, "variables to initialize memories.")
-        .AsDuplicable();
-
-    AddOutput(name.outlinks, "the outputs that need to concated for all steps.")
-        .AsDuplicable();
-    AddOutput(name.step_scopes, "step scopes");
-
-    // Attributes stored in AttributeMap
-    AddAttr<std::vector<std::string>>(name.pre_memories,
-                                      "names of pre-memories");
-    AddAttr<std::vector<std::string>>(name.memories, "names of memories");
-
-    AddComment("This is a RNN operator for varience-length sequences.");
+inline void RestoreInitialState(const DySeqMetaBatch& metas,
+                                const LoDTensor& tensor, LoDTensor* boot_state,
+                                const platform::Place& dst_place) {
+  for (size_t seq_id = 0; seq_id < metas.size(); seq_id++) {
+    auto slice = tensor.Slice(seq_id, seq_id + 1);
+    auto boot_slice =
+        boot_state->Slice(metas[seq_id].ori_idx, metas[seq_id].ori_idx + 1);
+    boot_slice.CopyFrom(slice, dst_place, platform::CPUDeviceContext());
  }
-};
+}

-void DynamicRecurrentOp::Run(const Scope& scope,
-                             const platform::DeviceContext& dev_ctx) const {
-  cache_.Init(kArgName, *this, scope, &arg_);
+}  // namespace detail
+
+// Implementation for forward propagation.
+template <>
+void RNNAlgorithm::Run<RNNAlgorithm::ComputeMode::kForward>(
+    const framework::Scope& scope, const framework::OperatorBase& op,
+    const platform::DeviceContext& dev_ctx) {
+  SetComputeMode(ComputeMode::kForward);
+  cache_.Init(kArgNames[mode_], op, scope, &dev_ctx, &arg_);
  SplitInputs();
  CreateScopes();
  WriteStepInputs();
  InitStates();
  WriteStepOutputs();
+  RunSteps();
+  ConcatOutputs();
+}

-  // call stepnet in all the time steps
-  for (size_t step = 0; step < cache_.num_steps; step++) {
-    auto& step_scope = cache_.GetScope(step);
-    stepnet_->Run(step_scope, dev_ctx);
+// Implementation for backward propagation.
+template <>
+void RNNAlgorithm::Run<RNNAlgorithm::ComputeMode::kBackward>(
+    const framework::Scope& scope, const framework::OperatorBase& op,
+    const platform::DeviceContext& dev_ctx) {
+  SetComputeMode(ComputeMode::kBackward);
+  cache_.Init(kArgNames[mode_], op, scope, &dev_ctx, &arg_);
+  SplitInputs();
+  WriteStepInputs();
+  InitStates();
+  WriteStepOutputs();
+  RunSteps();
+  // copy boot-states' gradients back.
+  for (const auto& state : arg_.states) {
+    ExportInitialStateGradient(state);
  }

  ConcatOutputs();
 }

-void DynamicRecurrentOp::SplitInputs() const {
+void RNNAlgorithm::SplitInputs() {
  // TODO(superjom) make level a config
  // TODO(superjom) check all the inputs has the same LoD
  int level = 0;
-  for (const auto& item : cache_.inlinks) {
+  for (const auto& item : cache_.inputs) {
    const auto& var = item.second;
    const auto& tensor = var->Get<LoDTensor>();
    TensorArray& ta = step_inputs_[item.first];
@@ -125,8 +126,8 @@ void DynamicRecurrentOp::SplitInputs() const {
  }
 }

-void DynamicRecurrentOp::WriteStepInputs() const {
-  for (const auto& item : cache_.inlinks) {
+void RNNAlgorithm::WriteStepInputs() {
+  for (const auto& item : cache_.inputs) {
    auto ta_it = step_inputs_.find(item.first);
    PADDLE_ENFORCE(ta_it != step_inputs_.end(),
                   "step_inputs_ not compatible with memory set");
@@ -138,20 +139,20 @@ void DynamicRecurrentOp::WriteStepInputs() const {
      if (var == nullptr) {
        var = step_scope.Var(item.first);
      }
-      var->GetMutable<LoDTensor>()->ShareDataWith<value_type>(tensor);
+      var->GetMutable<LoDTensor>()->ShareDataWith(tensor);
    }
  }
 }

-void DynamicRecurrentOp::WriteStepOutputs() const {
+void RNNAlgorithm::WriteStepOutputs() {
  // initialize step outputs
-  for (const auto& item : cache_.outlinks) {
+  for (const auto& item : cache_.outputs) {
    step_outputs_.emplace(item.first, TensorArray());
  }
  PADDLE_ENFORCE_GT(step_outputs_.size(), 0UL);
 }

-void DynamicRecurrentOp::CreateScopes() const {
+void RNNAlgorithm::CreateScopes() {
  PADDLE_ENFORCE_GT(cache_.num_steps, 0);
  // resize scopes
  size_t num_scopes_need_create = cache_.num_steps - cache_.scopes->size();
@@ -160,19 +161,19 @@ void DynamicRecurrentOp::CreateScopes() const {
  }

  // init temporary inputs
-  PADDLE_ENFORCE_NOT_NULL(stepnet_, "stepnet should be set first");
-  std::vector<std::string> memories;
-  std::vector<std::string> pre_memories;
-  std::vector<std::string> stepnet_outputs;
-  std::transform(arg_.memories.begin(), arg_.memories.end(),
-                 std::back_inserter(memories),
-                 [](const rnn::MemoryAttr& m) { return m.var; });
-  std::transform(arg_.memories.begin(), arg_.memories.end(),
-                 std::back_inserter(pre_memories),
-                 [](const rnn::MemoryAttr& m) { return m.pre_var; });
-  for (const auto& item : stepnet_->Outputs()) {
+  PADDLE_ENFORCE_NOT_NULL(step_unit_, "stepnet should be set first");
+  std::vector<std::string> states;
+  std::vector<std::string> ex_states;
+  std::vector<std::string> step_unit_outputs;
+  std::transform(arg_.states.begin(), arg_.states.end(),
+                 std::back_inserter(states),
+                 [](const rnn::StateAttr& m) { return m.var; });
+  std::transform(arg_.states.begin(), arg_.states.end(),
+                 std::back_inserter(ex_states),
+                 [](const rnn::StateAttr& m) { return m.pre_var; });
+  for (const auto& item : step_unit_->Outputs()) {
    for (const auto& var : item.second) {
-      stepnet_outputs.push_back(var);
+      step_unit_outputs.push_back(var);
    }
  }

@@ -180,13 +181,13 @@ void DynamicRecurrentOp::CreateScopes() const {
    auto& scope = cache_.GetScope(step);
    detail::CreateVariables(scope, arg_.inlinks);
    detail::CreateVariables(scope, arg_.outlinks);
-    detail::CreateVariables(scope, memories);
-    detail::CreateVariables(scope, pre_memories);
-    detail::CreateVariables(scope, stepnet_outputs);
+    detail::CreateVariables(scope, states);
+    detail::CreateVariables(scope, ex_states);
+    detail::CreateVariables(scope, step_unit_outputs);
  }
 }

-void DynamicRecurrentOp::ConcatOutputs() const {
+void RNNAlgorithm::ConcatOutputs() {
  // TODO(superjom) transform this to a config
  int level = 0;
  for (size_t step = 0; step < cache_.num_steps; step++) {
@@ -199,31 +200,45 @@ void DynamicRecurrentOp::ConcatOutputs() const {
      item.second.WriteShared(step, *tensor);
    }
  }
-  // the inlinks' lods should be the same, so randomly get one lod.
+  // the inputs' lods should be the same, so randomly get one lod.
  const auto& some_lod =
      cache_.scope->FindVar(arg_.inlinks.front())->Get<LoDTensor>().lod();
  const auto& some_meta = dy_seq_metas_[arg_.inlinks.front()];
  for (auto& item : step_outputs_) {
    auto tensor = item.second.Pack(level, some_meta, some_lod);
-    auto* output = cache_.outlinks[item.first]->GetMutable<LoDTensor>();
-    const_cast<LoDTensor*>(output)->ShareDataWith<value_type>(tensor);
+    auto* output = cache_.outputs[item.first]->GetMutable<LoDTensor>();
+    const_cast<LoDTensor*>(output)->ShareDataWith(tensor);
+  }
+}
+
+void RNNAlgorithm::RunSteps() {
+  if (IsBackward()) {
+    // call stepnet in all the time steps reversely
+    for (int step = cache_.num_steps - 1; step >= 0; step--) {
+      auto& step_scope = cache_.GetScope(step);
+      step_unit_->Run(step_scope, *cache_.dev_ctx);
+    }
+  } else {
+    for (size_t step = 0; step < cache_.num_steps; step++) {
+      auto& step_scope = cache_.GetScope(step);
+      step_unit_->Run(step_scope, *cache_.dev_ctx);
+    }
  }
 }

-void DynamicRecurrentOp::InitStates() const {
+void RNNAlgorithm::InitStates() {
  for (size_t step = 0; step < cache_.num_steps; step++) {
-    for (const auto& memory : arg_.memories) {
-      CreateState(memory, step);
-      LinkState(memory, step);
+    for (const auto& state : arg_.states) {
+      CreateState(state, step);
+      LinkState(state, step);
    }
  }
 }

-void DynamicRecurrentOp::CreateState(const rnn::MemoryAttr& memory,
-                                     size_t step) const {
+void RNNAlgorithm::CreateState(const rnn::StateAttr& state_attr, size_t step) {
  auto& scope = cache_.GetScope(step);
-  auto& state = *cache_.GetTensor(scope, memory.var);
-  auto& boot_state = *cache_.GetTensor(*cache_.scope, memory.boot_var);
+  auto& state = *cache_.GetTensor(scope, state_attr.var);
+  auto& boot_state = *cache_.GetTensor(*cache_.scope, state_attr.boot_var);

  size_t num_instances =
      step_inputs_[arg_.inlinks.front()].Read(step).dims()[0];
@@ -232,56 +247,79 @@ void DynamicRecurrentOp::CreateState(const rnn::MemoryAttr& memory,

  state.Resize(dims);
  state.mutable_data<value_type>(platform::CPUPlace());
-  states_[memory.var].WriteShared(step, state);
+  states_[state_attr.var].WriteShared(step, state);
 }

-void DynamicRecurrentOp::LinkState(const rnn::MemoryAttr& memory,
-                                   size_t step) const {
+void RNNAlgorithm::LinkState(const rnn::StateAttr& state, size_t step) {
  auto& scope = cache_.GetScope(step);
-  auto& state_pre = *cache_.GetTensor(scope, memory.pre_var);
+  auto& state_pre = *cache_.GetTensor(scope, state.pre_var);
+
+  // process the first state's boot-state(the 0-step in forward mode or the
+  // last step in backward mode)
+  // Only forward mode need to link the boot-state to the `pre-state` in first
+  // time step. In backward mode, need to copy the gradient of `pre-state` in
+  // first time step to the gradient of `boot-state`.
+  if (step == 0 && IsForward()) {
+    LinkInitialState(state);
+  } else {
+    size_t num_instances =
+        step_inputs_[arg_.inlinks.front()].Read(step).dims()[0];
+    auto* pre_state = cache_.GetTensor(cache_.GetScope(step - 1), state.var);
+    // shink and share from previous state
+    auto shrinked_pre_state = pre_state->Slice(0, num_instances);
+    state_pre.ShareDataWith(shrinked_pre_state);
+  }
+}

+void RNNAlgorithm::LinkInitialState(const rnn::StateAttr& state) {
  // all the step_inputs' metas should be the same, just randomly select one
  // and get the dyseq meta.
  const auto& some_meta = dy_seq_metas_[arg_.inlinks.front()];
-  size_t num_instances =
-      step_inputs_[arg_.inlinks.front()].Read(step).dims()[0];
+  auto& scope = cache_.GetScope(0);
+  auto& state_pre = *cache_.GetTensor(scope, state.pre_var);
+  auto* pre_state = cache_.GetTensor(*cache_.scope, state.boot_var);
+  pre_state->mutable_data<float>(platform::CPUPlace());
+  // allocate state
+  state_pre.Resize(pre_state->dims());
+  state_pre.mutable_data<value_type>(platform::CPUPlace());
+  detail::ReorderInitialState(some_meta, *pre_state, &state_pre,
+                              pre_state->place());
+}

-  LoDTensor* pre_state{nullptr};
-  if (step == 0) {
-    pre_state = cache_.GetTensor(*cache_.scope, memory.boot_var);
-    pre_state->mutable_data<float>(platform::CPUPlace());
-    // allocate memory
-    state_pre.Resize(pre_state->dims());
-    state_pre.mutable_data<value_type>(platform::CPUPlace());
-    detail::ReorderBootState<value_type>(some_meta, *pre_state, &state_pre,
-                                         pre_state->place());
-  } else {
-    pre_state = cache_.GetTensor(cache_.GetScope(step - 1), memory.var);
-  }
+void RNNAlgorithm::ExportInitialStateGradient(const rnn::StateAttr& state) {
+  // all the step_inputs' metas should be the same, just randomly select one
+  // and get the dyseq meta.
+  const auto& some_meta = dy_seq_metas_[arg_.inlinks.front()];
+  auto& scope = cache_.GetScope(0);

-  // shink and share from previous state
-  auto shrinked_pre_state = pre_state->Slice<value_type>(0, num_instances);
-  state_pre.ShareDataWith<value_type>(shrinked_pre_state);
+  auto& state_pre = *cache_.GetTensor(scope, state.pre_var);
+  auto& pre_state = *cache_.GetTensor(*cache_.scope, state.boot_var);
+  pre_state.Resize(state_pre.dims());
+  detail::RestoreInitialState(some_meta, state_pre, &pre_state,
+                              pre_state.place());
 }

-void DynamicRecurrentOp::ArgCache::Init(
-    const rnn::ArgumentName& name, const paddle::framework::OperatorBase& op,
-    const paddle::framework::Scope& scope, rnn::Argument* arg) {
+void RNNAlgorithm::ArgCache::Init(const rnn::ArgumentName& name,
+                                  const paddle::framework::OperatorBase& op,
+                                  const paddle::framework::Scope& scope,
+                                  platform::DeviceContext const* dev_ctx,
+                                  rnn::Argument* arg) {
  this->scope = &scope;
  InitArgument(name, op, arg);
  CacheScopes(scope, *arg);
  CacheInlinks(scope, arg->inlinks);
  CacheOutlinks(scope, arg->outlinks);
+  this->dev_ctx = dev_ctx;
 }

-void DynamicRecurrentOp::ArgCache::InitArgument(const rnn::ArgumentName& name,
-                                                const OperatorBase& op,
-                                                rnn::Argument* arg) {
+void RNNAlgorithm::ArgCache::InitArgument(const rnn::ArgumentName& name,
+                                          const OperatorBase& op,
+                                          rnn::Argument* arg) {
  rnn::InitArgument(name, arg, op, false /*is_grad*/);
 }

-void DynamicRecurrentOp::ArgCache::CacheScopes(const Scope& scope,
-                                               const rnn::Argument& arg) {
+void RNNAlgorithm::ArgCache::CacheScopes(const Scope& scope,
+                                         const rnn::Argument& arg) {
  auto scopes_var = scope.FindVar(arg.step_scopes);
  PADDLE_ENFORCE(scopes_var != nullptr,
                 "the step_scopes output argument [%s] should be created first "
@@ -290,45 +328,85 @@ void DynamicRecurrentOp::ArgCache::CacheScopes(const Scope& scope,
  this->scopes = scopes_var->GetMutable<std::vector<Scope*>>();
 }

-void DynamicRecurrentOp::ArgCache::CacheInlinks(
+void RNNAlgorithm::ArgCache::CacheInlinks(
    const Scope& scope, const std::vector<std::string>& names) {
  for (auto name : names) {
    auto* var = GetVariable(scope, name);
-    inlinks[name] = var;
+    inputs[name] = var;
  }
 }

-void DynamicRecurrentOp::ArgCache::CacheOutlinks(
+void RNNAlgorithm::ArgCache::CacheOutlinks(
    const Scope& scope, const std::vector<std::string>& names) {
  for (auto name : names) {
    auto* var = GetVariable(scope, name);
-    outlinks[name] = var;
+    outputs[name] = var;
  }
 }

-Variable* DynamicRecurrentOp::ArgCache::GetVariable(const Scope& scope,
-                                                    const std::string& name) {
+Variable* RNNAlgorithm::ArgCache::GetVariable(const Scope& scope,
+                                              const std::string& name) {
  auto* var = scope.FindVar(name);
  PADDLE_ENFORCE_NOT_NULL(var, "variable [%s] not exist in scope", name);
  return var;
 }

-LoDTensor* DynamicRecurrentOp::ArgCache::GetTensor(
-    const framework::Scope& scope, const std::string& name) {
+LoDTensor* RNNAlgorithm::ArgCache::GetTensor(const framework::Scope& scope,
+                                             const std::string& name) {
  auto* var = GetVariable(scope, name);
  return var->GetMutable<LoDTensor>();
 }

-const rnn::ArgumentName DynamicRecurrentOp::kArgName{
-    "step_net", "step_scopes",  "inlinks",      "outlinks",
-    "memories", "pre_memories", "boot_memories"};
+const std::array<rnn::ArgumentName, 2> RNNAlgorithm::kArgNames{
+    {rnn::ArgumentName{"step_unit", "step_scopes", "inputs", "outputs",
+                       "states", "ex_states", "initial_states"},
+     rnn::ArgumentName{"step_unit", "step_scopes@GRAD", "outputs@GRAD",
+                       "inputs@GRAD", "states", "ex_states",
+                       "initial_states@GRAD"}}};
+
+void DynamicRecurrentOp::Run(const framework::Scope& scope,
+                             const platform::DeviceContext& dev_ctx) const {
+  rnn.Run<RNNAlgorithm::ComputeMode::kForward>(
+      scope, *dynamic_cast<const OperatorBase*>(this), dev_ctx);
+}

 void DynamicRecurrentGradientOp::Run(
-    const Scope& scope, const platform::DeviceContext& dev_ctx) const {}
+    const Scope& scope, const platform::DeviceContext& dev_ctx) const {
+  rnn.Run<RNNAlgorithm::ComputeMode::kBackward>(
+      scope, *dynamic_cast<const OperatorBase*>(this), dev_ctx);
+}
+
+class DynamicRecurrentOpProtoAndCheckerMaker
+    : public framework::OpProtoAndCheckerMaker {
+ public:
+  DynamicRecurrentOpProtoAndCheckerMaker(framework::OpProto* proto,
+                                         framework::OpAttrChecker* op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    const auto& name =
+        RNNAlgorithm::kArgNames[RNNAlgorithm::ComputeMode::kForward];
+    // inputs and outputs stored in proto
+    AddInput(name.inlinks,
+             "the inputs that need to be segmented for each step.")
+        .AsDuplicable();
+    AddInput(name.initial_states, "variables to initialize states.")
+        .AsDuplicable();
+
+    AddOutput(name.outlinks, "the outputs that need to concated for all steps.")
+        .AsDuplicable();
+    AddOutput(name.step_scopes, "step scopes");
+
+    // Attributes stored in AttributeMap
+    AddAttr<std::vector<std::string>>(name.ex_states, "names of ex_states");
+    AddAttr<std::vector<std::string>>(name.states, "names of states");
+
+    AddComment("This is a RNN operator for varience-length sequences.");
+  }
+};

 }  // namespace operators
 }  // namespace paddle

-REGISTER_OP_WITHOUT_GRADIENT(
-    dynamic_recurrent, paddle::operators::DynamicRecurrentOp,
-    paddle::operators::DynamicRecurrentOpProtoAndCheckerMaker);
+REGISTER_OP(dynamic_recurrent, paddle::operators::DynamicRecurrentOp,
+            paddle::operators::DynamicRecurrentOpProtoAndCheckerMaker,
+            dynamic_recurrent_grad,
+            paddle::operators::DynamicRecurrentGradientOp);
--- a/paddle/operators/dynamic_recurrent_op.h
+++ b/paddle/operators/dynamic_recurrent_op.h
@@ -27,47 +27,39 @@
 namespace paddle {
 namespace operators {

-class DynamicRecurrentOp : public framework::OperatorBase {
+class RNNAlgorithm {
 public:
-  static const rnn::ArgumentName kArgName;
+  enum ComputeMode { kForward = 0, kBackward = 1 };
+  static const std::array<rnn::ArgumentName, 2> kArgNames;
  using value_type = float;

-  DynamicRecurrentOp(const std::string& type,
-                     const framework::VariableNameMap& inputs,
-                     const framework::VariableNameMap& outputs,
-                     const framework::AttributeMap& attrs)
-      : OperatorBase(type, inputs, outputs, attrs) {}
-
-  DynamicRecurrentOp(const DynamicRecurrentOp& o)
-      : framework::OperatorBase(
-            static_cast<const framework::OperatorBase&>(o)) {
-    // TODO(yuyang18): Implement copy ctor well.
-    PADDLE_THROW("Not implemented");
-  }
-
-  void Run(const framework::Scope& scope,
-           const platform::DeviceContext& dev_ctx) const override;
-
+  /*
+   * Different `Run` method for forward and backward, `_` is just for template
+   * specifialization.
+   */
+  template <ComputeMode _>
+  void Run(const framework::Scope& scope, const framework::OperatorBase& op,
+           const platform::DeviceContext& dev_ctx);
  /*
   * Split the inputs(LoDTensors) to segments for each time step.
   */
-  void SplitInputs() const;
+  void SplitInputs();

  /*
   * Create step-scopes to store temporary outputs in each time steps.
   */
-  void CreateScopes() const;
+  void CreateScopes();

  /*
   * Link TensorArray steps to the corresponding variables located in
   * step-scopes.
   */
-  void WriteStepInputs() const;
+  void WriteStepInputs();

  /*
   * Write output of each step to the corresponding TensorArray.
   */
-  void WriteStepOutputs() const;
+  void WriteStepOutputs();

  /*
   * Initialize the states, each state will have a corresponding pre-state,
@@ -75,54 +67,83 @@ class DynamicRecurrentOp : public framework::OperatorBase {
   * pre-state in the first time step will be initialized with an zero tensor or
   * a tensor in parent scope if is provided.
   */
-  void InitStates() const;
+  void InitStates();

  /*
   * Create state variables for each time step.
   */
-  void CreateState(const rnn::MemoryAttr& memory, size_t step) const;
+  void CreateState(const rnn::StateAttr& state, size_t step);

  /*
   * Link pre-state variable in current scope to the state variable in the
-   * previous time step (scope).
+   * previous time step (scope) by reference.
+   */
+  void LinkState(const rnn::StateAttr& state, size_t step);
+
+  /*
+   * Link the pre-state of the first time step to the `boot-state` in parent's
+   * scope.
+   */
+  void LinkInitialState(const rnn::StateAttr& state);
+
+  /*
+   * Copy the gradient from `pre-state` in the first step-scope to the
+   * `boot-state` in parent's scope.
+   */
+  void ExportInitialStateGradient(const rnn::StateAttr& state);
+
+  /*
+   * Calculate time steps.
   */
-  void LinkState(const rnn::MemoryAttr& memory, size_t step) const;
+  void RunSteps();

  /*
   * Concatenate outputs in each time step and generate a LoDTensor.
   */
-  void ConcatOutputs() const;
+  void ConcatOutputs();
+
+  void SetComputeMode(ComputeMode mode) { mode_ = mode; }
+  bool IsForward() const { return mode_ == ComputeMode::kForward; }
+  bool IsBackward() const { return mode_ == ComputeMode::kBackward; }

  /*
-   * set a stepnet that is created according to a RecurrentOp's stepnet.
+   * set a step unit that is created according to a RecurrentOp's step unit.
   */
-  void SetStepNet(std::unique_ptr<OperatorBase> net) {
-    PADDLE_ENFORCE_NOT_NULL(net);
-    stepnet_ = std::move(net);
+  void SetStepUnit(std::unique_ptr<framework::OperatorBase> step_unit) {
+    PADDLE_ENFORCE_NOT_NULL(step_unit);
+    step_unit_ = std::move(step_unit);
  }
-  const OperatorBase& GetStepNet() const { return *stepnet_; }
+  const framework::OperatorBase& GetStepUnit() const { return *step_unit_; }

  const framework::TensorArray& state(const std::string& name) const {
-    return states_[name];
+    auto it = states_.find(name);
+    PADDLE_ENFORCE(it != states_.end());
+    return it->second;
  }
  const framework::TensorArray& step_input(const std::string& name) const {
-    return step_inputs_[name];
+    auto it = step_inputs_.find(name);
+    PADDLE_ENFORCE(it != step_inputs_.end());
+    return it->second;
  }
  const framework::TensorArray& step_output(const std::string& name) const {
-    return step_outputs_[name];
+    auto it = step_outputs_.find(name);
+    PADDLE_ENFORCE(it != step_outputs_.end());
+    return it->second;
  }

 protected:
  struct ArgCache {
    framework::Scope const* scope;
    std::vector<framework::Scope*>* scopes;
-    std::map<std::string, framework::Variable*> inlinks;
-    std::map<std::string, framework::Variable*> outlinks;
+    std::map<std::string, framework::Variable*> inputs;
+    std::map<std::string, framework::Variable*> outputs;
+    platform::DeviceContext const* dev_ctx;

    size_t num_steps{0};

-    void Init(const rnn::ArgumentName& name, const OperatorBase& op,
-              const framework::Scope& scope, rnn::Argument* arg);
+    void Init(const rnn::ArgumentName& name, const framework::OperatorBase& op,
+              const framework::Scope& scope,
+              platform::DeviceContext const* dev_ctx, rnn::Argument* arg);

    framework::Scope& GetScope(size_t index) {
      PADDLE_ENFORCE_LT(index, num_steps);
@@ -133,8 +154,8 @@ class DynamicRecurrentOp : public framework::OperatorBase {
                                    const std::string& name);

   private:
-    void InitArgument(const rnn::ArgumentName& name, const OperatorBase& op,
-                      rnn::Argument* arg);
+    void InitArgument(const rnn::ArgumentName& name,
+                      const framework::OperatorBase& op, rnn::Argument* arg);
    void CacheScopes(const framework::Scope& scope, const rnn::Argument& arg);
    void CacheInlinks(const framework::Scope& scope,
                      const std::vector<std::string>& names);
@@ -145,27 +166,49 @@ class DynamicRecurrentOp : public framework::OperatorBase {
  };

 private:
-  std::unique_ptr<OperatorBase> stepnet_;
-  mutable std::map<std::string, framework::TensorArray> states_;
-  mutable std::map<std::string, framework::TensorArray> step_inputs_;
-  mutable std::map<std::string, framework::TensorArray> step_outputs_;
-  mutable std::map<std::string, std::vector<framework::DySeqMeta>>
-      dy_seq_metas_;
-  mutable rnn::Argument arg_;
-  mutable ArgCache cache_;
+  std::unique_ptr<framework::OperatorBase> step_unit_;
+  std::map<std::string, framework::TensorArray> states_;
+  std::map<std::string, framework::TensorArray> step_inputs_;
+  std::map<std::string, framework::TensorArray> step_outputs_;
+  std::map<std::string, std::vector<framework::DySeqMeta>> dy_seq_metas_;
+  rnn::Argument arg_;
+  ArgCache cache_;
+  ComputeMode mode_{ComputeMode::kForward};

 #ifdef PADDLE_WITH_TESTING
-  friend class DynamicRecurrentOpTestHelper;
-  FRIEND_TEST(DynamicRecurrentOpTestHelper, SplitInputs);
-  FRIEND_TEST(DynamicRecurrentOpTestHelper, CreateCache);
-  FRIEND_TEST(DynamicRecurrentOpTestHelper, CreateScopes);
-  FRIEND_TEST(DynamicRecurrentOpTestHelper, WriteStepInputs);
-  FRIEND_TEST(DynamicRecurrentOpTestHelper, WriteStepOutputs);
-  FRIEND_TEST(DynamicRecurrentOpTestHelper, InitStates);
-  FRIEND_TEST(DynamicRecurrentOpTestHelper, ConcatOutputs);
+  // test forward
+  friend class RNNAlgorithmTestHelper;
+  FRIEND_TEST(RNNAlgorithmTestHelper, SplitInputs);
+  FRIEND_TEST(RNNAlgorithmTestHelper, CreateCache);
+  FRIEND_TEST(RNNAlgorithmTestHelper, CreateScopes);
+  FRIEND_TEST(RNNAlgorithmTestHelper, WriteStepInputs);
+  FRIEND_TEST(RNNAlgorithmTestHelper, WriteStepOutputs);
+  FRIEND_TEST(RNNAlgorithmTestHelper, InitStates);
+  FRIEND_TEST(RNNAlgorithmTestHelper, ConcatOutputs);
+// TODO(superjom) test backward
 #endif
 };

+class DynamicRecurrentOp : public framework::OperatorBase {
+ public:
+  DynamicRecurrentOp(const std::string& type,
+                     const framework::VariableNameMap& inputs,
+                     const framework::VariableNameMap& outputs,
+                     const framework::AttributeMap& attrs)
+      : OperatorBase(type, inputs, outputs, attrs) {}
+
+  DynamicRecurrentOp(const DynamicRecurrentOp& o)
+      : framework::OperatorBase(
+            static_cast<const framework::OperatorBase&>(o)) {
+    PADDLE_THROW("Not implemented");
+  }
+
+  void Run(const framework::Scope& scope,
+           const platform::DeviceContext& dev_ctx) const override;
+
+  mutable RNNAlgorithm rnn;
+};
+
 class DynamicRecurrentGradientOp : public framework::OperatorBase {
 public:
  DynamicRecurrentGradientOp(const std::string& type,
@@ -174,8 +217,16 @@ class DynamicRecurrentGradientOp : public framework::OperatorBase {
                             const framework::AttributeMap& attrs)
      : OperatorBase(type, inputs, outputs, attrs) {}

+  DynamicRecurrentGradientOp(const DynamicRecurrentGradientOp& o)
+      : framework::OperatorBase(
+            static_cast<const framework::OperatorBase&>(o)) {
+    PADDLE_THROW("Not implemented");
+  }
+
  void Run(const framework::Scope& scope,
           const platform::DeviceContext& dev_ctx) const override;
+
+  mutable RNNAlgorithm rnn;
 };

 }  // namespace operators

--- a/paddle/operators/dynamic_recurrent_op_test.cc
+++ b/paddle/operators/dynamic_recurrent_op_test.cc
@@ -43,16 +43,16 @@ LoDTensor* CreateVar(Scope& scope, std::string name, framework::DDim dims,
  return tensor;
 }

-class DynamicRecurrentOpTestHelper : public ::testing::Test {
+class RNNAlgorithmTestHelper : public ::testing::Test {
 protected:
-  const rnn::ArgumentName argname = DynamicRecurrentOp::kArgName;
+  const rnn::ArgumentName argname = RNNAlgorithm::kArgNames[0];

  virtual void SetUp() override {
    CreateGlobalVariables();

    auto op_desc = CreateOpDesc();
    op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
-    dop = dynamic_cast<DynamicRecurrentOp*>(op.get());
+    dop = &(dynamic_cast<DynamicRecurrentOp*>(op.get())->rnn);
    InitCacheManually();
    InitStepNet();
  }
@@ -63,20 +63,20 @@ class DynamicRecurrentOpTestHelper : public ::testing::Test {
    op_desc.set_type("dynamic_recurrent");

    OpDescNewVar(argname.inlinks, {"in0"}, op_desc.add_inputs());
-    OpDescNewVar(argname.boot_memories, {"boot_mem"}, op_desc.add_inputs());
+    OpDescNewVar(argname.initial_states, {"boot_mem"}, op_desc.add_inputs());
    OpDescNewVar(argname.step_scopes, {"step_scopes"}, op_desc.add_outputs());
    OpDescNewVar(argname.outlinks, {"out0"}, op_desc.add_outputs());

-    // set pre-memories
+    // set pre-states
    auto pre_memories = op_desc.mutable_attrs()->Add();
-    pre_memories->set_name(argname.pre_memories);
+    pre_memories->set_name(argname.ex_states);
    pre_memories->set_type(paddle::framework::AttrType::STRINGS);
    auto pre_memories_item = pre_memories->add_strings();
    *pre_memories_item = "mem@pre";

-    // set memories
+    // set states
    auto memories = op_desc.mutable_attrs()->Add();
-    memories->set_name(argname.memories);
+    memories->set_name(argname.states);
    memories->set_type(paddle::framework::AttrType::STRINGS);
    auto memories_item = memories->add_strings();
    *memories_item = "mem";
@@ -113,32 +113,33 @@ class DynamicRecurrentOpTestHelper : public ::testing::Test {
  }

  void InitCacheManually() {
-    dop->cache_.Init(DynamicRecurrentOp::kArgName, *dop, scope, &dop->arg_);
+    dop->cache_.Init(RNNAlgorithm::kArgNames[0], *op, scope, &device_context,
+                     &dop->arg_);
  }

  void InitStepNet() {
    std::unique_ptr<framework::OperatorBase> stepnet{new NetOp};
    dynamic_cast<NetOp*>(stepnet.get())
        ->AppendOp(std::unique_ptr<TestOp>(new TestOp(
-            "test", {{"inlinks", {"in0"}}, {"boot_memories", {"boot_mem"}}},
-            {{"outlinks", {"out0"}}, {"step_scopes", {"step_scopes"}}}, {})));
-    dop->SetStepNet(std::move(stepnet));
+            "test", {{"inputs", {"in0"}}, {"initial_states", {"boot_mem"}}},
+            {{"outputs", {"out0"}}, {"step_scopes", {"step_scopes"}}}, {})));
+    dop->SetStepUnit(std::move(stepnet));
  }

 protected:
-  DynamicRecurrentOp* dop;
+  RNNAlgorithm* dop;
  std::unique_ptr<framework::OperatorBase> op;
  paddle::platform::CPUDeviceContext device_context;
  paddle::framework::Scope scope;
 };

-TEST_F(DynamicRecurrentOpTestHelper, CreateCache) {
+TEST_F(RNNAlgorithmTestHelper, CreateCache) {
  const rnn::Argument& arg = dop->arg_;
  ASSERT_EQ(arg.inlinks.size(), 1UL);
  ASSERT_EQ(arg.outlinks.size(), 1UL);
 }

-TEST_F(DynamicRecurrentOpTestHelper, SplitInputs) {
+TEST_F(RNNAlgorithmTestHelper, SplitInputs) {
  dop->SplitInputs();
  auto& in0_ta = dop->step_inputs_["in0"];
  ASSERT_EQ(in0_ta.size(), 4UL);
@@ -153,14 +154,14 @@ TEST_F(DynamicRecurrentOpTestHelper, SplitInputs) {
  EXPECT_EQ(batch3.dims()[0], 1);
 }

-TEST_F(DynamicRecurrentOpTestHelper, CreateScopes) {
+TEST_F(RNNAlgorithmTestHelper, CreateScopes) {
  dop->SplitInputs();
  dop->CreateScopes();
  ASSERT_EQ(dop->cache_.num_steps, 4UL);
  ASSERT_EQ(dop->cache_.scopes->size(), 4UL);
 }

-TEST_F(DynamicRecurrentOpTestHelper, WriteStepInputs) {
+TEST_F(RNNAlgorithmTestHelper, WriteStepInputs) {
  dop->SplitInputs();
  dop->CreateScopes();
  dop->WriteStepInputs();
@@ -173,7 +174,7 @@ TEST_F(DynamicRecurrentOpTestHelper, WriteStepInputs) {
  }
 }

-TEST_F(DynamicRecurrentOpTestHelper, WriteStepOutputs) {
+TEST_F(RNNAlgorithmTestHelper, WriteStepOutputs) {
  dop->SplitInputs();
  dop->CreateScopes();
  dop->WriteStepInputs();
@@ -187,11 +188,12 @@ TEST_F(DynamicRecurrentOpTestHelper, WriteStepOutputs) {
  }
 }

-TEST_F(DynamicRecurrentOpTestHelper, ConcatOutputs) {
+TEST_F(RNNAlgorithmTestHelper, ConcatOutputs) {
  // Let's leave this test to python unittest.
 }

-TEST_F(DynamicRecurrentOpTestHelper, InitStates) {
+TEST_F(RNNAlgorithmTestHelper, InitStates) {
+  dop->SetComputeMode(RNNAlgorithm::ComputeMode::kForward);
  dop->SplitInputs();
  dop->CreateScopes();
  dop->WriteStepInputs();
@@ -208,12 +210,6 @@ TEST_F(DynamicRecurrentOpTestHelper, InitStates) {

    auto* boot_state = scope.FindVar("boot_mem");
    ASSERT_TRUE(boot_state != nullptr);
-
-    if (step == 0) {
-      // check pre_state is a reference of boot_state
-      ASSERT_EQ(boot_state->Get<LoDTensor>().data<float>(),
-                pre_state->Get<LoDTensor>().data<float>());
-    }
  }
 }


--- a/paddle/operators/elementwise_op_function.h
+++ b/paddle/operators/elementwise_op_function.h
@@ -108,7 +108,7 @@ void ElementwiseCompute(const framework::ExecutionContext& ctx) {
  PADDLE_ENFORCE_GE(x_dims.size(), y_dims.size(),
                    "Rank of first input must >= rank of second input.")

-  if (x_dims == y_dims || product(y_dims) == 1) {
+  if (x_dims == y_dims) {
    functor f;
    f.template Run<Place, T>(x, y, z, ctx);
    return;
@@ -174,12 +174,6 @@ void ElementwiseGradCompute(const framework::ExecutionContext& ctx) {
    return;
  }

-  if (product(y_dims) == 1) {
-    functor1 f;
-    f(place, x, y, out, dx, dy, dout);
-    return;
-  }
-
  int axis = ctx.Attr<int>("axis");
  axis = (axis == -1 ? x_dims.size() - y_dims.size() : axis);


--- a/paddle/operators/feed_op.cc
+++ b/paddle/operators/feed_op.cc
@@ -26,8 +26,9 @@ class FeedOp : public framework::OperatorBase {
      : OperatorBase(type, inputs, outputs, attrs) {}
  void Run(const framework::Scope &scope,
           const platform::DeviceContext &dev_ctx) const override {
-    auto feed_var_name = Input("Input");
+    auto feed_var_name = Input("X");
    auto *feed_var = scope.FindVar(feed_var_name);
+
    PADDLE_ENFORCE(feed_var != nullptr,
                   "Cannot find feed_var in scope, feed_var_name is %s",
                   feed_var_name);
@@ -40,18 +41,32 @@ class FeedOp : public framework::OperatorBase {

    auto col = Attr<int>("col");

+    VLOG(3) << "Feed Var " << feed_var_name << "'s " << col << " column to var"
+            << out_name;
+
    auto &feed_list = feed_var->Get<framework::FeedFetchList>();
    auto &feed_item = feed_list.at(static_cast<size_t>(col));
    auto *out_item = out_var->GetMutable<framework::FeedFetchType>();
-    out_item->CopyFromTensor(feed_item, dev_ctx.GetPlace(), dev_ctx);
+    out_item->CopyFrom(feed_item, dev_ctx.GetPlace(), dev_ctx);
    out_item->set_lod(feed_item.lod());
  }
 };

+class FeedOpInfoMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  FeedOpInfoMaker(framework::OpProto *proto,
+                  framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "The input of feed op");
+    AddOutput("Out", "The output of feed op");
+    AddComment("feed op, it should not be configured by users directly");
+    AddAttr<int>("col", "column of feed");
+  }
+};
+
 }  // namespace operators
 }  // namespace paddle

-// We do not need to register OpInfoMaker,
-// since feed operator will not be used by end users directly
 REGISTER_OPERATOR(feed, paddle::operators::FeedOp,
-                  paddle::framework::EmptyGradOpMaker);
+                  paddle::framework::EmptyGradOpMaker,
+                  paddle::operators::FeedOpInfoMaker);
--- a/paddle/operators/fetch_op.cc
+++ b/paddle/operators/fetch_op.cc
@@ -27,7 +27,7 @@ class FetchOp : public framework::OperatorBase {

  void Run(const framework::Scope &scope,
           const platform::DeviceContext &dev_ctx) const override {
-    auto fetch_var_name = Input("Input");
+    auto fetch_var_name = Input("X");
    auto *fetch_var = scope.FindVar(fetch_var_name);
    PADDLE_ENFORCE(fetch_var != nullptr,
                   "Cannot find fetch variable in scope, fetch_var_name is %s",
@@ -51,14 +51,26 @@ class FetchOp : public framework::OperatorBase {

    // FIXME(yuyang18): Should we assume the fetch operator always generate
    // CPU outputs?
-    dst_item.CopyFromTensor(src_item, platform::CPUPlace(), dev_ctx);
+    dst_item.CopyFrom(src_item, platform::CPUPlace(), dev_ctx);
+
+    VLOG(3) << "Fetch variable " << fetch_var_name << " to " << out_name;
  }
 };

+class FetchOpInfoMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  FetchOpInfoMaker(framework::OpProto *proto,
+                   framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "The input of fetch op");
+    AddOutput("Out", "The output of fetch op");
+    AddComment("fetch op, it should not be configured by users directly");
+    AddAttr<int>("col", "column of fetch");
+  }
+};
 }  // namespace operators
 }  // namespace paddle

-// We do not need to register OpInfoMaker,
-// since fetch operator will not be used by end users directly
 REGISTER_OPERATOR(fetch, paddle::operators::FetchOp,
-                  paddle::framework::EmptyGradOpMaker);
+                  paddle::framework::EmptyGradOpMaker,
+                  paddle::operators::FetchOpInfoMaker);
--- a/paddle/operators/gaussian_random_op.cc
+++ b/paddle/operators/gaussian_random_op.cc
@@ -59,7 +59,7 @@ class GaussianRandomOp : public framework::OperatorWithKernel {
 protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
-    return static_cast<framework::DataType>(Attr<int>("data_type"));
+    return static_cast<framework::DataType>(ctx.Attr<int>("data_type"));
  }
 };


--- a/paddle/operators/increment_op.cc
+++ b/paddle/operators/increment_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/operators/increment_op.h"
+
+namespace paddle {
+namespace operators {
+
+class IncrementOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("X"),
+                   "Input(X) of IncrementOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("Out"),
+                   "Output(Out) of IncrementOp should not be null.");
+    ctx->SetOutputDim("Out", ctx->GetInputDim("X"));
+    ctx->ShareLoD("X", /*->*/ "Out");
+  }
+};
+
+template <typename AttrType>
+class IncrementOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  IncrementOpMaker(framework::OpProto *proto,
+                   framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "(Tensor) The input tensor of increment operator");
+    AddOutput("Out", "(Tensor) The output tensor of increment operator.");
+    AddComment(R"DOC(Increment operator
+
+The equation is: Out = X + step
+)DOC");
+    AddAttr<AttrType>("step",
+                      "The step size by which the "
+                      "input tensor will be incremented.")
+        .SetDefault(1.0);
+  }
+};
+
+class IncrementGradOpMaker : public framework::SingleGradOpDescMaker {
+ public:
+  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
+
+  std::unique_ptr<framework::OpDescBind> Apply() const override {
+    auto *grad_op = new framework::OpDescBind();
+    grad_op->SetType("scale");
+    grad_op->SetInput("X", OutputGrad("Out"));
+    grad_op->SetOutput("Out", InputGrad("X"));
+    grad_op->SetAttr("scale", 1.0f);
+    return std::unique_ptr<framework::OpDescBind>(grad_op);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OPERATOR(increment, ops::IncrementOp, ops::IncrementOpMaker<float>,
+                  ops::IncrementGradOpMaker);
+REGISTER_OP_CPU_KERNEL(increment,
+                       ops::IncrementKernel<paddle::platform::CPUPlace, float>);
--- a/paddle/operators/increment_op.cu
+++ b/paddle/operators/increment_op.cu
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/operators/increment_op.h"
+
+REGISTER_OP_GPU_KERNEL(
+    increment,
+    paddle::operators::IncrementKernel<paddle::platform::GPUPlace, float>);
--- a/paddle/operators/increment_op.h
+++ b/paddle/operators/increment_op.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#pragma once
+
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+template <typename Place, typename T, typename AttrType = T>
+class IncrementKernel : public framework::OpKernel<T> {
+ public:
+  virtual void Compute(const framework::ExecutionContext& context) const {
+    auto* tensor = context.Output<framework::Tensor>("Out");
+    auto* in = context.Input<framework::Tensor>("X");
+    tensor->mutable_data<T>(in->place());
+
+    auto step = static_cast<T>(context.Attr<AttrType>("step"));
+
+    auto eigen_out = framework::EigenVector<T>::Flatten(*tensor);
+    auto eigen_in = framework::EigenVector<T>::Flatten(*in);
+    auto& place = context.GetEigenDevice<Place>();
+    eigen_out.device(place) = eigen_in + step;
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/lstm_op.h
+++ b/paddle/operators/lstm_op.h
@@ -91,17 +91,17 @@ class LSTMKernel : public framework::OpKernel<T> {
      int bstart = static_cast<int>(batch_starts[n]);
      int bend = static_cast<int>(batch_starts[n + 1]);

-      Tensor gate_t = batch_gate->Slice<T>(bstart, bend);
-      Tensor out_t = batch_out.Slice<T>(bstart, bend);
-      Tensor cell_t = batch_cell.Slice<T>(bstart, bend);
-      Tensor cell_pre_act_t = batch_cell_pre_act.Slice<T>(bstart, bend);
+      Tensor gate_t = batch_gate->Slice(bstart, bend);
+      Tensor out_t = batch_out.Slice(bstart, bend);
+      Tensor cell_t = batch_cell.Slice(bstart, bend);
+      Tensor cell_pre_act_t = batch_cell_pre_act.Slice(bstart, bend);

      int cur_batch_size = bend - bstart;

      if (n != 0) {
        int pre_h_start = static_cast<int>(batch_starts[n - 1]);
        int pre_h_end = pre_h_start + cur_batch_size;
-        auto pre_hidden_t = batch_out.Slice<T>(pre_h_start, pre_h_end);
+        auto pre_hidden_t = batch_out.Slice(pre_h_start, pre_h_end);
        math::matmul<Place, T>(ctx.device_context(), pre_hidden_t, false,
                               *weight, false, static_cast<T>(1.0), &gate_t,
                               static_cast<T>(1.0));

--- a/paddle/operators/math/im2col_test.cc
+++ b/paddle/operators/math/im2col_test.cc
@@ -64,7 +64,7 @@ void testIm2col() {
  if (paddle::platform::is_cpu_place(*place)) {
    input = input_tmp;
  } else {
-    input.CopyFrom<float>(input_tmp, *place, *context);
+    input.CopyFrom(input_tmp, *place, *context);
  }
  output_cfo.mutable_data<float>(
      {1, filter_size, filter_size, output_height, output_width}, *place);
@@ -85,8 +85,7 @@ void testIm2col() {
  if (paddle::platform::is_cpu_place(*place)) {
    out_cfo_ptr = output_cfo.data<float>();
  } else {
-    output_tmp.CopyFrom<float>(output_cfo, paddle::platform::CPUPlace(),
-                               *context);
+    output_tmp.CopyFrom(output_cfo, paddle::platform::CPUPlace(), *context);
    out_cfo_ptr = output_tmp.data<float>();
  }
  EXPECT_EQ(out_cfo_ptr[0], 0);
@@ -102,8 +101,7 @@ void testIm2col() {
  if (paddle::platform::is_cpu_place(*place)) {
    out_ocf_ptr = output_ocf.data<float>();
  } else {
-    output_tmp.CopyFrom<float>(output_ocf, paddle::platform::CPUPlace(),
-                               *context);
+    output_tmp.CopyFrom(output_ocf, paddle::platform::CPUPlace(), *context);
    out_ocf_ptr = output_tmp.data<float>();
  }
  EXPECT_EQ(out_ocf_ptr[0], 0);

--- a/paddle/operators/math/math_function_test.cu
+++ b/paddle/operators/math/math_function_test.cu
@@ -16,15 +16,15 @@ TEST(math_function, notrans_mul_trans) {
  auto* gpu_place = new paddle::platform::GPUPlace(0);
  paddle::platform::CUDADeviceContext context(*gpu_place);

-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  input1_gpu.CopyFrom(input1, *gpu_place, context);
+  input2_gpu.CopyFrom(input1, *gpu_place, context);

  out_gpu.mutable_data<float>({2, 2}, *gpu_place);

  paddle::operators::math::matmul<paddle::platform::GPUPlace, float>(
      context, input1_gpu, false, input2_gpu, true, 1, &out_gpu, 0);

-  out.CopyFrom<float>(out_gpu, *cpu_place, context);
+  out.CopyFrom(out_gpu, *cpu_place, context);

  float* out_ptr = out.data<float>();
  context.Wait();
@@ -50,15 +50,15 @@ TEST(math_function, trans_mul_notrans) {
  auto* gpu_place = new paddle::platform::GPUPlace(0);
  paddle::platform::CUDADeviceContext context(*gpu_place);

-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  input1_gpu.CopyFrom(input1, *gpu_place, context);
+  input2_gpu.CopyFrom(input1, *gpu_place, context);

  out_gpu.mutable_data<float>({3, 3}, *gpu_place);

  paddle::operators::math::matmul<paddle::platform::GPUPlace, float>(
      context, input1_gpu, true, input2_gpu, false, 1, &out_gpu, 0);

-  out.CopyFrom<float>(out_gpu, *cpu_place, context);
+  out.CopyFrom(out_gpu, *cpu_place, context);

  float* out_ptr = out.data<float>();
  context.Wait();
@@ -99,9 +99,9 @@ TEST(math_function, gemm_notrans_cublas) {
  auto* gpu_place = new paddle::platform::GPUPlace(0);
  paddle::platform::CUDADeviceContext context(*gpu_place);

-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input2, *gpu_place, context);
-  input3_gpu.CopyFrom<float>(input3, *gpu_place, context);
+  input1_gpu.CopyFrom(input1, *gpu_place, context);
+  input2_gpu.CopyFrom(input2, *gpu_place, context);
+  input3_gpu.CopyFrom(input3, *gpu_place, context);
  float* a = input1_gpu.data<float>();
  float* b = input2_gpu.data<float>();
  float* c = input3_gpu.mutable_data<float>(*gpu_place);
@@ -109,7 +109,7 @@ TEST(math_function, gemm_notrans_cublas) {
  paddle::operators::math::gemm<paddle::platform::GPUPlace, float>(
      context, false, false, m, n, k, 1, a, 3, b + 1, 4, 1, c + 1, 4);

-  input3.CopyFrom<float>(input3_gpu, *cpu_place, context);
+  input3.CopyFrom(input3_gpu, *cpu_place, context);

  // numpy code:
  // a = np.arange(6).reshape(2, 3)
@@ -154,9 +154,9 @@ TEST(math_function, gemm_trans_cublas) {
  auto* gpu_place = new paddle::platform::GPUPlace(0);
  paddle::platform::CUDADeviceContext context(*gpu_place);

-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input2, *gpu_place, context);
-  input3_gpu.CopyFrom<float>(input3, *gpu_place, context);
+  input1_gpu.CopyFrom(input1, *gpu_place, context);
+  input2_gpu.CopyFrom(input2, *gpu_place, context);
+  input3_gpu.CopyFrom(input3, *gpu_place, context);
  float* a = input1_gpu.data<float>();
  float* b = input2_gpu.data<float>();
  float* c = input3_gpu.mutable_data<float>(*gpu_place);
@@ -164,7 +164,7 @@ TEST(math_function, gemm_trans_cublas) {
  paddle::operators::math::gemm<paddle::platform::GPUPlace, float>(
      context, false, true, m, n, k, 1, a, 3, b + 3, 3, 1, c + 1, 4);

-  input3.CopyFrom<float>(input3_gpu, *cpu_place, context);
+  input3.CopyFrom(input3_gpu, *cpu_place, context);
  context.Wait();

  EXPECT_EQ(input3_ptr[0], 0);

--- a/paddle/operators/math/selected_rows_functor_test.cu
+++ b/paddle/operators/math/selected_rows_functor_test.cu
@@ -67,7 +67,7 @@ TEST(selected_rows_functor, gpu_add) {
  EXPECT_EQ(out_rows[6], 9);

  Tensor out_cpu;
-  out_cpu.CopyFrom<float>(*out_value, cpu_place, ctx);
+  out_cpu.CopyFrom(*out_value, cpu_place, ctx);
  ctx.Wait();

  auto* out_cpu_data = out_cpu.data<float>();
@@ -94,7 +94,7 @@ TEST(selected_rows_functor, gpu_add) {
  add_tensor_functor(ctx, *output, *tensor1, tensor2.get());

  Tensor tensor2_cpu;
-  tensor2_cpu.CopyFrom<float>(*tensor2, cpu_place, ctx);
+  tensor2_cpu.CopyFrom(*tensor2, cpu_place, ctx);
  ctx.Wait();

  auto* tensor2_cpu_data = tensor2_cpu.data<float>();

--- a/paddle/operators/math/sequence2batch.h
+++ b/paddle/operators/math/sequence2batch.h
@@ -107,11 +107,8 @@ class LoDTensor2BatchFunctor {
        size_t seq_len = seq_info[i].length;
        int start = seq_info[i].start;
        if (n < seq_len) {
-          if (!is_reverse) {
-            seq2batch_idx[batch_id] = start + n;
-          } else {
-            seq2batch_idx[batch_id] = start + seq_len - 1 - n;
-          }
+          seq2batch_idx[batch_id] =
+              is_reverse ? start + seq_len - 1 - n : start + n;
          batch_id++;
        } else {
          break;

--- a/paddle/operators/math/vol2col_test.cc
+++ b/paddle/operators/math/vol2col_test.cc
@@ -78,7 +78,7 @@ void testVol2col() {
  if (paddle::platform::is_cpu_place(*place)) {
    input = input_tmp;
  } else {
-    input.CopyFrom<float>(input_tmp, *place, *context);
+    input.CopyFrom(input_tmp, *place, *context);
  }
  output.mutable_data<float>({1, filter_size, filter_size, filter_size,
                              output_depth, output_height, output_width},
@@ -93,7 +93,7 @@ void testVol2col() {
  if (paddle::platform::is_cpu_place(*place)) {
    out_cfo_ptr = output.data<float>();
  } else {
-    output_tmp.CopyFrom<float>(output, paddle::platform::CPUPlace(), *context);
+    output_tmp.CopyFrom(output, paddle::platform::CPUPlace(), *context);
    out_cfo_ptr = output_tmp.data<float>();
  }

@@ -107,7 +107,7 @@ void testVol2col() {
  if (paddle::platform::is_cpu_place(*place)) {
    input = input_tmp;
  } else {
-    input.CopyFrom<float>(input_tmp, *place, *context);
+    input.CopyFrom(input_tmp, *place, *context);
  }

  paddle::operators::math::Col2VolFunctor<Place, float> col2vol;
@@ -118,7 +118,7 @@ void testVol2col() {
  if (paddle::platform::is_cpu_place(*place)) {
    in_ptr = input.data<float>();
  } else {
-    input_tmp.CopyFrom<float>(input, paddle::platform::CPUPlace(), *context);
+    input_tmp.CopyFrom(input, paddle::platform::CPUPlace(), *context);
    in_ptr = input_tmp.data<float>();
  }


--- a/paddle/operators/matmul_op.h
+++ b/paddle/operators/matmul_op.h
@@ -46,7 +46,7 @@ class MatMulKernel : public framework::OpKernel<T> {
 template <typename T>
 inline Tensor Reshape(const Tensor& input, const DDim& dims) {
  Tensor output;
-  output.ShareDataWith<T>(input);
+  output.ShareDataWith(input);
  output.Resize(dims);
  return output;
 }
@@ -56,7 +56,7 @@ inline Tensor Reshape(const Tensor& input, const DDim& dims) {
 template <typename T>
 Tensor CombineBatchAndM(const Tensor& input) {
  Tensor output;
-  output.ShareDataWith<T>(input);
+  output.ShareDataWith(input);
  auto in_dims = input.dims();
  if (in_dims.size() == 3) {
    std::vector<int64_t> out_dims = {in_dims[0] * in_dims[1], in_dims[2]};
@@ -80,7 +80,7 @@ Tensor CombineBatchAndN(const framework::ExecutionContext& context,
    std::vector<int64_t> out_dims = {in_dims[1], in_dims[0] * in_dims[2]};
    output.Resize(make_ddim(out_dims));
  } else {
-    output.ShareDataWith<T>(input);
+    output.ShareDataWith(input);
  }
  return output;
 }

--- a/paddle/operators/momentum_op.cc
+++ b/paddle/operators/momentum_op.cc
@@ -75,12 +75,17 @@ class MomentumOpMaker : public framework::OpProtoAndCheckerMaker {
    AddOutput("VelocityOut", "(Tensor) Output updated velocity");

    AddAttr<float>("mu", "(float) Momentum coefficient");
+    AddAttr<bool>("useNesterov", "(bool) Use Nesterov Momentum")
+        .SetDefault(false);
    AddComment(R"DOC(

-Momentum Algorithm (momentum).
+Momentum Algorithm with a flag for Nestrov Moemntum (momentum).

 velocity = mu * velocity + gradient
-param = param - learning_rate * velocity
+if (use_nesterov):
+  param = param - gradient * learning_rate + mu * velocity * learning_rate
+else:
+  param = param - learning_rate * velocity

 )DOC");
  }

--- a/paddle/operators/momentum_op.h
+++ b/paddle/operators/momentum_op.h
@@ -34,6 +34,7 @@ class MomentumOpKernel : public framework::OpKernel<T> {
    velocity_out->mutable_data<T>(ctx.GetPlace());

    float mu = ctx.Attr<float>("mu");
+    bool use_nesterov = ctx.Attr<bool>("useNesterov");

    auto p_out = framework::EigenVector<T>::Flatten(*param_out);
    auto v_out = framework::EigenVector<T>::Flatten(*velocity_out);
@@ -46,8 +47,14 @@ class MomentumOpKernel : public framework::OpKernel<T> {
    auto place = ctx.GetEigenDevice<Place>();

    Eigen::DSizes<int, 1> grad_dsize(grad->numel());
+
    v_out.device(place) = v * mu + g;
-    p_out.device(place) = p - lr.broadcast(grad_dsize) * v_out;
+    if (use_nesterov) {
+      p_out.device(place) = p - g * lr.broadcast(grad_dsize) +
+                            v_out * mu * lr.broadcast(grad_dsize);
+    } else {
+      p_out.device(place) = p - lr.broadcast(grad_dsize) * v_out;
+    }
  }
 };


--- a/paddle/operators/mul_op.h
+++ b/paddle/operators/mul_op.h
@@ -36,12 +36,12 @@ class MulKernel : public framework::OpKernel<T> {
    Tensor* z = context.Output<Tensor>("Out");
    const Tensor x_matrix =
        x->dims().size() > 2
-            ? framework::ReshapeToMatrix<T>(
+            ? framework::ReshapeToMatrix(
                  *x, context.template Attr<int>("x_num_col_dims"))
            : *x;
    const Tensor y_matrix =
        y->dims().size() > 2
-            ? framework::ReshapeToMatrix<T>(
+            ? framework::ReshapeToMatrix(
                  *y, context.template Attr<int>("y_num_col_dims"))
            : *y;

@@ -59,30 +59,30 @@ class MulGradKernel : public framework::OpKernel<T> {
    int y_num_col_dims = ctx.template Attr<int>("y_num_col_dims");
    const Tensor* x = ctx.Input<Tensor>("X");
    const Tensor* y = ctx.Input<Tensor>("Y");
-    const Tensor x_matrix =
-        x->dims().size() > 2 ? framework::ReshapeToMatrix<T>(*x, x_num_col_dims)
-                             : *x;
-    const Tensor y_matrix =
-        y->dims().size() > 2 ? framework::ReshapeToMatrix<T>(*y, y_num_col_dims)
-                             : *y;
+    const Tensor x_matrix = x->dims().size() > 2
+                                ? framework::ReshapeToMatrix(*x, x_num_col_dims)
+                                : *x;
+    const Tensor y_matrix = y->dims().size() > 2
+                                ? framework::ReshapeToMatrix(*y, y_num_col_dims)
+                                : *y;
    const Tensor* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));

    Tensor* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
    Tensor* dy = ctx.Output<Tensor>(framework::GradVarName("Y"));
    if (dx) {
      dx->mutable_data<T>(ctx.GetPlace());
-      Tensor dx_matrix = dx->dims().size() > 2 ? framework::ReshapeToMatrix<T>(
-                                                     *dx, x_num_col_dims)
-                                               : *dx;
+      Tensor dx_matrix = dx->dims().size() > 2
+                             ? framework::ReshapeToMatrix(*dx, x_num_col_dims)
+                             : *dx;
      // dx = dout * y'. dx: M x K, dout : M x N, y : K x N
      math::matmul<Place, T>(ctx.device_context(), *dout, false, y_matrix, true,
                             1, &dx_matrix, 0);
    }
    if (dy) {
      dy->mutable_data<T>(ctx.GetPlace());
-      Tensor dy_matrix = dy->dims().size() > 2 ? framework::ReshapeToMatrix<T>(
-                                                     *dy, y_num_col_dims)
-                                               : *dy;
+      Tensor dy_matrix = dy->dims().size() > 2
+                             ? framework::ReshapeToMatrix(*dy, y_num_col_dims)
+                             : *dy;
      // dy = x' * dout. dy K x N, dout : M x N, x : M x K
      math::matmul<Place, T>(ctx.device_context(), x_matrix, true, *dout, false,
                             1, &dy_matrix, 0);

--- a/paddle/operators/multiplex_op.cu
+++ b/paddle/operators/multiplex_op.cu
@@ -33,8 +33,7 @@ class MultiplexGPUKernel : public framework::OpKernel<T> {
    auto cols = ins[0]->numel() / rows;
    // copy index to cpu
    Tensor index_t_cpu;
-    index_t_cpu.CopyFrom<int32_t>(*ids, platform::CPUPlace(),
-                                  ctx.device_context());
+    index_t_cpu.CopyFrom(*ids, platform::CPUPlace(), ctx.device_context());
    auto* index = index_t_cpu.data<int32_t>();
    auto stream = reinterpret_cast<const platform::CUDADeviceContext&>(
                      ctx.device_context())
@@ -71,8 +70,7 @@ class MultiplexGradGPUKernel : public framework::OpKernel<T> {
    auto cols = ins[0]->numel() / rows;
    // copy index to cpu
    Tensor index_t_cpu;
-    index_t_cpu.CopyFrom<int32_t>(*ids, platform::CPUPlace(),
-                                  ctx.device_context());
+    index_t_cpu.CopyFrom(*ids, platform::CPUPlace(), ctx.device_context());
    auto* index = index_t_cpu.data<int32_t>();

    auto stream = reinterpret_cast<const platform::CUDADeviceContext&>(

--- a/paddle/operators/recurrent_op.cc
+++ b/paddle/operators/recurrent_op.cc
--- a/paddle/operators/reduce_op.cc
+++ b/paddle/operators/reduce_op.cc
--- a/paddle/operators/reshape_op.h
+++ b/paddle/operators/reshape_op.h
@@ -33,7 +33,7 @@ class ReshapeKernel : public framework::OpKernel<T> {
    std::transform(shape.begin(), shape.end(), shape_int64.begin(),
                   [](int a) { return static_cast<int64_t>(a); });
    auto out_dims = framework::make_ddim(shape_int64);
-    out->CopyFrom<T>(*in, ctx.GetPlace(), ctx.device_context());
+    out->CopyFrom(*in, ctx.GetPlace(), ctx.device_context());
    out->Resize(out_dims);
  }
 };
@@ -47,7 +47,7 @@ class ReshapeGradKernel : public framework::OpKernel<T> {
    d_x->mutable_data<T>(ctx.GetPlace());

    auto in_dims = d_x->dims();
-    d_x->CopyFrom<T>(*d_out, ctx.GetPlace(), ctx.device_context());
+    d_x->CopyFrom(*d_out, ctx.GetPlace(), ctx.device_context());
    d_x->Resize(in_dims);
  }
 };

--- a/paddle/operators/rnn/recurrent_op_utils.cc
+++ b/paddle/operators/rnn/recurrent_op_utils.cc
--- a/paddle/operators/rnn/recurrent_op_utils.h
+++ b/paddle/operators/rnn/recurrent_op_utils.h
@@ -31,7 +31,7 @@ using Scope = framework::Scope;
 * boot memories in father scope. Other attributes are copied from Op's proto
 * attributes.
 */
-struct MemoryAttr {
+struct StateAttr {
  // name of current state variable
  std::string var;
  // name of previous step's state variable
@@ -46,7 +46,7 @@ struct Argument {
  std::string step_scopes;
  std::vector<std::string> inlinks;
  std::vector<std::string> outlinks;
-  std::vector<rnn::MemoryAttr> memories;
+  std::vector<rnn::StateAttr> states;
 };

 struct ArgumentName {
@@ -54,9 +54,9 @@ struct ArgumentName {
  std::string step_scopes;
  std::string inlinks;
  std::string outlinks;
-  std::string memories;       // the memory name
-  std::string pre_memories;   // the previous memory name
-  std::string boot_memories;  // the boot memory name
+  std::string states;          // the memory name
+  std::string ex_states;       // the previous memory name
+  std::string initial_states;  // the boot memory name
 };

 /**
@@ -74,7 +74,7 @@ void ConcatOutputs(const std::vector<Scope*>& step_scopes,
                   const size_t seq_len, const platform::DeviceContext& ctx);

 void LinkMemories(const std::vector<Scope*>& step_scopes,
-                  const std::vector<MemoryAttr>& memories, const size_t step_id,
+                  const std::vector<StateAttr>& memories, const size_t step_id,
                  const int offset);

 void InitArgument(const ArgumentName& name, Argument* arg,

--- a/paddle/operators/scatter_op.cu
+++ b/paddle/operators/scatter_op.cu
@@ -30,7 +30,7 @@ class ScatterOpCUDAKernel : public framework::OpKernel<T> {
    auto *Updates = ctx.Input<Tensor>("Updates");
    auto *Out = ctx.Output<Tensor>("Out");

-    Out->ShareDataWith<T>(*Ref);
+    Out->ShareDataWith(*Ref);

    GPUScatterAssign<T>(ctx.device_context(), *Updates, *Index, Out);
  }
@@ -48,7 +48,7 @@ class ScatterGradOpCUDAKernel : public framework::OpKernel<T> {
    auto *dOut = ctx.Input<Tensor>(framework::GradVarName("Out"));

    // In place gradient: dRef = dO
-    dRef->ShareDataWith<T>(*dOut);
+    dRef->ShareDataWith(*dOut);
    dUpdates->mutable_data<T>(ctx.GetPlace());
    // Gradient by Gather: dUpdates = dO[Index]
    GPUGather<T>(ctx.device_context(), *dOut, *Index, dUpdates);

--- a/paddle/operators/scatter_op.h
+++ b/paddle/operators/scatter_op.h
--- a/paddle/operators/sequence_concat_op.h
+++ b/paddle/operators/sequence_concat_op.h
--- a/paddle/operators/sequence_pool_op.h
+++ b/paddle/operators/sequence_pool_op.h
--- a/paddle/operators/sequence_softmax_op.h
+++ b/paddle/operators/sequence_softmax_op.h
--- a/paddle/operators/softmax_with_cross_entropy_op.cu
+++ b/paddle/operators/softmax_with_cross_entropy_op.cu
--- a/paddle/operators/softmax_with_cross_entropy_op.h
+++ b/paddle/operators/softmax_with_cross_entropy_op.h
--- a/paddle/operators/uniform_random_op.cc
+++ b/paddle/operators/uniform_random_op.cc
--- a/paddle/pybind/protobuf.cc
+++ b/paddle/pybind/protobuf.cc
--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
--- a/paddle/scripts/docker/build.sh
+++ b/paddle/scripts/docker/build.sh
--- a/paddle/trainer/tests/sample_trainer_config_branch_net.conf
+++ b/paddle/trainer/tests/sample_trainer_config_branch_net.conf
--- a/paddle/trainer/tests/sample_trainer_config_simple_net.conf
+++ b/paddle/trainer/tests/sample_trainer_config_simple_net.conf
--- a/python/paddle/trainer/PyDataProvider2.py
+++ b/python/paddle/trainer/PyDataProvider2.py
--- a/python/paddle/v2/framework/executor.py
+++ b/python/paddle/v2/framework/executor.py
--- a/python/paddle/v2/framework/framework.py
+++ b/python/paddle/v2/framework/framework.py
--- a/python/paddle/v2/framework/layer_helper.py
+++ b/python/paddle/v2/framework/layer_helper.py
--- a/python/paddle/v2/framework/layers.py
+++ b/python/paddle/v2/framework/layers.py
--- a/python/paddle/v2/framework/nets.py
+++ b/python/paddle/v2/framework/nets.py
--- a/python/paddle/v2/framework/optimizer.py
+++ b/python/paddle/v2/framework/optimizer.py
--- a/python/paddle/v2/framework/tests/.gitignore
+++ b/python/paddle/v2/framework/tests/.gitignore
--- a/python/paddle/v2/framework/tests/test_adam_op.py
+++ b/python/paddle/v2/framework/tests/test_adam_op.py
--- a/python/paddle/v2/framework/tests/test_adamax_op.py
+++ b/python/paddle/v2/framework/tests/test_adamax_op.py
--- a/python/paddle/v2/framework/tests/test_cross_entropy_op.py
+++ b/python/paddle/v2/framework/tests/test_cross_entropy_op.py
--- a/python/paddle/v2/framework/tests/test_dynamic_recurrent_op.py
+++ b/python/paddle/v2/framework/tests/test_dynamic_recurrent_op.py
--- a/python/paddle/v2/framework/tests/test_elementwise_add_op.py
+++ b/python/paddle/v2/framework/tests/test_elementwise_add_op.py
--- a/python/paddle/v2/framework/tests/test_executor_and_mul.py
+++ b/python/paddle/v2/framework/tests/test_executor_and_mul.py
--- a/python/paddle/v2/framework/tests/test_feed_fetch_method.py
+++ b/python/paddle/v2/framework/tests/test_feed_fetch_method.py
--- a/python/paddle/v2/framework/tests/test_fit_a_line.py
+++ b/python/paddle/v2/framework/tests/test_fit_a_line.py
--- a/python/paddle/v2/framework/tests/test_increment_op.py
+++ b/python/paddle/v2/framework/tests/test_increment_op.py
--- a/python/paddle/v2/framework/tests/test_layers.py
+++ b/python/paddle/v2/framework/tests/test_layers.py
--- a/python/paddle/v2/framework/tests/test_momentum_op.py
+++ b/python/paddle/v2/framework/tests/test_momentum_op.py
--- a/python/paddle/v2/framework/tests/test_op_support_gpu.py
+++ b/python/paddle/v2/framework/tests/test_op_support_gpu.py
--- a/python/paddle/v2/framework/tests/test_optimizer.py
+++ b/python/paddle/v2/framework/tests/test_optimizer.py
--- a/python/paddle/v2/framework/tests/test_recognize_digits_conv.py
+++ b/python/paddle/v2/framework/tests/test_recognize_digits_conv.py
--- a/python/paddle/v2/framework/tests/test_recognize_digits_mlp.py
+++ b/python/paddle/v2/framework/tests/test_recognize_digits_mlp.py
--- a/python/paddle/v2/framework/tests/test_recurrent_op.py
+++ b/python/paddle/v2/framework/tests/test_recurrent_op.py
--- a/python/paddle/v2/framework/tests/test_reduce_op.py
+++ b/python/paddle/v2/framework/tests/test_reduce_op.py
--- a/python/paddle/v2/framework/tests/test_rmsprop_op.py
+++ b/python/paddle/v2/framework/tests/test_rmsprop_op.py
--- a/python/paddle/v2/framework/tests/test_uniform_random_op.py
+++ b/python/paddle/v2/framework/tests/test_uniform_random_op.py
--- a/python/paddle/v2/parameters.py
+++ b/python/paddle/v2/parameters.py
--- a/python/paddle/v2/tests/CMakeLists.txt
+++ b/python/paddle/v2/tests/CMakeLists.txt
--- a/python/paddle/v2/tests/test_data_feeder.py
+++ b/python/paddle/v2/tests/test_data_feeder.py
--- a/python/paddle/v2/tests/test_paramconf_order.py
+++ b/python/paddle/v2/tests/test_paramconf_order.py
--- a/python/paddle/v2/topology.py
+++ b/python/paddle/v2/topology.py
--- a/python/paddle/v2/trainer.py
+++ b/python/paddle/v2/trainer.py