Merge remote-tracking branch 'upstream/develop' into factorization_machine_layer

b3cd6796 · wangmeng28 · 2ce8f187 · f43b1a90 · b3cd6796 · b3cd6796
109 changed file
--- a/doc/design/executor.md
+++ b/doc/design/executor.md
+# Executor Design Doc
+## Motivation
+We use executor to do the runtime evaluation of a `ProgramDesc`.
+## Overview
+An executor takes a `ProgramDesc`, a `block_id` and a `Scope`.  The `ProgramDesc` is a list of blocks and each block contains the protobuf definition of all the parameters and operators. The `block_id` specifies the entrance block. And the `Scope` is the container of all the variable instance, which is persistent throughout different runs.
+### What does executor do?
+It evaluates all the operators in the `block_id`th block of a `ProgramDesc`.
+### What does executor NOT do?
+It does not do runtime optimization, meaning intelligently parse the dependency of each op a choose which one to be run and in which order they should be run.
+It does not do graph partitioning, meaning dividing the `ProgramDesc` into several small pieces and executing them on different devices.
+## Implementation
+`Executor` evaluates a `ProgramDesc`. Essentially, it instantiates Variables and Operators, then run all the operators in sequence. [[code]](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.cc)
--- a/doc/design/infer_var_type.md
+++ b/doc/design/infer_var_type.md
+# Design Doc: InferVarType
+## The Problem Posed
+The variable in our design can hold variant types. Such as `LoDTensor` and `SelectedRows`. An operator should be able to inference the variable types of its output.
+For example, a `lookup table` operator takes two `LoDTensor`; one is a float tensor as the embedding table, the other is an int tensor as word ID. The gradient operator of `lookup table` will generate a `SelectedRows` as its output. A `sum` operator can take both `LoDTensor` and `SelectedRows` as its inputs and will generate a `LoDTensor` if any of its inputs is `LoDTensor`, otherwise, the `sum` operator will generate `SelectedRows` as its output.
+The variable type will be constant at runtime. Every variable's type can either be set by the user (input data and parameter) or be inferred by the operator in compile time.
+## Proposed Solution
+The `InferVarType` is a compile-time function which is registered to each operator. The inferface of that function is:
+```c++
+using InferVarTypeFN = std::function<
+    void (const OpDescBind& /*op_desc*/, BlockDescBind* /*block*/)>;
+```
+It takes an operator description as its input and will write the output variable type and store them in block description.
+The `InferVarTypeFN` will be registered in `OpInfo`, to replace `infer_var_type_` field. The `OpInfo` should be
+```cpp
+struct OpInfo {
+  InferVarTypeFN infer_var_type_;
+  ...
+};
+```
+The default `InferVarType` will set output type as `LoDTensor`. It can be done by `GetInferVarType()`.
+```cpp
+void DefaultInferVarType(const OpDescBind& op_desc, BlockDescBind* block) {
+  // set the output type of variable as `LoDTensor`.
+  // ...
+}
+struct OpInfo {
+  InferVarTypeFN infer_var_type_;
+  InferVarTypeFN GetInferVarType() const {
+    if (infer_var_type_) {
+      return infer_var_type_;
+    } else {
+      return DefaultInferVarType;
+    }
+  }
+};
+```
+## Register InferVarType
+We provide a thin base class for registering an `InferVarTypeFN`. To use a base class will ease the implementation of registry since we can detect the registry entry is an `InferVarTypeFN` or not.
+```cpp
+class VarTypeInferer {
+public:
+  virtual void operator()(const OpDescBind& op_desc, BlockDescBind* block) const = 0;
+}
+```
+Operator developers can write the specialize `VarTypeInferer` as follow.
+```cpp
+class SpecialVarTypeInferer : public VarTypeInferer {
+public:
+  virtual void operator()(const OpDescBind& op_desc, BlockDescBind* block) const {
+    // .. own logic
+  }
+}
+```
+Then user can register the `InferVarType` just like `GradOpDescMaker` and `OpInfoMaker`.
+```
+REGISTER_OPERATOR(some_op, OpType, SpecialVarTypeInferer, ...);
+```
--- a/doc/design/python_api.md
+++ b/doc/design/python_api.md
@@ -179,40 +179,104 @@ init_attr={
 `optimize_op_attrs` is not in the `VarDesc` message, but kept in the Python instance, as it will be used in the Python space when creating the optimize operator's `OpDesc`, and will be in the `OpDesc` message.
-## Layer Functions
+## Layer Function
-A layer is a Python function that creates some operators and variables.  Layers simplify the work of application programmers.
+A layer is a Python function that creates some operators and variables. Layers simplify the work of application programmers.
-### Data Layer
+Layer functions take `Variable` and configuration parameters as its input and return the output variable(s).
+For example, `FullyConnected` take one or more variable as its input. The input could be input data or another layer's output. There are many configuration options for a `FullyConnected` layer, such as layer size, activation, parameter names, initialization strategies of parameters, and so on. The `FullyConnected` layer will return an output variable.
+### Necessity for reusing code between layer functions
+There are a lot of code that can be reused. Such as
+* Give the default value of configuration. e.g., default initialize strategy for parameters is uniform random with `min = -1.0`, `max = 1.0`. and default initialize strategy for bias is to fill zero.
+* Append the activation operator.
+* Create a temporary variable.
+* Create parameter.
+* Generate a unique name.
+* Add a bias.
+* ...
+A mechanism to reuse code between layer functions is necessary. It will be around [150 lines of code](https://github.com/PaddlePaddle/Paddle/pull/4724/files#diff-823b27e07e93914ada859232ae23f846R12) if we write a `FullyConnected` layer without any helper functions.
+### Comparision between global functions and helper class
+The `FullyConnected` layer will be as follow when we provide global functions:
 ```python
-def data_layer(name, type, column_name):
+def fc_layer(input, size, param_attr=None, bias_attr=None, act=None, name=None):
-    block = the_current_program.glolal_block()
+  if name is None:
-    var = block.create_global_var(
+    name = unique_name("fc")
-            name=name,
+  input = multiple_input(input)
-            shape=[None] + type.dims(),
+  param_attr = default_param_attr(param_attr)
-            dtype=type.dtype)
+  param_attr = multiple_param_attr(param_attr, len(input))
-    block.prepend_operator(block,
-                           type="Feed",
+  # mul
-                           inputs = None,
+  mul_results = []
-                           outputs = [var],
+  for ipt, attr in zip(input, param_attr):
-                           {column_name: column_name})
+    shape = ipt.shape[1:] + [size]
-    return var
+    w = g_program.global_block().create_parameter(shape, ipt.dtype, name, attr)
+    tmp = create_tmp_var(name)
+    g_program.current_block().append_op("mul", {ipt, w}, {tmp})
+  mul_results.append(tmp)
+  # add sum
+  ...
+  # add bias
+  ...
+  # add activation
+  ...
+  return out
 ```
-The input to the feed operator is a special variable in the global scope, which is the output of [Python readers](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/reader/README.md).
+We can provide many helpers functions for layer developers. However, there are several disadvantages for global helper functions:
+1. We need a namespace for these methods, then layer developers can quickly figure out what method they can use.
+2. Global functions will force layer developers to pass its parameter time by time.
+So we provide a helper class, `LayerHelper`, to share code between layer functions. The `FullyConnected` Layer will be as follow.
+```python
+def fc_layer(input, size, param_attr=None, bias_attr=None, act=None, name=None):
+  helper = LayerHelper(locals())  # pass all parameter to LayerHelper
+  mul_results = []
+  for ipt, param in helper.iter_multiple_input_and_param():
+    w = helper.create_parameter(shape=ipt.shape[1:] + [size], dtype = ipt.dtype)
+    tmp = helper.create_tmp_variable()
+    helper.append_op('mul', {ipt, w}, {tmp})
+    mul_results.append(tmp)
+  pre_bias = helper.add_sum(mul_results)
+  pre_activation = helper.add_bias(pre_bias)
+  return helper.add_activation(pre_activation)
+```
+We not only use the fewer lines of code to write `fc_layer` but also make the code clearer to understand. At the same time, layer developers can figure out what function they can invoke by typing `helper.` in a python editor.
+### Implementation of layer helper
-### FC Layer
+We just keep all parameters of a layer function as a dictionary in layer helper as a private data member. Every method of layer helper will look up the dictionary after it is invoked. In that way, we can implement a layer helper for all layer functions even some layer does not contain some operator. For example, The `activation` is used by the FullyConnected layer or convolution layers, but a cross-entropy layer does not use it. The example code of `add_activation` are:
 ```python
-def fc_layer(input, size, ...):
+class LayerHelper(object):
-    block = program.current_block()
+  def __init__(self, **kwargs):  # kwargs is short for `keyword arguments`
-    w = block.create_parameter(...)
+    self.kwargs = kwargs
-    b = block.create_parameter(...)
-    out = block.create_var()
+  def add_activation(self, input_var):
-    op = block.append_operator("FC", X=input, W=w, b=b, out=out)
+    act = self.kwargs.get("act", None)  # default value is None
-    out.writer = op
+    if act is None:  # do nothing if no act
-    return out
+      return input_var
+    tmp = self.create_tmp_var(self)
+    self.append_op(type=act, input=input_var, output=tmp)
+    return tmp
 ```
 ## Optimizer

--- a/paddle/framework/CMakeLists.txt
+++ b/paddle/framework/CMakeLists.txt
@@ -43,16 +43,11 @@ cc_library(backward SRCS backward.cc DEPS net_op)
 cc_test(backward_test SRCS backward_test.cc DEPS backward recurrent_op device_context)
 cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto backward)
-set(EXECUTOR_TEST_OP elementwise_add_op gaussian_random_op feed_op fetch_op
-    mul_op sum_op squared_l2_distance_op fill_constant_op sgd_op mean_op)
-if(WITH_GPU)
-    nv_test(executor_test SRCS executor_test.cc DEPS executor ${EXECUTOR_TEST_OP})
-else()
-    cc_test(executor_test SRCS executor_test.cc DEPS executor ${EXECUTOR_TEST_OP})
-endif()
 cc_library(tensor_array SRCS tensor_array.cc DEPS lod_tensor)
 cc_test(tensor_array_test SRCS tensor_array_test.cc DEPS tensor_array place)
+cc_test(var_type_inference_test SRCS var_type_inference_test.cc DEPS op_registry
+        proto_desc)
 cc_library(selected_rows SRCS selected_rows.cc DEPS tensor)
 cc_test(selected_rows_test SRCS selected_rows_test.cc DEPS selected_rows)
--- a/paddle/framework/attribute.h
+++ b/paddle/framework/attribute.h
@@ -120,6 +120,57 @@ class EnumInContainer {
  std::unordered_set<T> container_;
 };
+template <typename T>
+struct ExtractAttribute {
+  explicit ExtractAttribute(const std::string& attr_name)
+      : attr_name_(attr_name) {}
+  T* operator()(Attribute& attr) const {
+    T* attr_value = nullptr;
+    try {
+      attr_value = &boost::get<T>(attr);
+    } catch (boost::bad_get& bad_get) {
+      PADDLE_THROW("Cannot get attribute %s by type %s, its type is %s",
+                   attr_name_, typeid(T).name(), attr.type().name());
+    }
+    return attr_value;
+  }
+  const std::string& attr_name_;
+};
+// special handle bool
+// FIXME(yuyang18): Currently we cast bool into int in python binding. It is
+// hard to change the logic there. In another way, we should correct handle
+// if the user set `some_flag=1`.
+//
+// FIX ME anytime if there is a better solution.
+template <>
+struct ExtractAttribute<bool> {
+  explicit ExtractAttribute(const std::string& attr_name)
+      : attr_name_(attr_name) {}
+  bool* operator()(Attribute& attr) const {
+    if (attr.type() == typeid(int)) {  // NOLINT
+      int val = boost::get<int>(attr);
+      attr = static_cast<bool>(val);
+    } else if (attr.type() == typeid(float)) {  // NOLINT
+      float val = boost::get<float>(attr);
+      attr = static_cast<bool>(val);
+    }
+    bool* attr_value = nullptr;
+    try {
+      attr_value = &boost::get<bool>(attr);
+    } catch (boost::bad_get& bad_get) {
+      PADDLE_THROW("Cannot get attribute %s by type bool, its type is %s",
+                   attr_name_, attr.type().name());
+    }
+    return attr_value;
+  }
+  const std::string& attr_name_;
+};
 // check whether a certain attribute fit its limits
 // an attribute can have more than one limits
 template <typename T>
@@ -171,9 +222,10 @@ class TypedAttrChecker {
      attr_map[attr_name_] = val;
    }
    Attribute& attr = attr_map.at(attr_name_);
-    T& attr_value = boost::get<T>(attr);
+    ExtractAttribute<T> extract_attr(attr_name_);
+    T* attr_value = extract_attr(attr);
    for (const auto& checker : value_checkers_) {
-      checker(attr_value);
+      checker(*attr_value);
    }
  }

--- a/paddle/framework/backward.cc
+++ b/paddle/framework/backward.cc
@@ -433,7 +433,7 @@ ParamGradInfoMap AppendBackward(
      new OpDescBind("fill_constant", {}, {{"Out", {fill_one_op_out}}},
                     {{"shape", std::vector<int>{1}},
                      {"value", static_cast<float>(1.0)},
-                      {"dataType", framework::DataType::FP32}}));
+                      {"data_type", framework::DataType::FP32}}));
  all_ops.push_back(std::move(fill_one_op));
  size_t forward_op_num = all_ops.size();
  size_t forward_block_num = program_desc.Size();

--- a/paddle/framework/details/op_registry.h
+++ b/paddle/framework/details/op_registry.h
@@ -18,6 +18,7 @@
 #include "paddle/framework/op_info.h"
 #include "paddle/framework/op_proto_maker.h"
 #include "paddle/framework/operator.h"
+#include "paddle/framework/var_type_inference.h"
 namespace paddle {
 namespace framework {
@@ -26,7 +27,8 @@ namespace details {
 enum OpInfoFillType {
  kOperator = 0,
  kOpProtoAndCheckerMaker = 1,
-  kGradOpDescMaker = 2
+  kGradOpDescMaker = 2,
+  kVarTypeInference = 3
 };
 template <typename T>
@@ -38,7 +40,9 @@ struct OpInfoFillTypeID {
                      ? kOpProtoAndCheckerMaker
                      : (std::is_base_of<GradOpDescMakerBase, T>::value
                             ? kGradOpDescMaker
-                             : static_cast<OpInfoFillType>(-1)));
+                             : (std::is_base_of<VarTypeInference, T>::value
+                                    ? kVarTypeInference
+                                    : static_cast<OpInfoFillType>(-1))));
  }
 };
@@ -106,6 +110,17 @@ struct OpInfoFiller<T, kGradOpDescMaker> {
    };
  }
 };
+template <typename T>
+struct OpInfoFiller<T, kVarTypeInference> {
+  void operator()(const char* op_type, OpInfo* info) const {
+    info->infer_var_type_ = [](const OpDescBind& fwd_op, BlockDescBind* block) {
+      T inference;
+      inference(fwd_op, block);
+    };
+  }
+};
 }  // namespace details
 }  // namespace framework

--- a/paddle/framework/executor_test.cc
+++ b/paddle/framework/executor_test.cc
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-    http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-#include "paddle/framework/executor.h"
-#include <memory>
-#include <vector>
-#include "gflags/gflags.h"
-#include "gtest/gtest.h"
-#include "paddle/framework/attribute.h"
-#include "paddle/framework/backward.h"
-#include "paddle/framework/block_desc.h"
-#include "paddle/framework/op_desc.h"
-#include "paddle/framework/op_registry.h"
-#include "paddle/framework/operator.h"
-USE_OP(elementwise_add);
-USE_OP(gaussian_random);
-USE_OP(feed);
-USE_OP(fetch);
-USE_OP(mul);
-USE_OP(sum);
-USE_OP(squared_l2_distance);
-USE_OP(fill_constant);
-USE_OP(mean);
-USE_OP(sgd);
-using namespace paddle::platform;
-using namespace paddle::framework;
-void AddOp(const std::string& type, const VariableNameMap& inputs,
-           const VariableNameMap& outputs, AttributeMap attrs,
-           paddle::framework::BlockDescBind* block) {
-  // insert output
-  for (auto kv : outputs) {
-    for (auto v : kv.second) {
-      // <<<<<<< HEAD
-      //       auto var = block->Var(v);
-      //       var->SetType(VarDesc::LOD_TENSOR);
-      //       var->SetDataType(paddle::framework::DataType::FP32);
-      // =======
-      if (!block->HasVar(v)) {
-        auto var = block->Var(v);
-        var->SetDataType(paddle::framework::DataType::FP32);
-      }
-      // >>>>>>> origin/develop
-    }
-  }
-  // insert op
-  auto op = block->AppendOp();
-  op->SetType(type);
-  for (auto& kv : inputs) {
-    op->SetInput(kv.first, kv.second);
-  }
-  for (auto& kv : outputs) {
-    op->SetOutput(kv.first, kv.second);
-  }
-  op->SetAttrMap(attrs);
-  op->CheckAttrs();
-}
-// Tensors in feed value variable will only be in CPUPlace
-// So we can memcpy the data from vector<T> to feed_value
-template <typename T>
-void SetFeedVariable(const std::vector<std::vector<T>>& inputs,
-                     const std::vector<std::vector<int64_t>>& dims) {
-  Variable* g_feed_value = GetGlobalScope().FindVar("feed_value");
-  auto& feed_inputs =
-      *(g_feed_value->GetMutable<std::vector<paddle::framework::Tensor>>());
-  size_t size = inputs.size();
-  feed_inputs.resize(size);
-  for (size_t i = 0; i < size; i++) {
-    T* dst = feed_inputs[i].mutable_data<T>(make_ddim(dims[i]), CPUPlace());
-    memcpy(dst, inputs[i].data(), inputs[i].size() * sizeof(T));
-  }
-}
-// Tensors in fetch value variable will only be in CPUPlace
-// So we can memcpy the data from fetch_value to vector<T>
-template <typename T>
-std::vector<std::vector<T>> GetFetchVariable() {
-  Variable* g_fetch_value = GetGlobalScope().FindVar("fetch_value");
-  auto& fetch_outputs =
-      *(g_fetch_value->GetMutable<std::vector<paddle::framework::Tensor>>());
-  size_t size = fetch_outputs.size();
-  std::vector<std::vector<T>> result;
-  result.reserve(size);
-  for (size_t i = 0; i < size; i++) {
-    std::vector<T> tmp;
-    tmp.resize(fetch_outputs[i].numel());
-    memcpy(tmp.data(), fetch_outputs[i].data<T>(),
-           fetch_outputs[i].numel() * sizeof(T));
-    result.push_back(tmp);
-  }
-  return result;
-}
-class ExecutorTesterRandom : public ::testing::Test {
- public:
-  virtual void SetUp() override {
-    int input_dim = 3, batch_size = 2, embed_dim = 5;
-    auto temp_init_root_block = init_pdesc_.add_blocks();
-    temp_init_root_block->set_idx(0);
-    temp_init_root_block->set_parent_idx(-1);
-    paddle::framework::ProgramDescBind& init_program =
-        paddle::framework::ProgramDescBind::Instance(&init_pdesc_);
-    paddle::framework::BlockDescBind* init_root_block = init_program.Block(0);
-    AddOp("gaussian_random", {}, {{"Out", {"w1"}}},
-          {{"dims", std::vector<int>{input_dim, embed_dim}}}, init_root_block);
-    AddOp("gaussian_random", {}, {{"Out", {"w2"}}},
-          {{"dims", std::vector<int>{embed_dim, input_dim}}}, init_root_block);
-    AddOp("fetch", {{"Input", {"w1"}}}, {}, {{"col", 0}}, init_root_block);
-    AddOp("fetch", {{"Input", {"w2"}}}, {}, {{"col", 1}}, init_root_block);
-    // flush
-    init_program.Proto();
-    // run block
-    auto temp_root_block = pdesc_.add_blocks();
-    temp_root_block->set_idx(0);
-    temp_root_block->set_parent_idx(-1);
-    paddle::framework::ProgramDescBind& program =
-        paddle::framework::ProgramDescBind::Instance(&pdesc_);
-    paddle::framework::BlockDescBind* root_block = program.Block(0);
-    // feed data
-    inputs_.push_back({1.0, 1.0, 1.0, 1.0, 1.0, 1.0});
-    dims_.push_back({batch_size, input_dim});
-    AddOp("feed", {}, {{"Out", {"a"}}},
-          {{"dims", std::vector<int>{batch_size, input_dim}}, {"col", 0}},
-          root_block);
-    // forward
-    AddOp("mul", {{"X", {"a"}}, {"Y", {"w1"}}}, {{"Out", {"b"}}}, {},
-          root_block);
-    AddOp("mul", {{"X", {"b"}}, {"Y", {"w2"}}}, {{"Out", {"a_out"}}}, {},
-          root_block);
-    AddOp("squared_l2_distance", {{"X", {"a"}}, {"Y", {"a_out"}}},
-          {{"Out", {"l2_distance"}}, {"sub_result", {"l2_distance_sub"}}}, {},
-          root_block);
-    AddOp("mean", {{"X", {"l2_distance"}}}, {{"Out", {"mean_out"}}}, {},
-          root_block);
-    // backward
-    auto target = VarDescBind("mean_out");
-    AppendBackward(program, target, {});
-    // update
-    AddOp("fill_constant", {}, {{"Out", {"learning_rate"}}},
-          {{"shape", std::vector<int>{1}}, {"value", float(0.001)}},
-          root_block);
-    AddOp("sgd", {{"Param", {"w1"}},
-                  {"LearningRate", {"learning_rate"}},
-                  {"Grad", {"w1@GRAD"}}},
-          {{"ParamOut", {"w1"}}}, {}, root_block);
-    AddOp("sgd", {{"Param", {"w2"}},
-                  {"LearningRate", {"learning_rate"}},
-                  {"Grad", {"w2@GRAD"}}},
-          {{"ParamOut", {"w2"}}}, {}, root_block);
-    AddOp("fetch", {{"Input", {"w1"}}}, {}, {{"col", 0}}, root_block);
-    AddOp("fetch", {{"Input", {"w2"}}}, {}, {{"col", 1}}, root_block);
-    AddOp("fetch", {{"Input", {"l2_distance"}}}, {}, {{"col", 0}}, root_block);
-    // flush
-    program.Proto();
-  }
- protected:
-  ProgramDesc init_pdesc_;
-  ProgramDesc pdesc_;
-  std::vector<std::vector<float>> inputs_;
-  std::vector<std::vector<int64_t>> dims_;
-};
-class ExecutorTesterFeedAndFetch : public ::testing::Test {
- public:
-  virtual void SetUp() override {
-    auto temp_root_block = pdesc_.add_blocks();
-    temp_root_block->set_idx(0);
-    temp_root_block->set_parent_idx(-1);
-    // wrap to BlockDescBind
-    paddle::framework::ProgramDescBind& program =
-        paddle::framework::ProgramDescBind::Instance(&pdesc_);
-    paddle::framework::BlockDescBind* root_block = program.Block(0);
-    std::vector<int> dim{6};
-    AddOp("feed", {}, {{"Out", {"a"}}}, {{"dims", dim}, {"col", 0}},
-          root_block);
-    AddOp("feed", {}, {{"Out", {"b"}}}, {{"dims", dim}, {"col", 1}},
-          root_block);
-    AddOp("fetch", {{"Input", {"a"}}}, {}, {{"col", 0}}, root_block);
-    AddOp("fetch", {{"Input", {"b"}}}, {}, {{"col", 1}}, root_block);
-    // flush
-    program.Proto();
-    std::vector<float> vec1 = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
-    std::vector<float> vec2 = {4.0, 5.0, 6.0, 7.0, 8.0, 9.0};
-    inputs_.push_back(vec1);
-    inputs_.push_back(vec2);
-    dims_.push_back({static_cast<int64_t>(vec1.size())});
-    dims_.push_back({static_cast<int64_t>(vec2.size())});
-  }
- protected:
-  ProgramDesc pdesc_;
-  std::vector<std::vector<float>> inputs_;
-  std::vector<std::vector<int64_t>> dims_;
-};
-#ifndef PADDLE_WITH_CUDA
-TEST_F(ExecutorTesterRandom, CPU) {
-  std::vector<Place> places;
-  CPUPlace cpu_place;
-  places.push_back(cpu_place);
-  // We have a global Scope and BuddyAllocator, and we must ensure
-  // global BuddyAllocator is initialized before global Scope. Thus,
-  // global Scope will deconstruct before BuddyAllocator. Otherwise,
-  // "pointer being freed was not allocated" error will appear.
-  paddle::memory::Used(cpu_place);
-  std::unique_ptr<Executor> executor(new Executor(places));
-  executor->Run(init_pdesc_, &GetGlobalScope(), 0);
-  SetFeedVariable<float>(inputs_, dims_);
-  executor->Run(pdesc_, &GetGlobalScope(), 0);
-  std::vector<std::vector<float>> result = GetFetchVariable<float>();
-}
-TEST_F(ExecutorTesterFeedAndFetch, CPU) {
-  std::vector<Place> places;
-  CPUPlace cpu_place;
-  places.push_back(cpu_place);
-  // We have a global Scope and BuddyAllocator, and we must ensure
-  // global BuddyAllocator is initialized before global Scope. Thus,
-  // global Scope will deconstruct before BuddyAllocator. Otherwise,
-  // "pointer being freed was not allocated" error will appear.
-  paddle::memory::Used(cpu_place);
-  std::unique_ptr<Executor> executor(new Executor(places));
-  for (int batch_id = 0; batch_id < 3; batch_id++) {
-    SetFeedVariable<float>(inputs_, dims_);
-    executor->Run(pdesc_, &GetGlobalScope(), 0);
-    std::vector<std::vector<float>> result = GetFetchVariable<float>();
-    PADDLE_ENFORCE_EQ(result.size(), inputs_.size());
-    for (size_t i = 0; i < result.size(); ++i) {
-      PADDLE_ENFORCE_EQ(result[i].size(), inputs_[i].size());
-      for (size_t j = 0; j < result[i].size(); ++j) {
-        PADDLE_ENFORCE_EQ(result[i][j], inputs_[i][j]);
-      }
-    }
-  }
-}
-#else
-TEST_F(ExecutorTesterRandom, GPU) {
-  std::vector<Place> places;
-  GPUPlace gpu_place(0);
-  places.push_back(gpu_place);
-  // We have a global Scope and BuddyAllocator, and we must ensure
-  // global BuddyAllocator is initialized before global Scope. Thus,
-  // global Scope will deconstruct before BuddyAllocator. Otherwise,
-  // "pointer being freed was not allocated" error will appear.
-  // If paddle is compiled with GPU, both CPU and GPU BuddyAllocator
-  // need to be used at first.
-  paddle::memory::Used(CPUPlace());
-  paddle::memory::Used(gpu_place);
-  std::unique_ptr<Executor> executor(new Executor(places));
-  executor->Run(init_pdesc_, &GetGlobalScope(), 0);
-  for (int batch_id = 0; batch_id < 3; batch_id++) {
-    SetFeedVariable<float>(inputs_, dims_);
-    executor->Run(pdesc_, &GetGlobalScope(), 0);
-  }
-}
-TEST_F(ExecutorTesterFeedAndFetch, GPU) {
-  std::vector<Place> places;
-  GPUPlace gpu_place(0);
-  places.push_back(gpu_place);
-  // We have a global Scope and BuddyAllocator, and we must ensure
-  // global BuddyAllocator is initialized before global Scope. Thus,
-  // global Scope will deconstruct before BuddyAllocator. Otherwise,
-  // "pointer being freed was not allocated" error will appear.
-  // If paddle is compiled with GPU, both CPU and GPU BuddyAllocator
-  // need to be used at first.
-  paddle::memory::Used(CPUPlace());
-  paddle::memory::Used(gpu_place);
-  std::unique_ptr<Executor> executor(new Executor(places));
-  for (int batch_id = 0; batch_id < 3; batch_id++) {
-    SetFeedVariable<float>(inputs_, dims_);
-    executor->Run(pdesc_, &GetGlobalScope(), 0);
-    std::vector<std::vector<float>> result = GetFetchVariable<float>();
-    PADDLE_ENFORCE_EQ(result.size(), inputs_.size());
-    for (size_t i = 0; i < result.size(); ++i) {
-      PADDLE_ENFORCE_EQ(result[i].size(), inputs_[i].size());
-      for (size_t j = 0; j < result[i].size(); ++j) {
-        PADDLE_ENFORCE_EQ(result[i][j], inputs_[i][j]);
-      }
-    }
-  }
-}
-DECLARE_double(fraction_of_gpu_memory_to_use);
-int main(int argc, char** argv) {
-  testing::InitGoogleTest(&argc, argv);
-  // Use less GPU memory for unittest.
-  FLAGS_fraction_of_gpu_memory_to_use = 0.25;
-  return RUN_ALL_TESTS();
-}
-#endif
--- a/paddle/operators/fetch_op.h
+++ b/paddle/operators/fetch_op.h
@@ -13,33 +13,38 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #pragma once
-#include "paddle/framework/eigen.h"
+#include "paddle/framework/scope.h"
-#include "paddle/framework/op_registry.h"
+#include "paddle/framework/variable.h"
 namespace paddle {
-namespace operators {
+namespace framework {
 template <typename T>
-class FetchKernel : public framework::OpKernel<T> {
+void SetFeedVariable(const LoDTensor& input, const std::string& var_name,
- public:
+                     size_t index) {
-  void Compute(const framework::ExecutionContext& ctx) const override {
+  // If var_name Variable is not found in GlobalScope, a new variable will
-    const framework::Tensor* input = ctx.Input<framework::Tensor>("Input");
+  // be created.
-    framework::Variable* g_fetch_variable =
+  Variable* g_feed_value = GetGlobalScope().Var(var_name);
-        framework::GetGlobalScope().FindVar("fetch_value");
+  auto& feed_inputs =
-    auto* tensors =
+      *(g_feed_value->GetMutable<std::vector<paddle::framework::LoDTensor>>());
-        g_fetch_variable->GetMutable<std::vector<framework::Tensor>>();
+  if (index >= feed_inputs.size()) {
-    int col = ctx.template Attr<int>("col");
+    feed_inputs.resize(index + 1);
-    if (tensors->size() < static_cast<size_t>(col + 1)) {
-      tensors->resize(col + 1);
-    }
-    PADDLE_ENFORCE_GT(tensors->size(), static_cast<size_t>(col));
-    (*tensors)[col].Resize(input->dims());
-    (*tensors)[col].mutable_data<T>(platform::CPUPlace());
-    (*tensors)[col].CopyFrom<T>(*input, platform::CPUPlace(),
-                                ctx.device_context());
-    // TODO(qijun): need to handle LodTensor later
  }
-};
+  // shared data with input tensor
+  feed_inputs[index].ShareDataWith<T>(input);
+  // set lod
+  feed_inputs[index].set_lod(input.lod());
+}
-}  // namespace operators
+LoDTensor& GetFetchVariable(const std::string& var_name, size_t index) {
+  // Since we want to fetch LodTensor from a variable, the variable must
+  // be created alreadly.
+  Variable* g_fetch_value = GetGlobalScope().FindVar(var_name);
+  auto& fetch_outputs =
+      *(g_fetch_value->GetMutable<std::vector<paddle::framework::LoDTensor>>());
+  PADDLE_ENFORCE_LT(index, fetch_outputs.size());
+  return fetch_outputs[index];
+}
+}  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/feed_fetch_type.h
+++ b/paddle/framework/feed_fetch_type.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+#pragma once
+#include <vector>
+#include "paddle/framework/lod_tensor.h"
+namespace paddle {
+namespace framework {
+using FeedFetchType = LoDTensor;
+using FeedFetchList = std::vector<FeedFetchType>;
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/grad_op_desc_maker.h
+++ b/paddle/framework/grad_op_desc_maker.h
@@ -149,5 +149,13 @@ class DefaultGradOpDescMaker : public SingleGradOpDescMaker {
  }
 };
+class EmptyGradOpMaker : public GradOpDescMakerBase {
+ public:
+  using GradOpDescMakerBase::GradOpDescMakerBase;
+  std::vector<std::unique_ptr<OpDescBind>> operator()() const override {
+    return {};
+  }
+};
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/op_desc.cc
+++ b/paddle/framework/op_desc.cc
@@ -220,9 +220,12 @@ static InferShapeFuncMap &InferShapeFuncs() {
 void OpDescBind::CheckAttrs() {
  PADDLE_ENFORCE(!Type().empty(),
                 "CheckAttr() can not be called before type is setted.");
-  const auto *checker = OpInfoMap::Instance().Get(Type()).Checker();
+  auto *checker = OpInfoMap::Instance().Get(Type()).Checker();
-  PADDLE_ENFORCE_NOT_NULL(checker, "Operator \"%s\" has no registered checker.",
+  if (checker == nullptr) {
-                          Type());
+    // checker is not configured. That operator could be generated by Paddle,
+    // not by users.
+    return;
+  }
  checker->Check(attrs_);
 }
@@ -236,5 +239,19 @@ void OpDescBind::InferShape(const BlockDescBind &block) const {
  it->second(&ctx);
 }
+void OpDescBind::InferVarType(BlockDescBind *block) const {
+  auto &info = OpInfoMap::Instance().Get(this->Type());
+  if (info.infer_var_type_) {
+    info.infer_var_type_(*this, block);
+  } else {
+    // all output type is LoDTensor by default
+    for (auto &out_pair : this->outputs_) {
+      for (auto &out_var_name : out_pair.second) {
+        block->Var(out_var_name)->SetType(VarDesc::LOD_TENSOR);
+      }
+    }
+  }
+}
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/op_desc.h
+++ b/paddle/framework/op_desc.h
@@ -102,6 +102,8 @@ class OpDescBind {
  void InferShape(const BlockDescBind &block) const;
+  void InferVarType(BlockDescBind *block) const;
  void Flush();
 private:

--- a/paddle/framework/op_info.h
+++ b/paddle/framework/op_info.h
@@ -19,7 +19,6 @@
 #include <unordered_map>
 #include "paddle/framework/attribute.h"
-#include "paddle/framework/op_desc.h"
 #include "paddle/framework/type_defs.h"
 #include "paddle/platform/macros.h"
@@ -31,6 +30,7 @@ struct OpInfo {
  GradOpMakerFN grad_op_maker_;
  OpProto* proto_{nullptr};
  OpAttrChecker* checker_{nullptr};
+  InferVarTypeFN infer_var_type_;
  bool HasOpProtoAndChecker() const {
    return proto_ != nullptr && checker_ != nullptr;

--- a/paddle/framework/op_registry.h
+++ b/paddle/framework/op_registry.h
@@ -82,16 +82,6 @@ class OpRegistry {
  static std::unique_ptr<OperatorBase> CreateOp(const OpDescBind& op_desc);
 };
-template <typename OpType, typename ProtoMakerType, typename GradOpType>
-class OpRegistrar : public Registrar {
- public:
-  explicit OpRegistrar(const char* op_type) { OpRegistrar(op_type, ""); }
-  OpRegistrar(const char* op_type, const char* grad_op_type) {
-    OpRegistry::RegisterOp<OpType, ProtoMakerType, GradOpType>(op_type,
-                                                               grad_op_type);
-  }
-};
 template <typename PlaceType, bool at_end, size_t I, typename... KernelType>
 struct OpKernelRegistrarFunctor;

--- a/paddle/framework/selected_rows.h
+++ b/paddle/framework/selected_rows.h
@@ -10,6 +10,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #pragma once
+#include "paddle/framework/lod_tensor.h"
 #include "paddle/framework/tensor.h"
 namespace paddle {
@@ -34,9 +35,9 @@ class SelectedRows {
  void set_height(int64_t height) { height_ = height; }
-  const std::vector<int64_t>& rows() const { return rows_; }
+  const Vector<int64_t>& rows() const { return rows_; }
-  void set_rows(const std::vector<int64_t>& rows) { rows_ = rows; }
+  void set_rows(const Vector<int64_t>& rows) { rows_ = rows; }
  DDim GetCompleteDims() const {
    std::vector<int64_t> dims = vectorize(value_->dims());
@@ -45,7 +46,10 @@ class SelectedRows {
  }
 private:
-  std::vector<int64_t> rows_;
+  // Notice: rows can be duplicate. We can have {0, 4, 7, 0, 5, 7, 9} here.
+  // SelectedRows are simplely concated when adding together. Until a
+  // SelectedRows add a Tensor, will the duplicate rows be handled.
+  Vector<int64_t> rows_;
  std::unique_ptr<Tensor> value_{nullptr};
  int64_t height_;
 };

--- a/paddle/framework/tensor.h
+++ b/paddle/framework/tensor.h
@@ -100,6 +100,22 @@ class Tensor {
  inline void CopyFrom(const Tensor& src, const platform::Place& dst_place,
                       const platform::DeviceContext& ctx);
+  // FIXME(yuyang18): CopyFrom should without template T, use the replace
+  // `CopyFrom` with `CopyFromTensor`
+  inline void CopyFromTensor(const Tensor& src,
+                             const platform::Place& dst_place,
+                             const platform::DeviceContext& ctx) {
+    // NOLINTNEXTLINES_8 cpplint.py will recognize below lines as functions.
+    // That is a bug of cpplint.py. Just ignore lint these lines.
+    if (src.type() == std::type_index(typeid(double))) {
+      CopyFrom<double>(src, dst_place, ctx);
+    } else if (src.type() == std::type_index(typeid(float))) {
+      CopyFrom<float>(src, dst_place, ctx);
+    } else if (src.type() == std::type_index(typeid(int))) {
+      CopyFrom<int>(src, dst_place, ctx);
+    }
+  }
  /**
   * @brief   Copy the content of an external vector to a tensor.
   *

--- a/paddle/framework/type_defs.h
+++ b/paddle/framework/type_defs.h
@@ -16,12 +16,18 @@
 #include <functional>
 #include <map>
 #include <memory>
+#include <string>
+#include <unordered_map>
+#include <unordered_set>
+#include <vector>
 #include "paddle/platform/variant.h"
 namespace paddle {
 namespace framework {
 class OperatorBase;
 class OpDescBind;
+class BlockDescBind;
+class BlockDesc;
 using VariableNameMap = std::map<std::string, std::vector<std::string>>;
 // The order should be as same as framework.proto
@@ -40,5 +46,8 @@ using GradOpMakerFN = std::function<std::vector<std::unique_ptr<OpDescBind>>(
    const OpDescBind&, const std::unordered_set<std::string>& /*no_grad_set*/,
    std::unordered_map<std::string, std::string>* /*grad_to_var*/)>;
+using InferVarTypeFN = std::function<void(const OpDescBind& /*op_desc*/,
+                                          BlockDescBind* /*block*/)>;
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/var_type_inference.h
+++ b/paddle/framework/var_type_inference.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+#pragma once
+#include "paddle/framework/type_defs.h"
+namespace paddle {
+namespace framework {
+class VarTypeInference {
+ public:
+  virtual ~VarTypeInference() {}
+  virtual void operator()(const OpDescBind& op_desc,
+                          BlockDescBind* block) const = 0;
+};
+}  // namespace framework
+}  // namespace paddle
--- a/paddle/framework/var_type_inference_test.cc
+++ b/paddle/framework/var_type_inference_test.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+#include "paddle/framework/var_type_inference.h"
+#include "gtest/gtest.h"
+#include "paddle/framework/op_registry.h"
+#include "paddle/framework/operator.h"
+#include "paddle/framework/program_desc.h"
+namespace paddle {
+namespace framework {
+class SumOpMaker : public OpProtoAndCheckerMaker {
+ public:
+  SumOpMaker(OpProto *proto, OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "").AsDuplicable();
+    AddOutput("Out", "");
+    AddComment("");
+  }
+};
+class SumOpVarTypeInference : public VarTypeInference {
+ public:
+  void operator()(const OpDescBind &op_desc,
+                  BlockDescBind *block) const override {
+    auto &inputs = op_desc.Input("X");
+    auto default_var_type = VarDesc::SELECTED_ROWS;
+    bool any_input_is_lod_tensor = std::any_of(
+        inputs.begin(), inputs.end(), [block](const std::string &name) {
+          return block->Var(name)->GetType() == VarDesc::LOD_TENSOR;
+        });
+    if (any_input_is_lod_tensor) {
+      default_var_type = VarDesc::LOD_TENSOR;
+    }
+    auto out_var_name = op_desc.Output("Out").front();
+    block->Var(out_var_name)->SetType(default_var_type);
+  }
+};
+}  // namespace framework
+}  // namespace paddle
+REGISTER_OPERATOR(sum, paddle::framework::NOP, paddle::framework::SumOpMaker,
+                  paddle::framework::SumOpVarTypeInference);
+REGISTER_OPERATOR(sum_without_infer_var_type, paddle::framework::NOP,
+                  paddle::framework::SumOpMaker);
+namespace paddle {
+namespace framework {
+TEST(InferVarType, sum_op) {
+  auto &prog = ProgramDescBind::Instance(&GetProgramDesc());
+  auto *op = prog.Block(0)->AppendOp();
+  op->SetType("sum");
+  op->SetInput("X", {"test_a", "test_b", "test_c"});
+  op->SetOutput("Out", {"test_out"});
+  prog.Block(0)->Var("test_a")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test_b")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test_c")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test_out");
+  op->InferVarType(prog.Block(0));
+  ASSERT_EQ(VarDesc::SELECTED_ROWS, prog.Block(0)->Var("test_out")->GetType());
+  prog.Block(0)->Var("test_b")->SetType(VarDesc::LOD_TENSOR);
+  op->InferVarType(prog.Block(0));
+  ASSERT_EQ(VarDesc::LOD_TENSOR, prog.Block(0)->Var("test_out")->GetType());
+}
+TEST(InferVarType, sum_op_without_infer_var_type) {
+  auto &prog = ProgramDescBind::Instance(&GetProgramDesc());
+  auto *op = prog.Block(0)->AppendOp();
+  op->SetType("sum_without_infer_var_type");
+  op->SetInput("X", {"test2_a", "test2_b", "test2_c"});
+  op->SetOutput("Out", {"test2_out"});
+  prog.Block(0)->Var("test2_a")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test2_b")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test2_c")->SetType(VarDesc::SELECTED_ROWS);
+  prog.Block(0)->Var("test2_out");
+  op->InferVarType(prog.Block(0));
+  ASSERT_EQ(VarDesc_VarType_LOD_TENSOR,
+            prog.Block(0)->Var("test2_out")->GetType());
+}
+}  // namespace framework
+}  // namespace paddle
\ No newline at end of file
--- a/paddle/operators/accuracy_op.cc
+++ b/paddle/operators/accuracy_op.cc
@@ -21,7 +21,6 @@ class AccuracyOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Inference"),
                   "Input(Inference) of AccuracyOp should not be null.");

--- a/paddle/operators/activation_op.cc
+++ b/paddle/operators/activation_op.cc
@@ -21,7 +21,6 @@ class ActivationOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    ctx->SetOutputDim("Y", ctx->GetInputDim("X"));
    ctx->ShareLoD("X", /*->*/ "Y");
@@ -32,7 +31,6 @@ class ActivationOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("Y"));
  }

--- a/paddle/operators/adadelta_op.cc
+++ b/paddle/operators/adadelta_op.cc
@@ -21,7 +21,6 @@ class AdadeltaOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Param"),
                   "Input(Param) of AdadeltaOp should not be null.");

--- a/paddle/operators/adagrad_op.cc
+++ b/paddle/operators/adagrad_op.cc
@@ -21,7 +21,6 @@ class AdagradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Param"),
                   "Input(Param) of AdagradOp should not be null.");

--- a/paddle/operators/adam_op.cc
+++ b/paddle/operators/adam_op.cc
@@ -21,7 +21,6 @@ class AdamOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Param"),
                   "Input(Param) of AdamOp should not be null.");

--- a/paddle/operators/adamax_op.cc
+++ b/paddle/operators/adamax_op.cc
@@ -21,7 +21,6 @@ class AdamaxOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Param"),
                   "Input(Param) of AdamaxOp should not be null.");

--- a/paddle/operators/clip_op.cc
+++ b/paddle/operators/clip_op.cc
@@ -21,7 +21,6 @@ class ClipOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of ClipOp should not be null.");
@@ -60,7 +59,6 @@ class ClipOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),

--- a/paddle/operators/concat_op.cc
+++ b/paddle/operators/concat_op.cc
@@ -23,7 +23,6 @@ class ConcatOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE_GE(ctx->Inputs("X").size(), 1UL,
                      "Inputs(X) of ConcatOp should be empty.")
@@ -82,7 +81,6 @@ class ConcatOpGrad : public framework::OperatorWithKernel {
               const framework::AttributeMap &attrs)
      : OperatorWithKernel(type, inputs, outputs, attrs) {}
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    ctx->SetOutputsDim(framework::GradVarName("X"), ctx->GetInputsDim("X"));
  }

--- a/paddle/operators/conv2d_op.h
+++ b/paddle/operators/conv2d_op.h
@@ -44,7 +44,6 @@ class Conv2DOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override;
 };
@@ -52,7 +51,6 @@ class Conv2DOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override;
 };

--- a/paddle/operators/conv_shift_op.cc
+++ b/paddle/operators/conv_shift_op.cc
@@ -27,7 +27,6 @@ class ConvShiftOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
    PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should be not null.");
@@ -54,7 +53,6 @@ class ConvShiftGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
    PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should be not null.");

--- a/paddle/operators/cos_sim_op.cc
+++ b/paddle/operators/cos_sim_op.cc
@@ -23,7 +23,6 @@ class CosSimOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    // notnull check
    PADDLE_ENFORCE(ctx->HasInput("X"),
@@ -97,7 +96,6 @@ class CosSimOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    // notnull check
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) must not be null.");

--- a/paddle/operators/crop_op.cc
+++ b/paddle/operators/crop_op.cc
@@ -24,7 +24,6 @@ class CropOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of CropOp should not be null.");
@@ -114,7 +113,6 @@ class CropOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),

--- a/paddle/operators/cross_entropy_op.cc
+++ b/paddle/operators/cross_entropy_op.cc
@@ -21,7 +21,6 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
    PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) should be not null.");
@@ -34,13 +33,13 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
    PADDLE_ENFORCE_EQ(x_dims[0], label_dims[0],
                      "The 1st dimension of Input(X) and Input(Label) should "
                      "be equal.");
-    if (ctx->Attrs().Get<bool>("softLabel")) {
+    if (ctx->Attrs().Get<bool>("soft_label")) {
      PADDLE_ENFORCE_EQ(x_dims[1], label_dims[1],
-                        "If Attr(softLabel) == true, the 2nd dimension of "
+                        "If Attr(soft_label) == true, the 2nd dimension of "
                        "Input(X) and Input(Label) should be equal.");
    } else {
      PADDLE_ENFORCE_EQ(label_dims[1], 1,
-                        "If Attr(softLabel) == false, the 2nd dimension of "
+                        "If Attr(soft_label) == false, the 2nd dimension of "
                        "Input(Label) should be 1.");
    }
@@ -48,6 +47,7 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
    ctx->ShareLoD("X", /*->*/ "Y");
  }
+ protected:
  // CrossEntropy's data type just determined by "X"
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
@@ -59,7 +59,6 @@ class CrossEntropyGradientOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
    PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) should be not null.");
@@ -82,18 +81,19 @@ class CrossEntropyGradientOp : public framework::OperatorWithKernel {
                      "be equal.");
    PADDLE_ENFORCE_EQ(dy_dims[1], 1,
                      "The 2nd dimension of Input(Y@Grad) should be 1.");
-    if (ctx->Attrs().Get<bool>("softLabel")) {
+    if (ctx->Attrs().Get<bool>("soft_label")) {
      PADDLE_ENFORCE_EQ(x_dims[1], label_dims[1],
-                        "When Attr(softLabel) == true, the 2nd dimension of "
+                        "When Attr(soft_label) == true, the 2nd dimension of "
                        "Input(X) and Input(Label) should be equal.");
    } else {
      PADDLE_ENFORCE_EQ(label_dims[1], 1,
-                        "When Attr(softLabel) == false, the 2nd dimension of "
+                        "When Attr(soft_label) == false, the 2nd dimension of "
                        "Input(Label) should be 1.");
    }
    ctx->SetOutputDim(framework::GradVarName("X"), x_dims);
  }
+ protected:
  // CrossEntropy's data type just determined by "X"
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
@@ -115,15 +115,15 @@ class CrossEntropyOpMaker : public framework::OpProtoAndCheckerMaker {
        "Label",
        "(Tensor, default Tensor<int>), the ground truth which is "
        "a 2-D tensor. "
-        "When softLabel is set to false, `Label` is a Tensor<int> with shape "
+        "When soft_label is set to false, `Label` is a Tensor<int> with shape "
        "[N x 1]. "
-        "When softLabel is set to true, `Label` is a Tensor<float/double> "
+        "When soft_label is set to true, `Label` is a Tensor<float/double> "
        "with shape [N x K].");
    AddOutput("Y",
              "(Tensor, default Tensor<float>), a 2-D tensor "
              "with shape [N x 1]. The cross entropy loss.");
    AddAttr<bool>(
-        "softLabel",
+        "soft_label",
        "(bool, default false), a flag to indicate whether to interpretate "
        "the given labels as soft labels.")
        .SetDefault(false);
@@ -133,12 +133,12 @@ CrossEntropy Operator.
 It supports both standard cross-entropy and soft-label cross-entropy loss
 computation.
 1) One-hot cross-entropy:
-    softLabel = false, Label[i, 0] indicates the class index for sample i:
+    soft_label = false, Label[i, 0] indicates the class index for sample i:
                Y[i] = -log(X[i, Label[i]])
 2) Soft-label cross-entropy:
-    softLabel = true, Label[i, j] indicates the soft label of class j
+    soft_label = true, Label[i, j] indicates the soft label of class j
    for sample i:
                Y[i] = \sum_j{-Label[i, j] * log(X[i, j])}

--- a/paddle/operators/cross_entropy_op.cu
+++ b/paddle/operators/cross_entropy_op.cu
@@ -56,7 +56,7 @@ class CrossEntropyOpCUDAKernel : public framework::OpKernel<T> {
    y->mutable_data<T>(ctx.GetPlace());
    math::CrossEntropyFunctor<platform::GPUPlace, T>()(
-        ctx.device_context(), y, x, label, ctx.Attr<bool>("softLabel"));
+        ctx.device_context(), y, x, label, ctx.Attr<bool>("soft_label"));
  }
 };
@@ -83,7 +83,7 @@ class CrossEntropyGradientOpCUDAKernel : public framework::OpKernel<T> {
    int block = 512;
    int grid = (batch_size * class_num + block - 1) / block;
-    if (ctx.Attr<bool>("softLabel")) {
+    if (ctx.Attr<bool>("soft_label")) {
      auto* label_data = label->data<T>();
      SoftCrossEntropyGradientKernel<T><<<
          grid, block, 0, reinterpret_cast<const platform::CUDADeviceContext&>(
@@ -91,7 +91,8 @@ class CrossEntropyGradientOpCUDAKernel : public framework::OpKernel<T> {
                              .stream()>>>(dx_data, dy_data, x_data, label_data,
                                           batch_size, class_num);
    } else {
-      math::SetConstant<platform::GPUPlace, T>(ctx.device_context(), dx, 0);
+      math::SetConstant<platform::GPUPlace, T> functor;
+      functor(ctx.device_context(), dx, 0);
      auto* label_data = label->data<int>();
      grid = (batch_size + block - 1) / block;
      CrossEntropyGradientKernel<T><<<

--- a/paddle/operators/cross_entropy_op.h
+++ b/paddle/operators/cross_entropy_op.h
@@ -38,7 +38,7 @@ class CrossEntropyOpKernel : public framework::OpKernel<T> {
    y->mutable_data<T>(ctx.GetPlace());
    math::CrossEntropyFunctor<platform::CPUPlace, T>()(
-        ctx.device_context(), y, x, labels, ctx.Attr<bool>("softLabel"));
+        ctx.device_context(), y, x, labels, ctx.Attr<bool>("soft_label"));
  }
 };
@@ -55,7 +55,7 @@ class CrossEntropyGradientOpKernel : public framework::OpKernel<T> {
    T* dx_data = dx->mutable_data<T>(ctx.GetPlace());
    int class_num = x->dims()[1];
-    if (ctx.Attr<bool>("softLabel")) {
+    if (ctx.Attr<bool>("soft_label")) {
      auto x_mat = EigenMatrix<T>::From(*x);
      auto dy_mat = EigenMatrix<T>::From(*dy);
      auto lbl_mat = EigenMatrix<T>::From(*label);
@@ -70,7 +70,8 @@ class CrossEntropyGradientOpKernel : public framework::OpKernel<T> {
      const T* x_data = x->data<T>();
      const int* label_data = label->data<int>();
-      math::SetConstant<platform::CPUPlace, T>(ctx.device_context(), dx, 0);
+      math::SetConstant<platform::CPUPlace, T> functor;
+      functor(ctx.device_context(), dx, 0);
      for (int i = 0; i < batch_size; ++i) {
        PADDLE_ASSERT(label_data[i] >= 0 || label_data[i] < class_num);

--- a/paddle/operators/decayed_adagrad_op.cc
+++ b/paddle/operators/decayed_adagrad_op.cc
@@ -21,7 +21,6 @@ class DecayedAdagradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Param"),
                   "Input(Param) of DecayedAdagradOp should not be null.");

--- a/paddle/operators/dropout_op.cc
+++ b/paddle/operators/dropout_op.cc
@@ -23,7 +23,6 @@ class DropoutOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) must not be null.");
    PADDLE_ENFORCE_GE(ctx->Attrs().Get<float>("dropout_prob"), 0);
@@ -69,7 +68,6 @@ class DropoutOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE_EQ(ctx->Attrs().Get<bool>("is_training"), 1,
                      "GradOp is only callable when is_training is true");

--- a/paddle/operators/elementwise_op.h
+++ b/paddle/operators/elementwise_op.h
@@ -23,7 +23,6 @@ class ElementwiseOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  using Tensor = framework::Tensor;
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
@@ -105,7 +104,6 @@ class ElementwiseOpGrad : public framework::OperatorWithKernel {
  using framework::OperatorWithKernel::OperatorWithKernel;
  using Tensor = framework::Tensor;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
    PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should not be null");

--- a/paddle/operators/feed_op.cc
+++ b/paddle/operators/feed_op.cc
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-Licensed under the Apache License, Version 2.0 (the "License");
+   Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
+   you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
+   You may obtain a copy of the License at
-    http://www.apache.org/licenses/LICENSE-2.0
+   http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
+   Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
+   distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
+   See the License for the specific language governing permissions and
-limitations under the License. */
+   limitations under the License. */
-#include "paddle/operators/feed_op.h"
+#include "paddle/framework/feed_fetch_type.h"
+#include "paddle/framework/op_registry.h"
+#include "paddle/framework/operator.h"
 namespace paddle {
 namespace operators {
+class FeedOp : public framework::OperatorBase {
-class FeedOp : public framework::OperatorWithKernel {
- public:
-  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
-  void InferShape(framework::InferShapeContext* ctx) const override {
-    PADDLE_ENFORCE(ctx->HasOutput("Out"), "Output should be not null.");
-    auto& shape = ctx->Attrs().Get<std::vector<int>>("dims");
-    std::vector<int64_t> shape_int64(shape.size(), 0);
-    std::transform(shape.begin(), shape.end(), shape_int64.begin(),
-                   [](int a) { return static_cast<int64_t>(a); });
-    ctx->SetOutputDim("Out", framework::make_ddim(shape_int64));
-    // TODO(qijun): need to handle LodTensor later
-  }
-  framework::DataType IndicateDataType(
-      const framework::ExecutionContext& ctx) const override {
-    return static_cast<framework::DataType>(Attr<int>("dataType"));
-  }
-};
-class FeedOpMaker : public framework::OpProtoAndCheckerMaker {
 public:
-  FeedOpMaker(framework::OpProto* proto, framework::OpAttrChecker* op_checker)
+  FeedOp(const std::string &type, const framework::VariableNameMap &inputs,
-      : OpProtoAndCheckerMaker(proto, op_checker) {
+         const framework::VariableNameMap &outputs,
-    AddAttr<int>("dataType", "output data type")
+         const framework::AttributeMap &attrs)
-        .SetDefault(framework::DataType::FP32);
+      : OperatorBase(type, inputs, outputs, attrs) {}
-    AddAttr<int>("col", "The col in global feed variable").SetDefault(0);
+  void Run(const framework::Scope &scope,
-    AddAttr<std::vector<int>>("dims", "The dimension of feed tensor.");
+           const platform::DeviceContext &dev_ctx) const override {
-    AddOutput("Out", "The output of feed op.");
+    auto feed_var_name = Input("Input");
-    AddComment(R"DOC(Feed data from global feed variable)DOC");
+    auto *feed_var = scope.FindVar(feed_var_name);
+    PADDLE_ENFORCE(feed_var != nullptr,
+                   "Cannot find feed_var in scope, feed_var_name is %s",
+                   feed_var_name);
+    auto out_name = this->Output("Out");
+    auto *out_var = scope.FindVar(out_name);
+    PADDLE_ENFORCE(out_var != nullptr,
+                   "Cannot find out_var in scope, out_var_name is %s",
+                   out_name);
+    auto col = Attr<int>("col");
+    auto &feed_list = feed_var->Get<framework::FeedFetchList>();
+    auto &feed_item = feed_list.at(static_cast<size_t>(col));
+    auto *out_item = out_var->GetMutable<framework::FeedFetchType>();
+    out_item->CopyFromTensor(feed_item, dev_ctx.GetPlace(), dev_ctx);
+    out_item->set_lod(feed_item.lod());
  }
 };
 }  // namespace operators
 }  // namespace paddle
-namespace ops = paddle::operators;
+// We do not need to register OpInfoMaker,
-REGISTER_OP_WITHOUT_GRADIENT(feed, ops::FeedOp, ops::FeedOpMaker);
+// since feed operator will not be used by end users directly
-REGISTER_OP_CPU_KERNEL(feed, ops::FeedKernel<float>);
+REGISTER_OPERATOR(feed, paddle::operators::FeedOp,
+                  paddle::framework::EmptyGradOpMaker);
--- a/paddle/operators/feed_op.cu
+++ b/paddle/operators/feed_op.cu
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-#include "paddle/operators/feed_op.h"
-namespace ops = paddle::operators;
-REGISTER_OP_GPU_KERNEL(feed, ops::FeedKernel<float>);
--- a/paddle/operators/fetch_op.cc
+++ b/paddle/operators/fetch_op.cc
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-Licensed under the Apache License, Version 2.0 (the "License");
+   Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
+   you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
+   You may obtain a copy of the License at
-    http://www.apache.org/licenses/LICENSE-2.0
+   http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
+   Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
+   distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
+   See the License for the specific language governing permissions and
-limitations under the License. */
+   limitations under the License. */
-#include "paddle/operators/fetch_op.h"
+#include "paddle/framework/feed_fetch_type.h"
+#include "paddle/framework/op_registry.h"
 namespace paddle {
 namespace operators {
-class FetchOp : public framework::OperatorWithKernel {
+class FetchOp : public framework::OperatorBase {
 public:
-  using framework::OperatorWithKernel::OperatorWithKernel;
+  FetchOp(const std::string &type, const framework::VariableNameMap &inputs,
+          const framework::VariableNameMap &outputs,
+          const framework::AttributeMap &attrs)
+      : OperatorBase(type, inputs, outputs, attrs) {}
- protected:
+  void Run(const framework::Scope &scope,
-  void InferShape(framework::InferShapeContext* ctx) const override {
+           const platform::DeviceContext &dev_ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput("Input"), "Input should be not null.");
+    auto fetch_var_name = Input("Input");
-  }
+    auto *fetch_var = scope.FindVar(fetch_var_name);
+    PADDLE_ENFORCE(fetch_var != nullptr,
+                   "Cannot find fetch variable in scope, fetch_var_name is %s",
+                   fetch_var_name);
-  framework::DataType IndicateDataType(
+    auto out_name = this->Output("Out");
-      const framework::ExecutionContext& ctx) const override {
+    auto *out_var = scope.FindVar(out_name);
-    return static_cast<framework::DataType>(Attr<int>("dataType"));
+    PADDLE_ENFORCE(out_var != nullptr,
-  }
+                   "Cannot find out_var in scope, out_var_name is %s",
-};
+                   out_name);
-class FetchOpMaker : public framework::OpProtoAndCheckerMaker {
+    auto col = static_cast<size_t>(Attr<int>("col"));
- public:
-  FetchOpMaker(framework::OpProto* proto, framework::OpAttrChecker* op_checker)
+    auto *fetch_list = out_var->GetMutable<framework::FeedFetchList>();
-      : OpProtoAndCheckerMaker(proto, op_checker) {
+    auto &src_item = fetch_var->Get<framework::FeedFetchType>();
-    AddAttr<int>("dataType", "output data type")
-        .SetDefault(framework::DataType::FP32);
+    if (col >= fetch_list->size()) {
-    AddAttr<int>("col", "The col in global fetch variable").SetDefault(0);
+      fetch_list->resize(col + 1);
-    AddInput("Input", "The output of fetch op.");
+    }
-    AddComment(R"DOC(Fetch data to global fetch variable)DOC");
+    auto &dst_item = fetch_list->at(col);
+    // FIXME(yuyang18): Should we assume the fetch operator always generate
+    // CPU outputs?
+    dst_item.CopyFromTensor(src_item, platform::CPUPlace(), dev_ctx);
  }
 };
 }  // namespace operators
 }  // namespace paddle
-namespace ops = paddle::operators;
+// We do not need to register OpInfoMaker,
-REGISTER_OP_WITHOUT_GRADIENT(fetch, ops::FetchOp, ops::FetchOpMaker);
+// since fetch operator will not be used by end users directly
-REGISTER_OP_CPU_KERNEL(fetch, ops::FetchKernel<float>);
+REGISTER_OPERATOR(fetch, paddle::operators::FetchOp,
+                  paddle::framework::EmptyGradOpMaker);
--- a/paddle/operators/fetch_op.cu
+++ b/paddle/operators/fetch_op.cu
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-http://www.apache.org/licenses/LICENSE-2.0
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-#include "paddle/operators/fetch_op.h"
-namespace ops = paddle::operators;
-REGISTER_OP_GPU_KERNEL(fetch, ops::FetchKernel<float>);
--- a/paddle/operators/fill_constant_op.cc
+++ b/paddle/operators/fill_constant_op.cc
@@ -21,7 +21,6 @@ class FillConstantOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
                   "Output(Out) of FillConstantOp should not be null.");
@@ -33,9 +32,10 @@ class FillConstantOp : public framework::OperatorWithKernel {
    ctx->SetOutputDim("Out", dims);
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext &ctx) const override {
-    return static_cast<framework::DataType>(ctx.Attr<int>("dataType"));
+    return static_cast<framework::DataType>(ctx.Attr<int>("data_type"));
  }
 };
@@ -44,7 +44,7 @@ class FillConstantOpMaker : public framework::OpProtoAndCheckerMaker {
  FillConstantOpMaker(framework::OpProto *proto,
                      framework::OpAttrChecker *op_checker)
      : framework::OpProtoAndCheckerMaker(proto, op_checker) {
-    AddAttr<int>("dataType",
+    AddAttr<int>("data_type",
                 "(int, default 5 (FP32)) "
                 "Output data type")
        .SetDefault(framework::DataType::FP32);

--- a/paddle/operators/fill_zeros_like_op.cc
+++ b/paddle/operators/fill_zeros_like_op.cc
@@ -21,7 +21,6 @@ class FillZerosLikeOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of FillZerosLikeOp should not be null.");

--- a/paddle/operators/gather_op.cc
+++ b/paddle/operators/gather_op.cc
@@ -22,7 +22,6 @@ class GatherOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of GatherOp should not be null.");
@@ -40,6 +39,7 @@ class GatherOp : public framework::OperatorWithKernel {
    ctx->SetOutputDim("Out", output_dims);
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.Input<Tensor>("X")->type());
@@ -50,11 +50,11 @@ class GatherGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X"));
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.Input<Tensor>("X")->type());

--- a/paddle/operators/gaussian_random_op.cc
+++ b/paddle/operators/gaussian_random_op.cc
@@ -42,7 +42,6 @@ class GaussianRandomOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
                   "Output(Out) of GaussianRandomOp should not be null.");
@@ -57,6 +56,7 @@ class GaussianRandomOp : public framework::OperatorWithKernel {
    ctx->SetOutputDim("Out", framework::make_ddim(temp));
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return static_cast<framework::DataType>(Attr<int>("data_type"));

--- a/paddle/operators/gru_unit_op.cc
+++ b/paddle/operators/gru_unit_op.cc
@@ -23,7 +23,6 @@ class GRUUnitOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Input"),
                   "Input(%s) of GRUUnitOp should not be null.", "Input");
@@ -131,7 +130,6 @@ class GRUUnitGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Input"),
                   "Input(%s) of GRUUnitGradOp should not be null.", "Input");

--- a/paddle/operators/lookup_table_op.cc
+++ b/paddle/operators/lookup_table_op.cc
@@ -21,7 +21,6 @@ class LookupTableOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("W"),
                   "Input(W) of LookupTableOp should not be null.");
@@ -37,6 +36,7 @@ class LookupTableOp : public framework::OperatorWithKernel {
    ctx->ShareLoD("Ids", /*->*/ "Out");
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.Input<Tensor>("W")->type());
@@ -69,12 +69,12 @@ class LookupTableOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    auto table_dims = ctx->GetInputDim("W");
    ctx->SetOutputDim(framework::GradVarName("W"), table_dims);
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.Input<Tensor>("W")->type());

--- a/paddle/operators/lstm_unit_op.cc
+++ b/paddle/operators/lstm_unit_op.cc
@@ -21,7 +21,6 @@ class LstmUnitOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) of LSTM should not be null.");
    PADDLE_ENFORCE(ctx->HasInput("C_prev"),
@@ -76,7 +75,6 @@ class LstmUnitGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("C")),
                   "Input(C@GRAD) should not be null");

--- a/paddle/operators/margin_rank_loss_op.cc
+++ b/paddle/operators/margin_rank_loss_op.cc
@@ -21,7 +21,6 @@ class MarginRankLossOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    // input check
    PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) shouldn't be null.");
@@ -94,7 +93,6 @@ class MarginRankLossGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) shouldn't be null.");
    PADDLE_ENFORCE(ctx->HasInput("X1"), "Input(X1) shouldn't be null.");

--- a/paddle/operators/math/CMakeLists.txt
+++ b/paddle/operators/math/CMakeLists.txt
 if(WITH_GPU)
    nv_library(math_function SRCS math_function.cc math_function.cu im2col.cc im2col.cu DEPS cblas device_context operator)
-    nv_test(math_function_test SRCS math_function_test.cc DEPS math_function tensor)
+    nv_test(math_function_gpu_test SRCS math_function_test.cu DEPS math_function tensor)
+    nv_library(selected_rows_functor SRCS selected_rows_functor.cc selected_rows_functor.cu DEPS selected_rows math_function)
+    nv_test(selected_rows_functor_gpu_test SRCS selected_rows_functor_test.cu DEPS selected_rows_functor)
    nv_library(softmax SRCS softmax.cc softmax.cu DEPS operator)
    nv_library(cross_entropy SRCS cross_entropy.cc cross_entropy.cu DEPS operator)
    nv_library(pooling SRCS pooling.cc pooling.cu DEPS device_context)
    nv_library(vol2col SRCS vol2col.cc vol2col.cu DEPS device_context)
 else()
    cc_library(math_function SRCS math_function.cc im2col.cc DEPS cblas device_context operator)
-    cc_test(math_function_test SRCS math_function_test.cc DEPS math_function tensor)
+    cc_library(selected_rows_functor SRCS selected_rows_functor.cc DEPS selected_rows math_function)
    cc_library(softmax SRCS softmax.cc DEPS operator)
    cc_library(cross_entropy SRCS cross_entropy.cc DEPS operator)
    cc_library(pooling SRCS pooling.cc DEPS device_context)
    cc_library(vol2col SRCS vol2col.cc DEPS device_context)
 endif()
+cc_test(math_function_test SRCS math_function_test.cc DEPS math_function tensor)
+cc_test(selected_rows_functor_test SRCS selected_rows_functor_test.cc DEPS selected_rows_functor)
 cc_test(im2col_test SRCS im2col_test.cc DEPS math_function tensor)
 cc_test(vol2col_test SRCS vol2col_test.cc DEPS vol2col tensor)
--- a/paddle/operators/math/math_function.cc
+++ b/paddle/operators/math/math_function.cc
@@ -130,6 +130,8 @@ void matmul<platform::CPUPlace, double>(
      matrix_b.data<double>(), beta, matrix_out->data<double>());
 }
+template struct SetConstant<platform::CPUPlace, float>;
 }  // namespace math
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/operators/math/math_function.cu
+++ b/paddle/operators/math/math_function.cu
@@ -155,6 +155,8 @@ void matmul<platform::GPUPlace, double>(
      matrix_b.data<double>(), beta, matrix_out->data<double>());
 }
+template struct SetConstant<platform::GPUPlace, float>;
 }  // namespace math
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/operators/math/math_function.h
+++ b/paddle/operators/math/math_function.h
@@ -86,11 +86,14 @@ void matmul(const platform::DeviceContext& context,
            framework::Tensor* matrix_out, T beta);
 template <typename Place, typename T>
-void SetConstant(const platform::DeviceContext& context,
+struct SetConstant {
-                 framework::Tensor* tensor, T num) {
+  void operator()(const platform::DeviceContext& context,
-  auto t = framework::EigenVector<T>::Flatten(*tensor);
+                  framework::Tensor* tensor, T num) {
-  t.device(*context.GetEigenDevice<Place>()) = t.constant(static_cast<T>(num));
+    auto t = framework::EigenVector<T>::Flatten(*tensor);
-}
+    t.device(*context.GetEigenDevice<Place>()) =
+        t.constant(static_cast<T>(num));
+  }
+};
 }  // namespace math
 }  // namespace operators

--- a/paddle/operators/math/math_function_test.cc
+++ b/paddle/operators/math/math_function_test.cc
 #include "paddle/operators/math/math_function.h"
 #include "gtest/gtest.h"
-#ifdef PADDLE_WITH_CUDA
-TEST(math_function, notrans_mul_trans) {
-  paddle::framework::Tensor input1;
-  paddle::framework::Tensor input1_gpu;
-  paddle::framework::Tensor input2_gpu;
-  paddle::framework::Tensor out_gpu;
-  paddle::framework::Tensor out;
-  auto* cpu_place = new paddle::platform::CPUPlace();
-  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
-  float arr[6] = {0, 1, 2, 3, 4, 5};
-  memcpy(input1_ptr, arr, 6 * sizeof(float));
-  auto* gpu_place = new paddle::platform::GPUPlace(0);
-  paddle::platform::CUDADeviceContext context(*gpu_place);
-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  out_gpu.mutable_data<float>({2, 2}, *gpu_place);
-  paddle::operators::math::matmul<paddle::platform::GPUPlace, float>(
-      context, input1_gpu, false, input2_gpu, true, 1, &out_gpu, 0);
-  out.CopyFrom<float>(out_gpu, *cpu_place, context);
-  float* out_ptr = out.data<float>();
-  context.Wait();
-  EXPECT_EQ(out_ptr[0], 5);
-  EXPECT_EQ(out_ptr[1], 14);
-  EXPECT_EQ(out_ptr[2], 14);
-  EXPECT_EQ(out_ptr[3], 50);
-  delete gpu_place;
-}
-TEST(math_function, trans_mul_notrans) {
-  paddle::framework::Tensor input1;
-  paddle::framework::Tensor input1_gpu;
-  paddle::framework::Tensor input2_gpu;
-  paddle::framework::Tensor out_gpu;
-  paddle::framework::Tensor out;
-  auto* cpu_place = new paddle::platform::CPUPlace();
-  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
-  float arr[6] = {0, 1, 2, 3, 4, 5};
-  memcpy(input1_ptr, arr, 6 * sizeof(float));
-  auto* gpu_place = new paddle::platform::GPUPlace(0);
-  paddle::platform::CUDADeviceContext context(*gpu_place);
-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  out_gpu.mutable_data<float>({3, 3}, *gpu_place);
-  paddle::operators::math::matmul<paddle::platform::GPUPlace, float>(
-      context, input1_gpu, true, input2_gpu, false, 1, &out_gpu, 0);
-  out.CopyFrom<float>(out_gpu, *cpu_place, context);
-  float* out_ptr = out.data<float>();
-  context.Wait();
-  EXPECT_EQ(out_ptr[0], 9);
-  EXPECT_EQ(out_ptr[1], 12);
-  EXPECT_EQ(out_ptr[2], 15);
-  EXPECT_EQ(out_ptr[3], 12);
-  EXPECT_EQ(out_ptr[4], 17);
-  EXPECT_EQ(out_ptr[5], 22);
-  EXPECT_EQ(out_ptr[6], 15);
-  EXPECT_EQ(out_ptr[7], 22);
-  EXPECT_EQ(out_ptr[8], 29);
-  delete gpu_place;
-}
-TEST(math_function, gemm_notrans_cublas) {
-  paddle::framework::Tensor input1;
-  paddle::framework::Tensor input2;
-  paddle::framework::Tensor input3;
-  paddle::framework::Tensor input1_gpu;
-  paddle::framework::Tensor input2_gpu;
-  paddle::framework::Tensor input3_gpu;
-  int m = 2;
-  int n = 3;
-  int k = 3;
-  auto* cpu_place = new paddle::platform::CPUPlace();
-  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
-  float arr1[6] = {0, 1, 2, 3, 4, 5};
-  memcpy(input1_ptr, arr1, 6 * sizeof(float));
-  float* input2_ptr = input2.mutable_data<float>({3, 4}, *cpu_place);
-  float arr2[12] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
-  memcpy(input2_ptr, arr2, 12 * sizeof(float));
-  float* input3_ptr = input3.mutable_data<float>({2, 4}, *cpu_place);
-  float arr3[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  memcpy(input3_ptr, arr3, 8 * sizeof(float));
-  auto* gpu_place = new paddle::platform::GPUPlace(0);
-  paddle::platform::CUDADeviceContext context(*gpu_place);
-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input2, *gpu_place, context);
-  input3_gpu.CopyFrom<float>(input3, *gpu_place, context);
-  float* a = input1_gpu.data<float>();
-  float* b = input2_gpu.data<float>();
-  float* c = input3_gpu.mutable_data<float>(*gpu_place);
-  paddle::operators::math::gemm<paddle::platform::GPUPlace, float>(
-      context, false, false, m, n, k, 1, a, 3, b + 1, 4, 1, c + 1, 4);
-  input3.CopyFrom<float>(input3_gpu, *cpu_place, context);
-  // numpy code:
-  // a = np.arange(6).reshape(2, 3)
-  // b = np.arange(12).reshape(3, 4)[:, 1:]
-  // c = np.arange(8).reshape(2, 4)[:, 1:]
-  // out = np.arange(8).reshape(2, 4)
-  // out[:, 1:] = np.dot(a, b) + c
-  context.Wait();
-  EXPECT_EQ(input3_ptr[0], 0);
-  EXPECT_EQ(input3_ptr[1], 24);
-  EXPECT_EQ(input3_ptr[2], 28);
-  EXPECT_EQ(input3_ptr[3], 32);
-  EXPECT_EQ(input3_ptr[4], 4);
-  EXPECT_EQ(input3_ptr[5], 73);
-  EXPECT_EQ(input3_ptr[6], 86);
-  EXPECT_EQ(input3_ptr[7], 99);
-  delete gpu_place;
-}
-TEST(math_function, gemm_trans_cublas) {
-  paddle::framework::Tensor input1;
-  paddle::framework::Tensor input2;
-  paddle::framework::Tensor input3;
-  paddle::framework::Tensor input1_gpu;
-  paddle::framework::Tensor input2_gpu;
-  paddle::framework::Tensor input3_gpu;
-  int m = 2;
-  int n = 3;
-  int k = 3;
-  auto* cpu_place = new paddle::platform::CPUPlace();
-  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
-  float arr1[6] = {0, 1, 2, 3, 4, 5};
-  memcpy(input1_ptr, arr1, 6 * sizeof(float));
-  float* input2_ptr = input2.mutable_data<float>({4, 3}, *cpu_place);
-  float arr2[12] = {0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11};
-  memcpy(input2_ptr, arr2, 12 * sizeof(float));
-  float* input3_ptr = input3.mutable_data<float>({2, 4}, *cpu_place);
-  float arr3[8] = {0, 1, 2, 3, 4, 5, 6, 7};
-  memcpy(input3_ptr, arr3, 8 * sizeof(float));
-  auto* gpu_place = new paddle::platform::GPUPlace(0);
-  paddle::platform::CUDADeviceContext context(*gpu_place);
-  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
-  input2_gpu.CopyFrom<float>(input2, *gpu_place, context);
-  input3_gpu.CopyFrom<float>(input3, *gpu_place, context);
-  float* a = input1_gpu.data<float>();
-  float* b = input2_gpu.data<float>();
-  float* c = input3_gpu.mutable_data<float>(*gpu_place);
-  paddle::operators::math::gemm<paddle::platform::GPUPlace, float>(
-      context, false, true, m, n, k, 1, a, 3, b + 3, 3, 1, c + 1, 4);
-  input3.CopyFrom<float>(input3_gpu, *cpu_place, context);
-  context.Wait();
-  EXPECT_EQ(input3_ptr[0], 0);
-  EXPECT_EQ(input3_ptr[1], 24);
-  EXPECT_EQ(input3_ptr[2], 28);
-  EXPECT_EQ(input3_ptr[3], 32);
-  EXPECT_EQ(input3_ptr[4], 4);
-  EXPECT_EQ(input3_ptr[5], 73);
-  EXPECT_EQ(input3_ptr[6], 86);
-  EXPECT_EQ(input3_ptr[7], 99);
-  delete gpu_place;
-}
-#endif
 TEST(math_function, gemm_notrans_cblas) {
  paddle::framework::Tensor input1;
  paddle::framework::Tensor input2;
@@ -253,15 +74,15 @@ TEST(math_function, zero) {
  auto* cpu_place = new paddle::platform::CPUPlace();
  float* t = tensor.mutable_data<float>({2, 2}, *cpu_place);
  paddle::platform::CPUDeviceContext context(*cpu_place);
-  paddle::operators::math::SetConstant<paddle::platform::CPUPlace, float>(
+  paddle::operators::math::SetConstant<paddle::platform::CPUPlace, float>
-      context, &tensor, 0);
+      functor;
+  functor(context, &tensor, 0);
  EXPECT_EQ(t[0], 0);
  EXPECT_EQ(t[1], 0);
  EXPECT_EQ(t[2], 0);
  EXPECT_EQ(t[3], 0);
-  paddle::operators::math::SetConstant<paddle::platform::CPUPlace, float>(
+  functor(context, &tensor, 1);
-      context, &tensor, 1);
  EXPECT_EQ(t[0], 1);
  EXPECT_EQ(t[1], 1);

--- a/paddle/operators/math/math_function_test.cu
+++ b/paddle/operators/math/math_function_test.cu
+#include "gtest/gtest.h"
+#include "paddle/operators/math/math_function.h"
+TEST(math_function, notrans_mul_trans) {
+  paddle::framework::Tensor input1;
+  paddle::framework::Tensor input1_gpu;
+  paddle::framework::Tensor input2_gpu;
+  paddle::framework::Tensor out_gpu;
+  paddle::framework::Tensor out;
+  auto* cpu_place = new paddle::platform::CPUPlace();
+  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
+  float arr[6] = {0, 1, 2, 3, 4, 5};
+  memcpy(input1_ptr, arr, 6 * sizeof(float));
+  auto* gpu_place = new paddle::platform::GPUPlace(0);
+  paddle::platform::CUDADeviceContext context(*gpu_place);
+  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  input2_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  out_gpu.mutable_data<float>({2, 2}, *gpu_place);
+  paddle::operators::math::matmul<paddle::platform::GPUPlace, float>(
+      context, input1_gpu, false, input2_gpu, true, 1, &out_gpu, 0);
+  out.CopyFrom<float>(out_gpu, *cpu_place, context);
+  float* out_ptr = out.data<float>();
+  context.Wait();
+  EXPECT_EQ(out_ptr[0], 5);
+  EXPECT_EQ(out_ptr[1], 14);
+  EXPECT_EQ(out_ptr[2], 14);
+  EXPECT_EQ(out_ptr[3], 50);
+  delete gpu_place;
+}
+TEST(math_function, trans_mul_notrans) {
+  paddle::framework::Tensor input1;
+  paddle::framework::Tensor input1_gpu;
+  paddle::framework::Tensor input2_gpu;
+  paddle::framework::Tensor out_gpu;
+  paddle::framework::Tensor out;
+  auto* cpu_place = new paddle::platform::CPUPlace();
+  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
+  float arr[6] = {0, 1, 2, 3, 4, 5};
+  memcpy(input1_ptr, arr, 6 * sizeof(float));
+  auto* gpu_place = new paddle::platform::GPUPlace(0);
+  paddle::platform::CUDADeviceContext context(*gpu_place);
+  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  input2_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  out_gpu.mutable_data<float>({3, 3}, *gpu_place);
+  paddle::operators::math::matmul<paddle::platform::GPUPlace, float>(
+      context, input1_gpu, true, input2_gpu, false, 1, &out_gpu, 0);
+  out.CopyFrom<float>(out_gpu, *cpu_place, context);
+  float* out_ptr = out.data<float>();
+  context.Wait();
+  EXPECT_EQ(out_ptr[0], 9);
+  EXPECT_EQ(out_ptr[1], 12);
+  EXPECT_EQ(out_ptr[2], 15);
+  EXPECT_EQ(out_ptr[3], 12);
+  EXPECT_EQ(out_ptr[4], 17);
+  EXPECT_EQ(out_ptr[5], 22);
+  EXPECT_EQ(out_ptr[6], 15);
+  EXPECT_EQ(out_ptr[7], 22);
+  EXPECT_EQ(out_ptr[8], 29);
+  delete gpu_place;
+}
+TEST(math_function, gemm_notrans_cublas) {
+  paddle::framework::Tensor input1;
+  paddle::framework::Tensor input2;
+  paddle::framework::Tensor input3;
+  paddle::framework::Tensor input1_gpu;
+  paddle::framework::Tensor input2_gpu;
+  paddle::framework::Tensor input3_gpu;
+  int m = 2;
+  int n = 3;
+  int k = 3;
+  auto* cpu_place = new paddle::platform::CPUPlace();
+  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
+  float arr1[6] = {0, 1, 2, 3, 4, 5};
+  memcpy(input1_ptr, arr1, 6 * sizeof(float));
+  float* input2_ptr = input2.mutable_data<float>({3, 4}, *cpu_place);
+  float arr2[12] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
+  memcpy(input2_ptr, arr2, 12 * sizeof(float));
+  float* input3_ptr = input3.mutable_data<float>({2, 4}, *cpu_place);
+  float arr3[8] = {0, 1, 2, 3, 4, 5, 6, 7};
+  memcpy(input3_ptr, arr3, 8 * sizeof(float));
+  auto* gpu_place = new paddle::platform::GPUPlace(0);
+  paddle::platform::CUDADeviceContext context(*gpu_place);
+  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  input2_gpu.CopyFrom<float>(input2, *gpu_place, context);
+  input3_gpu.CopyFrom<float>(input3, *gpu_place, context);
+  float* a = input1_gpu.data<float>();
+  float* b = input2_gpu.data<float>();
+  float* c = input3_gpu.mutable_data<float>(*gpu_place);
+  paddle::operators::math::gemm<paddle::platform::GPUPlace, float>(
+      context, false, false, m, n, k, 1, a, 3, b + 1, 4, 1, c + 1, 4);
+  input3.CopyFrom<float>(input3_gpu, *cpu_place, context);
+  // numpy code:
+  // a = np.arange(6).reshape(2, 3)
+  // b = np.arange(12).reshape(3, 4)[:, 1:]
+  // c = np.arange(8).reshape(2, 4)[:, 1:]
+  // out = np.arange(8).reshape(2, 4)
+  // out[:, 1:] = np.dot(a, b) + c
+  context.Wait();
+  EXPECT_EQ(input3_ptr[0], 0);
+  EXPECT_EQ(input3_ptr[1], 24);
+  EXPECT_EQ(input3_ptr[2], 28);
+  EXPECT_EQ(input3_ptr[3], 32);
+  EXPECT_EQ(input3_ptr[4], 4);
+  EXPECT_EQ(input3_ptr[5], 73);
+  EXPECT_EQ(input3_ptr[6], 86);
+  EXPECT_EQ(input3_ptr[7], 99);
+  delete gpu_place;
+}
+TEST(math_function, gemm_trans_cublas) {
+  paddle::framework::Tensor input1;
+  paddle::framework::Tensor input2;
+  paddle::framework::Tensor input3;
+  paddle::framework::Tensor input1_gpu;
+  paddle::framework::Tensor input2_gpu;
+  paddle::framework::Tensor input3_gpu;
+  int m = 2;
+  int n = 3;
+  int k = 3;
+  auto* cpu_place = new paddle::platform::CPUPlace();
+  float* input1_ptr = input1.mutable_data<float>({2, 3}, *cpu_place);
+  float arr1[6] = {0, 1, 2, 3, 4, 5};
+  memcpy(input1_ptr, arr1, 6 * sizeof(float));
+  float* input2_ptr = input2.mutable_data<float>({4, 3}, *cpu_place);
+  float arr2[12] = {0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11};
+  memcpy(input2_ptr, arr2, 12 * sizeof(float));
+  float* input3_ptr = input3.mutable_data<float>({2, 4}, *cpu_place);
+  float arr3[8] = {0, 1, 2, 3, 4, 5, 6, 7};
+  memcpy(input3_ptr, arr3, 8 * sizeof(float));
+  auto* gpu_place = new paddle::platform::GPUPlace(0);
+  paddle::platform::CUDADeviceContext context(*gpu_place);
+  input1_gpu.CopyFrom<float>(input1, *gpu_place, context);
+  input2_gpu.CopyFrom<float>(input2, *gpu_place, context);
+  input3_gpu.CopyFrom<float>(input3, *gpu_place, context);
+  float* a = input1_gpu.data<float>();
+  float* b = input2_gpu.data<float>();
+  float* c = input3_gpu.mutable_data<float>(*gpu_place);
+  paddle::operators::math::gemm<paddle::platform::GPUPlace, float>(
+      context, false, true, m, n, k, 1, a, 3, b + 3, 3, 1, c + 1, 4);
+  input3.CopyFrom<float>(input3_gpu, *cpu_place, context);
+  context.Wait();
+  EXPECT_EQ(input3_ptr[0], 0);
+  EXPECT_EQ(input3_ptr[1], 24);
+  EXPECT_EQ(input3_ptr[2], 28);
+  EXPECT_EQ(input3_ptr[3], 32);
+  EXPECT_EQ(input3_ptr[4], 4);
+  EXPECT_EQ(input3_ptr[5], 73);
+  EXPECT_EQ(input3_ptr[6], 86);
+  EXPECT_EQ(input3_ptr[7], 99);
+  delete gpu_place;
+}
--- a/paddle/operators/math/selected_rows_functor.cc
+++ b/paddle/operators/math/selected_rows_functor.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "paddle/operators/math/selected_rows_functor.h"
+#include "paddle/operators/math/math_function.h"
+namespace paddle {
+namespace operators {
+namespace math {
+template <typename T>
+struct SelectedRowsAdd<platform::CPUPlace, T> {
+  void operator()(const platform::DeviceContext& context,
+                  const framework::SelectedRows& input1,
+                  const framework::SelectedRows& input2,
+                  framework::SelectedRows* output) {
+    auto in1_height = input1.height();
+    PADDLE_ENFORCE_EQ(in1_height, input2.height());
+    output->set_height(in1_height);
+    auto& in1_rows = input1.rows();
+    auto& in2_rows = input2.rows();
+    std::vector<int64_t> out_rows;
+    out_rows.reserve(in1_rows.size() + in2_rows.size());
+    // concat rows
+    out_rows.insert(out_rows.end(), in1_rows.begin(), in1_rows.end());
+    out_rows.insert(out_rows.end(), in2_rows.begin(), in2_rows.end());
+    output->set_rows(out_rows);
+    auto* out_value = output->mutable_value();
+    auto& in1_value = input1.value();
+    auto& in2_value = input2.value();
+    auto in1_row_numel = in1_value.numel() / in1_rows.size();
+    PADDLE_ENFORCE_EQ(in1_row_numel, in2_value.numel() / in2_rows.size());
+    PADDLE_ENFORCE_EQ(in1_row_numel, out_value->numel() / out_rows.size());
+    auto in1_place = input1.place();
+    PADDLE_ENFORCE(platform::is_cpu_place(in1_place));
+    auto in2_place = input2.place();
+    PADDLE_ENFORCE(platform::is_cpu_place(in2_place));
+    auto out_place = context.GetPlace();
+    PADDLE_ENFORCE(platform::is_cpu_place(out_place));
+    auto* out_data = out_value->data<T>();
+    auto* in1_data = in1_value.data<T>();
+    memory::Copy(boost::get<platform::CPUPlace>(out_place), out_data,
+                 boost::get<platform::CPUPlace>(in1_place), in1_data,
+                 in1_value.numel() * sizeof(T));
+    auto* in2_data = in2_value.data<T>();
+    memory::Copy(boost::get<platform::CPUPlace>(out_place),
+                 out_data + in1_value.numel(),
+                 boost::get<platform::CPUPlace>(in2_place), in2_data,
+                 in2_value.numel() * sizeof(T));
+  }
+};
+template struct SelectedRowsAdd<platform::CPUPlace, float>;
+template <typename T>
+struct SelectedRowsAddTensor<platform::CPUPlace, T> {
+  void operator()(const platform::DeviceContext& context,
+                  const framework::SelectedRows& input1,
+                  const framework::Tensor& input2, framework::Tensor* output) {
+    auto in1_height = input1.height();
+    auto in2_dims = input2.dims();
+    auto out_dims = output->dims();
+    PADDLE_ENFORCE_EQ(in1_height, in2_dims[0]);
+    PADDLE_ENFORCE_EQ(in1_height, out_dims[0]);
+    auto& in1_value = input1.value();
+    auto& in1_rows = input1.rows();
+    int64_t in1_row_numel = in1_value.numel() / in1_rows.size();
+    PADDLE_ENFORCE_EQ(in1_row_numel, input2.numel() / in1_height);
+    PADDLE_ENFORCE_EQ(in1_row_numel, output->numel() / in1_height);
+    SetConstant<platform::CPUPlace, T> functor;
+    functor(context, output, 0.0);
+    auto* in1_data = in1_value.data<T>();
+    auto* out_data = output->data<T>();
+    for (size_t i = 0; i < in1_rows.size(); i++) {
+      for (int64_t j = 0; j < in1_row_numel; j++) {
+        out_data[in1_rows[i] * in1_row_numel + j] +=
+            in1_data[i * in1_row_numel + j];
+      }
+    }
+    auto out_eigen = framework::EigenVector<T>::Flatten(*output);
+    auto in2_eigen = framework::EigenVector<T>::Flatten(input2);
+    out_eigen.device(*context.GetEigenDevice<platform::CPUPlace>()) =
+        out_eigen + in2_eigen;
+  }
+};
+template struct SelectedRowsAddTensor<platform::CPUPlace, float>;
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/math/selected_rows_functor.cu
+++ b/paddle/operators/math/selected_rows_functor.cu
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "paddle/operators/math/math_function.h"
+#include "paddle/operators/math/selected_rows_functor.h"
+#include "paddle/platform/cuda_helper.h"
+namespace paddle {
+namespace operators {
+namespace math {
+template <typename T>
+struct SelectedRowsAdd<platform::GPUPlace, T> {
+  void operator()(const platform::DeviceContext& context,
+                  const framework::SelectedRows& input1,
+                  const framework::SelectedRows& input2,
+                  framework::SelectedRows* output) {
+    auto in1_height = input1.height();
+    PADDLE_ENFORCE_EQ(in1_height, input2.height());
+    output->set_height(in1_height);
+    auto& in1_rows = input1.rows();
+    auto& in2_rows = input2.rows();
+    std::vector<int64_t> out_rows;
+    out_rows.reserve(in1_rows.size() + in2_rows.size());
+    // concat rows
+    out_rows.insert(out_rows.end(), in1_rows.begin(), in1_rows.end());
+    out_rows.insert(out_rows.end(), in2_rows.begin(), in2_rows.end());
+    output->set_rows(out_rows);
+    auto* out_value = output->mutable_value();
+    auto& in1_value = input1.value();
+    auto& in2_value = input2.value();
+    auto in1_row_numel = in1_value.numel() / in1_rows.size();
+    PADDLE_ENFORCE_EQ(in1_row_numel, in2_value.numel() / in2_rows.size());
+    PADDLE_ENFORCE_EQ(in1_row_numel, out_value->numel() / out_rows.size());
+    auto* out_data = out_value->data<T>();
+    auto* in1_data = in1_value.data<T>();
+    auto in1_place = input1.place();
+    PADDLE_ENFORCE(platform::is_gpu_place(in1_place));
+    auto in2_place = input2.place();
+    PADDLE_ENFORCE(platform::is_gpu_place(in2_place));
+    auto out_place = context.GetPlace();
+    PADDLE_ENFORCE(platform::is_gpu_place(out_place));
+    memory::Copy(
+        boost::get<platform::GPUPlace>(out_place), out_data,
+        boost::get<platform::GPUPlace>(in1_place), in1_data,
+        in1_value.numel() * sizeof(T),
+        reinterpret_cast<const platform::CUDADeviceContext&>(context).stream());
+    auto* in2_data = in2_value.data<T>();
+    memory::Copy(
+        boost::get<platform::GPUPlace>(out_place), out_data + in1_value.numel(),
+        boost::get<platform::GPUPlace>(in2_place), in2_data,
+        in2_value.numel() * sizeof(T),
+        reinterpret_cast<const platform::CUDADeviceContext&>(context).stream());
+  }
+};
+template struct SelectedRowsAdd<platform::GPUPlace, float>;
+namespace {
+template <typename T>
+__global__ void SelectedRowsAddTensorKernel(const T* selected_rows,
+                                            const int64_t* rows, T* tensor_out,
+                                            int64_t row_numel, int block_size) {
+  const int ty = blockIdx.y;
+  int tid = threadIdx.x;
+  selected_rows += ty * row_numel;
+  tensor_out += rows[ty] * row_numel;
+  for (int index = tid; index < row_numel; index += block_size) {
+    // Since index in rows of SelectedRows can be duplicate, we can not use
+    // tensor_out[index] += selected_rows[index]; Instead, we have to use
+    // AtomicAdd to avoid concurrent write error.
+    paddle::platform::CudaAtomicAdd(tensor_out + index, selected_rows[index]);
+  }
+}
+}  // namespace
+template <typename T>
+struct SelectedRowsAddTensor<platform::GPUPlace, T> {
+  void operator()(const platform::DeviceContext& context,
+                  const framework::SelectedRows& input1,
+                  const framework::Tensor& input2, framework::Tensor* output) {
+    auto in1_height = input1.height();
+    auto in2_dims = input2.dims();
+    auto out_dims = output->dims();
+    PADDLE_ENFORCE_EQ(in1_height, in2_dims[0]);
+    PADDLE_ENFORCE_EQ(in1_height, out_dims[0]);
+    auto& in1_value = input1.value();
+    auto& in1_rows = input1.rows();
+    int64_t in1_row_numel = in1_value.numel() / in1_rows.size();
+    PADDLE_ENFORCE_EQ(in1_row_numel, input2.numel() / in1_height);
+    PADDLE_ENFORCE_EQ(in1_row_numel, output->numel() / in1_height);
+    auto* in1_data = in1_value.data<T>();
+    auto* in2_data = input2.data<T>();
+    auto* out_data = output->data<T>();
+    SetConstant<platform::GPUPlace, T> functor;
+    functor(context, output, 0.0);
+    int block_size = 256;
+    dim3 threads(block_size, 1);
+    dim3 grid(1, in1_rows.size());
+    SelectedRowsAddTensorKernel<
+        T><<<grid, threads, 0,
+             reinterpret_cast<const platform::CUDADeviceContext&>(context)
+                 .stream()>>>(in1_data, in1_rows.data(), out_data,
+                              in1_row_numel, block_size);
+    auto out_eigen = framework::EigenVector<T>::Flatten(*output);
+    auto in2_eigen = framework::EigenVector<T>::Flatten(input2);
+    out_eigen.device(*context.GetEigenDevice<platform::GPUPlace>()) =
+        out_eigen + in2_eigen;
+  }
+};
+template struct SelectedRowsAddTensor<platform::GPUPlace, float>;
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/feed_op.h
+++ b/paddle/operators/feed_op.h
@@ -11,32 +11,31 @@ distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
 #pragma once
-#include "paddle/framework/eigen.h"
+#include "paddle/framework/selected_rows.h"
-#include "paddle/framework/op_registry.h"
+#include "paddle/platform/device_context.h"
 namespace paddle {
 namespace operators {
+namespace math {
+// SelectedRows + SelectedRows will simplely concat value and rows.
+// The real computation happens in dealing with LoDTensor.
+template <typename Place, typename T>
+struct SelectedRowsAdd {
+  void operator()(const platform::DeviceContext& context,
+                  const framework::SelectedRows& input1,
+                  const framework::SelectedRows& input2,
+                  framework::SelectedRows* output);
+};
-template <typename T>
+template <typename Place, typename T>
-class FeedKernel : public framework::OpKernel<T> {
+struct SelectedRowsAddTensor {
- public:
+  void operator()(const platform::DeviceContext& context,
-  void Compute(const framework::ExecutionContext& ctx) const override {
+                  const framework::SelectedRows& input1,
-    framework::Tensor* out = ctx.Output<framework::Tensor>("Out");
+                  const framework::Tensor& input2, framework::Tensor* output);
-    out->mutable_data<T>(ctx.GetPlace());
-    framework::Variable* g_feed_variable =
-        framework::GetGlobalScope().FindVar("feed_value");
-    const auto& tensors =
-        g_feed_variable->Get<std::vector<framework::Tensor>>();
-    int col = ctx.template Attr<int>("col");
-    PADDLE_ENFORCE_GT(tensors.size(), static_cast<size_t>(col));
-    // TODO(qijun):
-    //   check tensors[col].dims() with attribute,
-    //   except the first dimenson.
-    out->CopyFrom<T>(tensors[col], ctx.GetPlace(), ctx.device_context());
-  }
 };
+}  // namespace math
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/operators/math/selected_rows_functor_test.cc
+++ b/paddle/operators/math/selected_rows_functor_test.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "paddle/operators/math/selected_rows_functor.h"
+#include "gtest/gtest.h"
+#include "paddle/operators/math/math_function.h"
+TEST(selected_rows_functor, cpu_add) {
+  using namespace paddle::framework;
+  using namespace paddle::platform;
+  using namespace paddle::operators::math;
+  CPUPlace cpu_place;
+  CPUDeviceContext ctx(cpu_place);
+  SetConstant<CPUPlace, float> functor;
+  int64_t height = 10;
+  int64_t row_numel = 10;
+  std::vector<int64_t> rows1{0, 4, 7};
+  std::unique_ptr<SelectedRows> selected_rows1{new SelectedRows(rows1, height)};
+  auto* in1_value = selected_rows1->mutable_value();
+  in1_value->mutable_data<float>(
+      make_ddim({static_cast<int64_t>(rows1.size()), row_numel}), cpu_place);
+  functor(ctx, in1_value, 1.0);
+  std::vector<int64_t> rows2{0, 5, 7, 9};
+  std::unique_ptr<SelectedRows> selected_rows2{new SelectedRows(rows2, height)};
+  auto* in2_value = selected_rows2->mutable_value();
+  in2_value->mutable_data<float>(
+      make_ddim({static_cast<int64_t>(rows2.size()), row_numel}), cpu_place);
+  functor(ctx, in2_value, 2.0);
+  std::unique_ptr<SelectedRows> output{new SelectedRows()};
+  auto* out_value = output->mutable_value();
+  // simplely concat two SelectedRows
+  out_value->mutable_data<float>(make_ddim({7, 10}), cpu_place);
+  SelectedRowsAdd<CPUPlace, float> add_functor;
+  add_functor(ctx, *selected_rows1, *selected_rows2, output.get());
+  auto out_height = output->height();
+  EXPECT_EQ(out_height, height);
+  auto& out_rows = output->rows();
+  // input1 rows
+  EXPECT_EQ(out_rows[0], 0);
+  EXPECT_EQ(out_rows[1], 4);
+  EXPECT_EQ(out_rows[2], 7);
+  // input2 rows
+  EXPECT_EQ(out_rows[3], 0);
+  EXPECT_EQ(out_rows[4], 5);
+  EXPECT_EQ(out_rows[5], 7);
+  EXPECT_EQ(out_rows[6], 9);
+  auto* out_data = output->value().data<float>();
+  // input1 value
+  EXPECT_EQ(out_data[0 * row_numel + 0], 1.0);
+  EXPECT_EQ(out_data[0 * row_numel + 8], 1.0);
+  EXPECT_EQ(out_data[1 * row_numel + 1], 1.0);
+  EXPECT_EQ(out_data[2 * row_numel + 6], 1.0);
+  // input2 value
+  EXPECT_EQ(out_data[3 * row_numel + 3], 2.0);
+  EXPECT_EQ(out_data[3 * row_numel + 8], 2.0);
+  EXPECT_EQ(out_data[4 * row_numel + 4], 2.0);
+  EXPECT_EQ(out_data[5 * row_numel + 7], 2.0);
+  EXPECT_EQ(out_data[6 * row_numel + 9], 2.0);
+  std::unique_ptr<Tensor> tensor1{new Tensor()};
+  tensor1->mutable_data<float>(make_ddim({height, row_numel}), cpu_place);
+  functor(ctx, tensor1.get(), 3.0);
+  std::unique_ptr<Tensor> tensor2{new Tensor()};
+  tensor2->mutable_data<float>(make_ddim({height, row_numel}), cpu_place);
+  SelectedRowsAddTensor<CPUPlace, float> add_tensor_functor;
+  add_tensor_functor(ctx, *output, *tensor1, tensor2.get());
+  auto* tensor2_data = tensor2->data<float>();
+  // row0: 1.0 + 2.0 + 3.0
+  EXPECT_EQ(tensor2_data[0 * row_numel + 0], 6.0);
+  // row1: 3.0
+  EXPECT_EQ(tensor2_data[1 * row_numel + 1], 3.0);
+  // row4 : 1.0 + 3.0
+  EXPECT_EQ(tensor2_data[4 * row_numel + 6], 4.0);
+  // row5: 2.0 + 3.0
+  EXPECT_EQ(tensor2_data[5 * row_numel + 7], 5.0);
+  // row6: 3.0
+  EXPECT_EQ(tensor2_data[6 * row_numel + 1], 3.0);
+  // row7: 1.0 + 2.0 + 3.0
+  EXPECT_EQ(tensor2_data[7 * row_numel + 3], 6.0);
+  // row9: 2.0 + 3.0
+  EXPECT_EQ(tensor2_data[9 * row_numel + 6], 5.0);
+}
--- a/paddle/operators/math/selected_rows_functor_test.cu
+++ b/paddle/operators/math/selected_rows_functor_test.cu
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "gtest/gtest.h"
+#include "paddle/operators/math/math_function.h"
+#include "paddle/operators/math/selected_rows_functor.h"
+TEST(selected_rows_functor, gpu_add) {
+  using namespace paddle::framework;
+  using namespace paddle::platform;
+  using namespace paddle::operators::math;
+  GPUPlace gpu_place(0);
+  CPUPlace cpu_place;
+  CUDADeviceContext ctx(gpu_place);
+  SetConstant<GPUPlace, float> functor;
+  int64_t height = 10;
+  int64_t row_numel = 10;
+  std::vector<int64_t> rows1{0, 4, 7};
+  std::unique_ptr<SelectedRows> selected_rows1{new SelectedRows(rows1, height)};
+  auto* in1_value = selected_rows1->mutable_value();
+  in1_value->mutable_data<float>(
+      make_ddim({static_cast<int64_t>(rows1.size()), row_numel}), gpu_place);
+  functor(ctx, in1_value, 1.0);
+  std::vector<int64_t> rows2{0, 5, 7, 9};
+  std::unique_ptr<SelectedRows> selected_rows2{new SelectedRows(rows2, height)};
+  auto* in2_value = selected_rows2->mutable_value();
+  in2_value->mutable_data<float>(
+      make_ddim({static_cast<int64_t>(rows2.size()), row_numel}), gpu_place);
+  functor(ctx, in2_value, 2.0);
+  std::unique_ptr<SelectedRows> output{new SelectedRows()};
+  auto* out_value = output->mutable_value();
+  // simplely concat two SelectedRows
+  out_value->mutable_data<float>(make_ddim({7, 10}), gpu_place);
+  SelectedRowsAdd<GPUPlace, float> add_functor;
+  add_functor(ctx, *selected_rows1, *selected_rows2, output.get());
+  auto out_height = output->height();
+  EXPECT_EQ(out_height, height);
+  auto& out_rows = output->rows();
+  // input1 rows
+  EXPECT_EQ(out_rows[0], 0);
+  EXPECT_EQ(out_rows[1], 4);
+  EXPECT_EQ(out_rows[2], 7);
+  // input2 rows
+  EXPECT_EQ(out_rows[3], 0);
+  EXPECT_EQ(out_rows[4], 5);
+  EXPECT_EQ(out_rows[5], 7);
+  EXPECT_EQ(out_rows[6], 9);
+  Tensor out_cpu;
+  out_cpu.CopyFrom<float>(*out_value, cpu_place, ctx);
+  ctx.Wait();
+  auto* out_cpu_data = out_cpu.data<float>();
+  // input1 value
+  EXPECT_EQ(out_cpu_data[0 * row_numel + 0], 1.0);
+  EXPECT_EQ(out_cpu_data[0 * row_numel + 8], 1.0);
+  EXPECT_EQ(out_cpu_data[1 * row_numel + 1], 1.0);
+  EXPECT_EQ(out_cpu_data[2 * row_numel + 6], 1.0);
+  // input2 value
+  EXPECT_EQ(out_cpu_data[3 * row_numel + 3], 2.0);
+  EXPECT_EQ(out_cpu_data[3 * row_numel + 8], 2.0);
+  EXPECT_EQ(out_cpu_data[4 * row_numel + 4], 2.0);
+  EXPECT_EQ(out_cpu_data[5 * row_numel + 7], 2.0);
+  EXPECT_EQ(out_cpu_data[6 * row_numel + 9], 2.0);
+  std::unique_ptr<Tensor> tensor1{new Tensor()};
+  tensor1->mutable_data<float>(make_ddim({height, row_numel}), gpu_place);
+  functor(ctx, tensor1.get(), 3.0);
+  std::unique_ptr<Tensor> tensor2{new Tensor()};
+  tensor2->mutable_data<float>(make_ddim({height, row_numel}), gpu_place);
+  SelectedRowsAddTensor<GPUPlace, float> add_tensor_functor;
+  add_tensor_functor(ctx, *output, *tensor1, tensor2.get());
+  Tensor tensor2_cpu;
+  tensor2_cpu.CopyFrom<float>(*tensor2, cpu_place, ctx);
+  ctx.Wait();
+  auto* tensor2_cpu_data = tensor2_cpu.data<float>();
+  // row0: 1.0 + 2.0 + 3.0
+  EXPECT_EQ(tensor2_cpu_data[0 * row_numel + 0], 6.0);
+  // row1: 3.0
+  EXPECT_EQ(tensor2_cpu_data[1 * row_numel + 1], 3.0);
+  // row4 : 1.0 + 3.0
+  EXPECT_EQ(tensor2_cpu_data[4 * row_numel + 6], 4.0);
+  // row5: 2.0 + 3.0
+  EXPECT_EQ(tensor2_cpu_data[5 * row_numel + 7], 5.0);
+  // row6: 3.0
+  EXPECT_EQ(tensor2_cpu_data[6 * row_numel + 1], 3.0);
+  // row7: 1.0 + 2.0 + 3.0
+  EXPECT_EQ(tensor2_cpu_data[7 * row_numel + 3], 6.0);
+  // row9: 2.0 + 3.0
+  EXPECT_EQ(tensor2_cpu_data[9 * row_numel + 6], 5.0);
+}
--- a/paddle/operators/mean_op.cc
+++ b/paddle/operators/mean_op.cc
@@ -21,7 +21,6 @@ class MeanOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of MeanOp should not be null.");
@@ -46,7 +45,6 @@ class MeanGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X"));
  }

--- a/paddle/operators/minus_op.cc
+++ b/paddle/operators/minus_op.cc
@@ -25,7 +25,6 @@ class MinusOp : public framework::OperatorWithKernel {
          const framework::AttributeMap &attrs)
      : OperatorWithKernel(type, inputs, outputs, attrs) {}
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of MinusOp should not be null.");

--- a/paddle/operators/modified_huber_loss_op.cc
+++ b/paddle/operators/modified_huber_loss_op.cc
@@ -21,7 +21,6 @@ class ModifiedHuberLossOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "X must be initialized.");
    PADDLE_ENFORCE(ctx->HasInput("Y"), "Y must be initialized.");
@@ -73,7 +72,6 @@ class ModifiedHuberLossGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "X must be initialized.");
    PADDLE_ENFORCE(ctx->HasInput("Y"), "Y must be initialized.");

--- a/paddle/operators/mul_op.cc
+++ b/paddle/operators/mul_op.cc
@@ -23,7 +23,6 @@ class MulOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) of MulOp should not be null.");
    PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) of MulOp should not be null.");
@@ -96,7 +95,6 @@ class MulOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
    PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should not be null");

--- a/paddle/operators/multiplex_op.cc
+++ b/paddle/operators/multiplex_op.cc
@@ -23,7 +23,6 @@ class MultiplexOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Ids"), "Input(Ids) shouldn't be null.");
    PADDLE_ENFORCE(!ctx->Inputs("X").empty(),
@@ -51,6 +50,7 @@ class MultiplexOp : public framework::OperatorWithKernel {
    ctx->SetOutputDim("Out", in_dim);
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.MultiInput<Tensor>("X")[0]->type());
@@ -89,7 +89,6 @@ class MultiplexGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(!ctx->Inputs("X").empty(), "Input(X) should not be null.");
    PADDLE_ENFORCE(!ctx->Outputs(framework::GradVarName("X")).empty(),
@@ -105,6 +104,7 @@ class MultiplexGradOp : public framework::OperatorWithKernel {
    ctx->SetOutputsDim(framework::GradVarName("X"), d_ins);
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.MultiInput<Tensor>("X")[0]->type());

--- a/paddle/operators/name_convention.md
+++ b/paddle/operators/name_convention.md
@@ -11,7 +11,7 @@ When defining an operator in Paddle, a corresponding [OpProtoMaker](https://gith
  - If an operator's Input/Output are tensors in math, not match to any meaningful words, input name should starts from `X`. e.g. `X`, `Y`, and output name should starts from `Out`. e.g. `Out`. This rule intends making operators which have few inputs/outputs unified.
 - Attribute.
-  - Attribute name follows the **camelCase**. e.g. `x`, `y`, `axis`, `rowwiseMatrix`. Also, attribute name prefers to meaningful English words.
+  - Attribute name follows the **snake_case**. e.g. `x`, `y`, `axis`, `rowwise_matrix`. Also, attribute name prefers to meaningful English words.
 - Comments.
  - Input/Output/Attr comment follow the format of **(type,default value) usage**, corresponding to which type it can be and how it will be used in the operator. e.g.  Attribute in Accumulator`"gamma" `,`(float, default 1.0) Accumulation multiplier`.

--- a/paddle/operators/pad_op.cc
+++ b/paddle/operators/pad_op.cc
@@ -23,7 +23,6 @@ class PadOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) of PadOp should not be null.");
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
@@ -97,7 +96,6 @@ class PadOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),

--- a/paddle/operators/pool_op.cc
+++ b/paddle/operators/pool_op.cc
@@ -29,7 +29,7 @@ void PoolOp::InferShape(framework::InferShapeContext *ctx) const {
  auto in_x_dims = ctx->GetInputDim("X");
-  std::string pooling_type = ctx->Attrs().Get<std::string>("poolingType");
+  std::string pooling_type = ctx->Attrs().Get<std::string>("pooling_type");
  std::vector<int> ksize = ctx->Attrs().Get<std::vector<int>>("ksize");
  std::vector<int> strides = ctx->Attrs().Get<std::vector<int>>("strides");
  std::vector<int> paddings = ctx->Attrs().Get<std::vector<int>>("paddings");
@@ -37,7 +37,7 @@ void PoolOp::InferShape(framework::InferShapeContext *ctx) const {
  PADDLE_ENFORCE(in_x_dims.size() == 4 || in_x_dims.size() == 5,
                 "Pooling intput should be 4-D or 5-D tensor.");
-  if (ctx->Attrs().Get<bool>("globalPooling")) {
+  if (ctx->Attrs().Get<bool>("global_pooling")) {
    ksize.resize(static_cast<size_t>(in_x_dims.size()) - 2);
    for (size_t i = 0; i < ksize.size(); ++i)
      ksize[i] = static_cast<int>(in_x_dims[i + 2]);
@@ -80,23 +80,23 @@ Pool2dOpMaker::Pool2dOpMaker(framework::OpProto *proto,
            "the number of channels, H and W is the height and "
            "width of feature.");
-  AddAttr<std::string>("poolingType",
+  AddAttr<std::string>("pooling_type",
-                       "PoolingType of pooling operator."
+                       "Pooling_type of pooling operator."
                       "Str constant equal to 'max' or 'avg'.")
      .InEnum({"max", "avg"});
  AddAttr<std::vector<int>>(
      "ksize",
      "The pooling window size(height, width) of pooling operator."
-      "If globalPooling = true, ksize is ignored and need not be "
+      "If global_pooling = true, ksize is ignored and need not be "
      "specified.");  // TODO(Chengduo): Add checker. (Currently,
                      // TypedAttrChecker don't support vector type.)
  AddAttr<bool>(
-      "globalPooling",
+      "global_pooling",
-      "Whether to use the globalPooling."
+      "Whether to use the global_pooling."
      "Bool constant equal to false or true."
      "Default false."
-      "If globalPooling = true, ksize is ignored and need not be specified.")
+      "If global_pooling = true, ksize is ignored and need not be specified.")
      .SetDefault(false);
  AddAttr<std::vector<int>>("strides",
                            "The strides(height, width) of pooling window."
@@ -146,7 +146,7 @@ Pool3dOpMaker::Pool3dOpMaker(framework::OpProto *proto,
            "the number of channels, D, H and W is the depth, height and "
            "width of feature.");
-  AddAttr<std::string>("poolingType",
+  AddAttr<std::string>("pooling_type",
                       "PoolingType of pooling operator."
                       "Str constant equal to 'max' or 'avg'.")
      .InEnum({"max", "avg"});
@@ -154,15 +154,15 @@ Pool3dOpMaker::Pool3dOpMaker(framework::OpProto *proto,
  AddAttr<std::vector<int>>(
      "ksize",
      "The pooling window size(depth, height, width) of pooling operator."
-      "If globalPooling = true, ksize is ignored and need not be "
+      "If global_pooling = true, ksize is ignored and need not be "
      "specified.");  // TODO(Chengduo): Add checker. (Currently,
                      // TypedAttrChecker don't support vector type.)
  AddAttr<bool>(
-      "globalPooling",
+      "global_pooling",
-      "Whether to use the globalPooling."
+      "Whether to use the global_pooling."
      "Bool constant equal to false or true."
      "Default false."
-      "If globalPooling = true, ksize is ignored and need not be specified.")
+      "If global_pooling = true, ksize is ignored and need not be specified.")
      .SetDefault(false);
  AddAttr<std::vector<int>>("strides",
                            "Strides(depth, height, width) of pooling operator."

--- a/paddle/operators/pool_op.h
+++ b/paddle/operators/pool_op.h
@@ -28,7 +28,6 @@ class PoolOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override;
 };
@@ -36,7 +35,6 @@ class PoolOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override;
 };
@@ -59,11 +57,11 @@ class PoolKernel : public framework::OpKernel<T> {
    const Tensor* in_x = context.Input<Tensor>("X");
    Tensor* out = context.Output<Tensor>("Out");
-    std::string pooling_type = context.Attr<std::string>("poolingType");
+    std::string pooling_type = context.Attr<std::string>("pooling_type");
    std::vector<int> ksize = context.Attr<std::vector<int>>("ksize");
    std::vector<int> strides = context.Attr<std::vector<int>>("strides");
    std::vector<int> paddings = context.Attr<std::vector<int>>("paddings");
-    if (context.Attr<bool>("globalPooling")) {
+    if (context.Attr<bool>("global_pooling")) {
      for (size_t i = 0; i < ksize.size(); ++i) {
        ksize[i] = static_cast<int>(in_x->dims()[i + 2]);
      }
@@ -119,12 +117,12 @@ class PoolGradKernel : public framework::OpKernel<T> {
        context.Input<Tensor>(framework::GradVarName("Out"));
    Tensor* in_x_grad = context.Output<Tensor>(framework::GradVarName("X"));
-    std::string pooling_type = context.Attr<std::string>("poolingType");
+    std::string pooling_type = context.Attr<std::string>("pooling_type");
    std::vector<int> ksize = context.Attr<std::vector<int>>("ksize");
    std::vector<int> strides = context.Attr<std::vector<int>>("strides");
    std::vector<int> paddings = context.Attr<std::vector<int>>("paddings");
-    if (context.Attr<bool>("globalPooling")) {
+    if (context.Attr<bool>("global_pooling")) {
      for (size_t i = 0; i < ksize.size(); ++i)
        ksize[i] = static_cast<int>(in_x->dims()[i + 2]);
    }

--- a/paddle/operators/pool_with_index_op.cc
+++ b/paddle/operators/pool_with_index_op.cc
@@ -27,7 +27,6 @@ class MaxPoolWithIndexOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "X(Input) of Pooling should not be null.");
@@ -45,7 +44,7 @@ class MaxPoolWithIndexOp : public framework::OperatorWithKernel {
    PADDLE_ENFORCE(in_x_dims.size() == 4 || in_x_dims.size() == 5,
                   "Pooling intput should be 4-D or 5-D tensor.");
-    if (ctx->Attrs().Get<bool>("globalPooling")) {
+    if (ctx->Attrs().Get<bool>("global_pooling")) {
      ksize.resize(static_cast<size_t>(in_x_dims.size()) - 2);
      for (size_t i = 0; i < ksize.size(); ++i)
        ksize[i] = static_cast<int>(in_x_dims[i + 2]);
@@ -72,7 +71,6 @@ class MaxPoolWithIndexOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Mask"), "Input(Mask) must not be null.");
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) must not be null.");
@@ -108,15 +106,15 @@ class MaxPool2dWithIndexOpMaker : public framework::OpProtoAndCheckerMaker {
    AddAttr<std::vector<int>>(
        "ksize",
        "The pooling window size(height, width) of pooling operator."
-        "If globalPooling = true, ksize is ignored and need not be "
+        "If global_pooling = true, ksize is ignored and need not be "
        "specified.");  // TODO(Chengduo): Add checker. (Currently,
                        // TypedAttrChecker don't support vector type.)
    AddAttr<bool>(
-        "globalPooling",
+        "global_pooling",
-        "Whether to use the globalPooling."
+        "Whether to use the global_pooling."
        "Bool constant equal to false or true."
        "Default false."
-        "If globalPooling = true, ksize is ignored and need not be specified.")
+        "If global_pooling = true, ksize is ignored and need not be specified.")
        .SetDefault(false);
    AddAttr<std::vector<int>>("strides",
                              "The strides(height, width) of pooling window."
@@ -179,15 +177,15 @@ class MaxPool3dWithIndexOpMaker : public framework::OpProtoAndCheckerMaker {
    AddAttr<std::vector<int>>(
        "ksize",
        "The pooling window size(depth, height, width) of pooling operator."
-        "If globalPooling = true, ksize is ignored and need not be "
+        "If global_pooling = true, ksize is ignored and need not be "
        "specified.");  // TODO(Chengduo): Add checker. (Currently,
                        // TypedAttrChecker don't support vector type.)
    AddAttr<bool>(
-        "globalPooling",
+        "global_pooling",
-        "Whether to use the globalPooling."
+        "Whether to use the global_pooling."
        "Bool constant equal to false or true."
        "Default false."
-        "If globalPooling = true, ksize is ignored and need not be specified.")
+        "If global_pooling = true, ksize is ignored and need not be specified.")
        .SetDefault(false);
    AddAttr<std::vector<int>>(
        "strides",

--- a/paddle/operators/pool_with_index_op.h
+++ b/paddle/operators/pool_with_index_op.h
@@ -35,7 +35,7 @@ class MaxPoolWithIndexKernel : public framework::OpKernel<T> {
    std::vector<int> ksize = context.Attr<std::vector<int>>("ksize");
    std::vector<int> strides = context.Attr<std::vector<int>>("strides");
    std::vector<int> paddings = context.Attr<std::vector<int>>("paddings");
-    if (context.Attr<bool>("globalPooling")) {
+    if (context.Attr<bool>("global_pooling")) {
      for (size_t i = 0; i < ksize.size(); ++i) {
        ksize[i] = static_cast<int>(in_x->dims()[i + 2]);
      }
@@ -70,7 +70,7 @@ class MaxPoolWithIndexGradKernel : public framework::OpKernel<T> {
    std::vector<int> ksize = context.Attr<std::vector<int>>("ksize");
    std::vector<int> strides = context.Attr<std::vector<int>>("strides");
    std::vector<int> paddings = context.Attr<std::vector<int>>("paddings");
-    if (context.Attr<bool>("globalPooling")) {
+    if (context.Attr<bool>("global_pooling")) {
      for (size_t i = 0; i < ksize.size(); ++i) {
        ksize[i] = static_cast<int>(in_x_grad->dims()[i + 2]);
      }

--- a/paddle/operators/prelu_op.cc
+++ b/paddle/operators/prelu_op.cc
@@ -25,7 +25,6 @@ class PReluOp : public framework::OperatorWithKernel {
          const framework::AttributeMap &attrs)
      : OperatorWithKernel(type, inputs, outputs, attrs) {}
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
    PADDLE_ENFORCE(ctx->HasInput("Alpha"), "Input(Alpha) should not be null");
@@ -62,7 +61,6 @@ class PReluGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) must not be null.");
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),

--- a/paddle/operators/rank_loss_op.cc
+++ b/paddle/operators/rank_loss_op.cc
@@ -24,7 +24,6 @@ class RankLossOp : public framework::OperatorWithKernel {
             const framework::AttributeMap &attrs)
      : OperatorWithKernel(type, inputs, outputs, attrs) {}
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    // input check
    PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) shouldn't be null");
@@ -89,7 +88,6 @@ class RankLossGradOp : public framework::OperatorWithKernel {
                 const framework::AttributeMap &attrs)
      : OperatorWithKernel(type, inputs, outputs, attrs) {}
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) shouldn't be null.");
    PADDLE_ENFORCE(ctx->HasInput("Left"), "Input(Left) shouldn't be null.");

--- a/paddle/operators/reduce_op.cc
+++ b/paddle/operators/reduce_op.cc
@@ -23,7 +23,6 @@ class ReduceOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of ReduceOp should not be null.");
@@ -57,7 +56,6 @@ class ReduceGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null.");
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),

--- a/paddle/operators/reshape_op.cc
+++ b/paddle/operators/reshape_op.cc
@@ -25,7 +25,6 @@ class ReshapeOp : public framework::OperatorWithKernel {
            const framework::AttributeMap &attrs)
      : OperatorWithKernel(type, inputs, outputs, attrs) {}
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    // input check
    PADDLE_ENFORCE(ctx->HasInput("X"),
@@ -93,7 +92,6 @@ class ReshapeGradOp : public framework::OperatorWithKernel {
                const framework::AttributeMap &attrs)
      : OperatorWithKernel(type, inputs, outputs, attrs) {}
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) shouldn't be null.");
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),

--- a/paddle/operators/rmsprop_op.cc
+++ b/paddle/operators/rmsprop_op.cc
@@ -21,7 +21,6 @@ class RmspropOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Param"),
                   "Input(Param) of RmspropOp should not be null.");

--- a/paddle/operators/scale_op.cc
+++ b/paddle/operators/scale_op.cc
@@ -25,7 +25,6 @@ class ScaleOp : public framework::OperatorWithKernel {
          const framework::AttributeMap &attrs)
      : OperatorWithKernel(type, inputs, outputs, attrs) {}
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of ScaleOp should not be null.");
@@ -56,7 +55,6 @@ class ScaleGradMaker : public framework::SingleGradOpDescMaker {
 public:
  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
- protected:
  std::unique_ptr<framework::OpDescBind> Apply() const override {
    auto *grad_op = new framework::OpDescBind();
    grad_op->SetType("scale");

--- a/paddle/operators/scatter_op.cc
+++ b/paddle/operators/scatter_op.cc
@@ -22,7 +22,6 @@ class ScatterOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Ref"),
                   "Input(Ref) of ScatterOp should not be null.");
@@ -49,6 +48,7 @@ class ScatterOp : public framework::OperatorWithKernel {
    ctx->SetOutputDim("Out", ref_dims);
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.Input<Tensor>("Ref")->type());
@@ -59,13 +59,13 @@ class ScatterGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    ctx->SetOutputDim(framework::GradVarName("Updates"),
                      ctx->GetInputDim("Updates"));
    ctx->SetOutputDim(framework::GradVarName("Ref"), ctx->GetInputDim("Ref"));
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.Input<Tensor>("Ref")->type());

--- a/paddle/operators/sequence_concat_op.cc
+++ b/paddle/operators/sequence_concat_op.cc
@@ -21,7 +21,6 @@ class SequenceConcatOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInputs("X"),
                   "Inputs(X) of SequenceConcatOp should not be null.");
@@ -105,7 +104,6 @@ class SequenceConcatGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
                   "The gradient of Out should not be null.");

--- a/paddle/operators/sequence_pool_op.cc
+++ b/paddle/operators/sequence_pool_op.cc
@@ -21,7 +21,6 @@ class SequencePoolOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of SequencePoolOp should not be null.");
@@ -72,7 +71,6 @@ class SequencePoolGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
                   "Gradient of Out should not be null.");

--- a/paddle/operators/sequence_pool_op.h
+++ b/paddle/operators/sequence_pool_op.h
@@ -111,7 +111,8 @@ class SequencePoolGradKernel : public framework::OpKernel<T> {
    in_g->mutable_data<T>(context.GetPlace());
    if (strategy == LAST || strategy == FIRST) {
      // set X@Grad be zero at first when strategy is LAST/FIRST
-      math::SetConstant<Place, T>(context.device_context(), in_g, 0);
+      math::SetConstant<Place, T> functor;
+      functor(context.device_context(), in_g, 0);
    }
    auto place = context.GetEigenDevice<Place>();
    for (int i = 0; i < static_cast<int>(lod.size()) - 1; ++i) {

--- a/paddle/operators/sequence_softmax_op.cc
+++ b/paddle/operators/sequence_softmax_op.cc
@@ -21,7 +21,6 @@ class SequenceSoftmaxOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of SequenceSoftmaxOp should not be null.");
@@ -66,7 +65,6 @@ class SequenceSoftmaxGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Out"),
                   "Input(Out) of SequenceSoftmaxGradOp should not be null.");

--- a/paddle/operators/sgd_op.cc
+++ b/paddle/operators/sgd_op.cc
@@ -21,7 +21,6 @@ class SGDOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Param"),
                   "Input(Param) of SGDOp should not be null.");

--- a/paddle/operators/sigmoid_cross_entropy_with_logits_op.cc
+++ b/paddle/operators/sigmoid_cross_entropy_with_logits_op.cc
@@ -23,7 +23,6 @@ class SigmoidCrossEntropyWithLogitsOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
    PADDLE_ENFORCE(ctx->HasInput("Labels"),
@@ -52,7 +51,6 @@ class SigmoidCrossEntropyWithLogitsGradOp
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should be not null.");
    PADDLE_ENFORCE(ctx->HasInput("Labels"),

--- a/paddle/operators/smooth_l1_loss_op.cc
+++ b/paddle/operators/smooth_l1_loss_op.cc
@@ -21,7 +21,6 @@ class SmoothL1LossOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "X must be initialized.");
    PADDLE_ENFORCE(ctx->HasInput("Y"), "Y must be initialized.");
@@ -93,7 +92,6 @@ class SmoothL1LossGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    auto in_dims = ctx->GetInputDim("X");
    auto out_dims = ctx->GetInputDim(framework::GradVarName("Out"));

--- a/paddle/operators/softmax_op.cc
+++ b/paddle/operators/softmax_op.cc
@@ -21,7 +21,6 @@ class SoftmaxOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of SoftmaxOp should not be null.");
@@ -68,7 +67,6 @@ class SoftmaxOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) should be not null.");
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Y")),

--- a/paddle/operators/softmax_with_cross_entropy_op.cc
+++ b/paddle/operators/softmax_with_cross_entropy_op.cc
@@ -46,7 +46,7 @@ class SoftmaxWithCrossEntropyOpMaker
              "(Tensor, default: Tensor<float>), A 2-D tensor. The cross "
              "entropy loss with shape [N x 1].");
    AddAttr<bool>(
-        "softLabel",
+        "soft_label",
        "(bool, default: false), A flag to indicate whether to interpretate "
        "the given labels as soft labels.")
        .SetDefault(false);
@@ -82,7 +82,6 @@ class SoftmaxWithCrossEntropyOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("Logits"),
                   "Input(Logits) should be not null.");
@@ -100,13 +99,13 @@ class SoftmaxWithCrossEntropyOp : public framework::OperatorWithKernel {
    PADDLE_ENFORCE_EQ(labels_dims.size(), 2UL,
                      "The labels should be a 2-D tensor.");
-    if (ctx->Attrs().Get<bool>("softLabel")) {
+    if (ctx->Attrs().Get<bool>("soft_label")) {
      PADDLE_ENFORCE_EQ(logits_dims[1], labels_dims[1],
-                        "If Attr(softLabel) == true, the 2nd dimension of "
+                        "If Attr(soft_label) == true, the 2nd dimension of "
                        "Input(X) and Input(Label) should be equal.");
    } else {
      PADDLE_ENFORCE_EQ(labels_dims[1], 1UL,
-                        "If Attr(softLabel) == false, the 2nd dimension of "
+                        "If Attr(soft_label) == false, the 2nd dimension of "
                        "Input(Label) should be 1.");
    }
@@ -117,6 +116,7 @@ class SoftmaxWithCrossEntropyOp : public framework::OperatorWithKernel {
    ctx->ShareLoD("Logits", /*->*/ "Loss");
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.Input<Tensor>("Logits")->type());
@@ -127,7 +127,6 @@ class SoftmaxWithCrossEntropyOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Loss")),
                   "Input(Loss@Grad) should not be null.");
@@ -142,13 +141,13 @@ class SoftmaxWithCrossEntropyOpGrad : public framework::OperatorWithKernel {
    PADDLE_ENFORCE_EQ(labels_dims.size(), 2UL,
                      "The labels should be a 2-D tensor.");
-    if (ctx->Attrs().Get<bool>("softLabel")) {
+    if (ctx->Attrs().Get<bool>("soft_label")) {
      PADDLE_ENFORCE_EQ(softmax_dims[1], labels_dims[1],
-                        "When Attr(softLabel) == true, the 2nd dimension of "
+                        "When Attr(soft_label) == true, the 2nd dimension of "
                        "Input(X) and Input(Label) should be equal.");
    } else {
      PADDLE_ENFORCE_EQ(labels_dims[1], 1UL,
-                        "When Attr(softLabel) == false, the 2nd dimension of "
+                        "When Attr(soft_label) == false, the 2nd dimension of "
                        "Input(Label) should be 1.");
    }
@@ -156,6 +155,7 @@ class SoftmaxWithCrossEntropyOpGrad : public framework::OperatorWithKernel {
                      ctx->GetInputDim("Softmax"));
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(

--- a/paddle/operators/softmax_with_cross_entropy_op.cu
+++ b/paddle/operators/softmax_with_cross_entropy_op.cu
@@ -70,7 +70,7 @@ class SoftmaxWithCrossEntropyCUDAKernel : public framework::OpKernel<T> {
                                                  logits, softmax);
    math::CrossEntropyFunctor<platform::GPUPlace, T>()(
        context.device_context(), loss, softmax, labels,
-        context.Attr<bool>("softLabel"));
+        context.Attr<bool>("soft_label"));
  }
 };
@@ -93,7 +93,7 @@ class SoftmaxWithCrossEntropyGradCUDAKernel : public framework::OpKernel<T> {
    int block = 512;
    int grid = (batch_size * class_num + block - 1) / block;
-    if (context.Attr<bool>("softLabel")) {
+    if (context.Attr<bool>("soft_label")) {
      const T* label_data = labels->data<T>();
      SoftCrossEntropyGradientKernel<T><<<
          grid, block, 0, reinterpret_cast<const platform::CUDADeviceContext&>(

--- a/paddle/operators/softmax_with_cross_entropy_op.h
+++ b/paddle/operators/softmax_with_cross_entropy_op.h
@@ -44,7 +44,7 @@ class SoftmaxWithCrossEntropyKernel : public framework::OpKernel<T> {
                                                  logits, softmax);
    math::CrossEntropyFunctor<platform::CPUPlace, T>()(
        context.device_context(), loss, softmax, labels,
-        context.Attr<bool>("softLabel"));
+        context.Attr<bool>("soft_label"));
  }
 };
@@ -60,7 +60,7 @@ class SoftmaxWithCrossEntropyGradKernel : public framework::OpKernel<T> {
    logit_grad->ShareDataWith<T>(*context.Input<Tensor>("Softmax"));
    const int class_num = logit_grad->dims()[1];
-    if (context.Attr<bool>("softLabel")) {
+    if (context.Attr<bool>("soft_label")) {
      auto out_grad_mat = EigenMatrix<T>::From(*out_grad);
      auto logit_grad_mat = EigenMatrix<T>::From(*logit_grad);
      auto lbl_mat = EigenMatrix<T>::From(*labels);

--- a/paddle/operators/split_op.cc
+++ b/paddle/operators/split_op.cc
@@ -23,7 +23,6 @@ class SplitOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of SplitOp should not be null.");

--- a/paddle/operators/squared_l2_distance_op.cc
+++ b/paddle/operators/squared_l2_distance_op.cc
@@ -21,7 +21,6 @@ class SquaredL2DistanceOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of SquaredL2DistanceOp should not be null.");
@@ -85,7 +84,6 @@ class SquaredL2DistanceGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
                   "Gradient of Out should not be null");

--- a/paddle/operators/sum_op.cc
+++ b/paddle/operators/sum_op.cc
@@ -21,7 +21,6 @@ class SumOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInputs("X"), "Inputs(X) should not be null");
    auto x_dims = ctx->GetInputsDim("X");

--- a/paddle/operators/top_k_op.cc
+++ b/paddle/operators/top_k_op.cc
@@ -21,7 +21,6 @@ class TopkOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext *ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
                   "Input(X) of TopkOp should not be null.");

--- a/paddle/operators/transpose_op.cc
+++ b/paddle/operators/transpose_op.cc
@@ -23,7 +23,6 @@ class TransposeOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
    PADDLE_ENFORCE(ctx->HasOutput("Out"), "Output(Out) should not be null");
@@ -92,7 +91,6 @@ class TransposeOpGrad : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null");
    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),

--- a/paddle/operators/uniform_random_op.cc
+++ b/paddle/operators/uniform_random_op.cc
@@ -46,7 +46,6 @@ class UniformRandomOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;
- protected:
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
                   "Output(Out) of UniformRandomOp should not be null.");
@@ -63,6 +62,7 @@ class UniformRandomOp : public framework::OperatorWithKernel {
    ctx->SetOutputDim("Out", framework::make_ddim(temp));
  }
+ protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return static_cast<framework::DataType>(Attr<int>("data_type"));

--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
@@ -16,7 +16,9 @@ limitations under the License. */
 #include "paddle/framework/backward.h"
 #include "paddle/framework/executor.h"
+#include "paddle/framework/feed_fetch_method.h"
 #include "paddle/framework/lod_tensor.h"
+#include "paddle/framework/selected_rows.h"
 #include "paddle/framework/tensor_array.h"
 #include "paddle/operators/cond_op.h"
 #include "paddle/operators/dynamic_recurrent_op.h"
@@ -138,6 +140,32 @@ PYBIND11_PLUGIN(core) {
 #endif
      });
+  py::class_<SelectedRows>(m, "SelectedRows")
+      .def("__init__",
+           [](SelectedRows &instance) { new (&instance) SelectedRows(); })
+      .def("__init__",
+           [](SelectedRows &instance, const std::vector<int64_t> rows,
+              const int64_t &height) {
+             new (&instance) SelectedRows(rows, height);
+           })
+      .def("get_tensor",
+           [](SelectedRows &self) { return self.mutable_value(); },
+           py::return_value_policy::reference)
+      .def("set_height", &SelectedRows::set_height)
+      .def("height", &SelectedRows::height)
+      .def("set_rows", &SelectedRows::set_rows)
+      .def("rows", [](SelectedRows &self) {
+#ifndef PADDLE_WITH_CUDA
+        return self.rows();
+#else
+         auto rows = self.rows();
+         std::vector<int64_t> new_rows;
+         new_rows.reserve(rows.size());
+         std::copy(rows.begin(), rows.end(), std::back_inserter(new_rows));
+         return new_rows;
+#endif
+      });
  py::class_<Variable>(m, "Variable", R"DOC(Variable Class.
 All parameter, weight, gradient are variables in Paddle.
@@ -403,6 +431,10 @@ All parameter, weight, gradient are variables in Paddle.
  m.def("unique_integer", UniqueIntegerGenerator);
  m.def("is_compile_gpu", IsCompileGPU);
+  m.def("set_feed_variable_float", framework::SetFeedVariable<float>);
+  m.def("set_feed_variable_double", framework::SetFeedVariable<double>);
+  m.def("set_feed_variable_int", framework::SetFeedVariable<int>);
+  m.def("get_fetch_variable", framework::GetFetchVariable);
  BindProgramDesc(m);
  BindBlockDesc(m);

--- a/python/paddle/v2/framework/framework.py
+++ b/python/paddle/v2/framework/framework.py
@@ -153,7 +153,8 @@ class OpProtoHolder(object):
            self.op_proto_map[proto.type] = proto
    def get_op_proto(self, type):
-        assert type in self.op_proto_map, "Operator \"%s\" has not been registered." % type
+        if type not in self.op_proto_map:
+            raise ValueError("Operator \"%s\" has not been registered." % type)
        return self.op_proto_map[type]
@@ -374,10 +375,10 @@ class Program(object):
            cls._instance = cls()
        return cls._instance
-    def __init__(self):
+    def __init__(self, desc=None):
-        assert not hasattr(self.__class__,
+        if desc is None:
-                           '_instance'), 'Do not call constructor directly!'
+            desc = core.ProgramDesc.instance()
-        self.desc = core.ProgramDesc.instance()
+        self.desc = desc
        self.blocks = [Block(self, 0)]
        self.current_block_idx = 0
@@ -428,7 +429,6 @@ class Parameter(Variable):
            if each < 0:
                raise ValueError("Parameter shape should not be related with "
                                 "batch-size")
        Variable.__init__(self, block, shape=shape, dtype=dtype, **kwargs)
        self.trainable = kwargs.get('trainable', True)
        self.init_attr = kwargs.get('initialize_attr', {
@@ -441,7 +441,7 @@ class Parameter(Variable):
        self._append_initialize_ops_()
    def _append_initialize_ops_(self):
-        attr = copy.deepcopy(self.init_attr)
+        attr = self.init_attr
        op_type = attr.pop('type', None)
        block = self.block
        assert isinstance(block, Block)

--- a/python/paddle/v2/framework/layer_helper.py
+++ b/python/paddle/v2/framework/layer_helper.py
+from paddle.v2.framework.framework import Variable, OpProtoHolder, g_program
+import paddle.v2.framework.core as core
+import copy
+import itertools
+def unique_name(prefix):
+    uid = core.unique_integer()  # unique during whole process.
+    return "_".join([prefix, str(uid)])
+class LayerHelper(object):
+    def __init__(self, layer_type, **kwargs):
+        self.kwargs = kwargs
+        self.layer_type = layer_type
+        name = self.kwargs.get('name', None)
+        if name is None:
+            self.kwargs['name'] = unique_name(self.layer_type)
+    @property
+    def name(self):
+        return self.kwargs['name']
+    @property
+    def program(self):
+        prog = self.kwargs.get('program', None)
+        if prog is None:
+            return g_program
+        else:
+            return prog
+    def append_op(self, *args, **kwargs):
+        return self.program.current_block().append_op(*args, **kwargs)
+    def multiple_input(self, input_param_name='input'):
+        inputs = self.kwargs.get(input_param_name, [])
+        type_error = TypeError(
+            "Input of {0} layer should be Variable or sequence of Variable".
+            format(self.layer_type))
+        if isinstance(inputs, Variable):
+            inputs = [inputs]
+        elif not isinstance(inputs, list) and not isinstance(inputs, tuple):
+            raise type_error
+        else:
+            for each in inputs:
+                if not isinstance(each, Variable):
+                    raise type_error
+        return inputs
+    def input(self, input_param_name='input'):
+        inputs = self.multiple_input(input_param_name)
+        if len(inputs) != 1:
+            raise "{0} layer only takes one input".format(self.layer_type)
+        return inputs[0]
+    @property
+    def param_attr(self):
+        default = {
+            'name': None,
+            'init_attr': {
+                'type': 'uniform_random',
+                'min': -1.0,
+                'max': 1.0
+            }
+        }
+        actual = self.kwargs.get('param_attr', None)
+        return actual if actual is not None else default
+    def bias_attr(self, size, dtype):
+        bias_attr = self.kwargs.get('bias_attr', False)
+        if bias_attr is None or bias_attr:
+            bias_attr = {
+                'name': None,
+                'init_attr': {
+                    'type': 'fill_constant',
+                    'value': 0.0,
+                    'shape': [size],
+                    'dataType': dtype
+                }
+            }
+        return bias_attr
+    def multiple_param_attr(self, length):
+        param_attr = self.param_attr
+        if isinstance(param_attr, dict):
+            param_attr = [param_attr]
+        if len(param_attr) != 1 and len(param_attr) != length:
+            raise ValueError("parameter number mismatch")
+        elif len(param_attr) == 1 and length != 1:
+            tmp = [None] * length
+            for i in xrange(length):
+                tmp[i] = copy.deepcopy(param_attr[0])
+            param_attr = tmp
+        return param_attr
+    def iter_inputs_and_params(self, input_param_name='input'):
+        inputs = self.multiple_input(input_param_name)
+        param_attrs = self.multiple_param_attr(len(inputs))
+        for ipt, param_attr in itertools.izip(inputs, param_attrs):
+            yield ipt, param_attr
+    def input_dtype(self, input_param_name='input'):
+        inputs = self.multiple_input(input_param_name)
+        dtype = None
+        for each in inputs:
+            if dtype is None:
+                dtype = each.data_type
+            elif dtype != each.data_type:
+                raise ValueError("Data Type mismatch")
+        return dtype
+    def create_parameter(self, attr, shape, dtype, suffix='w'):
+        if attr['name'] is None:
+            attr['name'] = unique_name(".".join([self.name, suffix]))
+        return self.program.global_block().create_parameter(
+            name=attr['name'],
+            dtype=dtype,
+            shape=shape,
+            initialize_attr=attr['init_attr'])
+    def create_tmp_variable(self, dtype):
+        return self.program.current_block().create_var(
+            name=unique_name(".".join([self.name, 'tmp'])), dtype=dtype)
+    def create_global_variable(self, *args, **kwargs):
+        return self.program.global_block().create_var(*args, **kwargs)
+    def append_bias_op(self, input_var):
+        bias_attr = self.bias_attr(
+            self.kwargs['size'], dtype=input_var.data_type)
+        if not bias_attr:
+            return input_var
+        b = self.create_parameter(
+            attr=bias_attr,
+            shape=[self.kwargs['size']],
+            dtype=input_var.data_type,
+            suffix='b')
+        tmp = self.create_tmp_variable(dtype=input_var.data_type)
+        self.append_op(
+            type='elementwise_add',
+            inputs={'X': [input_var],
+                    'Y': [b]},
+            outputs={'Out': [tmp]})
+        return tmp
+    def append_activation(self, input_var):
+        act = self.kwargs.get('act', None)
+        if act is None:
+            return input_var
+        if isinstance(act, basestring):
+            act = {'type': act}
+        tmp = self.create_tmp_variable(dtype=input_var.data_type)
+        act_type = act.pop('type')
+        self.append_op(
+            type=act_type,
+            inputs={"X": [input_var]},
+            outputs={"Y": [tmp]},
+            attrs=act)
+        return tmp
--- a/python/paddle/v2/framework/layers.py
+++ b/python/paddle/v2/framework/layers.py
+from paddle.v2.framework.layer_helper import LayerHelper
+import paddle.v2.framework.core as core
+from paddle.v2.framework.framework import OpProtoHolder, Variable
+import re
+__all__ = ['fc_layer', 'data_layer', 'cross_entropy']
+def fc_layer(input,
+             size,
+             param_attr=None,
+             bias_attr=True,
+             name=None,
+             act=None,
+             num_flatten_dims=1,
+             program=None):
+    # create helper
+    helper = LayerHelper('fc', **locals())
+    dtype = helper.input_dtype()
+    # mul
+    mul_results = []
+    for input_var, param_attr in helper.iter_inputs_and_params():
+        input_shape = input_var.shape
+        param_shape = list(input_shape[num_flatten_dims:]) + [size]
+        w = helper.create_parameter(
+            attr=param_attr, shape=param_shape, dtype=dtype)
+        tmp = helper.create_tmp_variable(dtype)
+        helper.append_op(
+            type="mul",
+            inputs={
+                "X": input_var,
+                "Y": w,
+            },
+            outputs={"Out": tmp},
+            attrs={'x_num_col_dims': num_flatten_dims})
+        mul_results.append(tmp)
+    # sum
+    if len(mul_results) == 1:
+        pre_bias = mul_results[0]
+    else:
+        pre_bias = helper.create_tmp_variable(dtype)
+        helper.append_op(
+            type="sum", inputs={"X": mul_results}, outputs={"Out": pre_bias})
+    # add bias
+    pre_activation = helper.append_bias_op(pre_bias)
+    # add activation
+    return helper.append_activation(pre_activation)
+def data_layer(name,
+               shape,
+               data_type='float32',
+               type=core.VarDesc.VarType.LOD_TENSOR,
+               program=None):
+    helper = LayerHelper('data', **locals())
+    shape = [-1] + shape  # append batch size as -1
+    return helper.create_global_variable(
+        name=name, shape=shape, dtype=data_type, type=type)
+def _convert_(name):
+    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
+    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()
+def _create_op_func_(op_type):
+    op_proto = OpProtoHolder.instance().get_op_proto(op_type)
+    if len(op_proto.outputs) != 1:
+        raise ValueError(
+            "Only one output operator can be automatically generated")
+    if op_proto.outputs[0].duplicable:
+        raise ValueError(
+            "Only not duplicable op can be automatically generated")
+    o_name = op_proto.outputs[0].name
+    def func(**kwargs):
+        helper = LayerHelper(op_type, **kwargs)
+        inputs = dict()
+        dtype = None
+        for ipt in op_proto.inputs:
+            name = _convert_(ipt.name)
+            val = kwargs.pop(name, [])
+            if not isinstance(val, list) and not isinstance(val, tuple):
+                val = [val]
+            for each in val:
+                if not isinstance(each, Variable):
+                    raise ValueError("input of {0} must be variable".format(
+                        op_type))
+                if dtype is None:
+                    dtype = each.data_type
+                elif dtype != each.data_type:
+                    raise ValueError(
+                        "operator {0} must input same dtype".format(op_type))
+            inputs[ipt.name] = val
+        out = helper.create_tmp_variable(dtype=dtype)
+        helper.append_op(
+            type=op_type, inputs=inputs, outputs={o_name: [out]}, attrs=kwargs)
+        return out
+    func.__name__ = op_type
+    globals()[op_type] = func
+    global __all__
+    __all__.append(op_type)
+_create_op_func_('mean')
+def cross_entropy(input, label, **kwargs):
+    helper = LayerHelper('cross_entropy', **kwargs)
+    out = helper.create_tmp_variable(dtype=input.data_type)
+    helper.append_op(
+        type='cross_entropy',
+        inputs={'X': [input],
+                'Label': [label]},
+        outputs={'Y': [out]},
+        attrs=kwargs)
+    return out
+def square_error_cost(input, label, **kwargs):
+    helper = LayerHelper('square_error_cost', **kwargs)
+    minus_out = helper.create_tmp_variable(dtype=input.data_type)
+    helper.append_op(
+        type='elementwise_sub',
+        inputs={'X': [input],
+                'Y': [label]},
+        outputs={'Out': [minus_out]})
+    square_out = helper.create_tmp_variable(dtype=input.data_type)
+    helper.append_op(
+        type='pow',
+        inputs={'X': [minus_out]},
+        outputs={'Y': [square_out]},
+        attrs={'factor': 2.0})
+    return square_out
--- a/python/paddle/v2/framework/tests/test_cross_entropy_op.py
+++ b/python/paddle/v2/framework/tests/test_cross_entropy_op.py
--- a/python/paddle/v2/framework/tests/test_feed_fetch_method.py
+++ b/python/paddle/v2/framework/tests/test_feed_fetch_method.py
--- a/python/paddle/v2/framework/tests/test_layers.py
+++ b/python/paddle/v2/framework/tests/test_layers.py
--- a/python/paddle/v2/framework/tests/test_operator_desc.py
+++ b/python/paddle/v2/framework/tests/test_operator_desc.py
--- a/python/paddle/v2/framework/tests/test_pool2d_op.py
+++ b/python/paddle/v2/framework/tests/test_pool2d_op.py
--- a/python/paddle/v2/framework/tests/test_pool3d_op.py
+++ b/python/paddle/v2/framework/tests/test_pool3d_op.py
--- a/python/paddle/v2/framework/tests/test_pool_max_op.py
+++ b/python/paddle/v2/framework/tests/test_pool_max_op.py
--- a/python/paddle/v2/framework/tests/test_selected_rows.py
+++ b/python/paddle/v2/framework/tests/test_selected_rows.py
--- a/python/paddle/v2/framework/tests/test_softmax_with_cross_entropy_op.py
+++ b/python/paddle/v2/framework/tests/test_softmax_with_cross_entropy_op.py