Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into op_transpose

35967e86 · xzl · 5ede6fd4 · d59295f2 · 35967e86 · 35967e86
79 changed file
--- a/doc/design/block.md
+++ b/doc/design/block.md
+# Design Doc: Block and Scope
+
+## The Representation of Computation
+
+Both deep learning systems and programming languages help users describe computation procedures.  These systems use various representations of computation:
+
+- Caffe, Torch, and Paddle: sequences of layers.
+- TensorFlow, Caffe2, Mxnet: graphs of operators.
+- PaddlePaddle: nested blocks, like C++ and Java programs.
+
+## Block in Programming Languages and Deep Learning
+
+In programming languages, a block is a pair of curly braces that includes local variables definitions and a sequence of instructions, or operators.
+
+Blocks work with control flow structures like `if`, `else`, and `for`, which have equivalents in deep learning:
+
+| programming languages | PaddlePaddle          |
+|-----------------------|-----------------------|
+| for, while loop       | RNN, WhileOp          |
+| if, if-else, switch   | IfElseOp, SwitchOp    |
+| sequential execution  | a sequence of layers  |
+
+A key difference is that a C++ program describes a one pass computation, whereas a deep learning program describes both the forward and backward passes.
+
+## Stack Frames and the Scope Hierarchy
+
+The existence of the backward makes the execution of a block of traditional programs and PaddlePaddle different to each other:
+
+| programming languages | PaddlePaddle                  |
+|-----------------------|-------------------------------|
+| stack                 | scope hierarchy               |
+| stack frame           | scope                         |
+| push at entering block| push at entering block        |
+| pop at leaving block  | destroy at minibatch completes|
+
+1. In traditional programs:
+
+   - When the execution enters the left curly brace of a block, the runtime pushes a frame into the stack, where it realizes local variables.
+   - After the execution leaves the right curly brace, the runtime pops the frame.
+   - The maximum number of frames in the stack is the maximum depth of nested blocks.
+
+1. In PaddlePaddle
+
+   - When the execution enters a block, PaddlePaddle adds a new scope, where it realizes variables.
+   - PaddlePaddle doesn't pop a scope after the execution of the block because variables therein are to be used by the backward pass.  So it has a stack forest known as a *scope hierarchy*.
+   - The height of the highest tree is the maximum depth of nested blocks.
+   - After the process of a minibatch, PaddlePaddle destroys the scope hierarchy.
+
+## Use Blocks in C++ and PaddlePaddle Programs
+
+Let us consolidate the discussion by presenting some examples.
+
+### Blocks with `if-else` and `IfElseOp`
+
+The following C++ programs shows how blocks are used with the `if-else` structure:
+
+```c++
+int x = 10;
+int y = 20;
+int out;
+bool cond = false;
+if (cond) {
+  int z = x + y;
+  out = softmax(z);
+} else {
+  int z = fc(x);
+  out = z;
+}
+```
+
+An equivalent PaddlePaddle program from the design doc of the [IfElseOp operator](./if_else_op.md) is as follows:
+
+```python
+import paddle as pd
+
+x = var(10)
+y = var(20)
+cond = var(false)
+ie = pd.create_ifelseop(inputs=[x], output_num=1)
+with ie.true_block():
+    x = ie.inputs(true, 0)
+    z = operator.add(x, y)
+    ie.set_output(true, 0, operator.softmax(z))
+with ie.false_block():
+    x = ie.inputs(false, 0)
+    z = layer.fc(x)
+    ie.set_output(true, 0, operator.softmax(z))
+out = b(cond)
+```
+
+In both examples, the left branch computes `softmax(x+y)` and the right branch computes `fc(x)`.
+
+A difference is that variables in the C++ program contain scalar values, whereas those in the PaddlePaddle programs are mini-batches of instances.  The `ie.input(true, 0)` invocation returns instances in the 0-th input, `x`, that corresponds to true values in `cond` as the local variable `x`, where `ie.input(false, 0)` returns instances corresponding to false values.
+
+### Blocks with `for` and `RNNOp`
+
+The following RNN model from the [RNN design doc](./rnn.md)
+
+```python
+x = sequence([10, 20, 30])
+m = var(0)
+W = tensor()
+U = tensor()
+
+rnn = create_rnn(inputs=[input])
+with rnn.stepnet() as net:
+  x = net.set_inputs(0)
+  h = net.add_memory(init=m)
+  fc_out = pd.matmul(W, x)
+  hidden_out = pd.matmul(U, h.pre(n=1))
+  sum = pd.add_two(fc_out, hidden_out)
+  act = pd.sigmoid(sum)
+  h.update(act)                       # update memory with act
+  net.set_outputs(0, act, hidden_out) # two outputs
+
+o1, o2 = rnn()
+print o1, o2
+```
+
+has its equivalent C++ program as follows
+
+```c++
+int* x = {10, 20, 30};
+int m = 0;
+int W = some_value();
+int U = some_other_value();
+
+int mem[sizeof(x) / sizeof(x[0]) + 1];
+int o1[sizeof(x) / sizeof(x[0]) + 1];
+int o2[sizeof(x) / sizeof(x[0]) + 1];
+for (int i = 1; i <= sizeof(x)/sizeof(x[0]); ++i) {
+  int x = x[i-1];
+  if (i == 1) mem[0] = m;
+  int fc_out = W * x;
+  int hidden_out = Y * mem[i-1];
+  int sum = fc_out + hidden_out;
+  int act = sigmoid(sum);
+  mem[i] = act;
+  o1[i] = act;
+  o2[i] = hidden_out;
+}
+
+print_array(o1);
+print_array(o2);
+```
+
+
+## Compilation and Execution
+
+Like TensorFlow programs, a PaddlePaddle program is written in Python.  The first part describes a neural network as a protobuf message, and the rest part executes the message for training or inference.
+
+The generation of this protobuf message is like what a compiler generates a binary executable file.  The execution of the message that the OS executes the binary file.
+
+## The "Binary Executable File Format"
+
+The definition of the protobuf message is as follows:
+
+```protobuf
+message BlockDesc {
+  repeated VarDesc vars = 1;
+  repeated OpDesc ops = 2;
+}
+```
+
+The step net in above RNN example would look like
+
+```
+BlockDesc {
+  vars = {
+    VarDesc {...} // x
+    VarDesc {...} // h
+    VarDesc {...} // fc_out
+    VarDesc {...} // hidden_out
+    VarDesc {...} // sum
+    VarDesc {...} // act
+  }
+  ops = {
+    OpDesc {...} // matmul
+    OpDesc {...} // add_two
+    OpDesc {...} // sigmoid
+  }
+};
+```
+
+Also, the RNN operator in above example is serialized into a protobuf message of type `OpDesc` and would look like:
+
+```
+OpDesc {
+  inputs = {0} // the index of x
+  outputs = {5, 3} // indices of act and hidden_out
+  attrs {
+    "memories" : {1} // the index of h
+    "step_net" : <above step net>
+  }
+};
+```
+
+This `OpDesc` value is in the `ops` field of the `BlockDesc` value representing the global block.
+
+
+## The Compilation of Blocks
+
+During the generation of the Protobuf message, the Block should store VarDesc (the Protobuf message which describes Variable) and OpDesc (the Protobuf message which describes Operator).
+
+VarDesc in a block should have its name scope to avoid local variables affect parent block's name scope.
+Child block's name scopes should inherit the parent's so that OpDesc in child block can reference a VarDesc that stored in parent block. For example
+
+```python
+a = pd.Varaible(shape=[20, 20])
+b = pd.fc(a, params=["fc.w", "fc.b"])
+
+rnn = pd.create_rnn()
+with rnn.stepnet() as net:
+    x = net.set_inputs(a)
+    # reuse fc's parameter
+    fc_without_b = pd.get_variable("fc.w")
+    net.set_outputs(fc_without_b)
+
+out = rnn()
+```
+the method `pd.get_variable` can help retrieve a Variable by a name, a Variable may store in a parent block, but might be retrieved in a child block, so block should have a variable scope that supports inheritance.
+
+In compiler design, the symbol table is a data structure created and maintained by compilers to store information about the occurrence of various entities such as variable names, function names, classes, etc.
+
+To store the definition of variables and operators, we define a C++ class `SymbolTable`, like the one used in compilers.
+
+`SymbolTable` can do the following stuff:
+
+- store the definitions (some names and attributes) of variables and operators,
+- to verify if a variable was declared,
+- to make it possible to implement type checking (offer Protobuf message pointers to `InferShape` handlers).
+
+
+```c++
+// Information in SymbolTable is enough to trace the dependency graph. So maybe
+// the Eval() interface takes a SymbolTable is enough.
+class SymbolTable {
+ public:
+  SymbolTable(SymbolTable* parent) : parent_(parent) {}
+
+  OpDesc* NewOp(const string& name="");
+
+  // TODO determine whether name is generated by python or C++
+  // currently assume that a unique name will be generated by C++ if the
+  // argument name left default.
+  VarDesc* NewVar(const string& name="");
+
+  // find a VarDesc by name, if recursive true, find parent's SymbolTable
+  // recursively.
+  // this interface is introduced to support InferShape, find protobuf messages
+  // of variables and operators, pass pointers into InferShape.
+  // operator
+  //
+  // NOTE maybe some C++ classes such as VarDescBuilder and OpDescBuilder should
+  // be proposed and embedded into pybind to enable python operate on C++ pointers.
+  VarDesc* FindVar(const string& name, bool recursive=true);
+
+  OpDesc* FindOp(const string& name);
+
+  BlockDesc Compile() const;
+
+ private:
+  SymbolTable* parent_;
+
+  map<string, OpDesc> ops_;
+  map<string, VarDesc> vars_;
+};
+```
+
+After all the description of variables and operators is added into SymbolTable,
+the block has enough information to run.
+
+The `Block` class takes a `BlockDesc` as input, and provide `Run` and `InferShape` functions.
+
+
+```c++
+namespace {
+
+class Block : OperatorBase {
+public:
+  Block(const BlockDesc& desc) desc_(desc) {}
+
+  void InferShape(const framework::Scope& scope) const override {
+    if (!symbols_ready_) {
+      CreateVariables(scope);
+      CreateOperators();
+    }
+    // should run InferShape first.
+    for (auto& op : runtime_table_.ops()) {
+      op->InferShape(scope);
+    }
+  }
+
+  void Run(const framework::Scope& scope,
+           const platform::DeviceContext& dev_ctx) const override {
+    PADDLE_ENFORCE(symbols_ready_, "operators and variables should be created first.");
+    for (auto& op : runtime_table_.ops()) {
+      op->Run(scope, dev_ctx);
+    }
+  }
+
+  void CreateVariables(const framework::Scope& scope);
+  void CreateOperators();
+
+  // some other necessary interfaces of NetOp are list below
+  // ...
+
+private:
+  BlockDesc desc_;
+  bool symbols_ready_{false};
+};
+```
+
+## The Execution of Blocks
+
+Block inherits from OperatorBase, which has a Run method.
+Block's Run method will run its operators sequentially.
+
+There is another important interface called `Eval`, which take some arguments called targets, and generate a minimal graph which takes targets as the end points and creates a new Block,
+after `Run`, `Eval` will get the latest value and return the targets.
+
+The definition of Eval is as follows:
+
+```c++
+// clean a block description by targets using the corresponding dependency graph.
+// return a new BlockDesc with minimal number of operators.
+// NOTE not return a Block but the block's description so that this can be distributed
+// to a cluster.
+BlockDesc Prune(const BlockDesc& desc, vector<string> targets);
+
+void Block::Eval(const vector<string>& targets,
+                 const framework::Scope& scope,
+                 const platform::DeviceContext& dev_ctx) {
+  BlockDesc min_desc = Prune(desc_, targets);
+  Block min_block(min_desc);
+  min_block.Run(scope, dev_ctx);
+}
+```
--- a/doc/design/if_else_op.md
+++ b/doc/design/if_else_op.md
-IfOp should have only one branch. An IfOp operator takes a `cond` variable whose value must be a vector of N boolean elements. Its return value has M (M<=N) instances, each corresponds to a true element in `cond`.
-
-```python
-import paddle as pd
-
-x = var()
-y = var()
-cond = var()
-
-b = pd.create_ifop(inputs=[x], output_num=1)
-with b.true_block():
-    x = b.inputs(0)
-    z = operator.add(x, y)
-    b.set_output(0, operator.softmax(z))
-
-out = b(cond)
-```
-
-If we want the output still has N instances, we can use IfElseOp with a default value, whose minibatch size must be N:
+IfOp should have only one branch. An IfOp operator takes a `cond` variable whose value must be a vector of N boolean elements. Its return value has N instances. If cond[i] == True, input instance input[i] will go through true_block() and generate output[i]; otherwise it will produce output from false_bloack().

 ```python
 import paddle as pd
@@ -39,7 +21,7 @@ with b.false_block():
 out = b(cond)
 ```

-If only true_block is set in an IfElseOp, we can have a default value for false as:
+If only true_block is set in an IfElseOp, a special case is that we can have a default value for false as:
 ```python
 import paddle as pd


--- a/paddle/cuda/include/hl_tensor_ops.h
+++ b/paddle/cuda/include/hl_tensor_ops.h
@@ -461,7 +461,7 @@ class add<float32x4_t> {
 public:
  INLINE float32x4_t operator()(const float32x4_t a,
                                const float32x4_t b) const {
-    return vmulq_f32(a, b);
+    return vaddq_f32(a, b);
  }
 };


--- a/paddle/framework/tensor_impl.h
+++ b/paddle/framework/tensor_impl.h
@@ -22,7 +22,7 @@ namespace framework {
 template <typename T>
 inline void Tensor::check_memory_size() const {
  PADDLE_ENFORCE_NOT_NULL(
-      holder_, "Tenosr holds no memory. Call Tensor::mutable_data first.");
+      holder_, "Tensor holds no memory. Call Tensor::mutable_data first.");
  PADDLE_ENFORCE_GE(
      holder_->size(), numel() * sizeof(T) + offset_,
      "Tensor's dims_ is out of bound. Call Tensor::mutable_data "

--- a/paddle/framework/tensor_test.cc
+++ b/paddle/framework/tensor_test.cc
@@ -36,7 +36,7 @@ TEST(Tensor, DataAssert) {
  } catch (paddle::platform::EnforceNotMet err) {
    caught = true;
    std::string msg =
-        "holder_ should not be null\nTenosr holds no memory. Call "
+        "holder_ should not be null\nTensor holds no memory. Call "
        "Tensor::mutable_data first.";
    const char* what = err.what();
    for (size_t i = 0; i < msg.length(); ++i) {
@@ -112,7 +112,7 @@ TEST(Tensor, ShareDataWith) {
    } catch (paddle::platform::EnforceNotMet err) {
      caught = true;
      std::string msg =
-          "holder_ should not be null\nTenosr holds no memory. Call "
+          "holder_ should not be null\nTensor holds no memory. Call "
          "Tensor::mutable_data first.";
      const char* what = err.what();
      for (size_t i = 0; i < msg.length(); ++i) {
@@ -274,4 +274,4 @@ TEST(Tensor, ReshapeToMatrix) {
  Tensor res = ReshapeToMatrix<int>(src, 2);
  ASSERT_EQ(res.dims()[0], 2 * 3);
  ASSERT_EQ(res.dims()[1], 4 * 9);
-}
\ No newline at end of file
+}
--- a/paddle/gserver/layers/ExpandConvBaseLayer.cpp
+++ b/paddle/gserver/layers/ExpandConvBaseLayer.cpp
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include "ExpandConvBaseLayer.h"
-
-#include "paddle/utils/Logging.h"
-namespace paddle {
-
-bool ExpandConvBaseLayer::init(const LayerMap &layerMap,
-                               const ParameterMap &parameterMap) {
-  /* Initialize the basic convolutional parent class */
-  ConvBaseLayer::init(layerMap, parameterMap);
-
-  int index = 0;
-  for (auto &inputConfig : config_.inputs()) {
-    const ConvConfig &conf = inputConfig.conv_conf();
-    /* Consistent caffe mode for multiple input */
-    caffeMode_ = conf.caffe_mode();
-
-    // create a new weight
-    size_t height, width;
-    height = filterPixels_[index] * filterChannels_[index];
-    width = (!isDeconv_) ? numFilters_ : channels_[index];
-    CHECK_EQ(parameters_[index]->getSize(), width * height);
-    Weight *w = new Weight(height, width, parameters_[index]);
-    weights_.emplace_back(w);
-    index++;
-  }
-  if (biasParameter_.get()) {
-    if (sharedBiases_) {
-      CHECK_EQ((size_t)numFilters_, biasParameter_->getSize());
-      biases_ =
-          std::unique_ptr<Weight>(new Weight(numFilters_, 1, biasParameter_));
-    } else {
-      biases_ =
-          std::unique_ptr<Weight>(new Weight(getSize(), 1, biasParameter_));
-    }
-  }
-  getOutputSize();
-
-  return true;
-}
-
-size_t ExpandConvBaseLayer::getOutputSize() {
-  CHECK_NE(inputLayers_.size(), 0UL);
-  size_t layerSize = ConvBaseLayer::calOutputSize();
-  return layerSize;
-}
-
-void ExpandConvBaseLayer::addSharedBias() {
-  size_t mapW = getOutputSize() / numFilters_;
-  size_t mapH = getOutputValue()->getElementCnt() / mapW;
-  MatrixPtr out =
-      Matrix::create(getOutputValue()->getData(), mapH, mapW, false, useGpu_);
-
-  Matrix::resizeOrCreate(transOutValue_, mapW, mapH, false, useGpu_);
-
-  out->transpose(transOutValue_, false);  // false means no memory allocation
-  transOutValue_->reshape(transOutValue_->getElementCnt() / numFilters_,
-                          numFilters_);
-
-  MatrixPtr bias = Matrix::create(biases_->getW()->getData(),
-                                  1,
-                                  biases_->getW()->getElementCnt(),
-                                  false,
-                                  useGpu_);
-  transOutValue_->addBias(*bias, 1.0f);
-
-  transOutValue_->reshape(mapW, mapH);
-  transOutValue_->transpose(out, false);  // false means no memory allocation
-
-  out->clear();
-  bias->clear();
-}
-
-void ExpandConvBaseLayer::addUnsharedBias() {
-  MatrixPtr outValue = getOutputValue();
-  MatrixPtr bias = Matrix::create(biases_->getW()->getData(),
-                                  1,
-                                  biases_->getW()->getElementCnt(),
-                                  false,
-                                  useGpu_);
-  outValue->addBias(*bias, 1.0f);
-}
-
-void ExpandConvBaseLayer::bpropSharedBias(MatrixPtr biases, MatrixPtr v) {
-  size_t mapW = getOutputSize() / numFilters_;
-  size_t mapH = v->getElementCnt() / mapW;
-  MatrixPtr vTmp = Matrix::create(v->getData(), mapH, mapW, false, useGpu_);
-
-  Matrix::resizeOrCreate(transOutValue_, mapW, mapH, false, useGpu_);
-
-  vTmp->transpose(transOutValue_, false);  // false means no memory allocation
-  transOutValue_->reshape(transOutValue_->getElementCnt() / numFilters_,
-                          numFilters_);
-  biases->collectBias(*transOutValue_, 1.0f);
-}
-
-void ExpandConvBaseLayer::bpropBiases(MatrixPtr v) {
-  MatrixPtr biases = Matrix::create(biases_->getWGrad()->getData(),
-                                    1,
-                                    biases_->getWGrad()->getElementCnt(),
-                                    false,
-                                    useGpu_);
-  if (sharedBiases_) {
-    bpropSharedBias(biases, v);
-  } else {
-    biases->collectBias(*v, 1.0f);
-  }
-  biases->clear();
-}
-
-}  // namespace paddle
--- a/paddle/gserver/layers/ExpandConvLayer.cpp
+++ b/paddle/gserver/layers/ExpandConvLayer.cpp
@@ -36,7 +36,36 @@ inline bool isDepthwiseConv(int channels, int groups) {
 bool ExpandConvLayer::init(const LayerMap &layerMap,
                           const ParameterMap &parameterMap) {
  /* Initialize the basic convolutional parent class */
-  ExpandConvBaseLayer::init(layerMap, parameterMap);
+  ConvBaseLayer::init(layerMap, parameterMap);
+
+  int index = 0;
+  for (auto &inputConfig : config_.inputs()) {
+    const ConvConfig &conf = inputConfig.conv_conf();
+    /* Consistent caffe mode for multiple input */
+    caffeMode_ = conf.caffe_mode();
+
+    // create a new weight
+    size_t height, width;
+    height = filterPixels_[index] * filterChannels_[index];
+    width = (!isDeconv_) ? numFilters_ : channels_[index];
+    CHECK_EQ(parameters_[index]->getSize(), width * height);
+    Weight *w = new Weight(height, width, parameters_[index]);
+    weights_.emplace_back(w);
+    index++;
+  }
+
+  if (biasParameter_.get()) {
+    if (sharedBiases_) {
+      CHECK_EQ((size_t)numFilters_, biasParameter_->getSize());
+      biases_ = std::unique_ptr<Weight>(
+          new Weight(1, numFilters_, biasParameter_, 0));
+    } else {
+      biases_ =
+          std::unique_ptr<Weight>(new Weight(1, getSize(), biasParameter_, 0));
+    }
+  }
+
+  getOutputSize();

  size_t numInputs = config_.inputs_size();
  inputShape_.resize(numInputs);
@@ -108,6 +137,12 @@ bool ExpandConvLayer::init(const LayerMap &layerMap,
  return true;
 }

+size_t ExpandConvLayer::getOutputSize() {
+  CHECK_NE(inputLayers_.size(), 0UL);
+  size_t layerSize = ConvBaseLayer::calOutputSize();
+  return layerSize;
+}
+
 // i is the index of input layers
 #define BACKWARD_INPUT(i, inputs, outputs) \
  backward_[2 * i]->calc(inputs, outputs)
@@ -155,11 +190,7 @@ void ExpandConvLayer::forward(PassType passType) {

  /* add the bias-vector */
  if (biases_.get()) {
-    if (sharedBiases_) {
-      addSharedBias();
-    } else {
-      addUnsharedBias();
-    }
+    output_.value->addBias(*biases_->getW(), 1.0, sharedBiases_);
  }

  /* activation */
@@ -171,7 +202,7 @@ void ExpandConvLayer::backward(const UpdateCallback &callback) {

  MatrixPtr outGrad = getOutputGrad();
  if (biases_ && biases_->getWGrad()) {
-    bpropBiases(outGrad);
+    biases_->getWGrad()->collectBias(*getOutputGrad(), 1, sharedBiases_);
    /* Increasing the number of gradient */
    biases_->getParameterPtr()->incUpdate(callback);
  }

--- a/paddle/gserver/layers/ExpandConvLayer.h
+++ b/paddle/gserver/layers/ExpandConvLayer.h
@@ -15,7 +15,7 @@ limitations under the License. */
 #pragma once

 #include <vector>
-#include "ExpandConvBaseLayer.h"
+#include "ConvBaseLayer.h"
 #include "paddle/math/Matrix.h"

 namespace paddle {
@@ -28,10 +28,9 @@ namespace paddle {
 * The config file api is img_conv_layer.
 */

-class ExpandConvLayer : public ExpandConvBaseLayer {
+class ExpandConvLayer : public ConvBaseLayer {
 public:
-  explicit ExpandConvLayer(const LayerConfig& config)
-      : ExpandConvBaseLayer(config) {}
+  explicit ExpandConvLayer(const LayerConfig& config) : ConvBaseLayer(config) {}

  ~ExpandConvLayer() {}

@@ -41,6 +40,8 @@ public:
  void forward(PassType passType) override;
  void backward(const UpdateCallback& callback) override;

+  size_t getOutputSize();
+
 protected:
  std::vector<TensorShape> inputShape_;
  std::vector<TensorShape> filterShape_;

--- a/paddle/gserver/layers/MKLDNNConvLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNConvLayer.cpp
@@ -285,10 +285,9 @@ void MKLDNNConvLayer::resetWgtBiasValue(
  wgt = MKLDNNMatrix::create(weight_->getW(), pd->weights_primitive_desc());
  VLOG(MKLDNN_FMTS) << "Weight value format: " << wgt->getFormat();

-  bias = nullptr;
-  if (biases_ && biases_->getW()) {
-    bias = MKLDNNMatrix::create(biases_->getW(), pd->bias_primitive_desc());
-  }
+  bias = (biases_ && biases_->getW())
+             ? MKLDNNMatrix::create(biases_->getW(), pd->bias_primitive_desc())
+             : nullptr;
 }

 void MKLDNNConvLayer::resetOutValue(
@@ -356,6 +355,7 @@ void MKLDNNConvLayer::resetBwdWgtPD(

 void MKLDNNConvLayer::resetBwdDataPD(
    std::shared_ptr<conv_bwdData::primitive_desc>& pd) {
+  pd = nullptr;
  if (inputLayers_[0]->getOutput().grad == nullptr) {
    return;
  }
@@ -476,6 +476,7 @@ void MKLDNNConvLayer::resetWgtBiasGrad(
      << "primitive desc of weight grad and value should be equal";
  VLOG(MKLDNN_FMTS) << "weight grad format: " << wgt->getFormat();

+  bias = nullptr;
  if (biasVal_ == nullptr) {
    return;
  }

--- a/paddle/gserver/layers/MKLDNNFcLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNFcLayer.cpp
@@ -17,9 +17,6 @@ limitations under the License. */

 using namespace mkldnn;  // NOLINT
 typedef memory::format format;
-typedef inner_product_forward fc_fwd;
-typedef inner_product_backward_weights fc_bwdWgt;
-typedef inner_product_backward_data fc_bwdData;

 namespace paddle {

@@ -93,35 +90,88 @@ void MKLDNNFcLayer::reshape(
  printSizeInfo();
 }

-void MKLDNNFcLayer::resetFwd(std::vector<mkldnn::primitive>& pipeline,
+void MKLDNNFcLayer::resetFwd(std::vector<primitive>& pipeline,
                             MKLDNNMatrixPtr& in,
                             MKLDNNMatrixPtr& wgt,
                             MKLDNNMatrixPtr& bias,
                             MKLDNNMatrixPtr& out) {
-  pipeline.clear();
-  bool hasBias = biases_ && biases_->getW();
-  const MatrixPtr& wgtVal = weight_->getW();
-  const MatrixPtr& biasVal = hasBias ? biases_->getW() : nullptr;
-  const MatrixPtr& outVal = output_.value;
+  resetFwdBuffers(in, wgt, bias, out);
+
+  resetFwdPD(fwdPD_, in, wgt, bias, out);
+
+  resetFwdPipeline(pipeline, fwdPD_, in, wgt, bias, out);
+
+  printValueFormatFlow();
+}
+
+void MKLDNNFcLayer::resetBwd(std::vector<primitive>& pipeline,
+                             MKLDNNMatrixPtr& in,
+                             MKLDNNMatrixPtr& wgt,
+                             MKLDNNMatrixPtr& bias,
+                             MKLDNNMatrixPtr& out) {
+  std::shared_ptr<fc_bwdWgt::primitive_desc> bwdWgtPD;
+  std::shared_ptr<fc_bwdData::primitive_desc> bwdDataPD;
+
+  resetBwdBuffers(in, wgt, bias, out);
+
+  resetBwdWgtPD(bwdWgtPD, wgt, bias, out);
+
+  resetBwdDataPD(bwdDataPD, in, out);
+
+  resetBwdPipeline(pipeline, bwdWgtPD, bwdDataPD, in, wgt, bias, out);
+
+  printGradFormatFlow();
+}
+
+void MKLDNNFcLayer::updateInputData() {
+  inVal_->setData(getInputValue(0, CPU_DEVICE)->getData());
+}

+void MKLDNNFcLayer::updateWeights(const UpdateCallback& callback) {
+  weight_->getParameterPtr()->incUpdate(callback);
+  if (biases_ && biases_->getWGrad()) {
+    biases_->getParameterPtr()->incUpdate(callback);
+  }
+}
+
+void MKLDNNFcLayer::resetFwdBuffers(MKLDNNMatrixPtr& in,
+                                    MKLDNNMatrixPtr& wgt,
+                                    MKLDNNMatrixPtr& bias,
+                                    MKLDNNMatrixPtr& out) {
+  resetInValue(in);
+
+  resetWgtBiasValue(wgt, bias);
+
+  resetOutValue(out);
+}
+
+void MKLDNNFcLayer::resetInValue(MKLDNNMatrixPtr& in) {
  if (inputIsOnlyMKLDNN()) {
-    const MatrixPtr& inVal = getInputValue(0);
-    in = std::dynamic_pointer_cast<MKLDNNMatrix>(inVal);
+    const MatrixPtr& dnnIn = getInputValue(0);
+    in = std::dynamic_pointer_cast<MKLDNNMatrix>(dnnIn);
    CHECK(in) << "Input should be MKLDNNMatrix";
  } else {
    CHECK_EQ(getPrev(0)->getDeviceId(), CPU_DEVICE) << "Only support CPU yet";
-    const MatrixPtr& inVal = getInputValue(0, CPU_DEVICE);
+    const MatrixPtr& cpuIn = getInputValue(0, CPU_DEVICE);
    in = MKLDNNMatrix::create(
-        inVal, memory::dims{bs_, ic_, ih_, iw_}, format::nchw, engine_);
+        cpuIn, {bs_, ic_, ih_, iw_}, format::nchw, engine_);
  }
  in->downSpatial();
+}
+
+void MKLDNNFcLayer::resetWgtBiasValue(MKLDNNMatrixPtr& wgt,
+                                      MKLDNNMatrixPtr& bias) {
  wgt = MKLDNNMatrix::create(
-      wgtVal, memory::dims{oc_, ic_, ih_, iw_}, format::oihw, engine_);
+      weight_->getW(), {oc_, ic_, ih_, iw_}, format::oihw, engine_);
  wgt->downSpatial();
-  bias = hasBias ? MKLDNNMatrix::create(biasVal, {oc_}, format::x, engine_)
-                 : nullptr;
-  out = MKLDNNMatrix::create(outVal, {bs_, oc_}, format::nc, engine_);

+  bias = (biases_ && biases_->getW())
+             ? MKLDNNMatrix::create(biases_->getW(), {oc_}, format::x, engine_)
+             : nullptr;
+}
+
+void MKLDNNFcLayer::resetOutValue(MKLDNNMatrixPtr& out) {
+  out = MKLDNNMatrix::create(output_.value, {bs_, oc_}, format::nc, engine_);
  // change original output value to mkldnn output value
  output_.value = std::dynamic_pointer_cast<Matrix>(out);
  if (!outputIsOnlyMKLDNN()) {
@@ -129,46 +179,59 @@ void MKLDNNFcLayer::resetFwd(std::vector<mkldnn::primitive>& pipeline,
    // just share point
    getOutput(CPU_DEVICE).value->setData(output_.value->getData());
  }
+}

-  // create forward handle
+void MKLDNNFcLayer::resetFwdPD(std::shared_ptr<fc_fwd::primitive_desc>& pd,
+                               MKLDNNMatrixPtr in,
+                               MKLDNNMatrixPtr wgt,
+                               MKLDNNMatrixPtr bias,
+                               MKLDNNMatrixPtr out) {
+  CHECK(in);
+  CHECK(wgt);
+  CHECK(out);
  prop_kind pk = prop_kind::forward;
-  fc_fwd::desc fwdDesc = hasBias ? fc_fwd::desc(pk,
-                                                in->getMemoryDesc(),
-                                                wgt->getMemoryDesc(),
-                                                bias->getMemoryDesc(),
-                                                out->getMemoryDesc())
-                                 : fc_fwd::desc(pk,
-                                                in->getMemoryDesc(),
-                                                wgt->getMemoryDesc(),
-                                                out->getMemoryDesc());
-  fc_fwd::primitive_desc fwdPD = fc_fwd::primitive_desc(fwdDesc, engine_);
-  if (hasBias) {
-    fwd_.reset(new fc_fwd(fwdPD, *in, *wgt, *bias, *out));
+  fc_fwd::desc fwdDesc = bias != nullptr ? fc_fwd::desc(pk,
+                                                        in->getMemoryDesc(),
+                                                        wgt->getMemoryDesc(),
+                                                        bias->getMemoryDesc(),
+                                                        out->getMemoryDesc())
+                                         : fc_fwd::desc(pk,
+                                                        in->getMemoryDesc(),
+                                                        wgt->getMemoryDesc(),
+                                                        out->getMemoryDesc());
+  pd.reset(new fc_fwd::primitive_desc(fwdDesc, engine_));
+}
+
+void MKLDNNFcLayer::resetFwdPipeline(
+    std::vector<primitive>& pipeline,
+    std::shared_ptr<fc_fwd::primitive_desc>& pd,
+    MKLDNNMatrixPtr& in,
+    MKLDNNMatrixPtr& wgt,
+    MKLDNNMatrixPtr& bias,
+    MKLDNNMatrixPtr& out) {
+  pipeline.clear();
+
+  if (bias) {
+    fwd_.reset(new fc_fwd(*pd, *in, *wgt, *bias, *out));
  } else {
-    fwd_.reset(new fc_fwd(fwdPD, *in, *wgt, *out));
+    fwd_.reset(new fc_fwd(*pd, *in, *wgt, *out));
  }
-  printValueFormatFlow();

  pipeline.push_back(*fwd_);
 }

-void MKLDNNFcLayer::resetBwd(std::vector<mkldnn::primitive>& pipeline,
-                             MKLDNNMatrixPtr& in,
-                             MKLDNNMatrixPtr& wgt,
-                             MKLDNNMatrixPtr& bias,
-                             MKLDNNMatrixPtr& out) {
-  pipeline.clear();
-  if (!needResetBwd_) {
-    return;
-  }
-  needResetBwd_ = false;
-  bool hasBias = biases_ && biases_->getWGrad();
+void MKLDNNFcLayer::resetBwdBuffers(MKLDNNMatrixPtr& in,
+                                    MKLDNNMatrixPtr& wgt,
+                                    MKLDNNMatrixPtr& bias,
+                                    MKLDNNMatrixPtr& out) {
+  resetOutGrad(out);
+
+  resetWgtBiasGrad(wgt, bias);

-  /// backward weight
-  CHECK(inVal_) << "Should have input value";
-  const MatrixPtr& wgtGrad = weight_->getWGrad();
-  const MatrixPtr& biasGrad = hasBias ? biases_->getWGrad() : nullptr;
+  resetInGrad(in);
+}

+void MKLDNNFcLayer::resetOutGrad(MKLDNNMatrixPtr& out) {
  // TODO(TJ): merge outgrad
  int device = outputIsOnlyMKLDNN() ? MKLDNN_DEVICE : CPU_DEVICE;
  // for MKLDNN device:
@@ -178,66 +241,88 @@ void MKLDNNFcLayer::resetBwd(std::vector<mkldnn::primitive>& pipeline,
  // for CPU device:
  // fc do not need to convert from cpu device since output is always nc format
  // only need create from cpu device
-  const MatrixPtr& outGrad = getOutput(device).grad;
-  out = MKLDNNMatrix::create(outGrad, outVal_->getPrimitiveDesc());
-  wgt = MKLDNNMatrix::create(wgtGrad, wgtVal_->getPrimitiveDesc());
-  bias = hasBias ? MKLDNNMatrix::create(biasGrad, biasVal_->getPrimitiveDesc())
-                 : nullptr;
-
-  // create memory primitive desc
-  fc_fwd::desc fwdDesc = fc_fwd::desc(prop_kind::forward,
-                                      inVal_->getMemoryDesc(),
-                                      wgt->getMemoryDesc(),
-                                      out->getMemoryDesc());
-  fc_fwd::primitive_desc fwdPD = fc_fwd::primitive_desc(fwdDesc, engine_);
-  fc_bwdWgt::desc bwdWgtDesc = hasBias
-                                   ? fc_bwdWgt::desc(inVal_->getMemoryDesc(),
-                                                     wgt->getMemoryDesc(),
-                                                     bias->getMemoryDesc(),
-                                                     out->getMemoryDesc())
-                                   : fc_bwdWgt::desc(inVal_->getMemoryDesc(),
-                                                     wgt->getMemoryDesc(),
-                                                     out->getMemoryDesc());
-  fc_bwdWgt::primitive_desc bwdWgtPD =
-      fc_bwdWgt::primitive_desc(bwdWgtDesc, engine_, fwdPD);
-
-  if (hasBias) {
-    bwdWgt_.reset(new fc_bwdWgt(bwdWgtPD, *inVal_, *out, *wgt, *bias));
-  } else {
-    bwdWgt_.reset(new fc_bwdWgt(bwdWgtPD, *inVal_, *out, *wgt));
+  CHECK(outVal_);
+  out =
+      MKLDNNMatrix::create(getOutput(device).grad, outVal_->getPrimitiveDesc());
+}
+
+void MKLDNNFcLayer::resetWgtBiasGrad(MKLDNNMatrixPtr& wgt,
+                                     MKLDNNMatrixPtr& bias) {
+  CHECK(wgtVal_);
+  wgt = MKLDNNMatrix::create(weight_->getWGrad(), wgtVal_->getPrimitiveDesc());
+
+  bias = nullptr;
+  if (biasVal_ == nullptr) {
+    return;
  }
-  pipeline.push_back(*bwdWgt_);
+  bias =
+      MKLDNNMatrix::create(biases_->getWGrad(), biasVal_->getPrimitiveDesc());
+}

-  /// backward data
+void MKLDNNFcLayer::resetInGrad(MKLDNNMatrixPtr& in) {
+  in = nullptr;
  const MatrixPtr& inGrad = inputLayers_[0]->getOutput().grad;
  if (inGrad == nullptr) {
    return;
  }
-  if (getInput(0, MKLDNN_DEVICE).getAllCount() > 1) {
-    // TODO(TJ): use outputMaps_ ways to get the inGrad_ when merge outgrad done
-  } else {
-    in = MKLDNNMatrix::create(inGrad, inVal_->getPrimitiveDesc());
-  }
-
-  fc_bwdData::desc bwdDataDesc = fc_bwdData::desc(
-      inVal_->getMemoryDesc(), wgt->getMemoryDesc(), out->getMemoryDesc());
-  fc_bwdData::primitive_desc bwdDataPD =
-      fc_bwdData::primitive_desc(bwdDataDesc, engine_, fwdPD);
+  // TODO(TJ): use outputMaps_ ways to get the inGrad_ when merge outgrad done
+  CHECK(inVal_);
+  in = MKLDNNMatrix::create(inGrad, inVal_->getPrimitiveDesc());
+}

-  CHECK(wgtVal_) << "Should have weight memory";
-  bwdData_.reset(new fc_bwdData(bwdDataPD, *out, *wgtVal_, *in));
-  printGradFormatFlow();
-  pipeline.push_back(*bwdData_);
+void MKLDNNFcLayer::resetBwdWgtPD(
+    std::shared_ptr<fc_bwdWgt::primitive_desc>& pd,
+    MKLDNNMatrixPtr& wgt,
+    MKLDNNMatrixPtr& bias,
+    MKLDNNMatrixPtr& out) {
+  CHECK(inVal_);
+  fc_bwdWgt::desc bwdWgtDesc = bias ? fc_bwdWgt::desc(inVal_->getMemoryDesc(),
+                                                      wgt->getMemoryDesc(),
+                                                      bias->getMemoryDesc(),
+                                                      out->getMemoryDesc())
+                                    : fc_bwdWgt::desc(inVal_->getMemoryDesc(),
+                                                      wgt->getMemoryDesc(),
+                                                      out->getMemoryDesc());
+  pd.reset(new fc_bwdWgt::primitive_desc(bwdWgtDesc, engine_, *fwdPD_));
 }

-void MKLDNNFcLayer::updateInputData() {
-  inVal_->setData(getInputValue(0, CPU_DEVICE)->getData());
+void MKLDNNFcLayer::resetBwdDataPD(
+    std::shared_ptr<fc_bwdData::primitive_desc>& pd,
+    MKLDNNMatrixPtr& in,
+    MKLDNNMatrixPtr& out) {
+  pd = nullptr;
+  if (in == nullptr) {
+    return;
+  }
+  CHECK(wgtVal_);
+  fc_bwdData::desc bwdDataDesc = fc_bwdData::desc(
+      in->getMemoryDesc(), wgtVal_->getMemoryDesc(), out->getMemoryDesc());
+  pd.reset(new fc_bwdData::primitive_desc(bwdDataDesc, engine_, *fwdPD_));
 }

-void MKLDNNFcLayer::updateWeights(const UpdateCallback& callback) {
-  weight_->getParameterPtr()->incUpdate(callback);
-  if (biases_ && biases_->getWGrad()) {
-    biases_->getParameterPtr()->incUpdate(callback);
+void MKLDNNFcLayer::resetBwdPipeline(
+    std::vector<primitive>& pipeline,
+    std::shared_ptr<fc_bwdWgt::primitive_desc>& bwdWgtPD,
+    std::shared_ptr<fc_bwdData::primitive_desc>& bwdDataPD,
+    MKLDNNMatrixPtr& in,
+    MKLDNNMatrixPtr& wgt,
+    MKLDNNMatrixPtr& bias,
+    MKLDNNMatrixPtr& out) {
+  pipeline.clear();
+  CHECK(inVal_);
+  if (bias) {
+    bwdWgt_.reset(new fc_bwdWgt(*bwdWgtPD, *inVal_, *out, *wgt, *bias));
+  } else {
+    bwdWgt_.reset(new fc_bwdWgt(*bwdWgtPD, *inVal_, *out, *wgt));
+  }
+  pipeline.push_back(*bwdWgt_);
+
+  if (bwdDataPD == nullptr) {
+    return;
  }
+  CHECK(wgtVal_) << "Should have weight memory";
+  bwdData_.reset(new fc_bwdData(*bwdDataPD, *out, *wgtVal_, *in));
+  pipeline.push_back(*bwdData_);
 }
+
 }  // namespace paddle
--- a/paddle/gserver/layers/MKLDNNFcLayer.h
+++ b/paddle/gserver/layers/MKLDNNFcLayer.h
@@ -18,6 +18,9 @@ limitations under the License. */
 #include "mkldnn.hpp"

 namespace paddle {
+typedef mkldnn::inner_product_forward fc_fwd;
+typedef mkldnn::inner_product_backward_weights fc_bwdWgt;
+typedef mkldnn::inner_product_backward_data fc_bwdData;

 /**
 * @brief A subclass of MKLDNNLayer fc layer.
@@ -32,6 +35,9 @@ protected:
  // if has already init the weight
  bool hasInitedWgt_;

+  // save forward primitive_desc, which can be used backward
+  std::shared_ptr<fc_fwd::primitive_desc> fwdPD_;
+
  // fc weight and bias
  std::unique_ptr<Weight> weight_;
  std::unique_ptr<Weight> biases_;
@@ -67,6 +73,59 @@ public:
  void convertWeightsFromPaddle() override;

  void convertWeightsToPaddle() override;
+
+protected:
+  /**
+   * Forward functions: reset buffers(input, output, weight and bias),
+   *                    reset primitive descriptor,
+   *                    reset pipeline.
+   */
+  void resetFwdBuffers(MKLDNNMatrixPtr& in,
+                       MKLDNNMatrixPtr& wgt,
+                       MKLDNNMatrixPtr& bias,
+                       MKLDNNMatrixPtr& out);
+  void resetInValue(MKLDNNMatrixPtr& in);
+  void resetWgtBiasValue(MKLDNNMatrixPtr& wgt, MKLDNNMatrixPtr& bias);
+  void resetOutValue(MKLDNNMatrixPtr& out);
+  void resetFwdPD(std::shared_ptr<fc_fwd::primitive_desc>& pd,
+                  MKLDNNMatrixPtr in,
+                  MKLDNNMatrixPtr wgt,
+                  MKLDNNMatrixPtr bias,
+                  MKLDNNMatrixPtr out);
+  void resetFwdPipeline(std::vector<mkldnn::primitive>& pipeline,
+                        std::shared_ptr<fc_fwd::primitive_desc>& pd,
+                        MKLDNNMatrixPtr& in,
+                        MKLDNNMatrixPtr& wgt,
+                        MKLDNNMatrixPtr& bias,
+                        MKLDNNMatrixPtr& out);
+
+  /**
+   * Backward functions: reset buffers(input, output, weight and bias),
+   *                     reset primitive descriptor for backward weight,
+   *                     reset primitive descriptor for backward data,
+   *                     reset pipeline.
+   */
+  void resetBwdBuffers(MKLDNNMatrixPtr& in,
+                       MKLDNNMatrixPtr& wgt,
+                       MKLDNNMatrixPtr& bias,
+                       MKLDNNMatrixPtr& out);
+  void resetOutGrad(MKLDNNMatrixPtr& out);
+  void resetWgtBiasGrad(MKLDNNMatrixPtr& wgt, MKLDNNMatrixPtr& bias);
+  void resetInGrad(MKLDNNMatrixPtr& in);
+  void resetBwdWgtPD(std::shared_ptr<fc_bwdWgt::primitive_desc>& pd,
+                     MKLDNNMatrixPtr& wgt,
+                     MKLDNNMatrixPtr& bias,
+                     MKLDNNMatrixPtr& out);
+  void resetBwdDataPD(std::shared_ptr<fc_bwdData::primitive_desc>& pd,
+                      MKLDNNMatrixPtr& in,
+                      MKLDNNMatrixPtr& out);
+  void resetBwdPipeline(std::vector<mkldnn::primitive>& pipeline,
+                        std::shared_ptr<fc_bwdWgt::primitive_desc>& bwdWgtPD,
+                        std::shared_ptr<fc_bwdData::primitive_desc>& bwdDataPD,
+                        MKLDNNMatrixPtr& in,
+                        MKLDNNMatrixPtr& wgt,
+                        MKLDNNMatrixPtr& bias,
+                        MKLDNNMatrixPtr& out);
 };

 }  // namespace paddle
--- a/paddle/math/BaseMatrix.cu
+++ b/paddle/math/BaseMatrix.cu
@@ -17,6 +17,7 @@ limitations under the License. */
 #include <cmath>
 #include "BaseMatrix.h"
 #include "MathFunctions.h"
+#include "NEONFunctions.h"
 #include "SIMDFunctions.h"
 #include "hl_matrix_apply.cuh"
 #include "hl_matrix_base.cuh"
@@ -666,6 +667,13 @@ void BaseMatrixT<T>::relu(BaseMatrixT& b) {
  applyBinary(binary::Relu<T>(), b);
 }

+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+template <>
+void BaseMatrixT<float>::relu(BaseMatrixT& b) {
+  neon::relu(data_, b.data_, height_ * width_);
+}
+#endif
+
 DEFINE_MATRIX_BINARY_OP(ReluDerivative, a *= (b > 0.0f ? 1.0f : 0.0f));
 template <class T>
 void BaseMatrixT<T>::reluDerivative(BaseMatrixT& b) {

--- a/paddle/math/MKLDNNMatrix.h
+++ b/paddle/math/MKLDNNMatrix.h
@@ -66,11 +66,12 @@ public:
  /**
   * Create reorder primitive.
   * Create a mkldnn::reorder handle for converting src MKLDNNMatrix to dst.
-   * checkData: for whether to check the data handle of src and dst is the same.
-   *            if true, means check it and do not want support inplace reorder;
-   *            otherwise do not check data which means the created reorder
-   *            maybe inplace buffer and do not guarantee the logical is correct
-   *            since not all format or conversion support inplace.
+   * checkData: whether to check the data handle of src and dst.
+   *            if true, it will check the data and do not allow them equal;
+   *            otherwise, it will not check them, then the reorder created
+   *            may have inplace buffer.
+   *            Do not set false, if you can not guarantee the inplace logical
+   *            would work with your reorder.
   */
  static std::shared_ptr<mkldnn::reorder> createReorder(
      const MKLDNNMatrixPtr& src,

--- a/paddle/math/NEONFunctions.cpp
+++ b/paddle/math/NEONFunctions.cpp
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#if defined(__ARM_NEON__) || defined(__ARM_NEON)
+
+#include "NEONFunctions.h"
+#include <arm_neon.h>
+
+namespace paddle {
+namespace neon {
+
+// b[i] = a[i] > 0.0f ? a[i] : 0.0f
+void relu(const float* a, float* b, int len) {
+  int offset = len % 16;
+  float32x4_t ma0, ma1, ma2, ma3;
+  float32x4_t mb0, mb1, mb2, mb3;
+
+  float32x4_t zero = vdupq_n_f32(0.f);
+  for (int k = 0; k < len / 16; k++, a += 16, b += 16) {
+    ma0 = vld1q_f32(a);
+    ma1 = vld1q_f32(a + 4);
+    ma2 = vld1q_f32(a + 8);
+    ma3 = vld1q_f32(a + 12);
+
+    mb0 = vmaxq_f32(ma0, zero);
+    mb1 = vmaxq_f32(ma1, zero);
+    mb2 = vmaxq_f32(ma2, zero);
+    mb3 = vmaxq_f32(ma3, zero);
+
+    vst1q_f32(b, mb0);
+    vst1q_f32(b + 4, mb1);
+    vst1q_f32(b + 8, mb2);
+    vst1q_f32(b + 12, mb3);
+  }
+
+  for (int i = 0; i < offset; i++) {
+    b[i] = a[i] > 0.0f ? a[i] : 0.0f;
+  }
+}
+
+}  // namespace neon
+}  // namespace paddle
+
+#endif
--- a/paddle/gserver/layers/ExpandConvBaseLayer.h
+++ b/paddle/gserver/layers/ExpandConvBaseLayer.h
@@ -14,44 +14,10 @@ limitations under the License. */

 #pragma once

-#include <vector>
-#include "ConvBaseLayer.h"
-#include "paddle/math/Matrix.h"
-
 namespace paddle {
+namespace neon {

-/**
- * @brief A subclass of ConvBaseLayer that is a superclass of both
- * ExpandConvLayer and ExpandConvTransLayer
- */
-class ExpandConvBaseLayer : public ConvBaseLayer {
-protected:
-  /// The transpose of output, which is an auxiliary matrix.
-  MatrixPtr transOutValue_;
-
-public:
-  explicit ExpandConvBaseLayer(const LayerConfig& config)
-      : ConvBaseLayer(config) {}
-
-  ~ExpandConvBaseLayer() {}
-
-  bool init(const LayerMap& layerMap,
-            const ParameterMap& parameterMap) override;
-
-  size_t getOutputSize();
-
-  /**
-   * Add shared bias.
-   */
-  void addSharedBias();
-
-  /**
-   * Add unshared bias.
-   */
-  void addUnsharedBias();
-
-  void bpropSharedBias(MatrixPtr biases, MatrixPtr v);
-  void bpropBiases(MatrixPtr v);
-};
+void relu(const float* a, float* b, int len);

+}  // namespace neon
 }  // namespace paddle
--- a/paddle/memory/memcpy.cc
+++ b/paddle/memory/memcpy.cc
@@ -62,6 +62,24 @@ void Copy<platform::GPUPlace, platform::GPUPlace>(platform::GPUPlace dst_place,
  }
 }

+template <>
+void Copy<platform::CPUPlace, platform::GPUPlace>(platform::CPUPlace dst_place,
+                                                  void* dst,
+                                                  platform::GPUPlace src_place,
+                                                  const void* src, size_t num) {
+  platform::SetDeviceId(src_place.device);
+  platform::GpuMemcpySync(dst, src, num, cudaMemcpyDeviceToHost);
+}
+
+template <>
+void Copy<platform::GPUPlace, platform::CPUPlace>(platform::GPUPlace dst_place,
+                                                  void* dst,
+                                                  platform::CPUPlace src_place,
+                                                  const void* src, size_t num) {
+  platform::SetDeviceId(dst_place.device);
+  platform::GpuMemcpySync(dst, src, num, cudaMemcpyHostToDevice);
+}
+
 #endif  // PADDLE_ONLY_CPU

 }  // namespace memory

--- a/paddle/operators/CMakeLists.txt
+++ b/paddle/operators/CMakeLists.txt
@@ -80,9 +80,11 @@ endfunction()
 add_subdirectory(math)

 set(DEPS_OPS
-    recurrent_op)
+    recurrent_op
+    cond_op)
 op_library(recurrent_op SRCS recurrent_op.cc rnn/recurrent_op_utils.cc
  DEPS framework_proto tensor net_op)
+op_library(cond_op SRCS cond_op.cc DEPS framework_proto tensor operator net_op)

 list(REMOVE_ITEM GENERAL_OPS ${DEPS_OPS})
 foreach(src ${GENERAL_OPS})

--- a/paddle/operators/accuracy_op.cc
+++ b/paddle/operators/accuracy_op.cc
@@ -23,10 +23,15 @@ class AccuracyOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
-    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Inference"),
-                            "Input of Inference must be initialized.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.InputVar("Inference"),
+        "Input(Inference) of AccuracyOp should not be null.");
    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Label"),
-                            "Input of Inference must be initialized.");
+                            "Input(Label) of AccuracyOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.OutputVar("Accuracy"),
+        "Output(Accuracy) of AccuracyOp should not be null.");
+
    auto *inference = ctx.Input<framework::Tensor>("Inference");
    auto *label = ctx.Input<framework::Tensor>("Label");


--- a/paddle/operators/accuracy_op.cu
+++ b/paddle/operators/accuracy_op.cu
@@ -12,26 +12,38 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

+#include <thrust/execution_policy.h>
+#include <thrust/reduce.h>
 #include "paddle/operators/accuracy_op.h"
+#include "paddle/platform/cuda_helper.h"

 namespace paddle {
 namespace operators {
+using platform::PADDLE_CUDA_NUM_THREADS;

-__global__ void AccuracySingleKernel(const int N, const int D, const int top_k,
-                                     const int* Xdata, const int* labelData,
-                                     float* accuracy) {
-  int correct = 0;
-  for (int row = 0; row < N; row++) {
-    const int label = labelData[row];
-    for (int col = 0; col < D; col++) {
-      const int pred = Xdata[row * D + col];
-      if (pred == label) {
-        ++correct;
+template <int BlockSize>
+__global__ void AccuracyCudaKernel(const int N, const int D, const int* Xdata,
+                                   const int* labeldata, float* accuracy) {
+  int count = 0;
+  __shared__ int total[BlockSize];
+
+  // support only 1 block
+  for (int i = threadIdx.x; i < (N); i += BlockSize) {
+    for (int j = 0; j < D; ++j) {
+      if (Xdata[i * D + j] == labeldata[i]) {
+        ++count;
        break;
      }
    }
  }
-  *accuracy = static_cast<float>(correct) / static_cast<float>(N);
+  total[threadIdx.x] = count;
+  __syncthreads();
+
+  // reduce the count with init value 0, and output accuracy.
+  int result = thrust::reduce(thrust::device, total, total + BlockSize, 0);
+  if (threadIdx.x == 0) {
+    *accuracy = static_cast<float>(result) / static_cast<float>(N);
+  }
 }

 template <typename T>
@@ -57,8 +69,8 @@ class AccuracyOpCUDAKernel : public framework::OpKernel {
      return;
    }

-    AccuracySingleKernel<<<1, 1>>>(num_samples, infer_width, 1, inference_data,
-                                   label_data, accuracy_data);
+    AccuracyCudaKernel<PADDLE_CUDA_NUM_THREADS><<<1, PADDLE_CUDA_NUM_THREADS>>>(
+        num_samples, infer_width, inference_data, label_data, accuracy_data);
  }
 };


--- a/paddle/operators/add_op.cc
+++ b/paddle/operators/add_op.cc
@@ -23,6 +23,13 @@ class AddOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of AddOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Y"),
+                            "Input(Y) of AddOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of AddOp should not be null.");
+
    PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("X")->dims(),
                      ctx.Input<Tensor>("Y")->dims(),
                      "Two input of Add Op's dimension must be same.");

--- a/paddle/operators/concat_op.cc
+++ b/paddle/operators/concat_op.cc
@@ -25,6 +25,9 @@ class ConcatOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of ConcatOp should not be null.");
+
    auto ins = ctx.MultiInput<framework::Tensor>("X");
    auto *out = ctx.Output<framework::LoDTensor>("Out");
    size_t axis = static_cast<size_t>(ctx.Attr<int>("axis"));

--- a/paddle/operators/cond_op.cc
+++ b/paddle/operators/cond_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/cond_op.h"
+
+#include <cstring>
+#include <sstream>
+
+#include "paddle/framework/op_registry.h"
+#include "paddle/operators/gather.h"
+#include "paddle/operators/net_op.h"
+#include "paddle/operators/scatter.h"
+
+namespace paddle {
+namespace operators {
+
+using Scope = framework::Scope;
+using Variable = framework::Variable;
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+using DDim = framework::DDim;
+
+void CondOp::CreateScope(const Scope& scope) const {
+  auto sub_scopes_var = scope.FindVar("SubScopes");
+  PADDLE_ENFORCE_NOT_NULL(sub_scopes_var,
+                          "Output(SubScopes) of CondOp should not be null.");
+  auto sub_scopes = sub_scopes_var->GetMutable<std::vector<Scope*>>();
+  auto& sub_scope = scope.NewScope();
+  sub_scopes->push_back(&sub_scope);
+}
+
+void CondOp::CreateIndexTensor(const Scope& scope) const {
+  auto index_tensors_var = scope.FindVar("IndexTensors");
+  PADDLE_ENFORCE_NOT_NULL(index_tensors_var,
+                          "Output(IndexTensors) of CondOp should not be null.");
+  auto& index_tensors =
+      *index_tensors_var->GetMutable<std::vector<LoDTensor>>();
+  index_tensors.push_back(LoDTensor());
+}
+
+void CondOp::InferShape(const Scope& scope) const {
+  auto sub_scopes_var = scope.FindVar("SubScopes");
+  PADDLE_ENFORCE_NOT_NULL(sub_scopes_var,
+                          "Output(SubScopes) of CondOp should not be null.");
+  auto& sub_scopes = *sub_scopes_var->GetMutable<std::vector<Scope*>>();
+
+  for (int i = 0; i < 2; ++i) {
+    // Create two sub scopes for true and false branches
+    // sub_scopes[0] for the true branch and sub_scopes[1] for the false
+    // branch
+    CreateScope(scope);
+
+    // Create two tensors for true and false indices
+    // index_tensors[0] for the true branch and index_tensors[1] for the false
+    // branch
+    CreateIndexTensor(scope);
+
+    PADDLE_ENFORCE(!Inputs("Xs").empty(),
+                   "Inputs(Xs) of CondOp can't be empty.");
+    for (auto& input : Inputs("Xs")) {
+      // Create a new tensor in sub-scope for input-type tensor
+      Variable* v = sub_scopes[i]->NewVar(input);
+      LoDTensor* sub_input = v->GetMutable<LoDTensor>();
+      sub_input->Resize(scope.FindVar(input)->GetMutable<LoDTensor>()->dims());
+    }
+
+    for (auto& output : (*sub_net_op_[i]).Outputs()) {
+      for (auto& var_name : output.second) {
+        sub_scopes[i]->NewVar(var_name);
+      }
+    }
+
+    // each net calls InferShape
+    sub_net_op_[i]->InferShape(*sub_scopes[i]);
+  }
+
+  for (auto& output : Outputs("Outs")) {
+    LoDTensor* tensor_t_out =
+        sub_scopes[0]->FindVar(output)->GetMutable<LoDTensor>();
+    PADDLE_ENFORCE_NOT_NULL(tensor_t_out, "True output should not be NULL");
+    LoDTensor* tensor_f_out =
+        sub_scopes[1]->FindVar(output)->GetMutable<LoDTensor>();
+    PADDLE_ENFORCE_NOT_NULL(tensor_f_out, "False output should not be NULL");
+
+    auto* tensor_out_var = scope.FindVar(output);
+    PADDLE_ENFORCE_NOT_NULL(tensor_out_var, "Output not found");
+    LoDTensor* tensor_out = tensor_out_var->GetMutable<LoDTensor>();
+    PADDLE_ENFORCE_NOT_NULL(tensor_t_out,
+                            "True output tensor should not be NULL");
+
+    // check output size should be same
+    PADDLE_ENFORCE_EQ(tensor_t_out->dims(), tensor_f_out->dims(),
+                      "Outputs not of the same shape");
+    tensor_out->Resize(tensor_t_out->dims());
+    // tensor_out->mutable_data<float>(tensor_out->dims(),
+    // platform::CPUPlace());
+    tensor_out->mutable_data<float>(platform::CPUPlace());
+  }
+}
+
+void CondOp::Run(const Scope& scope,
+                 const platform::DeviceContext& dev_ctx) const {
+  auto* sub_scopes_var = scope.FindVar("SubScopes");
+  PADDLE_ENFORCE_NOT_NULL(sub_scopes_var,
+                          "Output(SubScopes) of CondOp should not be null.");
+  auto sub_scopes = sub_scopes_var->Get<std::vector<Scope*>>();
+  auto* index_tensors_var = scope.FindVar("IndexTensors");
+  PADDLE_ENFORCE_NOT_NULL(index_tensors_var,
+                          "Output(IndexTensors) of CondOp should not be null.");
+  auto index_tensors = index_tensors_var->Get<std::vector<LoDTensor>>();
+
+  std::string cond_name = Input("Cond");
+  Variable* cond_var = scope.FindVar(cond_name);
+  PADDLE_ENFORCE_NOT_NULL(cond_var,
+                          "Input(Cond) of CondOp should not be null.");
+  const LoDTensor* cond = cond_var->GetMutable<LoDTensor>();
+
+  // Step 1: get the true/false index at runtime
+  // index_[0]: vector<int>, contains all index for cond[i] == true
+  // index_[1]: vector<int>, contains all index for cond[i] == false
+  for (int i = 0; i < 2; ++i) index_[i].clear();
+
+  const int* cond_data = cond->data<int>();
+  for (int i = 0; i < cond->dims()[0]; ++i) {
+    if (cond_data[i])
+      index_[0].push_back(i);
+    else
+      index_[1].push_back(i);
+  }
+
+  // put index_[0] and index_[1] into two tensors:
+  // index_tensor_[0] and index_tensor_[1]
+  DDim dim = paddle::framework::make_ddim({0});
+  for (int i = 0; i < 2; ++i) {
+    dim[0] = index_[i].size();
+    int* tmp_ptr =
+        index_tensors[i].mutable_data<int>(dim, platform::CPUPlace());
+    index_tensors[i].Resize(dim);
+    memcpy(tmp_ptr, index_[i].data(), dim[0] * sizeof(int));
+  }
+
+  // Step 2: collect data by calling gather
+  for (int i = 0; i < 2; ++i) {
+    // i= 0/i for True and False branches respectively
+    for (auto& input : Inputs("Xs")) {
+      // find Tensor
+      Variable* v = scope.FindVar(input);
+      PADDLE_ENFORCE_NOT_NULL(v);
+      LoDTensor* tensor_parent = v->GetMutable<LoDTensor>();
+
+      v = sub_scopes[i]->FindVar(input);
+      PADDLE_ENFORCE_NOT_NULL(v);
+      LoDTensor* tensor_child = v->GetMutable<LoDTensor>();
+
+      // Resize child
+      DDim dim = tensor_child->dims();
+      dim[0] = index_[i].size();
+      tensor_child->Resize(dim);
+      tensor_child->mutable_data<float>(dim, platform::CPUPlace());
+
+      Gather<float>(dev_ctx.GetPlace(), tensor_parent, &index_tensors[i],
+                    tensor_child);
+    }
+  }
+
+  // Step 3: run
+  for (int i = 0; i < 2; ++i) {
+    sub_net_op_[i]->Run(*sub_scopes[i], dev_ctx);
+  }
+
+  // Step 4: merge output results
+  PADDLE_ENFORCE(!Outputs("Outs").empty(),
+                 "Outputs(Outs) of CondOp can't be empty.");
+  for (int i = 0; i < 2; ++i) {
+    // i= 0/i for True and False branches respectively
+    for (auto& output : Outputs("Outs")) {
+      // find Tensor
+      Variable* v = scope.FindVar(output);
+      PADDLE_ENFORCE_NOT_NULL(v);
+      LoDTensor* tensor_parent = v->GetMutable<LoDTensor>();
+
+      v = sub_scopes[i]->FindVar(output);
+      PADDLE_ENFORCE_NOT_NULL(v);
+      LoDTensor* tensor_child = v->GetMutable<LoDTensor>();
+
+      ScatterUpdate<float>(dev_ctx.GetPlace(), tensor_child, &index_tensors[i],
+                           tensor_parent);
+    }
+  }
+}
+
+class CondOpProtoAndCheckerMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  CondOpProtoAndCheckerMaker(framework::OpProto* proto,
+                             framework::OpAttrChecker* op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("Cond", "The condition, which is a bool vector");
+    AddInput("Xs", "Inputs of Subnets").AsDuplicable();
+    AddOutput("Outs", "Outputs of Cond_Op after merge").AsDuplicable();
+
+    AddOutput("SubScopes", "sub scopes for true and false branches");
+    AddOutput("IndexTensors", "Index Tensors contains indices for true/false");
+
+    AddComment(R"DOC(
+Sample dependent Cond Operator:
+Given Cond[i] as a 1/0 vector to indicate true/false
+The equation is: 
+Out[i] = subnet_t[i], if Cond[i] == true
+Out[i] = subnet_t[i], if Cond[i] == false
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+REGISTER_OP_WITHOUT_GRADIENT(cond, paddle::operators::CondOp,
+                             paddle::operators::CondOpProtoAndCheckerMaker);
--- a/paddle/operators/cond_op.h
+++ b/paddle/operators/cond_op.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <vector>
+#include "glog/logging.h"
+#include "paddle/framework/ddim.h"
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/operator.h"
+#include "paddle/framework/tensor.h"
+#include "paddle/operators/net_op.h"
+
+namespace paddle {
+namespace operators {
+
+/*
+ * @brief CondOp is a dynamic if-else Operator
+ *
+ * It has a input tensor named cond indicating which netop each instance will
+ * run.
+ *
+ * if cond == 1, it will run true_net, which is a NetOp.
+ *
+ * if cond == 0, it will run false_net, which is another NetOp.
+ */
+class CondOp : public framework::OperatorBase {
+ public:
+  CondOp(const std::string& type, const framework::VariableNameMap& inputs,
+         const framework::VariableNameMap& outputs,
+         const framework::AttributeMap& attrs)
+      : OperatorBase(type, inputs, outputs, attrs) {
+    index_.resize(2);
+    sub_net_op_.resize(2);
+  }
+
+  CondOp(const CondOp& o)
+      : framework::OperatorBase(
+            static_cast<const framework::OperatorBase&>(o)) {
+    // TODO(yuyang18): Implement copy ctor well.
+    PADDLE_THROW("Not implemented");
+  }
+
+  void CreateScope(const framework::Scope& scope) const;
+
+  void CreateIndexTensor(const framework::Scope& scope) const;
+
+  /*
+   * InferShape must be called before Run.
+   */
+  void InferShape(const framework::Scope& scope) const override;
+
+  /*
+   * Set True Block
+   */
+  void set_truenet(std::unique_ptr<OperatorBase>&& net) {
+    sub_net_op_[0] = std::move(net);
+  }
+
+  /*
+   * Set False Block
+   */
+  void set_falsenet(std::unique_ptr<OperatorBase>&& net) {
+    sub_net_op_[1] = std::move(net);
+  }
+
+  void Run(const framework::Scope& scope,
+           const platform::DeviceContext& dev_ctx) const override;
+
+ private:
+  // sub_net_op_[0]: subnet_t
+  // sub_net_op_[1]: subnet_f
+  std::vector<std::unique_ptr<framework::OperatorBase>> sub_net_op_;
+
+  // index_[0]: True_index;
+  // index_[1]: False_index;
+  mutable std::vector<std::vector<int>> index_;
+};
+
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/cos_sim_op.cc
+++ b/paddle/operators/cos_sim_op.cc
@@ -26,8 +26,16 @@ class CosSimOp : public framework::OperatorWithKernel {
 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
    // notnull check
-    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"), "Input(X) must not be null.");
-    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Y"), "Input(Y) must not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of CosSimOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Y"),
+                            "Input(Y) of CosSimOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of CosSimOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("XNorm"),
+                            "Output(XNorm) of CosSimOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("YNorm"),
+                            "Output(YNorm) of CosSimOp should not be null.");

    // shape check
    auto x_dims = ctx.Input<Tensor>("X")->dims();

--- a/paddle/operators/cos_sim_op.h
+++ b/paddle/operators/cos_sim_op.h
@@ -56,7 +56,7 @@ class CosSimKernel : public framework::OpKernel {
    x_norm.device(place) = x.square().sum(row_along).sqrt();
    y_norm.device(place) = y.square().sum(row_along).sqrt();
    if (rows_x == rows_y) {
-      auto xy = (x * y).sum(Eigen::array<int, 1>({1}));
+      auto xy = (x * y).sum(Eigen::array<int, 1>({{1}}));
      z.device(place) = xy / x_norm / y_norm;
    } else {
      Eigen::DSizes<int, 2> bcast(rows_x, 1);
@@ -134,7 +134,7 @@ class CosSimGradKernel : public framework::OpKernel {
        out_grad_y->mutable_data<T>(context.GetPlace());
        auto dy = EigenMatrix<T>::Reshape(*out_grad_y, 1);
        auto grad = x / norm_prod_bcast - z_bcast * y_bcast / y_snorm_bcast;
-        dy.device(place) = (dz_bcast * grad).sum(Eigen::array<int, 1>({0}));
+        dy.device(place) = (dz_bcast * grad).sum(Eigen::array<int, 1>({{0}}));
      }
    }
  }

--- a/paddle/operators/elementwise_mul_op.cc
+++ b/paddle/operators/elementwise_mul_op.cc
@@ -25,8 +25,14 @@ class ElementWiseMulOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
-    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"), "Input(X) should not be null");
-    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Y"), "Input(Y) should not be null");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of ElementWiseMulOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Y"),
+                            "Input(Y) of ElementWiseMulOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.OutputVar("Out"),
+        "Output(Out) of ElementWiseMulOp should not be null.");
+
    auto x_dim = ctx.Input<Tensor>("X")->dims();
    auto y_dim = ctx.Input<Tensor>("Y")->dims();
    PADDLE_ENFORCE_GE(x_dim.size(), y_dim.size(),

--- a/paddle/operators/elementwise_mul_op.h
+++ b/paddle/operators/elementwise_mul_op.h
@@ -13,10 +13,8 @@
   limitations under the License. */

 #pragma once
-#include <iostream>
 #include "paddle/framework/eigen.h"
 #include "paddle/framework/op_registry.h"
-#include "paddle/operators/math/math_function.h"

 namespace paddle {
 namespace operators {

--- a/paddle/operators/fc_op.cc
+++ b/paddle/operators/fc_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/framework/op_registry.h"
+#include "paddle/operators/net_op.h"
+
+namespace paddle {
+namespace operators {
+
+class FCOp : public NetOp {
+ public:
+  FCOp(const std::string &type, const framework::VariableNameMap &inputs,
+       const framework::VariableNameMap &outputs,
+       const framework::AttributeMap &attrs)
+      : NetOp(type, inputs, outputs, attrs) {
+    PADDLE_ENFORCE(!Inputs("X").empty(),
+                   "Inputs(X) of FCOp should not be null.");
+    PADDLE_ENFORCE(!Inputs("W").empty(),
+                   "Inputs(W) of FCOp should not be null.");
+    PADDLE_ENFORCE(!Outputs("MulOut").empty(),
+                   "Outputs(MulOut) of FCOp should not be null.");
+    PADDLE_ENFORCE_NE(Output("Out"), framework::kEmptyVarName,
+                      "Output(Out) of FCOp should not be null.");
+
+    auto x = Inputs("X");
+    auto w = Inputs("W");
+    auto mul_out = Outputs("MulOut");
+    PADDLE_ENFORCE_EQ(
+        x.size(), w.size(),
+        "The size of inputs X(%d) should be the same as that of weights W(%d).",
+        x.size(), w.size());
+    PADDLE_ENFORCE_EQ(mul_out.size(), x.size(),
+                      "The size of intermediate mul_out(%d) should be the same "
+                      "as that of inputs X(%d).",
+                      mul_out.size(), x.size());
+
+    size_t n = x.size();
+    PADDLE_ENFORCE_GE(n, static_cast<size_t>(1),
+                      "The size of inputs X(%d) should be no less than 1.", n);
+
+    auto x_num_col_dims = Attr<std::vector<int>>("xNumColDims");
+
+    // Set all values or set no values (use the default value)
+    if (!x_num_col_dims.empty()) {
+      PADDLE_ENFORCE_EQ(x_num_col_dims.size(), n,
+                        "The size of attribute xNumColDims(%d) should be the "
+                        "same as that of inputs X(%d).",
+                        x_num_col_dims.size(), n);
+    } else {
+      x_num_col_dims.resize(n);
+      for (size_t i = 0; i < n; i++) {
+        x_num_col_dims[i] = 1;
+      }
+    }
+
+    // mul_out[i] = X[i] * W[i]
+    for (size_t i = 0; i < n; i++) {
+      framework::AttributeMap mul_attr;
+      mul_attr["x_num_col_dims"] = static_cast<int>(x_num_col_dims[i]);
+      mul_attr["y_num_col_dims"] = static_cast<int>(1);
+      AppendOp(
+          framework::OpRegistry::CreateOp("mul", {{"X", {x[i]}}, {"Y", {w[i]}}},
+                                          {{"Out", {mul_out[i]}}}, mul_attr));
+    }
+
+    // sum_out = X[0] * W[0] + ... + X[n-1] * W[n-1]
+    auto sum_out = mul_out[0];
+    if (n > 1) {
+      PADDLE_ENFORCE_NE(Output("SumOut"), framework::kEmptyVarName,
+                        "Output(SumOut) of FCOp should not be null when the "
+                        "size of Inputs(X) > 1.");
+
+      sum_out = Output("SumOut");
+      AppendOp(framework::OpRegistry::CreateOp("sum", {{"X", {mul_out}}},
+                                               {{"Out", {sum_out}}}, {}));
+    } else {
+      if (Output("SumOut") != framework::kEmptyVarName) {
+        this->Rename(Output("SumOut"), framework::kEmptyVarName);
+      }
+    }
+
+    // add_out = sum_out + b
+    auto b = Input("B");
+    auto add_out = sum_out;
+    if (b != framework::kEmptyVarName) {
+      PADDLE_ENFORCE_NE(
+          Output("AddOut"), framework::kEmptyVarName,
+          "Output(AddOut) of FCOp should not be null when Input(B) is set.");
+
+      add_out = Output("AddOut");
+      AppendOp(framework::OpRegistry::CreateOp(
+          "rowwise_add", {{"X", {sum_out}}, {"b", {Input("B")}}},
+          {{"Out", {add_out}}}, {}));
+    } else {
+      if (Output("AddOut") != framework::kEmptyVarName) {
+        this->Rename(Output("AddOut"), framework::kEmptyVarName);
+      }
+    }
+
+    auto activation = Attr<std::string>("activation");
+    AppendOp(framework::OpRegistry::CreateOp(activation, {{"X", {add_out}}},
+                                             {{"Y", {Output("Out")}}}, {}));
+    CompleteAddOp(false);
+  }
+};
+
+class FCOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  FCOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X",
+             "(A vector of Tensors) each input Tensor can be of arbitrary "
+             "dimension, and will be reshaped to a 2-D matrix of size "
+             "(minibatch, number_of_input_features) according to attribute "
+             "xNumColDims.")
+        .AsDuplicable();
+    AddInput("W",
+             "(A vector of Tensors) the weights of FC operator, a "
+             "vector of 2-D matrix of size "
+             "(number_of_input_features, number_of_neurons).")
+        .AsDuplicable();
+    AddInput("B",
+             "(Tensor) the bias of FC operator, a 1-D vector of size "
+             "number_of_neurons.");
+
+    AddOutput("Out",
+              "(Tensor) the activated output matrix of FC operator, a 2-D "
+              "matrix of size (minibatch, number_of_neurons).");
+    AddOutput("MulOut",
+              "(A vector of Tensors) the intermediate outputs of FC operator, "
+              "each Tensor saving the product of X_i * W_i.")
+        .AsIntermediate()
+        .AsDuplicable();
+    AddOutput(
+        "SumOut",
+        "(Tensor) the intermediate output of FC operator, "
+        "saving the sum of the products of X and W, that is sum{X_i * W_i}.")
+        .AsIntermediate();
+    AddOutput("AddOut",
+              "(Tensor) the non-actived output of FC operator, "
+              "saving sum{X_i * W_i} + B.")
+        .AsIntermediate();
+    AddAttr<std::string>(
+        "activation",
+        "(string, default identity) the activation type of FC operator.")
+        .SetDefault("identity")
+        .InEnum({"identity", "sigmoid", "softmax"});
+    AddAttr<std::vector<int>>(
+        "xNumColDims",
+        "(std::vector<int>) The inputs Tensors of FC operator can be of "
+        "more than 2 dimensions. In that case, each input Tensor `X_i` will be "
+        "reshaped to a 2-D matrix. The matrix's first dimension "
+        "(the length of column) will be the product of `X_i`'s last "
+        "`xNumColDims_i` dimensions, that is "
+        "`X_i.dims[0] x ... x X_i.dims[xNumColDims_i - 1]`. "
+        "The matrix's second dimension (the length of row) will be the product "
+        "of `X_i`'s first `rank - xNumColDims_i` dimensions, that is "
+        "`X_i.dims[xNumColDims_i] x ... x X_i.dims[rank - 1]`)")
+        .SetDefault(std::vector<int>{});
+
+    AddComment(R"DOC(
+Fully Connected Operator, known as Fully Connected Layer or Inner Product Layer
+in Convolutional Neural Networks. Neurons in a fully connected layer have
+full connections to all activations in the previous layer.
+It computes an inner product of a set of
+learned weights with a matrix multiplication followed by a bias offset
+(optionally).
+
+Equation:
+  Out = Act(sum_n{X_i * W_i} + B)
+
+where X_i is Tensor that will be reshaped to a 2-D matrix of size (M x K),
+usually M is the minibatch size and K is the number of input features.
+W_i is a 2-D matrix of size (K x N), where N means the number of neurons
+in the fully connected layer. B is a 1-D vector of size N.
+Thus, the output Out is a 2-D matrix of size (M x N).
+Activation type can be set to `identity` (default), `sigmoid` or `softmax`.
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_WITHOUT_GRADIENT(fc, ops::FCOp, ops::FCOpMaker);
--- a/paddle/operators/fill_zeros_like_op.cc
+++ b/paddle/operators/fill_zeros_like_op.cc
@@ -23,6 +23,13 @@ class FillZerosLikeOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.InputVar("Src"),
+        "Input(Src) of FillZerosLikeOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.OutputVar("Dst"),
+        "Output(Dst) of FillZerosLikeOp should not be null.");
+
    ctx.Output<framework::LoDTensor>("Dst")->Resize(
        ctx.Input<framework::Tensor>("Src")->dims());
  }

--- a/paddle/operators/gather_op.cc
+++ b/paddle/operators/gather_op.cc
@@ -24,6 +24,13 @@ class GatherOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of GatherOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Index"),
+                            "Input(Index) of GatherOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of GatherOp should not be null.");
+
    int batch_size = ctx.Input<Tensor>("Index")->dims()[0];
    PADDLE_ENFORCE_GE(batch_size, 0, "Batch size must be >0");
    framework::DDim output_dims(ctx.Input<Tensor>("X")->dims());

--- a/paddle/operators/gaussian_random_op.cc
+++ b/paddle/operators/gaussian_random_op.cc
@@ -43,8 +43,12 @@ class GaussianRandomOp : public framework::OperatorWithKernel {
  using framework::OperatorWithKernel::OperatorWithKernel;

 protected:
-  void InferShape(const framework::InferShapeContext& context) const override {
-    auto* tensor = context.Output<framework::LoDTensor>("Out");
+  void InferShape(const framework::InferShapeContext& ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.OutputVar("Out"),
+        "Output(Out) of GaussianRandomOp should not be null.");
+
+    auto* tensor = ctx.Output<framework::LoDTensor>("Out");
    auto dims = Attr<std::vector<int>>("dims");
    std::vector<int64_t> temp;
    temp.reserve(dims.size());

--- a/paddle/operators/identity_op.cc
+++ b/paddle/operators/identity_op.cc
@@ -27,7 +27,7 @@ class IdentityOpMaker : public framework::OpProtoAndCheckerMaker {
                  framework::OpAttrChecker *op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X", "The input tensor of identity operator.");
-    AddOutput("Out", "The output tensor of identity operator.");
+    AddOutput("Y", "The output tensor of identity operator.");
    AddComment(R"DOC(
 The identity operator is an alias of the scale operator
 with the attribute scale fixed to 1.0.
@@ -42,9 +42,15 @@ class IdentityOp : public NetOp {
             const framework::VariableNameMap &outputs,
             const framework::AttributeMap &attrs)
      : NetOp(type, inputs, outputs, attrs) {
+    PADDLE_ENFORCE_NE(Input("X"), framework::kEmptyVarName,
+                      "Input(X) of IdentityOp should not be null.");
+    PADDLE_ENFORCE_NE(Output("Y"), framework::kEmptyVarName,
+                      "Output(Y) of IdentityOp should not be null.");
+
    AppendOp(framework::OpRegistry::CreateOp(
-        "scale", {{"X", {Input("X")}}}, {{"Out", {Output("Out")}}},
+        "scale", {{"X", {Input("X")}}}, {{"Out", {Output("Y")}}},
        {{"scale", static_cast<AttrType>(1)}}));
+    CompleteAddOp(false);
  }
 };


--- a/paddle/operators/lookup_table_op.cc
+++ b/paddle/operators/lookup_table_op.cc
@@ -22,10 +22,17 @@ class LookupTableOp : public framework::OperatorWithKernel {
  using framework::OperatorWithKernel::OperatorWithKernel;

 protected:
-  void InferShape(const framework::InferShapeContext &context) const override {
-    auto table_t = context.Input<Tensor>("W");
-    auto ids_t = context.Input<Tensor>("Ids");
-    auto output_t = context.Output<framework::LoDTensor>("Out");
+  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("W"),
+                            "Input(W) of LookupTableOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Ids"),
+                            "Input(Ids) of LookupTableOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of LookupTableOp should not be null.");
+
+    auto table_t = ctx.Input<Tensor>("W");
+    auto ids_t = ctx.Input<Tensor>("Ids");
+    auto output_t = ctx.Output<framework::LoDTensor>("Out");

    output_t->Resize({ids_t->dims()[0], table_t->dims()[1]});
  }

--- a/paddle/operators/mean_op.cc
+++ b/paddle/operators/mean_op.cc
@@ -24,7 +24,9 @@ class MeanOp : public framework::OperatorWithKernel {
 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
-                            "Input of MeanOp must be initialized.");
+                            "Input(X) of MeanOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of MeanOp should not be null.");
    ctx.Output<framework::LoDTensor>("Out")->Resize({1});
  }
 };

--- a/paddle/operators/minus_op.cc
+++ b/paddle/operators/minus_op.cc
@@ -27,6 +27,13 @@ class MinusOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of MinusOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Y"),
+                            "Input(Y) of MinusOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of MinusOp should not be null.");
+
    auto *left_tensor = ctx.Input<framework::Tensor>("X");
    auto *right_tensor = ctx.Input<framework::Tensor>("Y");

@@ -64,7 +71,7 @@ class MinusGradOp : public NetOp {

    // x_grad = out_grad
    AppendOp(framework::OpRegistry::CreateOp("identity", {{"X", {out_grad}}},
-                                             {{"Out", {x_grad}}}, {}));
+                                             {{"Y", {x_grad}}}, {}));

    framework::AttributeMap scale_attr;
    scale_attr["scale"] = static_cast<AttrType>(-1);
@@ -77,8 +84,6 @@ class MinusGradOp : public NetOp {
 }  // namespace operators
 }  // namespace paddle

-USE_OP(scale);
-USE_NO_KERNEL_OP(identity);
 namespace ops = paddle::operators;
 REGISTER_OP(minus, ops::MinusOp, ops::MinusOpMaker, minus_grad,
            ops::MinusGradOp<float>);

--- a/paddle/operators/mul_op.cc
+++ b/paddle/operators/mul_op.cc
@@ -26,6 +26,13 @@ class MulOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of MulOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Y"),
+                            "Input(Y) of MulOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of MulOp should not be null.");
+
    auto x_dims = ctx.Input<Tensor>("X")->dims();
    auto y_dims = ctx.Input<Tensor>("Y")->dims();
    int x_num_col_dims = Attr<int>("x_num_col_dims");

--- a/paddle/operators/onehot_cross_entropy_op.cc
+++ b/paddle/operators/onehot_cross_entropy_op.cc
@@ -23,6 +23,16 @@ class OnehotCrossEntropyOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.InputVar("X"),
+        "Input(X) of OnehotCrossEntropyOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.InputVar("label"),
+        "Input(label) of OnehotCrossEntropyOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.OutputVar("Y"),
+        "Output(Y) of OnehotCrossEntropyOp should not be null.");
+
    auto *X = ctx.Input<Tensor>("X");
    auto *label = ctx.Input<Tensor>("label");


--- a/paddle/operators/pad_op.cc
+++ b/paddle/operators/pad_op.cc
@@ -25,6 +25,11 @@ class PadOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of PadOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of PadOp should not be null.");
+
    auto x_dim = ctx.Input<Tensor>("X")->dims();
    auto paddings = Attr<std::vector<int>>("paddings");
    PADDLE_ENFORCE_EQ(x_dim.size() * 2, int64_t(paddings.size()),

--- a/paddle/operators/reshape_op.cc
+++ b/paddle/operators/reshape_op.cc
@@ -28,7 +28,11 @@ class ReshapeOp : public framework::OperatorWithKernel {
 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
    // input check
-    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"), "Input(X) shouldn't be null");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of ReshapeOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of ReshapeOp should not be null.");
+
    auto shape = ctx.Attr<std::vector<int>>("shape");
    PADDLE_ENFORCE(shape.size() > 0, "Attr(shape) shouldn't be empty.");
    for (auto dim : shape) {

--- a/paddle/operators/rowwise_add_op.cc
+++ b/paddle/operators/rowwise_add_op.cc
@@ -25,6 +25,13 @@ class RowwiseAddOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of RowwiseAddOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("b"),
+                            "Input(b) of RowwiseAddOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of RowwiseAddOp should not be null.");
+
    auto x_dims = ctx.Input<Tensor>("X")->dims();
    auto b_dims = ctx.Input<Tensor>("b")->dims();
    PADDLE_ENFORCE_GT(

--- a/paddle/operators/scale_op.cc
+++ b/paddle/operators/scale_op.cc
@@ -27,6 +27,11 @@ class ScaleOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of ScaleOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of ScaleOp should not be null.");
+
    auto *in = ctx.Input<framework::Tensor>("X");
    auto *out = ctx.Output<framework::LoDTensor>("Out");
    out->Resize(in->dims());

--- a/paddle/operators/scatter_op.cc
+++ b/paddle/operators/scatter_op.cc
@@ -24,6 +24,15 @@ class ScatterOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Ref"),
+                            "Input(Ref) of ScatterOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Index"),
+                            "Input(Index) of ScatterOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Updates"),
+                            "Input(Updates) of ScatterOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of ScatterOp should not be null.");
+
    PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("Index")->dims().size(), 1,
                      "Update Index should be 1-D.");
    PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("Ref")->dims().size(),

--- a/paddle/operators/sequence_avg_pool_op.cc
+++ b/paddle/operators/sequence_avg_pool_op.cc
@@ -23,9 +23,12 @@ class SequenceAvgPoolOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext& ctx) const override {
-    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
-                            "Input of SequenceAvgPoolOp"
-                            "must be initialized.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.InputVar("X"), "Input(X) of SequenceAvgPoolOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.OutputVar("Out"),
+        "Output(Out) of SequenceAvgPoolOp should not be null.");
+
    auto* x = ctx.Input<framework::LoDTensor>("X");
    auto dims = x->dims();
    auto lod = x->lod();
@@ -60,7 +63,9 @@ class SequenceAvgPoolGradOp : public framework::OperatorWithKernel {
 protected:
  void InferShape(const framework::InferShapeContext& ctx) const override {
    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar(framework::GradVarName("Out")),
-                            "Gradient of Out should not be null");
+                            "Gradient of Out should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "The input X should not be null.");
    auto og_dims =
        ctx.Input<framework::LoDTensor>(framework::GradVarName("Out"))->dims();
    auto x_dims = ctx.Input<framework::LoDTensor>("X")->dims();

--- a/paddle/operators/sequence_avg_pool_op.h
+++ b/paddle/operators/sequence_avg_pool_op.h
@@ -21,6 +21,9 @@ namespace operators {

 using Tensor = framework::Tensor;
 using LoDTensor = framework::LoDTensor;
+template <typename T, int MajorType = Eigen::RowMajor,
+          typename IndexType = Eigen::DenseIndex>
+using EigenVector = framework::EigenVector<T, MajorType, IndexType>;
 template <typename T, int MajorType = Eigen::RowMajor,
          typename IndexType = Eigen::DenseIndex>
 using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
@@ -43,8 +46,8 @@ class SequenceAvgPoolKernel : public framework::OpKernel {
                                 static_cast<int>(lod[0][i + 1]));
      Tensor out_t = out->Slice<T>(i, i + 1);
      int64_t h = static_cast<int64_t>(lod[0][i + 1] - lod[0][i]);
-      auto in_e = EigenMatrix<T>::From(in_t, {h, w});
-      auto out_e = EigenMatrix<T>::From(out_t, {h, w});
+      auto in_e = EigenMatrix<T>::From(in_t, framework::make_ddim({h, w}));
+      auto out_e = EigenVector<T>::Flatten(out_t);
      out_e.device(place) = in_e.mean(Eigen::array<int, 1>({{0}}));
    }
  }
@@ -54,9 +57,9 @@ template <typename Place, typename T>
 class SequenceAvgPoolGradKernel : public framework::OpKernel {
 public:
  void Compute(const framework::ExecutionContext& context) const override {
-    auto* in = context.Output<LoDTensor>("X");
-    auto* in_g = context.Output<LoDTensor>(framework::GradVarName("X"));
+    auto* in = context.Input<LoDTensor>("X");
    auto* out_g = context.Input<LoDTensor>(framework::GradVarName("Out"));
+    auto* in_g = context.Output<LoDTensor>(framework::GradVarName("X"));

    auto dims = in->dims();
    auto lod = in->lod();
@@ -71,7 +74,7 @@ class SequenceAvgPoolGradKernel : public framework::OpKernel {
      int64_t h = static_cast<int64_t>(lod[0][i + 1] - lod[0][i]);
      auto in_g_e = EigenMatrix<T>::From(in_g_t, {h, w});
      auto out_g_e = EigenMatrix<T>::From(out_g_t, {1, w});
-      Eigen::DSizes<int, 2> bcast(h, w);
+      Eigen::DSizes<int, 2> bcast(h, 1);
      in_g_e.device(place) = (out_g_e / static_cast<T>(h)).broadcast(bcast);
    }
  }

--- a/paddle/operators/sgd_op.cc
+++ b/paddle/operators/sgd_op.cc
@@ -23,6 +23,13 @@ class SGDOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("param"),
+                            "Input(param) of SGDOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("grad"),
+                            "Input(grad) of SGDOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("param_out"),
+                            "Output(param_out) of SGDOp should not be null.");
+
    PADDLE_ENFORCE_EQ(ctx.Input<Tensor>("param")->dims(),
                      ctx.Input<Tensor>("grad")->dims(),
                      "Two input of SGD Op's dimension must be same.");

--- a/paddle/operators/sigmoid_op.cc
+++ b/paddle/operators/sigmoid_op.cc
@@ -23,6 +23,11 @@ class SigmoidOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of SigmoidOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Y"),
+                            "Output(Y) of SigmoidOp should not be null.");
+
    ctx.Output<framework::LoDTensor>("Y")->Resize(
        ctx.Input<Tensor>("X")->dims());
  }

--- a/paddle/operators/softmax_op.cc
+++ b/paddle/operators/softmax_op.cc
@@ -23,6 +23,11 @@ class SoftmaxOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
+                            "Input(X) of SoftmaxOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Y"),
+                            "Output(Y) of SoftmaxOp should not be null.");
+
    PADDLE_ENFORCE(ctx.Input<Tensor>("X")->dims().size() == 2UL,
                   "The input of softmax op must be a matrix.");
    ctx.Output<framework::LoDTensor>("Y")->Resize(

--- a/paddle/operators/split_op.cc
+++ b/paddle/operators/split_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/split_op.h"
+#include "paddle/operators/net_op.h"
+
+namespace paddle {
+namespace operators {
+using framework::Tensor;
+
+class SplitOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(const framework::InferShapeContext &ctx) const override {
+    // infershape
+    auto *in = ctx.Input<framework::Tensor>("X");
+    auto outs = ctx.MultiOutput<framework::LoDTensor>("Out");
+    size_t axis = static_cast<size_t>(ctx.Attr<int>("axis"));
+    size_t num = static_cast<size_t>(ctx.Attr<int>("num"));
+    std::vector<int> sections =
+        static_cast<std::vector<int>>(ctx.Attr<std::vector<int>>("sections"));
+    const size_t n = outs.size();
+
+    if (num > 0) {
+      int64_t in_axis_dim = in->dims()[axis];
+      PADDLE_ENFORCE_EQ(in_axis_dim % num, 0,
+                        "tensor split does not result"
+                        " in an equal division");
+      size_t out_axis_dim = in_axis_dim / num;
+      for (size_t i = 0; i < n; ++i) {
+        auto dim = in->dims();
+        dim[axis] = out_axis_dim;
+        outs[i]->Resize(dim);
+      }
+    } else if (sections.size() > 0) {
+      PADDLE_ENFORCE_EQ(sections.size(), n,
+                        "tensor split sections size"
+                        "should be equal to output size.");
+      for (size_t i = 0; i < n; ++i) {
+        auto dim = in->dims();
+        dim[axis] = sections[i];
+        outs[i]->Resize(dim);
+      }
+    } else {
+      PADDLE_ENFORCE_NOT_NULL(nullptr, "split operator should",
+                              " specify indices or sections.");
+    }
+  }
+};
+
+class SplitOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  SplitOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "the input tensor of split operator.");
+    AddOutput("Out", "the output tensors of split operator.").AsDuplicable();
+    AddComment(R"DOC(
+      Split the input tensor into multiple sub-tensors.
+      Example:
+        Input = [[1,2],
+                 [3,4],
+                 [5,6]]
+        sections = [2,1]
+        axis = 0
+        Output[0] = [[1,2],
+                     [3,4]]
+        Output[1] = [[5,6]]
+
+    )DOC");
+    AddAttr<std::vector<int>>("sections",
+                              "the length for each"
+                              "output along with the specify axis.")
+        .SetDefault(std::vector<int>{});
+    AddAttr<int>("num",
+                 "number of the sub-tensors, it must evenly divide "
+                 "Input.dims()[axis]")
+        .SetDefault(0);
+    AddAttr<int>("axis", "The axis which the input will be splited on.")
+        .SetDefault(0);
+  }
+};
+
+class SplitOpGrad : public NetOp {
+ public:
+  SplitOpGrad(const std::string &type, const framework::VariableNameMap &inputs,
+              const framework::VariableNameMap &outputs,
+              const framework::AttributeMap &attrs)
+      : NetOp(type, inputs, outputs, attrs) {
+    auto out_grad = Inputs(framework::GradVarName("Out"));
+    auto x_grad = Output(framework::GradVarName("X"));
+    AppendOp(framework::OpRegistry::CreateOp("concat", {{"X", out_grad}},
+                                             {{"Out", {x_grad}}}, attrs));
+    CompleteAddOp(false);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+USE_CPU_ONLY_OP(concat);
+REGISTER_OP(split, ops::SplitOp, ops::SplitOpMaker, split_grad,
+            ops::SplitOpGrad);
+REGISTER_OP_CPU_KERNEL(split,
+                       ops::SplitKernel<paddle::platform::CPUPlace, float>);
--- a/paddle/operators/split_op.h
+++ b/paddle/operators/split_op.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include <vector>
+#include "paddle/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename Place, typename T>
+class SplitKernel : public framework::OpKernel {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* in = ctx.Input<framework::Tensor>("X");
+    auto outs = ctx.MultiOutput<framework::Tensor>("Out");
+    int64_t axis = static_cast<int64_t>(ctx.Attr<int>("axis"));
+    size_t before = 1, after = 1;
+    const size_t n = outs.size();
+    size_t input_axis_dim = in->dims()[axis];
+
+    for (int64_t i = 0; i < in->dims().size(); ++i) {
+      if (i == axis) {
+        continue;
+      }
+      if (i < axis) {
+        before *= in->dims()[i];
+      } else {
+        after *= in->dims()[i];
+      }
+    }
+    size_t input_offset = 0;
+    for (size_t i = 0; i < n; i++) {
+      auto& out = outs[i];
+      size_t axis_dim = out->dims()[axis];
+      for (size_t j = 0; j < before; j++) {
+        size_t len = axis_dim * after * sizeof(T);
+        T* dest =
+            out->mutable_data<T>(platform::CPUPlace()) + axis_dim * after * j;
+        const T* src =
+            in->data<T>() + input_offset + input_axis_dim * after * j;
+        memcpy(dest, src, len);
+      }
+      input_offset += axis_dim * after;
+    }
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/squared_l2_distance_op.cc
+++ b/paddle/operators/squared_l2_distance_op.cc
@@ -23,12 +23,18 @@ class SquaredL2DistanceOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext& ctx) const override {
-    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
-                            "Input of SquaredL2DistanceOp "
-                            "must be initialized.");
-    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("Y"),
-                            "Target of SquaredL2DistanceOp "
-                            "must be initialized.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.InputVar("X"),
+        "Input(X) of SquaredL2DistanceOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.InputVar("Y"),
+        "Input(Y) of SquaredL2DistanceOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.OutputVar("sub_result"),
+        "Output(sub_result) of SquaredL2DistanceOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.OutputVar("Out"),
+        "Output(Out) of SquaredL2DistanceOp should not be null.");

    auto* x = ctx.Input<Tensor>("X");
    auto x_dims = x->dims();

--- a/paddle/operators/sum_op.cc
+++ b/paddle/operators/sum_op.cc
@@ -22,6 +22,11 @@ class SumOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
+    PADDLE_ENFORCE(!ctx.MultiInputVar("X").empty(),
+                   "Input(X) of SumOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of SumOp should not be null.");
+
    auto ins = ctx.MultiInput<framework::Tensor>("X");
    auto *out = ctx.Output<framework::LoDTensor>("Out");
    int N = ins.size();

--- a/paddle/operators/top_k_op.cc
+++ b/paddle/operators/top_k_op.cc
@@ -24,7 +24,12 @@ class TopkOp : public framework::OperatorWithKernel {
 protected:
  void InferShape(const framework::InferShapeContext &ctx) const override {
    PADDLE_ENFORCE_NOT_NULL(ctx.InputVar("X"),
-                            "Input of TopkOP must be initialized.");
+                            "Input(X) of TopkOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Out"),
+                            "Output(Out) of TopkOp should not be null.");
+    PADDLE_ENFORCE_NOT_NULL(ctx.OutputVar("Indices"),
+                            "Output(Indices) of TopkOp should not be null.");
+
    auto *input = ctx.Input<framework::Tensor>("X");
    const int k = static_cast<int>(ctx.Attr<int>("k"));


--- a/paddle/operators/uniform_random_op.cc
+++ b/paddle/operators/uniform_random_op.cc
@@ -48,6 +48,10 @@ class UniformRandomOp : public framework::OperatorWithKernel {

 protected:
  void InferShape(const framework::InferShapeContext& ctx) const override {
+    PADDLE_ENFORCE_NOT_NULL(
+        ctx.OutputVar("Out"),
+        "Output(Out) of UniformRandomOp should not be null.");
+
    PADDLE_ENFORCE(Attr<float>("min") < Attr<float>("max"),
                   "uniform_random's min must less then max");
    auto* tensor = ctx.Output<framework::LoDTensor>("Out");

--- a/paddle/platform/CMakeLists.txt
+++ b/paddle/platform/CMakeLists.txt
@@ -24,3 +24,4 @@ cc_library(device_context SRCS device_context.cc DEPS memory buddy_allocator
 nv_test(device_context_test SRCS device_context_test.cc DEPS device_context gpu_info)

 nv_test(cudnn_helper_test SRCS cudnn_helper_test.cc DEPS dynload_cuda)
+nv_test(transform_test SRCS transform_test.cu DEPS paddle_memory place)
--- a/paddle/platform/cuda_helper.h
+++ b/paddle/platform/cuda_helper.h
@@ -24,6 +24,11 @@ namespace platform {
 #define USE_CUDA_ATOMIC(op, T) \
  CUDA_ATOMIC_WRAPPER(op, T) { return atomic##op(address, val); }

+// Default thread count per block(or block size).
+// TODO(typhoonzero): need to benchmark against setting this value
+//                    to 1024.
+constexpr int PADDLE_CUDA_NUM_THREADS = 512;
+
 // For atomicAdd.
 USE_CUDA_ATOMIC(Add, float);


--- a/paddle/platform/details/device_ptr_cast.h
+++ b/paddle/platform/details/device_ptr_cast.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#ifndef __NVCC__
+#error device_ptr_cast must be include by .cu file
+#endif
+
+#include <thrust/device_ptr.h>
+
+namespace paddle {
+namespace platform {
+namespace details {
+template <typename T, bool is_ptr>
+struct DevicePtrCast;
+
+template <typename T>
+struct DevicePtrCast<T, true> {
+  using ELEM = typename std::remove_pointer<T>::type;
+  using RTYPE = thrust::device_ptr<ELEM>;
+
+  inline thrust::device_ptr<ELEM> operator()(ELEM* ele) const {
+    return thrust::device_pointer_cast(ele);
+  }
+};
+
+template <typename T>
+struct DevicePtrCast<T, false> {
+  using RTYPE = T;
+  inline RTYPE operator()(RTYPE it) const { return it; }
+};
+
+// Cast T to thrust::device_ptr if T is a pointer.
+// Otherwise, e.g., T is a iterator, return T itself.
+template <typename T>
+auto DevPtrCast(T t) ->
+    typename DevicePtrCast<T, std::is_pointer<T>::value>::RTYPE {
+  DevicePtrCast<T, std::is_pointer<T>::value> cast;
+  return cast(t);
+}
+
+}  // namespace details
+}  // namespace platform
+}  // namespace paddle
--- a/paddle/platform/transform.h
+++ b/paddle/platform/transform.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#pragma once
+
+#include "paddle/platform/enforce.h"
+#include "paddle/platform/hostdevice.h"
+#include "paddle/platform/place.h"
+
+#include <algorithm>
+#include <type_traits>
+#ifdef __NVCC__
+#include <thrust/transform.h>
+#include "paddle/platform/details/device_ptr_cast.h"
+#endif
+
+namespace paddle {
+namespace platform {
+// Transform on host or device. It provides the same API in std library.
+template <typename Place, typename InputIter, typename OutputIter,
+          typename UnaryOperation>
+void Transform(Place place, InputIter first, InputIter last, OutputIter result,
+               UnaryOperation op) {
+  if (is_cpu_place(place)) {
+    std::transform(first, last, result, op);
+  } else {
+#ifdef __NVCC__
+    using namespace details;
+    thrust::transform(DevPtrCast(first), DevPtrCast(last), DevPtrCast(result),
+                      op);
+#else
+    PADDLE_THROW("Do not invoke `Transform<GPUPlace>` in .cc file");
+#endif
+  }
+}
+
+template <typename Place, typename InputIter1, typename InputIter2,
+          typename OutputIter, typename BinaryOperation>
+void Transform(Place place, InputIter1 first1, InputIter1 last1,
+               InputIter2 first2, OutputIter result, BinaryOperation op) {
+  if (is_cpu_place(place)) {
+    std::transform(first1, last1, first2, result, op);
+  } else {
+#ifdef __NVCC__
+    using namespace details;
+    thrust::transform(DevPtrCast(first1), DevPtrCast(last1), DevPtrCast(first2),
+                      DevPtrCast(result), op);
+#else
+    PADDLE_THROW("Do not invoke `Transform<GPUPlace>` in .cc file");
+#endif
+  }
+};
+
+}  // namespace platform
+}  // namespace paddle
--- a/paddle/platform/transform_test.cu
+++ b/paddle/platform/transform_test.cu
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include <gtest/gtest.h>
+#include "paddle/memory/memcpy.h"
+#include "paddle/memory/memory.h"
+#include "paddle/platform/transform.h"
+
+template <typename T>
+class Scale {
+ public:
+  explicit Scale(const T& scale) : scale_(scale) {}
+
+  HOSTDEVICE T operator()(const T& a) const { return a * scale_; }
+
+ private:
+  T scale_;
+};
+
+template <typename T>
+class Multiply {
+ public:
+  HOSTDEVICE T operator()(const T& a, const T& b) const { return a * b; }
+};
+
+TEST(Transform, CPUUnary) {
+  using namespace paddle::platform;
+  float buf[4] = {0.1, 0.2, 0.3, 0.4};
+  Transform(CPUPlace(), buf, buf + 4, buf, Scale<float>(10));
+  for (int i = 0; i < 4; ++i) {
+    ASSERT_NEAR(buf[i], static_cast<float>(i + 1), 1e-5);
+  }
+}
+
+TEST(Transform, GPUUnary) {
+  using namespace paddle::platform;
+  using namespace paddle::memory;
+  GPUPlace gpu0(0);
+  float cpu_buf[4] = {0.1, 0.2, 0.3, 0.4};
+  float* gpu_buf = static_cast<float*>(Alloc(gpu0, sizeof(float) * 4));
+  Copy(gpu0, gpu_buf, CPUPlace(), cpu_buf, sizeof(cpu_buf));
+  Transform(gpu0, gpu_buf, gpu_buf + 4, gpu_buf, Scale<float>(10));
+  Copy(CPUPlace(), cpu_buf, gpu0, gpu_buf, sizeof(cpu_buf));
+  Free(gpu0, gpu_buf);
+  for (int i = 0; i < 4; ++i) {
+    ASSERT_NEAR(cpu_buf[i], static_cast<float>(i + 1), 1e-5);
+  }
+}
+
+TEST(Transform, CPUBinary) {
+  using namespace paddle::platform;
+  using namespace paddle::memory;
+  int buf[4] = {1, 2, 3, 4};
+  Transform(CPUPlace(), buf, buf + 4, buf, buf, Multiply<int>());
+  for (int i = 0; i < 4; ++i) {
+    ASSERT_EQ((i + 1) * (i + 1), buf[i]);
+  }
+}
+
+TEST(Transform, GPUBinary) {
+  using namespace paddle::platform;
+  using namespace paddle::memory;
+  int buf[4] = {1, 2, 3, 4};
+  GPUPlace gpu0(0);
+  int* gpu_buf = static_cast<int*>(Alloc(gpu0, sizeof(buf)));
+  Copy(gpu0, gpu_buf, CPUPlace(), buf, sizeof(buf));
+  Transform(gpu0, gpu_buf, gpu_buf + 4, gpu_buf, gpu_buf, Multiply<int>());
+  Copy(CPUPlace(), buf, gpu0, gpu_buf, sizeof(buf));
+  Free(gpu0, gpu_buf);
+  for (int i = 0; i < 4; ++i) {
+    ASSERT_EQ((i + 1) * (i + 1), buf[i]);
+  }
+}
\ No newline at end of file
--- a/paddle/pybind/CMakeLists.txt
+++ b/paddle/pybind/CMakeLists.txt
 if(WITH_PYTHON)
-cc_library(paddle_pybind SHARED
+  cc_library(paddle_pybind SHARED
    SRCS pybind.cc
    DEPS pybind python backward
    ${GLOB_OP_LIB})

--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
@@ -19,6 +19,7 @@ limitations under the License. */
 #include "paddle/framework/backward.h"
 #include "paddle/framework/lod_tensor.h"
 #include "paddle/framework/op_registry.h"
+#include "paddle/operators/cond_op.h"
 #include "paddle/operators/net_op.h"
 #include "paddle/operators/recurrent_op.h"
 #include "paddle/platform/enforce.h"
@@ -288,6 +289,28 @@ All parameter, weight, gradient are variables in Paddle.
           [](operators::RecurrentOp &self, const operators::NetOp &net)
               -> void { self.set_stepnet(net.Clone()); });

+  // cond_op
+  py::class_<operators::CondOp, OperatorBase>(m, "CondOp")
+      .def_static("create",
+                  [](py::bytes protobin) -> operators::CondOp * {
+                    OpDesc desc;
+                    PADDLE_ENFORCE(desc.ParsePartialFromString(protobin),
+                                   "Cannot parse user input to OpDesc");
+                    PADDLE_ENFORCE(desc.IsInitialized(),
+                                   "User OpDesc is not initialized, reason %s",
+                                   desc.InitializationErrorString());
+                    auto cond_op = OpRegistry::CreateOp(desc);
+                    return static_cast<operators::CondOp *>(cond_op.release());
+                  })
+      .def("set_truenet",
+           [](operators::CondOp &self, const operators::NetOp &net) -> void {
+             self.set_truenet(net.Clone());
+           })
+      .def("set_falsenet",
+           [](operators::CondOp &self, const operators::NetOp &net) -> void {
+             self.set_falsenet(net.Clone());
+           });
+
  m.def("unique_integer", UniqueIntegerGenerator);

  m.def("is_compile_gpu", IsCompileGPU);

--- a/python/paddle/trainer_config_helpers/networks.py
+++ b/python/paddle/trainer_config_helpers/networks.py
--- a/python/paddle/v2/framework/op.py
+++ b/python/paddle/v2/framework/op.py
@@ -215,5 +215,27 @@ class __RecurrentOp__(object):
        return core.RecurrentOp.create(proto.SerializeToString())


+class __CondOp__(object):
+    __proto__ = None
+    type = "cond"
+
+    def __init__(self):
+        # cache recurrent_op's proto
+        if self.__proto__ is None:
+            for op_proto in get_all_op_protos():
+                if op_proto.type == self.type:
+                    self.__proto__ = op_proto
+
+    def __call__(self, *args, **kwargs):
+        if self.type not in args and "type" not in kwargs:
+            kwargs["type"] = self.type
+        # create proto
+        create_method = OpDescCreationMethod(self.__proto__)
+        proto = create_method(*args, **kwargs)
+        # create condop
+        return core.CondOp.create(proto.SerializeToString())
+
+
 Operator = OperatorFactory()  # The default global factory
 RecurrentOp = __RecurrentOp__()
+CondOp = __CondOp__()
--- a/python/paddle/v2/framework/tests/op_test.py
+++ b/python/paddle/v2/framework/tests/op_test.py
@@ -28,10 +28,10 @@ def create_op(scope, op_type, inputs, outputs, attrs):
        if out_name in outputs:
            kwargs[out_name] = []
            if out_dup:
-                sub_in = outputs[out_name]
-                for sub_in_name, _ in sub_in:
-                    var = scope.new_var(sub_in_name)
-                    kwargs[out_name].append(sub_in_name)
+                sub_out = outputs[out_name]
+                for sub_out_name, _ in sub_out:
+                    var = scope.new_var(sub_out_name)
+                    kwargs[out_name].append(sub_out_name)
            else:
                var = scope.new_var(out_name)
                kwargs[out_name].append(out_name)
@@ -39,6 +39,7 @@ def create_op(scope, op_type, inputs, outputs, attrs):
    for attr_name in Operator.get_op_attr_names(op_type):
        if attr_name in attrs:
            kwargs[attr_name] = attrs[attr_name]
+
    return Operator(op_type, **kwargs)


@@ -47,17 +48,24 @@ def set_input(scope, op, inputs, place):
        if in_name in inputs:
            if in_dup:
                sub_in = inputs[in_name]
-                for sub_in_name, sub_in_array in sub_in:
+                for sub_in_name, sub_in_val in sub_in:
                    var = scope.find_var(sub_in_name)
                    tensor = var.get_tensor()
+                    sub_in_array = sub_in_val[0] \
+                        if isinstance(sub_in_val, tuple) else sub_in_val
                    tensor.set_dims(sub_in_array.shape)
                    tensor.set(sub_in_array, place)
+                    if isinstance(sub_in_val, tuple):
+                        tensor.set_lod(sub_in_val[1])
            else:
                var = scope.find_var(in_name)
                tensor = var.get_tensor()
-                arr = inputs[in_name]
-                tensor.set_dims(arr.shape)
-                tensor.set(arr, place)
+                in_val = inputs[in_name]
+                in_array = in_val[0] if isinstance(in_val, tuple) else in_val
+                tensor.set_dims(in_array.shape)
+                tensor.set(in_array, place)
+                if isinstance(in_val, tuple):
+                    tensor.set_lod(in_val[1])


 def set_output_grad(scope, op, outputs, place):
@@ -172,8 +180,9 @@ class OpTest(unittest.TestCase):
    def check_output_with_place(self, place):
        self.scope = core.Scope()
        op_inputs = self.inputs if hasattr(self, "inputs") else dict()
+        op_outputs = self.outputs if hasattr(self, "outputs") else dict()
        op_attrs = self.attrs if hasattr(self, "attrs") else dict()
-        self.op = create_op(self.scope, self.op_type, op_inputs, self.outputs,
+        self.op = create_op(self.scope, self.op_type, op_inputs, op_outputs,
                            op_attrs)
        if isinstance(place, core.GPUPlace) and not self.op.support_gpu():
            return
@@ -185,21 +194,26 @@ class OpTest(unittest.TestCase):
        for out_name, out_dup in Operator.get_op_outputs(self.op.type()):
            if out_dup:
                sub_out = self.outputs[out_name]
-                for sub_out_name in sub_out:
+                if not isinstance(sub_out, list):
+                    raise AssertionError("sub_out type %s is not list",
+                                         type(sub_out))
+
+                for sub_out_name, expect in sub_out:
                    actual = np.array(
                        self.scope.find_var(sub_out_name).get_tensor())
-                    expect = sub_out[sub_out_name]
                    self.assertTrue(
                        np.allclose(
                            actual, expect, atol=1e-05),
-                        "output name: " + out_name + "has diff")
+                        "output name: " + out_name + " has diff")
            else:
-                actual = np.array(self.scope.find_var(out_name).get_tensor())
-                expect = self.outputs[out_name]
-                self.assertTrue(
-                    np.allclose(
-                        actual, expect, atol=1e-05),
-                    "output name: " + out_name + "has diff")
+                var = self.scope.find_var(out_name)
+                if var is not None:
+                    actual = np.array(var.get_tensor())
+                    expect = self.outputs[out_name]
+                    self.assertTrue(
+                        np.allclose(
+                            actual, expect, atol=1e-05),
+                        "output name: " + out_name + " has diff")

    def check_output(self):
        places = [core.CPUPlace()]
@@ -234,8 +248,9 @@ class OpTest(unittest.TestCase):
                   max_relative_error=0.005):
        self.scope = core.Scope()
        op_inputs = self.inputs if hasattr(self, "inputs") else dict()
+        op_outputs = self.outputs if hasattr(self, "outputs") else dict()
        op_attrs = self.attrs if hasattr(self, "attrs") else dict()
-        self.op = create_op(self.scope, self.op_type, op_inputs, self.outputs,
+        self.op = create_op(self.scope, self.op_type, op_inputs, op_outputs,
                            op_attrs)
        if no_grad_set is None:
            no_grad_set = set()

--- a/python/paddle/v2/framework/tests/test_accuracy_op.py
+++ b/python/paddle/v2/framework/tests/test_accuracy_op.py
@@ -6,16 +6,17 @@ from op_test import OpTest
 class TestAccuracyOp(OpTest):
    def setUp(self):
        self.op_type = "accuracy"
-        infer = np.random.randint(0, 2, (32, 1)).astype("int")
-        label = np.random.randint(0, 2, (32, )).astype("int")
+        n = 8192
+        infer = np.random.randint(0, 2, (n, 1)).astype("int")
+        label = np.random.randint(0, 2, (n, )).astype("int")
        self.inputs = {'Inference': infer, "Label": label}
        num_correct = 0
-        for rowid in xrange(32):
+        for rowid in xrange(n):
            for ele in infer[rowid]:
                if ele == label[rowid]:
                    num_correct += 1
                    break
-        self.outputs = {'Accuracy': [num_correct / 32.0]}
+        self.outputs = {'Accuracy': [num_correct / float(n)]}

    def test_check_output(self):
        self.check_output()

--- a/python/paddle/v2/framework/tests/test_add_two_op.py
+++ b/python/paddle/v2/framework/tests/test_add_two_op.py
--- a/python/paddle/v2/framework/tests/test_cond_op.py
+++ b/python/paddle/v2/framework/tests/test_cond_op.py
+import logging
+import paddle.v2.framework.core as core
+import unittest
+import numpy as np
+from paddle.v2.framework.op import Operator, CondOp
+
+
+class PySimpleCond(object):
+    '''
+    A simple implementation of dynamic if-else based on numpy
+    '''
+
+    def __init__(self):
+        array = [1] * 10
+        for i in range(1, 10, 2):
+            array[i] = 0
+        self.cond = np.array(array)
+        self.x = np.ones(shape=(10, 1))
+
+    def forward(self):
+        self.index_t = np.where(self.cond == 1)
+        self.index_f = np.where(self.cond == 0)
+        y_t = self.x[self.index_t]
+        y_f = self.x[self.index_f]
+        y_t = y_t * 2.
+        y_f = y_f * (-2.)
+        output = np.zeros(shape=(10, 1))
+        output[self.index_t] = y_t
+        output[self.index_f] = y_f
+        return output
+
+
+class PySimpleCondTest(unittest.TestCase):
+    def setUp(self):
+        self.condnn = PySimpleCond()
+
+    def test_forward(self):
+        output = self.condnn.forward()
+
+
+def create_tensor(scope, name, shape, np_data):
+    tensor = scope.new_var(name).get_tensor()
+    tensor.set_dims(shape)
+    tensor.set(np_data, core.CPUPlace())
+    return tensor
+
+
+class TestCondOp(unittest.TestCase):
+    '''
+    Test CondOp
+
+    equation:
+        cond = [True, False, True, False, ...]
+        y[index_t] = x[index_t] * 2.
+        y[index_f] = x[index_f] * -2.
+    outputs:
+        y
+    '''
+
+    def setUp(self):
+        self.py_cond = PySimpleCond()
+
+    def forward(self):
+        self.scope = core.Scope()
+        self.create_global_variables()
+        self.create_cond_op()
+        self.create_sub_net()
+        ctx = core.DeviceContext.create(core.CPUPlace())
+        self.condop.infer_shape(self.scope)
+        self.condop.run(self.scope, ctx)
+        return np.array(self.scope.find_var("Out").get_tensor())
+
+    def create_global_variables(self):
+        x_np_data = self.py_cond.x
+        create_tensor(self.scope, "X", [10, 1], x_np_data)
+        cond_np_data = self.py_cond.cond.astype("int32")
+        create_tensor(self.scope, "cond", [10, 1], cond_np_data)
+        self.scope.new_var("SubScopes")
+        self.scope.new_var("IndexTensors")
+        self.scope.new_var("Out")
+
+    def create_cond_op(self):
+        self.condop = CondOp(
+            Cond="cond",
+            Xs=["X"],
+            Outs=["Out"],
+            SubScopes="SubScopes",
+            IndexTensors="IndexTensors")
+
+    def create_sub_net(self):
+        truenet = core.Net.create()
+        scale_op_t = Operator("scale", X='X', Out='Out', scale=2.)
+        truenet.append_op(scale_op_t)
+        truenet.complete_add_op(True)
+        self.condop.set_truenet(truenet)
+
+        falsenet = core.Net.create()
+        scale_op_t = Operator("scale", X='X', Out='Out', scale=-2.)
+        falsenet.append_op(scale_op_t)
+        falsenet.complete_add_op(True)
+        self.condop.set_falsenet(falsenet)
+
+    def test_forward(self):
+        print 'test cond op forward'
+        pd_output = self.forward()
+        py_output = self.py_cond.forward()
+        print 'pd_output', pd_output
+        print
+        print 'py_output', py_output
+        self.assertEqual(pd_output.shape, py_output.shape)
+        print 'test passed'
+        return 0
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_fc_op.py
+++ b/python/paddle/v2/framework/tests/test_fc_op.py
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+class TestFCOp1(OpTest):
+    def setUp(self):
+        x0 = np.random.random((16, 32)).astype("float32")
+        w0 = np.random.random((32, 10)).astype("float32")
+
+        mul_out0 = np.dot(x0, w0)
+        identity_out = mul_out0
+
+        self.op_type = "fc"
+        self.inputs = {"X": [("X0", x0)], "W": [("W0", w0)]}
+        self.outputs = {"MulOut": [("MulOut0", mul_out0)], "Out": identity_out}
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(["X0", "W0"], "Out", max_relative_error=0.01)
+
+
+class TestFCOp2(OpTest):
+    def setUp(self):
+        x0 = np.random.random((16, 4, 8)).astype("float32")
+        x1 = np.random.random((4, 4, 32)).astype("float32")
+        w0 = np.random.random((32, 10)).astype("float32")
+        w1 = np.random.random((32, 10)).astype("float32")
+        b = np.random.random(10).astype("float32")
+
+        mul_out0 = np.dot(x0.reshape(16, 4 * 8), w0)
+        mul_out1 = np.dot(x1.reshape(4 * 4, 32), w1)
+        sum_out = mul_out0 + mul_out1
+        add_out = np.add(sum_out, b)
+        sigmoid_out = 1 / (1 + np.exp(-add_out))
+
+        self.op_type = "fc"
+        self.inputs = {
+            "X": [("X0", x0), ("X1", x1)],
+            "W": [("W0", w0), ("W1", w1)],
+            "B": b
+        }
+        self.attrs = {"xNumColDims": [1, 2], "activation": "sigmoid"}
+        self.outputs = {
+            "MulOut": [("MulOut0", mul_out0), ("MulOut1", mul_out1)],
+            "SumOut": sum_out,
+            "AddOut": add_out,
+            "Out": sigmoid_out
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(
+            ["X0", "X1", "W0", "W1", "B"], "Out", max_relative_error=0.01)
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_gaussian_random_op.py
+++ b/python/paddle/v2/framework/tests/test_gaussian_random_op.py
@@ -4,7 +4,7 @@ from paddle.v2.framework.op import Operator
 import numpy


-class GaussianRandomTest(unittest.TestCase):
+class TestGaussianRandomOp(unittest.TestCase):
    def test_cpu(self):
        self.gaussian_random_test(place=core.CPUPlace())


--- a/python/paddle/v2/framework/tests/test_identity_op.py
+++ b/python/paddle/v2/framework/tests/test_identity_op.py
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+class TestIdentityOp(OpTest):
+    def setUp(self):
+        self.op_type = "identity"
+        self.inputs = {'X': np.random.random((10, 10)).astype("float32")}
+        self.outputs = {'Y': self.inputs['X']}
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(['X'], 'Y')
+
+
+if __name__ == "__main__":
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_lookup_table.py
+++ b/python/paddle/v2/framework/tests/test_lookup_table.py
--- a/python/paddle/v2/framework/tests/test_minus_op.py
+++ b/python/paddle/v2/framework/tests/test_minus_op.py
@@ -3,7 +3,7 @@ import numpy as np
 from op_test import OpTest


-class MinusOpTest(OpTest):
+class TestMinusOp(OpTest):
    def setUp(self):
        self.op_type = "minus"
        self.inputs = {

--- a/python/paddle/v2/framework/tests/test_cross_entropy_op.py
+++ b/python/paddle/v2/framework/tests/test_cross_entropy_op.py
@@ -3,7 +3,7 @@ import numpy
 from op_test import OpTest


-class TestCrossEntropy(OpTest):
+class TestOnehotCrossEntropyOp(OpTest):
    def setUp(self):
        self.op_type = "onehot_cross_entropy"
        batch_size = 30

--- a/python/paddle/v2/framework/tests/test_scale_and_identity_op.py
+++ b/python/paddle/v2/framework/tests/test_scale_and_identity_op.py
@@ -3,20 +3,7 @@ import numpy as np
 from op_test import OpTest


-class IdentityTest(OpTest):
-    def setUp(self):
-        self.op_type = "identity"
-        self.inputs = {'X': np.random.random((10, 10)).astype("float32")}
-        self.outputs = {'Out': self.inputs['X']}
-
-    def test_check_output(self):
-        self.check_output()
-
-    def test_check_grad(self):
-        self.check_grad(['X'], 'Out')
-
-
-class ScaleTest(OpTest):
+class TestScaleOp(OpTest):
    def setUp(self):
        self.op_type = "scale"
        self.inputs = {'X': np.random.random((10, 10)).astype("float32")}

--- a/python/paddle/v2/framework/tests/test_seq_pool.py
+++ b/python/paddle/v2/framework/tests/test_seq_pool.py
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+class TestSeqAvgPool1D(OpTest):
+    def setUp(self):
+        self.op_type = 'sequence_avg_pool'
+        # one level, batch size is 4
+        x = np.random.uniform(0.1, 1, [11, 23]).astype('float32')
+        lod = [[0, 4, 5, 8, 11]]
+
+        out = np.zeros((4, 23)).astype('float32')
+        for i in range(4):
+            sub_x = x[lod[0][i]:lod[0][i + 1], :]
+            out[i] = sub_x.mean(axis=0)
+
+        self.inputs = {'X': (x, lod)}
+        self.outputs = {'Out': out}
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(["X"], "Out")
+
+
+class TestSeqAvgPool2D(OpTest):
+    def setUp(self):
+        self.op_type = 'sequence_avg_pool'
+        # one level, batch size is 4
+        x = np.random.uniform(0.1, 1, [13, 3, 17]).astype('float32')
+        lod = [[0, 4, 5, 8, 13]]
+
+        out = np.zeros((4, 3, 17)).astype('float32')
+        for i in range(4):
+            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
+            out[i] = np.reshape(sub_x.mean(axis=0), (3, 17))
+
+        self.inputs = {'X': (x, lod)}
+        self.outputs = {'Out': out}
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(["X"], "Out")
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_sgd_op.py
+++ b/python/paddle/v2/framework/tests/test_sgd_op.py
@@ -3,7 +3,7 @@ import numpy as np
 from op_test import OpTest


-class TestSGD(OpTest):
+class TestSGDOp(OpTest):
    def setUp(self):
        self.op_type = "sgd"
        w = np.random.random((102, 105)).astype("float32")

--- a/python/paddle/v2/framework/tests/test_sigmoid_op.py
+++ b/python/paddle/v2/framework/tests/test_sigmoid_op.py
@@ -3,7 +3,7 @@ import numpy as np
 from op_test import OpTest


-class TestSigmoid(OpTest):
+class TestSigmoidOp(OpTest):
    def setUp(self):
        self.op_type = "sigmoid"
        self.inputs = {

--- a/python/paddle/v2/framework/tests/test_split_op.py
+++ b/python/paddle/v2/framework/tests/test_split_op.py
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+class TestSplitOp(OpTest):
+    def setUp(self):
+        self.op_type = "split"
+        axis = 0
+        num = 2
+        x = np.random.random((4, 2)).astype('float32')
+        out = np.split(x, num, axis)
+        self.inputs = {'X': x}
+        self.attrs = {'axis': axis, 'num': num}
+        self.outputs = {'Out': [('out%d' % i, out[i]) \
+            for i in xrange(len(out))]}
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(['X'], ['out0', 'out1'])
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_top_k_op.py
+++ b/python/paddle/v2/framework/tests/test_top_k_op.py
@@ -21,6 +21,9 @@ class TestTopkOp(OpTest):

        self.outputs = {'Out': output, 'Indices': indices}

+    def test_check_output(self):
+        self.check_output()
+

 class TestTopkOp3d(OpTest):
    def setUp(self):
@@ -42,6 +45,9 @@ class TestTopkOp3d(OpTest):

        self.outputs = {'Out': output, 'Indices': indices}

+    def test_check_output(self):
+        self.check_output()
+

 if __name__ == "__main__":
    unittest.main()
--- a/python/paddle/v2/framework/tests/test_uniform_random_op.py
+++ b/python/paddle/v2/framework/tests/test_uniform_random_op.py
@@ -4,7 +4,7 @@ import paddle.v2.framework.core as core
 import numpy


-class UniformRandomTest(unittest.TestCase):
+class TestUniformRandomOp(unittest.TestCase):
    def test_uniform_random_cpu(self):
        self.uniform_random_test(place=core.CPUPlace())