Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into lstm_fix

03bfd761 · dangqingqing · 1f53a72f · a3435044 · 03bfd761 · 03bfd761
76 changed file
--- a/benchmark/IntelOptimizedPaddle.md
+++ b/benchmark/IntelOptimizedPaddle.md
+# Benchmark
+Machine:
+- Server
+ 	- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, 2 Sockets, 20 Cores per socket
+- Laptop
+ 	- DELL XPS15-9560-R1745: i7-7700HQ 8G 256GSSD
+ 	- i5 MacBook Pro (Retina, 13-inch, Early 2015)
+- Desktop
+ 	- i7-6700k
+System: CentOS release 6.3 (Final), Docker 1.12.1.
+PaddlePaddle: paddlepaddle/paddle:latest (TODO: will rerun after 0.11.0)
+- MKL-DNN tag v0.10
+- MKLML 2018.0.20170720
+- OpenBLAS v0.2.20
+On each machine, we will test and compare the performance of training on single node using MKL-DNN / MKLML / OpenBLAS respectively.
+## Benchmark Model
+### Server
+Test on batch size 64, 128, 256 on Intel(R) Xeon(R) Gold 6148M CPU @ 2.40GHz
+Input image size - 3 * 224 * 224, Time: images/second
+- VGG-19
+| BatchSize    | 64    | 128  | 256     |
+|--------------|-------| -----| --------|
+| OpenBLAS     | 7.82  | 8.62  | 10.34  | 
+| MKLML        | 11.02 | 12.86 | 15.33  |
+| MKL-DNN      | 27.69 | 28.8 | 29.27  |
+chart on batch size 128
+TBD
+ - ResNet
+ - GoogLeNet
+### Laptop
+TBD
+### Desktop
+TBD
--- a/doc/howto/cross_compiling/cross_compiling_for_ios_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_ios_cn.md
+# 构建iOS平台上的PaddlePaddle库
+交叉编译iOS平台上适用的PaddlePaddle库，需要在MacOS系统上进行。本文的将介绍在MacOS上，从源码交叉编译iOS平台上适用的PaddlePaddle库。
+## 准备交叉编译环境
+Apple官方为iOS开发提供了完整的交叉编译工具和集成开发环境，用户从App Store下载安装Xcode即可。也可自行前往官网下载，[Xcode](https://developer.apple.com/cn/xcode/)。安装完成之后，可在命令行执行`xcodebuild -version`，判断是否安装成功。
+```bash
+$ xcodebuild -version
+Xcode 9.0
+Build version 9A235
+```
+## 配置交叉编译参数
+PaddlePaddle为交叉编译提供了工具链配置文档[cmake/cross_compiling/ios.cmake](https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cross_compiling/ios.cmake)，以提供一些默认的编译器和编译参数配置。
+交叉编译iOS版本的PaddlePaddle库时，有一些必须配置的参数：
+- `CMAKE_SYSTEM_NAME`，CMake编译的目标平台，必须设置为`iOS`。在设置`CMAKE_SYSTEM_NAME=iOS`后，PaddlePaddle的CMake系统会自动编译所有的第三方依赖库，并且强制设置一些PaddlePaddle参数的值（`WITH_C_API=ON`、`WITH_GPU=OFF`、`WITH_AVX=OFF`、`WITH_PYTHON=OFF`、`WITH_RDMA=OFF`）。
+- `WITH_C_API`，是否编译C-API预测库，必须设置为ON。在iOS平台上只支持使用C-API来预测。
+- `WITH_SWIG_PY`，必须设置为ON。在iOS平台上不支持通过swig调用来训练或者预测。
+iOS平台可选配置参数：
+- `IOS_PLATFORM`，可设置为`OS/SIMULATOR`，默认值为`OS`。
+  - `OS`，构建目标为`arm`架构的iPhone或者iPad等物理设备。
+  - `SIMULATOR`，构建目标为`x86`架构的模拟器平台。
+- `IOS_ARCH`，目标架构。针对不同的`IOS_PLATFORM`，可设置的目标架构如下表所示：
+   | IOS_PLATFORM | IOS_ARCH             |
+   |--------------|----------------------|
+   |   OS         | armv7, armv7s, arm64 (默认) |
+   | SIMULATOR    | i386, x86_64 (默认)         |   
+- `IOS_DEPLOYMENT_TARGET`，最小的iOS部署版本，默认值为`7.0`。
+- `IOS_ENABLE_BITCODE`，是否使能[Bitcode](https://developer.apple.com/library/content/documentation/IDEs/Conceptual/AppDistributionGuide/AppThinning/AppThinning.html#//apple_ref/doc/uid/TP40012582-CH35-SW3)，可设置`ON/OFF`，默认值为`ON`。
+- `IOS_USE_VECLIB_FOR_BLAS`，是否使用[vecLib](https://developer.apple.com/documentation/accelerate/veclib)框架进行BLAS矩阵计算，可设置`ON/OFF`，默认值为`OFF`。
+- `IOS_DEVELOPMENT_ROOT`，`Developer`目录，可显式指定为`/path/to/platform/Developer`。若未显式指定，PaddlePaddle将会根据`IOS_PLATFORM`自动选择`Xcode`对应`platform`的`Developer`目录。
+- `IOS_SDK_ROOT`，所使用`SDK`的根目录，可显式指定为`/path/to/platform/Developer/SDKs/SDK`。若未显式指定，PaddlePaddle将会自动选择`IOS_DEVELOPMENT_ROOT`目录下最新的`SDK`版本。
+其他配置参数：
+- `USE_EIGEN_FOR_BLAS`，是否使用Eigen库进行矩阵计算，在`IOS_USE_VECLIB_FOR_BLAS=OFF`时有效。可设置`ON/OFF`，默认值为`OFF`。
+- `HOST_C/CXX_COMPILER`，宿主机的C/C++编译器。默认值为环境变量`CC/CXX`的值；若环境变量`CC/CXX`未设置，则使用`cc/c++`编译器。
+常用的cmake配置如下：
+```bash
+cmake -DCMAKE_SYSTEM_NAME=iOS \
+      -DIOS_PLATFORM=OS \
+      -DIOS_ARCH="arm64" \
+      -DIOS_ENABLE_BITCODE=ON \
+      -DIOS_USE_VECLIB_FOR_BLAS=ON \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
+      -DWITH_C_API=ON \
+      -DWITH_TESTING=OFF \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+```bash
+cmake -DCMAKE_SYSTEM_NAME=iOS \
+      -DIOS_PLATFORM=SIMULATOR \
+      -DIOS_ARCH="x86_64" \
+      -DIOS_USE_VECLIB_FOR_BLAS=ON \
+      -DCMAKE_INSTALL_PREFIX=your/path/to/install \
+      -DWITH_C_API=ON \
+      -DWITH_TESTING=OFF \
+      -DWITH_SWIG_PY=OFF \
+      ..
+```
+用户还可根据自己的需求设置其他编译参数。比如希望最小化生成库的大小，可以设置`CMAKE_BUILD_TYPE`为`MinSizeRel`；若希望得到最快的执行速度，则可设置`CMAKE_BUILD_TYPE`为`Release`。亦可以通过手动设置`CMAKE_C/CXX_FLAGS`来影响PaddlePaddle的编译过程。
+**性能TIPS**，为了达到最快的计算速度，在CMake参数配置上，有以下建议：
+- 设置`CMAKE_BUILD_TYPE`为`Release`
+- 设置`IOS_USE_VECLIB_FOR_BLAS=ON`，调用`vecLib`框架提供的BLAS函数进行矩阵计算。
+## 编译和安装
+CMake配置完成后，执行以下命令，PaddlePaddle将自动下载和编译所有第三方依赖库、编译和安装PaddlePaddle预测库。
+```
+$ make
+$ make install
+```
+注意：如果你曾在源码目录下编译过其他平台的PaddlePaddle库，请先使用`rm -rf`命令删除`third_party`目录和`build`目录，以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。
+执行完安装命令后，`your/path/to/install`目录中会包含以下内容：
+- `include`目录，其中包含所有C-API的头文件
+- `lib`目录，其中包含PaddlePaddle的C-API静态库
+- `third_party`目录，其中包含所依赖的所有第三方库
+注意，不同架构的PaddlePaddle库建议安装到不同的目录下，然后使用`lipo`工具将多个静态库合并成一个支持多个架构的fat库。
+自此，PaddlePaddle库已经安装完成，用户可将合成的fat库用于深度学习相关的iOS App中，调用方法见C-API文档。
--- a/doc/howto/cross_compiling/cross_compiling_for_raspberry_cn.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_raspberry_cn.md
@@ -59,4 +59,4 @@ make install
 注意：如果你曾经在源码目录下编译过其他平台的PaddlePaddle库，请先使用`rm -rf`命令删除`third_party`目录和`build`目录，以确保所有的第三方依赖库和PaddlePaddle代码都是针对新的CMake配置重新编译的。
-执行完安装命令后，，`your/path/to/install`目录中会包含`include`和`lib`目录，其中`include`中包含C-API的头文件，`lib`中包含一个Raspberry Pi版本的库。
+执行完安装命令后，`your/path/to/install`目录中会包含`include`和`lib`目录，其中`include`中包含C-API的头文件，`lib`中包含一个Raspberry Pi版本的库。
--- a/doc/howto/cross_compiling/cross_compiling_for_raspberry_en.md
+++ b/doc/howto/cross_compiling/cross_compiling_for_raspberry_en.md
@@ -44,7 +44,7 @@ cmake -DCMAKE_SYSTEM_NAME=RPi \
      ..
 ```
-To build the inference library, please set the argument WITH_API to ON: `WITH_C_API=ON`.
+To build the inference library, please set the argument WITH\_C\_API to ON: `WITH_C_API=ON`.
 You can add more arguments. For example, to minimize the size of the generated inference library, you may use `CMAKE_BUILD_TYPE=MinSizeRel`. For performance optimization, you may use `CMAKE_BUILD_TYPE=Release`.

--- a/paddle/framework/attribute.cc
+++ b/paddle/framework/attribute.cc
@@ -19,7 +19,7 @@ limitations under the License. */
 namespace paddle {
 namespace framework {
-Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* program) {
+Attribute GetAttrValue(const OpDesc::Attr& attr_desc) {
  switch (attr_desc.type()) {
    case framework::AttrType::BOOLEAN: {
      return attr_desc.b();
@@ -61,13 +61,9 @@ Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* program) {
      }
      return val;
    }
-    case framework::AttrType::BLOCK: {
+    default:
-      PADDLE_ENFORCE(program != nullptr,
+      PADDLE_THROW("Unsupport attr type %d", attr_desc.type());
-                     "Need to specify ProgramDesc when get a block attr");
-      return program->mutable_blocks(attr_desc.block_idx());
-    }
  }
-  PADDLE_ENFORCE(false, "Unknown OpDesc::AttrDesc::type !");
  return boost::blank();
 }

--- a/paddle/framework/attribute.h
+++ b/paddle/framework/attribute.h
@@ -32,7 +32,7 @@ inline AttrType AttrTypeID() {
  return static_cast<AttrType>(tmp.which() - 1);
 }
-Attribute GetAttrValue(const OpDesc::Attr& attr_desc, ProgramDesc* desc);
+Attribute GetAttrValue(const OpDesc::Attr& attr_desc);
 class AttrReader {
 public:

--- a/paddle/framework/backward.cc
+++ b/paddle/framework/backward.cc
@@ -18,6 +18,7 @@
 #include <deque>
 #include <list>
 #include <memory>
+#include <unordered_set>
 #include "paddle/framework/block_desc.h"
 #include "paddle/framework/op_registry.h"
@@ -285,6 +286,15 @@ static bool AllGradInSet(const std::vector<std::string>& names,
  return true;
 }
+static std::string FwdName(const std::string& grad_name) {
+  auto pos = grad_name.find("@GRAD");
+  if (pos == std::string::npos) {
+    return "";
+  } else {
+    return grad_name.substr(0, pos);
+  }
+}
 static void CreateGradVarInBlock(
    size_t grad_op_start_index,
    const std::unordered_map<std::string, std::string>& param_name_map,
@@ -294,6 +304,7 @@ static void CreateGradVarInBlock(
  for (size_t op_index = grad_op_start_index; op_index < ops.size();
       ++op_index) {
    bool need_infer_shape = false;
+    std::unordered_set<std::string> new_vars;
    ForEachVarName(ops[op_index]->Outputs(),
                   [&](const std::string& grad_var_name) {
                     if (block_desc->HasVar(grad_var_name)) {
@@ -301,8 +312,7 @@ static void CreateGradVarInBlock(
                     }
                     need_infer_shape = true;
                     auto var = block_desc->Var(grad_var_name);
-                     // FIXME(qiao) infer the datatype
+                     new_vars.insert(var->Name());
-                     var->SetDataType(framework::DataType::FP32);
                     auto it = param_name_map.find(grad_var_name);
                     if (it == param_name_map.end()) {
                       return false;
@@ -316,6 +326,21 @@ static void CreateGradVarInBlock(
                   });
    if (need_infer_shape) {
      ops[op_index]->InferVarType(block_desc);
+      for (auto& arg : ops[op_index]->OutputArgumentNames()) {
+        if (new_vars.find(arg) == new_vars.end()) {
+          continue;
+        }
+        auto pname = FwdName(arg);
+        auto* param = block_desc->FindVar(pname);
+        auto* grad = block_desc->FindVar(arg);
+        if (param == nullptr) {
+          LOG(WARNING) << "Cannot find forward variable of " << arg
+                       << ". Set its gradient to FP32";
+          grad->SetDataType(DataType::FP32);
+        } else {
+          grad->SetDataType(param->GetDataType());
+        }
+      }
      ops[op_index]->InferShape(*block_desc);
    }
  }
@@ -368,7 +393,7 @@ std::vector<std::unique_ptr<OpDescBind>> MakeBlockBackward(
    ProgramDescBind& program_desc, int block_idx,
    std::unordered_set<std::string>* no_grad_vars,
    std::unordered_map<std::string, std::string>* grad_to_var) {
-  BlockDescBind* cur_block = program_desc.Block(block_idx);
+  BlockDescBind* cur_block = program_desc.MutableBlock(block_idx);
  std::vector<OpDescBind*> op_descs = cur_block->AllOps();
  std::unordered_map<std::string, std::vector<size_t>> dup_out_ops;
  size_t grad_desc_idx = 0;
@@ -443,7 +468,7 @@ ParamGradInfoMap AppendBackward(
  }
  const int root_block_idx = 0;
-  auto root_block = program_desc.Block(root_block_idx);
+  auto root_block = program_desc.MutableBlock(root_block_idx);
  // insert fill one op for target
  // TODO(qiao) add some check to the target.
@@ -492,7 +517,7 @@ ParamGradInfoMap AppendBackward(
  CreateGradVarInBlock(forward_op_num, grad_to_var, root_block, &retv);
  for (size_t block_index = forward_block_num;
       block_index < program_desc.Size(); ++block_index) {
-    CreateGradVarInBlock(0, grad_to_var, program_desc.Block(block_index),
+    CreateGradVarInBlock(0, grad_to_var, program_desc.MutableBlock(block_index),
                         &retv);
  }
  return retv;

--- a/paddle/framework/backward_test.cc
+++ b/paddle/framework/backward_test.cc
@@ -499,7 +499,7 @@ TEST(Backward, linear_net_intermediate_variable_has_no_grad) {
 TEST(Backward, simple_single_op) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  f::OpDescBind *op = block->AppendOp();
  op->SetType("rowwise_add");
@@ -535,7 +535,7 @@ TEST(Backward, simple_single_op) {
 TEST(Backward, default_attribute) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  f::OpDescBind *op = block->AppendOp();
  op->SetType("mul");
  op->SetInput("X", {"x"});
@@ -561,7 +561,7 @@ TEST(Backward, default_attribute) {
 TEST(Backward, simple_mult_op) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("rowwise_add");
  op1->SetInput("X", {"x1"});
@@ -644,7 +644,7 @@ TEST(Backward, simple_mult_op) {
 TEST(Backward, intermedia_var_no_grad) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("rowwise_add");
  op1->SetInput("X", {"x1"});
@@ -714,7 +714,7 @@ TEST(Backward, intermedia_var_no_grad) {
 TEST(Backward, var_no_grad) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("mult_in_out");
  op1->SetInput("X", {"x1"});
@@ -790,7 +790,7 @@ TEST(Backward, var_no_grad) {
 TEST(Backward, shared_var) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  f::OpDescBind *op1 = block->AppendOp();
  op1->SetType("rowwise_add");
  op1->SetInput("X", {"x1"});
@@ -880,7 +880,7 @@ TEST(Backward, shared_var) {
 TEST(Backward, half_backward) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  auto *op1 = block->AppendOp();
  op1->SetType("minus");
  op1->SetInput("X", {"a"});

--- a/paddle/framework/block_desc.cc
+++ b/paddle/framework/block_desc.cc
@@ -113,7 +113,7 @@ BlockDescBind *BlockDescBind::ParentBlock() const {
  if (this->desc_->parent_idx() == kNoneBlockIndex) {
    return nullptr;
  }
-  return prog_->Block(static_cast<size_t>(this->desc_->parent_idx()));
+  return prog_->MutableBlock(static_cast<size_t>(this->desc_->parent_idx()));
 }
 BlockDesc *BlockDescBind::Proto() {

--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@@ -73,33 +73,32 @@ static void CreateTensor(Variable* var, VarDesc::VarType var_type) {
  }
 }
-void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id) {
+void Executor::Run(const ProgramDescBind& pdesc, Scope* scope, int block_id) {
  // TODO(tonyyang-svail):
  //    - only runs on the first device (i.e. no interdevice communication)
  //    - will change to use multiple blocks for RNN op and Cond Op
-  PADDLE_ENFORCE_GT(pdesc.blocks_size(), block_id);
+  PADDLE_ENFORCE_LT(block_id, pdesc.Size());
-  auto& block = pdesc.blocks(block_id);
+  auto& block = pdesc.Block(block_id);
  auto& device = device_contexts_[0];
  Scope& local_scope = scope->NewScope();
-  for (auto& var : block.vars()) {
+  for (auto& var : block.AllVars()) {
-    if (var.persistable()) {
+    if (var->Persistable()) {
-      auto* ptr = scope->Var(var.name());
+      auto* ptr = scope->Var(var->Name());
-      CreateTensor(ptr, var.type());
+      CreateTensor(ptr, var->GetType());
-      VLOG(3) << "Create Variable " << var.name()
+      VLOG(3) << "Create Variable " << var->Name()
              << " global, which pointer is " << ptr;
    } else {
-      auto* ptr = local_scope.Var(var.name());
+      auto* ptr = local_scope.Var(var->Name());
-      CreateTensor(ptr, var.type());
+      CreateTensor(ptr, var->GetType());
-      VLOG(3) << "Create Variable " << var.name()
+      VLOG(3) << "Create Variable " << var->Name()
              << " locally, which pointer is " << ptr;
    }
  }
-  for (auto& op_desc : block.ops()) {
+  for (auto& op_desc : block.AllOps()) {
-    auto op = paddle::framework::OpRegistry::CreateOp(
+    auto op = paddle::framework::OpRegistry::CreateOp(*op_desc);
-        op_desc, const_cast<ProgramDesc*>(&pdesc));
    op->Run(local_scope, *device);
  }

--- a/paddle/framework/executor.h
+++ b/paddle/framework/executor.h
@@ -14,8 +14,8 @@ limitations under the License. */
 #pragma once
-#include "paddle/framework/framework.pb.h"
 #include "paddle/framework/op_info.h"
+#include "paddle/framework/program_desc.h"
 #include "paddle/framework/scope.h"
 #include "paddle/framework/tensor.h"
@@ -34,7 +34,7 @@ class Executor {
   *  ProgramDesc
   *  Scope
   */
-  void Run(const ProgramDesc&, Scope*, int);
+  void Run(const ProgramDescBind&, Scope*, int);
 private:
  std::vector<platform::DeviceContext*> device_contexts_;

--- a/paddle/framework/op_desc.cc
+++ b/paddle/framework/op_desc.cc
@@ -52,6 +52,22 @@ class CompileTimeInferShapeContext : public InferShapeContext {
  const std::vector<std::string> &Outputs(
      const std::string &name) const override;
+  void ShareLoD(const std::string &in, const std::string &out, size_t i = 0,
+                size_t j = 0) const override {
+    PADDLE_ENFORCE_LT(i, Inputs(in).size());
+    PADDLE_ENFORCE_LT(j, Outputs(out).size());
+    auto *in_var = block_.FindVarRecursive(Inputs(in)[i]);
+    auto *out_var = block_.FindVarRecursive(Outputs(out)[j]);
+    if (in_var->GetType() != VarDesc::LOD_TENSOR) {
+      VLOG(3) << "input " << in << "is not LodTensor";
+      return;
+    }
+    PADDLE_ENFORCE_EQ(in_var->GetType(), VarDesc::LOD_TENSOR,
+                      "The %d-th output of Output(%s) must be LoDTensor.", j,
+                      out);
+    in_var->SetLoDLevel(out_var->GetLodLevel());
+  }
 private:
  DDim GetDim(const std::string &name) const override;
@@ -98,7 +114,12 @@ OpDescBind::OpDescBind(const OpDesc &desc, ProgramDescBind *prog)
  // restore attrs_
  for (const OpDesc::Attr &attr : desc_.attrs()) {
    std::string attr_name = attr.name();
-    attrs_[attr_name] = GetAttrValue(attr, prog->Proto());
+    if (attr.type() != AttrType::BLOCK) {
+      attrs_[attr_name] = GetAttrValue(attr);
+    } else {
+      auto bid = attr.block_idx();
+      attrs_[attr_name] = prog->MutableBlock(bid);
+    }
  }
 }
@@ -172,8 +193,7 @@ void OpDescBind::SetAttr(const std::string &name, const Attribute &v) {
 }
 void OpDescBind::SetBlockAttr(const std::string &name, BlockDescBind &block) {
-  BlockDesc *desc = block.Proto();
+  this->attrs_[name] = &block;
-  this->attrs_[name] = desc;
  need_update_ = true;
 }
@@ -192,7 +212,7 @@ Attribute OpDescBind::GetAttr(const std::string &name) const {
 int OpDescBind::GetBlockAttr(const std::string &name) const {
  auto it = attrs_.find(name);
  PADDLE_ENFORCE(it != attrs_.end(), "Attribute %s is not found", name);
-  return boost::get<BlockDesc *>(it->second)->idx();
+  return boost::get<BlockDescBind *>(it->second)->ID();
 }
 const std::unordered_map<std::string, Attribute> &OpDescBind::GetAttrMap()

--- a/paddle/framework/op_registry.cc
+++ b/paddle/framework/op_registry.cc
@@ -43,13 +43,15 @@ static VariableNameMap ConvertOpDescVarsToVarNameMap(
  return ret_val;
 }
-std::unique_ptr<OperatorBase> OpRegistry::CreateOp(const OpDesc& op_desc,
+std::unique_ptr<OperatorBase> OpRegistry::CreateOp(const OpDesc& op_desc) {
-                                                   ProgramDesc* program) {
+  VLOG(1) << "CreateOp directly from OpDesc is deprecated. It should only be"
+             "used in unit tests. Use CreateOp(const OpDescBind& op_desc) "
+             "instead.";
  VariableNameMap inputs = ConvertOpDescVarsToVarNameMap(op_desc.inputs());
  VariableNameMap outputs = ConvertOpDescVarsToVarNameMap(op_desc.outputs());
  AttributeMap attrs;
  for (auto& attr : op_desc.attrs()) {
-    attrs[attr.name()] = GetAttrValue(attr, program);
+    attrs[attr.name()] = GetAttrValue(attr);
  }
  return CreateOp(op_desc.type(), inputs, outputs, attrs);

--- a/paddle/framework/op_registry.h
+++ b/paddle/framework/op_registry.h
@@ -77,8 +77,7 @@ class OpRegistry {
                                                const VariableNameMap& outputs,
                                                AttributeMap attrs);
-  static std::unique_ptr<OperatorBase> CreateOp(const OpDesc& op_desc,
+  static std::unique_ptr<OperatorBase> CreateOp(const OpDesc& op_desc);
-                                                ProgramDesc* program);
  static std::unique_ptr<OperatorBase> CreateOp(const OpDescBind& op_desc);
 };

--- a/paddle/framework/op_registry_test.cc
+++ b/paddle/framework/op_registry_test.cc
@@ -74,7 +74,7 @@ TEST(OpRegistry, CreateOp) {
  attr->set_type(paddle::framework::AttrType::FLOAT);
  attr->set_f(scale);
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  paddle::framework::Scope scope;
  paddle::platform::CPUDeviceContext dev_ctx;
  op->Run(scope, dev_ctx);
@@ -95,7 +95,7 @@ TEST(OpRegistry, IllegalAttr) {
  bool caught = false;
  try {
-    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+    paddle::framework::OpRegistry::CreateOp(op_desc);
  } catch (paddle::platform::EnforceNotMet err) {
    caught = true;
    std::string msg = "larger_than check fail";
@@ -115,7 +115,7 @@ TEST(OpRegistry, DefaultValue) {
  ASSERT_TRUE(op_desc.IsInitialized());
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  paddle::framework::Scope scope;
  paddle::platform::CPUDeviceContext dev_ctx;
  op->Run(scope, dev_ctx);
@@ -131,7 +131,7 @@ TEST(OpRegistry, CustomChecker) {
  // attr 'test_attr' is not set
  bool caught = false;
  try {
-    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+    paddle::framework::OpRegistry::CreateOp(op_desc);
  } catch (paddle::platform::EnforceNotMet err) {
    caught = true;
    std::string msg = "Attribute 'test_attr' is required!";
@@ -149,7 +149,7 @@ TEST(OpRegistry, CustomChecker) {
  attr->set_i(3);
  caught = false;
  try {
-    paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+    paddle::framework::OpRegistry::CreateOp(op_desc);
  } catch (paddle::platform::EnforceNotMet err) {
    caught = true;
    std::string msg = "'test_attr' must be even!";
@@ -166,7 +166,7 @@ TEST(OpRegistry, CustomChecker) {
  attr->set_name("test_attr");
  attr->set_type(paddle::framework::AttrType::INT);
  attr->set_i(4);
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  paddle::platform::CPUDeviceContext dev_ctx;
  paddle::framework::Scope scope;
  op->Run(scope, dev_ctx);

--- a/paddle/framework/operator.cc
+++ b/paddle/framework/operator.cc
@@ -37,32 +37,32 @@ ExecutionContext::GetEigenDevice<platform::GPUPlace, Eigen::GpuDevice>() const {
 std::string OperatorBase::Input(const std::string& name) const {
  auto& ins = Inputs(name);
  PADDLE_ENFORCE_LE(ins.size(), 1UL,
-                    "Op %s input %s should contain only one variable", type_,
+                    "Operator %s's input %s should contain only one variable.",
-                    name);
+                    type_, name);
  return ins.empty() ? kEmptyVarName : ins[0];
 }
 const std::vector<std::string>& OperatorBase::Inputs(
    const std::string& name) const {
  auto it = inputs_.find(name);
-  PADDLE_ENFORCE(it != inputs_.end(), "Op %s do not have input %s", type_,
+  PADDLE_ENFORCE(it != inputs_.end(), "Operator %s does not have the input %s.",
-                 name);
+                 type_, name);
  return it->second;
 }
 std::string OperatorBase::Output(const std::string& name) const {
  auto& outs = Outputs(name);
  PADDLE_ENFORCE_LE(outs.size(), 1UL,
-                    "Op %s output %s should contain only one variable", type_,
+                    "Operator %s's output %s should contain only one variable.",
-                    name);
+                    type_, name);
  return outs.empty() ? kEmptyVarName : outs[0];
 }
 const std::vector<std::string>& OperatorBase::Outputs(
    const std::string& name) const {
  auto it = outputs_.find(name);
-  PADDLE_ENFORCE(it != outputs_.end(), "Op %s does not have output called %s",
+  PADDLE_ENFORCE(it != outputs_.end(),
-                 type_, name);
+                 "Operator %s does not have an output called %s.", type_, name);
  return it->second;
 }
@@ -351,6 +351,20 @@ class RuntimeInferShapeContext : public InferShapeContext {
    return op_.Outputs(name);
  }
+  void ShareLoD(const std::string& in, const std::string& out, size_t i = 0,
+                size_t j = 0) const override {
+    PADDLE_ENFORCE_LT(i, Inputs(in).size());
+    PADDLE_ENFORCE_LT(j, Outputs(out).size());
+    Variable* in_var = scope_.FindVar(Inputs(in)[i]);
+    Variable* out_var = scope_.FindVar(Outputs(out)[j]);
+    if (!in_var->IsType<LoDTensor>()) return;
+    PADDLE_ENFORCE(out_var->IsType<LoDTensor>(),
+                   "The %d-th output of Output(%s) must be LoDTensor.", j, out);
+    auto in_tensor = in_var->Get<LoDTensor>();
+    auto* out_tensor = out_var->GetMutable<LoDTensor>();
+    out_tensor->set_lod(in_tensor.lod());
+  }
 private:
  DDim GetDim(const std::string& name) const override {
    Variable* var = scope_.FindVar(name);

--- a/paddle/framework/operator.h
+++ b/paddle/framework/operator.h
@@ -427,7 +427,8 @@ class OperatorWithKernel : public OperatorBase {
            int tmp = static_cast<int>(ToDataType(t->type()));
            VLOG(3) << "Input " << ipt_name << " with data_type " << tmp;
            PADDLE_ENFORCE(tmp == data_type || data_type == -1,
-                           "DataType of Paddle Op %s must be same.", Type());
+                           "DataType of Paddle Op %s must be the same.",
+                           Type());
            data_type = tmp;
          }
        }

--- a/paddle/framework/operator_test.cc
+++ b/paddle/framework/operator_test.cc
@@ -83,7 +83,7 @@ TEST(OperatorBase, all) {
  paddle::platform::CPUDeviceContext device_context;
  paddle::framework::Scope scope;
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  scope.Var("OUT1");
  ASSERT_EQ(paddle::framework::op_run_num, 0);
  op->Run(scope, device_context);
@@ -208,7 +208,7 @@ TEST(OpKernel, all) {
  paddle::platform::CPUDeviceContext cpu_device_context;
  paddle::framework::Scope scope;
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 0);
  op->Run(scope, cpu_device_context);
  ASSERT_EQ(paddle::framework::cpu_kernel_run_num, 1);
@@ -244,7 +244,7 @@ TEST(OpKernel, multi_inputs) {
  scope.Var("y0")->GetMutable<LoDTensor>();
  scope.Var("y1")->GetMutable<LoDTensor>();
-  auto op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+  auto op = paddle::framework::OpRegistry::CreateOp(op_desc);
  op->Run(scope, cpu_device_context);
 }

--- a/paddle/framework/program_desc.h
+++ b/paddle/framework/program_desc.h
@@ -37,7 +37,9 @@ class ProgramDescBind {
  BlockDescBind *AppendBlock(const BlockDescBind &parent);
-  BlockDescBind *Block(size_t idx) { return blocks_[idx].get(); }
+  BlockDescBind *MutableBlock(size_t idx) { return blocks_[idx].get(); }
+  const BlockDescBind &Block(size_t idx) const { return *blocks_[idx]; }
  size_t Size() const { return blocks_.size(); }

--- a/paddle/framework/program_desc_test.cc
+++ b/paddle/framework/program_desc_test.cc
@@ -20,7 +20,7 @@ namespace paddle {
 namespace framework {
 TEST(ProgramDesc, copy_ctor) {
  ProgramDescBind program;
-  auto* global_block = program.Block(0);
+  auto* global_block = program.MutableBlock(0);
  auto* x = global_block->Var("X");
  x->SetType(VarDesc_VarType_LOD_TENSOR);
  x->SetLoDLevel(0);
@@ -44,7 +44,7 @@ TEST(ProgramDesc, copy_ctor) {
  ProgramDescBind program_copy(program);
-  auto* global_block_copy = program_copy.Block(0);
+  auto* global_block_copy = program_copy.MutableBlock(0);
  ASSERT_NE(global_block, global_block_copy);
  auto assert_same_var = [&](const std::string& name, VarDescBind* var_before) {
@@ -82,7 +82,7 @@ TEST(ProgramDesc, copy_ctor) {
 TEST(ProgramDescBind, serialize_and_deserialize) {
  ProgramDescBind program_origin;
-  auto* global_block = program_origin.Block(0);
+  auto* global_block = program_origin.MutableBlock(0);
  auto* x = global_block->Var("X");
  x->SetType(VarDesc_VarType_LOD_TENSOR);
  x->SetLoDLevel(0);
@@ -108,7 +108,7 @@ TEST(ProgramDescBind, serialize_and_deserialize) {
  program_origin.Proto()->SerializeToString(&binary_str);
  ProgramDescBind program_restored(binary_str);
-  auto* global_block_restored = program_restored.Block(0);
+  auto* global_block_restored = program_restored.MutableBlock(0);
  ASSERT_NE(global_block, global_block_restored);
  auto assert_same_var = [&](const std::string& name, VarDescBind* var_before) {

--- a/paddle/framework/prune_test.cc
+++ b/paddle/framework/prune_test.cc
@@ -52,7 +52,7 @@ void AddOp(const std::string &type, const f::VariableNameMap &inputs,
 TEST(Prune, one_operator) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  AddOp("one_one", {{"input", {"a"}}}, {{"output", {"b"}}}, {}, block);
@@ -69,7 +69,7 @@ TEST(Prune, one_operator) {
 TEST(Prune, forward) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  AddOp("one_one", {{"input", {"a"}}}, {{"output", {"b"}}}, {}, block);
  AddOp("one_one", {{"input", {"b"}}}, {{"output", {"c"}}}, {}, block);
@@ -88,7 +88,7 @@ TEST(Prune, forward) {
 TEST(Prune, multi_input_op) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  AddOp("one_one", {{"input", {"a0"}}}, {{"output", {"b0"}}}, {}, block);
  AddOp("one_one", {{"input", {"a1"}}}, {{"output", {"b1"}}}, {}, block);
@@ -106,7 +106,7 @@ TEST(Prune, multi_input_op) {
 TEST(Prune, multi_output_op) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  AddOp("one_two", {{"input", {"a"}}}, {{"output", {"b", "c"}}}, {}, block);
  AddOp("one_one", {{"input", {"b"}}}, {{"output", {"b1"}}}, {}, block);
@@ -122,7 +122,7 @@ TEST(Prune, multi_output_op) {
 TEST(Prune, multi_target) {
  f::ProgramDescBind program;
-  f::BlockDescBind *block = program.Block(0);
+  f::BlockDescBind *block = program.MutableBlock(0);
  AddOp("one_two", {{"input", {"a"}}}, {{"output", {"b", "c"}}}, {}, block);
  AddOp("one_one", {{"input", {"b"}}}, {{"output", {"b1"}}}, {}, block);

--- a/paddle/framework/shape_inference.cc
+++ b/paddle/framework/shape_inference.cc
@@ -28,9 +28,6 @@ void InferShapeContext::SetOutputsDim(
  SetDims(names, dims);
 }
-void InferShapeContext::ShareLoD(const std::string &in, const std::string &out,
-                                 size_t i, size_t j) const {}
 std::vector<framework::DDim> InferShapeContext::GetDims(
    const std::vector<std::string> &names) const {
  std::vector<framework::DDim> ret;

--- a/paddle/framework/shape_inference.h
+++ b/paddle/framework/shape_inference.h
@@ -43,9 +43,8 @@ class InferShapeContext {
  virtual const std::vector<std::string> &Outputs(
      const std::string &name) const = 0;
-  // TODO(qiao) implement this function
+  virtual void ShareLoD(const std::string &in, const std::string &out,
-  void ShareLoD(const std::string &in, const std::string &out, size_t i = 0,
+                        size_t i = 0, size_t j = 0) const = 0;
-                size_t j = 0) const;
 protected:
  virtual framework::DDim GetDim(const std::string &name) const = 0;

--- a/paddle/framework/tensor.h
+++ b/paddle/framework/tensor.h
@@ -118,10 +118,12 @@ class Tensor {
                             const platform::DeviceContext& ctx);
  /**
-   * @brief   Return the slice of the tensor.
+   * @brief  Return a sub-tensor of the given tensor.
   *
-   * @param[in] begin_idx   The begin index of the slice.
+   * @param[in] begin_idx   The index of the start row(inclusive) to slice.
-   * @param[in] end_idx     The end index of the slice.
+   *                        The index number begins from 0.
+   * @param[in] end_idx     The index of the end row(exclusive) to slice.
+   *                        The index number begins from 0.
   */
  inline Tensor Slice(const int& begin_idx, const int& end_idx) const;

--- a/paddle/framework/tensor_impl.h
+++ b/paddle/framework/tensor_impl.h
@@ -112,9 +112,10 @@ inline void* Tensor::mutable_data(platform::Place place, std::type_index type) {
  if (holder_ != nullptr) {
    holder_->set_type(type);
  }
-  PADDLE_ENFORCE_GT(numel(), 0,
+  PADDLE_ENFORCE_GT(
-                    "Tensor's numel must be larger than zero to call "
+      numel(), 0,
-                    "Tensor::mutable_data. Call Tensor::set_dim first.");
+      "When calling this method, the Tensor's numel must be larger than zero. "
+      "Please check Tensor::Resize has been called first.");
  int64_t size = numel() * SizeOfType(type);
  /* some versions of boost::variant don't have operator!= */
  if (holder_ == nullptr || !(holder_->place() == place) ||
@@ -229,10 +230,12 @@ inline void Tensor::CopyFromVector(const std::vector<T>& src,
 inline Tensor Tensor::Slice(const int& begin_idx, const int& end_idx) const {
  check_memory_size();
-  PADDLE_ENFORCE_GE(begin_idx, 0, "Slice begin index is less than zero.");
+  PADDLE_ENFORCE_GE(begin_idx, 0,
-  PADDLE_ENFORCE_LE(end_idx, dims_[0], "Slice end index is out of bound.");
+                    "The start row index must be greater than 0.");
-  PADDLE_ENFORCE_LT(begin_idx, end_idx,
+  PADDLE_ENFORCE_LE(end_idx, dims_[0], "The end row index is out of bound.");
-                    "Begin index must be less than end index.");
+  PADDLE_ENFORCE_LT(
+      begin_idx, end_idx,
+      "The start row index must be lesser than the end row index.");
  if (dims_[0] == 1) {
    return *this;

--- a/paddle/framework/type_defs.h
+++ b/paddle/framework/type_defs.h
@@ -36,7 +36,7 @@ using VariableNameMap = std::map<std::string, std::vector<std::string>>;
 using Attribute =
    boost::variant<boost::blank, int, float, std::string, std::vector<int>,
                   std::vector<float>, std::vector<std::string>, bool,
-                   std::vector<bool>, BlockDesc*>;
+                   std::vector<bool>, BlockDescBind*>;
 using AttributeMap = std::unordered_map<std::string, Attribute>;

--- a/paddle/framework/var_type_inference_test.cc
+++ b/paddle/framework/var_type_inference_test.cc
@@ -63,41 +63,43 @@ namespace framework {
 TEST(InferVarType, sum_op) {
  ProgramDescBind prog;
-  auto *op = prog.Block(0)->AppendOp();
+  auto *op = prog.MutableBlock(0)->AppendOp();
  op->SetType("sum");
  op->SetInput("X", {"test_a", "test_b", "test_c"});
  op->SetOutput("Out", {"test_out"});
-  prog.Block(0)->Var("test_a")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test_a")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test_b")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test_b")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test_c")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test_c")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test_out");
+  prog.MutableBlock(0)->Var("test_out");
-  op->InferVarType(prog.Block(0));
+  op->InferVarType(prog.MutableBlock(0));
-  ASSERT_EQ(VarDesc::SELECTED_ROWS, prog.Block(0)->Var("test_out")->GetType());
+  ASSERT_EQ(VarDesc::SELECTED_ROWS,
+            prog.MutableBlock(0)->Var("test_out")->GetType());
-  prog.Block(0)->Var("test_b")->SetType(VarDesc::LOD_TENSOR);
+  prog.MutableBlock(0)->Var("test_b")->SetType(VarDesc::LOD_TENSOR);
-  op->InferVarType(prog.Block(0));
+  op->InferVarType(prog.MutableBlock(0));
-  ASSERT_EQ(VarDesc::LOD_TENSOR, prog.Block(0)->Var("test_out")->GetType());
+  ASSERT_EQ(VarDesc::LOD_TENSOR,
+            prog.MutableBlock(0)->Var("test_out")->GetType());
 }
 TEST(InferVarType, sum_op_without_infer_var_type) {
  ProgramDescBind prog;
-  auto *op = prog.Block(0)->AppendOp();
+  auto *op = prog.MutableBlock(0)->AppendOp();
  op->SetType("sum_without_infer_var_type");
  op->SetInput("X", {"test2_a", "test2_b", "test2_c"});
  op->SetOutput("Out", {"test2_out"});
-  prog.Block(0)->Var("test2_a")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test2_a")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test2_b")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test2_b")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test2_c")->SetType(VarDesc::SELECTED_ROWS);
+  prog.MutableBlock(0)->Var("test2_c")->SetType(VarDesc::SELECTED_ROWS);
-  prog.Block(0)->Var("test2_out");
+  prog.MutableBlock(0)->Var("test2_out");
-  op->InferVarType(prog.Block(0));
+  op->InferVarType(prog.MutableBlock(0));
  ASSERT_EQ(VarDesc_VarType_LOD_TENSOR,
-            prog.Block(0)->Var("test2_out")->GetType());
+            prog.MutableBlock(0)->Var("test2_out")->GetType());
 }
 }  // namespace framework

--- a/paddle/gserver/layers/CRFLayer.cpp
+++ b/paddle/gserver/layers/CRFLayer.cpp
@@ -101,8 +101,10 @@ void CRFLayer::backward(const UpdateCallback& callback) {
                              : real(1.0f);
    instanceWeight *= coeff_;
-    MatrixPtr grad = output.grad->subRowMatrix(starts[i], starts[i + 1]);
+    if (output.grad) {
-    grad->add(*crfs_[i].getXGrad(), real(1.0f), instanceWeight);
+      MatrixPtr grad = output.grad->subRowMatrix(starts[i], starts[i + 1]);
+      grad->add(*crfs_[i].getXGrad(), real(1.0f), instanceWeight);
+    }
    if (needWGrad) {
      weight_->getWGrad()->add(
          *crfs_[i].getWGrad(), real(1.0f), instanceWeight);

--- a/paddle/gserver/layers/LinearChainCRF.cpp
+++ b/paddle/gserver/layers/LinearChainCRF.cpp
@@ -102,7 +102,6 @@ real LinearChainCRF::forward(real* x, int* s, int length) {
 }
 void LinearChainCRF::backward(real* x, int* s, int length, bool needWGrad) {
-  MatrixPtr matX = Matrix::create(x, length, numClasses_);
  Matrix::resizeOrCreate(matGrad_, length, numClasses_);
  Matrix::resizeOrCreate(beta_, length, numClasses_);
  real* b = b_->getData();

--- a/paddle/gserver/layers/SequenceReshapeLayer.cpp
+++ b/paddle/gserver/layers/SequenceReshapeLayer.cpp
@@ -70,11 +70,23 @@ void SequenceReshapeLayer::forward(PassType passType) {
  size_t outDim = getSize();
  size_t numSequences = input.getNumSequences();
-  auto startPositions = input.sequenceStartPositions->getVector(false);
-  const int* starts = startPositions->getData();
-  CHECK_EQ(starts[numSequences], input.getBatchSize());
+  // by default, we assume each instance as a sequence
-  CHECK_EQ(numSequences, startPositions->getSize() - 1);
+  IVectorPtr seqStarts;
+  IVector::resizeOrCreate(seqStarts, input.getBatchSize() + 1, false);
+  int* startsData = seqStarts->getData();
+  for (int i = 0; i < input.getBatchSize() + 1; i++) {
+    startsData[i] = i;
+  }
+  const int* starts = startsData;
+  // if there is sequence, then use start positions
+  if (input.sequenceStartPositions) {
+    auto startPositions = input.sequenceStartPositions->getVector(false);
+    starts = startPositions->getData();
+    CHECK_EQ(starts[numSequences], input.getBatchSize());
+    CHECK_EQ(numSequences, startPositions->getSize() - 1);
+  }
  for (size_t seqID = 0; seqID < numSequences; seqID++) {
    size_t inNumIns = starts[seqID + 1] - starts[seqID];

--- a/paddle/gserver/tests/MKLDNNTester.cpp
+++ b/paddle/gserver/tests/MKLDNNTester.cpp
@@ -273,31 +273,37 @@ void MKLDNNTester::printVector(const VectorPtr& v) {
  VLOG(MKLDNN_ALL) << std::endl << ostr.str();
 }
-double MKLDNNTester::getDelta(const real* d1,
+double MKLDNNTester::getDelta(const real* refer,
-                              const real* d2,
+                              const real* value,
                              size_t len,
                              const float failRate,
                              const float thres) {
  double delta = 0, sum = 0;
  int failCnt = 0;
  const double eps = 1e-5;
-  double maxOut = 0;
+  double maxRatio = 0;
  for (size_t i = 0; i < len; ++i) {
-    double ref = fabs(d2[i]);
+    double ref = fabs(refer[i]);
-    double diff = fabs(d1[i] - d2[i]);
+    double val = fabs(value[i]);
+    double diff = fabs(refer[i] - value[i]);
    delta += diff;
    sum += ref;
-    if (ref > eps && fabs(d1[i]) > eps && diff / ref > thres) {
+    if (ref < eps && val < eps) {  // both values are very small
-      maxOut = std::max(maxOut, diff / ref);
+      continue;
+    }
+    double ratio = diff / ref;
+    if (ratio > thres) {
+      maxRatio = std::max(maxRatio, ratio);
      failCnt++;
    }
  }
-  EXPECT_TRUE(std::isnormal(sum));
  EXPECT_FALSE(std::isinf(sum));
+  EXPECT_FALSE(std::isnan(sum));
  EXPECT_FALSE(std::isnan(delta));
  VLOG(MKLDNN_ALL) << "reference avg data: " << sum / len
                   << ", delta: " << delta / sum << ", failCnt:" << failCnt;
-  return (failCnt / (float)len) > failRate ? maxOut : delta / sum;
+  double res = sum > eps ? delta / sum : eps;
+  return (failCnt / (float)len) > failRate ? maxRatio : res;
 }
 double MKLDNNTester::compareMatrix(const MatrixPtr& m1, const MatrixPtr& m2) {
@@ -515,12 +521,16 @@ void MKLDNNTester::getOutResult(const std::string& configPath,
    gradientMachine->forward(in.inArgs[i], &outArgs, PASS_TRAIN);
    // save forward result
    for (size_t k = 0; k < outArgs.size(); k++) {
-      MatrixPtr value = Matrix::create(outArgs[k].value->getHeight(),
+      const MatrixPtr& src = outArgs[k].value;
-                                       outArgs[k].value->getWidth(),
+      MatrixPtr dst =
-                                       false,
+          Matrix::create(src->getHeight(), src->getWidth(), false, false);
-                                       false);
+      if (typeid(*src) == typeid(MKLDNNMatrix)) {
-      value->copyFrom(*outArgs[k].value);
+        MKLDNNMatrixPtr dnnSrc = std::dynamic_pointer_cast<MKLDNNMatrix>(src);
-      out.outValues.push_back(value);
+        dnnSrc->copyTo(*dst);
+      } else {
+        dst->copyFrom(*src);
+      }
+      out.outValues.push_back(dst);
    }
    // random backward input
@@ -543,19 +553,19 @@ void MKLDNNTester::getOutResult(const std::string& configPath,
 void MKLDNNTester::compareResult(DataOut& ref, DataOut& dnn, float eps) {
  CHECK_EQ(ref.outValues.size(), dnn.outValues.size());
  CHECK_EQ(ref.paraValues.size(), dnn.paraValues.size());
-  VLOG(MKLDNN_TESTS) << "compare value size: " << ref.outValues.size();
  for (size_t i = 0; i < ref.outValues.size(); i++) {
+    VLOG(MKLDNN_TESTS) << "compare value index: " << i;
    EXPECT_LE(fabs(compareMatrix(ref.outValues[i], dnn.outValues[i])), eps);
  }
-  VLOG(MKLDNN_TESTS) << "compare param size: " << ref.outValues.size();
  for (size_t i = 0; i < ref.paraValues.size(); i++) {
+    VLOG(MKLDNN_TESTS) << "compare param index: " << i;
    EXPECT_LE(fabs(compareVector(ref.paraValues[i], dnn.paraValues[i])), eps);
  }
 }
-void MKLDNNTester::runBranchesTest(const std::string& configPath,
+void MKLDNNTester::runNetTest(const std::string& configPath,
-                                   size_t iter,
+                              size_t iter,
-                                   float eps) {
+                              float eps) {
  DataIn in;
  initArgument(in, configPath, iter);
  DataOut outCpu, outDnn;

--- a/paddle/gserver/tests/MKLDNNTester.h
+++ b/paddle/gserver/tests/MKLDNNTester.h
@@ -85,17 +85,17 @@ public:
           bool printDetails = false,
           size_t iter = 3,
           float epsilon = 1e-4);
-  static void runBranchesTest(const std::string& configPath,
+  static void runNetTest(const std::string& configPath,
-                              size_t iter = 3,
+                         size_t iter = 2,
-                              float eps = 1e-4);
+                         float eps = 1e-4);
  static void initArgument(DataIn& data,
                           const std::string& configPath,
-                           size_t iter = 3);
+                           size_t iter = 2);
  static void getOutResult(const std::string& configPath,
                           DataIn& in,
                           DataOut& out,
                           bool use_mkldnn,
-                           size_t iter = 3);
+                           size_t iter = 2);
 private:
  void reset(const TestConfig& dnn, const TestConfig& ref, size_t batchSize);
@@ -128,13 +128,13 @@ private:
  /**
   * Get delta percent
-   * if many(>failRate) wrong(abs(dnn-ref)/abs(ref)>thres) points return the
+   * if many(>failRate) wrong(abs(val-ref)/abs(ref) > thres) points
-   * max(diff/ref)
+   * return the max(diff/ref)
-   * else return sum(abs(a-b)) / sum(abs(b))
+   * else return sum(abs(diff)) / sum(abs(ref))
   * The return value should be smaller than eps when passing.
   */
-  static double getDelta(const real* d1,
+  static double getDelta(const real* refer,
-                         const real* d2,
+                         const real* value,
                         size_t len,
                         const float failRate = 1e-3,
                         const float thres = 0.1);

--- a/paddle/trainer/tests/sample_trainer_config_branch_net.conf
+++ b/paddle/trainer/tests/sample_trainer_config_branch_net.conf
@@ -14,36 +14,82 @@
 from paddle.trainer_config_helpers import *
-################################### Data Configuration ###################################
+settings(batch_size=16)
-TrainData(ProtoData(files = "trainer/tests/mnist.list"))
+channels = get_config_arg("channels", int, 2)
-################################### Algorithm Configuration ###################################
-settings(batch_size = 128,
+def two_conv(input, group_name):
-         learning_method = MomentumOptimizer(momentum=0.5, sparse=False))
+  out1 = img_conv_layer(input=input,
-################################### Network Configuration ###################################
+              name=group_name+'_conv1_',
-data = data_layer(name ="input", size=784)
+              filter_size=1,
+              num_filters=channels,
+              padding=0,
+              shared_biases=True,
+              act=ReluActivation())
+  out2 = img_conv_layer(input=input,
+              name=group_name+'_conv2_',
+              filter_size=3,
+              num_filters=channels,
+              padding=1,
+              shared_biases=True,
+              act=ReluActivation())
+  return out1, out2
+def two_conv_bn(input, group_name):
+  out1, out2 = two_conv(input, group_name)
+  out1 = batch_norm_layer(input=out1,
+              name=group_name+'_bn1_',
+              use_global_stats=False,
+              act=ReluActivation())
+  out2 = batch_norm_layer(input=out2,
+              name=group_name+'_bn2_',
+              use_global_stats=False,
+              act=ReluActivation())
+  return out1, out2
+def two_conv_pool(input, group_name):
+  out1, out2 = two_conv(input, group_name)
+  out1 = img_pool_layer(input=out1,
+              name=group_name+'_pool1_',
+              pool_size=3,
+              stride=2,
+              padding=0,
+              pool_type=MaxPooling())
+  out2 = img_pool_layer(input=out2,
+              name=group_name+'_pool2_',
+              pool_size=5,
+              stride=2,
+              padding=1,
+              pool_type=MaxPooling())
+  return out1, out2
+def two_fc(input, group_name):
+  out1 = fc_layer(input=input,
+            name=group_name+'_fc1_',
+            size=channels,
+            bias_attr=False,
+            act=LinearActivation())
-tmp = img_conv_layer(input=data,
+  out2 = fc_layer(input=input,
-            num_channels=1,
+            name=group_name+'_fc2_',
-            filter_size=3,
+            size=channels,
-            num_filters=32,
+            bias_attr=False,
-            padding=1,
+            act=LinearActivation())
-            shared_biases=True,
+  return out1, out2
-            act=ReluActivation())
-a1 = img_conv_layer(input=tmp,
+data = data_layer(name ="input", size=channels*16*16)
-            filter_size=1,
-            num_filters=32,
-            padding=0,
-            shared_biases=True,
-            act=ReluActivation())
-a2 = img_conv_layer(input=tmp,
+tmp = img_conv_layer(input=data,
+            num_channels=channels,
            filter_size=3,
-            num_filters=32,
+            num_filters=channels,
            padding=1,
            shared_biases=True,
            act=ReluActivation())
+a1, a2 = two_conv(tmp, 'conv_branch')
 tmp = addto_layer(input=[a1, a2],
            act=ReluActivation(),
            bias_attr=False)
@@ -54,36 +100,11 @@ tmp = img_pool_layer(input=tmp,
            padding=1,
            pool_type=AvgPooling())
-b1 = img_conv_layer(input=tmp,
+b1, b2 = two_conv_pool(tmp, 'pool_branch')
-            filter_size=3,
-            num_filters=32,
-            padding=1,
-            shared_biases=True,
-            act=ReluActivation())
-b1 = img_pool_layer(input=b1,
-            pool_size=3,
-            stride=2,
-            padding=0,
-            pool_type=MaxPooling())
-b2 = img_conv_layer(input=tmp,
-            filter_size=3,
-            num_filters=64,
-            padding=1,
-            shared_biases=True,
-            act=ReluActivation())
-b2 = img_pool_layer(input=b2,
-            pool_size=5,
-            stride=2,
-            padding=1,
-            pool_type=MaxPooling())
 tmp = concat_layer(input=[b1, b2])
 tmp = img_pool_layer(input=tmp,
-            num_channels=96,
+            num_channels=channels*2,
            pool_size=3,
            stride=2,
            padding=1,
@@ -91,8 +112,9 @@ tmp = img_pool_layer(input=tmp,
 tmp = img_conv_layer(input=tmp,
            filter_size=3,
-            num_filters=32,
+            num_filters=channels,
            padding=1,
+            stride=2,
            shared_biases=True,
            act=LinearActivation(),
            bias_attr=False)
@@ -101,33 +123,20 @@ tmp = batch_norm_layer(input=tmp,
            use_global_stats=False,
            act=ReluActivation())
-c1 = img_conv_layer(input=tmp,
+c1, c2 = two_conv_bn(tmp, 'bn_branch')
-            filter_size=1,
-            num_filters=32,
-            padding=0,
-            shared_biases=True,
-            act=ReluActivation())
-c2 = img_conv_layer(input=tmp,
-            filter_size=3,
-            num_filters=32,
-            padding=1,
-            shared_biases=True,
-            act=ReluActivation())
 tmp = addto_layer(input=[c1, c2],
            act=ReluActivation(),
            bias_attr=False)
-tmp = fc_layer(input=tmp, size=64,
+tmp = fc_layer(input=tmp, size=channels,
-            bias_attr=False,
+            bias_attr=True,
-            act=TanhActivation())
+            act=ReluActivation())
-output = fc_layer(input=tmp, size=10,
+d1, d2 = two_fc(tmp, 'fc_branch')
+tmp = addto_layer(input=[d1, d2])
+out = fc_layer(input=tmp, size=10,
            bias_attr=True,
            act=SoftmaxActivation())
-lbl = data_layer(name ="label", size=10)
+outputs(out)
-cost = classification_cost(input=output, label=lbl)
-outputs(cost)
--- a/paddle/gserver/tests/mkldnn_branches_fc.conf
+++ b/paddle/gserver/tests/mkldnn_branches_fc.conf
-# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from paddle.trainer_config_helpers import *
-settings(batch_size=16)
-channels = get_config_arg("channels", int, 2)
-def two_fc(input, group_name):
-  out1 = fc_layer(input=input,
-            name=group_name+'_fc1',
-            size=channels,
-            bias_attr=False,
-            act=LinearActivation())
-  out2 = fc_layer(input=input,
-            name=group_name+'_fc2',
-            size=channels,
-            bias_attr=False,
-            act=LinearActivation())
-  return out1, out2
-data = data_layer(name ="input", size=channels*16*16)
-conv = img_conv_layer(input=data,
-            num_channels=channels,
-            filter_size=3,
-            num_filters=channels,
-            padding=1,
-            shared_biases=True,
-            act=LinearActivation())
-pool = img_pool_layer(input=conv,
-            pool_size=3,
-            stride=2,
-            padding=1,
-            pool_type=AvgPooling())
-a1, a2 = two_fc(input=pool, group_name='a')
-concat = concat_layer(input=[a1, a2])
-b1, b2 = two_fc(input=pool, group_name='b')
-addto = addto_layer(input=[b1, b2])
-outputs([concat, addto])
--- a/paddle/gserver/tests/mkldnn_branches_pool.conf
+++ b/paddle/gserver/tests/mkldnn_branches_pool.conf
-# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from paddle.trainer_config_helpers import *
-settings(batch_size=16)
-channels = get_config_arg("channels", int, 2)
-def two_pool(input, group_name):
-  out1 = img_pool_layer(input=input,
-            name=group_name+'_pool1',
-            pool_size=3,
-            stride=2,
-            padding=0,
-            pool_type=MaxPooling())
-  out2 = img_pool_layer(input=input,
-            name=group_name+'_pool2',
-            pool_size=5,
-            stride=2,
-            padding=1,
-            pool_type=MaxPooling())
-  return out1, out2
-data = data_layer(name ="input", size=channels*16*16)
-conv = img_conv_layer(input=data,
-            num_channels=channels,
-            filter_size=3,
-            num_filters=channels,
-            padding=1,
-            shared_biases=True,
-            act=LinearActivation())
-pool = img_pool_layer(input=conv,
-            pool_size=3,
-            stride=1,
-            padding=1,
-            pool_type=AvgPooling())
-a1, a2 = two_pool(input=pool, group_name='a')
-concat = concat_layer(input=[a1, a2])
-b1, b2 = two_pool(input=pool, group_name='b')
-addto = addto_layer(input=[b1, b2])
-outputs([concat, addto])
--- a/paddle/gserver/tests/mkldnn_branches_conv.conf
+++ b/paddle/gserver/tests/mkldnn_branches_conv.conf
@@ -17,40 +17,48 @@ from paddle.trainer_config_helpers import *
 settings(batch_size=16)
 channels = get_config_arg("channels", int, 2)
-def two_conv(input, group_name):
+data = data_layer(name ="input", size=channels*16*16)
-  out1 = img_conv_layer(input=input,
-            name=group_name+'_conv1',
-            filter_size=1,
-            num_filters=channels,
-            padding=0,
-            shared_biases=True,
-            act=ReluActivation())
-  out2 = img_conv_layer(input=input,
+tmp = img_conv_layer(input=data,
-            name=group_name+'_conv2',
+            num_channels=channels,
            filter_size=3,
            num_filters=channels,
            padding=1,
            shared_biases=True,
            act=ReluActivation())
-  return out1, out2
-data = data_layer(name ="input", size=channels*16*16)
+tmp = img_pool_layer(input=tmp,
+            pool_size=3,
+            stride=1,
+            padding=0,
+            pool_type=AvgPooling())
-conv = img_conv_layer(input=data,
+tmp = img_conv_layer(input=tmp,
-            num_channels=channels,
            filter_size=3,
            num_filters=channels,
            padding=1,
            shared_biases=True,
-            act=ReluActivation())
+            act=LinearActivation(),
+            bias_attr=False)
-a1, a2 = two_conv(input=conv, group_name='a')
+tmp = batch_norm_layer(input=tmp,
+            use_global_stats=False,
+            act=ReluActivation())
-concat = concat_layer(input=[a1, a2])
+tmp = img_pool_layer(input=tmp,
+            pool_size=3,
+            stride=2,
+            padding=1,
+            pool_type=MaxPooling())
-b1, b2 = two_conv(input=conv, group_name='b')
+tmp = fc_layer(input=tmp,
+            size=channels,
+            bias_attr=False,
+            act=ReluActivation())
-addto = addto_layer(input=[b1, b2])
+out = fc_layer(input=tmp,
+            size=10,
+            bias_attr=True,
+            act=SoftmaxActivation())
-outputs([concat, addto])
+outputs(out)
--- a/paddle/gserver/tests/test_MKLDNN.cpp
+++ b/paddle/gserver/tests/test_MKLDNN.cpp
@@ -234,8 +234,7 @@ static void getMKLDNNBatchNormConfig(TestConfig& cfg,
  cfg.inputDefs.push_back({INPUT_DATA, "layer_2_moving_var", 1, size_t(pm.ic)});
  cfg.inputDefs.back().isStatic = true;
  LayerInputConfig* input = cfg.layerConfig.add_inputs();
-  // TODO(TJ): uncomment me when refine and support comparing all zeroes vector
+  cfg.layerConfig.set_active_type("relu");
-  // cfg.layerConfig.set_active_type("relu");
  cfg.layerConfig.add_inputs();
  cfg.layerConfig.add_inputs();
  ImageConfig* img_conf = input->mutable_image_conf();
@@ -309,15 +308,15 @@ TEST(MKLDNNActivation, Activations) {
 }
 DECLARE_string(config_args);
-TEST(MKLDNNLayer, branches) {
+TEST(MKLDNNNet, net) {
-  std::vector<std::string> cases = {"conv", "pool", "fc"};
+  std::vector<std::string> cases = {"simple", "branch"};
  for (auto name : cases) {
-    std::string config = "./gserver/tests/mkldnn_branches_" + name + ".conf";
+    std::string config = "./gserver/tests/mkldnn_" + name + "_net.conf";
    for (auto channels : {2, 32}) {
      std::ostringstream oss;
      oss << "channels=" << channels;
      FLAGS_config_args = oss.str();
-      MKLDNNTester::runBranchesTest(config);
+      MKLDNNTester::runNetTest(config);
    }
  }
 }

--- a/paddle/math/MKLDNNMatrix.h
+++ b/paddle/math/MKLDNNMatrix.h
@@ -102,6 +102,11 @@ public:
    m_->copyFrom(src);
  }
+  void copyTo(Matrix& dst) {
+    // TODO(TJ): reorder data if this format is not nchw or x
+    dst.copyFrom(*m_);
+  }
 public:
  /**
   * Reorder this MKLDNNMatrix from other format.

--- a/paddle/memory/detail/buddy_allocator.cc
+++ b/paddle/memory/detail/buddy_allocator.cc
@@ -27,11 +27,11 @@ BuddyAllocator::BuddyAllocator(SystemAllocator* system_allocator,
      system_allocator_(std::move(system_allocator)) {}
 BuddyAllocator::~BuddyAllocator() {
-  VLOG(3) << "BuddyAllocator Disconstructor makes sure that all of these "
+  VLOG(10) << "BuddyAllocator Disconstructor makes sure that all of these "
-             "have actually been freed";
+              "have actually been freed";
  while (!pool_.empty()) {
    auto block = static_cast<MemoryBlock*>(std::get<2>(*pool_.begin()));
-    VLOG(3) << "Free from block (" << block << ", " << max_chunk_size_ << ")";
+    VLOG(10) << "Free from block (" << block << ", " << max_chunk_size_ << ")";
    system_allocator_->Free(block, max_chunk_size_, block->index(cache_));
    cache_.invalidate(block);
@@ -51,11 +51,12 @@ void* BuddyAllocator::Alloc(size_t unaligned_size) {
  // acquire the allocator lock
  std::lock_guard<std::mutex> lock(mutex_);
-  VLOG(3) << "Allocate " << unaligned_size << " bytes from chunk size " << size;
+  VLOG(10) << "Allocate " << unaligned_size << " bytes from chunk size "
+           << size;
  // if the allocation is huge, send directly to the system allocator
  if (size > max_chunk_size_) {
-    VLOG(3) << "Allocate from system allocator.";
+    VLOG(10) << "Allocate from system allocator.";
    return SystemAlloc(size);
  }
@@ -70,9 +71,9 @@ void* BuddyAllocator::Alloc(size_t unaligned_size) {
      return nullptr;
    }
  } else {
-    VLOG(3) << "Allocation from existing memory block " << std::get<2>(*it)
+    VLOG(10) << "Allocation from existing memory block " << std::get<2>(*it)
-            << " at address "
+             << " at address "
-            << reinterpret_cast<MemoryBlock*>(std::get<2>(*it))->data();
+             << reinterpret_cast<MemoryBlock*>(std::get<2>(*it))->data();
  }
  total_used_ += size;
@@ -89,10 +90,10 @@ void BuddyAllocator::Free(void* p) {
  // Acquire the allocator lock
  std::lock_guard<std::mutex> lock(mutex_);
-  VLOG(3) << "Free from address " << block;
+  VLOG(10) << "Free from address " << block;
  if (block->type(cache_) == MemoryBlock::HUGE_CHUNK) {
-    VLOG(3) << "Free directly from system allocator";
+    VLOG(10) << "Free directly from system allocator";
    system_allocator_->Free(block, block->total_size(cache_),
                            block->index(cache_));
@@ -109,8 +110,8 @@ void BuddyAllocator::Free(void* p) {
  // Trying to merge the right buddy
  if (block->has_right_buddy(cache_)) {
-    VLOG(3) << "Merging this block " << block << " with its right buddy "
+    VLOG(10) << "Merging this block " << block << " with its right buddy "
-            << block->right_buddy(cache_);
+             << block->right_buddy(cache_);
    auto right_buddy = block->right_buddy(cache_);
@@ -127,8 +128,8 @@ void BuddyAllocator::Free(void* p) {
  // Trying to merge the left buddy
  if (block->has_left_buddy(cache_)) {
-    VLOG(3) << "Merging this block " << block << " with its left buddy "
+    VLOG(10) << "Merging this block " << block << " with its left buddy "
-            << block->left_buddy(cache_);
+             << block->left_buddy(cache_);
    auto left_buddy = block->left_buddy(cache_);
@@ -144,8 +145,8 @@ void BuddyAllocator::Free(void* p) {
  }
  // Dumping this block into pool
-  VLOG(3) << "Inserting free block (" << block << ", "
+  VLOG(10) << "Inserting free block (" << block << ", "
-          << block->total_size(cache_) << ")";
+           << block->total_size(cache_) << ")";
  pool_.insert(
      IndexSizeAddress(block->index(cache_), block->total_size(cache_), block));
@@ -164,7 +165,7 @@ void* BuddyAllocator::SystemAlloc(size_t size) {
  size_t index = 0;
  void* p = system_allocator_->Alloc(index, size);
-  VLOG(3) << "Allocated " << p << " from system allocator.";
+  VLOG(10) << "Allocated " << p << " from system allocator.";
  if (p == nullptr) return nullptr;
@@ -190,8 +191,8 @@ BuddyAllocator::PoolSet::iterator BuddyAllocator::RefillPool() {
  if (p == nullptr) return pool_.end();
-  VLOG(3) << "Creating and inserting new block " << p
+  VLOG(10) << "Creating and inserting new block " << p
-          << " from system allocator";
+           << " from system allocator";
  static_cast<MemoryBlock*>(p)->init(cache_, MemoryBlock::FREE_CHUNK, index,
                                     max_chunk_size_, nullptr, nullptr);
@@ -235,19 +236,19 @@ void* BuddyAllocator::SplitToAlloc(BuddyAllocator::PoolSet::iterator it,
  auto block = static_cast<MemoryBlock*>(std::get<2>(*it));
  pool_.erase(it);
-  VLOG(3) << "Split block (" << block << ", " << block->total_size(cache_)
+  VLOG(10) << "Split block (" << block << ", " << block->total_size(cache_)
-          << ") into";
+           << ") into";
  block->split(cache_, size);
-  VLOG(3) << "Left block (" << block << ", " << block->total_size(cache_)
+  VLOG(10) << "Left block (" << block << ", " << block->total_size(cache_)
-          << ")";
+           << ")";
  block->set_type(cache_, MemoryBlock::ARENA_CHUNK);
  // the rest of memory if exist
  if (block->has_right_buddy(cache_)) {
    if (block->right_buddy(cache_)->type(cache_) == MemoryBlock::FREE_CHUNK) {
-      VLOG(3) << "Insert right block (" << block->right_buddy(cache_) << ", "
+      VLOG(10) << "Insert right block (" << block->right_buddy(cache_) << ", "
-              << block->right_buddy(cache_)->total_size(cache_) << ")";
+               << block->right_buddy(cache_)->total_size(cache_) << ")";
      pool_.insert(
          IndexSizeAddress(block->right_buddy(cache_)->index(cache_),
@@ -274,7 +275,7 @@ void BuddyAllocator::CleanIdleFallBackAlloc() {
      return;
    }
-    VLOG(3) << "Return block " << block << " to fallback allocator.";
+    VLOG(10) << "Return block " << block << " to fallback allocator.";
    system_allocator_->Free(block, max_chunk_size_, block->index(cache_));
    cache_.invalidate(block);
@@ -310,7 +311,7 @@ void BuddyAllocator::CleanIdleNormalAlloc() {
    MemoryBlock* block = static_cast<MemoryBlock*>(std::get<2>(*pool));
-    VLOG(3) << "Return block " << block << " to base allocator.";
+    VLOG(10) << "Return block " << block << " to base allocator.";
    system_allocator_->Free(block, max_chunk_size_, block->index(cache_));
    cache_.invalidate(block);

--- a/paddle/memory/detail/meta_cache.cc
+++ b/paddle/memory/detail/meta_cache.cc
@@ -30,7 +30,7 @@ Metadata MetadataCache::load(const MemoryBlock* block) {
    return existing_metadata->second;
  } else {
    auto* meta = reinterpret_cast<const Metadata*>(block);
-    VLOG(3) << "Load MetaData type=" << meta->type;
+    VLOG(10) << "Load MetaData type=" << meta->type;
    PADDLE_ASSERT(meta->check_guards());
    return *reinterpret_cast<const Metadata*>(block);
  }

--- a/paddle/memory/detail/system_allocator.cc
+++ b/paddle/memory/detail/system_allocator.cc
@@ -41,7 +41,16 @@ void* CPUAllocator::Alloc(size_t& index, size_t size) {
  index = 0;  // unlock memory
-  void* p = malloc(size);
+  void* p;
+#ifdef PADDLE_USE_MKLDNN
+  // refer to https://github.com/01org/mkl-dnn/blob/master/include/mkldnn.hpp
+  // memory alignment
+  PADDLE_ENFORCE_EQ(posix_memalign(&p, 4096ul, size), 0);
+#else
+  PADDLE_ENFORCE_EQ(posix_memalign(&p, 32ul, size), 0);
+#endif
+  PADDLE_ENFORCE(p, "Fail to allocate CPU memory: size = %d .", size);
  if (p != nullptr) {
    if (FLAGS_use_pinned_memory) {

--- a/paddle/memory/memory.cc
+++ b/paddle/memory/memory.cc
@@ -39,15 +39,15 @@ BuddyAllocator* GetCPUBuddyAllocator() {
 template <>
 void* Alloc<platform::CPUPlace>(platform::CPUPlace place, size_t size) {
-  VLOG(3) << "Allocate " << size << " bytes on " << platform::Place(place);
+  VLOG(10) << "Allocate " << size << " bytes on " << platform::Place(place);
  void* p = GetCPUBuddyAllocator()->Alloc(size);
-  VLOG(3) << "  pointer=" << p;
+  VLOG(10) << "  pointer=" << p;
  return p;
 }
 template <>
 void Free<platform::CPUPlace>(platform::CPUPlace place, void* p) {
-  VLOG(3) << "Free pointer=" << p << " on " << platform::Place(place);
+  VLOG(10) << "Free pointer=" << p << " on " << platform::Place(place);
  GetCPUBuddyAllocator()->Free(p);
 }
@@ -69,11 +69,12 @@ BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {
                                   platform::GpuMinChunkSize(),
                                   platform::GpuMaxChunkSize());
    }
-    VLOG(3) << "\n\nNOTE: each GPU device use "
+    VLOG(10) << "\n\nNOTE: each GPU device use "
-            << FLAGS_fraction_of_gpu_memory_to_use * 100 << "% of GPU memory.\n"
+             << FLAGS_fraction_of_gpu_memory_to_use * 100
-            << "You can set environment variable '"
+             << "% of GPU memory.\n"
-            << platform::kEnvFractionGpuMemoryToUse
+             << "You can set environment variable '"
-            << "' to change the fraction of GPU usage.\n\n";
+             << platform::kEnvFractionGpuMemoryToUse
+             << "' to change the fraction of GPU usage.\n\n";
  }
  platform::SetDeviceId(gpu_id);
  return as[gpu_id];

--- a/paddle/operators/cross_entropy_op.cc
+++ b/paddle/operators/cross_entropy_op.cc
@@ -28,8 +28,9 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
    auto x_dims = ctx->GetInputDim("X");
    auto label_dims = ctx->GetInputDim("Label");
-    PADDLE_ENFORCE_EQ(x_dims.size(), 2, "Input(X)'s rank should be 2.");
+    PADDLE_ENFORCE_EQ(x_dims.size(), 2UL, "Input(X)'s rank should be 2.");
-    PADDLE_ENFORCE_EQ(label_dims.size(), 2, "Input(Label)'s rank should be 2.");
+    PADDLE_ENFORCE_EQ(label_dims.size(), 2UL,
+                      "Input(Label)'s rank should be 2.");
    PADDLE_ENFORCE_EQ(x_dims[0], label_dims[0],
                      "The 1st dimension of Input(X) and Input(Label) should "
                      "be equal.");
@@ -38,8 +39,8 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
                        "If Attr(soft_label) == true, the 2nd dimension of "
                        "Input(X) and Input(Label) should be equal.");
    } else {
-      PADDLE_ENFORCE_EQ(label_dims[1], 1,
+      PADDLE_ENFORCE_EQ(label_dims[1], 1UL,
-                        "If Attr(soft_label) == false, the 2nd dimension of "
+                        "If Attr(softLabel) == false, the 2nd dimension of "
                        "Input(Label) should be 1.");
    }
@@ -48,7 +49,8 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
  }
 protected:
-  // CrossEntropy's data type just determined by "X"
+  // Explicitly set that data type of the output of the cross_entropy operator
+  // is determined by its input "X".
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
    return framework::ToDataType(ctx.Input<Tensor>("X")->type());

--- a/paddle/operators/dynamic_recurrent_op_test.cc
+++ b/paddle/operators/dynamic_recurrent_op_test.cc
@@ -51,7 +51,7 @@ class RNNAlgorithmTestHelper : public ::testing::Test {
    CreateGlobalVariables();
    auto op_desc = CreateOpDesc();
-    op = paddle::framework::OpRegistry::CreateOp(op_desc, nullptr);
+    op = paddle::framework::OpRegistry::CreateOp(op_desc);
    dop = &(dynamic_cast<DynamicRecurrentOp*>(op.get())->rnn);
    InitCacheManually();
    InitStepNet();

--- a/paddle/operators/gaussian_random_op.cc
+++ b/paddle/operators/gaussian_random_op.cc
@@ -45,14 +45,14 @@ class GaussianRandomOp : public framework::OperatorWithKernel {
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
                   "Output(Out) of GaussianRandomOp should not be null.");
-    auto dims = ctx->Attrs().Get<std::vector<int>>("dims");
+    auto shape = ctx->Attrs().Get<std::vector<int>>("shape");
    std::vector<int64_t> temp;
-    temp.reserve(dims.size());
+    temp.reserve(shape.size());
-    for (auto dim : dims) {
+    for (auto dim : shape) {
      temp.push_back(static_cast<int64_t>(dim));
    }
-    PADDLE_ENFORCE(dims.size() > 0UL,
+    PADDLE_ENFORCE(shape.size() > 0UL,
-                   "dims can be one int or array. dims must be set.");
+                   "shape can be one int or array. shape must be set.");
    ctx->SetOutputDim("Out", framework::make_ddim(temp));
  }
@@ -74,7 +74,7 @@ GaussianRandom operator.
 Use to initialize tensor with gaussian random generator.
 )DOC");
-    AddAttr<std::vector<int>>("dims", "The dimension of random tensor.");
+    AddAttr<std::vector<int>>("shape", "The dimension of random tensor.");
    AddAttr<float>("mean", "mean of random tensor.").SetDefault(.0f);
    AddAttr<float>("std", "std of random tensor.").SetDefault(1.0f);
    AddAttr<int>("seed",

--- a/paddle/operators/linear_chain_crf_op.cc
+++ b/paddle/operators/linear_chain_crf_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "paddle/operators/linear_chain_crf_op.h"
+namespace paddle {
+namespace operators {
+class LinearChainCRFOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  LinearChainCRFOpMaker(framework::OpProto* proto,
+                        framework::OpAttrChecker* op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput(
+        "Emission",
+        "(LoDTensor, default: LoDTensor<float>). "
+        "The unscaled emission weight matrix for the linear chain CRF. "
+        "This input is a LoDTensor with shape [N x D] where N is the size of "
+        "the mini-batch and D is the total tag number.");
+    AddInput(
+        "Transition",
+        "(Tensor, default: Tensor<float>). A Tensor with shape [(D + 2) x D]. "
+        "The learnable parameter for the linear_chain_crf operator. "
+        "See more details in the operator's comments.");
+    AddInput(
+        "Label",
+        "(LoDTensor, default: LoDTensor<int>). The ground truth which is a 2-D "
+        "LoDTensor with shape [N x 1], where N is the total element number in "
+        "a mini-batch.");
+    AddOutput(
+        "Alpha",
+        "Tensor, default: Tensor<float>. The forward vectors for the entire "
+        "batch. A two dimensional tensor with shape [N x D], "
+        "denoted as \f$\alpha\f$. \f$\alpha$\f is a memo table used to "
+        "calculate the normalization factor in CRF. \f$\alpha[k, v]$\f stores "
+        "the unnormalized probabilites of all possible unfinished sequences of "
+        "tags that end at position \f$k$\f with tag \f$v$\f. For each \f$k$\f, "
+        "\f$\alpha[k, v]$\f is a vector of length \f$D$\f with a component for "
+        "each tag value \f$v$\f. This vector is called a forward vecotr and "
+        "will also be used in backward computations.")
+        .AsIntermediate();
+    AddOutput("EmissionExps",
+              "The exponentials of Input(Emission). This is an intermediate "
+              "computational result in forward computation, and will be reused "
+              "in backward computation.")
+        .AsIntermediate();
+    AddOutput("TransitionExps",
+              "The exponentials of Input(Transition). This is an intermediate "
+              "computational result in forward computation, and will be reused "
+              "in backward computation.")
+        .AsIntermediate();
+    AddOutput(
+        "LogLikelihood",
+        "(Tensor, default: Tensor<float>). The logarithm of the conditional "
+        "likelihood of each training sample in a mini-batch. This is a 2-D "
+        "tensor with shape [S x 1], where S is the sequence number in a "
+        "mini-batch. Note: S is equal to the sequence number in a mini-batch. "
+        "The output is no longer a LoDTensor.");
+    AddComment(R"DOC(
+Conditional Random Field defines an undirected probabilistic graph with nodes
+denoting random variables and edges denoting dependencies between these
+variables. CRF learns the conditional probability \f$P(Y|X)\f$, where
+\f$X = (x_1, x_2, ... , x_n)\f$ are structured inputs and
+\f$Y = (y_1, y_2, ... , y_n)\f$ are labels for the inputs.
+Linear chain CRF is a special case of CRF that is useful for sequence labeling
+task. Sequence labeling tasks do not assume a lot of conditional
+independences among inputs. The only constraint they impose is that the input
+and output must be linear sequences. Thus, the graph of such a CRF is a simple
+chain or a line, which results in the linear chain CRF.
+This operator implements the Forward-Backward algorithm for the linear chain
+CRF. Please see http://www.cs.columbia.edu/~mcollins/fb.pdf and
+http://cseweb.ucsd.edu/~elkan/250Bwinter2012/loglinearCRFs.pdf for reference.
+Equation:
+- Denote Input(Emission) to this operator as \f$x\f$ here.
+- The first D values of Input(Transition) to this operator are for starting
+weights, denoted as \f$a\f$ here.
+- The next D values of Input(Transition) of this operator are for ending
+weights, denoted as \f$b\f$ here.
+- The remaning values of Input(Transition) are for transition weights,
+denoted as \f$w\f$ here.
+- Denote Input(Label) as \f$s\f$ here.
+The probability of a sequence \f$s\f$ of length \f$L\f$ is defined as:
+\f$P(s) = (1/Z) exp(a_{s_1} + b_{s_L}
+                 + \sum_{l=1}^L x_{s_l}
+                 + \sum_{l=2}^L w_{s_{l-1},s_l})\f$
+where \f$Z\f$ is a normalization value so that the sum of \f$P(s)\f$ over
+all possible sequences is \f$1\f$, and \f$x\f$ is the emission feature weight
+to the linear chain CRF.
+Finaly, the linear chain CRF operator outputs the logarithm of the conditional
+likelihood of each training sample in a mini-batch.
+NOTE:
+1. The feature function for a CRF is made up of the emission features and the
+transition features. The emission feature weights are NOT computed in
+this operator. They MUST be computed first before this operator is called.
+2. Because this operator performs global normalization over all possible
+sequences internally, it expects UNSCALED emission feature weights.
+Please do not call this op with the emission feature being output of any
+nonlinear activation.
+3. The 2nd dimension of Input(Emission) MUST be equal to the tag number.
+)DOC");
+  }
+};
+class LinearChainCRFOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Emission"),
+                   "Input(Emission) should be not null.");
+    PADDLE_ENFORCE(ctx->HasInput("Transition"),
+                   "Input(Transition) should be not null.");
+    PADDLE_ENFORCE(ctx->HasInput("Label"), "Input(Label) should be not null.");
+    PADDLE_ENFORCE(ctx->HasOutput("Alpha"),
+                   "Output(Alpha) should be not null.");
+    PADDLE_ENFORCE(ctx->HasOutput("EmissionExps"),
+                   "Output(EmissionExps) should be not null.");
+    PADDLE_ENFORCE(ctx->HasOutput("TransitionExps"),
+                   "Output(TransitionExps) should be not null.");
+    PADDLE_ENFORCE(ctx->HasOutput("LogLikelihood"),
+                   "Output(LogLikelihood) should be not null.");
+    auto emission_dims = ctx->GetInputDim("Emission");
+    PADDLE_ENFORCE_EQ(emission_dims.size(), 2UL,
+                      "The Input(Emission) should be a 2-D tensor.");
+    PADDLE_ENFORCE(emission_dims[0], "An empty mini-batch is not allowed.");
+    auto transition_dims = ctx->GetInputDim("Transition");
+    PADDLE_ENFORCE_EQ(transition_dims.size(), 2UL,
+                      "The Input(Transition) should be a 2-D tensor.");
+    PADDLE_ENFORCE_EQ(
+        transition_dims[0] - 2, transition_dims[1],
+        "An invalid dimension for the Input(Transition), which should "
+        "be a 2-D tensor with shape [(D + 2) x D].");
+    PADDLE_ENFORCE_EQ(
+        emission_dims[1], transition_dims[1],
+        "The 2nd dimension of the Input(Emission) and the Input(Transition) "
+        "should be equal to the tag number.");
+    auto label_dims = ctx->GetInputDim("Label");
+    PADDLE_ENFORCE(label_dims.size() == 2UL && label_dims[1] == 1UL,
+                   "The Input(Label) should be a 2-D tensor with the 2nd "
+                   "dimensions fixed to 1.");
+    PADDLE_ENFORCE_EQ(
+        emission_dims[0], label_dims[0],
+        "The height of Input(Emission) and the height of Input(Label) "
+        "should be the same.");
+    ctx->SetOutputDim("Alpha", emission_dims);
+    ctx->SetOutputDim("EmissionExps", emission_dims);
+    ctx->SetOutputDim("TransitionExps", transition_dims);
+    // TODO(caoying) This is tricky. The 1st dimension of Output(LogLikelihood)
+    // is the sequence number in a mini-batch. The dimension set here should be
+    // resized to its correct size in the function Compute. Fix this once we can
+    // get LoD information in the InferShape interface.
+    ctx->SetOutputDim("LogLikelihood", {emission_dims[0], 1});
+  }
+ protected:
+  // Explicitly set that the data type of output of the linear_chain_crf
+  // operator is determined by its input "Emission".
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::ToDataType(ctx.Input<LoDTensor>("Emission")->type());
+  }
+};
+class LinearChainCRFGradOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("EmissionExps"),
+                   "Input(EmissionExps) should be not null.");
+    PADDLE_ENFORCE(ctx->HasInput("TransitionExps"),
+                   "Input(TransitionExps) should be not null.");
+    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("LogLikelihood")),
+                   "Input(LogLikelihood@GRAD) shoudl be not null.");
+    auto emission_exps_dims = ctx->GetInputDim("EmissionExps");
+    PADDLE_ENFORCE_EQ(emission_exps_dims.size(), 2UL,
+                      "The Input(EmissionExps) should be a 2-D tensor.");
+    PADDLE_ENFORCE(emission_exps_dims[0],
+                   "An empty mini-batch is not allowed.");
+    auto transition_exps_dims = ctx->GetInputDim("TransitionExps");
+    PADDLE_ENFORCE_EQ(transition_exps_dims.size(), 2UL,
+                      "The Input(TransitionExps) should be a 2-D tensor.");
+    PADDLE_ENFORCE_EQ(
+        transition_exps_dims[0] - 2, transition_exps_dims[1],
+        "An invalid dimension for the Input(TransitionExps), which should "
+        "be a 2-D tensor with shape [(D + 2) x D].");
+    PADDLE_ENFORCE_EQ(
+        emission_exps_dims[1], transition_exps_dims[1],
+        "The 2nd dimension of the Input(EmissionExps) and the "
+        "Input(TransitionExps) should be equal to the tag number.");
+    auto label_dims = ctx->GetInputDim("Label");
+    PADDLE_ENFORCE(label_dims.size() == 2UL && label_dims[1] == 1UL,
+                   "The Input(Label) should be a 2-D tensor with the 2nd "
+                   "dimensions fixed to 1.");
+    PADDLE_ENFORCE_EQ(
+        emission_exps_dims[0], label_dims[0],
+        "The height of Input(EmissionExps) and the height of Input(Label) "
+        "should be the same.");
+    if (ctx->HasOutput(framework::GradVarName("Emission"))) {
+      ctx->SetOutputDim(framework::GradVarName("Emission"), emission_exps_dims);
+    }
+    if (ctx->HasOutput(framework::GradVarName("Transition"))) {
+      ctx->SetOutputDim(framework::GradVarName("Transition"),
+                        transition_exps_dims);
+    }
+  }
+ protected:
+  // Explicitly set that the data type of output of the linear_chain_crf_grad
+  // operator is determined by its input: gradients of LogLikelihood.
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::ToDataType(
+        ctx.Input<LoDTensor>(framework::GradVarName("LogLikelihood"))->type());
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OP(linear_chain_crf, ops::LinearChainCRFOp, ops::LinearChainCRFOpMaker,
+            linear_chain_crf_grad, ops::LinearChainCRFGradOp);
+REGISTER_OP_CPU_KERNEL(
+    linear_chain_crf,
+    ops::LinearChainCRFOpKernel<paddle::platform::CPUPlace, float>,
+    ops::LinearChainCRFOpKernel<paddle::platform::CPUPlace, double>);
+REGISTER_OP_CPU_KERNEL(
+    linear_chain_crf_grad,
+    ops::LinearChainCRFGradOpKernel<paddle::platform::CPUPlace, float>,
+    ops::LinearChainCRFGradOpKernel<paddle::platform::CPUPlace, double>);
--- a/paddle/operators/linear_chain_crf_op.cu
+++ b/paddle/operators/linear_chain_crf_op.cu
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "paddle/operators/linear_chain_crf_op.h"
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(
+    linear_chain_crf,
+    ops::LinearChainCRFOpKernel<paddle::platform::GPUPlace, float>,
+    ops::LinearChainCRFOpKernel<paddle::platform::GPUPlace, double>);
+REGISTER_OP_GPU_KERNEL(
+    linear_chain_crf_grad,
+    ops::LinearChainCRFGradOpKernel<paddle::platform::GPUPlace, float>,
+    ops::LinearChainCRFGradOpKernel<paddle::platform::GPUPlace, double>);
--- a/paddle/operators/linear_chain_crf_op.h
+++ b/paddle/operators/linear_chain_crf_op.h
--- a/paddle/operators/lookup_table_op.cc
+++ b/paddle/operators/lookup_table_op.cc
@@ -43,7 +43,7 @@ class LookupTableOp : public framework::OperatorWithKernel {
 protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
-    return framework::ToDataType(ctx.Input<Tensor>("W")->type());
+    return framework::ToDataType(ctx.Input<LoDTensor>("W")->type());
  }
 };
@@ -93,7 +93,7 @@ class LookupTableOpGrad : public framework::OperatorWithKernel {
 protected:
  framework::DataType IndicateDataType(
      const framework::ExecutionContext& ctx) const override {
-    return framework::ToDataType(ctx.Input<Tensor>("W")->type());
+    return framework::ToDataType(ctx.Input<LoDTensor>("W")->type());
  }
 };

--- a/paddle/operators/lookup_table_op.cu
+++ b/paddle/operators/lookup_table_op.cu
@@ -61,16 +61,16 @@ template <typename T>
 class LookupTableCUDAKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& context) const override {
-    auto table_t = context.Input<Tensor>("W");
+    auto* table_t = context.Input<LoDTensor>("W");
-    auto ids_t = context.Input<Tensor>("Ids");
+    auto* ids_t = context.Input<LoDTensor>("Ids");
-    auto output_t = context.Output<Tensor>("Out");
+    auto* output_t = context.Output<LoDTensor>("Out");
    size_t N = table_t->dims()[0];
    size_t D = table_t->dims()[1];
    size_t K = ids_t->numel();
-    auto ids = ids_t->data<int64_t>();
+    auto* ids = ids_t->data<int64_t>();
-    auto table = table_t->data<T>();
+    auto* table = table_t->data<T>();
-    auto output = output_t->mutable_data<T>(context.GetPlace());
+    auto* output = output_t->mutable_data<T>(context.GetPlace());
    dim3 threads(128, 8);
    dim3 grids(8, 1);
@@ -87,9 +87,9 @@ class LookupTableGradCUDAKernel : public framework::OpKernel<T> {
  void Compute(const framework::ExecutionContext& context) const override {
    bool is_sparse = context.Attr<bool>("is_sparse");
    if (is_sparse) {
-      auto* ids = context.Input<Tensor>("Ids");
+      auto* ids = context.Input<LoDTensor>("Ids");
-      auto* table = context.Input<Tensor>("W");
+      auto* table = context.Input<LoDTensor>("W");
-      auto* d_output = context.Input<Tensor>(framework::GradVarName("Out"));
+      auto* d_output = context.Input<LoDTensor>(framework::GradVarName("Out"));
      auto* d_table = context.Output<SelectedRows>(framework::GradVarName("W"));
      auto* ids_data = ids->data<int64_t>();
@@ -116,12 +116,12 @@ class LookupTableGradCUDAKernel : public framework::OpKernel<T> {
      auto* d_output_data = d_output->data<T>();
      PADDLE_ENFORCE_EQ(d_table_value->dims(), d_output->dims());
      memory::Copy(gpu_place, d_table_data, gpu_place, d_output_data,
-                   d_output->numel(), stream);
+                   d_output->numel() * sizeof(T), stream);
    } else {
-      auto ids_t = context.Input<Tensor>("Ids");
+      auto ids_t = context.Input<LoDTensor>("Ids");
-      auto d_output_t = context.Input<Tensor>(framework::GradVarName("Out"));
+      auto d_output_t = context.Input<LoDTensor>(framework::GradVarName("Out"));
-      auto d_table_t = context.Output<Tensor>(framework::GradVarName("W"));
+      auto d_table_t = context.Output<LoDTensor>(framework::GradVarName("W"));
      int N = d_table_t->dims()[0];
      int D = d_table_t->dims()[1];

--- a/paddle/operators/lookup_table_op.h
+++ b/paddle/operators/lookup_table_op.h
@@ -19,22 +19,22 @@
 namespace paddle {
 namespace operators {
-using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
 using SelectedRows = framework::SelectedRows;
 template <typename T>
 class LookupTableKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& context) const override {
-    auto table_t = context.Input<Tensor>("W");      // float tensor
+    auto* table_t = context.Input<LoDTensor>("W");      // float tensor
-    auto ids_t = context.Input<Tensor>("Ids");      // int tensor
+    auto* ids_t = context.Input<LoDTensor>("Ids");      // int tensor
-    auto output_t = context.Output<Tensor>("Out");  // float tensor
+    auto* output_t = context.Output<LoDTensor>("Out");  // float tensor
    int N = table_t->dims()[0];
    int D = table_t->dims()[1];
-    auto ids = ids_t->data<int64_t>();
+    auto* ids = ids_t->data<int64_t>();
-    auto table = table_t->data<T>();
+    auto* table = table_t->data<T>();
-    auto output = output_t->mutable_data<T>(context.GetPlace());
+    auto* output = output_t->mutable_data<T>(context.GetPlace());
    for (int64_t i = 0; i < ids_t->numel(); ++i) {
      PADDLE_ENFORCE_LT(ids[i], N);
      PADDLE_ENFORCE_GE(ids[i], 0);
@@ -49,9 +49,9 @@ class LookupTableGradKernel : public framework::OpKernel<T> {
  void Compute(const framework::ExecutionContext& context) const override {
    bool is_sparse = context.Attr<bool>("is_sparse");
    if (is_sparse) {
-      auto* ids = context.Input<Tensor>("Ids");
+      auto* ids = context.Input<LoDTensor>("Ids");
-      auto* table = context.Input<Tensor>("W");
+      auto* table = context.Input<LoDTensor>("W");
-      auto* d_output = context.Input<Tensor>(framework::GradVarName("Out"));
+      auto* d_output = context.Input<LoDTensor>(framework::GradVarName("Out"));
      auto* d_table = context.Output<SelectedRows>(framework::GradVarName("W"));
      auto* ids_data = ids->data<int64_t>();
@@ -76,10 +76,10 @@ class LookupTableGradKernel : public framework::OpKernel<T> {
      PADDLE_ENFORCE_EQ(d_table_value->dims(), d_output->dims());
      memcpy(d_table_data, d_output_data, sizeof(T) * d_output->numel());
    } else {
-      auto* ids = context.Input<Tensor>("Ids");
+      auto* ids = context.Input<LoDTensor>("Ids");
-      auto* d_output = context.Input<Tensor>(framework::GradVarName("Out"));
+      auto* d_output = context.Input<LoDTensor>(framework::GradVarName("Out"));
-      auto* d_table = context.Output<Tensor>(framework::GradVarName("W"));
+      auto* d_table = context.Output<LoDTensor>(framework::GradVarName("W"));
-      auto* table = context.Input<Tensor>("W");
+      auto* table = context.Input<LoDTensor>("W");
      auto* ids_data = ids->data<int64_t>();
      auto ids_dim = ids->dims();

--- a/paddle/operators/nccl_op_test.cu
+++ b/paddle/operators/nccl_op_test.cu
@@ -185,7 +185,7 @@ TEST_F(NCCLTester, ncclAllReduceOp) {
        recv_tensor.numel() * sizeof(float),
        static_cast<p::CUDADeviceContext *>(dev_ctxs[i])->stream());
-    for (size_t j = 0; j < f::product(kDims); ++j) {
+    for (int64_t j = 0; j < f::product(kDims); ++j) {
      ASSERT_NEAR(ct[j], result, 1e-5);
    }
  }
@@ -234,7 +234,7 @@ TEST_F(NCCLTester, ncclReduceOp) {
      recv_tensor.numel() * sizeof(float),
      static_cast<p::CUDADeviceContext *>(dev_ctxs[kRoot])->stream());
-  for (int j = 0; j < f::product(kDims); ++j) {
+  for (int64_t j = 0; j < f::product(kDims); ++j) {
    ASSERT_NEAR(ct[j], result, 1e-5);
  }
 }
@@ -282,7 +282,7 @@ TEST_F(NCCLTester, ncclBcastOp) {
      recv_tensor.numel() * sizeof(float),
      static_cast<p::CUDADeviceContext *>(dev_ctxs[idx])->stream());
-  for (size_t j = 0; j < f::product(kDims); ++j) {
+  for (int64_t j = 0; j < f::product(kDims); ++j) {
    ASSERT_NEAR(ct[j], result, 1e-5);
  }
 }

--- a/paddle/operators/reshape_op.cc
+++ b/paddle/operators/reshape_op.cc
@@ -36,7 +36,7 @@ class ReshapeOp : public framework::OperatorWithKernel {
    PADDLE_ENFORCE(shape.size() > 0, "Attr(shape) shouldn't be empty.");
    auto x_dims = ctx->GetInputDim("X");
    // TODO(qiao) change batch_size
-    for (int i = 1; i < shape.size(); ++i) {
+    for (size_t i = 1; i < shape.size(); ++i) {
      PADDLE_ENFORCE(shape[i] > 0,
                     "Each dimension of shape "
                     "must be positiv except the first.");

--- a/paddle/operators/save_load_op_test.cc
+++ b/paddle/operators/save_load_op_test.cc
@@ -34,7 +34,7 @@ TEST(SaveLoadOp, CPU) {
  tensor->set_lod(expect_lod);
  int* expect = tensor->mutable_data<int>(place);
-  for (size_t i = 0; i < paddle::framework::product(tensor->dims()); ++i) {
+  for (int64_t i = 0; i < tensor->numel(); ++i) {
    expect[i] = static_cast<int>(i);
  }
  paddle::framework::AttributeMap attrs;
@@ -50,7 +50,7 @@ TEST(SaveLoadOp, CPU) {
      "load", {}, {{"Out", {"out_var"}}}, attrs);
  load_op->Run(scope, ctx);
  int* actual = target->data<int>();
-  for (size_t i = 0; i < paddle::framework::product(tensor->dims()); ++i) {
+  for (int64_t i = 0; i < tensor->numel(); ++i) {
    EXPECT_EQ(expect[i], actual[i]);
  }
  auto& actual_lod = target->lod();
@@ -60,4 +60,4 @@ TEST(SaveLoadOp, CPU) {
      EXPECT_EQ(expect_lod[i][j], actual_lod[i][j]);
    }
  }
 }
\ No newline at end of file
--- a/paddle/operators/sequence_conv_op.cc
+++ b/paddle/operators/sequence_conv_op.cc
@@ -89,7 +89,7 @@ class SequenceConvGradOp : public framework::OperatorWithKernel {
    }
    if (ctx->HasOutput(framework::GradVarName("X"))) {
      ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X"));
-      ctx->ShareLoD(framework::GradVarName("X"), "X");
+      ctx->ShareLoD("X", framework::GradVarName("X"));
    }
    if (ctx->HasOutput(framework::GradVarName("Filter"))) {
      ctx->SetOutputDim(framework::GradVarName("Filter"),

--- a/paddle/operators/sequence_pool_op.cc
+++ b/paddle/operators/sequence_pool_op.cc
@@ -39,15 +39,14 @@ class SequencePoolOpMaker : public framework::OpProtoAndCheckerMaker {
    AddOutput("Out",
              "(Tensor), output of SequencePoolOp, which does not contain LoD "
              "infomation.");
-    AddAttr<int>(
+    AddAttr<std::string>(
-        "strategy",
+        "pooltype",
-        "(int, default AVERAGE) the pooling strategy of SequencePoolOp.")
+        "(int, default AVERAGE) the pooling pooltype of SequencePoolOp.")
-        .SetDefault(AVERAGE)
+        .SetDefault("AVERAGE");
-        .InEnum({AVERAGE, SUM, SQRT, MAX, LAST, FIRST});
    AddComment(R"DOC(
    SequencePoolOp pools features of all time-steps of each instance.
-    It supports six pooling strategy:
+    It supports six pooling pooltype:
    - AVERAGE: Out[i] = average_{for each instance in i-th sequence}{X[i]}
    - SUM:     Out[i] = sum_{for each instance in i-th sequence}{X[i]}
    - SQRT:    Out[i] = sum_{for each instance in i-th sequence}{X[i]} 
@@ -63,7 +62,7 @@ class SequencePoolOpMaker : public framework::OpProtoAndCheckerMaker {
    and the value of X = [[1, 3], [2, 4, 6], [5, 1]].
    Thus, Out is a [3,1,1] Tensor without LoD infomation.
-    And for different strategy, the value of Out is as follows:
+    And for different pooltype, the value of Out is as follows:
    - AVERAGE: [2, 4, 3], where 2=(1+3)/2, 4=(2+4+6)/3, 3=(5+1)/2
    - SUM: [4, 12, 6], where 4=1+3, 12=2+4+6, 6=5+1

--- a/paddle/operators/sequence_pool_op.h
+++ b/paddle/operators/sequence_pool_op.h
@@ -29,22 +29,13 @@ template <typename T, int MajorType = Eigen::RowMajor,
          typename IndexType = Eigen::DenseIndex>
 using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
-enum SeqPoolType {
-  AVERAGE = 0,
-  SUM = 1,
-  SQRT = 2,  // square_root_n
-  MAX = 3,
-  LAST = 4,
-  FIRST = 5
-};
 template <typename Place, typename T>
 class SequencePoolKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& context) const override {
    auto* in = context.Input<LoDTensor>("X");
    auto* out = context.Output<LoDTensor>("Out");
-    int strategy = context.Attr<int>("strategy");
+    std::string pooltype = context.Attr<std::string>("pooltype");
    auto dims = in->dims();
    auto lod = in->lod();
@@ -71,28 +62,21 @@ class SequencePoolKernel : public framework::OpKernel<T> {
      auto in_e = EigenMatrix<T>::From(in_t, framework::make_ddim({h, w}));
      auto out_e = EigenVector<T>::Flatten(out_t);
-      switch (strategy) {
+      if (pooltype == "AVERAGE") {
-        case AVERAGE:
+        out_e.device(place) = in_e.mean(Eigen::array<int, 1>({{0}}));
-          out_e.device(place) = in_e.mean(Eigen::array<int, 1>({{0}}));
+      } else if (pooltype == "SUM") {
-          break;
+        out_e.device(place) = in_e.sum(Eigen::array<int, 1>({{0}}));
-        case SUM:
+      } else if (pooltype == "SQRT") {
-          out_e.device(place) = in_e.sum(Eigen::array<int, 1>({{0}}));
+        out_e.device(place) = in_e.sum(Eigen::array<int, 1>({{0}})) /
-          break;
+                              std::sqrt(static_cast<T>(h));
-        case SQRT:
+      } else if (pooltype == "MAX") {
-          out_e.device(place) = in_e.sum(Eigen::array<int, 1>({{0}})) /
+        out_e.device(place) = in_e.maximum(Eigen::array<int, 1>({{0}}));
-                                std::sqrt(static_cast<T>(h));
+      } else if (pooltype == "LAST") {
-          break;
+        out_e.device(place) = in_e.chip(h - 1, 0);
-        case MAX:
+      } else if (pooltype == "FIRST") {
-          out_e.device(place) = in_e.maximum(Eigen::array<int, 1>({{0}}));
+        out_e.device(place) = in_e.chip(0, 0);
-          break;
+      } else {
-        case LAST:
+        PADDLE_THROW("unsupported pooling pooltype");
-          out_e.device(place) = in_e.chip(h - 1, 0);
-          break;
-        case FIRST:
-          out_e.device(place) = in_e.chip(0, 0);
-          break;
-        default:
-          PADDLE_THROW("unsupported pooling strategy");
      }
    }
  }
@@ -105,15 +89,15 @@ class SequencePoolGradKernel : public framework::OpKernel<T> {
    auto* in = context.Input<LoDTensor>("X");
    auto* in_g = context.Output<LoDTensor>(framework::GradVarName("X"));
    auto* out_g = context.Input<LoDTensor>(framework::GradVarName("Out"));
-    int strategy = context.Attr<int>("strategy");
+    std::string pooltype = context.Attr<std::string>("pooltype");
    auto dims = in->dims();
    auto lod = in->lod()[0];
    int64_t w = in->numel() / dims[0];
    in_g->mutable_data<T>(context.GetPlace());
-    if (strategy == LAST || strategy == FIRST) {
+    if (pooltype == "LAST" || pooltype == "FIRST") {
-      // set X@Grad be zero at first when strategy is LAST/FIRST
+      // set X@Grad be zero at first when pooltype is LAST/FIRST
      math::SetConstant<Place, T> functor;
      functor(context.device_context(), in_g, 0);
    }
@@ -127,41 +111,33 @@ class SequencePoolGradKernel : public framework::OpKernel<T> {
      auto out_g_e = EigenMatrix<T>::From(out_g_t, {1, w});
      Eigen::DSizes<int, 2> bcast(h, 1);
-      switch (strategy) {
+      if (pooltype == "AVERAGE") {
-        case AVERAGE:
+        in_g_e.device(place) = (out_g_e / static_cast<T>(h)).broadcast(bcast);
-          in_g_e.device(place) = (out_g_e / static_cast<T>(h)).broadcast(bcast);
+      } else if (pooltype == "SUM") {
-          break;
+        in_g_e.device(place) = (out_g_e).broadcast(bcast);
-        case SUM:
+      } else if (pooltype == "SQRT") {
-          in_g_e.device(place) = (out_g_e).broadcast(bcast);
+        in_g_e.device(place) =
-          break;
+            (out_g_e / std::sqrt(static_cast<T>(h))).broadcast(bcast);
-        case SQRT:
+      } else if (pooltype == "MAX") {
-          in_g_e.device(place) =
+        auto in_t =
-              (out_g_e / std::sqrt(static_cast<T>(h))).broadcast(bcast);
+            in->Slice(static_cast<int>(lod[i]), static_cast<int>(lod[i + 1]));
-          break;
+        Eigen::Map<const Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>>
-        case MAX: {
+            in_t_map(in_t.data<T>(), h, w);
-          auto in_t =
+        int row_id;
-              in->Slice(static_cast<int>(lod[i]), static_cast<int>(lod[i + 1]));
+        Eigen::array<int, 2> extents{{1, 1}};
-          Eigen::Map<const Eigen::Matrix<T, Eigen::Dynamic, Eigen::Dynamic>>
+        for (int col_id = 0; col_id < w; col_id++) {
-              in_t_map(in_t.data<T>(), h, w);
+          in_t_map.col(col_id).maxCoeff(&row_id);
-          int row_id;
+          Eigen::array<int, 2> in_offsets{{row_id, col_id}};
-          Eigen::array<int, 2> extents{{1, 1}};
+          Eigen::array<int, 2> out_offsets{{0, col_id}};
-          for (int col_id = 0; col_id < w; col_id++) {
+          in_g_e.slice(in_offsets, extents).device(place) =
-            in_t_map.col(col_id).maxCoeff(&row_id);
+              out_g_e.slice(out_offsets, extents);
-            Eigen::array<int, 2> in_offsets{{row_id, col_id}};
-            Eigen::array<int, 2> out_offsets{{0, col_id}};
-            in_g_e.slice(in_offsets, extents).device(place) =
-                out_g_e.slice(out_offsets, extents);
-          }
-          break;
        }
-        case LAST:
+      } else if (pooltype == "LAST") {
-          in_g_e.chip(h - 1, 0).device(place) = out_g_e;
+        in_g_e.chip(h - 1, 0).device(place) = out_g_e;
-          break;
+      } else if (pooltype == "FIRST") {
-        case FIRST:
+        in_g_e.chip(0, 0).device(place) = out_g_e;
-          in_g_e.chip(0, 0).device(place) = out_g_e;
+      } else {
-          break;
+        PADDLE_THROW("unsupported pooling pooltype");
-        default:
-          PADDLE_THROW("unsupported pooling strategy");
      }
    }
  }

--- a/paddle/operators/softmax_with_cross_entropy_op.cc
+++ b/paddle/operators/softmax_with_cross_entropy_op.cc
@@ -32,9 +32,9 @@ class SoftmaxWithCrossEntropyOpMaker
    AddInput("Label",
             "(Tensor, default: Tensor<int>), The ground truth which is a 2-D "
             "tensor. "
-             "If softLable is set to 0, Label is a Tensor<int> with shape [N x "
+             "If softLabel is set to false, Label is a Tensor<int> with shape "
-             "1]. "
+             "[N x 1]."
-             "If softLable is set to 1, Label is a Tensor<float/double> "
+             "If softLabel is set to true, Label is a Tensor<float/double> "
             "with shape [N x K].");
    AddOutput(
        "Softmax",
@@ -60,19 +60,23 @@ Because this operators performs a softmax on logits internally, it expects
 unscaled logits. Please do not call this op with the output of softmax operator,
 which will produce incorrect results.
-This operators expects mutually exclusive hard labels, each sample in a batch
+When the attribute softLabel is set false, this operators expects mutually
-is in exactly one class with probabilities 1. Each sample in the batch with one
+exclusive hard labels, each sample in a batch is in exactly one class with
-and only one label.
+probabilities 1. Each sample in the batch with one and only one label.
 Equation:
 1) hard label (one-hot label)
-Loss_j = -\text{Logit}_{Label_j} + \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right), j = 1, ..., K
+Loss_j = \f$ -\text{Logit}_{Label_j} +
+\log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right),
+j = 1, ..., K $\f
 2) soft label (a distribution over all classes)
-Loss_j = -\sum_{i=0}^{K}\text{Label}_i\left(\text{Logit}_i-\log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right)\right), j = 1,...,K
+Loss_j = \f$ -\sum_{i=0}^{K}\text{Label}_i\left(\text{Logit}_i -
+\log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right)\right),
+j = 1,...,K $\f
 )DOC");
  }

--- a/paddle/pybind/protobuf.cc
+++ b/paddle/pybind/protobuf.cc
@@ -129,7 +129,8 @@ void BindProgramDesc(py::module &m) {
             }
             return retv;
           })
-      .def("block", &ProgramDescBind::Block, py::return_value_policy::reference)
+      .def("block", &ProgramDescBind::MutableBlock,
+           py::return_value_policy::reference)
      .def("num_blocks", &ProgramDescBind::Size)
      .def("serialize_to_string",
           [](ProgramDescBind &program_desc) -> py::bytes {

--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
@@ -275,7 +275,7 @@ All parameter, weight, gradient are variables in Paddle.
                    const std::vector<std::array<size_t, 2>> &targets) {
    ProgramDescBind prog_with_targets(origin);
    for (const auto &t : targets) {
-      prog_with_targets.Block(t[0])->Op(t[1])->MarkAsTarget();
+      prog_with_targets.MutableBlock(t[0])->Op(t[1])->MarkAsTarget();
    }
    ProgramDesc pruned_desc;
    Prune(*prog_with_targets.Proto(), &pruned_desc);
@@ -335,7 +335,7 @@ All parameter, weight, gradient are variables in Paddle.
                    PADDLE_ENFORCE(desc.IsInitialized(),
                                   "User OpDesc is not initialized, reason %s",
                                   desc.InitializationErrorString());
-                    return OpRegistry::CreateOp(desc, nullptr);
+                    return OpRegistry::CreateOp(desc);
                  })
      .def("backward",
           [](const OperatorBase &forwardOp,
@@ -439,7 +439,7 @@ All parameter, weight, gradient are variables in Paddle.
            PADDLE_ENFORCE(desc.IsInitialized(),
                           "User OpDesc is not initialized, reason %s",
                           desc.InitializationErrorString());
-            auto rnn_op = OpRegistry::CreateOp(desc, nullptr);
+            auto rnn_op = OpRegistry::CreateOp(desc);
            return static_cast<operators::RecurrentOp *>(rnn_op.release());
          })
      .def("set_stepnet", [](operators::RecurrentOp &self,
@@ -457,7 +457,7 @@ All parameter, weight, gradient are variables in Paddle.
                    PADDLE_ENFORCE(desc.IsInitialized(),
                                   "User OpDesc is not initialized, reason %s",
                                   desc.InitializationErrorString());
-                    auto rnn_op = OpRegistry::CreateOp(desc, nullptr);
+                    auto rnn_op = OpRegistry::CreateOp(desc);
                    return static_cast<operators::DynamicRecurrentOp *>(
                        rnn_op.release());
                  })
@@ -484,7 +484,7 @@ All parameter, weight, gradient are variables in Paddle.
                    PADDLE_ENFORCE(desc.IsInitialized(),
                                   "User OpDesc is not initialized, reason %s",
                                   desc.InitializationErrorString());
-                    auto cond_op = OpRegistry::CreateOp(desc, nullptr);
+                    auto cond_op = OpRegistry::CreateOp(desc);
                    return static_cast<operators::CondOp *>(cond_op.release());
                  })
      .def("set_truenet",
@@ -498,10 +498,7 @@ All parameter, weight, gradient are variables in Paddle.
  py::class_<framework::Executor>(m, "Executor")
      .def(py::init<std::vector<platform::Place> &>())
-      .def("run", [](Executor &self, ProgramDescBind *program_bind,
+      .def("run", &Executor::Run);
-                     Scope *scope, int block_id) {
-        self.Run(*program_bind->Proto(), scope, block_id);
-      });
  m.def("unique_integer", UniqueIntegerGenerator);
  m.def("init_gflags", InitGflags);

--- a/paddle/scripts/docker/build_android.sh
+++ b/paddle/scripts/docker/build_android.sh
@@ -4,6 +4,10 @@ set -xe
 if [ $ANDROID_ABI == "arm64-v8a" ]; then
  ANDROID_ARCH=arm64
+  if [ $ANDROID_API -lt 21 ]; then
+    echo "Warning: arm64-v8a requires ANDROID_API >= 21."
+    ANDROID_API=21
+  fi
 else # armeabi, armeabi-v7a
  ANDROID_ARCH=arm
 fi

--- a/paddle/trainer/MergeModel.cpp
+++ b/paddle/trainer/MergeModel.cpp
@@ -27,6 +27,13 @@ using namespace paddle;  // NOLINT
 using namespace std;     // NOLINT
 int main(int argc, char** argv) {
+  if (FLAGS_model_dir.empty() || FLAGS_config_file.empty() ||
+      FLAGS_model_file.empty()) {
+    LOG(INFO) << "Usage: ./paddle_merge_model --model_dir=pass-00000 "
+                 "--config_file=config.py --model_file=out.paddle";
+    return 0;
+  }
  initMain(argc, argv);
  initPython(argc, argv);

--- a/paddle/trainer/tests/CMakeLists.txt
+++ b/paddle/trainer/tests/CMakeLists.txt
@@ -37,22 +37,6 @@ add_test(NAME test_CompareTwoNets
            --config_file_a=trainer/tests/sample_trainer_config_qb_rnn.conf --config_file_b=trainer/tests/sample_trainer_config_rnn.conf
    WORKING_DIRECTORY ${PADDLE_SOURCE_DIR}/paddle/)
-################ test_CompareMKLDNNandCPU ######################
-if(WITH_MKLDNN)
-  macro(gen_command VAR_NAME CONFIG_FILE)
-    set(${VAR_NAME} "${PADDLE_SOURCE_DIR}/paddle/.set_python_path.sh" "-d" "${PADDLE_SOURCE_DIR}/python/"
-                    "${CMAKE_CURRENT_BINARY_DIR}/test_CompareMKLDNNandCPU --use_gpu=False"
-                    "--config_file_a=trainer/tests/${CONFIG_FILE} --use_mkldnn_a=True"
-                    "--config_file_b=trainer/tests/${CONFIG_FILE} --use_mkldnn_b=False"
-                    "WORKING_DIRECTORY" "${PADDLE_SOURCE_DIR}/paddle/")
-  endmacro()
-  add_unittest_without_exec(test_CompareMKLDNNandCPU test_CompareTwoNets.cpp)
-  gen_command(compare_simple_net "sample_trainer_config_simple_net.conf")
-  gen_command(compare_branch_net "sample_trainer_config_branch_net.conf")
-  add_test(NAME test_CompareMKLDNNandCPU_simple_net COMMAND ${compare_simple_net})
-  add_test(NAME test_CompareMKLDNNandCPU_branch_net COMMAND ${compare_branch_net})
-endif()
 ############### test_CompareTwoOpts ###################
 add_unittest_without_exec(test_CompareTwoOpts
    test_CompareTwoOpts.cpp)

--- a/paddle/trainer/tests/sample_trainer_config_simple_net.conf
+++ b/paddle/trainer/tests/sample_trainer_config_simple_net.conf
-# Copyright (c) 2017 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from paddle.trainer_config_helpers import *
-################################### Data Configuration ###################################
-TrainData(ProtoData(files = "trainer/tests/mnist.list"))
-################################### Algorithm Configuration ###################################
-settings(batch_size = 128,
-         learning_method = MomentumOptimizer(momentum=0.5, sparse=False))
-################################### Network Configuration ###################################
-data = data_layer(name ="input", size=784)
-tmp = img_conv_layer(input=data,
-            num_channels=1,
-            filter_size=3,
-            num_filters=32,
-            padding=1,
-            shared_biases=True,
-            act=ReluActivation())
-tmp = img_pool_layer(input=tmp,
-            pool_size=3,
-            stride=2,
-            padding=1,
-            pool_type=AvgPooling())
-tmp = img_conv_layer(input=tmp,
-            filter_size=3,
-            num_filters=32,
-            padding=1,
-            shared_biases=True,
-            act=LinearActivation(),
-            bias_attr=False)
-tmp = batch_norm_layer(input=tmp,
-            use_global_stats=False,
-            act=ReluActivation())
-tmp = img_pool_layer(input=tmp,
-            pool_size=3,
-            stride=2,
-            padding=1,
-            pool_type=MaxPooling())
-tmp = fc_layer(input=tmp, size=64,
-               bias_attr=True,
-               act=ReluActivation())
-output = fc_layer(input=tmp, size=10,
-                  bias_attr=True,
-                  act=SoftmaxActivation())
-lbl = data_layer(name ="label", size=10)
-cost = classification_cost(input=output, label=lbl)
-outputs(cost)
--- a/paddle/trainer/tests/test_CompareTwoNets.cpp
+++ b/paddle/trainer/tests/test_CompareTwoNets.cpp
@@ -26,15 +26,12 @@ DECLARE_int32(gpu_id);
 DECLARE_bool(local);
 DECLARE_bool(use_gpu);
-DECLARE_bool(use_mkldnn);
 DECLARE_string(config);
 DECLARE_string(nics);
 DEFINE_string(config_file_a, "", "config of one network to compare");
 DEFINE_string(config_file_b, "", "config of another network to compare");
-DEFINE_bool(use_mkldnn_a, false, "whether to use mkldnn to run config_file_a");
-DEFINE_bool(use_mkldnn_b, false, "whether to use mkldnn to run config_file_b");
 DEFINE_bool(need_high_accuracy,
            false,
            "whether need to run in double accuracy");
@@ -131,12 +128,6 @@ void compareGradient(ComData& comDataA, ComData& comDataB) {
                matA.getWidth());
  }
-  if (FLAGS_use_mkldnn_a || FLAGS_use_mkldnn_b) {
-    // some format of mkldnn parameter is different with cpu
-    // test_MKLDNN will check the parameters
-    return;
-  }
  vector<ParameterPtr>& parametersA = comDataA.parameters;
  vector<ParameterPtr>& parametersB = comDataB.parameters;
@@ -176,12 +167,10 @@ void compareGradient(ComData& comDataA, ComData& comDataB) {
 TEST(Trainer, create) {
  ComData dataA;
-  FLAGS_use_mkldnn = FLAGS_use_mkldnn_a;
  calcGradient(dataA, FLAGS_config_file_a);
  LOG(INFO) << "\n\nforwardBackward of Network A is finished\n\n";
  ComData dataB;
-  FLAGS_use_mkldnn = FLAGS_use_mkldnn_b;
  calcGradient(dataB, FLAGS_config_file_b);
  LOG(INFO) << "\n\nforwardBackward of the Network B is finished\n\n";

--- a/python/paddle/v2/dataset/imdb.py
+++ b/python/paddle/v2/dataset/imdb.py
@@ -116,7 +116,7 @@ def reader_creator(pos_pattern, neg_pattern, word_idx, buffer_size):
            yield [word_idx.get(w, UNK) for w in doc], i % 2
            doc = qs[i % 2].get()
-    return reader()
+    return reader
 def train(word_idx):

--- a/python/paddle/v2/framework/framework.py
+++ b/python/paddle/v2/framework/framework.py
@@ -354,8 +354,8 @@ class Block(object):
    def create_var(self, *args, **kwargs):
        var = Variable(self, *args, **kwargs)
-        if 'init_attr' in kwargs:
+        if 'initializer' in kwargs:
-            self._prepend_initialize_ops_(var, kwargs['init_attr'])
+            kwargs['initializer'](var, self)
        return var
    def has_var(self, name):
@@ -364,8 +364,8 @@ class Block(object):
    def create_parameter(self, *args, **kwargs):
        global_block = self.program.global_block()
        param = Parameter(global_block, *args, **kwargs)
-        if 'init_attr' in kwargs:
+        if 'initializer' in kwargs:
-            self._prepend_initialize_ops_(param, kwargs['init_attr'])
+            kwargs['initializer'](param, self)
        return param
    def append_op(self, *args, **kwargs):
@@ -424,17 +424,6 @@ class Block(object):
        for index in range(len(self.ops)):
            assert self.ops[index].desc == ops_in_cpp[index]
-    def _prepend_initialize_ops_(self, param, init_attr):
-        op_type = init_attr['type']
-        init_attr['shape'] = param.shape
-        init_attr['data_type'] = int(param.data_type)
-        op = self.prepend_op(
-            type=op_type,
-            inputs=None,
-            outputs={'Out': [param]},
-            attrs=init_attr)
-        param.op = op
 class Program(object):
    def __init__(self):

--- a/python/paddle/v2/framework/initializer.py
+++ b/python/paddle/v2/framework/initializer.py
+import paddle.v2.framework.framework as framework
+__all__ = ['ConstantInitializer', 'UniformInitializer']
+class Initializer(object):
+    """Base class for variable initializers
+    Defines the common interface of variable initializers.
+    They add operations to the init program that are used
+    to initialize variables. Users should not use this class
+    directly, but need to use one of its implementations.
+    """
+    def __init_(self):
+        pass
+    def __call__(self, param, block):
+        """Add corresponding initialization operations to the network
+        """
+        raise NotImplementedError()
+class ConstantInitializer(Initializer):
+    """Implements the constant initializer
+    """
+    def __init__(self, value=0.0):
+        """Constructor for ConstantInitializer
+        Args:
+            value: constant value to initialize the variable
+        """
+        assert value is not None
+        super(ConstantInitializer, self).__init__()
+        self._value = value
+    def __call__(self, var, block):
+        """Add constant initialization ops for a variable
+        Args:
+            var: Variable that needs to be initialized
+            block: The block in which initialization ops
+                   should be added
+        Returns:
+            the initialization op
+        """
+        assert isinstance(var, framework.Variable)
+        assert isinstance(block, framework.Block)
+        # Initialization Ops should be prepended and not appended
+        op = block.prepend_op(
+            type="fill_constant",
+            outputs={"Out": var},
+            attrs={
+                "shape": var.shape,
+                "data_type": int(var.data_type),
+                "value": self._value
+            })
+        var.op = op
+        return op
+class UniformInitializer(Initializer):
+    """Implements the random uniform distribution initializer
+    """
+    def __init__(self, low=-1.0, high=1.0, seed=0):
+        """Constructor for UniformInitializer
+        Args:
+            low: lower boundary of the uniform distribution
+            high: upper boundary of the uniform distribution
+            seed: random seed
+        """
+        assert low is not None
+        assert high is not None
+        assert high >= low
+        assert seed is not None
+        super(UniformInitializer, self).__init__()
+        self._low = low
+        self._high = high
+        self._seed = seed
+    def __call__(self, var, block):
+        """Add uniform distribution initialization ops for a variable
+        Args:
+            var: Variable that needs to be initialized
+            block: The block in which initialization ops
+                   should be added
+        Returns:
+            the initialization op
+        """
+        assert isinstance(var, framework.Variable)
+        assert isinstance(block, framework.Block)
+        # Initialization Ops should be prepended and not appended
+        op = block.prepend_op(
+            type="uniform_random",
+            outputs={"Out": var},
+            attrs={
+                "shape": var.shape,
+                "data_type": int(var.data_type),
+                "min": self._low,
+                "max": self._high,
+                "seed": self._seed
+            })
+        var.op = op
+        return op
+class NormalInitializer(Initializer):
+    """Implements the  random Normal(Gaussian) distribution initializer
+    """
+    def __init__(self, loc=0.0, scale=1.0, seed=0):
+        """Constructor for NormalInitializer
+        Args:
+            loc: mean of the normal distribution
+            scale: standard deviation of the normal distribution
+            seed: random seed
+        """
+        assert loc is not None
+        assert scale is not None
+        assert seed is not None
+        super(NormalInitializer, self).__init__()
+        self._mean = loc
+        self._std_dev = scale
+        self._seed = seed
+    def __call__(self, var, block):
+        """Add normal distribution initialization ops for a variable
+        Args:
+            var: Variable that needs to be initialized
+            block: The block in which initialization ops
+                   should be added
+        Returns:
+            the initialization op
+        """
+        assert isinstance(var, framework.Variable)
+        assert isinstance(block, framework.Block)
+        # Initialization Ops should be prepended and not appended
+        op = block.prepend_op(
+            type="gaussian_random",
+            outputs={"Out": var},
+            attrs={
+                "shape": var.shape,
+                "data_type": int(var.data_type),
+                "mean": self._mean,
+                "std": self._std_dev,
+                "seed": self._seed
+            })
+        var.op = op
+        return op
--- a/python/paddle/v2/framework/layer_helper.py
+++ b/python/paddle/v2/framework/layer_helper.py
@@ -5,6 +5,8 @@ import paddle.v2.framework.core as core
 from paddle.v2.framework.framework import Variable, g_program, \
    g_init_program
+from paddle.v2.framework.initializer import ConstantInitializer, \
+    UniformInitializer
 def unique_name(prefix):
@@ -66,14 +68,7 @@ class LayerHelper(object):
    @property
    def param_attr(self):
-        default = {
+        default = {'name': None, 'initializer': UniformInitializer()}
-            'name': None,
-            'init_attr': {
-                'type': 'uniform_random',
-                'min': -1.0,
-                'max': 1.0
-            }
-        }
        actual = self.kwargs.get('param_attr', None)
        if actual is None:
            actual = default
@@ -83,13 +78,7 @@ class LayerHelper(object):
        return actual
    def bias_attr(self):
-        default = {
+        default = {'name': None, 'initializer': ConstantInitializer()}
-            'name': None,
-            'init_attr': {
-                'type': 'fill_constant',
-                'value': 0.0
-            }
-        }
        bias_attr = self.kwargs.get('bias_attr', None)
        if bias_attr is True:
            bias_attr = default
@@ -153,8 +142,24 @@ class LayerHelper(object):
        return self.program.global_block().create_var(
            *args, persistable=False, **kwargs)
-    def append_bias_op(self, input_var):
+    def append_bias_op(self, input_var, num_flatten_dims=None):
-        size = list(input_var.shape[1:])
+        """
+        Append bias operator and return its output. If the user does not set 
+        bias_attr, append_bias_op will return input_var
+        :param input_var: the input variable. The len(input_var.shape) is larger
+        or equal than 2.
+        :param num_flatten_dims: The input tensor will be flatten as a matrix 
+        when adding bias.
+        `matrix.shape = product(input_var.shape[0:num_flatten_dims]), product(
+                input_var.shape[num_flatten_dims:])`
+        """
+        if num_flatten_dims is None:
+            num_flatten_dims = self.kwargs.get('num_flatten_dims', None)
+            if num_flatten_dims is None:
+                num_flatten_dims = 1
+        size = list(input_var.shape[num_flatten_dims:])
        bias_attr = self.bias_attr()
        if not bias_attr:
            return input_var

--- a/python/paddle/v2/framework/layers.py
+++ b/python/paddle/v2/framework/layers.py
 from paddle.v2.framework.layer_helper import LayerHelper, unique_name
 import paddle.v2.framework.core as core
 from paddle.v2.framework.framework import OpProtoHolder, Variable, Program
+from paddle.v2.framework.initializer import ConstantInitializer
 import re
 __all__ = [
    'fc', 'data', 'cross_entropy', 'conv2d', 'pool2d', 'embedding', 'concat',
-    'StaticRNN', 'cast', 'sequence_conv', 'sequence_pool', 'accuracy'
+    'StaticRNN', 'cast', 'sequence_conv', 'sequence_pool', 'sums', 'cos_sim',
+    'batch_norm', 'accuracy'
 ]
@@ -165,18 +167,6 @@ _create_op_func_('dropout')
 _create_op_func_('reshape')
-def cast(x, data_type, program=None):
-    helper = LayerHelper('cast', **locals())
-    out = helper.create_tmp_variable(dtype=data_type)
-    helper.append_op(
-        type='cast',
-        inputs={'X': [x]},
-        outputs={'Out': [out]},
-        attrs={'in_data_type': x.data_type,
-               'out_data_type': out.data_type})
-    return out
 def cast(x, data_type, program=None):
    helper = LayerHelper('cast', **locals())
    out = helper.create_tmp_variable(dtype=data_type)
@@ -191,9 +181,7 @@ def cast(x, data_type, program=None):
 def concat(input, axis, program=None, init_program=None):
    helper = LayerHelper('concat', **locals())
-    if not isinstance(input, list) and not isinstance(input, tuple):
+    out = helper.create_tmp_variable(dtype=helper.input_dtype())
-        input = [input]
-    out = helper.create_tmp_variable(dtype=input[0].data_type)
    helper.append_op(
        type='concat',
        inputs={'X': input},
@@ -202,6 +190,28 @@ def concat(input, axis, program=None, init_program=None):
    return out
+def sums(input, program=None, init_program=None):
+    helper = LayerHelper('sum', **locals())
+    out = helper.create_tmp_variable(dtype=helper.input_dtype())
+    helper.append_op(type='sum', inputs={'X': [input]}, outputs={'Out': out})
+    return out
+def cos_sim(X, Y, program=None, init_program=None):
+    helper = LayerHelper('cos_sim', **locals())
+    out = helper.create_tmp_variable(dtype=helper.input_dtype("X"))
+    xnorm = helper.create_tmp_variable(dtype=helper.input_dtype("X"))
+    ynorm = helper.create_tmp_variable(dtype=helper.input_dtype("X"))
+    helper.append_op(
+        type='cos_sim',
+        inputs={'X': [X],
+                'Y': [Y]},
+        outputs={'Out': [out],
+                 'XNorm': [xnorm],
+                 'YNorm': [ynorm]})
+    return out, xnorm, ynorm
 def cross_entropy(input, label, **kwargs):
    helper = LayerHelper('cross_entropy', **kwargs)
    out = helper.create_tmp_variable(dtype=input.data_type)
@@ -254,9 +264,7 @@ def accuracy(input, label, k=1, **kwargs):
 def sequence_conv(input,
                  num_filters,
-                  name=None,
                  filter_size=3,
-                  act=None,
                  stride=1,
                  padding=None,
                  bias_attr=None,
@@ -270,7 +278,7 @@ def sequence_conv(input,
    helper = LayerHelper('sequence_conv', **locals())
    dtype = helper.input_dtype()
-    filter_shape = [num_filters, filter_size]
+    filter_shape = [filter_size * input.shape[1], num_filters]
    filter = helper.create_parameter(
        attr=helper.param_attr, shape=filter_shape, dtype=dtype)
    pre_bias = helper.create_tmp_variable(dtype)
@@ -279,7 +287,7 @@ def sequence_conv(input,
        type='sequence_conv',
        inputs={
            'X': [input],
-            'Filter': filter,
+            'Filter': [filter],
        },
        outputs={"Out": pre_bias},
        attrs={
@@ -287,7 +295,6 @@ def sequence_conv(input,
            'context_start': 0,
            'context_length': filter_size
        })
    pre_act = helper.append_bias_op(pre_bias)
    return helper.append_activation(pre_act)
@@ -344,31 +351,21 @@ def conv2d(input,
    return helper.append_activation(pre_act)
-def sequence_pool(input,
+def sequence_pool(input, pool_type, **kwargs):
-                  pool_size,
+    ENUM_POOL_TYPE = set(["MAX", "AVG", "SQRT", "LAST", "FIRST"])
-                  pool_type,
+    if pool_type.upper() not in ENUM_POOL_TYPE:
-                  pool_stride=1,
-                  pool_padding=0,
-                  global_pooling=False,
-                  program=None,
-                  init_program=None):
-    # FIXME(dzh) : want to unify the argument of python layer
-    # function. So we ignore some unecessary attributes
-    ENUM_POOL_TYPE = set(["max", "avg", "sqrt", "last", "first"])
-    if pool_type not in ENUM_POOL_TYPE:
        raise ValueError("Unknown pool_type: '%s'. It can only be %s.",
                         str(pool_type), " ".join(ENUM_POOL_TYPE))
-    helper = LayerHelper('sequence_pool', **locals())
+    helper = LayerHelper('sequence_pool', **kwargs)
    dtype = helper.input_dtype()
    pool_out = helper.create_tmp_variable(dtype)
    helper.append_op(
        type="sequence_pool",
        inputs={"X": [input]},
-        outputs={"Out": pool_out},
+        outputs={"Out": [pool_out]},
-        attrs={"strategy": pool_type})
+        attrs={"pooltype": pool_type.upper()})
    return pool_out
@@ -433,26 +430,12 @@ def batch_norm(input,
        else:
            raise ValueError("unsupported data layout:" + data_layout)
-    def get_init_attr(value):
+    def create_persistable_var(dtype, shape, initializer=None):
-        if not isinstance(value, float):
-            raise ValueError("attr value should be a float")
-        return {'type': 'fill_constant', 'value': value}
-    def prepend_init_op(var, init_attr):
-        assert isinstance(var, Variable)
-        op_type = init_attr['type']
-        init_attr['shape'] = var.shape
-        init_attr['data_type'] = int(var.data_type)
-        op = var.block.prepend_op(
-            type=op_type, inputs=None, outputs={'Out': [var]}, attrs=init_attr)
-        return op
-    def create_persistable_var(dtype, shape, init_attr=None):
        name = unique_name(".".join([helper.name, "xxxx"]))
        var = init_program.global_block().create_var(
            dtype=dtype, shape=shape, name=name, persistable=True)
-        if 'init_attr' is not None:
+        if initializer is not None:
-            prepend_init_op(var, init_attr)
+            initializer(var, var.block)
        return program.global_block().create_var(
            name=name, dtype=dtype, shape=shape, persistable=True)
@@ -465,8 +448,9 @@ def batch_norm(input,
        attr=helper.param_attr, shape=param_shape, dtype=dtype)
    # create input
-    mean = create_persistable_var(dtype, param_shape, get_init_attr(0.0))
+    mean = create_persistable_var(dtype, param_shape, ConstantInitializer(0.0))
-    variance = create_persistable_var(dtype, param_shape, get_init_attr(1.0))
+    variance = create_persistable_var(dtype, param_shape,
+                                      ConstantInitializer(1.0))
    # create output
    # mean and mean_out share the same memory

--- a/python/paddle/v2/framework/nets.py
+++ b/python/paddle/v2/framework/nets.py
@@ -101,24 +101,19 @@ def img_conv_group(input,
 def sequence_conv_pool(input,
                       num_filters,
                       filter_size,
-                       pool_size,
+                       pool_type="max",
-                       pool_stride,
-                       act,
                       program=None,
                       init_program=None):
    conv_out = layers.sequence_conv(
        input=input,
        num_filters=num_filters,
        filter_size=filter_size,
-        act=act,
        program=program,
        init_program=init_program)
    pool_out = layers.sequence_pool(
        input=conv_out,
-        pool_size=pool_size,
+        pool_type=pool_type,
-        pool_type='max',
-        pool_stride=pool_stride,
        program=program,
        init_program=init_program)
    return pool_out
--- a/python/paddle/v2/framework/tests/test_gaussian_random_op.py
+++ b/python/paddle/v2/framework/tests/test_gaussian_random_op.py
@@ -19,7 +19,7 @@ class TestGaussianRandomOp(unittest.TestCase):
        op = Operator(
            "gaussian_random",
            Out='Out',
-            dims=[1000, 784],
+            shape=[1000, 784],
            mean=.0,
            std=1.,
            seed=10)

--- a/python/paddle/v2/framework/tests/test_initializer.py
+++ b/python/paddle/v2/framework/tests/test_initializer.py
+import unittest
+import paddle.v2.framework.framework as framework
+import paddle.v2.framework.initializer as initializer
+DELTA = 0.00001
+class TestConstantInitializer(unittest.TestCase):
+    def test_constant_initializer_default_value(self):
+        """Test the constant initializer with default value
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.ConstantInitializer())
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'fill_constant')
+        self.assertAlmostEqual(init_op.attr('value'), 0.0, delta=DELTA)
+    def test_constant_initializer(self):
+        """Test constant initializer with supplied value
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.ConstantInitializer(2.3))
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'fill_constant')
+        self.assertAlmostEqual(init_op.attr('value'), 2.3, delta=DELTA)
+class TestUniformInitializer(unittest.TestCase):
+    def test_uniform_initializer_default_value(self):
+        """Test the uniform initializer with default value
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.UniformInitializer())
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'uniform_random')
+        self.assertAlmostEqual(init_op.attr('min'), -1.0, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('max'), 1.0, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 0)
+    def test_uniform_initializer(self):
+        """Test uniform initializer with supplied attributes
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.UniformInitializer(-4.2, 3.1, 123))
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'uniform_random')
+        self.assertAlmostEqual(init_op.attr('min'), -4.2, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('max'), 3.1, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 123)
+class TestNormalInitializer(unittest.TestCase):
+    def test_normal_initializer_default_value(self):
+        """Test the normal initializer with default value
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.NormalInitializer())
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'gaussian_random')
+        self.assertAlmostEqual(init_op.attr('mean'), 0.0, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('std'), 1.0, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 0)
+    def test_normal_initializer(self):
+        """Test normal initializer with supplied attributes
+        """
+        program = framework.Program()
+        block = program.global_block()
+        block.create_parameter(
+            dtype="float32",
+            shape=[5, 10],
+            lod_level=0,
+            name="param",
+            initializer=initializer.NormalInitializer(2.3, 1.9, 123))
+        self.assertEqual(len(block.ops), 1)
+        init_op = block.ops[0]
+        self.assertEqual(init_op.type, 'gaussian_random')
+        self.assertAlmostEqual(init_op.attr('mean'), 2.3, delta=DELTA)
+        self.assertAlmostEqual(init_op.attr('std'), 1.9, delta=DELTA)
+        self.assertEqual(init_op.attr('seed'), 123)
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_linear_chain_crf_op.py
+++ b/python/paddle/v2/framework/tests/test_linear_chain_crf_op.py
+import unittest
+import random
+import numpy as np
+from op_test import OpTest
+class LinearChainCrfForward(object):
+    def __init__(self, seq_start_positions, emission_weights, emission_row_max,
+                 emission_exps, transition_weights, transition_exps, labels):
+        self.tag_num = emission_weights.shape[1]
+        self.seq_num = len(seq_start_positions) - 1
+        self.seq_start_positions = seq_start_positions
+        self.labels = labels
+        self.x = emission_weights
+        self.x_row_max = emission_row_max
+        self.x_exps = emission_exps
+        # unnormalized logits of the transition weights for the start mark.
+        self.a = transition_weights[0, :]
+        self.a_exps = transition_exps[0, :]
+        # unnormalized logits of the transition weights for the end mark.
+        self.b = transition_weights[1, :]
+        self.b_exps = transition_exps[1, :]
+        # unnormalized logits of the transition weights for all the other tags.
+        self.w = transition_weights[2:, :]
+        self.w_exps = transition_exps[2:, :]
+        # The output of linear chain crf operator.
+        # alpha is a memo table in dynamic programming to caculate
+        # nomalization factor.
+        self.alpha = np.zeros(
+            (seq_start_positions[-1], self.tag_num), dtype="float64")
+        self.log_likelihood = np.zeros((self.seq_num, 1))
+    def _l1_norm(self, x):
+        s = np.sum(x)
+        x /= s
+        return s
+    def _forward_a_sequence(self, x, x_row_max, x_exps, label, alpha):
+        seq_len = x_row_max.shape[0]
+        log_likelihood = 0.
+        for i in range(self.tag_num):
+            alpha[0, i] = self.a_exps[i] * x_exps[0, i]
+        log_likelihood = -x_row_max[0] - np.log(self._l1_norm(alpha[0, :]))
+        # calculate the unnormalized logits of the normalization factor.
+        for k in range(1, seq_len):
+            for i in range(self.tag_num):
+                s = 0.
+                for j in range(self.tag_num):
+                    s += alpha[k - 1, j] * self.w_exps[j, i]
+                alpha[k, i] = x_exps[k, i] * s
+            log_likelihood -= x_row_max[k] + np.log(self._l1_norm(alpha[k, :]))
+        s = 0.
+        for i in range(self.tag_num):
+            s += alpha[-1, i] * self.b_exps[i]
+        log_likelihood -= np.log(s)
+        # calculate the nominator part.
+        log_likelihood += (
+            self.a[label[0]] + x[0, label[0]] + self.b[label[-1]])
+        for k in range(1, seq_len):
+            log_likelihood += (x[k, label[k]] + self.w[label[k - 1], label[k]])
+        return -log_likelihood
+    def crf_forward_compute(self):
+        for i in range(self.seq_num):
+            start = self.seq_start_positions[i]
+            end = self.seq_start_positions[i + 1]
+            self.log_likelihood[i] = self._forward_a_sequence(
+                self.x[start:end, :], self.x_row_max[start:end, :],
+                self.x_exps[start:end, :], self.labels[start:end, :],
+                self.alpha[start:end, :])
+        return self.alpha, self.log_likelihood
+class TestLinearChainCrfOp(OpTest):
+    def set_test_data(self):
+        # TODO(caoying) Fix the unittest by: add the boundary cases when
+        # sequence lengths are 1, 2, and 3.
+        SEQ_NUM = 3
+        TAG_NUM = 17
+        MAX_SEQ_LEN = 5
+        # the linear_chain_crf operator only supports sequence (LoD level = 1)
+        lod = [[0]]
+        for i in range(SEQ_NUM):
+            lod[-1].append(lod[-1][-1] + random.randint(1, MAX_SEQ_LEN))
+        emission = np.random.uniform(-1, 1,
+                                     [lod[-1][-1], TAG_NUM]).astype("float64")
+        emission_row_max = np.amax(emission, axis=1, keepdims=True)
+        emission_exps = np.exp(emission - emission_row_max)
+        transition = np.random.uniform(-0.5, 0.5,
+                                       [TAG_NUM + 2, TAG_NUM]).astype("float64")
+        transition_exps = np.exp(transition)
+        labels = np.random.randint(
+            low=0, high=TAG_NUM, size=(lod[-1][-1], 1), dtype="int32")
+        self.inputs = {
+            "Emission": (emission, lod),
+            "Transition": transition,
+            "Label": (labels, lod)
+        }
+        crf = LinearChainCrfForward(lod[0], emission, emission_row_max,
+                                    emission_exps, transition, transition_exps,
+                                    labels)
+        alpha, log_likelihood = crf.crf_forward_compute()
+        self.outputs = {
+            "Alpha": alpha,
+            "EmissionExps": emission_exps,
+            "TransitionExps": transition_exps,
+            "LogLikelihood": log_likelihood
+        }
+    def setUp(self):
+        self.op_type = "linear_chain_crf"
+        self.set_test_data()
+    def test_check_output(self):
+        self.check_output()
+    def test_check_grad(self):
+        self.check_grad(["Emission", "Transition"], "LogLikelihood")
+    def test_check_grad_ignore_transition(self):
+        self.check_grad(
+            ["Emission"], "LogLikelihood", no_grad_set=set("Transition"))
+if __name__ == "__main__":
+    unittest.main()
--- a/python/paddle/v2/framework/tests/test_recognize_digits_mlp.py
+++ b/python/paddle/v2/framework/tests/test_recognize_digits_mlp.py
@@ -3,9 +3,10 @@ import paddle.v2.framework.layers as layers
 import paddle.v2.framework.core as core
 import paddle.v2.framework.optimizer as optimizer
-from paddle.v2.framework.framework import Program, g_program
+from paddle.v2.framework.framework import Program
 from paddle.v2.framework.executor import Executor
 from paddle.v2.framework.regularizer import L2DecayRegularizer
+from paddle.v2.framework.initializer import UniformInitializer
 import numpy as np
@@ -21,11 +22,8 @@ image = layers.data(
 param_attr = {
    'name': None,
-    'init_attr': {
+    'initializer': UniformInitializer(
-        'type': 'uniform_random',
+        low=-1.0, high=1.0),
-        'min': -1.0,
-        'max': 1.0
-    },
    'regularization': L2DecayRegularizer(0.0005 * BATCH_SIZE)
 }

--- a/python/paddle/v2/framework/tests/test_seq_pool.py
+++ b/python/paddle/v2/framework/tests/test_seq_pool.py
@@ -3,15 +3,6 @@ import numpy as np
 from op_test import OpTest
-class SeqPoolType(OpTest):
-    AVERAGE = 0
-    SUM = 1
-    SQRT = 2
-    MAX = 3
-    LAST = 4
-    FIRST = 5
 class TestSeqAvgPool(OpTest):
    def set_data(self):
        self.op_type = 'sequence_pool'
@@ -25,7 +16,7 @@ class TestSeqAvgPool(OpTest):
        return x, lod, out
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.AVERAGE}
+        self.attrs = {'pooltype': "AVERAGE"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            out[i] = sub_x.mean(axis=0)
@@ -54,7 +45,7 @@ class TestSeqAvgPool2D(TestSeqAvgPool):
        return x, lod, out
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.AVERAGE}
+        self.attrs = {'pooltype': "AVERAGE"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            out[i] = np.reshape(sub_x.mean(axis=0), (3, 17))
@@ -62,7 +53,7 @@ class TestSeqAvgPool2D(TestSeqAvgPool):
 class TestSeqSumPool(TestSeqAvgPool):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.SUM}
+        self.attrs = {'pooltype': "SUM"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            out[i] = sub_x.sum(axis=0)
@@ -70,7 +61,7 @@ class TestSeqSumPool(TestSeqAvgPool):
 class TestSeqSumPool2D(TestSeqAvgPool2D):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.SUM}
+        self.attrs = {'pooltype': "SUM"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            out[i] = np.reshape(sub_x.sum(axis=0), (3, 17))
@@ -78,7 +69,7 @@ class TestSeqSumPool2D(TestSeqAvgPool2D):
 class TestSeqSqrtPool(TestSeqAvgPool):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.SQRT}
+        self.attrs = {'pooltype': "SQRT"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            len = lod[0][i + 1] - lod[0][i]
@@ -87,7 +78,7 @@ class TestSeqSqrtPool(TestSeqAvgPool):
 class TestSeqSqrtPool2D(TestSeqAvgPool2D):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.SQRT}
+        self.attrs = {'pooltype': "SQRT"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            len = lod[0][i + 1] - lod[0][i]
@@ -99,7 +90,7 @@ class TestSeqSqrtPool2D(TestSeqAvgPool2D):
 class TestSeqMaxPool(TestSeqAvgPool):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.MAX}
+        self.attrs = {'pooltype': "MAX"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            out[i] = np.amax(sub_x, axis=0)
@@ -111,7 +102,7 @@ class TestSeqMaxPool(TestSeqAvgPool):
 class TestSeqMaxPool2D(TestSeqAvgPool2D):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.MAX}
+        self.attrs = {'pooltype': "MAX"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            out[i] = np.reshape(np.amax(sub_x, axis=0), (3, 17))
@@ -123,7 +114,7 @@ class TestSeqMaxPool2D(TestSeqAvgPool2D):
 class TestSeqLastPool(TestSeqAvgPool):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.LAST}
+        self.attrs = {'pooltype': "LAST"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            out[i] = sub_x[-1, :]
@@ -131,7 +122,7 @@ class TestSeqLastPool(TestSeqAvgPool):
 class TestSeqLastPool2D(TestSeqAvgPool2D):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.LAST}
+        self.attrs = {'pooltype': "LAST"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            out[i] = np.reshape(sub_x[-1, :], (3, 17))
@@ -139,7 +130,7 @@ class TestSeqLastPool2D(TestSeqAvgPool2D):
 class TestSeqFirstPool(TestSeqAvgPool):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.FIRST}
+        self.attrs = {'pooltype': "FIRST"}
        for i in range(4):
            sub_x = x[lod[0][i]:lod[0][i + 1], :]
            out[i] = sub_x[0, :]
@@ -147,7 +138,7 @@ class TestSeqFirstPool(TestSeqAvgPool):
 class TestSeqFirstPool2D(TestSeqAvgPool2D):
    def compute(self, x, lod, out):
-        self.attrs = {'strategy': SeqPoolType.FIRST}
+        self.attrs = {'pooltype': "FIRST"}
        for i in range(4):
            sub_x = np.reshape(x[lod[0][i]:lod[0][i + 1], :], (-1, 3 * 17))
            out[i] = np.reshape(sub_x[0, :], (3, 17))