Merge remote-tracking branch 'upstream/develop' into develop

331bfd98 · gx_wind · f382aa77 · 3388e52d · 331bfd98 · 331bfd98
148 changed file
--- a/.copyright.hook
+++ b/.copyright.hook
@@ -49,12 +49,12 @@ def generate_copyright(template, lang='C'):
        LANG_COMMENT_MARK = "//"
    lines = template.split(NEW_LINE_MARK)
-    ans = LANG_COMMENT_MARK + COPYRIGHT_HEADER + NEW_LINE_MARK
+    ans = LANG_COMMENT_MARK + " " + COPYRIGHT_HEADER + NEW_LINE_MARK
    for lino, line in enumerate(lines):
        if lino == 0 or lino == 1 or lino == len(lines) - 1: continue
-        ans += LANG_COMMENT_MARK + line + NEW_LINE_MARK
+        ans += LANG_COMMENT_MARK + " " + line + NEW_LINE_MARK
-    return ans
+    return ans + "\n"
 def lang_type(filename):
@@ -90,7 +90,7 @@ def main(argv=None):
    retv = 0
    for filename in args.filenames:
        first_line = io.open(filename).readline()
-        if "Copyright" in first_line: continue
+        if "COPYRIGHT" in first_line.upper() : continue
        original_contents = io.open(filename).read()
        new_contents = generate_copyright(
            COPYRIGHT, lang_type(filename)) + original_contents

--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
+# Contributor Covenant Code of Conduct
+## Our Pledge
+In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
+## Our Standards
+Examples of behavior that contributes to creating a positive environment include:
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+Examples of unacceptable behavior by participants include:
+* The use of sexualized language or imagery and unwelcome sexual attention or advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a professional setting
+## Our Responsibilities
+Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
+Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
+## Scope
+This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at paddle-dev@baidu.com. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
+Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at [http://contributor-covenant.org/version/1/4][version]
+[homepage]: http://contributor-covenant.org
+[version]: http://contributor-covenant.org/version/1/4/
--- a/CODE_OF_CONDUCT_cn.md
+++ b/CODE_OF_CONDUCT_cn.md
+# 貢獻者公約
+## 我們的承諾
+為了促進一個開放透明且受歡迎的環境，我們作為貢獻者和維護者保證，無論年齡、種族、民族、性別認同和表達、體型、殘疾、經驗水平、國籍、個人表現、宗教或性別取向，在我們的專案以及社群的參與者都有不被騷擾的體驗。
+## 我們的準則
+舉例來說有助於創造正面環境的行為包括：
+* 使用歡迎和包容性語言
+* 尊重不同的觀點和經驗
+* 優雅地接受建設性批評
+* 關注在對於社群最好的事情上
+* 對其他社群成員的表現友善
+舉例來說身為參與者不能接受的行為包括：
+* 使用與性有關的言語或是圖像，以及不受歡迎的性騷擾
+* 酸民/反串/釣魚行為或進行侮辱/貶損的評論，人身攻擊及政治攻擊
+* 公開或私下的騷擾
+* 未經許可地發布他人的個人資料，例如住址或是電子地址
+* 其他可以被合理地認定為不恰當或者違反職業操守的行為
+## 我們的責任
+專案維護者有責任為"可接受的行為"準則做出詮釋，以及對已發生的不被接受的行為採取恰當且公平的糾正措施。
+專案維護者有權力及責任去刪除、編輯、拒絕與本行為準則有所違背的評論(comments)、提交(commits)、程式碼、wiki 編輯、問題(issues)和其他貢獻，以及專案維護者可暫時或永久性的禁止任何他們認為有不適當、威脅、冒犯、有害行為的貢獻者。
+## 使用範圍
+當一個人代表該專案或是其社群時，本行為準則適用於其專案平台和公共平台。
+代表專案或是社群的情況，舉例來說包括使用官方專案的電子郵件地址、通過官方的社群媒體帳號發布或線上或線下事件中擔任指定代表。
+該專案的呈現方式可由其專案維護者進行進一步的定義及解釋。
+## 強制執行
+可以透過paddle-dev@baidu.com，來聯繫專案團隊來報告濫用、騷擾或其他不被接受的行為。
+任何維護團隊認為有必要且適合的所有投訴都將進行審查及調查，並做出相對應的回應。專案小組有對事件回報者有保密的義務。具體執行的方針近一步細節可能會單獨公佈。
+沒有真誠的遵守或是執行本行為準則的專案維護人員，可能會因專案領導人或是其他成員的決定，暫時或是永久的取消其身份。
+## 來源
+本行為準則改編自[貢獻者公約][首頁]，版本 1.4
+可在此觀看https://www.contributor-covenant.org/zh-tw/version/1/4/code-of-conduct.html
+[首頁]: https://www.contributor-covenant.org
--- a/benchmark/tensorflow/image/googlenet_multi_gpu.py
+++ b/benchmark/tensorflow/image/googlenet_multi_gpu.py
+#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
 from six.moves import xrange  # pylint: disable=redefined-builtin
 from datetime import datetime
 import math

--- a/doc/api/v2/fluid/layers.rst
+++ b/doc/api/v2/fluid/layers.rst
@@ -364,6 +364,12 @@ split
 ..  autofunction:: paddle.v2.fluid.layers.split
    :noindex:
+matmul
+------
+..  autofunction:: paddle.v2.fluid.layers.matmul
+    :noindex:
 logsigmoid
 ----------
 ..  autofunction:: paddle.v2.fluid.layers.logsigmoid
@@ -493,3 +499,8 @@ swish
 ------
 ..  autofunction:: paddle.v2.fluid.layers.swish
    :noindex:
+l2_normalize
+------------
+..  autofunction:: paddle.v2.fluid.layers.l2_normalize
+    :noindex:
--- a/doc/api/v2/fluid/nets.rst
+++ b/doc/api/v2/fluid/nets.rst
@@ -25,3 +25,9 @@ glu
 ..  autofunction:: paddle.v2.fluid.nets.glu
    :noindex:
+dot_product_attention
+---------------------
+..  autofunction:: paddle.v2.fluid.nets.dot_product_attention
+    :noindex:
--- a/doc/getstarted/concepts/src/infer.py
+++ b/doc/getstarted/concepts/src/infer.py
+#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
 import paddle.v2 as paddle
 import numpy as np

--- a/doc/howto/usage/capi/organization_of_the_inputs_cn.md
+++ b/doc/howto/usage/capi/organization_of_the_inputs_cn.md
@@ -19,7 +19,7 @@
 ### 基本使用概念
- 在PaddlePaddle内部，神经网络中一个计算层的输入/输出被组织为一个 `Argument` 结构体，如果神经网络有多个输入或者多个输入，每一个输入/输入都会对应有自己的`Argument`。
+- 在PaddlePaddle内部，神经网络中一个计算层的输入/输出被组织为一个 `Argument` 结构体，如果神经网络有多个输入或者多个输出，每一个输入/输出都会对应有自己的`Argument`。
 - `Argument` 并不真正“存储”数据，而是将输入/输出信息有机地组织在一起。
 - 在`Argument`内部由`IVector`（对应着上文提到的一维整型数组）和`Matrix`（对应着上文提到的二维浮点型矩阵）来实际存储数据；由 `Sequence Start Positions` (下文详细解释) 来描述输入/输出的序列信息。

--- a/doc/howto/usage/cluster/fluid_cluster_train_en.md
+++ b/doc/howto/usage/cluster/fluid_cluster_train_en.md
+# Fluid Distributed Training
+## Introduction
+In this article, we'll explain how to config and run distributed training jobs with PaddlePaddle Fluid in a bare metal cluster.
+## Preparations
+### Get your cluster ready
+Prepare your computer nodes in the cluster. Nodes in this cluster can be of any specification that runs PaddlePaddle, and with a unique IP address assigned to it. Make sure they can communicate with each other.
+### Have PaddlePaddle installed
+PaddlePaddle must be installed on all nodes. If you have GPU cards on your nodes, be sure to properly install drivers and CUDA libraries.
+PaddlePaddle build and installation guide can be found from [here](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/build_and_install/index_en.html).
+### Update training script
+#### Non-cluster training script
+Let's take [Deep Learning 101](http://www.paddlepaddle.org/docs/develop/book/01.fit_a_line/index.html)'s first chapter: "fit a line" as an example.
+This demo's non-cluster version with fluid API is as follows:
+``` python
+import paddle.v2 as paddle
+import paddle.v2.fluid as fluid
+x = fluid.layers.data(name='x', shape=[13], dtype='float32')
+y_predict = fluid.layers.fc(input=x, size=1, act=None)
+y = fluid.layers.data(name='y', shape=[1], dtype='float32')
+cost = fluid.layers.square_error_cost(input=y_predict, label=y)
+avg_cost = fluid.layers.mean(x=cost)
+sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001)
+sgd_optimizer.minimize(avg_cost)
+BATCH_SIZE = 20
+train_reader = paddle.batch(
+    paddle.reader.shuffle(
+        paddle.dataset.uci_housing.train(), buf_size=500),
+    batch_size=BATCH_SIZE)
+place = fluid.CPUPlace()
+feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
+exe = fluid.Executor(place)
+exe.run(fluid.default_startup_program())
+PASS_NUM = 100
+for pass_id in range(PASS_NUM):
+    fluid.io.save_persistables(exe, "./fit_a_line.model/")
+    fluid.io.load_persistables(exe, "./fit_a_line.model/")
+    for data in train_reader():
+        avg_loss_value, = exe.run(fluid.default_main_program(),
+                                  feed=feeder.feed(data),
+                                  fetch_list=[avg_cost])
+        if avg_loss_value[0] < 10.0:
+            exit(0)  # if avg cost less than 10.0, we think our code is good.
+exit(1)
+```
+We created a simple fully connected neural networks training program and handed it to the fluid executor to run for 100 passes.
+Now let's try to convert it to a distributed version to run in a cluster.
+#### Introducing parameter server
+As you see from the non-cluster version of training script, there is only one role in it: the trainer, who does the computing as well as holding parameters. In cluster training, since multi-trainers are working on the same task, they need one centralized place to hold and distribute parameters. This centralized place is called the Parameter Server in PaddlePaddle.
+![parameter server architect](src/trainer.png)
+Parameter Server in fluid does not only hold parameters but is also assigned with a part of the program. Trainers communicate with parameter servers via send/receive OPs. For more tech detail, please refer to this [document](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/dist_refactor/distributed_architecture.md).
+Now we need to create program for both trainers and parameter servers, the question is how?
+#### Slice the program
+Fluid provides a tool called "Distribute Transpiler" to automatically convert the non-cluster program into cluster program.
+The idea behind this tool is to find optimize OPs and gradient parameters, slice the program into 2 pieces and connect them with send/receive OP.
+Optimize OPs and gradient parameters can be found from the return values of optimizer's minimize function.
+To put them together:
+``` python
+... #define the program, cost, and create sgd optimizer
+optimize_ops, params_grads = sgd_optimizer.minimize(avg_cost) #get optimize OPs and gradient parameters
+t = fluid.DistributeTranspiler() # create transpiler instance
+# slice the program into 2 pieces with optimizer_ops and gradient parameters list, as well as pserver_endpoints, which is a comma separated list of [IP:PORT] and number of trainers
+t.transpile(optimize_ops, params_grads, pservers=pserver_endpoints, trainers=2) 
+... #create executor
+# in pserver, run this
+exe.run(fluid.default_startup_program())
+#current_endpoint here means current pserver IP:PORT you wish to run on
+exe.run(t.get_pserver_program(current_endpoint, optimize_ops)) 
+# in trainer, run this
+... # define data reader
+exe.run(fluid.default_startup_program())
+for pass_id in range(100):
+    for data in train_reader():
+        exe.run(t.get_trainer_program())
+```
+### E2E demo
+Please find the complete demo from [here](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/tests/book_distribute/notest_dist_fit_a_line.py). In parameter server node run this in the command line:
+``` bash
+PSERVERS=192.168.1.2:6174 SERVER_ENDPOINT=192.168.1.2:6174 TRAINING_ROLE=PSERVER python notest_dist_fit_a_line.py
+```
+*please note we assume that your parameter server runs at 192.168.1.2:6174*
+Wait until the prompt `Server listening on 192.168.1.2:6174`
+Then in 2 of your trainer node run this:
+``` bash
+PSERVERS=192.168.1.2:6174 SERVER_ENDPOINT=192.168.1.2:6174 TRAINING_ROLE=TRAINER python notest_dist_fit_a_line.py
+```
+*the reason you need to run this command twice in 2 nodes is: in the script we set the trainer count to be 2. You can change this setting on line 50*
+Now you have 2 trainers and 1 parameter server up and running.
--- a/paddle/framework/lod_tensor.cc
+++ b/paddle/framework/lod_tensor.cc
@@ -135,6 +135,65 @@ bool operator==(const LoD &a, const LoD &b) {
  return true;
 }
+bool CheckLoD(const LoD &in, int tensor_height) {
+  if (in.empty()) return true;
+  for (const auto &level : in) {
+    // check: there should be more than 2 offsets existing in each level.
+    if (level.size() < 2) return false;
+    // check: the first offset(the begin offset) of each level should be 0.
+    if (level.front() != 0) return false;
+    // check: all the offsets in a level should be ascending(no same items
+    // allows).
+    if (!std::is_sorted(level.begin(), level.begin(), [](size_t a, size_t b) {
+          if (a < b) return true;
+          return false;
+        })) {
+      LOG(INFO) << "ascending error";
+      return false;
+    }
+  }
+  // check: the lowest level's last offset should equals `tensor_height` if
+  //        tensor_height>0.
+  if (tensor_height > 0 && (size_t)tensor_height != in.back().back())
+    return false;
+  // check: the higher level's last offset should equals the lower level's
+  // size-1.
+  // NOTE LoD store the levels from top to bottom, so the higher level goes
+  // first.
+  for (size_t level = 0; level < in.size() - 1; level++) {
+    if (in[level].back() != in[level + 1].size() - 1) return false;
+  }
+  return true;
+}
+bool CheckAbsLoD(const LoD &in, int tensor_height) {
+  if (in.empty()) return true;
+  for (const auto &level : in) {
+    // check: all the offsets in a level should be ascending(no same items
+    // allows).
+    if (!std::is_sorted(level.begin(), level.begin(), [](size_t a, size_t b) {
+          if (a < b) return true;
+          return false;
+        })) {
+      return false;
+    }
+    // check: there should be more than 2 offsets existing in each level.
+    if (level.size() < 2) return false;
+    // check: the first offset of each level should be 0, and the last should be
+    // the same(the height of underlying tensor).
+    if (level.front() != 0) return false;
+    if (tensor_height < 0) {
+      tensor_height = level.back();
+    } else if ((size_t)tensor_height != level.back()) {
+      return false;
+    }
+  }
+  return true;
+}
 using LoDAndOffset = std::pair<LoD, std::pair<size_t, size_t>>;
 LoDAndOffset GetSubLoDAndAbsoluteOffset(const LoD &lod, size_t start_idx,
                                        size_t end_idx, size_t start_level) {
@@ -232,23 +291,32 @@ std::vector<LoDTensor> LoDTensor::SplitLoDTensor(
    const std::vector<platform::Place> places) const {
  check_memory_size();
  PADDLE_ENFORCE(lod().empty(), "Disable parallel lod for now");
-  PADDLE_ENFORCE(dims()[0] % places.size() == 0,
+  size_t result_size = std::min(static_cast<size_t>(dims()[0]), places.size());
-                 "Batch size should be divided by places size");
+  size_t remainder = dims()[0] % places.size();
-  std::vector<LoDTensor> lods;
+  std::vector<LoDTensor> results;
-  for (size_t place_idx = 0; place_idx < places.size(); ++place_idx) {
+  results.reserve(result_size);
-    int begin = place_idx * dims()[0] / places.size();
-    int end = (place_idx + 1) * dims()[0] / places.size();
+  int step_width = static_cast<int>(dims()[0] / result_size);
+  for (size_t i = 0; i < result_size; ++i) {
+    int begin = static_cast<int>(i * step_width);
+    int end = static_cast<int>((i + 1) * step_width);
+    if (i + 1 == places.size()) {  // last
+      end += remainder;
+    }
    auto src = Slice(begin, end);
-    auto &dst_place = places[place_idx];
+    auto &dst_place = places[i];
    LoDTensor dst;
+    if (!(dst_place == place())) {
      framework::Copy(src, dst_place, &dst);
+    } else {  // It is no need to copy if src_place and dst_place are same.
-    lods.emplace_back(dst);
+      dst.ShareDataWith(src);
+    }
+    results.emplace_back(dst);
  }
-  return lods;
+  return results;
 }
 // TODO(tonyyang-svail): make this function support LoD
@@ -259,12 +327,17 @@ void LoDTensor::MergeLoDTensor(
  framework::DDim new_dim = lod_tensors[0]->dims();
  std::type_index new_type = lod_tensors[0]->type();
  auto new_layout = lod_tensors[0]->layout();
+  int64_t new_height = 0;
  for (auto *lod : lod_tensors) {
-    PADDLE_ENFORCE(new_dim == lod->dims());
+    new_height += lod->dims()[0];
-    PADDLE_ENFORCE(new_type == lod->type());
+    for (int i = 1; i < new_dim.size(); ++i) {
-    PADDLE_ENFORCE(new_layout == lod->layout());
+      PADDLE_ENFORCE_EQ(new_dim[i], lod->dims()[i]);
+    }
+    PADDLE_ENFORCE_EQ(new_type, lod->type());
+    PADDLE_ENFORCE_EQ(new_layout, lod->layout());
  }
-  new_dim[0] *= lod_tensors.size();
+  new_dim[0] = new_height;
  Resize(new_dim);
  set_layout(new_layout);

--- a/paddle/framework/lod_tensor.h
+++ b/paddle/framework/lod_tensor.h
@@ -71,6 +71,38 @@ LoD ToAbsOffset(const LoD& in);
 bool operator==(const LoD& a, const LoD& b);
+/*
+ * Check whether this lod's format is valid.
+ *
+ * ATTENTION:
+ *   - Empty lod is treated as valid.
+ *
+ * It will check two things:
+ *
+ *  1. all the offsets in a level should be ascending(no same items allows).
+ *  2. there should be more than 2 offsets existing in each level.
+ *  3. the higher level's last offset should equals the lower level's size-1.
+ *  4. the first offset(the begin offset) of each level should be 0.
+ *  5. the lowest level's last offset should equals `tensor_height` if
+ * tensor_height>0.
+ */
+bool CheckLoD(const LoD& in, int tensor_height = -1);
+/*
+ * Check whether this absolute lod's format is valid.
+ *
+ * ATTENTION:
+ *   - Empty lod is treated as valid.
+ *
+ * It will check two things:
+ *  1. all the offsets in a level should be ascending(no same items allows)
+ *  2. there should be more than 2 offsets existing in each level.
+ *  3. the first offset of each level should be 0, and the last should be the
+ *     same(the height of underlying tensor) or `tensor_height` if
+ *     tensor_height>0.
+ */
+bool CheckAbsLoD(const LoD& in, int tensor_height = -1);
 /*
 * LoDTensor (Level of details Tensor)
 * see https://en.wikipedia.org/wiki/Level_of_details for reference.

--- a/paddle/framework/lod_tensor_test.cc
+++ b/paddle/framework/lod_tensor_test.cc
-//  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
+Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
+you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
+You may obtain a copy of the License at
-//
-//    http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-/*
-  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-  Licensed under the Apache License, Version 2.0 (the "License");
-  you may not use this file except in compliance with the License.
-  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
+Unless required by applicable law or agreed to in writing, software
-  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+distributed under the License is distributed on an "AS IS" BASIS,
-  See the License for the specific language governing permissions and
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  limitations under the License.
+See the License for the specific language governing permissions and
-*/
+limitations under the License. */
 #include "paddle/framework/lod_tensor.h"
@@ -35,38 +23,6 @@
 namespace paddle {
 namespace framework {
-const int kLodTensorSize = 20 * 128;
-class LoDTensorTester : public ::testing::Test {
- public:
-  virtual void SetUp() override {
-    // tensor's batch_size: 30
-    // 3 levels
-    // 0 10 20
-    // 0 5 10 15 20
-    // 0 2 5 7 10 12 15 20
-    LoD lod;
-    lod.push_back(std::vector<size_t>{0, 2, 3});
-    lod.push_back(std::vector<size_t>{0, 2, 5, 8});
-    lod.push_back(std::vector<size_t>{0, 2, 5, 7, 10, 12, 15, 17, 20});
-    ASSERT_EQ(lod.size(), 3UL);
-    lod_tensor_.Resize({20 /*batch size*/, 128 /*dim*/});
-    // malloc memory
-    float* dst_ptr = lod_tensor_.mutable_data<float>(place);
-    for (int i = 0; i < kLodTensorSize; ++i) {
-      dst_ptr[i] = i;
-    }
-    lod_tensor_.set_lod(lod);
-  }
- protected:
-  platform::CPUPlace place;
-  LoDTensor lod_tensor_;
-};
 TEST(LodExpand, test) {
  LoD lod{{0, 2}};
  LoDTensor tensor;
@@ -144,5 +100,53 @@ TEST(LoD, ToAbsOffset) {
  EXPECT_EQ(abs_lod, expected);
 }
+TEST(LoD, CheckLoD) {
+  LoD relative_lod;
+  relative_lod.push_back(std::vector<size_t>({0, 2}));
+  relative_lod.push_back(std::vector<size_t>({0, 1, 3}));
+  relative_lod.push_back(std::vector<size_t>({0, 2, 4, 5}));
+  // check compatible
+  ASSERT_TRUE(CheckLoD(relative_lod));
+  relative_lod[1].back()++;
+  ASSERT_FALSE(CheckLoD(relative_lod));
+  relative_lod[1].back()--;  // recover it
+  // check empty
+  LoD empty_lod;
+  ASSERT_TRUE(CheckLoD(empty_lod));
+  // check less than 2 offsets in a level
+  LoD some_lod0;
+  some_lod0.push_back(std::vector<size_t>({0}));
+  ASSERT_FALSE(CheckLoD(some_lod0));
+  // check with underlying tensor storage.
+  ASSERT_TRUE(CheckLoD(relative_lod, 5));
+  ASSERT_FALSE(CheckLoD(relative_lod, 9));
+}
+TEST(LoD, CheckAbsLoD) {
+  LoD relative_lod;
+  relative_lod.push_back(std::vector<size_t>({0, 2}));
+  relative_lod.push_back(std::vector<size_t>({0, 1, 3}));
+  relative_lod.push_back(std::vector<size_t>({0, 2, 4, 5}));
+  auto abs_lod = ToAbsOffset(relative_lod);
+  ASSERT_TRUE(CheckAbsLoD(abs_lod));
+  // check less than 2 offsets in a level.
+  // check the last item should be compatible with tensor height.
+  abs_lod.back().back()++;
+  ASSERT_FALSE(CheckAbsLoD(abs_lod));
+  abs_lod.back().back()--;  // restore
+  // check less than 2 offsets in a lod.
+  LoD abs_lod0;
+  abs_lod0.push_back(std::vector<size_t>({0}));
+  ASSERT_FALSE(CheckAbsLoD(abs_lod0));
+}
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/op_registry.h
+++ b/paddle/framework/op_registry.h
@@ -177,15 +177,15 @@ class OpKernelRegistrar : public Registrar {
 /**
 * Macro to register OperatorKernel.
 */
-#define REGISTER_OP_KERNEL(op_type, DEVICE_TYPE, place_class, ...)        \
+#define REGISTER_OP_KERNEL(op_type, LIBRARY_TYPE, place_class, ...)        \
  STATIC_ASSERT_GLOBAL_NAMESPACE(                                          \
-      __reg_op_kernel_##op_type##_##DEVICE_TYPE##__,                      \
+      __reg_op_kernel_##op_type##_##LIBRARY_TYPE##__,                      \
      "REGISTER_OP_KERNEL must be called in global namespace");            \
  static ::paddle::framework::OpKernelRegistrar<place_class, __VA_ARGS__>  \
-      __op_kernel_registrar_##op_type##_##DEVICE_TYPE##__(#op_type,       \
+      __op_kernel_registrar_##op_type##_##LIBRARY_TYPE##__(#op_type,       \
-                                                          #DEVICE_TYPE);  \
+                                                           #LIBRARY_TYPE); \
-  int TouchOpKernelRegistrar_##op_type##_##DEVICE_TYPE() {                \
+  int TouchOpKernelRegistrar_##op_type##_##LIBRARY_TYPE() {                \
-    __op_kernel_registrar_##op_type##_##DEVICE_TYPE##__.Touch();          \
+    __op_kernel_registrar_##op_type##_##LIBRARY_TYPE##__.Touch();          \
    return 0;                                                              \
  }
@@ -208,14 +208,14 @@ class OpKernelRegistrar : public Registrar {
  static int use_op_itself_##op_type##_ __attribute__((unused)) = \
      TouchOpRegistrar_##op_type()
-#define USE_OP_DEVICE_KERNEL(op_type, DEVICE_TYPE)               \
+#define USE_OP_DEVICE_KERNEL(op_type, LIBRARY_TYPE)               \
  STATIC_ASSERT_GLOBAL_NAMESPACE(                                 \
-      __use_op_kernel_##op_type##_##DEVICE_TYPE##__,             \
+      __use_op_kernel_##op_type##_##LIBRARY_TYPE##__,             \
      "USE_OP_DEVICE_KERNEL must be in global namespace");        \
-  extern int TouchOpKernelRegistrar_##op_type##_##DEVICE_TYPE(); \
+  extern int TouchOpKernelRegistrar_##op_type##_##LIBRARY_TYPE(); \
-  static int use_op_kernel_##op_type##_##DEVICE_TYPE##_          \
+  static int use_op_kernel_##op_type##_##LIBRARY_TYPE##_          \
      __attribute__((unused)) =                                   \
-          TouchOpKernelRegistrar_##op_type##_##DEVICE_TYPE()
+          TouchOpKernelRegistrar_##op_type##_##LIBRARY_TYPE()
 // TODO(fengjiayi): The following macros
 // seems ugly, do we have better method?

--- a/paddle/gserver/layers/MKLDNNConcatLayer.cpp
+++ b/paddle/gserver/layers/MKLDNNConcatLayer.cpp
@@ -43,7 +43,7 @@ void MKLDNNConcatLayer::reshape(
  channels_[0] = ic;
  oc = ic;
  for (size_t i = 1; i < inputLayers_.size(); i++) {
-    int batchsize, height, witdh;
+    int batchsize = 0, height = 0, witdh = 0;
    reshapeInput(batchsize, height, witdh, i);
    CHECK_EQ(bs, batchsize);
    CHECK_EQ(ih, height);
@@ -84,6 +84,7 @@ void MKLDNNConcatLayer::resetFwdBuffers(std::vector<MKLDNNMatrixPtr>& inputs,
  bool has8c = false, has16c = false, hasnc = false;
  for (size_t i = 0; i < inputs.size(); i++) {
    resetInValue(inputs[i], nullptr, i, channels_[i]);
+    inputs[i]->downSpatial();
    CHECK(inputs[i]);
    auto dm = inputs[i]->getDims();
    // inputs format can be different, but ndims must equal

--- a/paddle/gserver/tests/img_conv_cudnn.py
+++ b/paddle/gserver/tests/img_conv_cudnn.py
-#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
 #Licensed under the Apache License, Version 2.0 (the "License");
 #you may not use this file except in compliance with the License.
@@ -11,20 +11,6 @@
 #WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 #See the License for the specific language governing permissions and
 #limitations under the License.
-#edit-mode: -*- python -*-
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
 from paddle.trainer_config_helpers import *

--- a/paddle/gserver/tests/img_conv_exconv.py
+++ b/paddle/gserver/tests/img_conv_exconv.py
-#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
 #Licensed under the Apache License, Version 2.0 (the "License");
 #you may not use this file except in compliance with the License.
@@ -11,20 +11,6 @@
 #WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 #See the License for the specific language governing permissions and
 #limitations under the License.
-#edit-mode: -*- python -*-
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
 from paddle.trainer_config_helpers import *

--- a/paddle/gserver/tests/pyDataProvider.py
+++ b/paddle/gserver/tests/pyDataProvider.py
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
-# Licensed under the Apache License, Version 2.0 (the "License");
+#Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
+#you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
+#You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-# Unless required by applicable law or agreed to in writing, software
+#Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
+#distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
+#See the License for the specific language governing permissions and
-# limitations under the License.
+#limitations under the License.
 import numpy
 import struct
 import traceback

--- a/paddle/gserver/tests/rnn_data_provider.py
+++ b/paddle/gserver/tests/rnn_data_provider.py
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
-# Licensed under the Apache License, Version 2.0 (the "License");
+#Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
+#you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
+#You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-# Unless required by applicable law or agreed to in writing, software
+#Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
+#distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
+#See the License for the specific language governing permissions and
-# limitations under the License.
+#limitations under the License.
 from paddle.trainer.PyDataProvider2 import *
 # Note that each config should has an independent provider

--- a/paddle/gserver/tests/sequenceGen.py
+++ b/paddle/gserver/tests/sequenceGen.py
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
-# Licensed under the Apache License, Version 2.0 (the "License");
+#Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
+#you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
+#You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-# Unless required by applicable law or agreed to in writing, software
+#Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
+#distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
+#See the License for the specific language governing permissions and
-# limitations under the License.
+#limitations under the License.
 import os
 import sys

--- a/paddle/gserver/tests/sequence_nest_rnn_multi_unequalength_inputs.py
+++ b/paddle/gserver/tests/sequence_nest_rnn_multi_unequalength_inputs.py
-# edit-mode: -*- python -*-
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
 #
-# Licensed under the Apache License, Version 2.0 (the "License");
+#Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
+#you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
+#You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-# Unless required by applicable law or agreed to in writing, software
+#Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
+#distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
+#See the License for the specific language governing permissions and
-# limitations under the License.
+#limitations under the License.
 from paddle.trainer_config_helpers import *
 ######################## data source ################################

--- a/paddle/gserver/tests/sequence_recurrent.py
+++ b/paddle/gserver/tests/sequence_recurrent.py
-#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
 #Licensed under the Apache License, Version 2.0 (the "License");
 #you may not use this file except in compliance with the License.
@@ -11,20 +11,6 @@
 #WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 #See the License for the specific language governing permissions and
 #limitations under the License.
-#!/usr/bin/env python
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
 from paddle.trainer_config_helpers import *

--- a/paddle/gserver/tests/sequence_recurrent_group.py
+++ b/paddle/gserver/tests/sequence_recurrent_group.py
-#!/usr/bin/env python
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,7 +11,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from paddle.trainer_config_helpers import *
 ######################## data source ################################

--- a/paddle/gserver/tests/sequence_rnn_matched_inputs.py
+++ b/paddle/gserver/tests/sequence_rnn_matched_inputs.py
-#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
 #Licensed under the Apache License, Version 2.0 (the "License");
 #you may not use this file except in compliance with the License.
@@ -11,20 +11,6 @@
 #WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 #See the License for the specific language governing permissions and
 #limitations under the License.
-# edit-mode: -*- python -*-
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
 from paddle.trainer_config_helpers import *

--- a/paddle/gserver/tests/sequence_rnn_mixed_inputs.py
+++ b/paddle/gserver/tests/sequence_rnn_mixed_inputs.py
-#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
 #Licensed under the Apache License, Version 2.0 (the "License");
 #you may not use this file except in compliance with the License.
@@ -11,20 +11,6 @@
 #WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 #See the License for the specific language governing permissions and
 #limitations under the License.
-# edit-mode: -*- python -*-
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
 from paddle.trainer_config_helpers import *

--- a/paddle/gserver/tests/sequence_rnn_multi_unequalength_inputs.py
+++ b/paddle/gserver/tests/sequence_rnn_multi_unequalength_inputs.py
-#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
 #Licensed under the Apache License, Version 2.0 (the "License");
 #you may not use this file except in compliance with the License.
@@ -11,20 +11,6 @@
 #WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 #See the License for the specific language governing permissions and
 #limitations under the License.
-#edit-mode: -*- python -*-
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
 from paddle.trainer_config_helpers import *

--- a/paddle/gserver/tests/test_PyDataProvider2.py
+++ b/paddle/gserver/tests/test_PyDataProvider2.py
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
+#  Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
 #
-# Licensed under the Apache License, Version 2.0 (the "License");
+#Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
+#you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
+#You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-# Unless required by applicable law or agreed to in writing, software
+#Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
+#distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
+#See the License for the specific language governing permissions and
-# limitations under the License.
+#limitations under the License.
 import random
 from paddle.trainer.PyDataProvider2 import *

--- a/paddle/operators/CMakeLists.txt
+++ b/paddle/operators/CMakeLists.txt
@@ -178,14 +178,13 @@ foreach(src ${GENERAL_OPS})
 endforeach()
 file(APPEND ${pybind_file} "USE_OP(less_than);\nUSE_OP(logical_and);\nUSE_NO_KERNEL_OP(read_from_array);\n")
 set(GLOB_OP_LIB ${OP_LIBRARY} CACHE INTERNAL "Global OP library")
 cc_test(gather_test SRCS gather_test.cc DEPS tensor)
 cc_test(net_op_test SRCS net_op_test.cc DEPS net_op)
 cc_test(scatter_test SRCS scatter_test.cc DEPS tensor)
 cc_test(beam_search_decode_op_test SRCS beam_search_decode_op_test.cc DEPS lod_tensor)
+cc_test(beam_search_op_test SRCS beam_search_op_test.cc DEPS lod_tensor beam_search_op)
 cc_test(strided_memcpy_test SRCS strided_memcpy_test.cc DEPS tensor paddle_memory)
 if(WITH_GPU)
    cc_test(nccl_op_test SRCS nccl_op_test.cu.cc DEPS nccl_op gpu_info device_context)

--- a/paddle/operators/beam_search_op.cc
+++ b/paddle/operators/beam_search_op.cc
@@ -29,7 +29,7 @@ void BeamSearch::operator()(const framework::LoDTensor &pre_ids,
  PruneEndidCandidates(pre_ids, &selected_items);
  // calculate the output tensor's height
  size_t num_instances = std::accumulate(
-      std::begin(items), std::end(items), 0,
+      std::begin(selected_items), std::end(selected_items), 0,
      [](size_t a, std::vector<Item> &b) { return a + b.size(); });
  // the output tensor shape should be [num_instances, 1]
  auto dims = framework::make_ddim(
@@ -48,12 +48,20 @@ void BeamSearch::operator()(const framework::LoDTensor &pre_ids,
  size_t low_offset = 0;
  for (auto &items : selected_items) {
    low_level.push_back(low_offset);
+    sort(items.begin(), items.end(), [](const Item &a, const Item &b) {
+      if (a.offset < b.offset) {
+        return true;
+      }
+      return a.id < b.id;
+    });
    for (auto &item : items) {
      ids_data[low_offset] = item.id;
      scores_data[low_offset] = item.score;
      low_offset++;
    }
  }
+  low_level.push_back(low_offset);
  // fill lod
  auto abs_lod = framework::ToAbsOffset(ids_->lod());
  auto &high_level = abs_lod[lod_level_];
@@ -64,16 +72,21 @@ void BeamSearch::operator()(const framework::LoDTensor &pre_ids,
  selected_scores->set_lod(lod);
 }
-void BeamSearch::PruneEndidCandidates(const framework::LoDTensor &pre_ids,
+int BeamSearch::PruneEndidCandidates(const framework::LoDTensor &pre_ids,
                                     std::vector<std::vector<Item>> *items) {
  auto *pre_ids_data = pre_ids.data<int64_t>();
+  int res = 0;
  for (size_t offset = 0; offset < items->size(); offset++) {
    auto prefix_id = pre_ids_data[offset];
    if (prefix_id == end_id_) {
      items->at(offset).clear();
+    } else {
+      res++;
    }
  }
+  return res;
 }
 std::vector<std::vector<BeamSearch::Item>> BeamSearch::ToMap(
@@ -121,11 +134,7 @@ bool BeamSearch::NextItemSet(std::vector<BeamSearch::Item> *items) {
  auto ids = *ids_;
  auto scores = *scores_;
-  auto source_abs_two_level_lod = framework::SliceInLevel(
-      ids.lod(), lod_level_, sent_offset_, sent_offset_ + 1);
-  source_abs_two_level_lod = framework::ToAbsOffset(source_abs_two_level_lod);
  auto abs_lod = framework::ToAbsOffset(ids.lod());
-  PADDLE_ENFORCE_GE(source_abs_two_level_lod.size(), 2UL);
  auto *ids_data = ids.data<int64_t>();
  auto *scores_data = scores.data<float>();

--- a/paddle/operators/beam_search_op.h
+++ b/paddle/operators/beam_search_op.h
@@ -73,7 +73,15 @@ namespace operators {
 * second level:
 * [0, 2, 4]
 *
- * tensor's data
+ * id tensor's data
+ * [[
+ * 4,
+ * 1,
+ * 3,
+ * 8,
+ * ]]
+ *
+ * score tensor's data
 * [[
 * 0.5,
 * 0.3,
@@ -137,15 +145,20 @@ class BeamSearch {
    Item() {}
    Item(size_t offset, size_t id, float score)
        : offset(offset), id(id), score(score) {}
-    // offset in the lod_level_+1
+    // offset in the higher lod level.
    size_t offset;
+    // // prefix id in the lower lod level.
+    // size_t prefix;
    // the candidate id
    id_t id;
    // the corresponding score
    score_t score;
  };
-  void PruneEndidCandidates(const framework::LoDTensor& pre_ids,
+  /*
+   * Delete all the records that follows the end token.
+   */
+  int PruneEndidCandidates(const framework::LoDTensor& pre_ids,
                           std::vector<std::vector<Item>>* items);
  /*

--- a/paddle/operators/beam_search_op_test.cc
+++ b/paddle/operators/beam_search_op_test.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+#include "paddle/operators/beam_search_op.h"
+#include <gtest/gtest.h>
+#include <vector>
+namespace paddle {
+namespace test {
+using std::vector;
+using framework::LoDTensor;
+using framework::LoD;
+using operators::BeamSearch;
+using paddle::platform::CPUPlace;
+using std::cout;
+using std::endl;
+void CreateInput(LoDTensor* ids, LoDTensor* scores) {
+  LoD lod;
+  vector<size_t> level0({0, 1, 4});
+  vector<size_t> level1({0, 1, 2, 3, 4});
+  lod.push_back(level0);
+  lod.push_back(level1);
+  ids->set_lod(lod);
+  scores->set_lod(lod);
+  auto dims = framework::make_ddim(vector<int64_t>({4, 3}));
+  ids->Resize(dims);
+  scores->Resize(dims);
+  CPUPlace place;
+  auto* ids_data = ids->mutable_data<int64_t>(place);
+  auto* scores_data = scores->mutable_data<float>(place);
+  vector<int64_t> _ids({4, 2, 5, 2, 1, 3, 3, 5, 2, 8, 2, 1});
+  vector<float> _scores(
+      {0.5, 0.3, 0.2, 0.6, 0.3, 0.1, 0.9, 0.5, 0.1, 0.7, 0.5, 0.1});
+  for (int i = 0; i < 12; i++) {
+    ids_data[i] = _ids[i];
+    scores_data[i] = _scores[i];
+  }
+}
+TEST(beam_search_op, run) {
+  CPUPlace place;
+  LoDTensor ids, scores;
+  CreateInput(&ids, &scores);
+  LoDTensor pre_ids;
+  pre_ids.Resize(framework::make_ddim(vector<int64_t>(4, 1)));
+  for (int i = 0; i < 4; i++) {
+    pre_ids.mutable_data<int64_t>(place)[i] = i + 1;
+  }
+  BeamSearch beamsearch(ids, scores, (int64_t)0, (int64_t)2, 0);
+  LoDTensor sids, sscores;
+  beamsearch(pre_ids, &sids, &sscores);
+  LOG(INFO) << "score: " << sscores << endl;
+  ASSERT_EQ(sids.lod(), sscores.lod());
+  vector<int> tids({2, 4, 3, 8});
+  vector<float> tscores({0.3, 0.5, 0.9, 0.7});
+  for (int i = 0; i < 4; i++) {
+    ASSERT_EQ(tids[i], sids.data<int64_t>()[i]);
+    ASSERT_EQ(tscores[i], sscores.data<float>()[i]);
+  }
+}
+}  // namespace test
+}  // namespace paddle
--- a/paddle/operators/clip_op.cc
+++ b/paddle/operators/clip_op.cc
@@ -51,8 +51,8 @@ class ClipOpMaker : public framework::OpProtoAndCheckerMaker {
    AddComment(R"DOC(
 Clip Operator.
-The clip operator limits the value of given input within an interval. The interval is
+The clip operator limits the value of given input within an interval. The
-specified with arguments 'min' and 'max':
+interval is specified with arguments 'min' and 'max':
 $$
 Out = \min(\max(X, min), max)

--- a/paddle/operators/conv_op.cc
+++ b/paddle/operators/conv_op.cc
@@ -70,6 +70,13 @@ void ConvOp::InferShape(framework::InferShapeContext* ctx) const {
 framework::OpKernelType ConvOp::GetExpectedKernelType(
    const framework::ExecutionContext& ctx) const {
  bool use_cudnn = ctx.Attr<bool>("use_cudnn");
+  use_cudnn &= platform::is_gpu_place(ctx.GetPlace());
+#ifdef PADDLE_WITH_CUDA
+  if (platform::is_gpu_place(ctx.GetPlace())) {
+    auto& dev_ctx = ctx.template device_context<platform::CUDADeviceContext>();
+    use_cudnn &= dev_ctx.cudnn_handle() != nullptr;
+  }
+#endif
  framework::LibraryType library_;
  if (use_cudnn) {
    library_ = framework::LibraryType::kCUDNN;
@@ -283,6 +290,14 @@ void ConvOpGrad::InferShape(framework::InferShapeContext* ctx) const {
 framework::OpKernelType ConvOpGrad::GetExpectedKernelType(
    const framework::ExecutionContext& ctx) const {
  bool use_cudnn = ctx.Attr<bool>("use_cudnn");
+  use_cudnn &= platform::is_gpu_place(ctx.GetPlace());
+#ifdef PADDLE_WITH_CUDA
+  if (platform::is_gpu_place(ctx.GetPlace())) {
+    auto& dev_ctx = ctx.template device_context<platform::CUDADeviceContext>();
+    use_cudnn &= dev_ctx.cudnn_handle() != nullptr;
+  }
+#endif
  framework::LibraryType library_;
  if (use_cudnn) {
    library_ = framework::LibraryType::kCUDNN;

--- a/paddle/operators/conv_transpose_op.cc
+++ b/paddle/operators/conv_transpose_op.cc
@@ -61,6 +61,13 @@ void ConvTransposeOp::InferShape(framework::InferShapeContext* ctx) const {
 framework::OpKernelType ConvTransposeOp::GetExpectedKernelType(
    const framework::ExecutionContext& ctx) const {
  bool use_cudnn = ctx.Attr<bool>("use_cudnn");
+  use_cudnn &= platform::is_gpu_place(ctx.GetPlace());
+#ifdef PADDLE_WITH_CUDA
+  if (platform::is_gpu_place(ctx.GetPlace())) {
+    auto& dev_ctx = ctx.template device_context<platform::CUDADeviceContext>();
+    use_cudnn &= dev_ctx.cudnn_handle() != nullptr;
+  }
+#endif
  framework::LibraryType library_;
  if (use_cudnn) {
    library_ = framework::LibraryType::kCUDNN;
@@ -263,6 +270,13 @@ void ConvTransposeOpGrad::InferShape(framework::InferShapeContext* ctx) const {
 framework::OpKernelType ConvTransposeOpGrad::GetExpectedKernelType(
    const framework::ExecutionContext& ctx) const {
  bool use_cudnn = ctx.Attr<bool>("use_cudnn");
+  use_cudnn &= platform::is_gpu_place(ctx.GetPlace());
+#ifdef PADDLE_WITH_CUDA
+  if (platform::is_gpu_place(ctx.GetPlace())) {
+    auto& dev_ctx = ctx.template device_context<platform::CUDADeviceContext>();
+    use_cudnn &= dev_ctx.cudnn_handle() != nullptr;
+  }
+#endif
  framework::LibraryType library_;
  if (use_cudnn) {
    library_ = framework::LibraryType::kCUDNN;

--- a/paddle/operators/elementwise_add_op.h
+++ b/paddle/operators/elementwise_add_op.h
@@ -28,39 +28,7 @@ template <typename DeviceContext, typename T>
 class ElementwiseAddKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& ctx) const override {
-    using Tensor = framework::Tensor;
+    ElementwiseComputeEx<AddFunctor<T>, DeviceContext, T>(ctx);
-    auto* x = ctx.Input<Tensor>("X");
-    auto* y = ctx.Input<Tensor>("Y");
-    auto* z = ctx.Output<Tensor>("Out");
-    z->mutable_data<T>(ctx.GetPlace());
-    TransformFunctor<AddFunctor<T>, T, DeviceContext> functor(
-        x, y, z, ctx.template device_context<DeviceContext>(), AddFunctor<T>());
-    auto x_dims = x->dims();
-    auto y_dims = y->dims();
-    PADDLE_ENFORCE_GE(x_dims.size(), y_dims.size(),
-                      "Rank of first input must >= rank of second input.");
-    if (x_dims == y_dims) {
-      functor.Run();
-      return;
-    }
-    int axis = ctx.Attr<int>("axis");
-    axis = (axis == -1 ? x_dims.size() - y_dims.size() : axis);
-    PADDLE_ENFORCE(axis >= 0 && axis < x_dims.size(),
-                   "Axis should be in range [0, x_dims)");
-    int pre, n, post;
-    get_mid_dims(x_dims, y_dims, axis, pre, n, post);
-    if (post == 1) {
-      functor.RunRowWise(n, pre);
-      return;
-    } else {
-      functor.RunMidWise(n, pre, post);
-      return;
-    }
  }
 };

--- a/paddle/operators/elementwise_div_op.h
+++ b/paddle/operators/elementwise_div_op.h
@@ -19,11 +19,16 @@ limitations under the License. */
 namespace paddle {
 namespace operators {
+template <typename T>
+struct DivFunctor {
+  inline HOSTDEVICE T operator()(T a, T b) const { return a / b; }
+};
 template <typename DeviceContext, typename T>
 class ElementwiseDivKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& ctx) const override {
-    ElementwiseCompute<EigenDivFunctor, DeviceContext, T>(ctx);
+    ElementwiseComputeEx<DivFunctor<T>, DeviceContext, T>(ctx);
  }
 };

--- a/paddle/operators/elementwise_max_op.cc
+++ b/paddle/operators/elementwise_max_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "paddle/operators/elementwise_max_op.h"
+#include "paddle/operators/elementwise_op.h"
+namespace paddle {
+namespace operators {
+class ElementwiseMaxOpMaker : public ElementwiseOpMaker {
+ public:
+  ElementwiseMaxOpMaker(OpProto* proto, OpAttrChecker* op_checker)
+      : ElementwiseOpMaker(proto, op_checker) {
+    SetComment("Max", "Out = max(X, Y)");
+    AddComment(comment_);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OP(elementwise_max, ops::ElementwiseOp, ops::ElementwiseMaxOpMaker,
+            elementwise_max_grad, ops::ElementwiseOpGrad);
+REGISTER_OP_CPU_KERNEL(
+    elementwise_max,
+    ops::ElementwiseMaxKernel<paddle::platform::CPUDeviceContext, float>,
+    ops::ElementwiseMaxKernel<paddle::platform::CPUDeviceContext, double>,
+    ops::ElementwiseMaxKernel<paddle::platform::CPUDeviceContext, int>,
+    ops::ElementwiseMaxKernel<paddle::platform::CPUDeviceContext, int64_t>);
+REGISTER_OP_CPU_KERNEL(
+    elementwise_max_grad,
+    ops::ElementwiseMaxGradKernel<paddle::platform::CPUDeviceContext, float>,
+    ops::ElementwiseMaxGradKernel<paddle::platform::CPUDeviceContext, double>,
+    ops::ElementwiseMaxGradKernel<paddle::platform::CPUDeviceContext, int>,
+    ops::ElementwiseMaxGradKernel<paddle::platform::CPUDeviceContext, int64_t>);
--- a/paddle/operators/elementwise_max_op.cu
+++ b/paddle/operators/elementwise_max_op.cu
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#define EIGEN_USE_GPU
+#include "paddle/operators/elementwise_max_op.h"
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(
+    elementwise_max,
+    ops::ElementwiseMaxKernel<paddle::platform::CUDADeviceContext, float>,
+    ops::ElementwiseMaxKernel<paddle::platform::CUDADeviceContext, double>,
+    ops::ElementwiseMaxKernel<paddle::platform::CUDADeviceContext, int>,
+    ops::ElementwiseMaxKernel<paddle::platform::CUDADeviceContext, int64_t>);
+REGISTER_OP_CUDA_KERNEL(
+    elementwise_max_grad,
+    ops::ElementwiseMaxGradKernel<paddle::platform::CUDADeviceContext, float>,
+    ops::ElementwiseMaxGradKernel<paddle::platform::CUDADeviceContext, double>,
+    ops::ElementwiseMaxGradKernel<paddle::platform::CUDADeviceContext, int>,
+    ops::ElementwiseMaxGradKernel<paddle::platform::CUDADeviceContext,
+                                  int64_t>);
--- a/paddle/operators/elementwise_max_op.h
+++ b/paddle/operators/elementwise_max_op.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+#include "paddle/operators/elementwise_op_function.h"
+namespace paddle {
+namespace operators {
+template <typename T>
+struct MaxFunctor {
+  inline HOSTDEVICE T operator()(T a, T b) const { return a > b ? a : b; }
+};
+template <typename DeviceContext, typename T>
+class ElementwiseMaxKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    ElementwiseComputeEx<MaxFunctor<T>, DeviceContext, T>(ctx);
+  }
+};
+template <typename T>
+struct ElementwiseMaxGradFunctor {
+  template <typename Device, typename X, typename Y, typename Z, typename dX,
+            typename dY, typename dZ>
+  void operator()(Device d, X x, Y y, Z z, dX dx, dY dy, dZ dz) {
+    auto x_e = framework::EigenVector<T>::Flatten(*x);
+    auto y_e = framework::EigenVector<T>::Flatten(*y);
+    auto dz_e = framework::EigenVector<T>::Flatten(*dz);
+    if (dx) {
+      auto dx_e = framework::EigenVector<T>::Flatten(*dx);
+      dx_e.device(d) = (x_e > y_e).template cast<T>() * dz_e;
+    }
+    if (dy) {
+      auto dy_e = framework::EigenVector<T>::Flatten(*dy);
+      dy_e.device(d) = (x_e <= y_e).template cast<T>() * dz_e;
+    }
+  }
+};
+template <typename T>
+struct ElementwiseMaxBroadCastGradFunctor {
+  template <typename Device, typename X, typename Y, typename Z, typename dX,
+            typename dY, typename dZ, typename Pre, typename N>
+  void operator()(Device d, X x, Y y, Z z, dX dx, dY dy, dZ dz, Pre pre, N n) {
+    auto x_e = framework::EigenVector<T>::Flatten(*x);
+    auto y_e = framework::EigenVector<T>::Flatten(*y);
+    auto dz_e = framework::EigenVector<T>::Flatten(*dz);
+    auto y_e_bcast = y_e.reshape(Eigen::DSizes<int, 2>(1, n))
+                         .broadcast(Eigen::DSizes<int, 2>(pre, 1))
+                         .reshape(Eigen::DSizes<int, 1>(x_e.size()));
+    if (dx) {
+      auto dx_e = framework::EigenVector<T>::Flatten(*dx);
+      dx_e.device(d) = (x_e > y_e_bcast).template cast<T>() * dz_e;
+    }
+    if (dy) {
+      auto dy_e = framework::EigenVector<T>::Flatten(*dy);
+      dy_e.device(d) = ((x_e <= y_e_bcast).template cast<T>() * dz_e)
+                           .reshape(Eigen::DSizes<int, 2>(pre, n))
+                           .sum(Eigen::array<int, 1>{{0}});
+    }
+  }
+};
+template <typename T>
+struct ElementwiseMaxBroadCast2GradFunctor {
+  template <typename Device, typename X, typename Y, typename Z, typename dX,
+            typename dY, typename dZ, typename Pre, typename N, typename Post>
+  void operator()(Device d, X x, Y y, Z z, dX dx, dY dy, dZ dz, Pre pre, N n,
+                  Post post) {
+    auto x_e = framework::EigenVector<T>::Flatten(*x);
+    auto y_e = framework::EigenVector<T>::Flatten(*y);
+    auto dz_e = framework::EigenVector<T>::Flatten(*dz);
+    auto y_e_bcast = y_e.reshape(Eigen::DSizes<int, 3>(1, n, 1))
+                         .broadcast(Eigen::DSizes<int, 3>(pre, 1, post))
+                         .reshape(Eigen::DSizes<int, 1>(x_e.size()));
+    if (dx) {
+      auto dx_e = framework::EigenVector<T>::Flatten(*dx);
+      dx_e.device(d) = (x_e > y_e_bcast).template cast<T>() * dz_e;
+    }
+    if (dy) {
+      auto dy_e = framework::EigenVector<T>::Flatten(*dy);
+      dy_e.device(d) = ((x_e <= y_e_bcast).template cast<T>() * dz_e)
+                           .reshape(Eigen::DSizes<int, 3>(pre, n, post))
+                           .sum(Eigen::array<int, 2>{{0, 2}});
+    }
+  }
+};
+template <typename DeviceContext, typename T>
+class ElementwiseMaxGradKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    ElementwiseGradCompute<DeviceContext, T, ElementwiseMaxGradFunctor<T>,
+                           ElementwiseMaxBroadCastGradFunctor<T>,
+                           ElementwiseMaxBroadCast2GradFunctor<T>>(ctx);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/elementwise_min_op.cc
+++ b/paddle/operators/elementwise_min_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "paddle/operators/elementwise_min_op.h"
+#include "paddle/operators/elementwise_op.h"
+namespace paddle {
+namespace operators {
+class ElementwiseMinOpMaker : public ElementwiseOpMaker {
+ public:
+  ElementwiseMinOpMaker(OpProto* proto, OpAttrChecker* op_checker)
+      : ElementwiseOpMaker(proto, op_checker) {
+    SetComment("Max", "Out = min(X, Y)");
+    AddComment(comment_);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OP(elementwise_min, ops::ElementwiseOp, ops::ElementwiseMinOpMaker,
+            elementwise_min_grad, ops::ElementwiseOpGrad);
+REGISTER_OP_CPU_KERNEL(
+    elementwise_min,
+    ops::ElementwiseMinKernel<paddle::platform::CPUDeviceContext, float>,
+    ops::ElementwiseMinKernel<paddle::platform::CPUDeviceContext, double>,
+    ops::ElementwiseMinKernel<paddle::platform::CPUDeviceContext, int>,
+    ops::ElementwiseMinKernel<paddle::platform::CPUDeviceContext, int64_t>);
+REGISTER_OP_CPU_KERNEL(
+    elementwise_min_grad,
+    ops::ElementwiseMinGradKernel<paddle::platform::CPUDeviceContext, float>,
+    ops::ElementwiseMinGradKernel<paddle::platform::CPUDeviceContext, double>,
+    ops::ElementwiseMinGradKernel<paddle::platform::CPUDeviceContext, int>,
+    ops::ElementwiseMinGradKernel<paddle::platform::CPUDeviceContext, int64_t>);
--- a/paddle/operators/elementwise_min_op.cu
+++ b/paddle/operators/elementwise_min_op.cu
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#define EIGEN_USE_GPU
+#include "paddle/operators/elementwise_min_op.h"
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(
+    elementwise_min,
+    ops::ElementwiseMinKernel<paddle::platform::CUDADeviceContext, float>,
+    ops::ElementwiseMinKernel<paddle::platform::CUDADeviceContext, double>,
+    ops::ElementwiseMinKernel<paddle::platform::CUDADeviceContext, int>,
+    ops::ElementwiseMinKernel<paddle::platform::CUDADeviceContext, int64_t>);
+REGISTER_OP_CUDA_KERNEL(
+    elementwise_min_grad,
+    ops::ElementwiseMinGradKernel<paddle::platform::CUDADeviceContext, float>,
+    ops::ElementwiseMinGradKernel<paddle::platform::CUDADeviceContext, double>,
+    ops::ElementwiseMinGradKernel<paddle::platform::CUDADeviceContext, int>,
+    ops::ElementwiseMinGradKernel<paddle::platform::CUDADeviceContext,
+                                  int64_t>);
--- a/paddle/operators/elementwise_min_op.h
+++ b/paddle/operators/elementwise_min_op.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+#include "paddle/operators/elementwise_op_function.h"
+namespace paddle {
+namespace operators {
+template <typename T>
+struct MinFunctor {
+  inline HOSTDEVICE T operator()(T a, T b) const { return a < b ? a : b; }
+};
+template <typename DeviceContext, typename T>
+class ElementwiseMinKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    ElementwiseComputeEx<MinFunctor<T>, DeviceContext, T>(ctx);
+  }
+};
+template <typename T>
+struct ElementwiseMinGradFunctor {
+  template <typename Device, typename X, typename Y, typename Z, typename dX,
+            typename dY, typename dZ>
+  void operator()(Device d, X x, Y y, Z z, dX dx, dY dy, dZ dz) {
+    auto x_e = framework::EigenVector<T>::Flatten(*x);
+    auto y_e = framework::EigenVector<T>::Flatten(*y);
+    auto dz_e = framework::EigenVector<T>::Flatten(*dz);
+    if (dx) {
+      auto dx_e = framework::EigenVector<T>::Flatten(*dx);
+      dx_e.device(d) = (x_e < y_e).template cast<T>() * dz_e;
+    }
+    if (dy) {
+      auto dy_e = framework::EigenVector<T>::Flatten(*dy);
+      dy_e.device(d) = (x_e >= y_e).template cast<T>() * dz_e;
+    }
+  }
+};
+template <typename T>
+struct ElementwiseMinBroadCastGradFunctor {
+  template <typename Device, typename X, typename Y, typename Z, typename dX,
+            typename dY, typename dZ, typename Pre, typename N>
+  void operator()(Device d, X x, Y y, Z z, dX dx, dY dy, dZ dz, Pre pre, N n) {
+    auto x_e = framework::EigenVector<T>::Flatten(*x);
+    auto y_e = framework::EigenVector<T>::Flatten(*y);
+    auto dz_e = framework::EigenVector<T>::Flatten(*dz);
+    auto y_e_bcast = y_e.reshape(Eigen::DSizes<int, 2>(1, n))
+                         .broadcast(Eigen::DSizes<int, 2>(pre, 1))
+                         .reshape(Eigen::DSizes<int, 1>(x_e.size()));
+    if (dx) {
+      auto dx_e = framework::EigenVector<T>::Flatten(*dx);
+      dx_e.device(d) = (x_e < y_e_bcast).template cast<T>() * dz_e;
+    }
+    if (dy) {
+      auto dy_e = framework::EigenVector<T>::Flatten(*dy);
+      dy_e.device(d) = ((x_e >= y_e_bcast).template cast<T>() * dz_e)
+                           .reshape(Eigen::DSizes<int, 2>(pre, n))
+                           .sum(Eigen::array<int, 1>{{0}});
+    }
+  }
+};
+template <typename T>
+struct ElementwiseMinBroadCast2GradFunctor {
+  template <typename Device, typename X, typename Y, typename Z, typename dX,
+            typename dY, typename dZ, typename Pre, typename N, typename Post>
+  void operator()(Device d, X x, Y y, Z z, dX dx, dY dy, dZ dz, Pre pre, N n,
+                  Post post) {
+    auto x_e = framework::EigenVector<T>::Flatten(*x);
+    auto y_e = framework::EigenVector<T>::Flatten(*y);
+    auto dz_e = framework::EigenVector<T>::Flatten(*dz);
+    auto y_e_bcast = y_e.reshape(Eigen::DSizes<int, 3>(1, n, 1))
+                         .broadcast(Eigen::DSizes<int, 3>(pre, 1, post))
+                         .reshape(Eigen::DSizes<int, 1>(x_e.size()));
+    if (dx) {
+      auto dx_e = framework::EigenVector<T>::Flatten(*dx);
+      dx_e.device(d) = (x_e < y_e_bcast).template cast<T>() * dz_e;
+    }
+    if (dy) {
+      auto dy_e = framework::EigenVector<T>::Flatten(*dy);
+      dy_e.device(d) = ((x_e >= y_e_bcast).template cast<T>() * dz_e)
+                           .reshape(Eigen::DSizes<int, 3>(pre, n, post))
+                           .sum(Eigen::array<int, 2>{{0, 2}});
+    }
+  }
+};
+template <typename DeviceContext, typename T>
+class ElementwiseMinGradKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    ElementwiseGradCompute<DeviceContext, T, ElementwiseMinGradFunctor<T>,
+                           ElementwiseMinBroadCastGradFunctor<T>,
+                           ElementwiseMinBroadCast2GradFunctor<T>>(ctx);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/elementwise_mul_op.h
+++ b/paddle/operators/elementwise_mul_op.h
@@ -18,11 +18,16 @@ limitations under the License. */
 namespace paddle {
 namespace operators {
+template <typename T>
+struct MulFunctor {
+  inline HOSTDEVICE T operator()(T a, T b) const { return a * b; }
+};
 template <typename DeviceContext, typename T>
 class ElementwiseMulKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& ctx) const override {
-    ElementwiseCompute<EigenMulFunctor, DeviceContext, T>(ctx);
+    ElementwiseComputeEx<MulFunctor<T>, DeviceContext, T>(ctx);
  }
 };

--- a/paddle/operators/elementwise_op.h
+++ b/paddle/operators/elementwise_op.h
@@ -26,9 +26,9 @@ class ElementwiseOp : public framework::OperatorWithKernel {
  using Tensor = framework::Tensor;
  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"),
-                   "Input(X) of elementwise op should not be null");
+                   "Input(X) of elementwise op should not be null.");
    PADDLE_ENFORCE(ctx->HasInput("Y"),
-                   "Input(Y) of elementwise op should not be null");
+                   "Input(Y) of elementwise op should not be null.");
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
                   "Output(Out) of elementwise op should not be null.");
@@ -45,12 +45,12 @@ class ElementwiseOpMaker : public framework::OpProtoAndCheckerMaker {
 public:
  ElementwiseOpMaker(OpProto* proto, OpAttrChecker* op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("X", "(Tensor) The first input tensor of elementwise op");
+    AddInput("X", "(Tensor), The first input tensor of elementwise op.");
-    AddInput("Y", "(Tensor) The second input tensor of elementwise op");
+    AddInput("Y", "(Tensor), The second input tensor of elementwise op.");
-    AddOutput("Out", "The output of elementwise op");
+    AddOutput("Out", "The output of elementwise op.");
    AddAttr<int>("axis",
-                 "(int, default -1) The starting dimension index "
+                 "(int, default -1). The start dimension index "
-                 "for broadcasting Y onto X")
+                 "for broadcasting Y onto X.")
        .SetDefault(-1)
        .EqualGreaterThan(-1);
    comment_ = R"DOC(
@@ -58,19 +58,18 @@ Limited Elementwise {name} Operator.
 The equation is:
-.. math::
+$${equation}$$
-  {equation}
-X is a tensor of any dimension and the dimensions of tensor Y must be smaller than
+$X$ is a tensor of any dimension and the dimensions of tensor $Y$ must be
-or equal to the dimensions of X. 
+smaller than or equal to the dimensions of $X$.
 There are two cases for this operator:
-1. The shape of Y is same with X;
+1. The shape of $Y$ is same with $X$;
-2. The shape of Y is a subset of X.
+2. The shape of $Y$ is a subset of $X$.
 For case 2:
-Y will be broadcasted to match the shape of X and axis should be 
+$Y$ will be broadcasted to match the shape of $X$ and axis should be
-the starting dimension index for broadcasting Y onto X.
+set to index of the start dimension to broadcast $Y$ onto $X$.
 For example
  .. code-block:: python
@@ -81,7 +80,8 @@ For example
    shape(X) = (2, 3, 4, 5), shape(Y) = (3, 4), with axis=1
    shape(X) = (2, 3, 4, 5), shape(Y) = (2), with axis=0
-Either of the inputs X and Y or none can carry the LoD (Level of Details) information. However, the output only shares the LoD information with input X.
+Either of the inputs $X$ and $Y$ or none can carry the LoD (Level of Details)
+information. However, the output only shares the LoD information with input $X$.
 )DOC";
    AddComment(comment_);

--- a/paddle/operators/elementwise_op_function.h
+++ b/paddle/operators/elementwise_op_function.h
@@ -340,6 +340,13 @@ void ElementwiseGradCompute(const framework::ExecutionContext& ctx) {
    return;
  }
+  if (y_dims.size() == 1 && y_dims[0] == 1) {
+    // y is a scalar
+    auto extended_dims = framework::vectorize(x_dims);
+    extended_dims.push_back(1);
+    x_dims = framework::make_ddim(extended_dims);
+  }
  int axis = ctx.Attr<int>("axis");
  axis = (axis == -1 ? x_dims.size() - y_dims.size() : axis);
@@ -356,5 +363,50 @@ void ElementwiseGradCompute(const framework::ExecutionContext& ctx) {
    return;
  }
 }
+template <typename Functor, typename DeviceContext, typename T>
+void ElementwiseComputeEx(const framework::ExecutionContext& ctx) {
+  using Tensor = framework::Tensor;
+  auto* x = ctx.Input<Tensor>("X");
+  auto* y = ctx.Input<Tensor>("Y");
+  auto* z = ctx.Output<Tensor>("Out");
+  z->mutable_data<T>(ctx.GetPlace());
+  TransformFunctor<Functor, T, DeviceContext> functor(
+      x, y, z, ctx.template device_context<DeviceContext>(), Functor());
+  auto x_dims = x->dims();
+  auto y_dims = y->dims();
+  PADDLE_ENFORCE_GE(x_dims.size(), y_dims.size(),
+                    "Rank of first input must >= rank of second input.");
+  if (x_dims == y_dims) {
+    functor.Run();
+    return;
+  }
+  if (y_dims.size() == 1 && y_dims[0] == 1) {
+    // y is a scalar
+    auto extended_dims = framework::vectorize(x_dims);
+    extended_dims.push_back(1);
+    x_dims = framework::make_ddim(extended_dims);
+  }
+  int axis = ctx.Attr<int>("axis");
+  axis = (axis == -1 ? x_dims.size() - y_dims.size() : axis);
+  PADDLE_ENFORCE(axis >= 0 && axis < x_dims.size(),
+                 "Axis should be in range [0, x_dims)");
+  int pre, n, post;
+  get_mid_dims(x_dims, y_dims, axis, pre, n, post);
+  if (post == 1) {
+    functor.RunRowWise(n, pre);
+    return;
+  } else {
+    functor.RunMidWise(n, pre, post);
+    return;
+  }
+}
 }  // namespace operators
 }  // namespace paddle
--- a/paddle/operators/elementwise_sub_op.h
+++ b/paddle/operators/elementwise_sub_op.h
@@ -18,11 +18,16 @@ limitations under the License. */
 namespace paddle {
 namespace operators {
+template <typename T>
+struct SubFunctor {
+  inline HOSTDEVICE T operator()(T a, T b) const { return a - b; }
+};
 template <typename DeviceContext, typename T>
 class ElementwiseSubKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& ctx) const override {
-    ElementwiseCompute<EigenSubFunctor, DeviceContext, T>(ctx);
+    ElementwiseComputeEx<SubFunctor<T>, DeviceContext, T>(ctx);
  }
 };

--- a/paddle/operators/expand_op.cc
+++ b/paddle/operators/expand_op.cc
@@ -58,21 +58,21 @@ class ExpandOpMaker : public framework::OpProtoAndCheckerMaker {
  ExpandOpMaker(OpProto* proto, OpAttrChecker* op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
    AddInput("X",
-             "(Tensor, default Tensor<float>) A tensor with rank in [1, 6]."
+             "(Tensor, default Tensor<float>). A tensor with rank in [1, 6]."
-             "X is the input tensor to be expanded.");
+             "X is the input to be expanded.");
    AddOutput("Out",
-              "(Tensor, default Tensor<float>) A tensor with rank in [1, 6]."
+              "(Tensor, default Tensor<float>). A tensor with rank in [1, 6]."
-              "The rank of Output(Out) is same as Input(X) except that each "
+              "The rank of Output(Out) have the same with Input(X). "
-              "dimension size of Output(Out) is equal to corresponding "
+              "After expanding, size of each dimension of Output(Out) is equal "
-              "dimension size of Input(X) multiplying corresponding value of "
+              "to size of the corresponding dimension of Input(X) multiplying "
-              "Attr(expand_times).");
+              "the corresponding value given by Attr(expand_times).");
    AddAttr<std::vector<int>>("expand_times",
                              "Expand times number for each dimension.");
    AddComment(R"DOC(
 Expand operator tiles the input by given times number. You should set times
 number for each dimension by providing attribute 'expand_times'. The rank of X
-should be in [1, 6]. Please notice that size of 'expand_times' must be same with
+should be in [1, 6]. Please note that size of 'expand_times' must be the same
-X's rank. Following is a using case:
+with X's rank. Following is a using case:
 Input(X) is a 3-D tensor with shape [2, 3, 1]:

--- a/paddle/operators/op_documentation/batch_norm_op.md
+++ b/paddle/operators/op_documentation/batch_norm_op.md
@@ -66,7 +66,7 @@ As most C++ operators do, `batch_norm_op` is defined by inputs, outputs, attribu
 The following graph showes the training computational process of `batch_norm_op`:
-<img src="./images/batch_norm_op_kernel.png" width="800"/>
+<img src="../images/batch_norm_op_kernel.png" width="800"/>
 cudnn provides APIs to finish the whole series of computation, we can use them in our GPU kernel.
@@ -124,7 +124,7 @@ for pass_id in range(PASS_NUM):
 `is_infer` is an attribute. Once an operator is created, its attributes can not be changed. It suggests us that we shall maintain two `batch_norm_op` in the model, one's `is_infer` is `True`(we call it `infer_batch_norm_op`) and the other one's is `False`(we call it `train_batch_norm_op`). They share all parameters and variables, but be placed in two different branches. That is to say, if a network contains a `batch_norm_op`, it will fork into two branches, one go through `train_batch_norm_op` and the other one go through `infer_batch_norm_op`:
 <div align=center>
-<img src="./images/batch_norm_fork.png" width="500"/>
+<img src="../images/batch_norm_fork.png" width="500"/>
 </div>
 Just like what is shown in the above graph, the net forks before `batch_norm_op` and will never merge again. All the operators after `batch_norm_op` will duplicate. 

--- a/paddle/operators/parallel_do_op.cc
+++ b/paddle/operators/parallel_do_op.cc
@@ -30,16 +30,13 @@ static constexpr char kParallelScopes[] = "parallel_scopes";
 static constexpr char kParallelBlock[] = "sub_block";
-// using ParallelScopeVar = std::vector<framework::Scope *>;
 using LoDTensor = framework::LoDTensor;
-using OperatorBase = framework::OperatorBase;
-void SplitTensorAndMoveTensorToScopes(
+static void SplitTensorAndMoveTensorToScopes(
-    const framework::Scope &scope,
+    const framework::Scope &scope, std::vector<framework::Scope *> *sub_scopes,
-    const std::vector<framework::Scope *> &sub_scopes,
    const std::vector<platform::Place> &places,
    const std::vector<std::string> &names) {
-  PADDLE_ENFORCE_EQ(sub_scopes.size(), places.size());
+  size_t num_sub_scopes = 0;
  for (auto &argu : names) {
    auto *var = scope.FindVar(argu);
    const auto &tensor = var->Get<LoDTensor>();
@@ -48,9 +45,21 @@ void SplitTensorAndMoveTensorToScopes(
    for (auto &lod : lod_tensors) {
      VLOG(3) << lod.dims();
    }
+    if (num_sub_scopes == 0) {
+      num_sub_scopes = lod_tensors.size();
+    } else {
+      PADDLE_ENFORCE_EQ(num_sub_scopes, lod_tensors.size());
+    }
+    PADDLE_ENFORCE_NE(num_sub_scopes, 0);
+    if (sub_scopes->size() == 0) {
+      sub_scopes->reserve(num_sub_scopes);
+      for (size_t i = 0; i < num_sub_scopes; ++i) {
+        sub_scopes->emplace_back(&scope.NewScope());
+      }
+    }
-    for (size_t i = 0; i < sub_scopes.size(); ++i) {
+    for (size_t i = 0; i < lod_tensors.size(); ++i) {
-      *sub_scopes[i]->Var(argu)->GetMutable<LoDTensor>() = lod_tensors[i];
+      *(*sub_scopes)[i]->Var(argu)->GetMutable<LoDTensor>() = lod_tensors[i];
    }
  }
 }
@@ -70,7 +79,7 @@ class ParallelDoOp : public framework::OperatorBase {
               const framework::VariableNameMap &inputs,
               const framework::VariableNameMap &outputs,
               const framework::AttributeMap &attrs)
-      : OperatorBase(type, inputs, outputs, attrs) {}
+      : framework::OperatorBase(type, inputs, outputs, attrs) {}
  void Run(const framework::Scope &scope,
           const platform::Place &place) const override {
@@ -85,19 +94,17 @@ class ParallelDoOp : public framework::OperatorBase {
    auto &sub_scopes = *scope.FindVar(Output(kParallelScopes))
                            ->GetMutable<std::vector<framework::Scope *>>();
-    for (size_t place_idx = 0; place_idx < places.size(); ++place_idx) {
-      sub_scopes.push_back(&scope.NewScope());
-    }
    // split input
-    SplitTensorAndMoveTensorToScopes(scope, sub_scopes, places,
+    SplitTensorAndMoveTensorToScopes(scope, &sub_scopes, places,
                                     Inputs(kInputs));
    // copy parameter
    for (auto &param : Inputs(kParameters)) {
      PADDLE_ENFORCE(scope.FindVar(param)->IsType<LoDTensor>(),
                     "Only support parameter type as LoDTensor");
      auto &src = scope.FindVar(param)->Get<LoDTensor>();
-      for (size_t i = 0; i < places.size(); ++i) {
+      for (size_t i = 0; i < sub_scopes.size(); ++i) {
        auto &place = places[i];
        auto *sub_scope = sub_scopes[i];
        auto *dst = sub_scope->Var(param)->GetMutable<LoDTensor>();
@@ -108,9 +115,7 @@ class ParallelDoOp : public framework::OperatorBase {
    std::vector<std::future<void>> workers;
    workers.reserve(places.size());
-    for (size_t place_idx = 0; place_idx < places.size(); ++place_idx) {
+    for (size_t place_idx = 0; place_idx < sub_scopes.size(); ++place_idx) {
-      VLOG(3) << "Run " << place_idx;
      auto &place = places[place_idx];
      auto *cur_scope = sub_scopes[place_idx];
@@ -157,21 +162,16 @@ ParallelDo Operator.
  }
 };
-class ParallelDoGradOp : public OperatorBase {
+class ParallelDoGradOp : public framework::OperatorBase {
 public:
  ParallelDoGradOp(const std::string &type,
                   const framework::VariableNameMap &inputs,
                   const framework::VariableNameMap &outputs,
                   const framework::AttributeMap &attrs)
-      : OperatorBase(type, inputs, outputs, attrs) {}
+      : framework::OperatorBase(type, inputs, outputs, attrs) {}
  void Run(const framework::Scope &scope,
           const platform::Place &place) const override {
-    // // get device context from pool
-    // platform::DeviceContextPool &pool =
-    //        platform::DeviceContextPool::Instance();
-    // auto &dev_ctx = *pool.Get(place);
    auto *block = Attr<framework::BlockDesc *>(kParallelBlock);
    auto *program = block->Program();
@@ -181,26 +181,16 @@ class ParallelDoGradOp : public OperatorBase {
    auto &places = scope.FindVar(Input(kPlaces))->Get<platform::PlaceList>();
    // feed output@grad
-    SplitTensorAndMoveTensorToScopes(scope, sub_scopes, places,
+    SplitTensorAndMoveTensorToScopes(
-                                     Inputs(framework::GradVarName(kOutputs)));
+        scope, const_cast<std::vector<framework::Scope *> *>(&sub_scopes),
+        places, Inputs(framework::GradVarName(kOutputs)));
    WaitOnPlaces(places);
-    // for debugging
-    for (auto &s : Inputs(framework::GradVarName(kOutputs))) {
-      VLOG(3) << s;
-      VLOG(3) << scope.FindVar(s)->Get<LoDTensor>();
-      for (auto *sub_scope : sub_scopes) {
-        VLOG(3) << sub_scope->FindVar(s)->Get<LoDTensor>();
-      }
-    }
    // exe run
    std::vector<std::future<void>> workers;
-    for (size_t place_idx = 0; place_idx < places.size(); ++place_idx) {
+    for (size_t i = 0; i < sub_scopes.size(); ++i) {
-      VLOG(3) << "Run " << place_idx;
+      auto &place = places[i];
+      auto *cur_scope = sub_scopes[i];
-      auto &place = places[place_idx];
-      auto *cur_scope = sub_scopes[place_idx];
      // execute
      workers.emplace_back(framework::Async([program, cur_scope, place, block] {
@@ -216,33 +206,38 @@ class ParallelDoGradOp : public OperatorBase {
    // merge grad
    for (auto &s : Outputs(framework::GradVarName(kParameters))) {
-      VLOG(3) << "merge grad " << s;
+      auto &result = sub_scopes[0]->FindVar(s)->Get<LoDTensor>();
+      std::string tmp_name;
-      auto &t = sub_scopes[0]->FindVar(s)->Get<LoDTensor>();
+      auto *tmp = sub_scopes[0]->Var(&tmp_name)->GetMutable<LoDTensor>();
-      VLOG(3) << t;
+      for (size_t i = 1; i < sub_scopes.size(); ++i) {
-      std::string s_buf = s + "@BUF";
+        auto &tensor_to_merge = sub_scopes[i]->FindVar(s)->Get<LoDTensor>();
-      auto *t_buf = sub_scopes[0]->Var(s_buf)->GetMutable<LoDTensor>();
+        if (!(places[i] == places[0])) {
+          framework::Copy(tensor_to_merge, places[0], tmp);
-      for (size_t place_idx = 1; place_idx < places.size(); ++place_idx) {
+        } else {
-        auto &tt = sub_scopes[place_idx]->FindVar(s)->Get<LoDTensor>();
+          tmp->ShareDataWith(tensor_to_merge);
-        VLOG(3) << place_idx;
+        }
-        VLOG(3) << tt;
-        framework::Copy(tt, places[0], t_buf);
        auto sum_op = framework::OpRegistry::CreateOp(
-            "sum", {{"X", {s, s_buf}}}, {{"Out", {s}}},
+            "sum", {{"X", {s, tmp_name}}}, {{"Out", {s}}},
            framework::AttributeMap{});
        sum_op->Run(*sub_scopes[0], places[0]);
        WaitOnPlaces(places);
      }
-      VLOG(3) << t;
+      VLOG(3) << result;
-      framework::Copy(t, place, scope.FindVar(s)->GetMutable<LoDTensor>());
+      framework::Copy(result, place, scope.FindVar(s)->GetMutable<LoDTensor>());
    }
  }
 };
+std::ostream &operator<<(std::ostream &sout,
+                         const std::vector<std::string> &strs) {
+  std::copy(strs.begin(), strs.end(),
+            std::ostream_iterator<std::string>(sout, ","));
+  return sout;
+}
 class ParallelDoGradOpDescMaker : public framework::SingleGradOpDescMaker {
 public:
  using framework::SingleGradOpDescMaker::SingleGradOpDescMaker;
@@ -283,18 +278,30 @@ class ParallelDoGradOpShapeInference : public framework::InferShapeBase {
  void operator()(framework::InferShapeContext *ctx) const override {
    std::vector<std::string> input{kParameters, kInputs};
    std::vector<std::string> output{kOutputs};
-    for (auto &s : input) {
-      PADDLE_ENFORCE(ctx->HasInputs(s));
+    PADDLE_ENFORCE(ctx->HasInputs(kParameters));
-      PADDLE_ENFORCE(ctx->HasOutputs(framework::GradVarName(s)),
+    PADDLE_ENFORCE(ctx->HasOutputs(framework::GradVarName(kParameters)));
-                     "Cannot find the gradient variable %s",
+    PADDLE_ENFORCE(ctx->HasInput(kInputs));
-                     framework::GradVarName(s));
-    }
    for (auto &s : output) {
      PADDLE_ENFORCE(ctx->HasInputs(s));
    }
-    for (auto &s : input) {
-      ctx->SetOutputsDim(framework::GradVarName(s), ctx->GetInputsDim(s));
+    ctx->SetOutputsDim(framework::GradVarName(kParameters),
+                       ctx->GetInputsDim(kParameters));
+    auto i_dims = ctx->GetInputsDim(kInputs);
+    auto ig_names = ctx->Outputs(framework::GradVarName(kInputs));
+    for (size_t i = 0; i < ig_names.size(); ++i) {
+      auto &ig_name = ig_names[i];
+      if (ig_name == framework::kEmptyVarName) {
+        continue;
+      }
+      ctx->SetDims({ig_name}, {i_dims[i]});
    }
    if (ctx->HasInputs(kParameters)) {
      PADDLE_ENFORCE(ctx->HasOutputs(framework::GradVarName(kParameters)));
      ctx->SetOutputsDim(framework::GradVarName(kParameters),

--- a/paddle/operators/pool_op.cc
+++ b/paddle/operators/pool_op.cc
@@ -64,6 +64,13 @@ void PoolOp::InferShape(framework::InferShapeContext *ctx) const {
 framework::OpKernelType PoolOp::GetExpectedKernelType(
    const framework::ExecutionContext &ctx) const {
  bool use_cudnn = ctx.Attr<bool>("use_cudnn");
+  use_cudnn &= platform::is_gpu_place(ctx.GetPlace());
+#ifdef PADDLE_WITH_CUDA
+  if (platform::is_gpu_place(ctx.GetPlace())) {
+    auto &dev_ctx = ctx.template device_context<platform::CUDADeviceContext>();
+    use_cudnn &= dev_ctx.cudnn_handle() != nullptr;
+  }
+#endif
  framework::LibraryType library_;
  if (use_cudnn) {
    library_ = framework::LibraryType::kCUDNN;
@@ -88,6 +95,13 @@ void PoolOpGrad::InferShape(framework::InferShapeContext *ctx) const {
 framework::OpKernelType PoolOpGrad::GetExpectedKernelType(
    const framework::ExecutionContext &ctx) const {
  bool use_cudnn = ctx.Attr<bool>("use_cudnn");
+  use_cudnn &= platform::is_gpu_place(ctx.GetPlace());
+#ifdef PADDLE_WITH_CUDA
+  if (platform::is_gpu_place(ctx.GetPlace())) {
+    auto &dev_ctx = ctx.template device_context<platform::CUDADeviceContext>();
+    use_cudnn &= dev_ctx.cudnn_handle() != nullptr;
+  }
+#endif
  framework::LibraryType library_;
  if (use_cudnn) {
    library_ = framework::LibraryType::kCUDNN;

--- a/paddle/operators/reduce_op.cc
+++ b/paddle/operators/reduce_op.cc
@@ -129,7 +129,7 @@ If reduce_all is true, just reduce along all dimensions and output a scalar.
  }
  void SetComment(std::string name, std::string op) {
-    Replace(comment_, "{ReduceOP}", name);
+    Replace(comment_, "{ReduceOp}", name);
    Replace(comment_, "{reduce}", op);
  }
 };

--- a/paddle/string/to_string.h
+++ b/paddle/string/to_string.h
@@ -15,9 +15,15 @@ limitations under the License. */
 #pragma once
 #include <sstream>
 #include <string>
+#include <typeindex>
 namespace paddle {
 namespace string {
+inline std::ostream& operator<<(std::ostream& s, const std::type_index& t) {
+  s << t.name();
+  return s;
+}
 template <typename T>
 inline std::string to_string(T v) {
  std::ostringstream sout;
@@ -25,6 +31,11 @@ inline std::string to_string(T v) {
  return sout.str();
 }
+template <>
+inline std::string to_string(std::type_index t) {
+  return t.name();
+}
 // Faster std::string/const char* type
 template <>
 inline std::string to_string(std::string v) {

--- a/python/paddle/trainer_config_helpers/evaluators.py
+++ b/python/paddle/trainer_config_helpers/evaluators.py
@@ -16,13 +16,22 @@ from paddle.trainer.config_parser import *
 from default_decorators import *
 __all__ = [
-    "evaluator_base", "classification_error_evaluator", "auc_evaluator",
+    "evaluator_base",
-    "pnpair_evaluator", "precision_recall_evaluator", "ctc_error_evaluator",
+    "classification_error_evaluator",
-    "chunk_evaluator", "sum_evaluator", "column_sum_evaluator",
+    "auc_evaluator",
-    "value_printer_evaluator", "gradient_printer_evaluator",
+    "pnpair_evaluator",
-    "maxid_printer_evaluator", "maxframe_printer_evaluator",
+    "precision_recall_evaluator",
-    "seqtext_printer_evaluator", "classification_error_printer_evaluator",
+    "ctc_error_evaluator",
-    "detection_map_evaluator"
+    "chunk_evaluator",
+    "sum_evaluator",
+    "column_sum_evaluator",
+    "value_printer_evaluator",
+    "gradient_printer_evaluator",
+    "maxid_printer_evaluator",
+    "maxframe_printer_evaluator",
+    "seqtext_printer_evaluator",
+    "classification_error_printer_evaluator",
+    "detection_map_evaluator",
 ]

--- a/python/paddle/v2/fluid/framework.py
+++ b/python/paddle/v2/fluid/framework.py
@@ -116,8 +116,8 @@ def _debug_string_(proto, throw_on_error=True):
    """
    error_fields = list()
    if not proto.IsInitialized(error_fields) and throw_on_error:
-        raise ValueError("{0} are not initialized\nThe message is {1}".format(
+        raise ValueError("{0} are not initialized.\nThe message is {1}:\n".
-            error_fields, proto))
+                         format(error_fields, proto))
    return proto.__str__()
@@ -374,12 +374,13 @@ class Operator(object):
        >>>                     outputs={"Out": [var1]})
        Args:
-            block(Block): The block has the current operator
+            block(Block): The block has the current operator.
-            desc(core.OpDesc): The protobuf description
+            desc(core.OpDesc): The protobuf description.
            type(str): The type of operator.
            inputs(dict): The input dictionary. Key is the input parameter name.
                Value is a list of variables.
-            outputs(dict): The output dictionary. Has same format with inputs
+            outputs(dict): The output dictionary which has the same format with
+                           inputs.
            attrs(dict): The attributes dictionary. Key is attribute name. Value
                is the attribute value. The attribute type should be as same as
                the type registered in C++
@@ -436,9 +437,10 @@ class Operator(object):
            for m in proto.outputs:
                need.add(m.name)
            if not given == need:
-                raise ValueError(
+                raise ValueError(("Incorrect setting for output(s) of "
-                    "Incorrect setting for output(s) of operator \"%s\". Need: [%s] Given: [%s]"
+                                  "operator \"%s\". Need: [%s] Given: [%s]") %
-                    % (type, ", ".join(str(e) for e in need), ", ".join(
+                                 (type, ", ".join(str(e)
+                                                  for e in need), ", ".join(
                                                      str(e) for e in given)))
            for out_proto in proto.outputs:
@@ -818,9 +820,8 @@ class Program(object):
                if isinstance(t, Variable):
                    t = t.op
                else:
-                    raise ValueError(
+                    raise ValueError(("All targets of prune() can only be "
-                        "All targets of prune() can only be Variable or Operator."
+                                      "Variable or Operator."))
-                    )
            targets_idx.append([t.block.idx, t.idx])
        res = Program()

--- a/python/paddle/v2/fluid/layers/io.py
+++ b/python/paddle/v2/fluid/layers/io.py
@@ -28,9 +28,9 @@ def data(name,
    **Data Layer**
    This function takes in the input and based on whether data has
-    to be returned back as a minibatch, it creates the global variable using
+    to be returned back as a minibatch, it creates the global variable by using
    the helper functions. The global variables can be accessed by all the
-    following operations and layers in the graph.
+    following operators in the graph.
    All the input variables of this function are passed in as local variables
    to the LayerHelper constructor.

--- a/python/paddle/v2/fluid/layers/nn.py
+++ b/python/paddle/v2/fluid/layers/nn.py
@@ -50,6 +50,8 @@ __all__ = [
    'sequence_last_step',
    'dropout',
    'split',
+    'l2_normalize',
+    'matmul',
 ]
@@ -674,6 +676,7 @@ def conv2d(input,
           groups=None,
           param_attr=None,
           bias_attr=None,
+           use_cudnn=True,
           act=None):
    """
    **Convlution2D Layer**
@@ -737,6 +740,8 @@ def conv2d(input,
            connected to the second half of the input channels. Default: groups=1
        param_attr(ParamAttr): The parameters to the Conv2d Layer. Default: None
        bias_attr(ParamAttr): Bias parameter for the Conv2d layer. Default: None
+        use_cudnn(bool): Use cudnn kernel or not, it is valid only when the cudnn
+            library is installed. Default: True
        act(str): Activation type. Default: None
    Returns:
@@ -772,6 +777,8 @@ def conv2d(input,
        stride = [stride, stride]
    if isinstance(padding, int):
        padding = [padding, padding]
+    if not isinstance(use_cudnn, bool):
+        raise ValueError("use_cudnn should be True or False")
    input_shape = input.shape
    filter_shape = [num_filters, num_filter_channels] + filter_size
@@ -795,9 +802,12 @@ def conv2d(input,
            'Filter': filter_param,
        },
        outputs={"Output": pre_bias},
-        attrs={'strides': stride,
+        attrs={
+            'strides': stride,
            'paddings': padding,
-               'groups': groups})
+            'groups': groups,
+            'use_cudnn': use_cudnn
+        })
    pre_act = helper.append_bias_op(pre_bias, dim_start=1, dim_end=2)
@@ -945,7 +955,9 @@ def pool2d(input,
           pool_type,
           pool_stride=None,
           pool_padding=None,
-           global_pooling=False):
+           global_pooling=False,
+           use_cudnn=True,
+           name=None):
    """
    This function adds the operator for pooling in 2 dimensions, using the
    pooling configurations mentioned in input parameters.
@@ -964,6 +976,8 @@ def pool2d(input,
        pool_stride = [pool_stride, pool_stride]
    if isinstance(pool_padding, int):
        pool_padding = [pool_padding, pool_padding]
+    if not isinstance(use_cudnn, bool):
+        raise ValueError("use_cudnn should be True or False")
    helper = LayerHelper('pool2d', **locals())
    dtype = helper.input_dtype()
@@ -978,7 +992,8 @@ def pool2d(input,
            "ksize": pool_size,
            "global_pooling": global_pooling,
            "strides": pool_stride,
-            "paddings": pool_padding
+            "paddings": pool_padding,
+            "use_cudnn": use_cudnn
        })
    return pool_out
@@ -991,7 +1006,8 @@ def batch_norm(input,
               epsilon=1e-05,
               param_attr=None,
               bias_attr=None,
-               data_layout='NCHW'):
+               data_layout='NCHW',
+               name=None):
    """
    This function helps create an operator to implement
    the BatchNorm layer using the configurations from the input parameters.
@@ -1067,7 +1083,7 @@ def batch_norm(input,
    return helper.append_activation(batch_norm_out)
-def beam_search_decode(ids, scores):
+def beam_search_decode(ids, scores, name=None):
    helper = LayerHelper('beam_search_decode', **locals())
    sentence_ids = helper.create_tmp_variable(dtype=ids.dtype)
    sentence_scores = helper.create_tmp_variable(dtype=ids.dtype)
@@ -1091,7 +1107,9 @@ def conv2d_transpose(input,
                     padding=None,
                     stride=None,
                     dilation=None,
-                     param_attr=None):
+                     param_attr=None,
+                     use_cudnn=True,
+                     name=None):
    """
    The transpose of conv2d layer.
@@ -1118,8 +1136,10 @@ def conv2d_transpose(input,
            contain two integers, (dilation_H, dilation_W). Otherwise, the
            dilation_H = dilation_W = dilation.
        param_attr: Parameter Attribute.
-        main_program(Program): the main program
+        use_cudnn(bool): Use cudnn kernel or not, it is valid only when the cudnn
-        startup_program(Program): the startup program
+            library is installed. Default: True
+        name(str|None): A name for this layer(optional). If set None, the layer
+                       will be named automatically.
    Returns:
        Variable: Output image.
@@ -1146,6 +1166,10 @@ def conv2d_transpose(input,
    elif dilation is not None:
        op_attr['dilations'] = dilation
+    if not isinstance(use_cudnn, bool):
+        raise ValueError("use_cudnn should be True or False")
+    op_attr['use_cudnn'] = use_cudnn
    if filter_size is None:
        if output_size is None:
            raise ValueError("output_size must be set when filter_size is None")
@@ -1183,7 +1207,7 @@ def conv2d_transpose(input,
    return out
-def sequence_expand(x, y):
+def sequence_expand(x, y, name=None):
    """Sequence Expand Layer. This layer will expand the input variable **x**
    according to LoD information of **y**. And the following examples will
    explain how sequence_expand works:
@@ -1227,6 +1251,8 @@ def sequence_expand(x, y):
    Args:
        x (Variable): The input variable which is a Tensor or LoDTensor.
        y (Variable): The input variable which is a LoDTensor.
+        name(str|None): A name for this layer(optional). If set None, the layer
+                       will be named automatically.
    Returns:
        Variable: The expanded variable which is a LoDTensor.
@@ -1253,7 +1279,8 @@ def lstm_unit(x_t,
              cell_t_prev,
              forget_bias=0.0,
              param_attr=None,
-              bias_attr=None):
+              bias_attr=None,
+              name=None):
    """Lstm unit layer. The equation of a lstm step is:
        .. math::
@@ -1300,6 +1327,8 @@ def lstm_unit(x_t,
            initializer, name etc.
        bias_attr (ParamAttr): The attributes of bias weights, if not False,
            bias weights will be created and be set to default value.
+        name(str|None): A name for this layer(optional). If set None, the layer
+                       will be named automatically.
    Returns:
        tuple: The hidden value and cell value of lstm unit.
@@ -1365,7 +1394,7 @@ def lstm_unit(x_t,
    return h, c
-def reduce_sum(input, dim=None, keep_dim=False):
+def reduce_sum(input, dim=None, keep_dim=False, name=None):
    """
    Computes the sum of tensor elements over the given dimension.
@@ -1379,6 +1408,8 @@ def reduce_sum(input, dim=None, keep_dim=False):
        keep_dim (bool): Whether to reserve the reduced dimension in the
            output Tensor. The result tensor will have one fewer dimension
            than the :attr:`input` unless :attr:`keep_dim` is true.
+        name(str|None): A name for this layer(optional). If set None, the layer
+                       will be named automatically.
    Returns:
        Variable: The reduced Tensor variable.
@@ -1409,7 +1440,7 @@ def reduce_sum(input, dim=None, keep_dim=False):
    return out
-def reduce_mean(input, dim=None, keep_dim=False):
+def reduce_mean(input, dim=None, keep_dim=False, name=None):
    """
    Computes the mean of tensor elements over the given dimension.
@@ -1423,6 +1454,8 @@ def reduce_mean(input, dim=None, keep_dim=False):
        keep_dim (bool): Whether to reserve the reduced dimension in the
            output Tensor. The result tensor will have one fewer dimension
            than the :attr:`input` unless :attr:`keep_dim` is true.
+        name(str|None): A name for this layer(optional). If set None, the layer
+                       will be named automatically.
    Returns:
        Variable: The reduced Tensor variable.
@@ -1453,7 +1486,7 @@ def reduce_mean(input, dim=None, keep_dim=False):
    return out
-def reduce_max(input, dim=None, keep_dim=False):
+def reduce_max(input, dim=None, keep_dim=False, name=None):
    """
    Computes the maximum of tensor elements over the given dimension.
@@ -1467,6 +1500,8 @@ def reduce_max(input, dim=None, keep_dim=False):
        keep_dim (bool): Whether to reserve the reduced dimension in the
            output Tensor. The result tensor will have one fewer dimension
            than the :attr:`input` unless :attr:`keep_dim` is true.
+        name(str|None): A name for this layer(optional). If set None, the layer
+                       will be named automatically.
    Returns:
        Variable: The reduced Tensor variable.
@@ -1497,7 +1532,7 @@ def reduce_max(input, dim=None, keep_dim=False):
    return out
-def reduce_min(input, dim=None, keep_dim=False):
+def reduce_min(input, dim=None, keep_dim=False, name=None):
    """
    Computes the minimum of tensor elements over the given dimension.
@@ -1511,6 +1546,8 @@ def reduce_min(input, dim=None, keep_dim=False):
        keep_dim (bool): Whether to reserve the reduced dimension in the
            output Tensor. The result tensor will have one fewer dimension
            than the :attr:`input` unless :attr:`keep_dim` is true.
+        name(str|None): A name for this layer(optional). If set None, the layer
+                       will be named automatically.
    Returns:
        Variable: The reduced Tensor variable.
@@ -1541,9 +1578,9 @@ def reduce_min(input, dim=None, keep_dim=False):
    return out
-def split(input, num_or_sections, dim=-1):
+def split(input, num_or_sections, dim=-1, name=None):
    """
-    Splits the tensor into multiple sub-tensors.
+    Split the input tensor into multiple sub-tensors.
    Args:
        input (Variable): The input variable which is a Tensor or LoDTensor.
@@ -1555,6 +1592,8 @@ def split(input, num_or_sections, dim=-1):
            :attr:`dim` dimension orderly.
        dim (int): The dimension along which to split. If :math:`dim < 0`, the
            dimension to split along is :math:`rank(input) + dim`.
+        name(str|None): A name for this layer(optional). If set None, the layer
+                       will be named automatically.
    Returns:
        List: The list of segmented tensor variables.
@@ -1597,3 +1636,155 @@ def split(input, num_or_sections, dim=-1):
            'axis': dim
        })
    return outs
+def l2_normalize(x, axis, epsilon=1e-12, name=None):
+    """
+    **L2 normalize Layer**
+    The l2 normalize layer normalizes `x` along dimension `axis` using an L2
+    norm. For a 1-D tensor (`dim` is fixed to 0), this layer computes
+    output = x / sqrt(max(sum(x**2), epsilon))
+    For `x` with more dimensions, this layer independently normalizes each 1-D
+    slice along dimension `axis`.
+    Args:
+       x(Variable|list): The input tensor to l2_normalize layer.
+       axis(int): Dimension along which to normalize the input.
+       epsilon(float): A lower bound value for `x`'s l2 norm. sqrt(epsilon) will
+                       be used as the divisor if the l2 norm of `x` is less than
+                       sqrt(epsilon).
+       name(str|None): A name for this layer(optional). If set None, the layer
+                       will be named automatically.
+    Returns:
+        Variable: The output tensor variable.
+    Examples:
+        .. code-block:: python
+          data = fluid.layers.data(name="data",
+                                   shape=(3, 17, 13),
+                                   dtype="float32")
+          fc = fluid.layers.l2_normalize(x=data, axis=1)
+    """
+    if len(x.shape) == 1: axis = 0
+    helper = LayerHelper("l2_normalize", **locals())
+    square = helper.create_tmp_variable(dtype=x.dtype)
+    helper.append_op(type="square", inputs={"X": x}, outputs={"Out": square})
+    reduced_sum = helper.create_tmp_variable(dtype=x.dtype)
+    helper.append_op(
+        type="reduce_sum",
+        inputs={"X": square},
+        outputs={"Out": reduced_sum},
+        attrs={
+            "dim": 1 if axis is None else axis,
+            "keep_dim": True,
+            "reduce_all": False
+        })
+    # TODO(caoying) A lower bound value epsilon for the norm is needed to
+    # imporve the numeric stability of reciprocal. This requires a maximum_op.
+    rsquare = helper.create_tmp_variable(dtype=x.dtype)
+    helper.append_op(
+        type="reciprocal", inputs={"X": reduced_sum}, outputs={"Out": rsquare})
+    # TODO(caoying) the current elementwise_mul operator does not support a
+    # general broadcast rule which broadcasts input(Y) to have the same
+    # dimension with Input(X) starting from a specified dimension. So this
+    # exanpsion is requred. Once a general broadcast rule is spported, this
+    # expanding canbe removed.
+    rsquare_expanded = helper.create_tmp_variable(dtype=x.dtype)
+    expand_times = [1] * len(x.shape)
+    expand_times[axis] = int(x.shape[axis])
+    helper.append_op(
+        type="expand",
+        inputs={"X": rsquare},
+        outputs={"Out": rsquare_expanded},
+        attrs={"expand_times": expand_times})
+    out = helper.create_tmp_variable(dtype=x.dtype)
+    helper.append_op(
+        type="elementwise_mul",
+        inputs={"X": x,
+                "Y": rsquare_expanded},
+        outputs={"Out": out})
+    return out
+def matmul(x, y, transpose_x=False, transpose_y=False, name=None):
+    """
+    Applies matrix multipication to two tensors. Currently only rank 1 to rank 
+    3 input tensors are supported.
+    The actual behavior depends on the shapes of :math:`x`, :math:`y` and the 
+    flag values of :attr:`transpose_x`, :attr:`transpose_y`. Specifically:
+    - If a transpose flag is specified, the last two dimensions of the tensor 
+      are transposed. If the tensor is rank-1 of shape :math:`[D]`, then for 
+      :math:`x` it is treated as :math:`[1, D]` in nontransposed form and as 
+      :math:`[D, 1]` in transposed form, whereas for :math:`y` it is the 
+      opposite: It is treated as :math:`[D, 1]` in nontransposed form and as 
+      :math:`[1, D]` in transposed form.
+    - After transpose, the two tensors are 2-D or 3-D and matrix multipication 
+      performs in the following way.
+      - If both are 2-D, they are multiplied like conventional matrices.
+      - If either is 3-D, it is treated as a stack of matrices residing in the 
+        last two dimensions and a batched matrix multiply supporting broadcast 
+        applies on the two tensors.
+    Also note that if the raw tensor :math:`x` or :math:`y` is rank-1 and 
+    nontransposed, the prepended or appended dimension :math:`1` will be 
+    removed after matrix multipication.
+    Args:
+        x (Variable): The input variable which is a Tensor or LoDTensor.
+        y (Variable): The input variable which is a Tensor or LoDTensor.
+        transpose_x (bool): Whether to transpose :math:`x` before multiplication.
+        transpose_y (bool): Whether to transpose :math:`y` before multiplication.
+        name(str|None): A name for this layer(optional). If set None, the layer 
+            will be named automatically.
+    Returns:
+        Variable: The product Tensor variable.
+    Examples:
+        .. code-block:: python
+            # Examples to clarify shapes of the inputs and output
+            # x: [B, M, K], y: [B, K, N]
+            fluid.layers.matmul(x, y)  # out: [B, M, N]
+            # x: [B, M, K], y: [K, N]
+            fluid.layers.matmul(x, y)  # out: [B, M, N]
+            # x: [B, M, K], y: [K]
+            fluid.layers.matmul(x, y)  # out: [B, M]
+            # x: [M, K], y: [K, N]
+            fluid.layers.matmul(x, y)  # out: [M, N]
+            # x: [K], y: [K]
+            fluid.layers.matmul(x, y)  # out: [1]
+            # x: [M], y: [N]
+            fluid.layers.matmul(x, y, True, True)  # out: [M, N]
+    """
+    helper = LayerHelper('matmul', **locals())
+    assert max(
+        len(x.shape), len(y.shape)
+    ) <= 3, 'Currently only rank 1 to rank 3 input tensors are supported.'
+    out = helper.create_tmp_variable(dtype=helper.input_dtype())
+    helper.append_op(
+        type='matmul',
+        inputs={'X': x,
+                'Y': y},
+        outputs={'Out': out},
+        attrs={'transpose_X': transpose_x,
+               'transpose_Y': transpose_y})
+    return out
--- a/python/paddle/v2/fluid/layers/ops.py
+++ b/python/paddle/v2/fluid/layers/ops.py
@@ -55,6 +55,8 @@ __all__ = [
    'elementwise_div',
    'elementwise_sub',
    'elementwise_mul',
+    'elementwise_max',
+    'elementwise_min',
    'clip',
    'sequence_softmax',
 ] + __activations__

--- a/python/paddle/v2/fluid/nets.py
+++ b/python/paddle/v2/fluid/nets.py
@@ -17,6 +17,7 @@ __all__ = [
    "simple_img_conv_pool",
    "sequence_conv_pool",
    "glu",
+    "dot_product_attention",
 ]
@@ -27,19 +28,22 @@ def simple_img_conv_pool(input,
                         pool_stride,
                         act,
                         param_attr=None,
-                         pool_type='max'):
+                         pool_type='max',
+                         use_cudnn=True):
    conv_out = layers.conv2d(
        input=input,
        num_filters=num_filters,
        filter_size=filter_size,
        param_attr=param_attr,
-        act=act)
+        act=act,
+        use_cudnn=use_cudnn)
    pool_out = layers.pool2d(
        input=conv_out,
        pool_size=pool_size,
        pool_type=pool_type,
-        pool_stride=pool_stride)
+        pool_stride=pool_stride,
+        use_cudnn=use_cudnn)
    return pool_out
@@ -53,7 +57,8 @@ def img_conv_group(input,
                   conv_with_batchnorm=False,
                   conv_batchnorm_drop_rate=None,
                   pool_stride=1,
-                   pool_type=None):
+                   pool_type=None,
+                   use_cudnn=True):
    """
    Image Convolution Group, Used for vgg net.
    """
@@ -84,7 +89,8 @@ def img_conv_group(input,
            filter_size=conv_filter_size[i],
            padding=conv_padding[i],
            param_attr=param_attr[i],
-            act=local_conv_act)
+            act=local_conv_act,
+            use_cudnn=use_cudnn)
        if conv_with_batchnorm[i]:
            tmp = layers.batch_norm(input=tmp, act=conv_act)
@@ -96,7 +102,8 @@ def img_conv_group(input,
        input=tmp,
        pool_size=pool_size,
        pool_type=pool_type,
-        pool_stride=pool_stride)
+        pool_stride=pool_stride,
+        use_cudnn=use_cudnn)
    return pool_out
@@ -150,3 +157,55 @@ def glu(input, dim=-1):
    act_b = layers.sigmoid(x=b)
    out = layers.elementwise_mul(x=a, y=act_b)
    return out
+def dot_product_attention(querys, keys, values):
+    """
+    The dot-product attention.
+    Attention mechanism can be seen as mapping a query and a set of key-value 
+    pairs to an output. The output is computed as a weighted sum of the values, 
+    where the weight assigned to each value is computed by a compatibility 
+    function (dot-product here) of the query with the corresponding key.
+    The dot-product attention can be implemented through (batch) matrix 
+    multipication as follows:
+        .. math::
+            Attention(Q, K, V)= softmax(QK^\mathrm{T})V
+    Refer to `Attention Is All You Need 
+    <https://arxiv.org/pdf/1706.03762.pdf>`_.
+    Note that batch data containing sequences with different lengths is not 
+    supported by this because of the (batch) matrix multipication.
+    Args:
+        query (Variable): The input variable which is a Tensor or LoDTensor.
+        key (Variable): The input variable which is a Tensor or LoDTensor.
+        value (Variable): The input variable which is a Tensor or LoDTensor.
+    Returns:
+        tuple: The Tensor variables representing the output and attention scores.
+    Examples:
+        .. code-block:: python
+            # Suppose q, k, v are tensor variables with the following shape:
+            # q: [3, 5, 9], k: [3, 6, 9], v: [3, 6, 10]
+            out, attn_scores = fluid.nets.dot_product_attention(q, k, v)
+            out.shape  # [3, 5, 10]
+            attn_scores.shape  # [3, 5, 6]
+    """
+    assert keys.shape[-2] == values.shape[
+        -2], 'The shapes of keys and values mismatch.'
+    assert querys.shape[-1] == keys.shape[
+        -1], 'The shapes of querys and keys mismatch.'
+    product = layers.matmul(x=querys, y=keys, transpose_y=True)
+    attn_scores = layers.reshape(
+        x=layers.reshape(
+            x=product, shape=[-1, product.shape[-1]], act='softmax'),
+        shape=product.shape)
+    out = layers.matmul(attn_scores, values)
+    return out, attn_scores
--- a/python/paddle/v2/fluid/tests/test_dynrnn_static_input.py
+++ b/python/paddle/v2/fluid/tests/test_dynrnn_static_input.py
+#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
 import unittest
 import paddle.v2 as paddle
 import paddle.v2.fluid.core as core

--- a/python/paddle/v2/fluid/tests/test_edit_distance_op.py
+++ b/python/paddle/v2/fluid/tests/test_edit_distance_op.py
+#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
 import unittest
 import numpy as np
 from op_test import OpTest

--- a/python/paddle/v2/fluid/tests/test_elementwise_add_op.py
+++ b/python/paddle/v2/fluid/tests/test_elementwise_add_op.py
 #  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
 #
-#Licensed under the Apache License, Version 2.0 (the "License");
+# Licensed under the Apache License, Version 2.0 (the "License");
-#you may not use this file except in compliance with the License.
+# you may not use this file except in compliance with the License.
-#You may obtain a copy of the License at
+# You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-#Unless required by applicable law or agreed to in writing, software
+# Unless required by applicable law or agreed to in writing, software
-#distributed under the License is distributed on an "AS IS" BASIS,
+# distributed under the License is distributed on an "AS IS" BASIS,
-#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#See the License for the specific language governing permissions and
+# See the License for the specific language governing permissions and
-#limitations under the License.
+# limitations under the License.
 import unittest
 import numpy as np
 from op_test import OpTest
@@ -40,6 +40,16 @@ class TestElementwiseOp(OpTest):
            ['X'], 'Out', max_relative_error=0.005, no_grad_set=set('Y'))
+class TestElementwiseAddOp_scalar(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_add"
+        self.inputs = {
+            'X': np.random.rand(2, 3, 4).astype(np.float32),
+            'Y': np.random.rand(1).astype(np.float32)
+        }
+        self.outputs = {'Out': self.inputs['X'] + self.inputs['Y']}
 class TestElementwiseAddOp_Vector(TestElementwiseOp):
    def setUp(self):
        self.op_type = "elementwise_add"

--- a/python/paddle/v2/fluid/tests/test_elementwise_div_op.py
+++ b/python/paddle/v2/fluid/tests/test_elementwise_div_op.py
 #  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
 #
-#Licensed under the Apache License, Version 2.0 (the "License");
+# Licensed under the Apache License, Version 2.0 (the "License");
-#you may not use this file except in compliance with the License.
+# you may not use this file except in compliance with the License.
-#You may obtain a copy of the License at
+# You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-#Unless required by applicable law or agreed to in writing, software
+# Unless required by applicable law or agreed to in writing, software
-#distributed under the License is distributed on an "AS IS" BASIS,
+# distributed under the License is distributed on an "AS IS" BASIS,
-#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#See the License for the specific language governing permissions and
+# See the License for the specific language governing permissions and
-#limitations under the License.
+# limitations under the License.
 import unittest
 import numpy as np
 from op_test import OpTest
@@ -45,6 +45,16 @@ class ElementwiseDivOp(OpTest):
            ['X'], 'Out', max_relative_error=0.05, no_grad_set=set('Y'))
+class TestElementwiseDivOp_scalar(ElementwiseDivOp):
+    def setUp(self):
+        self.op_type = "elementwise_div"
+        self.inputs = {
+            'X': np.random.uniform(0.1, 1, [2, 3, 4]).astype(np.float32),
+            'Y': np.random.uniform(0.1, 1, [1]).astype(np.float32)
+        }
+        self.outputs = {'Out': self.inputs['X'] / self.inputs['Y']}
 class TestElementwiseDivOp_Vector(ElementwiseDivOp):
    def setUp(self):
        self.op_type = "elementwise_div"

--- a/python/paddle/v2/fluid/tests/test_elementwise_max_op.py
+++ b/python/paddle/v2/fluid/tests/test_elementwise_max_op.py
+#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+import unittest
+import numpy as np
+from op_test import OpTest
+class TestElementwiseOp(OpTest):
+    def setUp(self):
+        self.op_type = "elementwise_max"
+        # If x and y have the same value, the max() is not differentiable.
+        # So we generate test data by the following method
+        # to avoid them being too close to each other.
+        x = np.random.uniform(0.1, 1, [13, 17]).astype("float32")
+        sgn = np.random.choice([-1, 1], [13, 17]).astype("float32")
+        y = x + sgn * np.random.uniform(0.1, 1, [13, 17]).astype("float32")
+        self.inputs = {'X': x, 'Y': y}
+        self.outputs = {'Out': np.maximum(self.inputs['X'], self.inputs['Y'])}
+    def test_check_output(self):
+        self.check_output()
+    def test_check_grad_normal(self):
+        self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.005)
+    def test_check_grad_ingore_x(self):
+        self.check_grad(
+            ['Y'], 'Out', max_relative_error=0.005, no_grad_set=set("X"))
+    def test_check_grad_ingore_y(self):
+        self.check_grad(
+            ['X'], 'Out', max_relative_error=0.005, no_grad_set=set('Y'))
+class TestElementwiseMaxOp_scalar(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_max"
+        x = np.random.random_integers(-5, 5, [2, 3, 4]).astype("float32")
+        y = np.array([0.5]).astype("float32")
+        self.inputs = {'X': x, 'Y': y}
+        self.outputs = {'Out': np.maximum(self.inputs['X'], self.inputs['Y'])}
+class TestElementwiseMaxOp_Vector(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_max"
+        x = np.random.random((32, )).astype("float32")
+        sgn = np.random.choice([-1, 1], (32, )).astype("float32")
+        y = x + sgn * np.random.uniform(0.1, 1, (32, )).astype("float32")
+        self.inputs = {'X': x, 'Y': y}
+        self.outputs = {'Out': np.maximum(self.inputs['X'], self.inputs['Y'])}
+class TestElementwiseMaxOp_broadcast_0(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_max"
+        x = np.random.uniform(0.5, 1, (2, 3, 4)).astype(np.float32)
+        sgn = np.random.choice([-1, 1], (2, )).astype(np.float32)
+        y = x[:, 0, 0] + sgn * \
+            np.random.uniform(1, 2, (2, )).astype(np.float32)
+        self.inputs = {'X': x, 'Y': y}
+        self.attrs = {'axis': 0}
+        self.outputs = {
+            'Out':
+            np.maximum(self.inputs['X'], self.inputs['Y'].reshape(2, 1, 1))
+        }
+class TestElementwiseMaxOp_broadcast_1(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_max"
+        x = np.random.uniform(0.5, 1, (2, 3, 4)).astype(np.float32)
+        sgn = np.random.choice([-1, 1], (3, )).astype(np.float32)
+        y = x[0, :, 0] + sgn * \
+            np.random.uniform(1, 2, (3, )).astype(np.float32)
+        self.inputs = {'X': x, 'Y': y}
+        self.attrs = {'axis': 1}
+        self.outputs = {
+            'Out':
+            np.maximum(self.inputs['X'], self.inputs['Y'].reshape(1, 3, 1))
+        }
+class TestElementwiseMaxOp_broadcast_2(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_max"
+        x = np.random.uniform(0.5, 1, (2, 3, 4)).astype(np.float32)
+        sgn = np.random.choice([-1, 1], (4, )).astype(np.float32)
+        y = x[0, 0, :] + sgn * \
+            np.random.uniform(1, 2, (4, )).astype(np.float32)
+        self.inputs = {'X': x, 'Y': y}
+        self.outputs = {
+            'Out':
+            np.maximum(self.inputs['X'], self.inputs['Y'].reshape(1, 1, 4))
+        }
+class TestElementwiseMaxOp_broadcast_3(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_max"
+        x = np.random.uniform(0.5, 1, (2, 3, 4, 5)).astype(np.float32)
+        sgn = np.random.choice([-1, 1], (3, 4)).astype(np.float32)
+        y = x[0, :, :, 0] + sgn * \
+            np.random.uniform(1, 2, (3, 4)).astype(np.float32)
+        self.inputs = {'X': x, 'Y': y}
+        self.attrs = {'axis': 1}
+        self.outputs = {
+            'Out':
+            np.maximum(self.inputs['X'], self.inputs['Y'].reshape(1, 3, 4, 1))
+        }
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/fluid/tests/test_elementwise_min_op.py
+++ b/python/paddle/v2/fluid/tests/test_elementwise_min_op.py
+#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+import unittest
+import numpy as np
+from op_test import OpTest
+class TestElementwiseOp(OpTest):
+    def setUp(self):
+        self.op_type = "elementwise_min"
+        # If x and y have the same value, the min() is not differentiable.
+        # So we generate test data by the following method
+        # to avoid them being too close to each other.
+        x = np.random.uniform(0.1, 1, [13, 17]).astype("float32")
+        sgn = np.random.choice([-1, 1], [13, 17]).astype("float32")
+        y = x + sgn * np.random.uniform(0.1, 1, [13, 17]).astype("float32")
+        self.inputs = {'X': x, 'Y': y}
+        self.outputs = {'Out': np.minimum(self.inputs['X'], self.inputs['Y'])}
+    def test_check_output(self):
+        self.check_output()
+    def test_check_grad_normal(self):
+        self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.005)
+    def test_check_grad_ingore_x(self):
+        self.check_grad(
+            ['Y'], 'Out', max_relative_error=0.005, no_grad_set=set("X"))
+    def test_check_grad_ingore_y(self):
+        self.check_grad(
+            ['X'], 'Out', max_relative_error=0.005, no_grad_set=set('Y'))
+class TestElementwiseMinOp_scalar(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_min"
+        x = np.random.random_integers(-5, 5, [2, 3, 4]).astype("float32")
+        y = np.array([0.5]).astype("float32")
+        self.inputs = {'X': x, 'Y': y}
+        self.outputs = {'Out': np.minimum(self.inputs['X'], self.inputs['Y'])}
+class TestElementwiseMaxOp_Vector(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_min"
+        x = np.random.random((32, )).astype("float32")
+        sgn = np.random.choice([-1, 1], (32, )).astype("float32")
+        y = x + sgn * np.random.uniform(0.1, 1, (32, )).astype("float32")
+        self.inputs = {'X': x, 'Y': y}
+        self.outputs = {'Out': np.minimum(self.inputs['X'], self.inputs['Y'])}
+class TestElementwiseMaxOp_broadcast_0(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_min"
+        x = np.random.uniform(0.5, 1, (2, 3, 4)).astype(np.float32)
+        sgn = np.random.choice([-1, 1], (2, )).astype(np.float32)
+        y = x[:, 0, 0] + sgn * \
+            np.random.uniform(1, 2, (2, )).astype(np.float32)
+        self.inputs = {'X': x, 'Y': y}
+        self.attrs = {'axis': 0}
+        self.outputs = {
+            'Out':
+            np.minimum(self.inputs['X'], self.inputs['Y'].reshape(2, 1, 1))
+        }
+class TestElementwiseMaxOp_broadcast_1(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_min"
+        x = np.random.uniform(0.5, 1, (2, 3, 4)).astype(np.float32)
+        sgn = np.random.choice([-1, 1], (3, )).astype(np.float32)
+        y = x[0, :, 0] + sgn * \
+            np.random.uniform(1, 2, (3, )).astype(np.float32)
+        self.inputs = {'X': x, 'Y': y}
+        self.attrs = {'axis': 1}
+        self.outputs = {
+            'Out':
+            np.minimum(self.inputs['X'], self.inputs['Y'].reshape(1, 3, 1))
+        }
+class TestElementwiseMaxOp_broadcast_2(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_min"
+        x = np.random.uniform(0.5, 1, (2, 3, 4)).astype(np.float32)
+        sgn = np.random.choice([-1, 1], (4, )).astype(np.float32)
+        y = x[0, 0, :] + sgn * \
+            np.random.uniform(1, 2, (4, )).astype(np.float32)
+        self.inputs = {'X': x, 'Y': y}
+        self.outputs = {
+            'Out':
+            np.minimum(self.inputs['X'], self.inputs['Y'].reshape(1, 1, 4))
+        }
+class TestElementwiseMaxOp_broadcast_3(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_min"
+        x = np.random.uniform(0.5, 1, (2, 3, 4, 5)).astype(np.float32)
+        sgn = np.random.choice([-1, 1], (3, 4)).astype(np.float32)
+        y = x[0, :, :, 0] + sgn * \
+            np.random.uniform(1, 2, (3, 4)).astype(np.float32)
+        self.inputs = {'X': x, 'Y': y}
+        self.attrs = {'axis': 1}
+        self.outputs = {
+            'Out':
+            np.minimum(self.inputs['X'], self.inputs['Y'].reshape(1, 3, 4, 1))
+        }
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/fluid/tests/test_elementwise_mul_op.py
+++ b/python/paddle/v2/fluid/tests/test_elementwise_mul_op.py
 #  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
 #
-#Licensed under the Apache License, Version 2.0 (the "License");
+# Licensed under the Apache License, Version 2.0 (the "License");
-#you may not use this file except in compliance with the License.
+# you may not use this file except in compliance with the License.
-#You may obtain a copy of the License at
+# You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-#Unless required by applicable law or agreed to in writing, software
+# Unless required by applicable law or agreed to in writing, software
-#distributed under the License is distributed on an "AS IS" BASIS,
+# distributed under the License is distributed on an "AS IS" BASIS,
-#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#See the License for the specific language governing permissions and
+# See the License for the specific language governing permissions and
-#limitations under the License.
+# limitations under the License.
 import unittest
 import numpy as np
 from op_test import OpTest
@@ -38,6 +38,16 @@ class ElementwiseMulOp(OpTest):
        self.check_grad(['X'], 'Out', no_grad_set=set('Y'))
+class TestElementwiseMulOp_scalar(ElementwiseMulOp):
+    def setUp(self):
+        self.op_type = "elementwise_mul"
+        self.inputs = {
+            'X': np.random.rand(2, 3, 4).astype(np.float32),
+            'Y': np.random.rand(1).astype(np.float32)
+        }
+        self.outputs = {'Out': self.inputs['X'] * self.inputs['Y']}
 class TestElementwiseMulOp_Vector(ElementwiseMulOp):
    def setUp(self):
        self.op_type = "elementwise_mul"

--- a/python/paddle/v2/fluid/tests/test_elementwise_sub_op.py
+++ b/python/paddle/v2/fluid/tests/test_elementwise_sub_op.py
 #  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
 #
-#Licensed under the Apache License, Version 2.0 (the "License");
+# Licensed under the Apache License, Version 2.0 (the "License");
-#you may not use this file except in compliance with the License.
+# you may not use this file except in compliance with the License.
-#You may obtain a copy of the License at
+# You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-#Unless required by applicable law or agreed to in writing, software
+# Unless required by applicable law or agreed to in writing, software
-#distributed under the License is distributed on an "AS IS" BASIS,
+# distributed under the License is distributed on an "AS IS" BASIS,
-#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#See the License for the specific language governing permissions and
+# See the License for the specific language governing permissions and
-#limitations under the License.
+# limitations under the License.
 import unittest
 import numpy as np
 from op_test import OpTest
@@ -40,6 +40,16 @@ class TestElementwiseOp(OpTest):
            ['X'], 'Out', max_relative_error=0.005, no_grad_set=set('Y'))
+class TestElementwiseSubOp_scalar(TestElementwiseOp):
+    def setUp(self):
+        self.op_type = "elementwise_sub"
+        self.inputs = {
+            'X': np.random.rand(2, 3, 4).astype(np.float32),
+            'Y': np.random.rand(1).astype(np.float32)
+        }
+        self.outputs = {'Out': self.inputs['X'] - self.inputs['Y']}
 class TestElementwiseSubOp_Vector(TestElementwiseOp):
    def setUp(self):
        self.op_type = "elementwise_sub"

--- a/python/paddle/v2/fluid/tests/test_matmul_op.py
+++ b/python/paddle/v2/fluid/tests/test_matmul_op.py
@@ -96,18 +96,18 @@ class Generator(object):
        self.outputs = {'Out': Out}
    def test_check_output(self):
-        self.check_output(atol=1e-2)
+        self.check_output(atol=1e-3)
    def test_check_grad_normal(self):
-        self.check_grad(['X', 'Y'], 'Out', max_relative_error=0.5)
+        self.check_grad(['X', 'Y'], 'Out', max_relative_error=1e-3)
    def test_check_grad_ignore_x(self):
        self.check_grad(
-            ['Y'], 'Out', max_relative_error=0.5, no_grad_set=set("X"))
+            ['Y'], 'Out', max_relative_error=1e-3, no_grad_set=set("X"))
    def test_check_grad_ignore_y(self):
        self.check_grad(
-            ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))
+            ['X'], 'Out', max_relative_error=1e-3, no_grad_set=set('Y'))
 # Generate test cases for all possibilities

--- a/python/paddle/v2/fluid/tests/test_normalization_wrapper.py
+++ b/python/paddle/v2/fluid/tests/test_normalization_wrapper.py
+#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+import unittest
+import paddle.v2.fluid as fluid
+import paddle.v2.fluid.core as core
+import numpy as np
+class TestNormalization(unittest.TestCase):
+    data_desc = {"name": "input", "shape": (2, 3, 7)}
+    def gen_random_input(self):
+        """Generate random input data.
+        """
+        self.data = np.random.random(
+            size=self.data_desc["shape"]).astype("float32")
+    def set_program(self, axis, epsilon):
+        """Build the test program.
+        """
+        data = fluid.layers.data(
+            name=self.data_desc["name"],
+            shape=self.data_desc["shape"],
+            dtype="float32",
+            append_batch_size=False)
+        data.stop_gradient = False
+        l2_norm = fluid.layers.l2_normalize(x=data, axis=axis, epsilon=epsilon)
+        out = fluid.layers.reduce_sum(l2_norm, dim=None)
+        fluid.backward.append_backward(loss=out)
+        self.fetch_list = [l2_norm]
+    def run_program(self):
+        """Run the test program.
+        """
+        places = [core.CPUPlace()]
+        if core.is_compile_gpu():
+            places.append(core.CUDAPlace(0))
+        for place in places:
+            self.set_inputs(place)
+            exe = fluid.Executor(place)
+            output = exe.run(fluid.default_main_program(),
+                             feed=self.inputs,
+                             fetch_list=self.fetch_list,
+                             return_numpy=True)
+            self.op_output = output
+    def set_inputs(self, place):
+        """Set the randomly generated data to the test program.
+        """
+        self.inputs = {}
+        tensor = fluid.Tensor()
+        tensor.set(self.data, place)
+        self.inputs[self.data_desc["name"]] = tensor
+    def l2_normalize(self, data, axis, epsilon):
+        """ Compute the groundtruth.
+        """
+        output = data * np.reciprocal(
+            np.sum(np.square(data), axis=axis, keepdims=True))
+        return output
+    def test_l2_normalize(self):
+        """ Test the python wrapper for l2_normalize.
+        """
+        axis = 1
+        #TODO(caoying) epsilon is not supported due to lack of a maximum_op.
+        epsilon = 1e-6
+        self.gen_random_input()
+        self.set_program(axis, epsilon)
+        self.run_program()
+        expect_output = self.l2_normalize(self.data, axis, epsilon)
+        # check output
+        self.assertTrue(np.allclose(self.op_output, expect_output, atol=0.001))
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/v2/fluid/tests/test_parallel_op.py
+++ b/python/paddle/v2/fluid/tests/test_parallel_op.py
@@ -151,24 +151,28 @@ class BaseParallelForTest(unittest.TestCase):
 class ParallelOpTest(BaseParallelForTest):
-    def test_simple_fc(self):
+    @staticmethod
    def __network__():
        x = fluid.layers.data(shape=[784], dtype='float32', name='img')
-            # FIXME: This is a bug of parallel.do
-            x.stop_gradient = False
        x = yield x
        hidden = fluid.layers.fc(input=x, size=200, param_attr='fc1.w')
        loss = fluid.layers.mean(x=hidden)
        yield loss
+    def test_simple_fc(self):
        self.run_test(
-            callback=__network__,
+            callback=ParallelOpTest.__network__,
            feed={
-                'img':
+                'img': numpy.random.random(size=(51, 784)).astype('float32')
-                numpy.random.random(size=(128 * 3, 784)).astype('float32')
            },
            fetch='fc1.w@GRAD')
+    def test_fc_with_tiny_data(self):
+        self.run_test(
+            callback=ParallelOpTest.__network__,
+            feed={'img': numpy.random.random(size=(1, 784)).astype('float32')},
+            fetch='fc1.w@GRAD')
 if __name__ == '__main__':
    unittest.main()
--- a/v1_api_demo/README.md
+++ b/v1_api_demo/README.md
-The examples in v1_api_demo are using v1_api currently, and will be upgraded to v2_api later.
-Thus, v1_api_demo is a temporary directory. We decide not to maintain it and will delete it in future.
-Please go to [PaddlePaddle/book](https://github.com/PaddlePaddle/book) and 
-[PaddlePaddle/models](https://github.com/PaddlePaddle/models) to learn PaddlePaddle.
--- a/v1_api_demo/gan/.gitignore
+++ b/v1_api_demo/gan/.gitignore
-output/
-uniform_params/
-cifar_params/
-mnist_params/
-*.png
-.pydevproject
-.project
-*.log
-*.pyc
-data/mnist_data/
-data/cifar-10-batches-py/
--- a/v1_api_demo/gan/README.md
+++ b/v1_api_demo/gan/README.md
-# Generative Adversarial Networks (GAN) 
-This demo implements GAN training described in the original GAN paper (https://arxiv.org/abs/1406.2661) and DCGAN (https://arxiv.org/abs/1511.06434).
-The general training procedures are implemented in gan_trainer.py. The neural network configurations are specified in gan_conf.py (for synthetic data) and gan_conf_image.py (for image data).
-In order to run the model, first download the corresponding data by running the shell script in ./data.
-Then you can run the command below. The flag -d specifies the training data (cifar, mnist or uniform) and flag --useGpu specifies whether to use gpu for training (0 is cpu, 1 is gpu).  
-$python gan_trainer.py -d cifar --use_gpu 1
-The generated images will be stored in ./cifar_samples/
-The corresponding models will be stored in ./cifar_params/
--- a/v1_api_demo/gan/data/download_cifar.sh
+++ b/v1_api_demo/gan/data/download_cifar.sh
-#!/bin/bash
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-set -e
-wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
-tar zxf cifar-10-python.tar.gz
-rm cifar-10-python.tar.gz
--- a/v1_api_demo/gan/data/get_mnist_data.sh
+++ b/v1_api_demo/gan/data/get_mnist_data.sh
-#!/usr/bin/env sh
-# This script downloads the mnist data and unzips it.
-set -e
-DIR="$( cd "$(dirname "$0")" ; pwd -P )"
-rm -rf "$DIR/mnist_data"
-mkdir "$DIR/mnist_data"
-cd "$DIR/mnist_data"
-echo "Downloading..."
-for fname in train-images-idx3-ubyte train-labels-idx1-ubyte t10k-images-idx3-ubyte t10k-labels-idx1-ubyte
-do
-    if [ ! -e $fname ]; then
-        wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz
-        gunzip ${fname}.gz
-    fi
-done
--- a/v1_api_demo/gan/gan_conf.py
+++ b/v1_api_demo/gan/gan_conf.py
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from paddle.trainer_config_helpers import *
-mode = get_config_arg("mode", str, "generator")
-assert mode in set([
-    "generator", "discriminator", "generator_training", "discriminator_training"
-])
-is_generator_training = mode == "generator_training"
-is_discriminator_training = mode == "discriminator_training"
-is_generator = mode == "generator"
-is_discriminator = mode == "discriminator"
-# The network structure below follows the ref https://arxiv.org/abs/1406.2661
-# Here we used two hidden layers and batch_norm
-print('mode=%s' % mode)
-# the dim of the noise (z) as the input of the generator network
-noise_dim = 10
-# the dim of the hidden layer
-hidden_dim = 10
-# the dim of the generated sample
-sample_dim = 2
-settings(
-    batch_size=128,
-    learning_rate=1e-4,
-    learning_method=AdamOptimizer(beta1=0.5))
-def discriminator(sample):
-    """
-    discriminator ouputs the probablity of a sample is from generator
-    or real data.
-    The output has two dimenstional: dimension 0 is the probablity
-    of the sample is from generator and dimension 1 is the probabblity
-    of the sample is from real data.
-    """
-    param_attr = ParamAttr(is_static=is_generator_training)
-    bias_attr = ParamAttr(
-        is_static=is_generator_training, initial_mean=1.0, initial_std=0)
-    hidden = fc_layer(
-        input=sample,
-        name="dis_hidden",
-        size=hidden_dim,
-        bias_attr=bias_attr,
-        param_attr=param_attr,
-        act=ReluActivation())
-    hidden2 = fc_layer(
-        input=hidden,
-        name="dis_hidden2",
-        size=hidden_dim,
-        bias_attr=bias_attr,
-        param_attr=param_attr,
-        act=LinearActivation())
-    hidden_bn = batch_norm_layer(
-        hidden2,
-        act=ReluActivation(),
-        name="dis_hidden_bn",
-        bias_attr=bias_attr,
-        param_attr=ParamAttr(
-            is_static=is_generator_training, initial_mean=1.0,
-            initial_std=0.02),
-        use_global_stats=False)
-    return fc_layer(
-        input=hidden_bn,
-        name="dis_prob",
-        size=2,
-        bias_attr=bias_attr,
-        param_attr=param_attr,
-        act=SoftmaxActivation())
-def generator(noise):
-    """
-    generator generates a sample given noise
-    """
-    param_attr = ParamAttr(is_static=is_discriminator_training)
-    bias_attr = ParamAttr(
-        is_static=is_discriminator_training, initial_mean=1.0, initial_std=0)
-    hidden = fc_layer(
-        input=noise,
-        name="gen_layer_hidden",
-        size=hidden_dim,
-        bias_attr=bias_attr,
-        param_attr=param_attr,
-        act=ReluActivation())
-    hidden2 = fc_layer(
-        input=hidden,
-        name="gen_hidden2",
-        size=hidden_dim,
-        bias_attr=bias_attr,
-        param_attr=param_attr,
-        act=LinearActivation())
-    hidden_bn = batch_norm_layer(
-        hidden2,
-        act=ReluActivation(),
-        name="gen_layer_hidden_bn",
-        bias_attr=bias_attr,
-        param_attr=ParamAttr(
-            is_static=is_discriminator_training,
-            initial_mean=1.0,
-            initial_std=0.02),
-        use_global_stats=False)
-    return fc_layer(
-        input=hidden_bn,
-        name="gen_layer1",
-        size=sample_dim,
-        bias_attr=bias_attr,
-        param_attr=param_attr,
-        act=LinearActivation())
-if is_generator_training:
-    noise = data_layer(name="noise", size=noise_dim)
-    sample = generator(noise)
-if is_discriminator_training:
-    sample = data_layer(name="sample", size=sample_dim)
-if is_generator_training or is_discriminator_training:
-    label = data_layer(name="label", size=1)
-    prob = discriminator(sample)
-    cost = cross_entropy(input=prob, label=label)
-    classification_error_evaluator(
-        input=prob, label=label, name=mode + '_error')
-    outputs(cost)
-if is_generator:
-    noise = data_layer(name="noise", size=noise_dim)
-    outputs(generator(noise))
--- a/v1_api_demo/gan/gan_conf_image.py
+++ b/v1_api_demo/gan/gan_conf_image.py
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from paddle.trainer_config_helpers import *
-mode = get_config_arg("mode", str, "generator")
-dataSource = get_config_arg("data", str, "mnist")
-assert mode in set([
-    "generator", "discriminator", "generator_training", "discriminator_training"
-])
-is_generator_training = mode == "generator_training"
-is_discriminator_training = mode == "discriminator_training"
-is_generator = mode == "generator"
-is_discriminator = mode == "discriminator"
-# The network structure below follows the dcgan paper 
-# (https://arxiv.org/abs/1511.06434)
-print('mode=%s' % mode)
-# the dim of the noise (z) as the input of the generator network
-noise_dim = 100
-# the number of filters in the layer in generator/discriminator that is 
-# closet to the image
-gf_dim = 64
-df_dim = 64
-if dataSource == "mnist":
-    sample_dim = 28  # image dim
-    c_dim = 1  # image color
-else:
-    sample_dim = 32
-    c_dim = 3
-s2, s4 = int(sample_dim / 2), int(sample_dim / 4),
-s8, s16 = int(sample_dim / 8), int(sample_dim / 16)
-settings(
-    batch_size=128,
-    learning_rate=2e-4,
-    learning_method=AdamOptimizer(beta1=0.5))
-def conv_bn(input,
-            channels,
-            imgSize,
-            num_filters,
-            output_x,
-            stride,
-            name,
-            param_attr,
-            bias_attr,
-            param_attr_bn,
-            bn,
-            trans=False,
-            act=ReluActivation()):
-    """
-    conv_bn is a utility function that constructs a convolution/deconv layer 
-    with an optional batch_norm layer
-    :param bn: whether to use batch_norm_layer
-    :type bn: bool
-    :param trans: whether to use conv (False) or deconv (True)
-    :type trans: bool
-    """
-    # calculate the filter_size and padding size based on the given
-    # imgSize and ouput size
-    tmp = imgSize - (output_x - 1) * stride
-    if tmp <= 1 or tmp > 5:
-        raise ValueError("conv input-output dimension does not fit")
-    elif tmp <= 3:
-        filter_size = tmp + 2
-        padding = 1
-    else:
-        filter_size = tmp
-        padding = 0
-    print(imgSize, output_x, stride, filter_size, padding)
-    if trans:
-        nameApx = "_convt"
-    else:
-        nameApx = "_conv"
-    if bn:
-        conv = img_conv_layer(
-            input,
-            filter_size=filter_size,
-            num_filters=num_filters,
-            name=name + nameApx,
-            num_channels=channels,
-            act=LinearActivation(),
-            groups=1,
-            stride=stride,
-            padding=padding,
-            bias_attr=bias_attr,
-            param_attr=param_attr,
-            shared_biases=True,
-            layer_attr=None,
-            filter_size_y=None,
-            stride_y=None,
-            padding_y=None,
-            trans=trans)
-        conv_bn = batch_norm_layer(
-            conv,
-            act=act,
-            name=name + nameApx + "_bn",
-            bias_attr=bias_attr,
-            param_attr=param_attr_bn,
-            use_global_stats=False)
-        return conv_bn
-    else:
-        conv = img_conv_layer(
-            input,
-            filter_size=filter_size,
-            num_filters=num_filters,
-            name=name + nameApx,
-            num_channels=channels,
-            act=act,
-            groups=1,
-            stride=stride,
-            padding=padding,
-            bias_attr=bias_attr,
-            param_attr=param_attr,
-            shared_biases=True,
-            layer_attr=None,
-            filter_size_y=None,
-            stride_y=None,
-            padding_y=None,
-            trans=trans)
-        return conv
-def generator(noise):
-    """
-    generator generates a sample given noise
-    """
-    param_attr = ParamAttr(
-        is_static=is_discriminator_training, initial_mean=0.0, initial_std=0.02)
-    bias_attr = ParamAttr(
-        is_static=is_discriminator_training, initial_mean=0.0, initial_std=0.0)
-    param_attr_bn = ParamAttr(
-        is_static=is_discriminator_training, initial_mean=1.0, initial_std=0.02)
-    h1 = fc_layer(
-        input=noise,
-        name="gen_layer_h1",
-        size=s8 * s8 * gf_dim * 4,
-        bias_attr=bias_attr,
-        param_attr=param_attr,
-        act=LinearActivation())
-    h1_bn = batch_norm_layer(
-        h1,
-        act=ReluActivation(),
-        name="gen_layer_h1_bn",
-        bias_attr=bias_attr,
-        param_attr=param_attr_bn,
-        use_global_stats=False)
-    h2_bn = conv_bn(
-        h1_bn,
-        channels=gf_dim * 4,
-        output_x=s8,
-        num_filters=gf_dim * 2,
-        imgSize=s4,
-        stride=2,
-        name="gen_layer_h2",
-        param_attr=param_attr,
-        bias_attr=bias_attr,
-        param_attr_bn=param_attr_bn,
-        bn=True,
-        trans=True)
-    h3_bn = conv_bn(
-        h2_bn,
-        channels=gf_dim * 2,
-        output_x=s4,
-        num_filters=gf_dim,
-        imgSize=s2,
-        stride=2,
-        name="gen_layer_h3",
-        param_attr=param_attr,
-        bias_attr=bias_attr,
-        param_attr_bn=param_attr_bn,
-        bn=True,
-        trans=True)
-    return conv_bn(
-        h3_bn,
-        channels=gf_dim,
-        output_x=s2,
-        num_filters=c_dim,
-        imgSize=sample_dim,
-        stride=2,
-        name="gen_layer_h4",
-        param_attr=param_attr,
-        bias_attr=bias_attr,
-        param_attr_bn=param_attr_bn,
-        bn=False,
-        trans=True,
-        act=TanhActivation())
-def discriminator(sample):
-    """
-    discriminator ouputs the probablity of a sample is from generator
-    or real data.
-    The output has two dimenstional: dimension 0 is the probablity
-    of the sample is from generator and dimension 1 is the probabblity
-    of the sample is from real data.
-    """
-    param_attr = ParamAttr(
-        is_static=is_generator_training, initial_mean=0.0, initial_std=0.02)
-    bias_attr = ParamAttr(
-        is_static=is_generator_training, initial_mean=0.0, initial_std=0.0)
-    param_attr_bn = ParamAttr(
-        is_static=is_generator_training, initial_mean=1.0, initial_std=0.02)
-    h0 = conv_bn(
-        sample,
-        channels=c_dim,
-        imgSize=sample_dim,
-        num_filters=df_dim,
-        output_x=s2,
-        stride=2,
-        name="dis_h0",
-        param_attr=param_attr,
-        bias_attr=bias_attr,
-        param_attr_bn=param_attr_bn,
-        bn=False)
-    h1_bn = conv_bn(
-        h0,
-        channels=df_dim,
-        imgSize=s2,
-        num_filters=df_dim * 2,
-        output_x=s4,
-        stride=2,
-        name="dis_h1",
-        param_attr=param_attr,
-        bias_attr=bias_attr,
-        param_attr_bn=param_attr_bn,
-        bn=True)
-    h2_bn = conv_bn(
-        h1_bn,
-        channels=df_dim * 2,
-        imgSize=s4,
-        num_filters=df_dim * 4,
-        output_x=s8,
-        stride=2,
-        name="dis_h2",
-        param_attr=param_attr,
-        bias_attr=bias_attr,
-        param_attr_bn=param_attr_bn,
-        bn=True)
-    return fc_layer(
-        input=h2_bn,
-        name="dis_prob",
-        size=2,
-        bias_attr=bias_attr,
-        param_attr=param_attr,
-        act=SoftmaxActivation())
-if is_generator_training:
-    noise = data_layer(name="noise", size=noise_dim)
-    sample = generator(noise)
-if is_discriminator_training:
-    sample = data_layer(name="sample", size=sample_dim * sample_dim * c_dim)
-if is_generator_training or is_discriminator_training:
-    label = data_layer(name="label", size=1)
-    prob = discriminator(sample)
-    cost = cross_entropy(input=prob, label=label)
-    classification_error_evaluator(
-        input=prob, label=label, name=mode + '_error')
-    outputs(cost)
-if is_generator:
-    noise = data_layer(name="noise", size=noise_dim)
-    outputs(generator(noise))
--- a/v1_api_demo/gan/gan_trainer.py
+++ b/v1_api_demo/gan/gan_trainer.py
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import argparse
-import random
-import numpy
-import cPickle
-import sys, os
-from PIL import Image
-from paddle.trainer.config_parser import parse_config
-from paddle.trainer.config_parser import logger
-import py_paddle.swig_paddle as api
-import matplotlib.pyplot as plt
-def plot2DScatter(data, outputfile):
-    '''
-    Plot the data as a 2D scatter plot and save to outputfile
-    data needs to be two dimensinoal
-    '''
-    x = data[:, 0]
-    y = data[:, 1]
-    logger.info("The mean vector is %s" % numpy.mean(data, 0))
-    logger.info("The std vector is %s" % numpy.std(data, 0))
-    heatmap, xedges, yedges = numpy.histogram2d(x, y, bins=50)
-    extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
-    plt.clf()
-    plt.scatter(x, y)
-    plt.savefig(outputfile, bbox_inches='tight')
-def CHECK_EQ(a, b):
-    assert a == b, "a=%s, b=%s" % (a, b)
-def copy_shared_parameters(src, dst):
-    '''
-    copy the parameters from src to dst
-    :param src: the source of the parameters
-    :type src: GradientMachine
-    :param dst: the destination of the parameters
-    :type dst: GradientMachine
-    '''
-    src_params = [src.getParameter(i) for i in xrange(src.getParameterSize())]
-    src_params = dict([(p.getName(), p) for p in src_params])
-    for i in xrange(dst.getParameterSize()):
-        dst_param = dst.getParameter(i)
-        src_param = src_params.get(dst_param.getName(), None)
-        if src_param is None:
-            continue
-        src_value = src_param.getBuf(api.PARAMETER_VALUE)
-        dst_value = dst_param.getBuf(api.PARAMETER_VALUE)
-        CHECK_EQ(len(src_value), len(dst_value))
-        dst_value.copyFrom(src_value)
-        dst_param.setValueUpdated()
-def print_parameters(src):
-    src_params = [src.getParameter(i) for i in xrange(src.getParameterSize())]
-    print "***************"
-    for p in src_params:
-        print "Name is %s" % p.getName()
-        print "value is %s \n" % p.getBuf(api.PARAMETER_VALUE).copyToNumpyArray(
-        )
-def load_mnist_data(imageFile):
-    f = open(imageFile, "rb")
-    f.read(16)
-    # Define number of samples for train/test
-    if "train" in imageFile:
-        n = 60000
-    else:
-        n = 10000
-    data = numpy.fromfile(f, 'ubyte', count=n * 28 * 28).reshape((n, 28 * 28))
-    data = data / 255.0 * 2.0 - 1.0
-    f.close()
-    return data.astype('float32')
-def load_cifar_data(cifar_path):
-    batch_size = 10000
-    data = numpy.zeros((5 * batch_size, 32 * 32 * 3), dtype="float32")
-    for i in range(1, 6):
-        file = cifar_path + "/data_batch_" + str(i)
-        fo = open(file, 'rb')
-        dict = cPickle.load(fo)
-        fo.close()
-        data[(i - 1) * batch_size:(i * batch_size), :] = dict["data"]
-    data = data / 255.0 * 2.0 - 1.0
-    return data
-# synthesize 2-D uniform data
-def load_uniform_data():
-    data = numpy.random.rand(1000000, 2).astype('float32')
-    return data
-def merge(images, size):
-    if images.shape[1] == 28 * 28:
-        h, w, c = 28, 28, 1
-    else:
-        h, w, c = 32, 32, 3
-    img = numpy.zeros((h * size[0], w * size[1], c))
-    for idx in xrange(size[0] * size[1]):
-        i = idx % size[1]
-        j = idx // size[1]
-        img[j*h:j*h+h, i*w:i*w+w, :] = \
-          ((images[idx, :].reshape((h, w, c), order="F").transpose(1, 0, 2) + 1.0) / 2.0 * 255.0)
-    return img.astype('uint8')
-def save_images(images, path):
-    merged_img = merge(images, [8, 8])
-    if merged_img.shape[2] == 1:
-        im = Image.fromarray(numpy.squeeze(merged_img)).convert('RGB')
-    else:
-        im = Image.fromarray(merged_img, mode="RGB")
-    im.save(path)
-def get_real_samples(batch_size, data_np):
-    return data_np[numpy.random.choice(
-        data_np.shape[0], batch_size, replace=False), :]
-def get_noise(batch_size, noise_dim):
-    return numpy.random.normal(size=(batch_size, noise_dim)).astype('float32')
-def get_fake_samples(generator_machine, batch_size, noise):
-    gen_inputs = api.Arguments.createArguments(1)
-    gen_inputs.setSlotValue(0, api.Matrix.createDenseFromNumpy(noise))
-    gen_outputs = api.Arguments.createArguments(0)
-    generator_machine.forward(gen_inputs, gen_outputs, api.PASS_TEST)
-    fake_samples = gen_outputs.getSlotValue(0).copyToNumpyMat()
-    return fake_samples
-def get_training_loss(training_machine, inputs):
-    outputs = api.Arguments.createArguments(0)
-    training_machine.forward(inputs, outputs, api.PASS_TEST)
-    loss = outputs.getSlotValue(0).copyToNumpyMat()
-    return numpy.mean(loss)
-def prepare_discriminator_data_batch_pos(batch_size, data_np):
-    real_samples = get_real_samples(batch_size, data_np)
-    labels = numpy.ones(batch_size, dtype='int32')
-    inputs = api.Arguments.createArguments(2)
-    inputs.setSlotValue(0, api.Matrix.createDenseFromNumpy(real_samples))
-    inputs.setSlotIds(1, api.IVector.createVectorFromNumpy(labels))
-    return inputs
-def prepare_discriminator_data_batch_neg(generator_machine, batch_size, noise):
-    fake_samples = get_fake_samples(generator_machine, batch_size, noise)
-    labels = numpy.zeros(batch_size, dtype='int32')
-    inputs = api.Arguments.createArguments(2)
-    inputs.setSlotValue(0, api.Matrix.createDenseFromNumpy(fake_samples))
-    inputs.setSlotIds(1, api.IVector.createVectorFromNumpy(labels))
-    return inputs
-def prepare_generator_data_batch(batch_size, noise):
-    label = numpy.ones(batch_size, dtype='int32')
-    inputs = api.Arguments.createArguments(2)
-    inputs.setSlotValue(0, api.Matrix.createDenseFromNumpy(noise))
-    inputs.setSlotIds(1, api.IVector.createVectorFromNumpy(label))
-    return inputs
-def find(iterable, cond):
-    for item in iterable:
-        if cond(item):
-            return item
-    return None
-def get_layer_size(model_conf, layer_name):
-    layer_conf = find(model_conf.layers, lambda x: x.name == layer_name)
-    assert layer_conf is not None, "Cannot find '%s' layer" % layer_name
-    return layer_conf.size
-def main():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("-d", "--data_source", help="mnist or cifar or uniform")
-    parser.add_argument(
-        "--use_gpu", default="1", help="1 means use gpu for training")
-    parser.add_argument("--gpu_id", default="0", help="the gpu_id parameter")
-    args = parser.parse_args()
-    data_source = args.data_source
-    use_gpu = args.use_gpu
-    assert data_source in ["mnist", "cifar", "uniform"]
-    assert use_gpu in ["0", "1"]
-    if not os.path.exists("./%s_samples/" % data_source):
-        os.makedirs("./%s_samples/" % data_source)
-    if not os.path.exists("./%s_params/" % data_source):
-        os.makedirs("./%s_params/" % data_source)
-    api.initPaddle('--use_gpu=' + use_gpu, '--dot_period=10',
-                   '--log_period=100', '--gpu_id=' + args.gpu_id,
-                   '--save_dir=' + "./%s_params/" % data_source)
-    if data_source == "uniform":
-        conf = "gan_conf.py"
-        num_iter = 10000
-    else:
-        conf = "gan_conf_image.py"
-        num_iter = 1000
-    gen_conf = parse_config(conf, "mode=generator_training,data=" + data_source)
-    dis_conf = parse_config(conf,
-                            "mode=discriminator_training,data=" + data_source)
-    generator_conf = parse_config(conf, "mode=generator,data=" + data_source)
-    batch_size = dis_conf.opt_config.batch_size
-    noise_dim = get_layer_size(gen_conf.model_config, "noise")
-    if data_source == "mnist":
-        data_np = load_mnist_data("./data/mnist_data/train-images-idx3-ubyte")
-    elif data_source == "cifar":
-        data_np = load_cifar_data("./data/cifar-10-batches-py/")
-    else:
-        data_np = load_uniform_data()
-    # this creates a gradient machine for discriminator
-    dis_training_machine = api.GradientMachine.createFromConfigProto(
-        dis_conf.model_config)
-    # this create a gradient machine for generator    
-    gen_training_machine = api.GradientMachine.createFromConfigProto(
-        gen_conf.model_config)
-    # generator_machine is used to generate data only, which is used for
-    # training discriminator
-    logger.info(str(generator_conf.model_config))
-    generator_machine = api.GradientMachine.createFromConfigProto(
-        generator_conf.model_config)
-    dis_trainer = api.Trainer.create(dis_conf, dis_training_machine)
-    gen_trainer = api.Trainer.create(gen_conf, gen_training_machine)
-    dis_trainer.startTrain()
-    gen_trainer.startTrain()
-    # Sync parameters between networks (GradientMachine) at the beginning
-    copy_shared_parameters(gen_training_machine, dis_training_machine)
-    copy_shared_parameters(gen_training_machine, generator_machine)
-    # constrain that either discriminator or generator can not be trained
-    # consecutively more than MAX_strike times
-    curr_train = "dis"
-    curr_strike = 0
-    MAX_strike = 5
-    for train_pass in xrange(100):
-        dis_trainer.startTrainPass()
-        gen_trainer.startTrainPass()
-        for i in xrange(num_iter):
-            # Do forward pass in discriminator to get the dis_loss
-            noise = get_noise(batch_size, noise_dim)
-            data_batch_dis_pos = prepare_discriminator_data_batch_pos(
-                batch_size, data_np)
-            dis_loss_pos = get_training_loss(dis_training_machine,
-                                             data_batch_dis_pos)
-            data_batch_dis_neg = prepare_discriminator_data_batch_neg(
-                generator_machine, batch_size, noise)
-            dis_loss_neg = get_training_loss(dis_training_machine,
-                                             data_batch_dis_neg)
-            dis_loss = (dis_loss_pos + dis_loss_neg) / 2.0
-            # Do forward pass in generator to get the gen_loss
-            data_batch_gen = prepare_generator_data_batch(batch_size, noise)
-            gen_loss = get_training_loss(gen_training_machine, data_batch_gen)
-            if i % 100 == 0:
-                print "d_pos_loss is %s     d_neg_loss is %s" % (dis_loss_pos,
-                                                                 dis_loss_neg)
-                print "d_loss is %s    g_loss is %s" % (dis_loss, gen_loss)
-            # Decide which network to train based on the training history
-            # And the relative size of the loss        
-            if (not (curr_train == "dis" and curr_strike == MAX_strike)) and \
-               ((curr_train == "gen" and curr_strike == MAX_strike) or dis_loss > gen_loss):
-                if curr_train == "dis":
-                    curr_strike += 1
-                else:
-                    curr_train = "dis"
-                    curr_strike = 1
-                dis_trainer.trainOneDataBatch(batch_size, data_batch_dis_neg)
-                dis_trainer.trainOneDataBatch(batch_size, data_batch_dis_pos)
-                copy_shared_parameters(dis_training_machine,
-                                       gen_training_machine)
-            else:
-                if curr_train == "gen":
-                    curr_strike += 1
-                else:
-                    curr_train = "gen"
-                    curr_strike = 1
-                gen_trainer.trainOneDataBatch(batch_size, data_batch_gen)
-                # TODO: add API for paddle to allow true parameter sharing between different GradientMachines 
-                # so that we do not need to copy shared parameters. 
-                copy_shared_parameters(gen_training_machine,
-                                       dis_training_machine)
-                copy_shared_parameters(gen_training_machine, generator_machine)
-        dis_trainer.finishTrainPass()
-        gen_trainer.finishTrainPass()
-        # At the end of each pass, save the generated samples/images
-        fake_samples = get_fake_samples(generator_machine, batch_size, noise)
-        if data_source == "uniform":
-            plot2DScatter(fake_samples, "./%s_samples/train_pass%s.png" %
-                          (data_source, train_pass))
-        else:
-            save_images(fake_samples, "./%s_samples/train_pass%s.png" %
-                        (data_source, train_pass))
-    dis_trainer.finishTrain()
-    gen_trainer.finishTrain()
-if __name__ == '__main__':
-    main()
--- a/v1_api_demo/mnist/.gitignore
+++ b/v1_api_demo/mnist/.gitignore
-data/raw_data
-data/*.list
-mnist_vgg_model
-plot.png
-train.log
-*pyc
-.ipynb_checkpoints
-params.pkl
-params.tar
-params.tar.gz
--- a/v1_api_demo/mnist/api_train.py
+++ b/v1_api_demo/mnist/api_train.py
-#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
-#
-#Licensed under the Apache License, Version 2.0 (the "License");
-#you may not use this file except in compliance with the License.
-#You may obtain a copy of the License at
-#
-#    http://www.apache.org/licenses/LICENSE-2.0
-#
-#Unless required by applicable law or agreed to in writing, software
-#distributed under the License is distributed on an "AS IS" BASIS,
-#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#See the License for the specific language governing permissions and
-#limitations under the License.
-"""
-A very basic example for how to use current Raw SWIG API to train mnist network.
-Current implementation uses Raw SWIG, which means the API call is directly \
-passed to C++ side of Paddle.
-The user api could be simpler and carefully designed.
-"""
-import random
-import numpy as np
-import paddle.v2 as paddle_v2
-import py_paddle.swig_paddle as api
-from paddle.trainer_config_helpers import *
-from py_paddle import DataProviderConverter
-from mnist_util import read_from_mnist
-def init_parameter(network):
-    assert isinstance(network, api.GradientMachine)
-    for each_param in network.getParameters():
-        assert isinstance(each_param, api.Parameter)
-        array_size = len(each_param)
-        array = np.random.uniform(-1.0, 1.0, array_size).astype('float32')
-        each_param.getBuf(api.PARAMETER_VALUE).copyFromNumpyArray(array)
-def generator_to_batch(generator, batch_size):
-    ret_val = list()
-    for each_item in generator:
-        ret_val.append(each_item)
-        if len(ret_val) == batch_size:
-            yield ret_val
-            ret_val = list()
-    if len(ret_val) != 0:
-        yield ret_val
-class BatchPool(object):
-    def __init__(self, generator, batch_size):
-        self.data = list(generator)
-        self.batch_size = batch_size
-    def __call__(self):
-        random.shuffle(self.data)
-        for offset in xrange(0, len(self.data), self.batch_size):
-            limit = min(offset + self.batch_size, len(self.data))
-            yield self.data[offset:limit]
-def input_order_converter(generator):
-    for each_item in generator:
-        yield each_item['pixel'], each_item['label']
-def main():
-    api.initPaddle("-use_gpu=false", "-trainer_count=4")  # use 4 cpu cores
-    optimizer = paddle_v2.optimizer.Adam(
-        learning_rate=1e-4,
-        batch_size=1000,
-        model_average=ModelAverage(average_window=0.5),
-        regularization=L2Regularization(rate=0.5))
-    # Create Local Updater. Local means not run in cluster.
-    # For a cluster training, here we can change to createRemoteUpdater
-    # in future.
-    updater = optimizer.create_local_updater()
-    assert isinstance(updater, api.ParameterUpdater)
-    # define network
-    images = paddle_v2.layer.data(
-        name='pixel', type=paddle_v2.data_type.dense_vector(784))
-    label = paddle_v2.layer.data(
-        name='label', type=paddle_v2.data_type.integer_value(10))
-    hidden1 = paddle_v2.layer.fc(input=images, size=200)
-    hidden2 = paddle_v2.layer.fc(input=hidden1, size=200)
-    inference = paddle_v2.layer.fc(input=hidden2,
-                                   size=10,
-                                   act=paddle_v2.activation.Softmax())
-    cost = paddle_v2.layer.classification_cost(input=inference, label=label)
-    # Create Simple Gradient Machine.
-    model_config = paddle_v2.layer.parse_network(cost)
-    m = api.GradientMachine.createFromConfigProto(model_config,
-                                                  api.CREATE_MODE_NORMAL,
-                                                  optimizer.enable_types())
-    # This type check is not useful. Only enable type hint in IDE.
-    # Such as PyCharm
-    assert isinstance(m, api.GradientMachine)
-    # Initialize Parameter by numpy.
-    init_parameter(network=m)
-    # Initialize ParameterUpdater.
-    updater.init(m)
-    # DataProvider Converter is a utility convert Python Object to Paddle C++
-    # Input. The input format is as same as Paddle's DataProvider.
-    converter = DataProviderConverter(input_types=[images.type, label.type])
-    train_file = './data/raw_data/train'
-    test_file = './data/raw_data/t10k'
-    # start gradient machine.
-    # the gradient machine must be started before invoke forward/backward.
-    # not just for training, but also for inference.
-    m.start()
-    # evaluator can print error rate, etc. It is a C++ class.
-    batch_evaluator = m.makeEvaluator()
-    test_evaluator = m.makeEvaluator()
-    # Get Train Data.
-    # TrainData will stored in a data pool. Currently implementation is not care
-    # about memory, speed. Just a very naive implementation.
-    train_data_generator = input_order_converter(read_from_mnist(train_file))
-    train_data = BatchPool(train_data_generator, 512)
-    # outArgs is Neural Network forward result. Here is not useful, just passed
-    # to gradient_machine.forward
-    outArgs = api.Arguments.createArguments(0)
-    for pass_id in xrange(2):  # we train 2 passes.
-        updater.startPass()
-        for batch_id, data_batch in enumerate(train_data()):
-            # data_batch is input images.
-            # here, for online learning, we could get data_batch from network.
-            # Start update one batch.
-            pass_type = updater.startBatch(len(data_batch))
-            # Start BatchEvaluator.
-            # batch_evaluator can be used between start/finish.
-            batch_evaluator.start()
-            # forwardBackward is a shortcut for forward and backward.
-            # It is sometimes faster than invoke forward/backward separately,
-            # because in GradientMachine, it may be async.
-            m.forwardBackward(converter(data_batch), outArgs, pass_type)
-            for each_param in m.getParameters():
-                updater.update(each_param)
-            # Get cost. We use numpy to calculate total cost for this batch.
-            cost_vec = outArgs.getSlotValue(0)
-            cost_vec = cost_vec.copyToNumpyMat()
-            cost = cost_vec.sum() / len(data_batch)
-            # Make evaluator works.
-            m.eval(batch_evaluator)
-            # Print logs.
-            print 'Pass id', pass_id, 'Batch id', batch_id, 'with cost=', \
-                cost, batch_evaluator
-            batch_evaluator.finish()
-            # Finish batch.
-            #  * will clear gradient.
-            #  * ensure all values should be updated.
-            updater.finishBatch(cost)
-        # testing stage. use test data set to test current network.
-        updater.apply()
-        test_evaluator.start()
-        test_data_generator = input_order_converter(read_from_mnist(test_file))
-        for data_batch in generator_to_batch(test_data_generator, 512):
-            # in testing stage, only forward is needed.
-            m.forward(converter(data_batch), outArgs, api.PASS_TEST)
-            m.eval(test_evaluator)
-        # print error rate for test data set
-        print 'Pass', pass_id, ' test evaluator: ', test_evaluator
-        test_evaluator.finish()
-        updater.restore()
-        updater.catchUpWith()
-        params = m.getParameters()
-        for each_param in params:
-            assert isinstance(each_param, api.Parameter)
-            value = each_param.getBuf(api.PARAMETER_VALUE)
-            value = value.copyToNumpyArray()
-            # Here, we could save parameter to every where you want
-            print each_param.getName(), value
-        updater.finishPass()
-    m.finish()
-if __name__ == '__main__':
-    main()
--- a/v1_api_demo/mnist/data/generate_list.py
+++ b/v1_api_demo/mnist/data/generate_list.py
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-o = open("./" + "train.list", "w")
-o.write("./data/raw_data/train" + "\n")
-o.close()
-o = open("./" + "test.list", "w")
-o.write("./data/raw_data/t10k" + "\n")
-o.close()
--- a/v1_api_demo/mnist/data/get_mnist_data.sh
+++ b/v1_api_demo/mnist/data/get_mnist_data.sh
-#!/usr/bin/env sh
-# This scripts downloads the mnist data and unzips it.
-set -e
-DIR="$( cd "$(dirname "$0")" ; pwd -P )"
-rm -rf "$DIR/raw_data"
-mkdir "$DIR/raw_data"
-cd "$DIR/raw_data"
-echo "Downloading..."
-for fname in train-images-idx3-ubyte train-labels-idx1-ubyte t10k-images-idx3-ubyte t10k-labels-idx1-ubyte
-do
-    if [ ! -e $fname ]; then
-        wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz
-        gunzip ${fname}.gz
-    fi
-done
-cd $DIR
-rm -f *.list
-python generate_list.py
--- a/v1_api_demo/mnist/light_mnist.py
+++ b/v1_api_demo/mnist/light_mnist.py
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from paddle.trainer_config_helpers import *
-is_predict = get_config_arg("is_predict", bool, False)
-####################Data Configuration ##################
-if not is_predict:
-    data_dir = './data/'
-    define_py_data_sources2(
-        train_list=data_dir + 'train.list',
-        test_list=data_dir + 'test.list',
-        module='mnist_provider',
-        obj='process')
-######################Algorithm Configuration #############
-settings(batch_size=50, learning_rate=0.001, learning_method=AdamOptimizer())
-#######################Network Configuration #############
-data_size = 1 * 28 * 28
-label_size = 10
-img = data_layer(name='pixel', size=data_size)
-# light cnn
-# A shallower cnn model: [CNN, BN, ReLU, Max-Pooling] x4 + FC x1
-# Easier to train for mnist dataset and quite efficient
-# Final performance is close to deeper ones on tasks such as digital and character classification 
-def light_cnn(input_image, num_channels, num_classes):
-    def __light__(ipt,
-                  num_filter=128,
-                  times=1,
-                  conv_filter_size=3,
-                  dropouts=0,
-                  num_channels_=None):
-        return img_conv_group(
-            input=ipt,
-            num_channels=num_channels_,
-            pool_size=2,
-            pool_stride=2,
-            conv_padding=0,
-            conv_num_filter=[num_filter] * times,
-            conv_filter_size=conv_filter_size,
-            conv_act=ReluActivation(),
-            conv_with_batchnorm=True,
-            conv_batchnorm_drop_rate=dropouts,
-            pool_type=MaxPooling())
-    tmp = __light__(input_image, num_filter=128, num_channels_=num_channels)
-    tmp = __light__(tmp, num_filter=128)
-    tmp = __light__(tmp, num_filter=128)
-    tmp = __light__(tmp, num_filter=128, conv_filter_size=1)
-    tmp = fc_layer(input=tmp, size=num_classes, act=SoftmaxActivation())
-    return tmp
-predict = light_cnn(input_image=img, num_channels=1, num_classes=label_size)
-if not is_predict:
-    lbl = data_layer(name="label", size=label_size)
-    inputs(img, lbl)
-    outputs(classification_cost(input=predict, label=lbl))
-else:
-    outputs(predict)
--- a/v1_api_demo/mnist/mnist_provider.py
+++ b/v1_api_demo/mnist/mnist_provider.py
-#  Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
-#
-#Licensed under the Apache License, Version 2.0 (the "License");
-#you may not use this file except in compliance with the License.
-#You may obtain a copy of the License at
-#
-#    http://www.apache.org/licenses/LICENSE-2.0
-#
-#Unless required by applicable law or agreed to in writing, software
-#distributed under the License is distributed on an "AS IS" BASIS,
-#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#See the License for the specific language governing permissions and
-#limitations under the License.
-from paddle.trainer.PyDataProvider2 import *
-from mnist_util import read_from_mnist
-# Define a py data provider
-@provider(
-    input_types={'pixel': dense_vector(28 * 28),
-                 'label': integer_value(10)},
-    cache=CacheType.CACHE_PASS_IN_MEM)
-def process(settings, filename):  # settings is not used currently.
-    for each in read_from_mnist(filename):
-        yield each
--- a/v1_api_demo/mnist/mnist_util.py
+++ b/v1_api_demo/mnist/mnist_util.py
-import numpy
-__all__ = ['read_from_mnist']
-def read_from_mnist(filename):
-    imgf = filename + "-images-idx3-ubyte"
-    labelf = filename + "-labels-idx1-ubyte"
-    f = open(imgf, "rb")
-    l = open(labelf, "rb")
-    f.read(16)
-    l.read(8)
-    # Define number of samples for train/test
-    if "train" in filename:
-        n = 60000
-    else:
-        n = 10000
-    images = numpy.fromfile(
-        f, 'ubyte', count=n * 28 * 28).reshape((n, 28 * 28)).astype('float32')
-    images = images / 255.0 * 2.0 - 1.0
-    labels = numpy.fromfile(l, 'ubyte', count=n).astype("int")
-    for i in xrange(n):
-        yield {"pixel": images[i, :], 'label': labels[i]}
-    f.close()
-    l.close()
--- a/v1_api_demo/mnist/train.sh
+++ b/v1_api_demo/mnist/train.sh
-#!/bin/bash
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-set -e
-config=vgg_16_mnist.py
-output=./mnist_vgg_model
-log=train.log
-paddle train \
--config=$config \
--dot_period=10 \
--log_period=100 \
--test_all_data_in_one_period=1 \
--use_gpu=0 \
--trainer_count=1 \
--num_passes=100 \
--save_dir=$output \
-2>&1 | tee $log
-paddle usage -l $log -e $? -n "mnist_train" >/dev/null 2>&1
-python -m paddle.utils.plotcurve -i $log > plot.png
--- a/v1_api_demo/mnist/vgg_16_mnist.py
+++ b/v1_api_demo/mnist/vgg_16_mnist.py
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from paddle.trainer_config_helpers import *
-is_predict = get_config_arg("is_predict", bool, False)
-####################Data Configuration ##################
-if not is_predict:
-    data_dir = './data/'
-    define_py_data_sources2(
-        train_list=data_dir + 'train.list',
-        test_list=data_dir + 'test.list',
-        module='mnist_provider',
-        obj='process')
-######################Algorithm Configuration #############
-settings(
-    batch_size=128,
-    learning_rate=0.1 / 128.0,
-    learning_method=MomentumOptimizer(0.9),
-    regularization=L2Regularization(0.0005 * 128))
-#######################Network Configuration #############
-data_size = 1 * 28 * 28
-label_size = 10
-img = data_layer(name='pixel', size=data_size)
-# small_vgg is predined in trainer_config_helpers.network
-predict = small_vgg(input_image=img, num_channels=1, num_classes=label_size)
-if not is_predict:
-    lbl = data_layer(name="label", size=label_size)
-    inputs(img, lbl)
-    outputs(classification_cost(input=predict, label=lbl))
-else:
-    outputs(predict)
--- a/v1_api_demo/model_zoo/embedding/.gitignore
+++ b/v1_api_demo/model_zoo/embedding/.gitignore
-baidu.dict
-model_*.emb
--- a/v1_api_demo/model_zoo/embedding/extract_para.py
+++ b/v1_api_demo/model_zoo/embedding/extract_para.py
-#!/bin/env python
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""
-Example:
-    python extract_para.py --preModel PREMODEL --preDict PREDICT \
-                            --usrModel USRMODEL --usrDict USRDICT -d DIM
-Options:
-    -h, --help          show this help message and exit
-    --preModel PREMODEL the name of pretrained embedding model
-    --preDict PREDICT   the name of pretrained dictionary
-    --usrModel usrModel the name of output usr embedding model
-    --usrDict usrDict   the name of user specified dictionary
-    -d DIM              dimension of parameter
-"""
-from optparse import OptionParser
-import struct
-def get_row_index(preDict, usrDict):
-    """
-    Get the row positions for all words in user dictionary from pre-trained dictionary.
-    return: a list of row positions
-    Example: preDict='a\nb\nc\n', usrDict='a\nc\n', then return [0,2]
-    """
-    pos = []
-    index = dict()
-    with open(preDict, "r") as f:
-        for line_index, line in enumerate(f):
-            word = line.strip().split()[0]
-            index[word] = line_index
-    with open(usrDict, "r") as f:
-        for line in f:
-            word = line.strip().split()[0]
-            pos.append(index[word])
-    return pos
-def extract_parameters_by_usrDict(preModel, preDict, usrModel, usrDict,
-                                  paraDim):
-    """
-    Extract desired parameters from a pretrained embedding model based on user dictionary
-    """
-    if paraDim not in [32, 64, 128, 256]:
-        raise RuntimeError("We only support 32, 64, 128, 256 dimensions now")
-    fi = open(preModel, "rb")
-    fo = open(usrModel, "wb")
-    # write filehead
-    rowIndex = get_row_index(preDict, usrDict)
-    newHead = struct.pack("iil", 0, 4, len(rowIndex) * paraDim)
-    fo.write(newHead)
-    bytes = 4 * paraDim
-    for i in range(0, len(rowIndex)):
-        # find the absolute position of input file
-        fi.seek(rowIndex[i] * bytes + 16, 0)
-        fo.write(fi.read(bytes))
-    print "extract parameters finish, total", len(rowIndex), "lines"
-    fi.close()
-def main():
-    """
-    Main entry for running paraconvert.py 
-    """
-    usage = "usage: \n" \
-            "python %prog --preModel PREMODEL --preDict PREDICT" \
-            " --usrModel USRMODEL --usrDict USRDICT -d DIM"
-    parser = OptionParser(usage)
-    parser.add_option(
-        "--preModel",
-        action="store",
-        dest="preModel",
-        help="the name of pretrained embedding model")
-    parser.add_option(
-        "--preDict",
-        action="store",
-        dest="preDict",
-        help="the name of pretrained dictionary")
-    parser.add_option(
-        "--usrModel",
-        action="store",
-        dest="usrModel",
-        help="the name of output usr embedding model")
-    parser.add_option(
-        "--usrDict",
-        action="store",
-        dest="usrDict",
-        help="the name of user specified dictionary")
-    parser.add_option(
-        "-d", action="store", dest="dim", help="dimension of parameter")
-    (options, args) = parser.parse_args()
-    extract_parameters_by_usrDict(options.preModel, options.preDict,
-                                  options.usrModel, options.usrDict,
-                                  int(options.dim))
-if __name__ == '__main__':
-    main()
--- a/v1_api_demo/model_zoo/embedding/paraconvert.py
+++ b/v1_api_demo/model_zoo/embedding/paraconvert.py
-#!/bin/env python
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""
-Example:
-    python paraconvert.py --b2t -i INPUT -o OUTPUT -d DIM
-    python paraconvert.py --t2b -i INPUT -o OUTPUT
-Options:
-    -h, --help  show this help message and exit
-    --b2t       convert parameter file of embedding model from binary to text
-    --t2b       convert parameter file of embedding model from text to binary
-    -i INPUT    input parameter file name
-    -o OUTPUT   output parameter file name
-    -d DIM      dimension of parameter
-"""
-from optparse import OptionParser
-import struct
-def binary2text(input, output, paraDim):
-    """
-    Convert a binary parameter file of embedding model to be a text file.  
-    input: the name of input binary parameter file, the format is:
-           1) the first 16 bytes is filehead:
-                version(4 bytes): version of paddle, default = 0
-                floatSize(4 bytes): sizeof(float) = 4
-                paraCount(8 bytes): total number of parameter
-           2) the next (paraCount * 4) bytes is parameters, each has 4 bytes 
-    output: the name of output text parameter file, for example:
-           0,4,32156096
-           -0.7845433,1.1937413,-0.1704215,...
-           0.0000909,0.0009465,-0.0008813,...
-           ...
-           the format is:
-           1) the first line is filehead: 
-              version=0, floatSize=4, paraCount=32156096
-           2) other lines print the paramters
-              a) each line prints paraDim paramters splitted by ','
-              b) there is paraCount/paraDim lines (embedding words)
-    paraDim: dimension of parameters 
-    """
-    fi = open(input, "rb")
-    fo = open(output, "w")
-    """
-    """
-    version, floatSize, paraCount = struct.unpack("iil", fi.read(16))
-    newHead = ','.join([str(version), str(floatSize), str(paraCount)])
-    print >> fo, newHead
-    bytes = 4 * int(paraDim)
-    format = "%df" % int(paraDim)
-    context = fi.read(bytes)
-    line = 0
-    while context:
-        numbers = struct.unpack(format, context)
-        lst = []
-        for i in numbers:
-            lst.append('%8.7f' % i)
-        print >> fo, ','.join(lst)
-        context = fi.read(bytes)
-        line += 1
-    fi.close()
-    fo.close()
-    print "binary2text finish, total", line, "lines"
-def get_para_count(input):
-    """
-    Compute the total number of embedding parameters in input text file. 
-    input: the name of input text file
-    """
-    numRows = 1
-    paraDim = 0
-    with open(input) as f:
-        line = f.readline()
-        paraDim = len(line.split(","))
-        for line in f:
-            numRows += 1
-    return numRows * paraDim
-def text2binary(input, output, paddle_head=True):
-    """
-    Convert a text parameter file of embedding model to be a binary file.
-    input: the name of input text parameter file, for example:
-           -0.7845433,1.1937413,-0.1704215,...
-           0.0000909,0.0009465,-0.0008813,... 
-           ...
-           the format is:
-           1) it doesn't have filehead
-           2) each line stores the same dimension of parameters, 
-              the separator is commas ','
-    output: the name of output binary parameter file, the format is:
-           1) the first 16 bytes is filehead: 
-             version(4 bytes), floatSize(4 bytes), paraCount(8 bytes)
-           2) the next (paraCount * 4) bytes is parameters, each has 4 bytes
-    """
-    fi = open(input, "r")
-    fo = open(output, "wb")
-    newHead = struct.pack("iil", 0, 4, get_para_count(input))
-    fo.write(newHead)
-    count = 0
-    for line in fi:
-        line = line.strip().split(",")
-        for i in range(0, len(line)):
-            binary_data = struct.pack("f", float(line[i]))
-            fo.write(binary_data)
-        count += 1
-    fi.close()
-    fo.close()
-    print "text2binary finish, total", count, "lines"
-def main():
-    """
-    Main entry for running paraconvert.py 
-    """
-    usage = "usage: \n" \
-            "python %prog --b2t -i INPUT -o OUTPUT -d DIM \n" \
-            "python %prog --t2b -i INPUT -o OUTPUT"
-    parser = OptionParser(usage)
-    parser.add_option(
-        "--b2t",
-        action="store_true",
-        help="convert parameter file of embedding model from binary to text")
-    parser.add_option(
-        "--t2b",
-        action="store_true",
-        help="convert parameter file of embedding model from text to binary")
-    parser.add_option(
-        "-i", action="store", dest="input", help="input parameter file name")
-    parser.add_option(
-        "-o", action="store", dest="output", help="output parameter file name")
-    parser.add_option(
-        "-d", action="store", dest="dim", help="dimension of parameter")
-    (options, args) = parser.parse_args()
-    if options.b2t:
-        binary2text(options.input, options.output, options.dim)
-    if options.t2b:
-        text2binary(options.input, options.output)
-if __name__ == '__main__':
-    main()
--- a/v1_api_demo/model_zoo/embedding/pre_DictAndModel.sh
+++ b/v1_api_demo/model_zoo/embedding/pre_DictAndModel.sh
--- a/v1_api_demo/model_zoo/resnet/.gitignore
+++ b/v1_api_demo/model_zoo/resnet/.gitignore
--- a/v1_api_demo/model_zoo/resnet/classify.py
+++ b/v1_api_demo/model_zoo/resnet/classify.py
--- a/v1_api_demo/model_zoo/resnet/example/.gitignore
+++ b/v1_api_demo/model_zoo/resnet/example/.gitignore
--- a/v1_api_demo/model_zoo/resnet/example/__init__.py
+++ b/v1_api_demo/model_zoo/resnet/example/__init__.py
--- a/v1_api_demo/model_zoo/resnet/example/cat.jpg
+++ b/v1_api_demo/model_zoo/resnet/example/cat.jpg
--- a/v1_api_demo/model_zoo/resnet/example/dog.jpg
+++ b/v1_api_demo/model_zoo/resnet/example/dog.jpg
--- a/v1_api_demo/model_zoo/resnet/example/image_list_provider.py
+++ b/v1_api_demo/model_zoo/resnet/example/image_list_provider.py
--- a/v1_api_demo/model_zoo/resnet/example/test.list
+++ b/v1_api_demo/model_zoo/resnet/example/test.list
--- a/v1_api_demo/model_zoo/resnet/extract_fea_c++.sh
+++ b/v1_api_demo/model_zoo/resnet/extract_fea_c++.sh
--- a/v1_api_demo/model_zoo/resnet/extract_fea_py.sh
+++ b/v1_api_demo/model_zoo/resnet/extract_fea_py.sh
--- a/v1_api_demo/model_zoo/resnet/get_model.sh
+++ b/v1_api_demo/model_zoo/resnet/get_model.sh
--- a/v1_api_demo/model_zoo/resnet/load_feature.py
+++ b/v1_api_demo/model_zoo/resnet/load_feature.py
--- a/v1_api_demo/model_zoo/resnet/net_diagram.sh
+++ b/v1_api_demo/model_zoo/resnet/net_diagram.sh
--- a/v1_api_demo/model_zoo/resnet/predict.sh
+++ b/v1_api_demo/model_zoo/resnet/predict.sh
--- a/v1_api_demo/model_zoo/resnet/resnet.py
+++ b/v1_api_demo/model_zoo/resnet/resnet.py
--- a/v1_api_demo/quick_start/.gitignore
+++ b/v1_api_demo/quick_start/.gitignore
--- a/v1_api_demo/quick_start/api_predict.py
+++ b/v1_api_demo/quick_start/api_predict.py
--- a/v1_api_demo/quick_start/api_predict.sh
+++ b/v1_api_demo/quick_start/api_predict.sh
--- a/v1_api_demo/quick_start/api_train.py
+++ b/v1_api_demo/quick_start/api_train.py
--- a/v1_api_demo/quick_start/api_train.sh
+++ b/v1_api_demo/quick_start/api_train.sh
--- a/v1_api_demo/quick_start/cluster/cluster_train.sh
+++ b/v1_api_demo/quick_start/cluster/cluster_train.sh
--- a/v1_api_demo/quick_start/cluster/env.sh
+++ b/v1_api_demo/quick_start/cluster/env.sh
--- a/v1_api_demo/quick_start/cluster/pserver.sh
+++ b/v1_api_demo/quick_start/cluster/pserver.sh
--- a/v1_api_demo/quick_start/data/README.md
+++ b/v1_api_demo/quick_start/data/README.md
--- a/v1_api_demo/quick_start/data/get_data.sh
+++ b/v1_api_demo/quick_start/data/get_data.sh
--- a/v1_api_demo/quick_start/data/proc_from_raw_data/get_data.sh
+++ b/v1_api_demo/quick_start/data/proc_from_raw_data/get_data.sh
--- a/v1_api_demo/quick_start/data/proc_from_raw_data/preprocess.py
+++ b/v1_api_demo/quick_start/data/proc_from_raw_data/preprocess.py
--- a/v1_api_demo/quick_start/dataprovider_bow.py
+++ b/v1_api_demo/quick_start/dataprovider_bow.py
--- a/v1_api_demo/quick_start/dataprovider_emb.py
+++ b/v1_api_demo/quick_start/dataprovider_emb.py
--- a/v1_api_demo/quick_start/predict.sh
+++ b/v1_api_demo/quick_start/predict.sh
--- a/v1_api_demo/quick_start/train.sh
+++ b/v1_api_demo/quick_start/train.sh
--- a/v1_api_demo/quick_start/trainer_config.bidi-lstm.py
+++ b/v1_api_demo/quick_start/trainer_config.bidi-lstm.py
--- a/v1_api_demo/quick_start/trainer_config.cnn.py
+++ b/v1_api_demo/quick_start/trainer_config.cnn.py
--- a/v1_api_demo/quick_start/trainer_config.db-lstm.py
+++ b/v1_api_demo/quick_start/trainer_config.db-lstm.py
--- a/v1_api_demo/quick_start/trainer_config.emb.py
+++ b/v1_api_demo/quick_start/trainer_config.emb.py
--- a/v1_api_demo/quick_start/trainer_config.lr.py
+++ b/v1_api_demo/quick_start/trainer_config.lr.py
--- a/v1_api_demo/quick_start/trainer_config.lstm.py
+++ b/v1_api_demo/quick_start/trainer_config.lstm.py
--- a/v1_api_demo/quick_start/trainer_config.resnet-lstm.py
+++ b/v1_api_demo/quick_start/trainer_config.resnet-lstm.py
--- a/v1_api_demo/sequence_tagging/data/get_data.sh
+++ b/v1_api_demo/sequence_tagging/data/get_data.sh
--- a/v1_api_demo/sequence_tagging/data/test.list
+++ b/v1_api_demo/sequence_tagging/data/test.list
--- a/v1_api_demo/sequence_tagging/data/train.list
+++ b/v1_api_demo/sequence_tagging/data/train.list
--- a/v1_api_demo/sequence_tagging/dataprovider.py
+++ b/v1_api_demo/sequence_tagging/dataprovider.py
--- a/v1_api_demo/sequence_tagging/linear_crf.py
+++ b/v1_api_demo/sequence_tagging/linear_crf.py
--- a/v1_api_demo/sequence_tagging/readme.md
+++ b/v1_api_demo/sequence_tagging/readme.md
--- a/v1_api_demo/sequence_tagging/rnn_crf.py
+++ b/v1_api_demo/sequence_tagging/rnn_crf.py
--- a/v1_api_demo/sequence_tagging/train.sh
+++ b/v1_api_demo/sequence_tagging/train.sh
--- a/v1_api_demo/sequence_tagging/train_linear.sh
+++ b/v1_api_demo/sequence_tagging/train_linear.sh
--- a/v1_api_demo/traffic_prediction/README
+++ b/v1_api_demo/traffic_prediction/README
--- a/v1_api_demo/traffic_prediction/data/get_data.sh
+++ b/v1_api_demo/traffic_prediction/data/get_data.sh
--- a/v1_api_demo/traffic_prediction/dataprovider.py
+++ b/v1_api_demo/traffic_prediction/dataprovider.py
--- a/v1_api_demo/traffic_prediction/gen_result.py
+++ b/v1_api_demo/traffic_prediction/gen_result.py
--- a/v1_api_demo/traffic_prediction/predict.sh
+++ b/v1_api_demo/traffic_prediction/predict.sh
--- a/v1_api_demo/traffic_prediction/train.sh
+++ b/v1_api_demo/traffic_prediction/train.sh
--- a/v1_api_demo/traffic_prediction/trainer_config.py
+++ b/v1_api_demo/traffic_prediction/trainer_config.py
--- a/v1_api_demo/vae/README.md
+++ b/v1_api_demo/vae/README.md
--- a/v1_api_demo/vae/data/get_mnist_data.sh
+++ b/v1_api_demo/vae/data/get_mnist_data.sh
--- a/v1_api_demo/vae/dataloader.py
+++ b/v1_api_demo/vae/dataloader.py
--- a/v1_api_demo/vae/vae_conf.py
+++ b/v1_api_demo/vae/vae_conf.py
--- a/v1_api_demo/vae/vae_train.py
+++ b/v1_api_demo/vae/vae_train.py