Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into dockerfile

Merge branch develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into dockerfile
Merge branch develop
a68290fc · _青葱 · 45eb94e6 · e26f1123 · a68290fc · a68290fc
42 changed file
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -53,7 +53,7 @@ option(COVERALLS_UPLOAD "Package code coverage data to coveralls"       OFF)
 option(ON_TRAVIS        "Exclude special unit test on Travis CI"        OFF)
 option(WITH_C_API       "Compile PaddlePaddle with C-API(Prediction)"   OFF)
 # TODO: Only compile PaddlePaddle fluid version by WITH_FLUID option. 
-option(WITH_FLUID       "Compile PaddlePaddle fluid only(TODO)"         ON)
+option(WITH_FLUID       "Compile PaddlePaddle fluid only(TODO)"         OFF)
 option(WITH_GOLANG      "Compile PaddlePaddle with GOLANG"              OFF)
 option(GLIDE_INSTALL    "Download and install go dependencies "         ON)
 option(USE_NNPACK       "Compile PaddlePaddle with NNPACK library"      OFF)

--- a/doc/fluid/design/dist_train/distributed_architecture.md
+++ b/doc/fluid/design/dist_train/distributed_architecture.md
@@ -155,7 +155,7 @@ Cluster environment.
 <img src="src/remote_executor.png" width="500" align="center" />

 `RemoteExecutor.run` sends the `ProgramDesc` and
-[TrainingJob](https://github.com/PaddlePaddle/cloud/blob/develop/doc/autoscale/README.md#training-job-resource)
+[TrainingJob](https://github.com/PaddlePaddle/cloud/blob/unreleased-tpr/doc/autoscale/README.md#training-job-resource)
 to a server in the cluster which executes `RemoteExecutor.listen`. This server is responsible
 to start the final Kubernetes Jobs to run the different role of `ProgramDesc` from `ConfigMap`.


--- a/doc/fluid/dev/api_doc_std_cn.md
+++ b/doc/fluid/dev/api_doc_std_cn.md
+# API注释撰写标准
+
+- [API注释模块](#API注释模块)
+- [格式及示例](#格式及示例)
+- [完整示例](#完整示例)
+
+
+## API注释模块
+
+API文档须包含以下几个模块（排列顺序为文档撰写顺序）：
+
+- Python API Definition
+
+  API的代码定义。
+
+- Function Description
+
+  API的功能描述。描述该API的含义、作用或对输入所做的操作，及参考文献和对应链接（如果有），必要时给出公式，并解释公式中关键变量的含义。
+
+- Args Description
+
+  API参数介绍。按代码定义中的参数顺序逐个介绍，介绍内容包含数据类型、默认值（如果有）、含义等。
+
+- Returns
+
+  API返回值介绍。介绍返回值含义，必要时给出对应的形状。若返回值为包含多个参数的tuple，则按顺序逐个介绍各参数。
+
+- Raises（如果有）
+
+  可能抛出的异常或错误及可能的产生原因，当可能抛出多种异常或错误时应分条列出。
+
+- Note（如果有）
+
+  注意事项。当有多条注意事项时，应分条列出。
+
+- Examples
+
+  API的使用示例。
+
+
+## 格式及示例
+
+API文档须使用reStructuredText格式撰写，该格式详情请参考[链接](http://sphinx-doc-zh.readthedocs.io/en/latest/rest.html)。API文档各模块的内容格式及示例如下（以下以fc为例进行说明）：
+
+- Python API Definition
+
+  - 格式：
+    
+      [Python API Definition]
+    
+  - 示例
+  
+      ```
+      fc(input,
+         size,
+         num_flatten_dims=1,
+         param_attr=None,
+         bias_attr=None,
+         act=None,
+         name=None,
+         main_program=None,
+         startup_program=None)
+      ```
+
+- Function Description
+  
+  - 格式
+
+      本模块应包含以下内容（排列顺序为文档撰写顺序）：
+
+      [Function Description]
+  
+      [Formula]
+    
+      [Symbols' Descriptions if necessary]
+    
+      [References if necessary]
+ 
+  - 示例
+
+      [Function Description]
+
+       ```
+       **Fully Connected Layer**
+
+       The fully connected layer can take multiple tensors as its inputs. It
+       creates a variable called weights for each input tensor, which represents
+       a fully connected weight matrix from each input unit to each output unit.
+       The fully connected layer multiplies each input tensor with its coresponding
+       weight to produce an output Tensor. If multiple input tensors are given,
+       the results of multiple multiplications will be sumed up. If bias_attr is
+       not None, a bias variable will be created and added to the output. Finally,
+       if activation is not None, it will be applied to the output as well.
+       ```
+
+      [Formula]
+
+      ```
+      This process can be formulated as follows:
+
+      .. math::
+
+           Out = Act({\sum_{i=0}^{N-1}X_iW_i + b})
+      ```
+
+      [Symbols' Descriptions if necessary]
+
+      ```
+      In the above equation:
+
+      * :math:`N`: Number of the input.
+      * :math:`X_i`: The input tensor.
+      * :math:`W`: The weights created by this layer.
+      * :math:`b`: The bias parameter created by this layer (if needed).
+      * :math:`Act`: The activation function.
+      * :math:`Out`: The output tensor.
+      ```
+
+      [References if necessary]
+
+      因fc没有必要列出的参考文献，故该内容省略。其他情况下需明确给出对应的参考文献和对应连接，以 layer_norm 为例：
+      
+      ```
+      Refer to `Layer Normalization <https://arxiv.org/pdf/1607.06450v1.pdf>`_ for more details.
+      ```
+  
+
+- Args Description
+  
+  - 格式
+  
+      \[Arg's Name\][(Data Type, Default Value)][Description]
+  
+  - 示例
+
+      fc的部分参数注释如下：
+
+      ```
+      Args:
+          input (Variable|list of Variable): The input tensor(s) of this layer, and the dimension of
+              the input tensor(s) is at least 2.
+          param_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for learnable
+              parameters/weights of this layer.
+          name (str, default None): The name of this layer.
+      ```
+
+- Returns
+  
+  - 格式
+  
+      [Name][Shape]
+  
+  - 示例
+  
+      ```
+      Returns:
+          A tensor variable storing the transformation result.
+      ```
+  
+      当返回值为包含多个参数的tuple时，应按顺序逐个介绍各参数，以dynamic_lstm为例：
+  
+      ```
+      Returns:
+          A tuple containing:
+            The hidden state of LSTM whose shape is (T X D).
+            The cell state of LSTM whose shape is (T X D).
+      ```
+  
+- Raises
+
+  - 格式
+  
+      [Exception Type][Condition]
+
+  - 示例
+  
+      ```
+      Raises:
+          ValueError: If the rank of the input is less than 2.
+      ```
+
+- Note
+
+  - 格式
+  
+     [Note]
+
+  - 示例
+
+      fc没有注意事项，故该模块省略不写。如有注意事项应明确给出，当有多条注意事项，须分条列出，以scaled\_dot\_product\_attention为例：
+
+      ```
+      Note:
+          1. When num_heads > 1, three linear projections are learned respectively
+             to map input queries, keys and values into queries', keys' and values'.
+             queries', keys' and values' have the same shapes with queries, keys
+             and values.
+          2. When num_heads == 1, scaled_dot_product_attention has no learnable
+             parameters.
+      ```
+  
+- Examples
+
+  - 格式
+
+      \[Python Code Snipper]
+  
+  - 示例
+  
+      ```
+      Examples:
+          .. code-block:: python
+
+            data = fluid.layers.data(name="data", shape=[32, 32], dtype="float32")
+            fc = fluid.layers.fc(input=data, size=1000, act="tanh")
+      ```
+
+## 完整示例
+
+fc 的完整注释见[示例](src/fc.py)。
--- a/doc/fluid/dev/src/fc.py
+++ b/doc/fluid/dev/src/fc.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def fc(input,
+       size,
+       num_flatten_dims=1,
+       param_attr=None,
+       bias_attr=None,
+       act=None,
+       name=None):
+    """
+    **Fully Connected Layer**
+
+    The fully connected layer can take multiple tensors as its inputs. It
+    creates a variable called weights for each input tensor, which represents
+    a fully connected weight matrix from each input unit to each output unit.
+    The fully connected layer multiplies each input tensor with its coresponding
+    weight to produce an output Tensor. If multiple input tensors are given,
+    the results of multiple multiplications will be sumed up. If bias_attr is
+    not None, a bias variable will be created and added to the output. Finally,
+    if activation is not None, it will be applied to the output as well.
+
+    This process can be formulated as follows:
+
+    .. math::
+
+        Out = Act({\sum_{i=0}^{N-1}X_iW_i + b})
+
+    In the above equation:
+
+    * :math:`N`: Number of the input.
+    * :math:`X_i`: The input tensor.
+    * :math:`W`: The weights created by this layer.
+    * :math:`b`: The bias parameter created by this layer (if needed).
+    * :math:`Act`: The activation function.
+    * :math:`Out`: The output tensor.
+
+    Args:
+        input (Variable|list of Variable): The input tensor(s) of this layer, and the dimension of
+            the input tensor(s) is at least 2.
+        size(int): The number of output units in this layer.
+        num_flatten_dims (int, default 1): The fc layer can accept an input tensor with more than
+            two dimensions. If this happens, the multidimensional tensor will first be flattened
+            into a 2-dimensional matrix. The parameter `num_flatten_dims` determines how the input
+            tensor is flattened: the first `num_flatten_dims` (inclusive, index starts from 1)
+            dimensions will be flatten to form the first dimension of the final matrix (height of
+            the matrix), and the rest `rank(X) - num_flatten_dims` dimensions are flattened to
+            form the second dimension of the final matrix (width of the matrix). For example, suppose
+            `X` is a 6-dimensional tensor with a shape [2, 3, 4, 5, 6], and `num_flatten_dims` = 3.
+            Then, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] = [24, 30].
+        param_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for learnable
+            parameters/weights of this layer.
+        bias_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for the bias
+            of this layer. If it is set to None, no bias will be added to the output units.
+        act (str, default None): Activation to be applied to the output of this layer.
+        name (str, default None): The name of this layer.
+
+    Returns:
+        A tensor variable storing the transformation result.
+
+    Raises:
+        ValueError: If rank of the input tensor is less than 2.
+
+    Examples:
+        .. code-block:: python
+
+          data = fluid.layers.data(name="data", shape=[32, 32], dtype="float32")
+          fc = fluid.layers.fc(input=data, size=1000, act="tanh")
+    """
--- a/paddle/CMakeLists.txt
+++ b/paddle/CMakeLists.txt
-add_subdirectory(cuda)
-add_subdirectory(function)
-add_subdirectory(utils)
-add_subdirectory(math)
-add_subdirectory(gserver)
-add_subdirectory(parameter)
-add_subdirectory(testing)
-
-if(MOBILE_INFERENCE)
-  add_subdirectory(capi)
-else()
-  add_subdirectory(pserver)
-  add_subdirectory(trainer)
-  add_subdirectory(scripts)
+if(NOT WITH_FLUID)
+  add_subdirectory(cuda)
+  add_subdirectory(function)
+  add_subdirectory(utils)
+  add_subdirectory(math)
+  add_subdirectory(gserver)
+  add_subdirectory(parameter)

-  if(WITH_C_API)
+  if(MOBILE_INFERENCE)
    add_subdirectory(capi)
-  endif()
+  else()
+    add_subdirectory(pserver)
+    add_subdirectory(trainer)
+    add_subdirectory(scripts)

-  if(NOT ANDROID AND NOT IOS)
-    add_subdirectory(fluid)
-  endif()
+    if(WITH_C_API)
+      add_subdirectory(capi)
+    endif()

-  if(WITH_SWIG_PY)
-    add_subdirectory(api)
+    if(WITH_SWIG_PY)
+      add_subdirectory(api)
+    endif()
  endif()
 endif()
+
+add_subdirectory(testing)
+if(NOT MOBILE_INFERENCE AND NOT ANDROID AND NOT IOS)
+  add_subdirectory(fluid)
+endif()
--- a/paddle/fluid/framework/channel.h
+++ b/paddle/fluid/framework/channel.h
@@ -15,23 +15,43 @@ limitations under the License. */
 #pragma once

 #include <stddef.h>  // for size_t
+#include <condition_variable>
 #include <typeindex>
 #include "paddle/fluid/platform/enforce.h"

 namespace paddle {
 namespace framework {

+enum class ChannelAction {
+  SEND = 0,
+  RECEIVE = 1,
+  CLOSE = 2,
+};
+
 // Channel is the abstract class of buffered and un-buffered channels.
 template <typename T>
 class Channel {
 public:
+  virtual bool CanSend() = 0;
+  virtual bool CanReceive() = 0;
  virtual bool Send(T*) = 0;
  virtual bool Receive(T*) = 0;
  virtual size_t Cap() = 0;
  virtual void Lock() = 0;
+
  virtual void Unlock() = 0;
+  virtual bool IsClosed() = 0;
  virtual void Close() = 0;
  virtual ~Channel() {}
+
+  virtual void AddToSendQ(const void* referrer, T* data,
+                          std::shared_ptr<std::condition_variable_any> cond,
+                          std::function<bool(ChannelAction)> cb) = 0;
+  virtual void AddToReceiveQ(const void* referrer, T* data,
+                             std::shared_ptr<std::condition_variable_any> cond,
+                             std::function<bool(ChannelAction)> cb) = 0;
+  virtual void RemoveFromSendQ(const void* referrer) = 0;
+  virtual void RemoveFromReceiveQ(const void* referrer) = 0;
 };

 // Forward declaration of channel implementations.
@@ -80,6 +100,27 @@ class ChannelHolder {
    return channel != nullptr ? channel->Receive(data) : false;
  }

+  bool IsClosed() {
+    if (IsInitialized()) {
+      return holder_->IsClosed();
+    }
+    return false;
+  }
+
+  bool CanSend() {
+    if (IsInitialized()) {
+      return holder_->CanSend();
+    }
+    return false;
+  }
+
+  bool CanReceive() {
+    if (IsInitialized()) {
+      return holder_->CanReceive();
+    }
+    return false;
+  }
+
  void close() {
    if (IsInitialized()) holder_->Close();
  }
@@ -97,6 +138,50 @@ class ChannelHolder {
    if (IsInitialized()) holder_->Unlock();
  }

+  template <typename T>
+  void AddToSendQ(const void* referrer, T* data,
+                  std::shared_ptr<std::condition_variable_any> cond,
+                  std::function<bool(ChannelAction)> cb) {
+    if (IsInitialized()) {
+      Channel<T>* channel = static_cast<Channel<T>*>(holder_->Ptr());
+      if (channel != nullptr) {
+        channel->AddToSendQ(referrer, data, cond, cb);
+      }
+    }
+  }
+
+  template <typename T>
+  void AddToReceiveQ(const void* referrer, T* data,
+                     std::shared_ptr<std::condition_variable_any> cond,
+                     std::function<bool(ChannelAction)> cb) {
+    if (IsInitialized()) {
+      Channel<T>* channel = static_cast<Channel<T>*>(holder_->Ptr());
+      if (channel != nullptr) {
+        channel->AddToReceiveQ(referrer, data, cond, cb);
+      }
+    }
+  }
+
+  template <typename T>
+  void RemoveFromSendQ(const void* referrer) {
+    if (IsInitialized()) {
+      Channel<T>* channel = static_cast<Channel<T>*>(holder_->Ptr());
+      if (channel != nullptr) {
+        channel->RemoveFromSendQ(referrer);
+      }
+    }
+  }
+
+  template <typename T>
+  void RemoveFromReceiveQ(const void* referrer) {
+    if (IsInitialized()) {
+      Channel<T>* channel = static_cast<Channel<T>*>(holder_->Ptr());
+      if (channel != nullptr) {
+        channel->RemoveFromReceiveQ(referrer);
+      }
+    }
+  }
+
  inline bool IsInitialized() const { return holder_ != nullptr; }

  inline const std::type_index Type() {
@@ -113,6 +198,9 @@ class ChannelHolder {
    virtual ~Placeholder() {}
    virtual const std::type_index Type() const = 0;
    virtual void* Ptr() const = 0;
+    virtual bool IsClosed() = 0;
+    virtual bool CanSend() = 0;
+    virtual bool CanReceive() = 0;
    virtual void Close() = 0;
    virtual void Lock() = 0;
    virtual void Unlock() = 0;
@@ -129,6 +217,27 @@ class ChannelHolder {

    virtual void* Ptr() const { return static_cast<void*>(channel_.get()); }

+    virtual bool IsClosed() {
+      if (channel_) {
+        return channel_->IsClosed();
+      }
+      return false;
+    }
+
+    virtual bool CanSend() {
+      if (channel_) {
+        return channel_->CanSend();
+      }
+      return false;
+    }
+
+    virtual bool CanReceive() {
+      if (channel_) {
+        return channel_->CanReceive();
+      }
+      return false;
+    }
+
    virtual void Close() {
      if (channel_) channel_->Close();
    }

--- a/paddle/fluid/framework/channel_impl.h
+++ b/paddle/fluid/framework/channel_impl.h
@@ -29,32 +29,50 @@ class ChannelImpl : public paddle::framework::Channel<T> {
  friend void paddle::framework::CloseChannel<T>(Channel<T> *);

 public:
+  virtual bool CanSend();
+  virtual bool CanReceive();
  virtual bool Send(T *);
  virtual bool Receive(T *);
  virtual size_t Cap() { return cap_; }
  virtual void Lock();
  virtual void Unlock();
+  virtual bool IsClosed();
  virtual void Close();
-
  ChannelImpl(size_t);
  virtual ~ChannelImpl();

+  virtual void AddToSendQ(const void *referrer, T *data,
+                          std::shared_ptr<std::condition_variable_any> cond,
+                          std::function<bool(ChannelAction)> cb);
+  virtual void AddToReceiveQ(const void *referrer, T *data,
+                             std::shared_ptr<std::condition_variable_any> cond,
+                             std::function<bool(ChannelAction)> cb);
+
+  virtual void RemoveFromSendQ(const void *referrer);
+  virtual void RemoveFromReceiveQ(const void *referrer);
+
 private:
  struct QueueMessage {
    T *data;
-    std::condition_variable_any cond;
+    std::shared_ptr<std::condition_variable_any> cond;
    bool chan_closed = false;
    bool completed = false;
+    const void *referrer;  // TODO(thuan): figure out better way to do this
+    std::function<bool(ChannelAction)> callback;

-    QueueMessage(T *item) : data(item) {}
+    QueueMessage(T *item)
+        : data(item), cond(std::make_shared<std::condition_variable_any>()) {}
+
+    QueueMessage(T *item, std::shared_ptr<std::condition_variable_any> cond)
+        : data(item), cond(cond) {}

    void Wait(std::unique_lock<std::recursive_mutex> &lock) {
-      cond.wait(lock, [this]() { return completed; });
+      cond->wait(lock, [this]() { return completed; });
    }

    void Notify() {
      completed = true;
-      cond.notify_all();
+      cond->notify_all();
    }
  };

@@ -87,6 +105,18 @@ ChannelImpl<T>::ChannelImpl(size_t capacity)
  PADDLE_ENFORCE_GE(capacity, 0);
 }

+template <typename T>
+bool ChannelImpl<T>::CanSend() {
+  std::lock_guard<std::recursive_mutex> lock{mu_};
+  return !closed_ && (!recvq.empty() || buf_.size() < cap_);
+}
+
+template <typename T>
+bool ChannelImpl<T>::CanReceive() {
+  std::lock_guard<std::recursive_mutex> lock{mu_};
+  return !(closed_ && buf_.empty()) && (!sendq.empty() || buf_.size() > 0);
+}
+
 template <typename T>
 bool ChannelImpl<T>::Send(T *item) {
  send_ctr++;
@@ -105,7 +135,24 @@ bool ChannelImpl<T>::Send(T *item) {
    std::shared_ptr<QueueMessage> m = recvq.front();
    recvq.pop_front();
    // Do the data transfer
-    *(m->data) = std::move(*item);
+    // We will do this data transfer if either of the following
+    // cases are true
+    // 1. callback == nullptr // This means it was a regular channel send
+    // 2. callback returns true
+    bool do_send = true;
+    if (m->callback != nullptr) do_send = m->callback(ChannelAction::SEND);
+    if (do_send)
+      *(m->data) = std::move(*item);
+    else
+      // We cannot do the data transfer because
+      // this QueueMessage was added by Select
+      // and some other case was executed.
+      // So call the Send function again.
+      // We do not care about notifying other
+      // because they would have been notified
+      // by the executed select case.
+      return Send(item);
+
    // Wake up the blocked process and unlock
    m->Notify();
    lock.unlock();
@@ -150,7 +197,25 @@ bool ChannelImpl<T>::Receive(T *item) {
    std::shared_ptr<QueueMessage> m = sendq.front();
    sendq.pop_front();
    // Do the data transfer
-    *item = std::move(*(m->data));
+    // We will do this data transfer if either of the following
+    // cases are true
+    // 1. callback == nullptr // This means it was a regular channel send
+    // 2. callback returns true
+    bool do_receive = true;
+    if (m->callback != nullptr)
+      do_receive = m->callback(ChannelAction::RECEIVE);
+    if (do_receive)
+      *item = std::move(*(m->data));
+    else
+      // We cannot do the data transfer because
+      // this QueueMessage was added by Select
+      // and some other case was executed.
+      // So call the Receive function again.
+      // We do not care about notifying other
+      // because they would have been notified
+      // by the executed select case.
+      return Receive(item);
+
    // Wake up the blocked process and unlock
    m->Notify();
    lock.unlock();
@@ -186,6 +251,12 @@ void ChannelImpl<T>::Unlock() {
  mu_.unlock();
 }

+template <typename T>
+bool ChannelImpl<T>::IsClosed() {
+  std::lock_guard<std::recursive_mutex> lock{mu_};
+  return closed_;
+}
+
 template <typename T>
 void ChannelImpl<T>::Close() {
  std::unique_lock<std::recursive_mutex> lock{mu_};
@@ -203,6 +274,12 @@ void ChannelImpl<T>::Close() {
    std::shared_ptr<QueueMessage> m = recvq.front();
    recvq.pop_front();
    m->chan_closed = true;
+
+    // Execute callback function (if any)
+    if (m->callback != nullptr) {
+      m->callback(ChannelAction::CLOSE);
+    }
+
    m->Notify();
  }

@@ -211,10 +288,72 @@ void ChannelImpl<T>::Close() {
    std::shared_ptr<QueueMessage> m = sendq.front();
    sendq.pop_front();
    m->chan_closed = true;
+
+    // Execute callback function (if any)
+    if (m->callback != nullptr) {
+      m->callback(ChannelAction::CLOSE);
+    }
+
    m->Notify();
  }
 }

+template <typename T>
+void ChannelImpl<T>::AddToSendQ(
+    const void *referrer, T *data,
+    std::shared_ptr<std::condition_variable_any> cond,
+    std::function<bool(ChannelAction)> cb) {
+  std::lock_guard<std::recursive_mutex> lock{mu_};
+  auto m = std::make_shared<QueueMessage>(data, cond);
+  m->referrer = referrer;
+  m->callback = cb;
+  sendq.push_back(m);
+}
+
+template <typename T>
+void ChannelImpl<T>::AddToReceiveQ(
+    const void *referrer, T *data,
+    std::shared_ptr<std::condition_variable_any> cond,
+    std::function<bool(ChannelAction)> cb) {
+  std::lock_guard<std::recursive_mutex> lock{mu_};
+  auto m = std::make_shared<QueueMessage>(data, cond);
+  m->referrer = referrer;
+  m->callback = cb;
+  recvq.push_back(m);
+}
+
+template <typename T>
+void ChannelImpl<T>::RemoveFromSendQ(const void *referrer) {
+  std::lock_guard<std::recursive_mutex> lock{mu_};
+
+  for (auto it = sendq.begin(); it != sendq.end();) {
+    std::shared_ptr<QueueMessage> sendMsg = (std::shared_ptr<QueueMessage>)*it;
+
+    if (sendMsg->referrer == referrer) {
+      it = sendq.erase(it);
+      send_ctr--;
+    } else {
+      ++it;
+    }
+  }
+}
+
+template <typename T>
+void ChannelImpl<T>::RemoveFromReceiveQ(const void *referrer) {
+  std::lock_guard<std::recursive_mutex> lock{mu_};
+
+  for (auto it = recvq.begin(); it != recvq.end();) {
+    std::shared_ptr<QueueMessage> recvMsg = (std::shared_ptr<QueueMessage>)*it;
+
+    if (recvMsg->referrer == referrer) {
+      it = recvq.erase(it);
+      recv_ctr--;
+    } else {
+      ++it;
+    }
+  }
+}
+
 template <typename T>
 ChannelImpl<T>::~ChannelImpl() {
  Close();

--- a/paddle/fluid/framework/operator.cc
+++ b/paddle/fluid/framework/operator.cc
@@ -74,9 +74,6 @@ void OperatorBase::Run(const Scope& scope, const platform::Place& place) {
    platform::SetDeviceId(dev_id);
 #endif
  }
-  // profile
-  auto* dev_ctx = platform::DeviceContextPool::Instance().Get(place);
-  platform::RecordEvent record_event(Type(), dev_ctx);
  RunImpl(scope, place);
 }

@@ -445,15 +442,7 @@ class RuntimeInferShapeContext : public InferShapeContext {
  }

  std::vector<DDim> GetRepeatedDims(const std::string& name) const override {
-    Variable* var = scope_.FindVar(name);
-    if (var->IsType<ReaderHolder>()) {
-      return var->Get<ReaderHolder>().shapes();
-    } else {
-      PADDLE_THROW(
-          "Only ReaderHolder support 'GetRepeatedDims', but Variable %s's "
-          "type_id is %s.",
-          name, var->Type().name());
-    }
+    PADDLE_THROW("Only compile time support this method");
  }

  void SetDim(const std::string& name, const DDim& dim) override {
@@ -470,15 +459,7 @@ class RuntimeInferShapeContext : public InferShapeContext {

  void SetRepeatedDims(const std::string& name,
                       const std::vector<DDim>& dims) override {
-    Variable* var = scope_.FindVar(name);
-    if (var->IsType<ReaderHolder>()) {
-      var->GetMutable<ReaderHolder>()->set_shapes(dims);
-    } else {
-      PADDLE_THROW(
-          "Only ReaderHolder support 'SetRepeatedDims', but Variable %s's "
-          "type_id is %s.",
-          name, var->Type().name());
-    }
+    PADDLE_THROW("Only compile time support this method");
  }

  proto::VarType::Type GetVarType(const std::string& name) const override {
@@ -501,6 +482,10 @@ void OperatorWithKernel::RunImpl(const Scope& scope,
  this->InferShape(&infer_shape_ctx);
  platform::DeviceContextPool& pool = platform::DeviceContextPool::Instance();
  auto* dev_ctx = pool.Get(place);
+
+  // For profiling, don't move out of this function because that will result
+  // in the failure of multi-GPU profiling.
+  platform::RecordEvent record_event(Type(), dev_ctx);
  // check if op[type] has kernel registered.
  auto& all_op_kernels = AllOpKernels();
  auto kernels_iter = all_op_kernels.find(type_);

--- a/paddle/fluid/framework/reader.cc
+++ b/paddle/fluid/framework/reader.cc
@@ -16,14 +16,22 @@

 namespace paddle {
 namespace framework {
+ReaderBase::~ReaderBase() {}

-DDim ReaderBase::shape(size_t idx) const {
-  PADDLE_ENFORCE_LT(
-      idx, shapes_.size(),
-      "Cannot get the %d'th shape, 'shapes_' only has %d elements.", idx,
-      shapes_.size());
-  return shapes_[idx];
-}
+FileReader::FileReader(const std::vector<DDim> &dims) : dims_(dims) {}
+
+void FileReader::ReadNext(std::vector<LoDTensor> *out) {
+  ReadNextImpl(out);
+  PADDLE_ENFORCE_EQ(out->size(), dims_.size());
+  for (size_t i = 0; i < dims_.size(); ++i) {
+    auto &actual = out->at(i).dims();
+    auto &expect = dims_[i];

+    PADDLE_ENFORCE_EQ(actual.size(), expect.size());
+    for (int j = 0; j < actual.size(); ++j) {
+      PADDLE_ENFORCE(actual[i] == expect[i] || expect[i] == -1);
+    }
+  }
+}
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/fluid/framework/reader.h
+++ b/paddle/fluid/framework/reader.h
@@ -16,40 +16,29 @@

 #include "paddle/fluid/framework/ddim.h"
 #include "paddle/fluid/framework/lod_tensor_array.h"
+#include "paddle/fluid/platform/place.h"
+
+#include <memory>
+#include <thread>
+#include <vector>

 namespace paddle {
 namespace framework {

 class ReaderBase {
 public:
-  explicit ReaderBase(const std::vector<DDim>& shapes) : shapes_(shapes) {
-    PADDLE_ENFORCE(!shapes_.empty());
-  }
  virtual void ReadNext(std::vector<LoDTensor>* out) = 0;

  virtual void ReInit() = 0;

-  DDim shape(size_t idx) const;
-  std::vector<DDim> shapes() const { return shapes_; }
-  void set_shapes(const std::vector<DDim>& shapes) { shapes_ = shapes; }
-
  virtual bool HasNext() const = 0;

-  virtual ~ReaderBase() {}
-
- protected:
-  std::vector<DDim> shapes_;
-};
-
-class FileReader : public ReaderBase {
- public:
-  explicit FileReader(const std::vector<DDim>& shapes) : ReaderBase(shapes) {}
+  virtual ~ReaderBase();
 };

 class DecoratedReader : public ReaderBase {
 public:
-  explicit DecoratedReader(ReaderBase* reader)
-      : ReaderBase(reader->shapes()), reader_(reader) {
+  explicit DecoratedReader(ReaderBase* reader) : ReaderBase(), reader_(reader) {
    PADDLE_ENFORCE_NOT_NULL(reader_);
  }

@@ -61,6 +50,19 @@ class DecoratedReader : public ReaderBase {
  ReaderBase* reader_;
 };

+class FileReader : public ReaderBase {
+ public:
+  explicit FileReader(const std::vector<DDim>& dims);
+
+  void ReadNext(std::vector<LoDTensor>* out) override;
+
+ protected:
+  virtual void ReadNextImpl(std::vector<LoDTensor>* out) = 0;
+
+ private:
+  std::vector<DDim> dims_;
+};
+
 // The ReaderHolder is used as reader' unified wrapper,
 // making it easier to access different type reader in Variables.
 class ReaderHolder {
@@ -78,19 +80,6 @@ class ReaderHolder {
    reader_->ReInit();
  }

-  DDim shape(size_t idx) const {
-    PADDLE_ENFORCE_NOT_NULL(reader_);
-    return reader_->shape(idx);
-  }
-  std::vector<DDim> shapes() const {
-    PADDLE_ENFORCE_NOT_NULL(reader_);
-    return reader_->shapes();
-  }
-  void set_shapes(const std::vector<DDim>& shapes) {
-    PADDLE_ENFORCE_NOT_NULL(reader_);
-    reader_->set_shapes(shapes);
-  }
-
  bool HasNext() const { return reader_->HasNext(); }

 private:

--- a/paddle/fluid/operators/assign_op.cc
+++ b/paddle/fluid/operators/assign_op.cc
@@ -56,6 +56,7 @@ class AssignFunctor {
 private:
  void copy_tensor(const framework::LoDTensor &lod_tensor,
                   framework::LoDTensor *out) const {
+    if (lod_tensor.numel() == 0) return;
    auto &out_tensor = *out;
    TensorCopy(lod_tensor, lod_tensor.place(), dev_ctx_, &out_tensor);
    out_tensor.set_lod(lod_tensor.lod());

--- a/paddle/fluid/operators/mul_op.cc
+++ b/paddle/fluid/operators/mul_op.cc
@@ -17,11 +17,14 @@ limitations under the License. */
 namespace paddle {
 namespace operators {

+using framework::OpKernelType;
 using framework::Tensor;

-class MulOpShapeInference : public framework::InferShapeBase {
+class MulOp : public framework::OperatorWithKernel {
 public:
-  void operator()(framework::InferShapeContext* ctx) const override {
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) of MulOp should not be null.");
    PADDLE_ENFORCE(ctx->HasInput("Y"), "Input(Y) of MulOp should not be null.");
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
@@ -122,7 +125,7 @@ or not. But the output only shares the LoD information with input $X$.
  }
 };

-class MulOpGrad : public framework::OperatorWithKernel {
+class MulGradOp : public framework::OperatorWithKernel {
 public:
  using framework::OperatorWithKernel::OperatorWithKernel;

@@ -156,10 +159,7 @@ class MulOpGrad : public framework::OperatorWithKernel {
 }  // namespace paddle

 namespace ops = paddle::operators;
-REGISTER_OPERATOR(mul, paddle::framework::OperatorWithKernel, ops::MulOpMaker,
-                  ops::MulOpShapeInference,
-                  paddle::framework::DefaultGradOpDescMaker<true>);
-REGISTER_OPERATOR(mul_grad, ops::MulOpGrad);
+REGISTER_OP(mul, ops::MulOp, ops::MulOpMaker, mul_grad, ops::MulGradOp);
 REGISTER_OP_CPU_KERNEL(
    mul, ops::MulKernel<paddle::platform::CPUDeviceContext, float>);
 REGISTER_OP_CPU_KERNEL(

--- a/paddle/fluid/operators/mul_op.cu.cc
+++ b/paddle/fluid/operators/mul_op.cu.cc
@@ -13,9 +13,11 @@ See the License for the specific language governing permissions and
 limitations under the License. */

 #include "paddle/fluid/operators/mul_op.h"
+#include "paddle/fluid/platform/float16.h"

 namespace ops = paddle::operators;
-REGISTER_OP_CUDA_KERNEL(
-    mul, ops::MulKernel<paddle::platform::CUDADeviceContext, float>);
-REGISTER_OP_CUDA_KERNEL(
-    mul_grad, ops::MulGradKernel<paddle::platform::CUDADeviceContext, float>);
+namespace plat = paddle::platform;
+REGISTER_OP_CUDA_KERNEL(mul, ops::MulKernel<plat::CUDADeviceContext, float>,
+                        ops::MulKernel<plat::CUDADeviceContext, plat::float16>);
+REGISTER_OP_CUDA_KERNEL(mul_grad,
+                        ops::MulGradKernel<plat::CUDADeviceContext, float>);
--- a/paddle/fluid/operators/mul_op.h
+++ b/paddle/fluid/operators/mul_op.h
@@ -48,7 +48,7 @@ class MulKernel : public framework::OpKernel<T> {
    }
    math::matmul<DeviceContext, T>(
        context.template device_context<DeviceContext>(), x_matrix, false,
-        y_matrix, false, 1, z, 0);
+        y_matrix, false, static_cast<T>(1), z, static_cast<T>(0));
    if (z_dim.size() != 2) {
      z->Resize(z_dim);
    }

--- a/paddle/fluid/operators/nccl_op.cu.cc
+++ b/paddle/fluid/operators/nccl_op.cu.cc
@@ -106,6 +106,8 @@ class NCCLReduceKernel : public framework::OpKernel<T> {
    T* recvbuffer = nullptr;
    if (root == gpu_id) {
      recvbuffer = out->mutable_data<T>(ctx.GetPlace());
+    } else {
+      out->Resize(framework::make_ddim({0}));
    }
    VLOG(3) << "gpu : " << gpu_id << " invoke reduce. send " << x->numel()
            << " recv " << out->numel();

--- a/paddle/fluid/operators/reader/create_double_buffer_reader_op.cc
+++ b/paddle/fluid/operators/reader/create_double_buffer_reader_op.cc
@@ -24,11 +24,31 @@ static constexpr size_t kDoubleBufferSize = 2;

 class DoubleBufferReader : public framework::DecoratedReader {
 public:
-  explicit DoubleBufferReader(ReaderBase* reader)
-      : DecoratedReader(reader),
-        buffer_(framework::MakeChannel<std::vector<framework::LoDTensor>>(
-            kDoubleBufferSize)) {
-    std::thread prefetch(&DoubleBufferReader::PrefetchThreadFunc, this);
+  struct Item {
+    Item() : ctx_(nullptr) {}
+
+    std::vector<framework::LoDTensor> payloads_;
+    platform::DeviceContext* ctx_;
+  };
+
+  explicit DoubleBufferReader(
+      ReaderBase* reader, platform::Place target_place = platform::CPUPlace())
+      : DecoratedReader(reader), place_(target_place) {
+    for (size_t i = 0; i < kDoubleBufferSize; ++i) {
+      if (platform::is_gpu_place(place_)) {
+#ifdef PADDLE_WITH_CUDA
+        ctxs_.emplace_back(new platform::CUDADeviceContext(
+            boost::get<platform::CUDAPlace>(place_)));
+#endif
+      }
+    }
+
+    start_thread();
+  }
+
+  void start_thread() {
+    buffer_ = framework::MakeChannel<Item>(kDoubleBufferSize);
+    std::thread prefetch([this] { PrefetchThreadFunc(); });
    prefetch.detach();
  }

@@ -42,7 +62,10 @@ class DoubleBufferReader : public framework::DecoratedReader {
 private:
  void PrefetchThreadFunc();

-  framework::Channel<std::vector<framework::LoDTensor>>* buffer_;
+  framework::Channel<Item>* buffer_;
+  platform::Place place_;
+  std::vector<std::unique_ptr<platform::DeviceContext>> ctxs_;
+  mutable Item local_buffer_;
 };

 class CreateDoubleBufferReaderOp : public framework::OperatorBase {
@@ -56,7 +79,20 @@ class CreateDoubleBufferReaderOp : public framework::OperatorBase {
                                        ->Get<framework::ReaderHolder>();
    auto* out = scope.FindVar(Output("Out"))
                    ->template GetMutable<framework::ReaderHolder>();
-    out->Reset(new DoubleBufferReader(underlying_reader.Get()));
+
+    auto place_str = Attr<std::string>("place");
+    platform::Place place;
+    if (place_str == "CPU") {
+      place = platform::CPUPlace();
+    } else {
+      std::istringstream sin(place_str);
+      sin.seekg(std::string("CUDA:").size(), std::ios::beg);
+      size_t num;
+      sin >> num;
+      place = platform::CUDAPlace(static_cast<int>(num));
+    }
+
+    out->Reset(new DoubleBufferReader(underlying_reader.Get(), place));
  }
 };

@@ -71,44 +107,73 @@ class CreateDoubleBufferReaderOpMaker : public DecoratedReaderMakerBase {
      It launches another thread to execute the 'underlying reader' asynchronously, 
      which prevents reading process from blocking subsequent training.
    )DOC");
+    std::unordered_set<std::string> enum_range;
+    constexpr size_t kMaxCUDADevs = 128;
+    for (size_t i = 0; i < kMaxCUDADevs; ++i) {
+      enum_range.insert(string::Sprintf("CUDA:%d", i));
+    }
+    enum_range.insert("CPU");
+    AddAttr<std::string>("place", "The double buffer place, default is CPU")
+        .SetDefault("CPU")
+        .InEnum({enum_range});
  }
 };

 void DoubleBufferReader::ReadNext(std::vector<framework::LoDTensor>* out) {
-  out->clear();
-  buffer_->Receive(out);
+  if (local_buffer_.payloads_.empty()) {
+    buffer_->Receive(&local_buffer_);
+  }
+
+  *out = local_buffer_.payloads_;
+  local_buffer_.payloads_.clear();
+  if (local_buffer_.ctx_) {
+    local_buffer_.ctx_->Wait();
+  }
 }

 void DoubleBufferReader::ReInit() {
  reader_->ReInit();
  buffer_->Close();
-  // The existing prefetch thread will terminate for the buffer_ is closed.
-  buffer_ = framework::MakeChannel<std::vector<framework::LoDTensor>>(
-      kDoubleBufferSize);
-  std::thread prefetch(&DoubleBufferReader::PrefetchThreadFunc, this);
-  prefetch.detach();
+  start_thread();
 }

 void DoubleBufferReader::PrefetchThreadFunc() {
  VLOG(5) << "A new prefetch thread starts.";
-  while (true) {
-    std::vector<framework::LoDTensor> batch;
-    reader_->ReadNext(&batch);
-    if (batch.empty()) {
-      // EOF
-      buffer_->Close();
-      VLOG(5) << "Reached the end of the file. The prefetch thread terminates.";
-      break;
+  size_t gpu_ctx_offset = 0;
+  while (reader_->HasNext()) {
+    Item batch;
+    reader_->ReadNext(&batch.payloads_);
+    if (platform::is_gpu_place(place_)) {
+      std::vector<framework::LoDTensor> gpu_batch;
+      auto& gpu_ctx = this->ctxs_[gpu_ctx_offset++];
+      gpu_ctx_offset %= this->ctxs_.size();
+      gpu_batch.resize(batch.payloads_.size());
+      for (size_t i = 0; i < batch.payloads_.size(); ++i) {
+        framework::TensorCopy(batch.payloads_[i], place_, *gpu_ctx,
+                              &gpu_batch[i]);
+        gpu_batch[i].set_lod(batch.payloads_[i].lod());
+      }
+      batch.ctx_ = gpu_ctx.get();
+      std::swap(gpu_batch, batch.payloads_);
    }
+
    if (!buffer_->Send(&batch)) {
      VLOG(5) << "WARNING: The double buffer channel has been closed. The "
                 "prefetch thread terminates.";
      break;
    }
  }
+  buffer_->Close();
 }

-bool DoubleBufferReader::HasNext() const { PADDLE_THROW("Not Implemented"); }
+bool DoubleBufferReader::HasNext() const {
+  if (local_buffer_.payloads_.empty()) {
+    bool ok = buffer_->Receive(&local_buffer_);
+    return ok;
+  } else {
+    return true;
+  }
+}

 }  // namespace reader
 }  // namespace operators

--- a/paddle/fluid/operators/reader/create_random_data_generator_op.cc
+++ b/paddle/fluid/operators/reader/create_random_data_generator_op.cc
@@ -19,11 +19,11 @@ namespace operators {
 namespace reader {

 template <typename T>
-class RandomDataGenerator : public framework::FileReader {
+class RandomDataGenerator : public framework::ReaderBase {
 public:
  RandomDataGenerator(const std::vector<framework::DDim>& shapes, float min,
                      float max)
-      : FileReader(shapes), min_(min), max_(max) {
+      : framework::ReaderBase(), min_(min), max_(max), shapes_(shapes) {
    PADDLE_ENFORCE_LE(
        min, max, "'min' shouldn't be greater than 'max'.(%f vs %f)", min, max);
    unsigned int seed = std::random_device()();
@@ -59,6 +59,7 @@ class RandomDataGenerator : public framework::FileReader {
  float max_;
  std::minstd_rand engine_;
  std::uniform_real_distribution<float> dist_;
+  std::vector<framework::DDim> shapes_;
 };

 template <typename T>

--- a/paddle/fluid/operators/reader/create_recordio_file_reader_op.cc
+++ b/paddle/fluid/operators/reader/create_recordio_file_reader_op.cc
@@ -20,21 +20,22 @@ namespace operators {
 namespace reader {
 class RecordIOFileReader : public framework::FileReader {
 public:
-  RecordIOFileReader(const std::string& filename,
-                     const std::vector<framework::DDim>& shapes)
-      : FileReader(shapes),
+  explicit RecordIOFileReader(const std::string& filename,
+                              const std::vector<framework::DDim>& dims)
+      : FileReader(dims),
        scanner_(filename),
        dev_ctx_(*platform::DeviceContextPool::Instance().Get(
            platform::CPUPlace())) {}

-  void ReadNext(std::vector<framework::LoDTensor>* out) override {
-    *out = framework::ReadFromRecordIO(scanner_, dev_ctx_);
-  }
-
  bool HasNext() const override { return scanner_.HasNext(); }

  void ReInit() override { scanner_.Reset(); }

+ protected:
+  void ReadNextImpl(std::vector<framework::LoDTensor>* out) override {
+    *out = framework::ReadFromRecordIO(scanner_, dev_ctx_);
+  }
+
 private:
  recordio::Scanner scanner_;
  const platform::DeviceContext& dev_ctx_;
@@ -54,12 +55,12 @@ class CreateRecordIOReaderOp : public framework::OperatorBase {
                      int(shape_concat.size()),
                      "The accumulate of all ranks should be equal to the "
                      "shape concat's length.");
-    std::vector<framework::DDim> shapes = RestoreShapes(shape_concat, ranks);
    std::string filename = Attr<std::string>("filename");

    auto* out = scope.FindVar(Output("Out"))
                    ->template GetMutable<framework::ReaderHolder>();
-    out->Reset(new RecordIOFileReader(filename, shapes));
+    out->Reset(
+        new RecordIOFileReader(filename, RestoreShapes(shape_concat, ranks)));
  }
 };

@@ -85,3 +86,5 @@ namespace reader = paddle::operators::reader;
 REGISTER_FILE_READER_OPERATOR(create_recordio_file_reader,
                              reader::CreateRecordIOReaderOp,
                              reader::CreateRecordIOReaderOpMaker);
+
+REGISTER_FILE_READER(recordio, reader::RecordIOFileReader);
--- a/paddle/fluid/operators/reader/create_shuffle_reader_op.cc
+++ b/paddle/fluid/operators/reader/create_shuffle_reader_op.cc
@@ -12,6 +12,9 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

+#include <random>
+#include "glog/logging.h"
+#include "paddle/fluid/operators/detail/safe_ref.h"
 #include "paddle/fluid/operators/reader/reader_op_registry.h"

 namespace paddle {
@@ -20,43 +23,53 @@ namespace reader {

 class ShuffleReader : public framework::DecoratedReader {
 public:
-  ShuffleReader(ReaderBase* reader, int buffer_size)
-      : DecoratedReader(reader), buffer_size_(buffer_size), iteration_pos_(0) {
-    buffer_.reserve(buffer_size);
+  ShuffleReader(ReaderBase* reader, size_t buffer_size, size_t seed = 0)
+      : DecoratedReader(reader), buffer_size_(buffer_size), seed_(seed) {
+    VLOG(10) << "Create shuffle reader of " << reader_;
+    if (seed_ == 0) {
+      std::random_device device;
+      seed_ = device();
+    }
+    ReadIntoBuffers();
  }

-  void ReadNext(std::vector<framework::LoDTensor>* out) override;
+  void ReadNext(std::vector<framework::LoDTensor>* out) override {
+    if (iteration_pos_ >= buffer_.size()) {
+      VLOG(10) << "Resetting shuffle buffer";
+      ReadIntoBuffers();
+    }
+    *out = buffer_[iteration_pos_++];
+  }

- private:
-  int buffer_size_;
-  std::vector<std::vector<framework::LoDTensor>> buffer_;
-  size_t iteration_pos_;
-};
+  bool HasNext() const override {
+    return iteration_pos_ < buffer_.size() || reader_->HasNext();
+  }

-void ShuffleReader::ReadNext(std::vector<framework::LoDTensor>* out) {
-  if (iteration_pos_ >= buffer_.size()) {
-    // Reload buffer with new data
+ private:
+  void ReadIntoBuffers() {
    buffer_.clear();
    buffer_.reserve(buffer_size_);
-    for (int i = 0; i < buffer_size_; ++i) {
-      buffer_.push_back(std::vector<framework::LoDTensor>());
-      reader_->ReadNext(&buffer_.back());
-      if (buffer_.back().empty()) {
-        buffer_.pop_back();
+    iteration_pos_ = 0;
+    PADDLE_ENFORCE(reader_->HasNext());
+    for (size_t i = 0; i < buffer_size_; ++i) {
+      if (!reader_->HasNext()) {
        break;
      }
+      buffer_.emplace_back();
+      reader_->ReadNext(&buffer_.back());
    }
-    // TODO(fengjiayi): 'std::random_shuffle' can be very slow. It needs to be
-    // optimize.
-    std::random_shuffle(buffer_.begin(), buffer_.end());
-    iteration_pos_ = 0;
+    std::mt19937 g(seed_);
+    std::shuffle(buffer_.begin(), buffer_.end(), g);
+    seed_ = g();  // update seed_;
+    VLOG(10) << "random buffer size = " << buffer_.size();
  }
-  out->clear();
-  if (!buffer_.empty()) {
-    std::swap(*out, buffer_[iteration_pos_++]);
-  }
-  // if buffer_ is empty, the 'out' will return as an empty vector.
-}
+
+  size_t buffer_size_;
+  std::vector<std::vector<framework::LoDTensor>> buffer_;
+
+  size_t iteration_pos_;
+  size_t seed_;
+};

 class CreateShuffleReaderOp : public framework::OperatorBase {
 public:
@@ -67,10 +80,10 @@ class CreateShuffleReaderOp : public framework::OperatorBase {
               const platform::Place& dev_place) const override {
    const auto& underlying_reader = scope.FindVar(Input("UnderlyingReader"))
                                        ->Get<framework::ReaderHolder>();
-    auto* out = scope.FindVar(Output("Out"))
-                    ->template GetMutable<framework::ReaderHolder>();
-    out->Reset(
-        new ShuffleReader(underlying_reader.Get(), Attr<int>("buffer_size")));
+    auto& var = detail::Ref(scope.FindVar(Output("Out")));
+    var.GetMutable<framework::ReaderHolder>()->Reset(
+        new ShuffleReader(underlying_reader.Get(),
+                          static_cast<size_t>(Attr<int>("buffer_size"))));
  }
 };


--- a/paddle/fluid/operators/reader/reader_op_registry.cc
+++ b/paddle/fluid/operators/reader/reader_op_registry.cc
@@ -31,6 +31,11 @@ std::vector<framework::DDim> RestoreShapes(const std::vector<int>& shape_concat,
  return res;
 }

+std::unordered_map<std::string, FileReaderCreator>& FileReaderRegistry() {
+  static std::unordered_map<std::string, FileReaderCreator> regs;
+  return regs;
+}
+
 FileReaderMakerBase::FileReaderMakerBase(
    framework::OpProtoAndCheckerMaker::OpProto* op_proto,
    framework::OpAttrChecker* op_checker)

--- a/paddle/fluid/operators/reader/reader_op_registry.h
+++ b/paddle/fluid/operators/reader/reader_op_registry.h
@@ -21,6 +21,20 @@ namespace paddle {
 namespace operators {
 namespace reader {

+using FileReaderCreator = std::function<framework::ReaderBase*(
+    const std::string&, const std::vector<framework::DDim>&)>;
+
+std::unordered_map<std::string, FileReaderCreator>& FileReaderRegistry();
+
+template <typename Reader>
+int RegisterFileReader(const std::string& filetype) {
+  FileReaderRegistry()[filetype] = [](
+      const std::string& fn, const std::vector<paddle::framework::DDim>& dim) {
+    return new Reader(fn, dim);
+  };
+  return 0;
+}
+
 extern std::vector<framework::DDim> RestoreShapes(
    const std::vector<int>& shape_concat, const std::vector<int>& ranks);

@@ -73,3 +87,15 @@ class DecoratedReaderMakerBase : public framework::OpProtoAndCheckerMaker {
                    paddle::operators::reader::DecoratedReaderInferShape, \
                    paddle::framework::EmptyGradOpMaker,                  \
                    paddle::operators::reader::DecoratedReaderInferVarType)
+
+#define REGISTER_FILE_READER(_filetype, _reader)            \
+  STATIC_ASSERT_GLOBAL_NAMESPACE(                           \
+      _reg_file_reader_##_filetype,                         \
+      "Must use REGISTER_FILE_READER in global namespace"); \
+  int TouchFileReader##_filetype() { return 0; }            \
+  int _reg_file_reader_entry_##filetype =                   \
+      paddle::operators::reader::RegisterFileReader<_reader>(#_filetype)
+
+#define USE_FILE_READER(filetype)         \
+  extern int TouchFileReader##filetype(); \
+  static int _use_##filetype = TouchFileReader##filetype()
--- a/paddle/fluid/operators/reduce_op.cc
+++ b/paddle/fluid/operators/reduce_op.cc
@@ -173,6 +173,15 @@ class ReduceMinOpMaker : public ReduceOpMaker {
  }
 };

+class ReduceProdOpMaker : public ReduceOpMaker {
+ public:
+  ReduceProdOpMaker(OpProto *proto, OpAttrChecker *op_checker)
+      : ReduceOpMaker(proto, op_checker) {
+    SetComment("ReduceProd", "production");
+    AddComment(comment_);
+  }
+};
+
 }  // namespace operators
 }  // namespace paddle

@@ -190,6 +199,9 @@ REGISTER_OP(reduce_max, ops::ReduceOp, ops::ReduceMaxOpMaker, reduce_max_grad,
 REGISTER_OP(reduce_min, ops::ReduceOp, ops::ReduceMinOpMaker, reduce_min_grad,
            ops::ReduceGradOp);

+REGISTER_OP(reduce_prod, ops::ReduceOp, ops::ReduceProdOpMaker,
+            reduce_prod_grad, ops::ReduceGradOp);
+
 #define REGISTER_REDUCE_CPU_KERNEL(reduce_type, functor, grad_functor)         \
  REGISTER_OP_CPU_KERNEL(reduce_type,                                          \
                         ops::ReduceKernel<paddle::platform::CPUDeviceContext, \

--- a/paddle/fluid/operators/reduce_op.h
+++ b/paddle/fluid/operators/reduce_op.h
@@ -93,6 +93,22 @@ struct MaxOrMinGradFunctor {
  }
 };

+struct ProdFunctor {
+  template <typename DeviceContext, typename X, typename Y, typename Dim>
+  void operator()(const DeviceContext& place, X& x, Y& y, const Dim& dim) {
+    y.device(place) = x.prod(dim);
+  }
+};
+
+struct ProdGradFunctor {
+  template <typename DeviceContext, typename X, typename Y, typename DX,
+            typename DY, typename Dim>
+  void operator()(const DeviceContext& place, X& x, Y& y, DX& dx, DY& dy,
+                  const Dim& dim, int size) {
+    dx.device(place) = dy.broadcast(dim) * y.broadcast(dim) * x.inverse();
+  }
+};
+
 template <typename DeviceContext, typename T, typename Functor>
 class ReduceKernel : public framework::OpKernel<T> {
 public:
@@ -254,4 +270,5 @@ class ReduceGradKernel : public framework::OpKernel<T> {
  __macro(reduce_sum, SumFunctor, SumGradFunctor);      \
  __macro(reduce_mean, MeanFunctor, MeanGradFunctor);   \
  __macro(reduce_max, MaxFunctor, MaxOrMinGradFunctor); \
-  __macro(reduce_min, MinFunctor, MaxOrMinGradFunctor);
+  __macro(reduce_min, MinFunctor, MaxOrMinGradFunctor); \
+  __macro(reduce_prod, ProdFunctor, ProdGradFunctor);
--- a/paddle/fluid/operators/scatter_op.cc
+++ b/paddle/fluid/operators/scatter_op.cc
@@ -23,24 +23,24 @@ class ScatterOp : public framework::OperatorWithKernel {
  using framework::OperatorWithKernel::OperatorWithKernel;

  void InferShape(framework::InferShapeContext* ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput("Ref"),
-                   "Input(Ref) of ScatterOp should not be null.");
-    PADDLE_ENFORCE(ctx->HasInput("Index"),
-                   "Input(Index) of ScatterOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("X"),
+                   "Input(X) of ScatterOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Ids"),
+                   "Input(Ids) of ScatterOp should not be null.");
    PADDLE_ENFORCE(ctx->HasInput("Updates"),
                   "Input(Updates) of ScatterOp should not be null.");
    PADDLE_ENFORCE(ctx->HasOutput("Out"),
                   "Output(Out) of ScatterOp should not be null.");

    auto updates_dims = ctx->GetInputDim("Updates");
-    auto ref_dims = ctx->GetInputDim("Ref");
-    PADDLE_ENFORCE_EQ(ctx->GetInputDim("Index").size(), 1,
-                      "Update Index should be 1-D.");
+    auto ref_dims = ctx->GetInputDim("X");
+    PADDLE_ENFORCE_EQ(ctx->GetInputDim("Ids").size(), 1,
+                      "Update Ids should be 1-D.");
    PADDLE_ENFORCE_EQ(ref_dims.size(), updates_dims.size(),
-                      "Reference and Updates should have the same shape size");
+                      "Xerence and Updates should have the same shape size");
    PADDLE_ENFORCE_EQ(ctx->GetInputDim("Updates")[0],
-                      ctx->GetInputDim("Index")[0],
-                      "Updates and Index should have same batch-size.");
+                      ctx->GetInputDim("Ids")[0],
+                      "Updates and Ids should have same batch-size.");
    framework::DDim data_dim(updates_dims);
    for (int i = 1; i < data_dim.size(); ++i) {
      PADDLE_ENFORCE_EQ(data_dim[i], updates_dims[i]);
@@ -52,7 +52,7 @@ class ScatterOp : public framework::OperatorWithKernel {
  framework::OpKernelType GetExpectedKernelType(
      const framework::ExecutionContext& ctx) const override {
    return framework::OpKernelType(
-        framework::ToDataType(ctx.Input<Tensor>("Ref")->type()),
+        framework::ToDataType(ctx.Input<Tensor>("X")->type()),
        ctx.device_context());
  }
 };
@@ -64,14 +64,14 @@ class ScatterGradOp : public framework::OperatorWithKernel {
  void InferShape(framework::InferShapeContext* ctx) const override {
    ctx->SetOutputDim(framework::GradVarName("Updates"),
                      ctx->GetInputDim("Updates"));
-    ctx->SetOutputDim(framework::GradVarName("Ref"), ctx->GetInputDim("Ref"));
+    ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X"));
  }

 protected:
  framework::OpKernelType GetExpectedKernelType(
      const framework::ExecutionContext& ctx) const override {
    return framework::OpKernelType(
-        framework::ToDataType(ctx.Input<Tensor>("Ref")->type()),
+        framework::ToDataType(ctx.Input<Tensor>("X")->type()),
        ctx.device_context());
  }
 };
@@ -80,9 +80,8 @@ class ScatterOpMaker : public framework::OpProtoAndCheckerMaker {
 public:
  ScatterOpMaker(OpProto* proto, OpAttrChecker* op_checker)
      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("Ref", "The source input of scatter op");
-    AddInput("Index",
-             "The index input of scatter op where Ref will be updated");
+    AddInput("X", "The source input of scatter op");
+    AddInput("Ids", "The index input of scatter op where X will be updated");
    AddInput("Updates", "The updated value of updates op");
    AddOutput("Out", "The output of add op");
    AddComment(R"DOC(
@@ -91,8 +90,8 @@ Scatter Operator.
 This operator obtains output by updating the input on selected indices on the first axis:

 $$
-Out = Ref \\
-Out[Index] = Ref[Index] + Updates
+Out = X \\
+Out[Ids] = X[Ids] + Updates
 $$

 )DOC");

--- a/paddle/fluid/operators/scatter_op.cu
+++ b/paddle/fluid/operators/scatter_op.cu
@@ -25,14 +25,14 @@ class ScatterOpCUDAKernel : public framework::OpKernel<T> {
  void Compute(const framework::ExecutionContext &ctx) const override {
    PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
                   "This kernel only runs on GPU device.");
-    auto *Ref = ctx.Input<Tensor>("Ref");
-    auto *Index = ctx.Input<Tensor>("Index");
+    auto *X = ctx.Input<Tensor>("X");
+    auto *Ids = ctx.Input<Tensor>("Ids");
    auto *Updates = ctx.Input<Tensor>("Updates");
    auto *Out = ctx.Output<Tensor>("Out");

-    Out->ShareDataWith(*Ref);
+    Out->ShareDataWith(*X);

-    GPUScatterAssign<T>(ctx.device_context(), *Updates, *Index, Out);
+    GPUScatterAssign<T>(ctx.device_context(), *Updates, *Ids, Out);
  }
 };

@@ -42,16 +42,16 @@ class ScatterGradOpCUDAKernel : public framework::OpKernel<T> {
  void Compute(const framework::ExecutionContext &ctx) const override {
    PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
                   "This kernel only runs on GPU device.");
-    auto *dRef = ctx.Output<Tensor>(framework::GradVarName("Ref"));
+    auto *dX = ctx.Output<Tensor>(framework::GradVarName("X"));
    auto *dUpdates = ctx.Output<Tensor>(framework::GradVarName("Updates"));
-    auto *Index = ctx.Input<Tensor>("Index");
+    auto *Ids = ctx.Input<Tensor>("Ids");
    auto *dOut = ctx.Input<Tensor>(framework::GradVarName("Out"));

-    // In place gradient: dRef = dO
-    dRef->ShareDataWith(*dOut);
+    // In place gradient: dX = dO
+    dX->ShareDataWith(*dOut);
    dUpdates->mutable_data<T>(ctx.GetPlace());
-    // Gradient by Gather: dUpdates = dO[Index]
-    GPUGather<T>(ctx.device_context(), *dOut, *Index, dUpdates);
+    // Gradient by Gather: dUpdates = dO[Ids]
+    GPUGather<T>(ctx.device_context(), *dOut, *Ids, dUpdates);
  }
 };


--- a/paddle/fluid/operators/scatter_op.h
+++ b/paddle/fluid/operators/scatter_op.h
@@ -29,15 +29,15 @@ class ScatterOpKernel : public framework::OpKernel<T> {
  void Compute(const framework::ExecutionContext &ctx) const override {
    PADDLE_ENFORCE(platform::is_cpu_place(ctx.GetPlace()),
                   "This kernel only runs on CPU.");
-    auto *Ref = ctx.Input<Tensor>("Ref");
-    auto *Index = ctx.Input<Tensor>("Index");
+    auto *X = ctx.Input<Tensor>("X");
+    auto *Ids = ctx.Input<Tensor>("Ids");
    auto *Updates = ctx.Input<Tensor>("Updates");
    auto *Out = ctx.Output<Tensor>("Out");

-    // In place output: Out = Ref, Out[Index] += Updates
-    Out->ShareDataWith(*Ref);
+    // In place output: Out = X, Out[Ids] += Updates
+    Out->ShareDataWith(*X);
    // Apply ScatterUpdate: Out[index] += Updates[:]
-    ScatterAssign<T>(ctx.device_context(), *Updates, *Index, Out);
+    ScatterAssign<T>(ctx.device_context(), *Updates, *Ids, Out);
  }
 };

@@ -47,16 +47,16 @@ class ScatterGradientOpKernel : public framework::OpKernel<T> {
  void Compute(const framework::ExecutionContext &ctx) const override {
    PADDLE_ENFORCE(platform::is_cpu_place(ctx.GetPlace()),
                   "This kernel only runs on CPU.");
-    auto *dRef = ctx.Output<Tensor>(framework::GradVarName("Ref"));
+    auto *dX = ctx.Output<Tensor>(framework::GradVarName("X"));
    auto *dUpdates = ctx.Output<Tensor>(framework::GradVarName("Updates"));
-    auto *Index = ctx.Input<Tensor>("Index");
+    auto *Ids = ctx.Input<Tensor>("Ids");
    auto *dOut = ctx.Input<Tensor>(framework::GradVarName("Out"));

-    // In place gradient: dRef = dO
-    dRef->ShareDataWith(*dOut);
+    // In place gradient: dX = dO
+    dX->ShareDataWith(*dOut);
    dUpdates->mutable_data<T>(ctx.GetPlace());
-    // Gradient by Gather: dUpdates += dO[Index]
-    CPUGather<T>(ctx.device_context(), *dOut, *Index, dUpdates);
+    // Gradient by Gather: dUpdates += dO[Ids]
+    CPUGather<T>(ctx.device_context(), *dOut, *Ids, dUpdates);
  }
 };


--- a/paddle/fluid/pybind/pybind.cc
+++ b/paddle/fluid/pybind/pybind.cc
@@ -31,6 +31,7 @@ limitations under the License. */
 #include "paddle/fluid/operators/cond_op.h"
 #include "paddle/fluid/operators/net_op.h"
 #include "paddle/fluid/platform/enforce.h"
+#include "paddle/fluid/platform/gpu_info.h"
 #include "paddle/fluid/platform/place.h"
 #include "paddle/fluid/platform/profiler.h"
 #include "paddle/fluid/pybind/const_value.h"
@@ -103,12 +104,14 @@ PYBIND11_PLUGIN(core) {
      .def("set", PyCPUTensorSetFromArray<double>)
      .def("set", PyCPUTensorSetFromArray<int64_t>)
      .def("set", PyCPUTensorSetFromArray<bool>)
+      .def("set", PyCPUTensorSetFromArray<uint16_t>)
 #ifdef PADDLE_WITH_CUDA
      .def("set", PyCUDATensorSetFromArray<float>)
      .def("set", PyCUDATensorSetFromArray<int>)
      .def("set", PyCUDATensorSetFromArray<double>)
      .def("set", PyCUDATensorSetFromArray<int64_t>)
      .def("set", PyCUDATensorSetFromArray<bool>)
+      .def("set", PyCUDATensorSetFromArray<uint16_t>)
 #endif
      .def("shape", [](Tensor &self) { return vectorize(self.dims()); })
      .def("set_float_element", TensorSetElement<float>)
@@ -315,7 +318,6 @@ All parameter, weight, gradient are variables in Paddle.
 #endif
                  });
 // clang-format on
-
 #ifdef PADDLE_WITH_CUDA
  py::class_<platform::Communicator>(m, "Communicator").def(py::init<>());
 #endif
@@ -423,6 +425,12 @@ All parameter, weight, gradient are variables in Paddle.
  m.def("init_devices", &framework::InitDevices);

  m.def("is_compiled_with_cuda", IsCompiledWithCUDA);
+#ifdef PADDLE_WITH_CUDA
+  m.def("is_float16_supported", [](const platform::CUDAPlace &place) -> bool {
+    // Only GPUs with Compute Capability >= 53 support float16
+    return platform::GetCUDAComputeCapability(place.device) >= 53;
+  });
+#endif

  m.def("set_feed_variable", framework::SetFeedVariable);
  m.def("get_fetch_variable", framework::GetFetchVariable);

--- a/paddle/fluid/pybind/tensor_py.h
+++ b/paddle/fluid/pybind/tensor_py.h
@@ -17,6 +17,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/lod_tensor.h"
 #include "paddle/fluid/memory/memcpy.h"
 #include "paddle/fluid/platform/device_context.h"
+#include "paddle/fluid/platform/float16.h"
 #include "pybind11/numpy.h"
 #include "pybind11/pybind11.h"

@@ -77,21 +78,32 @@ struct CastToPyBufferImpl<true, I, ARGS...> {
      } else if (paddle::platform::is_cpu_place(tensor.place())) {
        dst_tensor = tensor;
      }
-      return py::buffer_info(dst_tensor.data<CUR_TYPE>(), sizeof(CUR_TYPE),
-                             py::format_descriptor<CUR_TYPE>::format(),
-                             (size_t)framework::arity(dst_tensor.dims()),
-                             dims_outside, strides);
+
+      if (std::type_index(typeid(CUR_TYPE)) ==
+          std::type_index(typeid(platform::float16))) {
+        return py::buffer_info(dst_tensor.data<CUR_TYPE>(), sizeof(CUR_TYPE),
+                               "e", /* np.dtype('e') == np.float16 */
+                               (size_t)framework::arity(dst_tensor.dims()),
+                               dims_outside, strides);
+      } else {
+        return py::buffer_info(dst_tensor.data<CUR_TYPE>(), sizeof(CUR_TYPE),
+                               py::format_descriptor<CUR_TYPE>::format(),
+                               (size_t)framework::arity(dst_tensor.dims()),
+                               dims_outside, strides);
+      }
    } else {
      constexpr bool less = I + 1 < std::tuple_size<std::tuple<ARGS...>>::value;
      return CastToPyBufferImpl<less, I + 1, ARGS...>()(tensor);
    }
  }
 };
+
 }  // namespace details
+
 inline py::buffer_info CastToPyBuffer(framework::Tensor &tensor) {
  auto buffer_info =
-      details::CastToPyBufferImpl<true, 0, float, int, double, int64_t, bool>()(
-          tensor);
+      details::CastToPyBufferImpl<true, 0, float, int, double, int64_t, bool,
+                                  platform::float16>()(tensor);
  return buffer_info;
 }

@@ -136,6 +148,22 @@ void PyCPUTensorSetFromArray(
  std::memcpy(dst, array.data(), sizeof(T) * array.size());
 }

+template <>
+void PyCPUTensorSetFromArray(
+    framework::Tensor &self,
+    py::array_t<uint16_t, py::array::c_style | py::array::forcecast> array,
+    paddle::platform::CPUPlace &place) {
+  std::vector<int64_t> dims;
+  dims.reserve(array.ndim());
+  for (size_t i = 0; i < array.ndim(); ++i) {
+    dims.push_back((int)array.shape()[i]);
+  }
+
+  self.Resize(framework::make_ddim(dims));
+  auto *dst = self.mutable_data<platform::float16>(place);
+  std::memcpy(dst, array.data(), sizeof(uint16_t) * array.size());
+}
+
 #ifdef PADDLE_WITH_CUDA
 template <typename T>
 void PyCUDATensorSetFromArray(
@@ -157,6 +185,28 @@ void PyCUDATensorSetFromArray(
  paddle::platform::GpuMemcpyAsync(dst, array.data(), sizeof(T) * array.size(),
                                   cudaMemcpyHostToDevice, dev_ctx->stream());
 }
+
+template <>
+void PyCUDATensorSetFromArray(
+    framework::Tensor &self,
+    py::array_t<uint16_t, py::array::c_style | py::array::forcecast> array,
+    paddle::platform::CUDAPlace &place) {
+  std::vector<int64_t> dims;
+  dims.reserve(array.ndim());
+  for (size_t i = 0; i < array.ndim(); ++i) {
+    dims.push_back((int)array.shape()[i]);
+  }
+
+  self.Resize(framework::make_ddim(dims));
+  auto *dst = self.mutable_data<platform::float16>(place);
+
+  platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
+  auto dev_ctx =
+      static_cast<const platform::CUDADeviceContext *>(pool.Get(place));
+  paddle::platform::GpuMemcpyAsync(dst, array.data(),
+                                   sizeof(uint16_t) * array.size(),
+                                   cudaMemcpyHostToDevice, dev_ctx->stream());
+}
 #endif

 }  // namespace pybind

--- a/python/CMakeLists.txt
+++ b/python/CMakeLists.txt
-
-file(GLOB TRAINER_PY_FILES . ./paddle/trainer/*.py)
-file(GLOB HELPERS_PY_FILES . ./paddle/trainer_config_helpers/*.py)
 file(GLOB UTILS_PY_FILES . ./paddle/utils/*.py)
-file(GLOB_RECURSE V2_PY_FILES ./paddle/v2/ *.py)
 file(GLOB_RECURSE FLUID_PY_FILES ./paddle/fluid/ *.py)
-
 set(PY_FILES paddle/__init__.py
-  ${TRAINER_PY_FILES}
-  ${HELPERS_PY_FILES}
  ${UTILS_PY_FILES}
-  ${V2_PY_FILES}
  ${FLUID_PY_FILES})

-add_custom_target(copy_paddle_master)
+if(NOT WITH_FLUID)
+  file(GLOB TRAINER_PY_FILES . ./paddle/trainer/*.py)
+  file(GLOB HELPERS_PY_FILES . ./paddle/trainer_config_helpers/*.py)
+  file(GLOB_RECURSE V2_PY_FILES ./paddle/v2/ *.py)
+  set(PY_FILES ${PY_FILES}
+    ${TRAINER_PY_FILES}
+    ${HELPERS_PY_FILES}
+    ${V2_PY_FILES})

-SET(COPY_PADDLE_MASTER "")
-if(WITH_GOLANG)
-  SET(COPY_PADDLE_MASTER "copy_paddle_master")
-  add_custom_command(TARGET ${COPY_PADDLE_MASTER}
-    COMMAND cp ${paddle_master_LIB_PATH} ${PADDLE_SOURCE_DIR}/python/paddle/v2/master/
-    )
-  add_dependencies(copy_paddle_master paddle_master)
-endif(WITH_GOLANG)
+  add_custom_target(copy_paddle_master)
+
+  SET(COPY_PADDLE_MASTER "")
+  if(WITH_GOLANG)
+    SET(COPY_PADDLE_MASTER "copy_paddle_master")
+    add_custom_command(TARGET ${COPY_PADDLE_MASTER}
+      COMMAND cp ${paddle_master_LIB_PATH} ${PADDLE_SOURCE_DIR}/python/paddle/v2/master/
+      )
+    add_dependencies(copy_paddle_master paddle_master)
+  endif(WITH_GOLANG)
+endif()

 set(MKL_SHARED_LIBS "")
 set(MKL_DEPENDS "")
@@ -59,23 +61,28 @@ add_custom_command(OUTPUT ${PADDLE_PYTHON_BUILD_DIR}/.timestamp
    COMMAND ${CMAKE_COMMAND} -E copy_directory ${PADDLE_PYTHON_BUILD_DIR}/lib* ${PADDLE_PYTHON_BUILD_DIR}/lib-python
    DEPENDS gen_proto_py copy_paddle_pybind framework_py_proto profiler_py_proto ${PY_FILES} ${external_project_dependencies} ${COPY_PADDLE_MASTER})

-set(paddle_python_deps ${PADDLE_PYTHON_BUILD_DIR}/.timestamp paddle_pserver_main paddle_trainer paddle_merge_model ${MKL_DEPENDS})
-if(WITH_SWIG_PY)
-    list(APPEND paddle_python_deps python_api_wheel)
+set(paddle_python_deps ${PADDLE_PYTHON_BUILD_DIR}/.timestamp ${MKL_DEPENDS})
+if(NOT WITH_FLUID)
+    set(paddle_python_deps ${paddle_python_deps} paddle_pserver_main paddle_trainer paddle_merge_model)
+    if(WITH_SWIG_PY)
+        list(APPEND paddle_python_deps python_api_wheel)
+    endif()
 endif()
 add_custom_target(paddle_python ALL DEPENDS ${paddle_python_deps})

 set(PADDLE_PYTHON_PACKAGE_DIR ${CMAKE_CURRENT_BINARY_DIR}/dist/)

 if (WITH_TESTING)
-  add_subdirectory(paddle/trainer_config_helpers/tests)
-  if (WITH_SWIG_PY)
-    # enable v2 API unittest only when paddle swig api is compiled
-    add_subdirectory(paddle/v2/tests)
-    add_subdirectory(paddle/v2/reader/tests)
-    add_subdirectory(paddle/v2/plot/tests)
-    add_subdirectory(paddle/fluid/tests)
+  if(NOT WITH_FLUID)
+    add_subdirectory(paddle/trainer_config_helpers/tests)
+    if (WITH_SWIG_PY)
+      # enable v2 API unittest only when paddle swig api is compiled
+      add_subdirectory(paddle/v2/tests)
+      add_subdirectory(paddle/v2/reader/tests)
+      add_subdirectory(paddle/v2/plot/tests)
+    endif()
  endif()
+  add_subdirectory(paddle/fluid/tests)
 endif()
 install(DIRECTORY ${PADDLE_PYTHON_PACKAGE_DIR}
    DESTINATION opt/paddle/share/wheels

--- a/python/paddle/fluid/backward.py
+++ b/python/paddle/fluid/backward.py
@@ -248,12 +248,15 @@ def _callback_lookup_(op):
                        if o_argu in self.param_grad_names:
                            allreduce_out_name = o_argu + "__nccl_all_reduce__"
                            op_desc = _create_op_desc_(
-                                "ncclAllReduce", {
+                                "ncclReduce",
+                                {
                                    "X": [o_argu],
                                    "Communicator":
                                    ['nccl_com__do_not_change_']
-                                }, {"Out": [allreduce_out_name]},
-                                {"reduction": "ncclSum"})
+                                },
+                                {"Out": [allreduce_out_name]},
+                                {"reduction": "ncclSum",
+                                 "root": 0}, )
                            block.desc.append_op().copy_from(op_desc)

                            op_desc = _create_op_desc_(

--- a/python/paddle/fluid/layers/io.py
+++ b/python/paddle/fluid/layers/io.py
@@ -21,7 +21,7 @@ from ..executor import global_scope

 __all__ = [
    'data', 'BlockGuardServ', 'ListenAndServ', 'Send', 'open_recordio_file',
-    'read_file'
+    'read_file', 'create_shuffle_reader', 'create_double_buffer_reader'
 ]


@@ -245,6 +245,8 @@ def monkey_patch_reader_methods(reader):

    reader.eof = eof
    reader.reset = reset
+    reader.stop_gradient = True
+    reader.persistable = True
    return reader


@@ -285,6 +287,33 @@ def open_recordio_file(filename, shapes, lod_levels, dtypes):
                             startup_var)


+def __create_decorated_reader__(op_type, reader, attrs):
+    var_name = unique_name(op_type)
+    startup_blk = default_startup_program().current_block()
+    startup_var = startup_blk.create_var(name=var_name)
+    startup_blk.append_op(
+        type=op_type,
+        inputs={'UnderlyingReader': reader},
+        outputs={'Out': [startup_var]},
+        attrs=attrs)
+    startup_var.persistable = True
+    return _copy_reader_var_(default_main_program().current_block(),
+                             startup_var)
+
+
+def create_shuffle_reader(reader, buffer_size):
+    return __create_decorated_reader__('create_shuffle_reader', reader,
+                                       {'buffer_size': int(buffer_size)})
+
+
+def create_double_buffer_reader(reader, place=None):
+    attrs = dict()
+    if place is not None:
+        attrs['place'] = str(place).upper()
+    return __create_decorated_reader__('create_double_buffer_reader', reader,
+                                       attrs)
+
+
 def read_file(file_obj):
    helper = LayerHelper('read_file')
    out = [

--- a/python/paddle/fluid/layers/nn.py
+++ b/python/paddle/fluid/layers/nn.py
@@ -49,6 +49,7 @@ __all__ = [
    'reduce_mean',
    'reduce_max',
    'reduce_min',
+    'reduce_prod',
    'sequence_first_step',
    'sequence_last_step',
    'dropout',
@@ -84,13 +85,12 @@ def fc(input,
    **Fully Connected Layer**

    The fully connected layer can take multiple tensors as its inputs. It
-    creates a variable (one for each input tensor) called weights for each
-    input tensor, which represents a fully connected weight matrix from
-    each input unit to each output unit. The fully connected layer
-    multiplies each input tensor with its coresponding weight to produce
-    an output Tensor. If multiple input tensors are given, the results of
-    multiple multiplications will be sumed up. If bias_attr is not None,
-    a biases variable will be created and added to the output. Finally,
+    creates a variable called weights for each input tensor, which represents
+    a fully connected weight matrix from each input unit to each output unit.
+    The fully connected layer multiplies each input tensor with its coresponding
+    weight to produce an output Tensor. If multiple input tensors are given,
+    the results of multiple multiplications will be sumed up. If bias_attr is
+    not None, a bias variable will be created and added to the output. Finally,
    if activation is not None, it will be applied to the output as well.

    This process can be formulated as follows:
@@ -109,44 +109,27 @@ def fc(input,
    * :math:`Out`: The output tensor.

    Args:
-       input(Variable|list): The input tensor(s) to the fully connected layer.
-       size(int): The number of output units in the fully connected layer.
-       num_flatten_dims(int): The fc layer can accept an input tensor with more
-                              than two dimensions. If this happens, the
-                              multidimensional tensor will first be flattened
-                              into a 2-dimensional matrix. The parameter
-                              `num_flatten_dims` determines how the input tensor
-                              is flattened: the first `num_flatten_dims`
-                              (inclusive, index starts from 1) dimensions will
-                              be flatten to form the first dimension of the
-                              final matrix (height of the matrix), and the rest
-                              `rank(X) - num_flatten_dims` dimensions are
-                              flattened to form the second dimension of the
-                              final matrix (width of the matrix). For example,
-                              suppose `X` is a 6-dimensional tensor with a shape
-                              [2, 3, 4, 5, 6], and `num_flatten_dims` = 3. Then,
-                              the flattened matrix will have a shape
-                              [2 x 3 x 4, 5 x 6] = [24, 30]. By default,
-                              `num_flatten_dims` is set to 1.
-       param_attr(ParamAttr|list): The parameter attribute for learnable
-                                   parameters/weights of the fully connected
-                                   layer.
-       param_initializer(ParamAttr|list): The initializer used for the
-                                          weight/parameter. If set None,
-                                          XavierInitializer() will be used.
-       bias_attr(ParamAttr|list): The parameter attribute for the bias parameter
-                                  for this layer. If set None, no bias will be
-                                  added to the output units.
-       bias_initializer(ParamAttr|list): The initializer used for the bias.
-                                        If set None, then ConstantInitializer()
-                                        will be used.
-       act(str): Activation to be applied to the output of the fully connected
-                 layer.
-       name(str): Name/alias of the fully connected layer.
-
+        input (Variable|list of Variable): The input tensor(s) of this layer, and the dimension of
+            the input tensor(s) is at least 2.
+        size(int): The number of output units in this layer.
+        num_flatten_dims (int, default 1): The fc layer can accept an input tensor with more than
+            two dimensions. If this happens, the multidimensional tensor will first be flattened
+            into a 2-dimensional matrix. The parameter `num_flatten_dims` determines how the input
+            tensor is flattened: the first `num_flatten_dims` (inclusive, index starts from 1)
+            dimensions will be flatten to form the first dimension of the final matrix (height of
+            the matrix), and the rest `rank(X) - num_flatten_dims` dimensions are flattened to
+            form the second dimension of the final matrix (width of the matrix). For example, suppose
+            `X` is a 6-dimensional tensor with a shape [2, 3, 4, 5, 6], and `num_flatten_dims` = 3.
+            Then, the flattened matrix will have a shape [2 x 3 x 4, 5 x 6] = [24, 30].
+        param_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for learnable
+            parameters/weights of this layer.
+        bias_attr (ParamAttr|list of ParamAttr, default None): The parameter attribute for the bias
+            of this layer. If it is set to None, no bias will be added to the output units.
+        act (str, default None): Activation to be applied to the output of this layer.
+        name (str, default None): The name of this layer.

    Returns:
-        Variable: The output tensor variable.
+        A tensor variable storing the transformation result.

    Raises:
        ValueError: If rank of the input tensor is less than 2.
@@ -2202,6 +2185,53 @@ def reduce_min(input, dim=None, keep_dim=False, name=None):
    return out


+def reduce_prod(input, dim=None, keep_dim=False, name=None):
+    """
+    Computes the product of tensor elements over the given dimension.
+
+    Args:
+        input (Variable): The input variable which is a Tensor or LoDTensor.
+        dim (int|None): The dimension along which the product is performed. If
+            :attr:`None`, multipy all elements of :attr:`input` and return a
+            Tensor variable with a single element, otherwise must be in the
+            range :math:`[-rank(input), rank(input))`. If :math:`dim < 0`,
+            the dimension to reduce is :math:`rank + dim`.
+        keep_dim (bool|False): Whether to reserve the reduced dimension in the
+            output Tensor. The result tensor will have one fewer dimension
+            than the :attr:`input` unless :attr:`keep_dim` is true.
+        name(str|None): A name for this layer(optional). If set None, the 
+            layer will be named automatically.
+
+    Returns:
+        Variable: The reduced Tensor variable.
+
+    Examples:
+        .. code-block:: python
+
+            # x is a Tensor variable with following elements:
+            #    [[0.2, 0.3, 0.5, 0.9]
+            #     [0.1, 0.2, 0.6, 0.7]]
+            # Each example is followed by the correspending output tensor.
+            fluid.layers.reduce_prod(x)  # [0.0002268]
+            fluid.layers.reduce_prod(x, dim=0)  # [0.02, 0.06, 0.3, 0.63]
+            fluid.layers.reduce_prod(x, dim=-1)  # [0.027, 0.0084]
+            fluid.layers.reduce_prod(x, dim=1, 
+                                     keep_dim=True)  # [[0.027], [0.0084]]
+    """
+    helper = LayerHelper('reduce_prod', **locals())
+    out = helper.create_tmp_variable(dtype=helper.input_dtype())
+    helper.append_op(
+        type='reduce_prod',
+        inputs={'X': input},
+        outputs={'Out': out},
+        attrs={
+            'dim': dim if dim != None else 0,
+            'keep_dim': keep_dim,
+            'reduce_all': True if dim == None else False
+        })
+    return out
+
+
 def split(input, num_or_sections, dim=-1, name=None):
    """
    Split the input tensor into multiple sub-tensors.

--- a/python/paddle/fluid/layers/ops.py
+++ b/python/paddle/fluid/layers/ops.py
@@ -45,31 +45,13 @@ __activations__ = [
 ]

 __all__ = [
-    'mean',
-    'mul',
-    'reshape',
-    'scale',
-    'sigmoid_cross_entropy_with_logits',
-    'elementwise_add',
-    'elementwise_div',
-    'elementwise_sub',
-    'elementwise_mul',
-    'elementwise_max',
-    'elementwise_min',
-    'elementwise_pow',
-    'clip',
-    'clip_by_norm',
-    'softmax',
-    'sequence_softmax',
-    'logical_and',
-    'logical_or',
-    'logical_xor',
-    'logical_not',
-    'uniform_random',
-    'uniform_random_batch_size_like',
-    'gaussian_random',
-    'gaussian_random_batch_size_like',
-    'cumsum',
+    'mean', 'mul', 'reshape', 'scale', 'sigmoid_cross_entropy_with_logits',
+    'elementwise_add', 'elementwise_div', 'elementwise_sub', 'elementwise_mul',
+    'elementwise_max', 'elementwise_min', 'elementwise_pow', 'clip',
+    'clip_by_norm', 'softmax', 'sequence_softmax', 'logical_and', 'logical_or',
+    'logical_xor', 'logical_not', 'uniform_random',
+    'uniform_random_batch_size_like', 'gaussian_random',
+    'gaussian_random_batch_size_like', 'cumsum', 'scatter'
 ] + __activations__

 for _OP in set(__all__):

--- a/python/paddle/fluid/recordio_writer.py
+++ b/python/paddle/fluid/recordio_writer.py
@@ -36,6 +36,7 @@ def convert_reader_to_recordio_file(
        feed_order=None):
    if feed_order is None:
        feed_order = feeder.feed_names
+    counter = 0
    with create_recordio_writer(filename, compressor,
                                max_num_records) as writer:
        for batch in reader_creator():
@@ -43,3 +44,5 @@ def convert_reader_to_recordio_file(
            for each in feed_order:
                writer.append_tensor(res[each])
            writer.complete_append_tensor()
+            counter += 1
+    return counter
--- a/python/paddle/fluid/regularizer.py
+++ b/python/paddle/fluid/regularizer.py
@@ -13,6 +13,7 @@
 # limitations under the License.

 import framework
+from . import core

 __all__ = [
    'append_regularization_ops',
@@ -46,9 +47,9 @@ def append_regularization_ops(parameters_and_grads, regularization=None):
        regularization_term = None
        if param.regularizer is not None:
            # Add variable for regularization term in grad block
-            regularization_term = param.regularizer(param, grad.block)
+            regularization_term = param.regularizer(param, grad, grad.block)
        elif regularization is not None:
-            regularization_term = regularization(param, grad.block)
+            regularization_term = regularization(param, grad, grad.block)

        # If no gradient or no regularization specified,
        # then we don't need to do anything
@@ -82,7 +83,7 @@ class WeightDecayRegularizer(object):
    def __init__(self):
        pass

-    def __call__(self, param, block):
+    def __call__(self, param, grad, block):
        """Add corresponding weight decay operations to the network
        """
        raise NotImplementedError()
@@ -102,7 +103,7 @@ class L2DecayRegularizer(WeightDecayRegularizer):
        super(L2DecayRegularizer, self).__init__()
        self._regularization_coeff = regularization_coeff

-    def __call__(self, param, block):
+    def __call__(self, param, grad, block):
        """Add L2 weight decay ops to network

        Adds L2 weight decay ops.
@@ -117,8 +118,23 @@ class L2DecayRegularizer(WeightDecayRegularizer):
        """
        assert isinstance(param, framework.Parameter)
        assert isinstance(block, framework.Block)
+
        decay = block.create_var(
            dtype="float32", shape=param.shape, lod_level=param.lod_level)
+
+        if grad.type == core.VarDesc.VarType.SELECTED_ROWS:
+            decay = block.create_var(
+                dtype="float32",
+                shape=param.shape,
+                type=core.VarDesc.VarType.SELECTED_ROWS)
+            block.append_op(
+                type='lookup_table',
+                inputs={'W': param,
+                        'Ids': grad},
+                outputs={'Out': decay},
+                attrs={'is_sparse': True})
+            param = decay
+
        # Append Op to calculate decay
        block.append_op(
            type='scale',
@@ -141,7 +157,7 @@ class L1DecayRegularizer(WeightDecayRegularizer):
        super(L1DecayRegularizer, self).__init__()
        self._regularization_coeff = regularization_coeff

-    def __call__(self, param, block):
+    def __call__(self, param, grad, block):
        """Add L1 weight decay ops to network

        Adds L1 weight decay ops.
@@ -158,6 +174,19 @@ class L1DecayRegularizer(WeightDecayRegularizer):
        assert isinstance(block, framework.Block)
        decay = block.create_var(
            dtype="float32", shape=param.shape, lod_level=param.lod_level)
+
+        if grad.type == core.VarDesc.VarType.SELECTED_ROWS:
+            decay = block.create_var(
+                dtype="float32",
+                shape=param.shape,
+                type=core.VarDesc.VarType.SELECTED_ROWS)
+            block.append_op(
+                type='lookup_table',
+                inputs={'W': param,
+                        'Ids': grad},
+                outputs={'Out': decay},
+                attrs={'is_sparse': True})
+
        # Append sign op
        block.append_op(
            type='sign', inputs={"X": param}, outputs={"Out": decay})

--- a/python/paddle/fluid/tests/book/test_machine_translation.py
+++ b/python/paddle/fluid/tests/book/test_machine_translation.py
@@ -181,7 +181,10 @@ def train_main(use_cuda, is_sparse, is_local=True):
    cost = pd.cross_entropy(input=rnn_out, label=label)
    avg_cost = pd.mean(cost)

-    optimizer = fluid.optimizer.Adagrad(learning_rate=1e-4)
+    optimizer = fluid.optimizer.Adagrad(
+        learning_rate=1e-4,
+        regularization=fluid.regularizer.L2DecayRegularizer(
+            regularization_coeff=0.1))
    optimize_ops, params_grads = optimizer.minimize(avg_cost)

    train_data = paddle.batch(

--- a/python/paddle/fluid/tests/unittests/test_mul_op.py
+++ b/python/paddle/fluid/tests/unittests/test_mul_op.py
@@ -14,6 +14,7 @@

 import unittest
 import numpy as np
+import paddle.fluid.core as core
 from op_test import OpTest


@@ -69,5 +70,42 @@ class TestMulOp2(OpTest):
            ['X'], 'Out', max_relative_error=0.5, no_grad_set=set('Y'))


+class TestFP16MulOp1(OpTest):
+    def setUp(self):
+        self.op_type = "mul"
+        x = np.random.random((32, 84)).astype("float16")
+        y = np.random.random((84, 100)).astype("float16")
+        self.inputs = {'X': x.view(np.uint16), 'Y': y.view(np.uint16)}
+        self.outputs = {'Out': np.dot(x, y)}
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=1e-1)
+
+
+class TestFP16MulOp2(OpTest):
+    def setUp(self):
+        self.op_type = "mul"
+        x = np.random.random((15, 4, 12, 10)).astype("float16")
+        y = np.random.random((4, 30, 8, 2, 9)).astype("float16")
+        self.inputs = {'X': x.view(np.uint16), 'Y': y.view(np.uint16)}
+        self.attrs = {
+            'x_num_col_dims': 2,
+            'y_num_col_dims': 2,
+        }
+        result = np.dot(
+            x.reshape(15 * 4, 12 * 10), y.reshape(4 * 30, 8 * 2 * 9))
+        result = result.reshape(15, 4, 8, 2, 9)
+        self.outputs = {'Out': result}
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=2e-1)
+
+
 if __name__ == "__main__":
    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_parallel_op.py
+++ b/python/paddle/fluid/tests/unittests/test_parallel_op.py
@@ -15,6 +15,7 @@
 import unittest

 import paddle.fluid as fluid
+import paddle.fluid.profiler as profiler
 import numpy


@@ -60,20 +61,23 @@ class BaseParallelForTest(unittest.TestCase):
                feed=feed,
                fetch=fetch,
                place=gpu,
-                use_parallel=False)
+                use_parallel=False,
+                use_gpu=True)
            result_gpu_parallel = self._run_test_impl_(
                callback=callback,
                feed=feed,
                fetch=fetch,
                place=gpu,
-                use_parallel=True)
+                use_parallel=True,
+                use_gpu=True)
            result_gpu_nccl = self._run_test_impl_(
                callback=callback,
                feed=feed,
                fetch=fetch,
                place=gpu,
                use_parallel=True,
-                use_nccl=True)
+                use_nccl=True,
+                use_gpu=True)
            self._assert_same_(fetch, result_cpu, result_cpu_parallel,
                               result_gpu, result_gpu_parallel, result_gpu_nccl)
        else:
@@ -85,7 +89,8 @@ class BaseParallelForTest(unittest.TestCase):
                        fetch,
                        place,
                        use_parallel=False,
-                        use_nccl=False):
+                        use_nccl=False,
+                        use_gpu=False):
        """
        Run a single test, returns the fetch values
        Args:
@@ -132,7 +137,12 @@ class BaseParallelForTest(unittest.TestCase):

        exe = fluid.Executor(place)
        exe.run(startup)
-        return exe.run(main, feed=feed, fetch_list=fetch)
+        if use_gpu:
+            profile_type = 'GPU'
+        else:
+            profile_type = 'CPU'
+        with profiler.profiler(profile_type, 'total', '/tmp/profiler'):
+            return exe.run(main, feed=feed, fetch_list=fetch)

    def _assert_same_(self, fetch, *args):
        """

--- a/python/paddle/fluid/tests/unittests/test_recordio_reader.py
+++ b/python/paddle/fluid/tests/unittests/test_recordio_reader.py
@@ -13,9 +13,10 @@
 # limitations under the License.

 import unittest
+
 import paddle.fluid as fluid
-import paddle.v2.dataset.mnist as mnist
 import paddle.v2 as paddle
+import paddle.v2.dataset.mnist as mnist


 class TestRecordIO(unittest.TestCase):
@@ -31,10 +32,10 @@ class TestRecordIO(unittest.TestCase):
                        name='label', shape=[1], dtype='int64'),
                ],
                place=fluid.CPUPlace())
-            fluid.recordio_writer.convert_reader_to_recordio_file(
+            self.num_batches = fluid.recordio_writer.convert_reader_to_recordio_file(
                './mnist.recordio', reader, feeder)

-    def test_main(self):
+    def test_main(self, decorator_callback=None):
        # use new program
        with fluid.program_guard(fluid.Program(), fluid.Program()):
            data_file = fluid.layers.open_recordio_file(
@@ -42,6 +43,8 @@ class TestRecordIO(unittest.TestCase):
                shapes=[[-1, 784], [-1, 1]],
                lod_levels=[0, 0],
                dtypes=['float32', 'int64'])
+            if decorator_callback is not None:
+                data_file = decorator_callback(data_file)
            img, label = fluid.layers.read_file(data_file)

            hidden = fluid.layers.fc(input=img, size=100, act='tanh')
@@ -51,14 +54,28 @@ class TestRecordIO(unittest.TestCase):

            fluid.optimizer.Adam(learning_rate=1e-3).minimize(avg_loss)

-            exe = fluid.Executor(fluid.CPUPlace())
+            if fluid.core.is_compiled_with_cuda():
+                place = fluid.CUDAPlace(0)
+            else:
+                place = fluid.CPUPlace()
+
+            exe = fluid.Executor(place)
            exe.run(fluid.default_startup_program())
            avg_loss_np = []

            # train a pass
+            batch_id = 0
            while not data_file.eof():
                tmp, = exe.run(fetch_list=[avg_loss])
                avg_loss_np.append(tmp)
+                batch_id += 1
            data_file.reset()
-
+            self.assertEqual(batch_id, self.num_batches)
            self.assertLess(avg_loss_np[-1], avg_loss_np[0])
+
+    def test_shuffle_reader(self):
+        self.test_main(decorator_callback=lambda reader: fluid.layers.create_shuffle_reader(reader, buffer_size=200))
+
+    def test_double_buffer_reader(self):
+        self.test_main(decorator_callback=lambda reader: fluid.layers.create_double_buffer_reader(reader,
+                                                                                                  place='cuda:0' if fluid.core.is_compiled_with_cuda() else 'cpu'))
--- a/python/paddle/fluid/tests/unittests/test_reduce_op.py
+++ b/python/paddle/fluid/tests/unittests/test_reduce_op.py
@@ -70,6 +70,19 @@ class TestMinOp(OpTest):
        self.check_output()


+class TestProdOp(OpTest):
+    def setUp(self):
+        self.op_type = "reduce_prod"
+        self.inputs = {'X': np.random.random((5, 6, 10)).astype("float64")}
+        self.outputs = {'Out': self.inputs['X'].prod(axis=0)}
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(['X'], 'Out')
+
+
 class TestKeepDimReduce(OpTest):
    def setUp(self):
        self.op_type = "reduce_sum"

--- a/python/paddle/fluid/tests/unittests/test_scatter_op.py
+++ b/python/paddle/fluid/tests/unittests/test_scatter_op.py
@@ -25,7 +25,7 @@ class TestScatterOp(OpTest):
        updates_np = np.random.random((2, 3)).astype("float32")
        output_np = np.copy(ref_np)
        output_np[index_np] = updates_np
-        self.inputs = {'Ref': ref_np, 'Index': index_np, 'Updates': updates_np}
+        self.inputs = {'X': ref_np, 'Ids': index_np, 'Updates': updates_np}
        self.outputs = {'Out': output_np}

    def test_check_output(self):

--- a/python/setup.py.in
+++ b/python/setup.py.in
@@ -62,20 +62,22 @@ write_version_py(filename='@PADDLE_SOURCE_DIR@/python/paddle/version.py')


 packages=['paddle',
-          'paddle.proto',
-          'paddle.trainer',
-          'paddle.trainer_config_helpers',
          'paddle.utils',
-          'paddle.v2',
-          'paddle.v2.dataset',
-          'paddle.v2.reader',
-          'paddle.v2.master',
-          'paddle.v2.plot',
          'paddle.fluid',
          'paddle.fluid.proto',
          'paddle.fluid.proto.profiler',
-          'paddle.fluid.layers',
-          'py_paddle']
+          'paddle.fluid.layers']
+
+if '${WITH_FLUID}'== 'OFF':
+    packages+=['paddle.proto',
+               'paddle.trainer',
+               'paddle.trainer_config_helpers',
+               'paddle.v2',
+               'paddle.v2.dataset',
+               'paddle.v2.reader',
+               'paddle.v2.master',
+               'paddle.v2.plot',
+               'py_paddle']

 with open('@PADDLE_SOURCE_DIR@/python/requirements.txt') as f:
    setup_requires = f.read().splitlines()
@@ -84,11 +86,29 @@ if '${CMAKE_SYSTEM_PROCESSOR}' not in ['arm', 'armv7-a', 'aarch64']:
    setup_requires+=['opencv-python']

 # the prefix is sys.prefix which should always be usr
-paddle_bin_dir = 'opt/paddle/bin'
-paddle_bins = ['${PADDLE_BINARY_DIR}/paddle/trainer/paddle_trainer',
-               '${PADDLE_BINARY_DIR}/paddle/trainer/paddle_merge_model',
-               '${PADDLE_BINARY_DIR}/paddle/pserver/paddle_pserver_main',
-               '${PADDLE_BINARY_DIR}/paddle/scripts/paddle']
+paddle_bins = ''
+if '${WITH_FLUID}'== 'OFF':
+    paddle_bin_dir = 'opt/paddle/bin'
+    paddle_bins = ['${PADDLE_BINARY_DIR}/paddle/trainer/paddle_trainer',
+                   '${PADDLE_BINARY_DIR}/paddle/trainer/paddle_merge_model',
+                   '${PADDLE_BINARY_DIR}/paddle/pserver/paddle_pserver_main',
+                   '${PADDLE_BINARY_DIR}/paddle/scripts/paddle']
+
+package_data={'paddle.fluid': ['core.so']}
+if '${WITH_FLUID}'== 'OFF':
+    package_data['paddle.v2.master']=['libpaddle_master.so']
+    package_data['py_paddle']=['*.py','_swig_paddle.so']
+
+package_dir={
+    '': '${CMAKE_CURRENT_SOURCE_DIR}',
+    # The paddle.fluid.proto will be generated while compiling.
+    # So that package points to other directory.
+    'paddle.fluid.proto.profiler': '${PADDLE_BINARY_DIR}/paddle/fluid/platform',
+    'paddle.fluid.proto': '${PADDLE_BINARY_DIR}/paddle/fluid/framework',
+}
+if '${WITH_FLUID}'== 'OFF':
+    package_dir['py_paddle']='${PADDLE_SOURCE_DIR}/paddle/py_paddle'
+    

 paddle_rt_lib_dir = 'lib'
 paddle_rt_libs = ['${WARPCTC_LIBRARIES}']
@@ -101,19 +121,8 @@ setup(name='${PACKAGE_NAME}',
      install_requires=setup_requires,
      packages=packages,
      ext_modules=[Extension('_foo', ['stub.cc'])],
-      package_data={
-        'paddle.v2.master': ['libpaddle_master.so'],
-        'paddle.fluid': ['core.so'],
-        'py_paddle':['*.py','_swig_paddle.so']
-      },
-      package_dir={
-          '': '${CMAKE_CURRENT_SOURCE_DIR}',
-          # The paddle.fluid.proto will be generated while compiling.
-          # So that package points to other directory.
-          'paddle.fluid.proto.profiler': '${PADDLE_BINARY_DIR}/paddle/fluid/platform',
-          'paddle.fluid.proto': '${PADDLE_BINARY_DIR}/paddle/fluid/framework',
-          'py_paddle': '${PADDLE_SOURCE_DIR}/paddle/py_paddle'
-      },
+      package_data=package_data,
+      package_dir=package_dir,
      scripts=paddle_bins,
      data_files=[(paddle_rt_lib_dir, paddle_rt_libs)]
 )