Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into split_byref_op

e2d56832 · typhoonzero · 69188e59 · ebbc28e6 · e2d56832 · e2d56832
36 changed file
--- a/paddle/scripts/check_env.sh
+++ b/paddle/scripts/check_env.sh
--- a/doc/fluid/api/initializer.rst
+++ b/doc/fluid/api/initializer.rst
@@ -33,3 +33,45 @@ Xavier
    :members:
    :noindex:
+MSRA
+------
+..  autoclass:: paddle.fluid.initializer.MSRA
+    :members:
+    :noindex:
+ConstantInitializer
+-------------------
+..  autoclass:: paddle.fluid.initializer.ConstantInitializer
+    :members:
+    :noindex:
+UniformInitializer
+------------------
+..  autoclass:: paddle.fluid.initializer.UniformInitializer
+    :members:
+    :noindex:
+NormalInitializer
+-----------------
+..  autoclass:: paddle.fluid.initializer.NormalInitializer
+    :members:
+    :noindex:
+XavierInitializer
+-----------------
+..  autoclass:: paddle.fluid.initializer.XavierInitializer
+    :members:
+    :noindex:
+    MSRA
+    ------
+MSRAInitializer
+-----------------
+..  autoclass:: paddle.fluid.initializer.MSRAInitializer
+    :members:
+    :noindex:
--- a/doc/fluid/design/dist_train/mpi_enabled_design.md
+++ b/doc/fluid/design/dist_train/mpi_enabled_design.md
+# MPI-enabled PaddlePaddle Design doc
+# Background
+When we do distribute multi GPU training, the communication overhead between servers become the major bottleneck, because of the following reasons:
+1. Must copy at least once from GPU to CPU memory so that the data can be ready to transfer. And for the pserver side, copy data from CPU to GPU introduce more overhead.
+2. GPU->CPU data transfer is 10 times slower than data transfer between GPUs or between PCIe devices.
+3. TCP connections can not make full use of RDMA 100Gb devices.
+We will use OpenMPI API to PaddlePaddle, which can bring two benefits to PaddlePaddle:
+1. Enable RDMA with PaddlePaddle, which bring high-performance low latency networks.
+2. Enable GPUDriect with PaddlePaddle, which bring the highest throughput and lowest latency GPU read and write.
+# Change list
+* Compile args: Need add compile args to enable MPI support.
+* Execute args:  Need add execute args to assign when and how to use MPI operations.
+* New ops:  Need new op  ```mpi_send_op``` and ```mpi_listenandserve_op``` to support MPI send and receive.
+* Transpiler optimized: Which can add   ```mpi_send_op``` and ```mpi_listenandserve_op```  to the running graph.
+* MPI utils package: Need MPI utils package as the low-level API supported.
+## Compile args
+Because MPI or CUDA need hardware supported, so we will add compile args to enable MPI support and control compiling.Add ```WITH_MPI```  compile args to control MPI to use or not. If the  ```WITH_MPI``` is ```ON```, compile system will find openMPI codes in configuration. We should prepare openMPI environment before compiling.
+## Execute args
+Launch the script using the ```mpirun``` launcher, For example: ```mpirun -np 3 -hosts node1,node2,node3 python train.py```. By doing this, We can number the actors (trainer/pserver/master) with o .. (n-1). The node's number is the Rank of the calling process in a group of comm (integer),  The MPI processes identify each other using a Rank ID. We have to create a mapping between PaddlePaddle's nodes and their Rank ID so that we can communicate with the correct destinations when using MPI operations.
+## New ops
+We won't replace all the gRPC requests to MPI requests,  the standard gRPC library is used for all administrative operations and the MPI API will be used to transfer tensor or selectRows to Pservers. The base of this idea, we create two new operators to handle requests and receives,  the two operators are ```mpi_send_op``` and ```mpi_listenandserve_op```. They are a little similar to [send_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/send_op.cc) and [listen_and_serv_op](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/listen_and_serv_op.cc), also, We will build a new module to package MPI send and receive process.
+### mpi_send_op
+Very similar with ```send_op```, we will replace gRPC code which used to send gradient with ```mpi_module```, at the same time, we will wrap it with ```framework::Async```.
+### mpi_listenandserve_op
+Very similar with ```listen_and_serv_op```, we will replace gRPC code which used to receive gradient with ```mpi_module```, at the same time, we will wrap it with ```framework::Async```.
+## Transpiler optimized
+**We can get env ```OMPI_COMM_WORLD_SIZE``` and ```OMPI_COMM_WORLD_RANK``` to distinguish use MPI or not, If we use openMPI, the variable in env must exist.**
+ if  confirm to use MPI, we will modify  ```send_op``` to ```mpi_send_op``` in distribute_transpiler, and modify ```listenandserve_op``` to ```mpi_listenandserve_op``` also.
+## MPI utils package
+In this package, We will write openMPI low-level API to use MPI.
+The API included in this package are:
+* MPI send and receive module, We will build a new module to package MPI send and receive process. MPI send and receive are different to gRPC, the MPI [recvice](https://www.open-mpi.org/doc/v1.8/man3/MPI_Irecv.3.php) must know receive buffer size and receive buffer element. For this reason, We have to make communications twice, the first one is to send metadata about gradient through gRPC, the second one is the real communication through MPI which send gradient data to mpi_listenandserve_op.
+The detailed flow is below:
+![](https://github.com/seiriosPlus/Paddle/blob/mpi_enabled/doc/fluid/design/dist_train/src/mpi_module.png)
+* MPI global configurations, which store the Rank ID and the mapping in global variables, for example:
+gRPC client : MPI nodes :``` 127.0.0.1:32004 : 3 ```
--- a/doc/fluid/design/dist_train/src/mpi_module.png
+++ b/doc/fluid/design/dist_train/src/mpi_module.png
--- a/paddle/fluid/framework/details/broadcast_op_handle_test.cc
+++ b/paddle/fluid/framework/details/broadcast_op_handle_test.cc
@@ -77,14 +77,9 @@ struct TestBroadcastOpHandle {
    local_scopes_[input_scope_idx]->Var("input");
    op_handle_.reset(new BroadcastOpHandle(local_scopes_, gpu_list_));
+    auto* in_var_handle =
-    vars_.emplace_back(new VarHandle());
+        new VarHandle(1, input_scope_idx, "input", gpu_list_[input_scope_idx]);
-    VarHandle* in_var_handle = static_cast<VarHandle*>(vars_.back().get());
+    vars_.emplace_back(in_var_handle);
-    in_var_handle->place_ = gpu_list_[input_scope_idx];
-    in_var_handle->name_ = "input";
-    in_var_handle->version_ = 1;
-    in_var_handle->scope_idx_ = input_scope_idx;
-    in_var_handle->generated_op_ = nullptr;
    op_handle_->AddInput(in_var_handle);
    // add dummy var
@@ -96,12 +91,8 @@ struct TestBroadcastOpHandle {
    for (size_t j = 0; j < gpu_list_.size(); ++j) {
      op_handle_->dev_ctxes_[gpu_list_[j]] = ctxs_[j].get();
-      vars_.emplace_back(new VarHandle());
+      VarHandle* out_var_handle = new VarHandle(2, j, "out", gpu_list_[j]);
-      VarHandle* out_var_handle = static_cast<VarHandle*>(vars_.back().get());
+      vars_.emplace_back(out_var_handle);
-      out_var_handle->place_ = gpu_list_[j];
-      out_var_handle->name_ = "out";
-      out_var_handle->version_ = 2;
-      out_var_handle->scope_idx_ = j;
      op_handle_->AddOutput(out_var_handle);
    }

--- a/paddle/fluid/framework/details/gather_op_handle_test.cc
+++ b/paddle/fluid/framework/details/gather_op_handle_test.cc
@@ -79,13 +79,8 @@ struct TestGatherOpHandle {
    // add input
    for (size_t j = 0; j < gpu_list_.size(); ++j) {
      op_handle_->dev_ctxes_[gpu_list_[j]] = ctxs_[j].get();
-      vars_.emplace_back(new VarHandle());
+      auto* in_var_handle = new VarHandle(1, j, "input", gpu_list_[j]);
-      VarHandle* in_var_handle = static_cast<VarHandle*>(vars_.back().get());
+      vars_.emplace_back(in_var_handle);
-      in_var_handle->place_ = gpu_list_[j];
-      in_var_handle->name_ = "input";
-      in_var_handle->version_ = 1;
-      in_var_handle->scope_idx_ = j;
-      in_var_handle->generated_op_ = nullptr;
      op_handle_->AddInput(in_var_handle);
    }
@@ -97,12 +92,9 @@ struct TestGatherOpHandle {
    op_handle_->AddInput(in_dummy_var_handle);
    // add output
-    vars_.emplace_back(new VarHandle());
+    auto* out_var_handle =
-    VarHandle* out_var_handle = static_cast<VarHandle*>(vars_.back().get());
+        new VarHandle(2, input_scope_idx, "out", gpu_list_[input_scope_idx]);
-    out_var_handle->place_ = gpu_list_[input_scope_idx];
+    vars_.emplace_back(out_var_handle);
-    out_var_handle->name_ = "out";
-    out_var_handle->version_ = 2;
-    out_var_handle->scope_idx_ = input_scope_idx;
    op_handle_->AddOutput(out_var_handle);
    // add dummy var

--- a/paddle/fluid/framework/details/multi_devices_graph_builder.cc
+++ b/paddle/fluid/framework/details/multi_devices_graph_builder.cc
@@ -177,13 +177,9 @@ std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build(
            auto &prev_grad = vars[vars.size() - 1];
            op_handle->AddInput(prev_grad.get());
-            vars.emplace_back(new VarHandle);
+            auto var = new VarHandle(vars.size() - 1, i, og, p);
-            auto &var = vars.back();
+            vars.emplace_back(var);
-            var->place_ = p;
+            op_handle->AddOutput(var);
-            var->name_ = og;
-            var->version_ = vars.size() - 1;
-            op_handle->AddOutput(var.get());
          }
 #else
          PADDLE_ENFORCE("Not implemented");

--- a/paddle/fluid/framework/details/ssa_graph_builder.cc
+++ b/paddle/fluid/framework/details/ssa_graph_builder.cc
@@ -54,13 +54,8 @@ VarHandle *SSAGraphBuilder::CreateOrGetLatestVarHandle(
  auto &var_holder = var_holders[each_var_name];
  VarHandle *var = nullptr;
  if (var_holder.empty()) {
-    var_holder.emplace_back(new VarHandle);
+    var = new VarHandle(0, place_offset, each_var_name, place);
-    auto &init_var = var_holder[0];
+    var_holder.emplace_back(var);
-    init_var->place_ = place;
-    init_var->name_ = each_var_name;
-    init_var->generated_op_ = nullptr;
-    init_var->version_ = 0;
-    var = init_var.get();
  } else {
    var = var_holder.rbegin()->get();
  }
@@ -73,12 +68,9 @@ void SSAGraphBuilder::CreateOpOutput(SSAGraph *graph, OpHandleBase *op_handle,
                                     size_t place_offset) {
  auto &vars = graph->vars_[place_offset][each_var_name];
  size_t version = vars.size();
-  vars.emplace_back(new VarHandle());
+  auto var = new VarHandle(version, place_offset, each_var_name, place);
-  auto &var = vars.back();
+  vars.emplace_back(var);
-  var->version_ = version;
+  op_handle->AddOutput(var);
-  var->name_ = each_var_name;
-  var->place_ = place;
-  op_handle->AddOutput(var.get());
 }
 template <typename Callback>

--- a/paddle/fluid/framework/details/var_handle.h
+++ b/paddle/fluid/framework/details/var_handle.h
@@ -16,6 +16,7 @@
 #include <sstream>
 #include <string>
 #include <unordered_set>
+#include <utility>
 #include "paddle/fluid/platform/place.h"
@@ -33,10 +34,10 @@ struct VarHandleBase {
  // The operator who generate this variable. nullptr if the variable
  // is a root node.
-  OpHandleBase *generated_op_;
+  OpHandleBase* generated_op_{nullptr};
  // Operators which depend on this variable ready.
-  std::unordered_set<OpHandleBase *> pending_ops_;
+  std::unordered_set<OpHandleBase*> pending_ops_;
 };
 // VarHandle is actually a single version of Runtime Variable.
@@ -47,6 +48,13 @@ struct VarHandleBase {
 struct VarHandle : public VarHandleBase {
  std::string DebugString() const override;
+  VarHandle(size_t version, size_t scope_index, std::string name,
+            platform::Place place)
+      : version_(version),
+        scope_idx_(scope_index),
+        name_(std::move(name)),
+        place_(std::move(place)) {}
  // version field currently is not used, however, just store the version to
  // debug easily.
  size_t version_;

--- a/paddle/fluid/operators/beam_search_decode_op.cc
+++ b/paddle/fluid/operators/beam_search_decode_op.cc
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/operators/beam_search_decode_op.h"
+#include <string>
 #include "paddle/fluid/platform/device_context.h"
 namespace paddle {

--- a/paddle/fluid/operators/beam_search_decode_op.h
+++ b/paddle/fluid/operators/beam_search_decode_op.h
@@ -14,6 +14,7 @@ limitations under the License. */
 #pragma once
+#include <vector>
 #include "paddle/fluid/framework/lod_tensor_array.h"
 #include "paddle/fluid/framework/op_registry.h"
@@ -87,7 +88,7 @@ struct BeamSearchDecoder {
   */
  std::vector<BeamNodeVector<T>> PackTwoSteps(
      const LoDTensor& cur_ids, const LoDTensor& cur_scores,
-      std::vector<BeamNodeVector<T>>& prefixes_list,
+      std::vector<BeamNodeVector<T>>* prefixes_list,
      std::vector<SentenceVector<T>>* sentence_vector_list) const;
  /**
@@ -140,7 +141,7 @@ Sentence<T> BeamSearchDecoder<T>::MakeSentence(const BeamNode<T>* node) const {
 template <typename T>
 std::vector<BeamNodeVector<T>> BeamSearchDecoder<T>::PackTwoSteps(
    const LoDTensor& cur_ids, const LoDTensor& cur_scores,
-    std::vector<BeamNodeVector<T>>& prefixes_list,
+    std::vector<BeamNodeVector<T>>* prefixes_list,
    std::vector<SentenceVector<T>>* sentence_vector_list) const {
  std::vector<BeamNodeVector<T>> result;
@@ -153,7 +154,7 @@ std::vector<BeamNodeVector<T>> BeamSearchDecoder<T>::PackTwoSteps(
    // if prefixes size is 0, it means this is the first step. In this step,
    // all candidate id is the start of candidate sentences.
-    if (prefixes_list.empty()) {
+    if (prefixes_list->empty()) {
      PADDLE_ENFORCE_EQ(cur_ids.lod().at(kSourceLevel).back(),
                        cur_ids.lod().at(kSentenceLevel).back(),
                        "in the first step");
@@ -162,7 +163,7 @@ std::vector<BeamNodeVector<T>> BeamSearchDecoder<T>::PackTwoSteps(
            cur_ids.data<int64_t>()[id_idx], cur_scores.data<T>()[id_idx])));
      }
    } else {
-      BeamNodeVector<T>& prefixes = prefixes_list[src_idx];
+      BeamNodeVector<T>& prefixes = prefixes_list->at(src_idx);
      SentenceVector<T>& sentence_vector = (*sentence_vector_list)[src_idx];
      PADDLE_ENFORCE_EQ(src_end - src_start, prefixes.size(),
@@ -262,7 +263,7 @@ void BeamSearchDecoder<T>::PackAllSteps(const LoDTensorArray& step_ids,
  for (size_t step_id = 0; step_id < step_num; ++step_id) {
    beamnode_vector_list =
        PackTwoSteps(step_ids.at(step_id), step_scores.at(step_id),
-                     beamnode_vector_list, &sentence_vector_list);
+                     &beamnode_vector_list, &sentence_vector_list);
  }
  // append last beam_node to result
  for (size_t src_idx = 0; src_idx < src_num; ++src_idx) {

--- a/paddle/fluid/operators/beam_search_decode_op_test.cc
+++ b/paddle/fluid/operators/beam_search_decode_op_test.cc
@@ -125,7 +125,7 @@ TEST(BeamSearchDecodeOp, PackTwoStepsFistStep) {
  BeamSearchDecoder<float> helper;
  beamnode_vector_list = helper.PackTwoSteps(
-      ids[0], scores[0], beamnode_vector_list, &sentence_vector_list);
+      ids[0], scores[0], &beamnode_vector_list, &sentence_vector_list);
  ASSERT_EQ(beamnode_vector_list.size(), 2UL);
  ASSERT_EQ(beamnode_vector_list[0].size(), 2UL);
  ASSERT_EQ(beamnode_vector_list[1].size(), 4UL);
@@ -167,7 +167,7 @@ TEST(BeamSearchDecodeOp, PackTwoSteps) {
  BeamSearchDecoder<float> helper1;
  beamnode_vector_list = helper1.PackTwoSteps(
-      ids[0], scores[0], beamnode_vector_list, &sentence_vector_list);
+      ids[0], scores[0], &beamnode_vector_list, &sentence_vector_list);
  ASSERT_EQ(sentence_vector_list[0].size(), 1UL);
  ASSERT_EQ(sentence_vector_list[1].size(), 0UL);

--- a/paddle/fluid/operators/beam_search_op.cc
+++ b/paddle/fluid/operators/beam_search_op.cc
@@ -14,7 +14,10 @@ limitations under the License. */
 #include "paddle/fluid/operators/beam_search_op.h"
+#include <algorithm>
 #include <map>
+#include <string>
+#include <vector>
 #include "paddle/fluid/framework/lod_tensor.h"
 #include "paddle/fluid/framework/op_registry.h"

--- a/paddle/fluid/operators/beam_search_op.h
+++ b/paddle/fluid/operators/beam_search_op.h
@@ -18,6 +18,8 @@ limitations under the License. */
 #include "gtest/gtest.h"
 #endif
+#include <string>
+#include <vector>
 #include "paddle/fluid/framework/lod_tensor.h"
 #include "paddle/fluid/framework/operator.h"

--- a/paddle/fluid/operators/chunk_eval_op.cc
+++ b/paddle/fluid/operators/chunk_eval_op.cc
@@ -13,6 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/operators/chunk_eval_op.h"
+#include <string>
+#include <vector>
 namespace paddle {
 namespace operators {

--- a/paddle/fluid/operators/chunk_eval_op.h
+++ b/paddle/fluid/operators/chunk_eval_op.h
@@ -14,6 +14,9 @@ limitations under the License. */
 #pragma once
 #include <set>
+#include <string>
+#include <vector>
 #include "paddle/fluid/framework/eigen.h"
 #include "paddle/fluid/framework/op_registry.h"
@@ -36,11 +39,11 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
  };
  void GetSegments(const int64_t* label, int length,
-                   std::vector<Segment>& segments, int num_chunk_types,
+                   std::vector<Segment>* segments, int num_chunk_types,
                   int num_tag_types, int other_chunk_type, int tag_begin,
                   int tag_inside, int tag_end, int tag_single) const {
-    segments.clear();
+    segments->clear();
-    segments.reserve(length);
+    segments->reserve(length);
    int chunk_start = 0;
    bool in_chunk = false;
    int tag = -1;
@@ -58,7 +61,7 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
            i - 1,        // end
            prev_type,
        };
-        segments.push_back(segment);
+        segments->push_back(segment);
        in_chunk = false;
      }
      if (ChunkBegin(prev_tag, prev_type, tag, type, other_chunk_type,
@@ -73,7 +76,7 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
          length - 1,   // end
          type,
      };
-      segments.push_back(segment);
+      segments->push_back(segment);
    }
  }
@@ -177,8 +180,8 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
    for (int i = 0; i < num_sequences; ++i) {
      int seq_length = lod[0][i + 1] - lod[0][i];
      EvalOneSeq(inference_data + lod[0][i], label_data + lod[0][i], seq_length,
-                 output_segments, label_segments, *num_infer_chunks_data,
+                 &output_segments, &label_segments, num_infer_chunks_data,
-                 *num_label_chunks_data, *num_correct_chunks_data,
+                 num_label_chunks_data, num_correct_chunks_data,
                 num_chunk_types, num_tag_types, other_chunk_type, tag_begin,
                 tag_inside, tag_end, tag_single, excluded_chunk_types);
    }
@@ -197,10 +200,10 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
  }
  void EvalOneSeq(const int64_t* output, const int64_t* label, int length,
-                  std::vector<Segment>& output_segments,
+                  std::vector<Segment>* output_segments,
-                  std::vector<Segment>& label_segments,
+                  std::vector<Segment>* label_segments,
-                  int64_t& num_output_segments, int64_t& num_label_segments,
+                  int64_t* num_output_segments, int64_t* num_label_segments,
-                  int64_t& num_correct, int num_chunk_types, int num_tag_types,
+                  int64_t* num_correct, int num_chunk_types, int num_tag_types,
                  int other_chunk_type, int tag_begin, int tag_inside,
                  int tag_end, int tag_single,
                  const std::set<int>& excluded_chunk_types) const {
@@ -209,25 +212,29 @@ class ChunkEvalKernel : public framework::OpKernel<T> {
    GetSegments(label, length, label_segments, num_chunk_types, num_tag_types,
                other_chunk_type, tag_begin, tag_inside, tag_end, tag_single);
    size_t i = 0, j = 0;
-    while (i < output_segments.size() && j < label_segments.size()) {
+    while (i < output_segments->size() && j < label_segments->size()) {
-      if (output_segments[i] == label_segments[j] &&
+      if (output_segments->at(i) == label_segments->at(j) &&
-          excluded_chunk_types.count(output_segments[i].type) != 1) {
+          excluded_chunk_types.count(output_segments->at(i).type) != 1) {
-        ++num_correct;
+        ++(*num_correct);
      }
-      if (output_segments[i].end < label_segments[j].end) {
+      if (output_segments->at(i).end < label_segments->at(j).end) {
        ++i;
-      } else if (output_segments[i].end > label_segments[j].end) {
+      } else if (output_segments->at(i).end > label_segments->at(j).end) {
        ++j;
      } else {
        ++i;
        ++j;
      }
    }
-    for (auto& segment : label_segments) {
+    for (auto& segment : (*label_segments)) {
-      if (excluded_chunk_types.count(segment.type) != 1) ++num_label_segments;
+      if (excluded_chunk_types.count(segment.type) != 1) {
+        ++(*num_label_segments);
+      }
    }
-    for (auto& segment : output_segments) {
+    for (auto& segment : (*output_segments)) {
-      if (excluded_chunk_types.count(segment.type) != 1) ++num_output_segments;
+      if (excluded_chunk_types.count(segment.type) != 1) {
+        ++(*num_output_segments);
+      }
    }
  }
 };

--- a/paddle/fluid/operators/conv_mkldnn_op.cc
+++ b/paddle/fluid/operators/conv_mkldnn_op.cc
@@ -73,9 +73,11 @@ class ConvMKLDNNOpKernel : public paddle::framework::OpKernel<T> {
        dst_tz, mkldnn::memory::data_type::f32, mkldnn::memory::format::nchw);
    auto src_memory =
-        mkldnn::memory({src_md, mkldnn_engine}, (void*)input_data);
+        mkldnn::memory({src_md, mkldnn_engine},
+                       reinterpret_cast<void*>(const_cast<T*>(input_data)));
    auto weights_memory =
-        mkldnn::memory({weights_md, mkldnn_engine}, (void*)filter_data);
+        mkldnn::memory({weights_md, mkldnn_engine},
+                       reinterpret_cast<void*>(const_cast<T*>(filter_data)));
    auto dst_memory = mkldnn::memory({dst_md, mkldnn_engine}, output_data);
    std::shared_ptr<mkldnn::convolution_forward::primitive_desc> conv_pd =
@@ -180,8 +182,9 @@ class ConvMKLDNNGradOpKernel : public paddle::framework::OpKernel<T> {
        dst_tz, mkldnn::memory::data_type::f32, mkldnn::memory::format::nchw);
    // create memory
-    auto diff_dst_memory = mkldnn::memory({diff_weights_md, mkldnn_engine},
+    auto diff_dst_memory = mkldnn::memory(
-                                          (void*)output_grad_data);
+        {diff_weights_md, mkldnn_engine},
+        reinterpret_cast<void*>(const_cast<T*>(output_grad_data)));
    // Retrieve conv_pd from device context
    auto conv_pd =
        std::static_pointer_cast<mkldnn::convolution_forward::primitive_desc>(
@@ -198,10 +201,12 @@ class ConvMKLDNNGradOpKernel : public paddle::framework::OpKernel<T> {
                                      mkldnn_engine);
      // create memory
-      auto diff_weights_memory = mkldnn::memory(
+      auto diff_weights_memory =
-          {diff_weights_md, mkldnn_engine}, (void*)filter_grad_data);
+          mkldnn::memory({diff_weights_md, mkldnn_engine},
+                         reinterpret_cast<void*>(filter_grad_data));
      auto src_memory =
-          mkldnn::memory({src_md, mkldnn_engine}, (void*)input_data);
+          mkldnn::memory({src_md, mkldnn_engine},
+                         reinterpret_cast<void*>(const_cast<T*>(input_data)));
      // create backward conv primitive for weights
      auto conv_bwd_weights_prim = mkldnn::convolution_backward_weights(
@@ -220,10 +225,12 @@ class ConvMKLDNNGradOpKernel : public paddle::framework::OpKernel<T> {
                                   strides, paddings, *conv_pd, mkldnn_engine);
      // create memory
-      auto diff_src_memory =
+      auto diff_src_memory = mkldnn::memory(
-          mkldnn::memory({diff_src_md, mkldnn_engine}, (void*)input_grad_data);
+          {diff_src_md, mkldnn_engine},
+          reinterpret_cast<void*>(const_cast<T*>(input_grad_data)));
      auto weights_memory =
-          mkldnn::memory({weights_md, mkldnn_engine}, (void*)filter_data);
+          mkldnn::memory({weights_md, mkldnn_engine},
+                         reinterpret_cast<void*>(const_cast<T*>(filter_data)));
      // create backward conv primitive for data
      auto conv_bwd_data_prim = mkldnn::convolution_backward_data(

--- a/paddle/fluid/operators/conv_op.h
+++ b/paddle/fluid/operators/conv_op.h
@@ -14,6 +14,7 @@ limitations under the License. */
 #pragma once
+#include <vector>
 #include "paddle/fluid/framework/eigen.h"
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/operators/math/depthwise_conv.h"
@@ -41,9 +42,10 @@ inline int ConvOutputSize(int input_size, int filter_size, int dilation,
  return output_size;
 }
-inline bool IsExpand(std::vector<int64_t>& filter_dim,
+inline bool IsExpand(const std::vector<int64_t>& filter_dim,
-                     std::vector<int>& strides, std::vector<int>& paddings,
+                     const std::vector<int>& strides,
-                     std::vector<int>& dilations) {
+                     const std::vector<int>& paddings,
+                     const std::vector<int>& dilations) {
  bool filter_1 = true, strides_1 = true, padding_0 = true, dilation_1 = true;
  for (size_t j = 0; j < strides.size(); ++j) {
    filter_1 = filter_1 && (static_cast<int>(filter_dim[j + 2]) == 1);

--- a/paddle/fluid/operators/detection_map_op.cc
+++ b/paddle/fluid/operators/detection_map_op.cc
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #include "paddle/fluid/operators/detection_map_op.h"
+#include <string>
 namespace paddle {
 namespace operators {

--- a/paddle/fluid/operators/detection_map_op.h
+++ b/paddle/fluid/operators/detection_map_op.h
@@ -13,6 +13,11 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 #pragma once
+#include <algorithm>
+#include <map>
+#include <string>
+#include <utility>
+#include <vector>
 #include "paddle/fluid/framework/eigen.h"
 #include "paddle/fluid/framework/op_registry.h"
@@ -82,7 +87,7 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
    std::vector<std::map<int, std::vector<Box>>> gt_boxes;
    std::vector<std::map<int, std::vector<std::pair<T, Box>>>> detect_boxes;
-    GetBoxes(*in_label, *in_detect, gt_boxes, detect_boxes);
+    GetBoxes(*in_label, *in_detect, &gt_boxes, detect_boxes);
    std::map<int, int> label_pos_count;
    std::map<int, std::vector<std::pair<T, int>>> true_pos;
@@ -95,20 +100,20 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
    }
    if (in_pos_count != nullptr && state) {
-      GetInputPos(*in_pos_count, *in_true_pos, *in_false_pos, label_pos_count,
+      GetInputPos(*in_pos_count, *in_true_pos, *in_false_pos, &label_pos_count,
-                  true_pos, false_pos, class_num);
+                  &true_pos, &false_pos, class_num);
    }
    CalcTrueAndFalsePositive(gt_boxes, detect_boxes, evaluate_difficult,
-                             overlap_threshold, label_pos_count, true_pos,
+                             overlap_threshold, &label_pos_count, &true_pos,
-                             false_pos);
+                             &false_pos);
    int background_label = ctx.Attr<int>("background_label");
    T map = CalcMAP(ap_type, label_pos_count, true_pos, false_pos,
                    background_label);
-    GetOutputPos(ctx, label_pos_count, true_pos, false_pos, *out_pos_count,
+    GetOutputPos(ctx, label_pos_count, true_pos, false_pos, out_pos_count,
-                 *out_true_pos, *out_false_pos, class_num);
+                 out_true_pos, out_false_pos, class_num);
    T* map_data = out_map->mutable_data<T>(ctx.GetPlace());
    map_data[0] = map;
@@ -155,7 +160,7 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
  void GetBoxes(const framework::LoDTensor& input_label,
                const framework::LoDTensor& input_detect,
-                std::vector<std::map<int, std::vector<Box>>>& gt_boxes,
+                std::vector<std::map<int, std::vector<Box>>>* gt_boxes,
                std::vector<std::map<int, std::vector<std::pair<T, Box>>>>&
                    detect_boxes) const {
    auto labels = framework::EigenTensor<T, 2>::From(input_label);
@@ -179,7 +184,7 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
          box.is_difficult = true;
        boxes[label].push_back(box);
      }
-      gt_boxes.push_back(boxes);
+      gt_boxes->push_back(boxes);
    }
    auto detect_index = detect_lod[0];
@@ -200,9 +205,9 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
      const std::map<int, int>& label_pos_count,
      const std::map<int, std::vector<std::pair<T, int>>>& true_pos,
      const std::map<int, std::vector<std::pair<T, int>>>& false_pos,
-      framework::Tensor& output_pos_count,
+      framework::Tensor* output_pos_count,
-      framework::LoDTensor& output_true_pos,
+      framework::LoDTensor* output_true_pos,
-      framework::LoDTensor& output_false_pos, const int class_num) const {
+      framework::LoDTensor* output_false_pos, const int class_num) const {
    int true_pos_count = 0;
    int false_pos_count = 0;
    for (auto it = true_pos.begin(); it != true_pos.end(); ++it) {
@@ -214,12 +219,12 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
      false_pos_count += fp.size();
    }
-    int* pos_count_data = output_pos_count.mutable_data<int>(
+    int* pos_count_data = output_pos_count->mutable_data<int>(
        framework::make_ddim({class_num, 1}), ctx.GetPlace());
-    T* true_pos_data = output_true_pos.mutable_data<T>(
+    T* true_pos_data = output_true_pos->mutable_data<T>(
        framework::make_ddim({true_pos_count, 2}), ctx.GetPlace());
-    T* false_pos_data = output_false_pos.mutable_data<T>(
+    T* false_pos_data = output_false_pos->mutable_data<T>(
        framework::make_ddim({false_pos_count, 2}), ctx.GetPlace());
    true_pos_count = 0;
    false_pos_count = 0;
@@ -261,21 +266,21 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
    framework::LoD false_pos_lod;
    false_pos_lod.emplace_back(false_pos_starts);
-    output_true_pos.set_lod(true_pos_lod);
+    output_true_pos->set_lod(true_pos_lod);
-    output_false_pos.set_lod(false_pos_lod);
+    output_false_pos->set_lod(false_pos_lod);
    return;
  }
  void GetInputPos(const framework::Tensor& input_pos_count,
                   const framework::LoDTensor& input_true_pos,
                   const framework::LoDTensor& input_false_pos,
-                   std::map<int, int>& label_pos_count,
+                   std::map<int, int>* label_pos_count,
-                   std::map<int, std::vector<std::pair<T, int>>>& true_pos,
+                   std::map<int, std::vector<std::pair<T, int>>>* true_pos,
-                   std::map<int, std::vector<std::pair<T, int>>>& false_pos,
+                   std::map<int, std::vector<std::pair<T, int>>>* false_pos,
                   const int class_num) const {
    const int* pos_count_data = input_pos_count.data<int>();
    for (int i = 0; i < class_num; ++i) {
-      label_pos_count[i] = pos_count_data[i];
+      (*label_pos_count)[i] = pos_count_data[i];
    }
    auto SetData = [](const framework::LoDTensor& pos_tensor,
@@ -291,8 +296,8 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
      }
    };
-    SetData(input_true_pos, true_pos);
+    SetData(input_true_pos, *true_pos);
-    SetData(input_false_pos, false_pos);
+    SetData(input_false_pos, *false_pos);
    return;
  }
@@ -301,9 +306,9 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
      const std::vector<std::map<int, std::vector<std::pair<T, Box>>>>&
          detect_boxes,
      bool evaluate_difficult, float overlap_threshold,
-      std::map<int, int>& label_pos_count,
+      std::map<int, int>* label_pos_count,
-      std::map<int, std::vector<std::pair<T, int>>>& true_pos,
+      std::map<int, std::vector<std::pair<T, int>>>* true_pos,
-      std::map<int, std::vector<std::pair<T, int>>>& false_pos) const {
+      std::map<int, std::vector<std::pair<T, int>>>* false_pos) const {
    int batch_size = gt_boxes.size();
    for (int n = 0; n < batch_size; ++n) {
      auto image_gt_boxes = gt_boxes[n];
@@ -320,10 +325,10 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
          continue;
        }
        int label = it->first;
-        if (label_pos_count.find(label) == label_pos_count.end()) {
+        if (label_pos_count->find(label) == label_pos_count->end()) {
-          label_pos_count[label] = count;
+          (*label_pos_count)[label] = count;
        } else {
-          label_pos_count[label] += count;
+          (*label_pos_count)[label] += count;
        }
      }
    }
@@ -338,8 +343,8 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
          int label = it->first;
          for (size_t i = 0; i < pred_boxes.size(); ++i) {
            auto score = pred_boxes[i].first;
-            true_pos[label].push_back(std::make_pair(score, 0));
+            (*true_pos)[label].push_back(std::make_pair(score, 0));
-            false_pos[label].push_back(std::make_pair(score, 1));
+            (*false_pos)[label].push_back(std::make_pair(score, 1));
          }
        }
        continue;
@@ -351,8 +356,8 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
        if (image_gt_boxes.find(label) == image_gt_boxes.end()) {
          for (size_t i = 0; i < pred_boxes.size(); ++i) {
            auto score = pred_boxes[i].first;
-            true_pos[label].push_back(std::make_pair(score, 0));
+            (*true_pos)[label].push_back(std::make_pair(score, 0));
-            false_pos[label].push_back(std::make_pair(score, 1));
+            (*false_pos)[label].push_back(std::make_pair(score, 1));
          }
          continue;
        }
@@ -381,17 +386,17 @@ class DetectionMAPOpKernel : public framework::OpKernel<T> {
                (!evaluate_difficult && !matched_bboxes[max_idx].is_difficult);
            if (match_evaluate_difficult) {
              if (!visited[max_idx]) {
-                true_pos[label].push_back(std::make_pair(score, 1));
+                (*true_pos)[label].push_back(std::make_pair(score, 1));
-                false_pos[label].push_back(std::make_pair(score, 0));
+                (*false_pos)[label].push_back(std::make_pair(score, 0));
                visited[max_idx] = true;
              } else {
-                true_pos[label].push_back(std::make_pair(score, 0));
+                (*true_pos)[label].push_back(std::make_pair(score, 0));
-                false_pos[label].push_back(std::make_pair(score, 1));
+                (*false_pos)[label].push_back(std::make_pair(score, 1));
              }
            }
          } else {
-            true_pos[label].push_back(std::make_pair(score, 0));
+            (*true_pos)[label].push_back(std::make_pair(score, 0));
-            false_pos[label].push_back(std::make_pair(score, 1));
+            (*false_pos)[label].push_back(std::make_pair(score, 1));
          }
        }
      }

--- a/paddle/fluid/operators/edit_distance_op.cu
+++ b/paddle/fluid/operators/edit_distance_op.cu
@@ -14,6 +14,7 @@ limitations under the License. */
 #include <algorithm>
 #include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/operators/edit_distance_op.h"
 #include "paddle/fluid/operators/math/math_function.h"
 #include "paddle/fluid/platform/cuda_helper.h"
 #include "paddle/fluid/platform/gpu_info.h"

--- a/paddle/fluid/operators/listen_and_serv_op.cc
+++ b/paddle/fluid/operators/listen_and_serv_op.cc
@@ -12,6 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */
+#include <fstream>
 #include <ostream>
 #include <thread>  // NOLINT
 #include <vector>
@@ -67,7 +68,7 @@ ListenAndServOp::ListenAndServOp(const std::string &type,
                                 const framework::AttributeMap &attrs)
    : OperatorBase(type, inputs, outputs, attrs) {}
-int ListenAndServOp::GetSelectedPort() {
+int ListenAndServOp::GetSelectedPort() const {
  return rpc_service_->GetSelectedPort();
 }
@@ -99,7 +100,7 @@ void ListenAndServOp::RunImpl(const framework::Scope &scope,
  framework::Executor executor(dev_place);
  std::vector<int> block_list;
  for (size_t blkid = 1; blkid < num_blocks; ++blkid) {
-    if (blkid != prefetch_block->ID()) {
+    if (blkid != static_cast<size_t>(prefetch_block->ID())) {
      block_list.push_back(blkid);
    }
  }
@@ -121,10 +122,14 @@ void ListenAndServOp::RunImpl(const framework::Scope &scope,
  rpc_service_->SetProgram(program);
  // start the server listening after all member initialized.
  server_thread_.reset(new std::thread(RunServer, rpc_service_));
-  // FIXME(typhoonzero): do we need to wait until the server port is ready?
+  VLOG(3) << "wait server thread to become ready...";
  sleep(5);
+  // Write to a file of server selected port for python use.
+  std::ofstream port_file;
+  port_file.open("/tmp/paddle.selected_port");
+  port_file << rpc_service_->GetSelectedPort();
+  port_file.close();
-  // TODO(typhoonzero): change this to a while_op for every cluster-batch.
  bool exit_flag = false;
  // Record received sparse variables, so that
  // we could reset those after execute optimize program
@@ -175,7 +180,7 @@ void ListenAndServOp::RunImpl(const framework::Scope &scope,
    parallel_blkids.push_back(1);
    double ts = detail::GetTimestamp();
    for (size_t blkid = 2; blkid < num_blocks; ++blkid) {
-      if (blkid != prefetch_block->ID()) {
+      if (blkid != static_cast<size_t>(prefetch_block->ID())) {
        if (program->Block(blkid).Parent() != last_parent_blkid) {
          ParallelExecuteBlocks(parallel_blkids, &executor, optimize_prepared,
                                program, &recv_scope);

--- a/paddle/fluid/operators/listen_and_serv_op.h
+++ b/paddle/fluid/operators/listen_and_serv_op.h
@@ -39,7 +39,7 @@ class ListenAndServOp : public framework::OperatorBase {
                  const framework::VariableNameMap &outputs,
                  const framework::AttributeMap &attrs);
-  int GetSelectedPort();
+  int GetSelectedPort() const;
  void Stop() override;

--- a/paddle/fluid/operators/send_recv_op_test.cc
+++ b/paddle/fluid/operators/send_recv_op_test.cc
@@ -139,7 +139,6 @@ void StartServerNet(bool is_sparse) {
  attrs.insert({"PrefetchBlock", prefetch_block});
  listen_and_serv_op =
      f::OpRegistry::CreateOp("listen_and_serv", {{"X", {"x1"}}}, {}, attrs);
-  LOG(INFO) << "selected port before run " << selected_port;
  listen_and_serv_op->Run(scope, place);
  LOG(INFO) << "server exit";
 }
@@ -158,16 +157,13 @@ TEST(SendRecvOp, CPUDense) {
  selected_port = static_cast<paddle::operators::ListenAndServOp *>(
                      listen_and_serv_op.get())
                      ->GetSelectedPort();
-  LOG(INFO) << "selected port " << selected_port;
  std::string endpoint = paddle::string::Sprintf("127.0.0.1:%d", selected_port);
  attrs.insert({"endpoints", std::vector<std::string>({endpoint})});
  attrs.insert({"epmap", std::vector<std::string>({endpoint})});
  auto send_op = f::OpRegistry::CreateOp(
      "send", {{"X", {"x1"}}},
      {{"Out", {"Out"}}, {"RPCClient", {"RPC_CLIENT_VAR"}}}, attrs);
-  LOG(INFO) << "before run " << endpoint;
  send_op->Run(scope, place);
-  LOG(INFO) << "end run";
  auto in_var = scope.Var("x1");
  auto tensor = in_var->GetMutable<f::LoDTensor>();
@@ -180,7 +176,6 @@ TEST(SendRecvOp, CPUDense) {
  for (int64_t i = 0; i < target->numel(); ++i) {
    EXPECT_EQ(expected[i] * 2, actual[i]);
  }
-  LOG(INFO) << "before stop";
  listen_and_serv_op->Stop();
  server_thread.join();
  listen_and_serv_op.reset(nullptr);
@@ -199,7 +194,6 @@ TEST(SendRecvOp, CPUSparse) {
  selected_port = static_cast<paddle::operators::ListenAndServOp *>(
                      listen_and_serv_op.get())
                      ->GetSelectedPort();
-  LOG(INFO) << "selected port " << selected_port;
  std::string endpoint = paddle::string::Sprintf("127.0.0.1:%d", selected_port);
  attrs.insert({"endpoints", std::vector<std::string>({endpoint})});
  attrs.insert({"epmap", std::vector<std::string>({endpoint})});

--- a/python/paddle/fluid/layers/io.py
+++ b/python/paddle/fluid/layers/io.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 from .. import core
-from ..framework import convert_np_dtype_to_dtype_, default_main_program, default_startup_program
+from ..framework import convert_np_dtype_to_dtype_, default_main_program, default_startup_program, Program
 from ..unique_name import generate as unique_name
 from control_flow import BlockGuard
 from ..layer_helper import LayerHelper
@@ -158,6 +158,7 @@ class ListenAndServ(object):
        main_program = self.helper.main_program
        current_block = main_program.current_block()
        parent_block = self.parent_block()
+        empty_block = Program().global_block()
        parent_block.append_op(
            type='listen_and_serv',
@@ -166,11 +167,12 @@ class ListenAndServ(object):
            attrs={
                'endpoint': self.endpoint,
                'Fanin': self.fan_in,
-                'OptimizeBlock': current_block
+                'OptimizeBlock': current_block,
+                'PrefetchBlock': empty_block
            })
-def Send(endpoints, send_vars, get_vars):
+def Send(endpoints, send_vars, get_vars=None):
    """
    Send layer
@@ -184,7 +186,6 @@ def Send(endpoints, send_vars, get_vars):
    side when server have finished running server side program.
    """
    assert (type(send_vars) == list)
-    assert (type(get_vars) == list)
    epmap = endpoints.split(",")
    endpoints = list(set(epmap))
@@ -192,6 +193,11 @@ def Send(endpoints, send_vars, get_vars):
    helper = LayerHelper("Send", **locals())
    rpc_client_var = default_main_program().global_block().create_var(
        name="RPC_CLIENT_VAR", persistable=True, type=core.VarDesc.VarType.RAW)
+    if not get_vars:
+        get_vars = []
+        for s in send_vars:
+            v = helper.create_tmp_variable(dtype=s.dtype, stop_gradient=True)
+            get_vars.append(v)
    helper.append_op(
        type="send",
@@ -200,6 +206,7 @@ def Send(endpoints, send_vars, get_vars):
                 "RPCClient": rpc_client_var},
        attrs={"endpoints": endpoints,
               "epmap": epmap})
+    return get_vars
 def Recv(endpoints, get_vars):

--- a/python/paddle/fluid/tests/unittests/CMakeLists.txt
+++ b/python/paddle/fluid/tests/unittests/CMakeLists.txt
 file(GLOB TEST_OPS RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "test_*.py")
 string(REPLACE ".py" "" TEST_OPS "${TEST_OPS}")
-# The fully connected test is removed whe the WITH_MKLDNN flag is OFF
+# The MKLDNN tests are skiped when the MKLDNN flag is OFF
-# Because the fully connected layer has only one kernel (MKLDNN)
 if(NOT WITH_MKLDNN)
-    list(REMOVE_ITEM TEST_OPS test_fc_op)
+    foreach(src ${TEST_OPS})
+        if(${src} MATCHES ".*_mkldnn_op$")
+            list(REMOVE_ITEM TEST_OPS ${src})
+        endif()
+    endforeach()
 endif(NOT WITH_MKLDNN)
 if(NOT WITH_DISTRIBUTE)
@@ -62,6 +65,7 @@ list(REMOVE_ITEM TEST_OPS test_registry)
 list(REMOVE_ITEM TEST_OPS test_fetch_var)
 list(REMOVE_ITEM TEST_OPS test_parallel_op)
 list(REMOVE_ITEM TEST_OPS test_dynrnn_static_input)
+list(REMOVE_ITEM TEST_OPS test_dist_train)
 # tests that can be bundled together in one python process for speed.
 if(WITH_FAST_BUNDLE_TEST)
@@ -100,3 +104,4 @@ py_test_modules(test_registry MODULES test_registry)
 py_test_modules(test_fetch_var MODULES test_fetch_var)
 py_test_modules(test_dynrnn_static_input MODULES test_dynrnn_static_input)
 py_test_modules(test_parallel_op MODULES test_parallel_op)
+py_test_modules(test_dist_train MODULES test_dist_train)
--- a/python/paddle/fluid/tests/unittests/test_activation_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_activation_mkldnn_op.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+import numpy as np
+import paddle.fluid.core as core
+from op_test import OpTest
+from scipy.special import expit
+from test_activation_op import TestRelu, TestTanh, TestSqrt, TestAbs
+class TestMKLDNNReluDim2(TestRelu):
+    def setUp(self):
+        super(TestMKLDNNReluDim2, self).setUp()
+        self.attrs = {"use_mkldnn": True}
+class TestMKLDNNTanhDim2(TestTanh):
+    def setUp(self):
+        super(TestMKLDNNTanhDim2, self).setUp()
+        self.attrs = {"use_mkldnn": True}
+class TestMKLDNNSqrtDim2(TestSqrt):
+    def setUp(self):
+        super(TestMKLDNNSqrtDim2, self).setUp()
+        self.attrs = {"use_mkldnn": True}
+class TestMKLDNNAbsDim2(TestAbs):
+    def setUp(self):
+        super(TestMKLDNNAbsDim2, self).setUp()
+        self.attrs = {"use_mkldnn": True}
+class TestMKLDNNReluDim4(TestRelu):
+    def setUp(self):
+        super(TestMKLDNNReluDim4, self).setUp()
+        x = np.random.uniform(-1, 1, [2, 4, 3, 5]).astype("float32")
+        # The same reason with TestAbs
+        x[np.abs(x) < 0.005] = 0.02
+        out = np.maximum(x, 0)
+        self.inputs = {'X': OpTest.np_dtype_to_fluid_dtype(x)}
+        self.outputs = {'Out': out}
+        self.attrs = {"use_mkldnn": True}
+class TestMKLDNNTanhDim4(TestTanh):
+    def setUp(self):
+        super(TestMKLDNNTanhDim4, self).setUp()
+        self.inputs = {
+            'X': np.random.uniform(0.1, 1, [2, 4, 3, 5]).astype("float32")
+        }
+        self.outputs = {'Out': np.tanh(self.inputs['X'])}
+        self.attrs = {"use_mkldnn": True}
+class TestMKLDNNSqrtDim4(TestSqrt):
+    def setUp(self):
+        super(TestMKLDNNSqrtDim4, self).setUp()
+        self.inputs = {
+            'X': np.random.uniform(0.1, 1, [2, 4, 3, 5]).astype("float32")
+        }
+        self.outputs = {'Out': np.sqrt(self.inputs['X'])}
+        self.attrs = {"use_mkldnn": True}
+class TestMKLDNNAbsDim4(TestAbs):
+    def setUp(self):
+        super(TestMKLDNNAbsDim4, self).setUp()
+        x = np.random.uniform(-1, 1, [2, 4, 3, 5]).astype("float32")
+        # The same reason with TestAbs
+        x[np.abs(x) < 0.005] = 0.02
+        self.inputs = {'X': x}
+        self.outputs = {'Out': np.abs(self.inputs['X'])}
+        self.attrs = {"use_mkldnn": True}
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_activation_op.py
+++ b/python/paddle/fluid/tests/unittests/test_activation_op.py
@@ -1098,82 +1098,5 @@ class TestFP16Swish(TestSwish):
                self.check_output_with_place(place, atol=1e-3)
-#--------------------test MKLDNN--------------------
-class TestMKLDNNReluDim2(TestRelu):
-    def setUp(self):
-        super(TestMKLDNNReluDim2, self).setUp()
-        self.attrs = {"use_mkldnn": True}
-class TestMKLDNNTanhDim2(TestTanh):
-    def setUp(self):
-        super(TestMKLDNNTanhDim2, self).setUp()
-        self.attrs = {"use_mkldnn": True}
-class TestMKLDNNSqrtDim2(TestSqrt):
-    def setUp(self):
-        super(TestMKLDNNSqrtDim2, self).setUp()
-        self.attrs = {"use_mkldnn": True}
-class TestMKLDNNAbsDim2(TestAbs):
-    def setUp(self):
-        super(TestMKLDNNAbsDim2, self).setUp()
-        self.attrs = {"use_mkldnn": True}
-class TestMKLDNNReluDim4(TestRelu):
-    def setUp(self):
-        super(TestMKLDNNReluDim4, self).setUp()
-        x = np.random.uniform(-1, 1, [2, 4, 3, 5]).astype("float32")
-        # The same reason with TestAbs
-        x[np.abs(x) < 0.005] = 0.02
-        out = np.maximum(x, 0)
-        self.inputs = {'X': OpTest.np_dtype_to_fluid_dtype(x)}
-        self.outputs = {'Out': out}
-        self.attrs = {"use_mkldnn": True}
-class TestMKLDNNTanhDim4(TestTanh):
-    def setUp(self):
-        super(TestMKLDNNTanhDim4, self).setUp()
-        self.inputs = {
-            'X': np.random.uniform(0.1, 1, [2, 4, 3, 5]).astype("float32")
-        }
-        self.outputs = {'Out': np.tanh(self.inputs['X'])}
-        self.attrs = {"use_mkldnn": True}
-class TestMKLDNNSqrtDim4(TestSqrt):
-    def setUp(self):
-        super(TestMKLDNNSqrtDim4, self).setUp()
-        self.inputs = {
-            'X': np.random.uniform(0.1, 1, [2, 4, 3, 5]).astype("float32")
-        }
-        self.outputs = {'Out': np.sqrt(self.inputs['X'])}
-        self.attrs = {"use_mkldnn": True}
-class TestMKLDNNAbsDim4(TestAbs):
-    def setUp(self):
-        super(TestMKLDNNAbsDim4, self).setUp()
-        x = np.random.uniform(-1, 1, [2, 4, 3, 5]).astype("float32")
-        # The same reason with TestAbs
-        x[np.abs(x) < 0.005] = 0.02
-        self.inputs = {'X': x}
-        self.outputs = {'Out': np.abs(self.inputs['X'])}
-        self.attrs = {"use_mkldnn": True}
 if __name__ == "__main__":
    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_conv2d_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_conv2d_mkldnn_op.py
+# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+from test_conv2d_op import TestConv2dOp, TestWithPad, TestWithStride
+class TestMKLDNN(TestConv2dOp):
+    def init_kernel_type(self):
+        self.use_mkldnn = True
+class TestMKLDNNWithPad(TestWithPad):
+    def init_kernel_type(self):
+        self.use_mkldnn = True
+class TestMKLDNNWithStride(TestWithStride):
+    def init_kernel_type(self):
+        self.use_mkldnn = True
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_conv2d_op.py
+++ b/python/paddle/fluid/tests/unittests/test_conv2d_op.py
@@ -373,22 +373,5 @@ class TestDepthwiseConv2(TestConv2dOp):
 #     def init_op_type(self):
 #         self.op_type = "conv_cudnn"
-#----------------Conv2dMKLDNN----------------
-class TestMKLDNN(TestConv2dOp):
-    def init_kernel_type(self):
-        self.use_mkldnn = True
-class TestMKLDNNWithPad(TestWithPad):
-    def init_kernel_type(self):
-        self.use_mkldnn = True
-class TestMKLDNNWithStride(TestWithStride):
-    def init_kernel_type(self):
-        self.use_mkldnn = True
 if __name__ == '__main__':
    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_recv_op.py
+++ b/python/paddle/fluid/tests/unittests/test_recv_op.py
@@ -15,31 +15,42 @@
 import unittest
 import paddle.fluid as fluid
+import paddle.fluid.core as core
 import paddle.fluid.layers as layers
 import numpy
 from multiprocessing import Process
+from threading import Thread
 import os, sys
 import time
-class TestRecvOp(unittest.TestCase):
+class TestSendOp(unittest.TestCase):
-    def no_test_send(self):
+    def test_send(self):
        # Run init_serv in a thread
        place = fluid.CPUPlace()
+        # NOTE: python thread will not work here due to GIL.
        p = Process(target=self.init_serv, args=(place, ))
        p.daemon = True
        p.start()
-        time.sleep(1)
-        self.init_client(place)
+        time.sleep(10)
+        with open("/tmp/paddle.selected_port", "r") as fn:
+            selected_port = int(fn.readlines()[0])
+        self.init_client(place, selected_port)
+        self.run_local(place)
+        self.assertTrue(numpy.allclose(self.local_out, self.dist_out))
        # FIXME(typhoonzero): find a way to gracefully shutdown the server.
        os.system("kill -9 %d" % p.pid)
        p.join()
    def init_serv(self, place):
        main = fluid.Program()
        with fluid.program_guard(main):
            serv = layers.ListenAndServ(
-                "127.0.0.1:6174", ["X"], optimizer_mode=False)
+                "127.0.0.1:0", ["X"], optimizer_mode=False)
            with serv.do():
                x = layers.data(
                    shape=[32, 32],
@@ -50,10 +61,29 @@ class TestRecvOp(unittest.TestCase):
                o = layers.scale(x=x, scale=10.0)
            main.global_block().create_var(
                name=o.name, psersistable=False, dtype=o.dtype, shape=o.shape)
+        self.server_exe = fluid.Executor(place)
+        self.server_exe.run(main)
+    def init_client(self, place, port):
+        main = fluid.Program()
+        with fluid.program_guard(main):
+            x = layers.data(
+                shape=[32, 32],
+                dtype='float32',
+                name='X',
+                append_batch_size=False)
+            fluid.initializer.Constant(value=2.3)(x, main.global_block())
+            get_var = main.global_block().create_var(
+                name="scale_0.tmp_0",  # server side var
+                dtype="float32",
+                persistable=False,
+                shape=[32, 32])
+            o = layers.Send("127.0.0.1:%d" % port, [x], [get_var])
        exe = fluid.Executor(place)
-        exe.run(main)
+        self.dist_out = exe.run(main, fetch_list=o)  # o is a list
-    def init_client(self, place):
+    def run_local(self, place):
        main = fluid.Program()
        with fluid.program_guard(main):
            x = layers.data(
@@ -61,10 +91,10 @@ class TestRecvOp(unittest.TestCase):
                dtype='float32',
                name='X',
                append_batch_size=False)
-            fluid.initializer.Constant(value=1.0)(x, main.global_block())
+            fluid.initializer.Constant(value=2.3)(x, main.global_block())
-            layers.Send("127.0.0.1:6174", [x], [x])
+            o = layers.scale(x=x, scale=10.0)
        exe = fluid.Executor(place)
-        exe.run(main)
+        self.local_out = exe.run(main, fetch_list=[o])
 if __name__ == "__main__":

--- a/python/paddle/fluid/tests/unittests/test_fc_op.py
+++ b/python/paddle/fluid/tests/unittests/test_fc_op.py
--- a/python/paddle/fluid/tests/unittests/test_lrn_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_lrn_mkldnn_op.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+from test_lrn_op import TestLRNOp
+class TestLRNMKLDNNOp(TestLRNOp):
+    def get_attrs(self):
+        attrs = TestLRNOp.get_attrs(self)
+        attrs['use_mkldnn'] = True
+        return attrs
+    def test_check_output(self):
+        self.check_output(atol=0.002)
+class TestLRNMKLDNNOpWithIsTest(TestLRNMKLDNNOp):
+    def get_attrs(self):
+        attrs = TestLRNMKLDNNOp.get_attrs(self)
+        attrs['is_test'] = True
+        return attrs
+    def test_check_grad_normal(self):
+        def check_raise_is_test():
+            try:
+                self.check_grad(['X'], 'Out', max_relative_error=0.01)
+            except Exception as e:
+                t = \
+                "is_test attribute should be set to False in training phase."
+                if t in str(e):
+                    raise AttributeError
+        self.assertRaises(AttributeError, check_raise_is_test)
+if __name__ == "__main__":
+    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_lrn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_lrn_op.py
@@ -87,34 +87,5 @@ class TestLRNOp(OpTest):
        self.check_grad(['X'], 'Out', max_relative_error=0.01)
-class TestLRNMKLDNNOp(TestLRNOp):
-    def get_attrs(self):
-        attrs = TestLRNOp.get_attrs(self)
-        attrs['use_mkldnn'] = True
-        return attrs
-    def test_check_output(self):
-        self.check_output(atol=0.002)
-class TestLRNMKLDNNOpWithIsTest(TestLRNMKLDNNOp):
-    def get_attrs(self):
-        attrs = TestLRNMKLDNNOp.get_attrs(self)
-        attrs['is_test'] = True
-        return attrs
-    def test_check_grad_normal(self):
-        def check_raise_is_test():
-            try:
-                self.check_grad(['X'], 'Out', max_relative_error=0.01)
-            except Exception as e:
-                t = \
-                "is_test attribute should be set to False in training phase."
-                if t in str(e):
-                    raise AttributeError
-        self.assertRaises(AttributeError, check_raise_is_test)
 if __name__ == "__main__":
    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_pool2d_mkldnn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_pool2d_mkldnn_op.py
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import unittest
+from test_pool2d_op import TestPool2d_Op, TestCase1, TestCase2, TestCase3, TestCase4, TestCase5
+class TestMKLDNNCase1(TestPool2d_Op):
+    def init_kernel_type(self):
+        self.use_mkldnn = True
+class TestMKLDNNCase2(TestCase1):
+    def init_kernel_type(self):
+        self.use_mkldnn = True
+class TestMKLDNNCase3(TestCase2):
+    def init_kernel_type(self):
+        self.use_mkldnn = True
+class TestMKLDNNCase4(TestCase3):
+    def init_kernel_type(self):
+        self.use_mkldnn = True
+class TestMKLDNNCase5(TestCase4):
+    def init_kernel_type(self):
+        self.use_mkldnn = True
+class TestMKLDNNCase6(TestCase5):
+    def init_kernel_type(self):
+        self.use_mkldnn = True
+if __name__ == '__main__':
+    unittest.main()
--- a/python/paddle/fluid/tests/unittests/test_pool2d_op.py
+++ b/python/paddle/fluid/tests/unittests/test_pool2d_op.py
@@ -317,36 +317,5 @@ class TestCeilModeCase4(TestCase2):
        self.ceil_mode = True
-#--------------------test pool2d MKLDNN--------------------
-class TestMKLDNNCase1(TestPool2d_Op):
-    def init_kernel_type(self):
-        self.use_mkldnn = True
-class TestMKLDNNCase2(TestCase1):
-    def init_kernel_type(self):
-        self.use_mkldnn = True
-class TestMKLDNNCase3(TestCase2):
-    def init_kernel_type(self):
-        self.use_mkldnn = True
-class TestMKLDNNCase4(TestCase3):
-    def init_kernel_type(self):
-        self.use_mkldnn = True
-class TestMKLDNNCase5(TestCase4):
-    def init_kernel_type(self):
-        self.use_mkldnn = True
-class TestMKLDNNCase6(TestCase5):
-    def init_kernel_type(self):
-        self.use_mkldnn = True
 if __name__ == '__main__':
    unittest.main()