merge develop

d73aa874 · Yang Yang · ac0e3828 · e8a678e1 · d73aa874 · d73aa874
109 changed file
--- a/cmake/configure.cmake
+++ b/cmake/configure.cmake
@@ -53,7 +53,8 @@ if(NOT WITH_GPU)

    list(APPEND CMAKE_CXX_SOURCE_FILE_EXTENSIONS cu)
 else()
-    add_definitions(-DPADDLE_WITH_GPU)
+    add_definitions(-DPADDLE_WITH_CUDA)
+
    FIND_PACKAGE(CUDA REQUIRED)

    if(${CUDA_VERSION_MAJOR} VERSION_LESS 7)

--- a/doc/design/refactor/session.md
+++ b/doc/design/refactor/session.md
+# Design Doc: Session
+
+## Abstract
+
+The *session* object encapsulates the environment in which the
+computation graph is executed.
+
+We will have the *local* session and *remote* session, they offer the
+same [interface](#interface). The local session encapsulates the local
+runtime environment and the remote session encapsulates the cluster
+runtime environment.
+
+The local runtime environment contains:
+
+1. computation devices (i.e., CPU, GPU) handles, and
+1. the [scope](../scope.md) which holds all variables.
+
+The remote runtime environment contains:
+
+1. computation devices (i.e., CPU and GPU on node 0, 1) in a cluster,
+   and
+1. the distributed [scope](../scope.md) in a cluster which holds all
+   variables.
+
+The user can create a remote session on Paddle Cloud and evaluate the
+computation graph with it. In this way, the user can control the
+remote computation resource in a cluster from his local computer.
+
+
+## Background
+
+The current design has an implicit global session in which
+`paddle.eval()` is executed. The pain point is:
+
+Since the user is not able to explicitly switch between runtime
+environments, the user cannot run a topology in two independent
+environments.
+
+For example, in reinforcement learning, the user may want to have a
+stale model for inference and a fresh model for training, and only
+replace the stale model with the fresh model periodically.
+
+Furthermore, we have no concept that encapsulates a remote environment
+that executes a computation graph.
+
+We need the session object to address above issues.
+
+
+## Session
+
+A session is an object that owns the runtime environment. All
+computations are executed through `session.eval()`.
+
+
+### Interface
+
+```python
+eval(
+    targets,
+    feed_dict=None,
+)
+```
+
+Evaluates the target Operations or Variables in `targets`.
+
+- *targets*: the evaluation targets. Can be a single Operation or
+  Variable, or a list with the Operations or Variables as
+  elements. The value returned by `eval()` has the same shape as the
+  `target` argument.
+
+  The PaddlePaddle program is represented by
+  the [ProgramDesc](../design/program.md), `eval()` will infer the
+  ProgramDesc from the given targets and run the PaddlePaddle
+  program. Please
+  see
+  [this graph](./distributed_architecture.md#local-training-architecture) for
+  the detailed illustration for the local session
+  and
+  [this graph](./distributed_architecture.md#distributed-training-architecture) for
+  the detailed illustration for the remote session.
+
+- *feed_dict*: a dictionary that contains the tensors which override
+  the edges of the computation graph.
+
+  feed_dict not only can provide the input data, it can override any
+  OP's input as well:
+
+  ```python
+  a = pd.constant(2.0, name="a")
+  b = pd.variable(name="b")
+  c = pd.mul(a,b)
+  sess.eval(targets=c, feed_dict={"b":3.0}) # returns 6.0
+  ```
+
+```python
+close()
+```
+
+Closes the session and releases the scope that the session owns.
+
+
+### Create a Local Session
+
+```python
+session(
+    devices=None
+)
+```
+
+Creates a new session. One session owns one global scope, so creating
+multiple sessions will create different scopes.
+
+- *devices*: a single `string` or a list of `string` of device names,
+  the corresponding devices will be the computation devices for
+  `eval()`. If not specified, all available devices (e.g., all GPUs)
+  will be used. The user doesn't need to specify the CPU device since
+  it will be always used. Multiple sessions can use the same device.
+
+
+#### Example
+
+```Python
+a = paddle.constant(1.0)
+b = paddle.constant(2.0)
+c = a + b
+sess = paddle.session(devices=["gpu:0", "gpu:1", "fpga:0"])
+sess.eval(c)
+sess.close()
+```
+
+### Create a Remote Session
+
+```python
+create_cloud_job(
+    name,
+    num_trainer,
+    mem_per_trainer,
+    gpu_per_trainer,
+    cpu_per_trainer,
+    num_ps,
+    mem_per_ps,
+    cpu_per_ps,
+)
+```
+
+Creates a Paddle Cloud job. Fails if the job name exists.
+
+```python
+get_cloud_job(
+    name
+)
+```
+
+Gets a Paddle Cloud job.
+
+```python
+remote_session(
+    job
+)
+```
+
+- *job*: the Paddle Cloud job.
+
+#### Example
+
+```Python
+reader = paddle.reader.recordio("/pfs/home/peter/mnist-train-*") # data stored on Paddle Cloud
+image = reader.column(0)
+label = reader.column(1)
+fc1 = paddle.op.fc(image, size=256, act="sigmoid")
+fc2 = paddle.op.fc(fc1, size=10, act="softmax")
+cost = paddle.op.cross_entropy(fc2, label)
+opt = paddle.optimizer.sgd(cost)
+
+job = paddle.create_cloud_job("test", 3, "1G", 1, 1, 2, "1G", 1)
+sess = paddle.remote_ession(job)
+for i in range(1000):
+    sess.eval(opt)
+sess.close()
+```
--- a/paddle/api/Util.cpp
+++ b/paddle/api/Util.cpp
@@ -47,7 +47,7 @@ bool isUsingGpu() { return FLAGS_use_gpu; }
 void setUseGpu(bool useGpu) { FLAGS_use_gpu = useGpu; }

 bool isGpuVersion() {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
  return false;
 #else
  return true;

--- a/paddle/capi/Matrix.cpp
+++ b/paddle/capi/Matrix.cpp
@@ -46,7 +46,7 @@ paddle_error paddle_matrix_set_row(paddle_matrix mat,
  if (rowID >= ptr->mat->getHeight()) return kPD_OUT_OF_RANGE;
  paddle::real* buf = ptr->mat->getRowBuf(rowID);
  size_t width = ptr->mat->getWidth();
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  hl_memcpy(buf, rowArray, sizeof(paddle::real) * width);
 #else
  std::copy(rowArray, rowArray + width, buf);

--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@@ -34,7 +34,7 @@ Executor::Executor(const std::vector<platform::Place>& places) {
      device_contexts_[i] = new platform::CPUDeviceContext(
          boost::get<platform::CPUPlace>(places[i]));
    } else if (platform::is_gpu_place(places[i])) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
      device_contexts_[i] = new platform::CUDADeviceContext(
          boost::get<platform::GPUPlace>(places[i]));
 #else

--- a/paddle/framework/executor_test.cc
+++ b/paddle/framework/executor_test.cc
@@ -13,8 +13,6 @@ See the License for the specific language governing permissions and
 limitations under the License. */

 #include "paddle/framework/executor.h"
-#include <memory>  // for unique_ptr
-#include <mutex>   // for call_once
 #include <vector>
 #include "gtest/gtest.h"
 #include "paddle/framework/attribute.h"
@@ -25,6 +23,7 @@ limitations under the License. */
 USE_OP(elementwise_add);
 USE_OP(gaussian_random);
 USE_OP(feed);
+USE_OP(fetch);

 using std::string;
 using namespace paddle::platform;
@@ -33,9 +32,8 @@ using namespace paddle::framework;
 typedef paddle::framework::BlockDesc proto_block;
 typedef paddle::framework::OpDesc proto_op;

-void add_gaussian_random_op(string var_name, proto_block* block) {
-  std::vector<int> dim{2, 3};
-
+void add_gaussian_random_op(string var_name, std::vector<int>& dim,
+                            proto_block* block) {
  // insert variable
  auto a = block->add_vars();
  a->set_name(var_name);
@@ -59,9 +57,8 @@ void add_gaussian_random_op(string var_name, proto_block* block) {
  Out->add_arguments(var_name);
 }

-void add_feed_op(string var_name, int index, proto_block* block) {
-  std::vector<int> dim{3};
-
+void add_feed_op(string var_name, std::vector<int>& dim, int index,
+                 proto_block* block) {
  // insert variable
  auto a = block->add_vars();
  a->set_name(var_name);
@@ -94,29 +91,75 @@ void add_feed_op(string var_name, int index, proto_block* block) {
  Out->add_arguments(var_name);
 }

+void add_fetch_op(string var_name, std::vector<int>& dim, int index,
+                  proto_block* block) {
+  // insert variable
+  auto a = block->add_vars();
+  a->set_name(var_name);
+  auto a_lt = a->mutable_lod_tensor();
+  a_lt->set_data_type(paddle::framework::DataType::FP32);
+  for (int i : dim) {
+    a_lt->add_dims(i);
+  }
+
+  // insert operation
+  auto op = block->add_ops();
+  op->set_type("fetch");
+
+  // set dims attr
+  auto dims = op->add_attrs();
+  dims->set_name("dims");
+  dims->set_type(paddle::framework::AttrType::INTS);
+  for (int i : dim) {
+    dims->add_ints(i);
+  }
+
+  // set col attr
+  auto col = op->add_attrs();
+  col->set_name("col");
+  col->set_type(paddle::framework::AttrType::INT);
+  col->set_i(index);
+
+  auto Out = op->add_inputs();
+  Out->set_parameter("Input");
+  Out->add_arguments(var_name);
+}
+
 std::once_flag set_variable_flag;

 template <typename T>
 void set_feed_variable(const std::vector<std::vector<T>>& inputs) {
  typedef std::vector<paddle::framework::Tensor> FeedInputs;
+  // Tensors in feed value variable will only be in CPUPlace
  Variable* g_feed_value = GetScope()->FindVar("feed_value");
  FeedInputs& feed_inputs = *(g_feed_value->GetMutable<FeedInputs>());
  auto size = inputs.size();
+  feed_inputs.resize(size);
+  for (size_t i = 0; i < size; i++) {
+    T* dst = feed_inputs[i].mutable_data<T>(
+        make_ddim({static_cast<int64_t>(inputs[i].size())}), CPUPlace());
+    memcpy(dst, inputs[i].data(), inputs[i].size() * sizeof(T));
+  }
+}

-  std::call_once(set_variable_flag, [&]() {
-    feed_inputs.reserve(size);
-    for (size_t i = 0; i < size; i++) {
-      paddle::framework::Tensor tmp;
-      tmp.mutable_data<T>(make_ddim({static_cast<int64_t>(inputs[i].size())}),
-                          CPUPlace());
-      feed_inputs.push_back(tmp);
-    }
-  });
-
+template <typename T>
+std::vector<std::vector<T>> get_fetch_variable() {
+  typedef std::vector<paddle::framework::Tensor> FetchOutputs;
+  // Tensors in fetch value variable will only be in CPUPlace
+  Variable* g_fetch_value = GetScope()->FindVar("fetch_value");
+  FetchOutputs& fetch_outputs = *(g_fetch_value->GetMutable<FetchOutputs>());
+
+  auto size = fetch_outputs.size();
+  std::vector<std::vector<T>> result;
+  result.reserve(size);
  for (size_t i = 0; i < size; i++) {
-    memcpy(feed_inputs[i].data<T>(), inputs[i].data(),
-           inputs[i].size() * sizeof(T));
+    std::vector<T> tmp;
+    tmp.resize(fetch_outputs[i].numel());
+    memcpy(tmp.data(), fetch_outputs[i].data<T>(),
+           fetch_outputs[i].numel() * sizeof(T));
+    result.push_back(tmp);
  }
+  return result;
 }

 class ExecutorTesterRandom : public ::testing::Test {
@@ -126,8 +169,9 @@ class ExecutorTesterRandom : public ::testing::Test {
    root_block->set_idx(0);
    root_block->set_parent_idx(-1);

-    add_gaussian_random_op("a", root_block);
-    add_gaussian_random_op("b", root_block);
+    std::vector<int> dim{2, 3};
+    add_gaussian_random_op("a", dim, root_block);
+    add_gaussian_random_op("b", dim, root_block);

    auto c = root_block->add_vars();
    c->set_name("c");
@@ -146,12 +190,11 @@ class ExecutorTesterRandom : public ::testing::Test {
    Out->set_parameter("Out");
    Out->add_arguments("c");

-    scope_ = GetScope();
+    add_fetch_op("c", dim, 0, root_block);
  }

 protected:
  ProgramDesc pdesc_;
-  Scope* scope_;
 };

 class ExecutorTesterFeed : public ::testing::Test {
@@ -161,8 +204,10 @@ class ExecutorTesterFeed : public ::testing::Test {
    root_block->set_idx(0);
    root_block->set_parent_idx(-1);

-    add_feed_op("a", 0, root_block);
-    add_feed_op("b", 1, root_block);
+    std::vector<int> dim{6};
+
+    add_feed_op("a", dim, 0, root_block);
+    add_feed_op("b", dim, 1, root_block);

    auto c = root_block->add_vars();
    c->set_name("c");
@@ -181,8 +226,10 @@ class ExecutorTesterFeed : public ::testing::Test {
    Out->set_parameter("Out");
    Out->add_arguments("c");

-    std::vector<float> vec1 = {1.0, 2.0, 3.0};
-    std::vector<float> vec2 = {4.0, 5.0, 6.0};
+    add_fetch_op("c", dim, 0, root_block);
+
+    std::vector<float> vec1 = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0};
+    std::vector<float> vec2 = {4.0, 5.0, 6.0, 7.0, 8.0, 9.0};
    inputs_.push_back(vec1);
    inputs_.push_back(vec2);
  }
@@ -192,14 +239,27 @@ class ExecutorTesterFeed : public ::testing::Test {
  std::vector<std::vector<float>> inputs_;
 };

+#ifndef PADDLE_WITH_CUDA
 TEST_F(ExecutorTesterRandom, CPU) {
  std::vector<Place> places;
-  CPUPlace cpu_place1, cpu_place2;
-  places.push_back(cpu_place1);
-  places.push_back(cpu_place2);
+  CPUPlace cpu_place;
+  places.push_back(cpu_place);
+
+  // We have a global Scope and BuddyAllocator, and we must ensure
+  // global BuddyAllocator is initialized before global Scope. Thus,
+  // global Scope will deconstruct before BuddyAllocator. Otherwise,
+  // "pointer being freed was not allocated" error will appear.
+  paddle::memory::Used(cpu_place);

  Executor* executor = new Executor(places);
-  executor->Run(pdesc_, scope_);
+  executor->Run(pdesc_, GetScope());
+  std::vector<std::vector<float>> result = get_fetch_variable<float>();
+  for (auto& vec : result) {
+    for (auto& num : vec) {
+      std::cout << num << " ";
+    }
+    std::cout << std::endl;
+  }
  delete executor;
 }

@@ -208,26 +268,48 @@ TEST_F(ExecutorTesterFeed, CPU) {
  CPUPlace cpu_place;
  places.push_back(cpu_place);

+  // We have a global Scope and BuddyAllocator, and we must ensure
+  // global BuddyAllocator is initialized before global Scope. Thus,
+  // global Scope will deconstruct before BuddyAllocator. Otherwise,
+  // "pointer being freed was not allocated" error will appear.
+  paddle::memory::Used(cpu_place);
+
  Executor* executor = new Executor(places);

  // 3 mini-batch
  for (int i = 0; i < 3; i++) {
    // need to set feed variable before Executor::Run
+    std::cout << "start mini-batch " << i << std::endl;
    set_feed_variable<float>(inputs_);
    executor->Run(pdesc_, GetScope());
+    std::vector<std::vector<float>> result = get_fetch_variable<float>();
+    for (auto& vec : result) {
+      for (auto& num : vec) {
+        std::cout << num << " ";
+      }
+      std::cout << std::endl;
+    }
  }

  delete executor;
 }
-
-#ifdef PADDLE_WITH_GPU
+#else
 TEST_F(ExecutorTesterRandom, GPU) {
  std::vector<Place> places;
  GPUPlace gpu_place(0);
  places.push_back(gpu_place);

+  // We have a global Scope and BuddyAllocator, and we must ensure
+  // global BuddyAllocator is initialized before global Scope. Thus,
+  // global Scope will deconstruct before BuddyAllocator. Otherwise,
+  // "pointer being freed was not allocated" error will appear.
+  // If paddle is compiled with GPU, both CPU and GPU BuddyAllocator
+  // need to be used at first.
+  paddle::memory::Used(CPUPlace());
+  paddle::memory::Used(gpu_place);
+
  Executor* executor = new Executor(places);
-  executor->Run(pdesc_, scope_);
+  executor->Run(pdesc_, GetScope());
  delete executor;
 }

@@ -235,13 +317,31 @@ TEST_F(ExecutorTesterFeed, GPU) {
  std::vector<Place> places;
  GPUPlace gpu_place(0);
  places.push_back(gpu_place);
+  // We have a global Scope and BuddyAllocator, and we must ensure
+  // global BuddyAllocator is initialized before global Scope. Thus,
+  // global Scope will deconstruct before BuddyAllocator. Otherwise,
+  // "pointer being freed was not allocated" error will appear.
+  // If paddle is compiled with GPU, both CPU and GPU BuddyAllocator
+  // need to be used at first.
+  paddle::memory::Used(CPUPlace());
+  paddle::memory::Used(gpu_place);

  Executor* executor = new Executor(places);

-  // need to set feed variable before Executor::Run
-  set_feed_variable<float>(inputs_);
-  executor->Run(pdesc_, scope_);
-
+  // 3 mini-batch
+  for (int i = 0; i < 3; i++) {
+    // need to set feed variable before Executor::Run
+    std::cout << "start mini-batch " << i << std::endl;
+    set_feed_variable<float>(inputs_);
+    executor->Run(pdesc_, GetScope());
+    std::vector<std::vector<float>> result = get_fetch_variable<float>();
+    for (auto& vec : result) {
+      for (auto& num : vec) {
+        std::cout << num << " ";
+      }
+      std::cout << std::endl;
+    }
+  }
  delete executor;
 }
 #endif
--- a/paddle/framework/grad_op_builder_test.cc
+++ b/paddle/framework/grad_op_builder_test.cc
@@ -183,4 +183,4 @@ TEST(GradOpDescBuilder, IOIgnoredInGradient) {
                {f::GradVarName("in3_1"), f::GradVarName("in3_2")}));
  delete forw_op;
  delete grad_op;
-}
\ No newline at end of file
+}
--- a/paddle/framework/lod_tensor.h
+++ b/paddle/framework/lod_tensor.h
@@ -15,7 +15,7 @@
 #pragma once

 #include <memory>
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include <thrust/device_vector.h>
 #include <thrust/host_vector.h>
 #include <thrust/system/cuda/experimental/pinned_allocator.h>
@@ -29,7 +29,7 @@
 namespace paddle {
 namespace framework {

-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
 template <typename T>
 using Vector = std::vector<T>;
 #else

--- a/paddle/framework/op_proto_maker_test.cc
+++ b/paddle/framework/op_proto_maker_test.cc
@@ -48,4 +48,4 @@ TEST(ProtoMaker, DuplicatedInOut) {
  paddle::framework::OpAttrChecker op_checker;
  auto proto_maker = TestInOutProtoMaker(&op_proto, &op_checker);
  ASSERT_THROW(proto_maker.Validate(), paddle::platform::EnforceNotMet);
-}
\ No newline at end of file
+}
--- a/paddle/framework/op_registry.h
+++ b/paddle/framework/op_registry.h
@@ -211,7 +211,7 @@ class OpKernelRegistrar : public Registrar {
 // TODO(fengjiayi): The following macros
 // seems ugly, do we have better method?

-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
 #define USE_OP_KERNEL(op_type) USE_OP_DEVICE_KERNEL(op_type, CPU)
 #else
 #define USE_OP_KERNEL(op_type)        \

--- a/paddle/framework/op_registry_test.cc
+++ b/paddle/framework/op_registry_test.cc
@@ -183,4 +183,4 @@ class CosineOpComplete : public paddle::framework::CosineOp {
 TEST(OperatorRegistrar, Test) {
  using namespace paddle::framework;
  OperatorRegistrar<CosineOpComplete, CosineOpProtoAndCheckerMaker> reg("cos");
-}
\ No newline at end of file
+}
--- a/paddle/framework/operator.cc
+++ b/paddle/framework/operator.cc
@@ -25,7 +25,7 @@ Eigen::DefaultDevice& ExecutionContext::GetEigenDevice<
  return *device_context_.GetEigenDevice<platform::CPUPlace>();
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 template <>
 Eigen::GpuDevice&
 ExecutionContext::GetEigenDevice<platform::GPUPlace, Eigen::GpuDevice>() const {

--- a/paddle/framework/scope.cc
+++ b/paddle/framework/scope.cc
@@ -66,15 +66,13 @@ void Scope::DropKids() {

 std::once_flag feed_variable_flag;

-template <typename T, typename... Args>
-std::unique_ptr<T> make_unique(Args&&... args) {
-  return std::unique_ptr<T>(new T(std::forward<Args>(args)...));
-}
-
 framework::Scope* GetScope() {
-  static std::unique_ptr<framework::Scope> g_scope =
-      make_unique<framework::Scope>();
-  std::call_once(feed_variable_flag, [&]() { g_scope->NewVar("feed_value"); });
+  static std::unique_ptr<framework::Scope> g_scope{nullptr};
+  std::call_once(feed_variable_flag, [&]() {
+    g_scope.reset(new framework::Scope());
+    g_scope->NewVar("feed_value");
+    g_scope->NewVar("fetch_value");
+  });
  return g_scope.get();
 }


--- a/paddle/framework/tensor_impl.h
+++ b/paddle/framework/tensor_impl.h
@@ -65,7 +65,7 @@ inline T* Tensor::mutable_data(platform::Place place) {
      holder_.reset(new PlaceholderImpl<T, platform::CPUPlace>(
          boost::get<platform::CPUPlace>(place), size));
    } else if (platform::is_gpu_place(place)) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
      PADDLE_THROW("'GPUPlace' is not supported in CPU only device.");
    }
 #else
@@ -103,7 +103,7 @@ inline void Tensor::CopyFrom(const Tensor& src,
    memory::Copy(boost::get<platform::CPUPlace>(dst_place), dst_ptr,
                 boost::get<platform::CPUPlace>(src_place), src_ptr, size);
  }
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  else if (platform::is_gpu_place(src_place) &&
           platform::is_cpu_place(dst_place)) {
    memory::Copy(boost::get<platform::CPUPlace>(dst_place), dst_ptr,

--- a/paddle/framework/tensor_test.cc
+++ b/paddle/framework/tensor_test.cc
@@ -74,7 +74,7 @@ TEST(Tensor, MutableData) {
    EXPECT_EQ(p1, p2);
  }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  {
    Tensor src_tensor;
    float* p1 = nullptr;
@@ -126,7 +126,7 @@ TEST(Tensor, ShareDataWith) {
    ASSERT_EQ(src_tensor.data<int>(), dst_tensor.data<int>());
  }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  {
    Tensor src_tensor;
    Tensor dst_tensor;
@@ -163,7 +163,7 @@ TEST(Tensor, Slice) {
    EXPECT_EQ(src_data_address + 3 * 4 * 1 * sizeof(int), slice_data_address);
  }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  {
    Tensor src_tensor;
    src_tensor.mutable_data<double>(make_ddim({6, 9}), GPUPlace());
@@ -218,7 +218,7 @@ TEST(Tensor, CopyFrom) {
      EXPECT_EQ(dst_ptr[i], slice_ptr[i]);
    }
  }
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  {
    Tensor src_tensor;
    Tensor gpu_tensor;

--- a/paddle/function/BlockExpandOp.cpp
+++ b/paddle/function/BlockExpandOp.cpp
@@ -194,7 +194,7 @@ public:

 REGISTER_TYPED_FUNC(BlockExpand, CPU, BlockExpandForward);
 REGISTER_TYPED_FUNC(BlockExpandGrad, CPU, BlockExpandBackward);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(BlockExpand, GPU, BlockExpandForward);
 REGISTER_TYPED_FUNC(BlockExpandGrad, GPU, BlockExpandBackward);
 #endif

--- a/paddle/function/ContextProjectionOp.cpp
+++ b/paddle/function/ContextProjectionOp.cpp
@@ -395,7 +395,7 @@ REGISTER_TYPED_FUNC(ContextProjectionForward,
 REGISTER_TYPED_FUNC(ContextProjectionBackward,
                    CPU,
                    ContextProjectionBackwardFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(ContextProjectionForward,
                    GPU,
                    ContextProjectionForwardFunc);

--- a/paddle/function/CosSimOp.cpp
+++ b/paddle/function/CosSimOp.cpp
@@ -233,7 +233,7 @@ private:

 REGISTER_TYPED_FUNC(CosSimForward, CPU, CosSimForwardFunc);
 REGISTER_TYPED_FUNC(CosSimBackward, CPU, CosSimBackwardFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(CosSimForward, GPU, CosSimForwardFunc);
 REGISTER_TYPED_FUNC(CosSimBackward, GPU, CosSimBackwardFunc);
 #endif

--- a/paddle/function/CropOp.cpp
+++ b/paddle/function/CropOp.cpp
@@ -169,7 +169,7 @@ private:

 REGISTER_TYPED_FUNC(Crop, CPU, CropFunc);
 REGISTER_TYPED_FUNC(CropGrad, CPU, CropGradFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(Crop, GPU, CropFunc);
 REGISTER_TYPED_FUNC(CropGrad, GPU, CropGradFunc);
 #endif

--- a/paddle/function/CrossMapNormalOp.cpp
+++ b/paddle/function/CrossMapNormalOp.cpp
@@ -336,7 +336,7 @@ private:

 REGISTER_TYPED_FUNC(CrossMapNormal, CPU, CrossMapNormalFunc);
 REGISTER_TYPED_FUNC(CrossMapNormalGrad, CPU, CrossMapNormalGradFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(CrossMapNormal, GPU, CrossMapNormalFunc);
 REGISTER_TYPED_FUNC(CrossMapNormalGrad, GPU, CrossMapNormalGradFunc);
 #endif

--- a/paddle/function/DepthwiseConvOp.cpp
+++ b/paddle/function/DepthwiseConvOp.cpp
@@ -292,7 +292,7 @@ REGISTER_TYPED_FUNC(DepthwiseConvGradInput,
 REGISTER_TYPED_FUNC(DepthwiseConvGradFilter,
                    CPU,
                    DepthwiseConvGradFilterFunction);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(DepthwiseConv, GPU, DepthwiseConvFunction);
 REGISTER_TYPED_FUNC(DepthwiseConvGradInput,
                    GPU,

--- a/paddle/function/DepthwiseConvOpTest.cpp
+++ b/paddle/function/DepthwiseConvOpTest.cpp
@@ -17,7 +17,7 @@ limitations under the License. */

 namespace paddle {

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(DepthwiseConv, Forward) {
  DepthwiseConvolution<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU>(
      "GemmConv-CPU", "DepthwiseConv-GPU", forward);

--- a/paddle/function/GemmConvOp.cpp
+++ b/paddle/function/GemmConvOp.cpp
@@ -340,7 +340,7 @@ public:
 REGISTER_TYPED_FUNC(GemmConv, CPU, GemmConvFunction);
 REGISTER_TYPED_FUNC(GemmConvGradInput, CPU, GemmConvGradInputFunction);
 REGISTER_TYPED_FUNC(GemmConvGradFilter, CPU, GemmConvGradFilterFunction);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(GemmConv, GPU, GemmConvFunction);
 REGISTER_TYPED_FUNC(GemmConvGradInput, GPU, GemmConvGradInputFunction);
 REGISTER_TYPED_FUNC(GemmConvGradFilter, GPU, GemmConvGradFilterFunction);

--- a/paddle/function/GemmConvOpTest.cpp
+++ b/paddle/function/GemmConvOpTest.cpp
@@ -24,7 +24,7 @@ TEST(GemmConv, NaiveConv) {
      "NaiveConv-CPU", "GemmConv-CPU", forward);
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(GemmConv, Forward) {
  Convolution<DEVICE_TYPE_CPU, DEVICE_TYPE_GPU>(
      "GemmConv-CPU", "GemmConv-GPU", forward);

--- a/paddle/function/Im2ColTest.cpp
+++ b/paddle/function/Im2ColTest.cpp
@@ -116,7 +116,7 @@ void TestIm2ColFunctor() {

 TEST(Im2ColFunctor, CPU) { TestIm2ColFunctor<DEVICE_TYPE_CPU, float>(); }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 TEST(Im2ColFunctor, GPU) { TestIm2ColFunctor<DEVICE_TYPE_GPU, float>(); }


--- a/paddle/function/MulOp.cpp
+++ b/paddle/function/MulOp.cpp
@@ -341,7 +341,7 @@ private:
 };

 REGISTER_TYPED_FUNC(MulOp, CPU, MulFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(MulOp, GPU, MulFunc);
 #endif
 }  // namespace paddle
--- a/paddle/function/PadOp.cpp
+++ b/paddle/function/PadOp.cpp
@@ -207,7 +207,7 @@ private:

 REGISTER_TYPED_FUNC(Pad, CPU, PadFunc);
 REGISTER_TYPED_FUNC(PadGrad, CPU, PadGradFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(Pad, GPU, PadFunc);
 REGISTER_TYPED_FUNC(PadGrad, GPU, PadGradFunc);
 #endif

--- a/paddle/function/RowConvOp.cpp
+++ b/paddle/function/RowConvOp.cpp
@@ -217,7 +217,7 @@ public:

 REGISTER_TYPED_FUNC(RowConv, CPU, RowConvFunc);
 REGISTER_TYPED_FUNC(RowConvGrad, CPU, RowConvGradFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(RowConv, GPU, RowConvFunc);
 REGISTER_TYPED_FUNC(RowConvGrad, GPU, RowConvGradFunc);
 #endif

--- a/paddle/function/SwitchOp.cpp
+++ b/paddle/function/SwitchOp.cpp
@@ -132,7 +132,7 @@ public:

 REGISTER_TYPED_FUNC(NCHW2NHWC, CPU, NCHW2NHWCFunc);
 REGISTER_TYPED_FUNC(NHWC2NCHW, CPU, NHWC2NCHWFunc);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 REGISTER_TYPED_FUNC(NCHW2NHWC, GPU, NCHW2NHWCFunc);
 REGISTER_TYPED_FUNC(NHWC2NCHW, GPU, NHWC2NCHWFunc);
 #endif

--- a/paddle/gserver/layers/BatchNormBaseLayer.cpp
+++ b/paddle/gserver/layers/BatchNormBaseLayer.cpp
@@ -16,7 +16,7 @@ limitations under the License. */
 #include "BatchNormalizationLayer.h"
 #include "Layer.h"
 #include "paddle/utils/Stat.h"
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include "CudnnBatchNormLayer.h"
 #endif


--- a/paddle/gserver/layers/BatchNormalizationLayer.cpp
+++ b/paddle/gserver/layers/BatchNormalizationLayer.cpp
@@ -13,7 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */

 #include "paddle/utils/Stat.h"
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include "hl_batch_transpose.h"
 #endif
 #include "BatchNormalizationLayer.h"
@@ -90,7 +90,7 @@ void BatchNormalizationLayer::expandMat(const MatrixPtr& in, MatrixPtr& out) {
  size_t batchSize = in->getHeight();
  CHECK_EQ(out->getHeight(), batchSize * imgPixels_);
  if (useGpu_) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
    LOG(FATAL) << "paddle is compiled only for cpu";
 #else
    batchTranspose(
@@ -127,7 +127,7 @@ void BatchNormalizationLayer::shrinkMat(const MatrixPtr& in, MatrixPtr& out) {
  }
  CHECK_EQ(in->getHeight(), static_cast<size_t>(batchSize * imgPixels_));
  if (useGpu_) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
    LOG(FATAL) << "paddle is compiled only for cpu";
 #else
    batchTranspose(

--- a/paddle/gserver/layers/PoolLayer.cpp
+++ b/paddle/gserver/layers/PoolLayer.cpp
@@ -15,7 +15,7 @@ limitations under the License. */
 #include "PoolLayer.h"
 #include "PoolProjectionLayer.h"
 #include "paddle/utils/Logging.h"
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include "CudnnPoolLayer.h"
 #endif
 namespace paddle {
@@ -53,7 +53,7 @@ Layer* PoolLayer::create(const LayerConfig& config) {
  const std::string& pool = config.inputs(0).pool_conf().pool_type();
  if (pool == "max-projection" || pool == "avg-projection") {
    return new PoolProjectionLayer(config);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  } else if (CudnnPoolLayer::typeCheck(pool)) {
    return new CudnnPoolLayer(config);
 #endif

--- a/paddle/gserver/tests/LayerGradUtil.cpp
+++ b/paddle/gserver/tests/LayerGradUtil.cpp
@@ -674,7 +674,7 @@ void testLayerGradKernel(TestConfig testConf,
                         bool useGpu,
                         bool useWeight,
                         float epsilon) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
  if (useGpu) return;
 #endif
  FLAGS_use_gpu = useGpu;

--- a/paddle/gserver/tests/test_BatchNorm.cpp
+++ b/paddle/gserver/tests/test_BatchNorm.cpp
@@ -119,7 +119,7 @@ TEST(Layer, batchNorm) {
  CHECK_EQ(static_cast<int>(convLayer->getOutputValue()->getWidth()), 576);
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 void batchNormInference(int n, int c, int h, int w) {
  MatrixPtr input = std::make_shared<GpuMatrix>(n, c * h * w);
  MatrixPtr cudnnOut = std::make_shared<GpuMatrix>(n, c * h * w);

--- a/paddle/gserver/tests/test_ConvUnify.cpp
+++ b/paddle/gserver/tests/test_ConvUnify.cpp
@@ -117,7 +117,7 @@ MatrixPtr doOneConvTest(size_t imgSize,
 }

 TEST(Layer, convParaUnified) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  MatrixPtr input, resultCpu, resultGpu;

  /// TEST1 for conv ///

--- a/paddle/gserver/tests/test_DetectionOutput.cpp
+++ b/paddle/gserver/tests/test_DetectionOutput.cpp
@@ -150,7 +150,7 @@ TEST(Layer, detectionOutputLayerFwd) {
                           useGpu,
                           result2);

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  // GPU case 1.
  useGpu = true;
  inputLoc = Matrix::create(1, 16, false, useGpu);

--- a/paddle/gserver/tests/test_Evaluator.cpp
+++ b/paddle/gserver/tests/test_Evaluator.cpp
@@ -51,7 +51,7 @@ void testEvaluator(TestConfig testConf,
                   string testEvaluatorName,
                   size_t batchSize,
                   bool useGpu) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
  if (useGpu) return;
 #endif
  FLAGS_use_gpu = useGpu;

--- a/paddle/gserver/tests/test_KmaxSeqScore.cpp
+++ b/paddle/gserver/tests/test_KmaxSeqScore.cpp
@@ -97,7 +97,7 @@ TEST(Layer, kmaxSeqScoreLayer) {
      Matrix::create(subSeqStartPosition.back(), 1, false, false);

  std::vector<bool> mode = {false};
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  mode.push_back(true);
 #endif


--- a/paddle/gserver/tests/test_LayerGrad.cpp
+++ b/paddle/gserver/tests/test_LayerGrad.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include <cudnn.h>
 #endif
 #include <gtest/gtest.h>
@@ -258,7 +258,7 @@ void testProjectionConv(size_t groups, bool isDeconv) {
                     true);
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(Projection, conv) {
  /// test ConvProjection
  testProjectionConv(1, false);
@@ -422,7 +422,7 @@ TEST(Layer, depthwiseConvLayer) {
  //  'depthwise_conv' is a sepecial case of 'exconv' whose
  //  groups size equals to the input channels size.
  testDepthwiseConvLayer("exconv", /* useGpu= */ false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  testDepthwiseConvLayer("exconv", /* useGpu= */ true);
 #endif
 }
@@ -480,7 +480,7 @@ void testConvLayer(const string& type, bool trans, bool useGpu) {

 TEST(Layer, convLayer) {
  testConvLayer("exconv", /* trans= */ false, /* useGpu= */ false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  testConvLayer("exconv", /* trans= */ false, /* useGpu= */ true);
  testConvLayer("cudnn_conv", /* trans= */ false, /* useGpu= */ true);
 #endif
@@ -525,7 +525,7 @@ TEST(Layer, convTransLayer) {
  for (auto useGpu : {false, true}) {
    testConvTransLayer("exconvt", /* trans= */ false, /* useGpu= */ useGpu);
  }
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  testConvTransLayer("cudnn_convt", /* trans= */ false, /* useGpu= */ true);
 #endif
 }
@@ -638,7 +638,7 @@ TEST(Layer, SelectiveFullyConnectedLayer) {
                /* trans= */ false,
                /* useGup= */ false,
                false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  testLayerGrad(config,
                "selective_fc",
                100,
@@ -1210,7 +1210,7 @@ void testPoolLayer(const string& poolType, bool trans, bool useGpu) {
  testLayerGrad(config, "pool", 100, trans, useGpu);
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 void testPoolLayer2(const string& poolType, bool trans, bool useGpu) {
  TestConfig config;
  config.inputDefs.push_back({INPUT_DATA, "layer_0", 3200, 0});
@@ -1236,7 +1236,7 @@ TEST(Layer, PoolLayer) {
  testPoolLayer("avg-projection", /* trans= */ false, /* useGpu= */ false);
  testPoolLayer("max-projection", /* trans= */ false, /* useGpu= */ false);

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  testPoolLayer("avg-projection", /* trans= */ false, /* useGpu= */ true);
  testPoolLayer("max-projection", /* trans= */ false, /* useGpu= */ true);
  testPoolLayer("cudnn-max-pool", /* trans= */ false, /* useGpu= */ true);
@@ -1309,7 +1309,7 @@ void testPool3DLayer(const string& poolType, bool trans, bool useGpu) {
 TEST(Layer, Pool3DLayer) {
  testPool3DLayer("avg", /* trans= */ false, /* useGpu= */ false);
  testPool3DLayer("max", /* trans= */ false, /* useGpu= */ false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  testPool3DLayer("avg", /* trans= */ false, /* useGpu= */ true);
  testPool3DLayer("max", /* trans= */ false, /* useGpu= */ true);
 #endif
@@ -1695,7 +1695,7 @@ void testBatchNormLayer(const string& type, bool trans, bool useGpu) {

 TEST(Layer, BatchNormalizationLayer) {
  testBatchNormLayer("batch_norm", false, false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  testBatchNormLayer("batch_norm", false, true);
  if (hl_get_cudnn_lib_version() >= int(4000)) {
    testBatchNormLayer("cudnn_batch_norm", false, true);
@@ -1744,7 +1744,7 @@ void testBatchNorm3DLayer(const string& type, bool trans, bool useGpu) {

 TEST(Layer, testBatchNorm3DLayer) {
  testBatchNorm3DLayer("batch_norm", false, false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  testBatchNorm3DLayer("batch_norm", false, true);
  if (hl_get_cudnn_lib_version() >= int(4000)) {
    testBatchNorm3DLayer("cudnn_batch_norm", false, true);
@@ -2262,7 +2262,7 @@ void test3DConvLayer(const string& type, bool trans, bool useGpu) {

 TEST(Layer, test3DConvLayer) {
  test3DConvLayer("conv3d", /* trans= */ false, /* useGpu= */ false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  test3DConvLayer("conv3d", /* trans= */ false, /* useGpu= */ true);
 #endif
 }
@@ -2339,7 +2339,7 @@ void test3DDeConvLayer(const string& type, bool trans, bool useGpu) {

 TEST(Layer, test3DDeConvLayer) {
  test3DDeConvLayer("deconv3d", /* trans= */ false, /* useGpu= */ false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  test3DDeConvLayer("deconv3d", /* trans= */ false, /* useGpu= */ true);
 #endif
 }

--- a/paddle/gserver/tests/test_NetworkCompare.cpp
+++ b/paddle/gserver/tests/test_NetworkCompare.cpp
@@ -243,7 +243,7 @@ TEST(Compare, concat_slice) {
  compareNetwork(config_file_a, config_file_b);
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(Compare, img_pool) {
  std::string config_file_a = "./gserver/tests/img_pool_a.conf";
  std::string config_file_b = "./gserver/tests/img_pool_b.conf";

--- a/paddle/gserver/tests/test_PriorBox.cpp
+++ b/paddle/gserver/tests/test_PriorBox.cpp
@@ -151,7 +151,7 @@ TEST(Layer, priorBoxLayerFwd) {
                    useGpu,
                    result);

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  // reset the input parameters
  variance[1] = 0.1;
  variance[3] = 0.2;

--- a/paddle/gserver/tests/test_ProtoDataProvider.cpp
+++ b/paddle/gserver/tests/test_ProtoDataProvider.cpp
@@ -485,7 +485,7 @@ TEST(ProtoDataProvider, test) {
              // Currently in async mode, useGpu is not supported
              continue;
            }
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
            if (useGpu) {
              continue;
            }
@@ -525,7 +525,7 @@ TEST(ProtoDataProvider, constant_slots) {
      for (int numConstantSlots : {1, 2}) {
        for (int useGpu : numTwoArray) {
          for (int dataCompression : numTwoArray) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
            if (useGpu) {
              continue;
            }
@@ -708,7 +708,7 @@ TEST(ProtoSequenceDataProvider, test) {
              // Currently in async mode, useGpu is not supported
              continue;
            }
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
            if (useGpu) {
              continue;
            }

--- a/paddle/gserver/tests/test_PyDataProvider.cpp
+++ b/paddle/gserver/tests/test_PyDataProvider.cpp
@@ -37,7 +37,7 @@ TEST(PyDataProvider, py_fill_slots) {
  config.clear_files();
  std::string dataFile = "gserver/tests/pyDataProvider/pyDataProviderList";
  config.set_files(dataFile);
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
  bool useGpu = false;
 #else
  bool useGpu = true;
@@ -71,7 +71,7 @@ TEST(PyDataProvider, py_fill_nest_slots) {
  std::string dataFile = "gserver/tests/pyDataProvider/pyDataProviderList";
  config.set_files(dataFile);
  EXPECT_EQ(config.IsInitialized(), true);
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
  bool useGpu = false;
 #else
  bool useGpu = true;

--- a/paddle/gserver/tests/test_SelectiveFCLayer.cpp
+++ b/paddle/gserver/tests/test_SelectiveFCLayer.cpp
@@ -321,7 +321,7 @@ TEST(Layer, SelectiveFcLayer_train_dense_mul) {
      "filelist=gserver/tests/SelectiveFcTest/dense_mul_list";

  for (auto useGpu : {false, true}) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
    if (useGpu) {
      break;
    }
@@ -388,7 +388,7 @@ void testSelectiveFcLayerTrainSparseMul(const LayerConfig& config,
                          outMatSelfc->getWidth(),
                          outMatSelfc->getElementCnt()));
  cpuOutMatSelfc->copyFrom(*outMatSelfc, HPPL_STREAM_DEFAULT);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  if (useGpu) {
    hl_stream_synchronize(HPPL_STREAM_DEFAULT);
  }
@@ -418,7 +418,7 @@ void testSelectiveFcLayerTrainSparseMul(const LayerConfig& config,
  MatrixPtr cpuOutMatFc(
      new CpuMatrix(outMatFc->getHeight(), outMatFc->getWidth()));
  cpuOutMatFc->copyFrom(*outMatFc, HPPL_STREAM_DEFAULT);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  if (useGpu) {
    hl_stream_synchronize(HPPL_STREAM_DEFAULT);
  }
@@ -443,7 +443,7 @@ TEST(Layer, SelectiveFcLayer_train_sparse_mul) {
  selLayerConfig.set_size(fcLayerWidth);

  testSelectiveFcLayerTrainSparseMul(selLayerConfig, false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  testSelectiveFcLayerTrainSparseMul(selLayerConfig, true);
 #endif
 }

--- a/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp
+++ b/paddle/gserver/tests/test_SeqSliceLayerGrad.cpp
@@ -195,7 +195,7 @@ TEST(Layer, SeqSliceLayer) {
  vector<vector<real>> ends;

  std::vector<bool> mode = {false};
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  mode.push_back(true);
 #endif
  genSeqInfo(seqStartPos, subSeqStartPos);

--- a/paddle/gserver/tests/test_WarpCTCLayer.cpp
+++ b/paddle/gserver/tests/test_WarpCTCLayer.cpp
@@ -199,7 +199,7 @@ TEST(Layer, WarpCTCLayer) {
    for (auto batchSize : {1, 10, 32}) {
      for (auto normByTimes : {false, true}) {
        for (auto useGpu : {false, true}) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
          if (useGpu) continue;
 #endif
          LOG(INFO) << "layerSize=" << layerSize << " batchSize=" << batchSize

--- a/paddle/math/Matrix.cpp
+++ b/paddle/math/Matrix.cpp
@@ -670,7 +670,7 @@ void GpuMatrix::leftMul(Matrix& a, real scaleAB, real scaleT) {
 }

 void GpuMatrix::selectRows(Matrix& table, IVector& ids) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  CHECK(dynamic_cast<GpuMatrix*>(&table));
  CHECK(table.useGpu());
  CHECK(ids.useGpu());
@@ -694,7 +694,7 @@ void GpuMatrix::selectRows(Matrix& table, IVector& ids) {
 }

 void GpuMatrix::addToRows(Matrix& table, IVector& ids) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  CHECK(dynamic_cast<GpuMatrix*>(&table));
  CHECK(table.useGpu());
  CHECK(ids.useGpu());
@@ -741,7 +741,7 @@ void GpuMatrix::rowMax(Matrix& max) {
 }

 void GpuMatrix::rowMax(IVector& maxIds, Matrix& maxVal) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  CHECK(maxIds.useGpu() && maxVal.useGpu()) << "Matrix type are not equal";
  size_t numSamples = getHeight();
  size_t beam = maxVal.getWidth();

--- a/paddle/math/SparseMatrix.cpp
+++ b/paddle/math/SparseMatrix.cpp
@@ -836,7 +836,7 @@ void GpuSparseMatrix::zeroMem() {
 }

 void GpuSparseMatrix::rowMax(IVector& maxIds, Matrix& maxVal) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  CHECK(maxIds.useGpu() && maxVal.useGpu()) << "Matrix type are not equal";
  size_t numSamples = getHeight();
  size_t beam = maxVal.getWidth();

--- a/paddle/math/Vector.cpp
+++ b/paddle/math/Vector.cpp
@@ -172,7 +172,7 @@ void GpuVectorT<T>::isEqualTo(const VectorT<T>& b, const T& value) {

 template <class T>
 void GpuVectorT<T>::selectFrom(const VectorT<T>& src, const VectorT<int>& ids) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  hl_vector_select_from<T>(this->getData(),
                           this->getSize(),
                           src.getData(),
@@ -850,7 +850,7 @@ CpuGpuVectorT<T>::CpuGpuVectorT(CpuGpuVectorT<T>& src,
                                size_t size)
    : sync_(nullptr) {
  CHECK_LE(offset + size, static_cast<size_t>(src.getSize()));
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  SyncedFlag* flag = src.getSync();
  if (*flag == DATA_AT_CPU) {
    src.copyToGpu();  // will set synchronous data between CPU and GPU
@@ -861,7 +861,7 @@ CpuGpuVectorT<T>::CpuGpuVectorT(CpuGpuVectorT<T>& src,
  auto cMemHandle = (src.getVector(false))->getMemoryHandle();
  cpuVectorT_ = std::make_shared<CpuVectorT<T>>(
      size, std::dynamic_pointer_cast<CpuMemoryHandle>(cMemHandle), offset);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  auto gMemHandle = (src.getVector(true))->getMemoryHandle();
  gpuVectorT_ = std::make_shared<GpuVectorT<T>>(
      size, std::dynamic_pointer_cast<GpuMemoryHandle>(gMemHandle), offset);

--- a/paddle/math/tests/test_Allocator.cpp
+++ b/paddle/math/tests/test_Allocator.cpp
@@ -68,7 +68,7 @@ void testPoolAllocator() {

 TEST(Allocator, Pool) {
  testPoolAllocator<CpuAllocator>();
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  testPoolAllocator<GpuAllocator>();
 #endif
 }
@@ -92,7 +92,7 @@ TEST(MemoryHandle, Cpu) {
  EXPECT_EQ(ptr1, ptr2);
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(MemoryHandle, Gpu) {
  int numGpu = hl_get_device_count();


--- a/paddle/math/tests/test_BaseMatrix.cpp
+++ b/paddle/math/tests/test_BaseMatrix.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 /**
 * This test file use autotest::AutoCompare and cmpWithoutArg to compares the
 * implementation of CPU and GPU member function in

--- a/paddle/math/tests/test_CpuGpuVector.cpp
+++ b/paddle/math/tests/test_CpuGpuVector.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 #include <gtest/gtest.h>
 #include "paddle/math/Vector.h"

--- a/paddle/math/tests/test_ExecViaCpu.cpp
+++ b/paddle/math/tests/test_ExecViaCpu.cpp
@@ -94,7 +94,7 @@ void testWrapper(F&& f) {
  }
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(ExecViaCpu, test1) {
  testWrapper(f);
  testWrapper(&f);

--- a/paddle/math/tests/test_GpuProfiler.cpp
+++ b/paddle/math/tests/test_GpuProfiler.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 #include <gtest/gtest.h>
 #include "paddle/math/Matrix.h"

--- a/paddle/math/tests/test_Matrix.cpp
+++ b/paddle/math/tests/test_Matrix.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 /**
 * This test file use autotest::AutoCompare and cmpWithArg to compares the
 * implementation of CPU and GPU member function in Matrix.cpp.

--- a/paddle/math/tests/test_SparseMatrix.cpp
+++ b/paddle/math/tests/test_SparseMatrix.cpp
@@ -47,7 +47,7 @@ struct MatrixPara {
  SparseFormat format;
 };

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 void test_sparse_matrix_mul(MatrixPara paraA,
                            MatrixPara paraB,
                            MatrixPara paraC) {
@@ -452,7 +452,7 @@ TEST(Matrix, SparseMatrixCSRFormatTrimFrom) {
  matB->trimFrom(*mat);
  checkSMatrixEqual2(matA, matB);

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  GpuSparseMatrixPtr matC = std::make_shared<GpuSparseMatrix>(
      height, trimedWidth, height, FLOAT_VALUE, SPARSE_CSR, true);
  matC->trimFrom(*mat);
@@ -546,7 +546,7 @@ TEST(Matrix, SparseMatrixCSCFormatTrimFrom) {
  matB->trimFrom(*mat);
  checkSMatrixEqual2(matA, matB);

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  GpuSparseMatrixPtr matC = std::make_shared<GpuSparseMatrix>(
      height, trimedWidth, height, FLOAT_VALUE, SPARSE_CSC, true);
  matC->trimFrom(*mat);

--- a/paddle/math/tests/test_TrainingAlgorithm.cpp
+++ b/paddle/math/tests/test_TrainingAlgorithm.cpp
@@ -91,7 +91,7 @@ int VectorCheckErr(const VectorPtr& vector1, const VectorPtr& vector2) {
 typedef std::function<void(size_t size, bool useGpu)> testMatrixFunc;

 void testCase(testMatrixFunc matrixFunc) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  for (auto useGpu : {false, true}) {
 #else
  for (auto useGpu : {false}) {

--- a/paddle/math/tests/test_batchTranspose.cpp
+++ b/paddle/math/tests/test_batchTranspose.cpp
@@ -17,7 +17,7 @@ limitations under the License. */

 using namespace paddle;  // NOLINT

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(MatrixBatchTransTest, test_batch_matrix_transpose) {
  const int nx = 100;
  const int ny = 50;

--- a/paddle/math/tests/test_matrixCompare.cpp
+++ b/paddle/math/tests/test_matrixCompare.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 /// This unittest checks GpuMatrix/CpuMatrix get same result, so disable when
 /// only cpu version.


--- a/paddle/math/tests/test_perturbation.cpp
+++ b/paddle/math/tests/test_perturbation.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 #include <cuda_runtime.h>
 #include <gtest/gtest.h>

--- a/paddle/math/tests/test_sparseMatrixCompare.cpp
+++ b/paddle/math/tests/test_sparseMatrixCompare.cpp
@@ -12,7 +12,7 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. */

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 /// This unittest checks GpuSparseMatrix/CpuSparseMatrix get same result,
 //  so disable when
 /// only cpu version.

--- a/paddle/memory/detail/buddy_allocator.cc
+++ b/paddle/memory/detail/buddy_allocator.cc
@@ -175,7 +175,7 @@ void* BuddyAllocator::SystemAlloc(size_t size) {
 }

 BuddyAllocator::PoolSet::iterator BuddyAllocator::RefillPool() {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  if (system_allocator_->UseGpu()) {
    if ((total_used_ + total_free_) == 0) {
      // Compute the maximum allocation size for the first allocation.

--- a/paddle/memory/detail/system_allocator.cc
+++ b/paddle/memory/detail/system_allocator.cc
@@ -62,7 +62,7 @@ void CPUAllocator::Free(void* p, size_t size, size_t index) {

 bool CPUAllocator::UseGpu() const { return false; }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 void* GPUAllocator::Alloc(size_t& index, size_t size) {
  // CUDA documentation doesn't explain if cudaMalloc returns nullptr

--- a/paddle/memory/detail/system_allocator.h
+++ b/paddle/memory/detail/system_allocator.h
@@ -40,7 +40,7 @@ class CPUAllocator : public SystemAllocator {
  virtual bool UseGpu() const;
 };

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 class GPUAllocator : public SystemAllocator {
 public:
  virtual void* Alloc(size_t& index, size_t size);

--- a/paddle/memory/detail/system_allocator_test.cc
+++ b/paddle/memory/detail/system_allocator_test.cc
@@ -56,7 +56,7 @@ TEST(CPUAllocator, LockMem) {
  TestAllocator(a, 0);
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(GPUAllocator, Alloc) {
  paddle::memory::detail::GPUAllocator a;
  TestAllocator(a, 2048);

--- a/paddle/memory/memcpy.cc
+++ b/paddle/memory/memcpy.cc
@@ -26,7 +26,7 @@ void Copy<platform::CPUPlace, platform::CPUPlace>(platform::CPUPlace, void* dst,
  std::memcpy(dst, src, num);
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 template <>
 void Copy<platform::CPUPlace, platform::GPUPlace>(platform::CPUPlace dst_place,
                                                  void* dst,

--- a/paddle/memory/memcpy.h
+++ b/paddle/memory/memcpy.h
@@ -33,7 +33,7 @@ namespace memory {
 template <typename DstPlace, typename SrcPlace>
 void Copy(DstPlace, void* dst, SrcPlace, const void* src, size_t num);

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 /**
 * \brief   Copy memory from one place to another place.

--- a/paddle/memory/memory.cc
+++ b/paddle/memory/memory.cc
@@ -62,7 +62,7 @@ size_t Used<platform::CPUPlace>(platform::CPUPlace place) {
  return GetCPUBuddyAllocator()->Used();
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {
  using BuddyAllocVec = std::vector<BuddyAllocator*>;
@@ -77,7 +77,7 @@ BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {

  // GPU buddy allocator initialization
  std::call_once(gpu_allocator_flag, [&]() {
-    int gpu_num = platform::GetDeviceCount();
+    int gpu_num = platform::GetCUDADeviceCount();
    allocators.reserve(gpu_num);
    for (int gpu = 0; gpu < gpu_num; gpu++) {
      platform::SetDeviceId(gpu);

--- a/paddle/memory/memory_test.cc
+++ b/paddle/memory/memory_test.cc
@@ -80,7 +80,7 @@ TEST(BuddyAllocator, CPUMultAlloc) {
  }
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 size_t align(size_t size, paddle::platform::GPUPlace place) {
  size += sizeof(paddle::memory::detail::Metadata);

--- a/paddle/operators/activation_op.cc
+++ b/paddle/operators/activation_op.cc
@@ -69,6 +69,22 @@ class ReluOpMaker : public framework::OpProtoAndCheckerMaker {
  }
 };

+template <typename AttrType>
+class LeakyReluOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  LeakyReluOpMaker(framework::OpProto *proto,
+                   framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("X", "Input of LeakyRelu operator");
+    AddOutput("Y", "Output of LeakyRelu operator");
+    AddComment(
+        "LeakyRelu activation operator, "
+        "leaky_relu = max(x, alpha * x)");
+    AddAttr<AttrType>("alpha", "The small negative slope")
+        .SetDefault(static_cast<AttrType>(0.02f));
+  }
+};
+
 class TanhOpMaker : public framework::OpProtoAndCheckerMaker {
 public:
  TanhOpMaker(framework::OpProto *proto, framework::OpAttrChecker *op_checker)
@@ -240,6 +256,9 @@ REGISTER_OP(softsign, ops::ActivationOp, ops::SoftsignOpMaker, softsign_grad,
 REGISTER_OP(brelu, ops::ActivationOp, ops::BReluOpMaker<float>, brelu_grad,
            ops::ActivationOpGrad);

+REGISTER_OP(leaky_relu, ops::ActivationOp, ops::LeakyReluOpMaker<float>,
+            leaky_relu_grad, ops::ActivationOpGrad);
+
 REGISTER_OP(soft_relu, ops::ActivationOp, ops::SoftReluOpMaker<float>,
            soft_relu_grad, ops::ActivationOpGrad);


--- a/paddle/operators/activation_op.cu
+++ b/paddle/operators/activation_op.cu
 /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.

-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at

-http://www.apache.org/licenses/LICENSE-2.0
+   http://www.apache.org/licenses/LICENSE-2.0

-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */

 #define EIGEN_USE_GPU
 #include "paddle/operators/activation_op.h"

--- a/paddle/operators/activation_op.h
+++ b/paddle/operators/activation_op.h
@@ -309,6 +309,33 @@ struct SoftReluGradFunctor : public BaseActivationFunctor<T> {
  }
 };

+template <typename T>
+struct LeakyReluFunctor : public BaseActivationFunctor<T> {
+  float alpha;
+  typename BaseActivationFunctor<T>::AttrPair GetAttrs() {
+    return {{"alpha", &alpha}};
+  }
+
+  template <typename Device, typename X, typename Y>
+  void operator()(Device d, X x, Y y) const {
+    y.device(d) = x.cwiseMax(alpha * x);
+  }
+};
+
+template <typename T>
+struct LeakyReluGradFunctor : public BaseActivationFunctor<T> {
+  float alpha;
+  typename BaseActivationFunctor<T>::AttrPair GetAttrs() {
+    return {{"alpha", &alpha}};
+  }
+  template <typename Device, typename X, typename Y, typename dY, typename dX>
+  void operator()(Device d, X x, Y y, dY dy, dX dx) const {
+    auto temp1 = alpha * (x < static_cast<T>(0)).template cast<T>().eval();
+    auto temp2 = (x >= static_cast<T>(0)).template cast<T>().eval();
+    dx.device(d) = dy * (temp1 + temp2).template cast<T>();
+  }
+};
+
 template <typename T>
 struct PowFunctor : public BaseActivationFunctor<T> {
  float factor;
@@ -379,4 +406,5 @@ struct STanhGradFunctor : public BaseActivationFunctor<T> {
  __macro(soft_relu, SoftReluFunctor, SoftReluGradFunctor);      \
  __macro(pow, PowFunctor, PowGradFunctor);                      \
  __macro(stanh, STanhFunctor, STanhGradFunctor);                \
-  __macro(softsign, SoftsignFunctor, SoftsignGradFunctor)
+  __macro(softsign, SoftsignFunctor, SoftsignGradFunctor);       \
+  __macro(leaky_relu, LeakyReluFunctor, LeakyReluGradFunctor)
--- a/paddle/operators/adadelta_op.cc
+++ b/paddle/operators/adadelta_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/adadelta_op.h"
+
+namespace paddle {
+namespace operators {
+
+class AdadeltaOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContextBase *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Param"),
+                   "Input(Param) of AdadeltaOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Grad"),
+                   "Input(Grad) of AdadeltaOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("AvgSquaredGrad"),
+                   "Input(AvgSquaredGrad) of AdadeltaOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("AvgSquaredUpdate"),
+                   "Input(AvgSquaredUpdate) of AdadeltaOp should not be null.");
+
+    PADDLE_ENFORCE(ctx->HasOutput("ParamOut"),
+                   "Output(ParamOut) of AdadeltaOp should not be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("AvgSquaredGradOut"),
+        "Output(AvgSquaredGradOut) of AdadeltaOp should not be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("AvgSquaredUpdateOut"),
+        "Output(AvgSquaredUpdateOut) of AdadeltaOp should not be null.");
+
+    auto param_dim = ctx->GetInputDim("Param");
+    PADDLE_ENFORCE_EQ(
+        param_dim, ctx->GetInputDim("Grad"),
+        "param and grad input of AdadeltaOp should have same dimension");
+    PADDLE_ENFORCE_EQ(param_dim, ctx->GetInputDim("AvgSquaredGrad"),
+                      "Param and AvgSquaredGrad input of AdadeltaOp "
+                      "should have same dimension");
+    PADDLE_ENFORCE_EQ(param_dim, ctx->GetInputDim("AvgSquaredUpdate"),
+                      "Param and AvgSquaredUpdate input of AdadeltaOp "
+                      "should have same dimension");
+
+    ctx->SetOutputDim("ParamOut", param_dim);
+    ctx->SetOutputDim("AvgSquaredGradOut", param_dim);
+    ctx->SetOutputDim("AvgSquaredUpdateOut", param_dim);
+  }
+};
+
+class AdadeltaOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  AdadeltaOpMaker(framework::OpProto *proto,
+                  framework::OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput("Param", "(Tensor) Input parameter");
+    AddInput("Grad", "(Tensor) Input gradient");
+    AddInput("AvgSquaredGrad",
+             "(Tensor) Input expectation of squared gradient");
+    AddInput("AvgSquaredUpdate",
+             "(Tensor) Input expectation of squared parameter updates");
+
+    AddOutput("ParamOut", "(Tensor) Output parameter");
+    AddOutput("AvgSquaredGradOut",
+              "(Tensor) Output expectation of squared gradient");
+    AddOutput("AvgSquaredUpdateOut",
+              "(Tensor) Output expectation of squared parameter updates");
+
+    AddAttr<float>("rho",
+                   "(float, default 0.95) Exponential decay rate "
+                   "for squared gradients.")
+        .SetDefault(0.95f);
+    AddAttr<float>("epsilon",
+                   "(float, default 1.0e-6) Constant for "
+                   "numerical stability")
+        .SetDefault(1.0e-6f);
+    AddComment(R"DOC(
+Adadelta Updates Operator.
+
+This implements the Adadelta optimizer[1]. Adadelta is a per-dimension
+adaptive learning rate method for gradient descent.
+
+Adadelta updates:
+
+avg_squared_grad_out = rho * avg_squared_grad + (1 - rho) * grad * grad
+param_update =  - sqrt((avg_squared_update + epsilon) /
+                       (avg_squared_grad_out + epsilon)) * grad
+avg_squared_update_out = rho * avg_squared_update + (1 - rho) * param_update**2
+param_out = param + param_update
+
+References:
+  [1] ADADELTA: An Adaptive Learning Rate Method
+      https://arxiv.org/abs/1212.5701
+
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_WITHOUT_GRADIENT(adadelta, ops::AdadeltaOp, ops::AdadeltaOpMaker);
+REGISTER_OP_CPU_KERNEL(
+    adadelta, ops::AdadeltaOpKernel<paddle::platform::CPUPlace, float>);
--- a/paddle/operators/adadelta_op.cu
+++ b/paddle/operators/adadelta_op.cu
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#define EIGEN_USE_GPU
+#include "paddle/operators/adadelta_op.h"
+
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(
+    adadelta, ops::AdadeltaOpKernel<paddle::platform::GPUPlace, float>);
--- a/paddle/operators/adadelta_op.h
+++ b/paddle/operators/adadelta_op.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename Place, typename T>
+class AdadeltaOpKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto param_out_tensor = ctx.Output<framework::Tensor>("ParamOut");
+    auto avg_squared_grad_out_tensor =
+        ctx.Output<framework::Tensor>("AvgSquaredGradOut");
+    auto avg_squared_update_out_tensor =
+        ctx.Output<framework::Tensor>("AvgSquaredUpdateOut");
+
+    param_out_tensor->mutable_data<T>(ctx.GetPlace());
+    avg_squared_grad_out_tensor->mutable_data<T>(ctx.GetPlace());
+    avg_squared_update_out_tensor->mutable_data<T>(ctx.GetPlace());
+
+    float rho = ctx.Attr<float>("rho");
+    float epsilon = ctx.Attr<float>("epsilon");
+
+    auto param = framework::EigenVector<T>::Flatten(
+        *ctx.Input<framework::Tensor>("Param"));
+    auto grad = framework::EigenVector<T>::Flatten(
+        *ctx.Input<framework::Tensor>("Grad"));
+    // Squared gradient accumulator
+    auto avg_squared_grad = framework::EigenVector<T>::Flatten(
+        *ctx.Input<framework::Tensor>("AvgSquaredGrad"));
+    // Squared updates accumulator
+    auto avg_squared_update = framework::EigenVector<T>::Flatten(
+        *ctx.Input<framework::Tensor>("AvgSquaredUpdate"));
+    auto param_out = framework::EigenVector<T>::Flatten(*param_out_tensor);
+    auto avg_squared_grad_out =
+        framework::EigenVector<T>::Flatten(*avg_squared_grad_out_tensor);
+    auto avg_squared_update_out =
+        framework::EigenVector<T>::Flatten(*avg_squared_update_out_tensor);
+    auto place = ctx.GetEigenDevice<Place>();
+
+    avg_squared_grad_out.device(place) =
+        rho * avg_squared_grad + (1 - rho) * grad.square();
+    auto update =
+        -((avg_squared_update + epsilon) / (avg_squared_grad_out + epsilon))
+             .sqrt() *
+        grad;
+    avg_squared_update_out.device(place) =
+        rho * avg_squared_update + (1 - rho) * update.square();
+    param_out.device(place) = param + update;
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/detail/strided_memcpy.h
+++ b/paddle/operators/detail/strided_memcpy.h
@@ -34,7 +34,7 @@ struct StridedMemcpyFunctor<T, 1> {
      auto& cpu_place = boost::get<platform::CPUPlace>(place);
      memory::Copy(cpu_place, dst, cpu_place, src, sizeof(T) * dst_dim.head);
    } else {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
      auto& gpu_place = boost::get<platform::GPUPlace>(place);
      auto& cuda_ctx =
          reinterpret_cast<const platform::CUDADeviceContext&>(dev_ctx);

--- a/paddle/operators/feed_op.cc
+++ b/paddle/operators/feed_op.cc
@@ -29,11 +29,11 @@ class FeedOp : public framework::OperatorWithKernel {
    framework::Variable* g_feed_variable =
        framework::GetScope()->FindVar("feed_value");

-    FeedInputs tensors = g_feed_variable->Get<FeedInputs>();
+    const FeedInputs& tensors = g_feed_variable->Get<FeedInputs>();

    auto in_dim = tensors[col].dims();
    ctx->SetOutputDim("Out", in_dim);
-    // need to handle LodTensor later
+    // TODO(qijun): need to handle LodTensor later
  }

  framework::DataType IndicateDataType(
@@ -49,9 +49,9 @@ class FeedOpMaker : public framework::OpProtoAndCheckerMaker {
    AddAttr<int>("data_type", "output data type")
        .SetDefault(framework::DataType::FP32);
    AddAttr<int>("col", "The col in global feed variable").SetDefault(0);
-    AddAttr<std::vector<int>>("dims", "The dimension of random tensor.");
-    AddOutput("Out", "The output of dropout op.");
-    AddComment(R"DOC(Feed data to global feed variable)DOC");
+    AddAttr<std::vector<int>>("dims", "The dimension of feed tensor.");
+    AddOutput("Out", "The output of feed op.");
+    AddComment(R"DOC(Feed data from global feed variable)DOC");
  }
 };


--- a/paddle/operators/feed_op.h
+++ b/paddle/operators/feed_op.h
@@ -31,7 +31,7 @@ class FeedKernel : public framework::OpKernel<T> {
    framework::Variable* g_feed_variable =
        framework::GetScope()->FindVar("feed_value");
    int col = ctx.template Attr<int>("col");
-    FeedInputs tensors = g_feed_variable->Get<FeedInputs>();
+    const FeedInputs& tensors = g_feed_variable->Get<FeedInputs>();
    out->CopyFrom<T>(tensors[col], ctx.GetPlace());
  }
 };

--- a/paddle/operators/fetch_op.cc
+++ b/paddle/operators/fetch_op.cc
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/fetch_op.h"
+
+namespace paddle {
+namespace operators {
+
+class FetchOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContextBase* ctx) const override {
+    typedef std::vector<framework::Tensor> FetchOutputs;
+    PADDLE_ENFORCE(ctx->HasInput("Input"), "Input should be not null.");
+    int col = ctx->Attrs().Get<int>("col");
+    framework::Variable* g_fetch_variable =
+        framework::GetScope()->FindVar("fetch_value");
+
+    FetchOutputs* tensors = g_fetch_variable->GetMutable<FetchOutputs>();
+    if (tensors->size() < static_cast<size_t>(col + 1)) {
+      tensors->resize(col + 1);
+    }
+
+    auto input_dim = ctx->GetInputDim("Input");
+    (*tensors)[col].Resize(input_dim);
+
+    // TODO(qijun): need to handle LodTensor later
+  }
+
+  framework::DataType IndicateDataType(
+      const framework::ExecutionContext& ctx) const override {
+    return static_cast<framework::DataType>(Attr<int>("data_type"));
+  }
+};
+
+class FetchOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  FetchOpMaker(framework::OpProto* proto, framework::OpAttrChecker* op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddAttr<int>("data_type", "output data type")
+        .SetDefault(framework::DataType::FP32);
+    AddAttr<int>("col", "The col in global fetch variable").SetDefault(0);
+    AddAttr<std::vector<int>>("dims", "The dimension of fetch tensor.");
+    AddInput("Input", "The output of fetch op.");
+    AddComment(R"DOC(Fetch data to global fetch variable)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_WITHOUT_GRADIENT(fetch, ops::FetchOp, ops::FetchOpMaker);
+REGISTER_OP_CPU_KERNEL(fetch, ops::FetchKernel<float>);
--- a/paddle/operators/fetch_op.cu
+++ b/paddle/operators/fetch_op.cu
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/operators/fetch_op.h"
+
+namespace ops = paddle::operators;
+REGISTER_OP_GPU_KERNEL(fetch, ops::FetchKernel<float>);
--- a/paddle/operators/fetch_op.h
+++ b/paddle/operators/fetch_op.h
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include "paddle/framework/eigen.h"
+#include "paddle/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename T>
+class FetchKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    typedef std::vector<framework::Tensor> FetchOutputs;
+    const Tensor* input = ctx.Input<Tensor>("Input");
+    int col = ctx.template Attr<int>("col");
+    framework::Variable* g_fetch_variable =
+        framework::GetScope()->FindVar("fetch_value");
+    FetchOutputs* tensors = g_fetch_variable->GetMutable<FetchOutputs>();
+    (*tensors)[col].mutable_data<T>(platform::CPUPlace());
+    (*tensors)[col].CopyFrom<T>(*input, platform::CPUPlace());
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
--- a/paddle/operators/math/im2col_test.cc
+++ b/paddle/operators/math/im2col_test.cc
@@ -71,7 +71,7 @@ void testIm2col() {
    context =
        new paddle::platform::CPUDeviceContext(paddle::platform::CPUPlace());
  } else {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
    context =
        new paddle::platform::CUDADeviceContext(paddle::platform::GPUPlace());
 #else
@@ -116,7 +116,7 @@ void testIm2col() {

 TEST(math, im2col) {
  testIm2col<paddle::platform::CPUPlace>();
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  testIm2col<paddle::platform::GPUPlace>();
 #endif
 }
--- a/paddle/operators/math/math_function_test.cc
+++ b/paddle/operators/math/math_function_test.cc
 #include "paddle/operators/math/math_function.h"
 #include "gtest/gtest.h"

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(math_function, notrans_mul_trans) {
  paddle::framework::Tensor input1;
  paddle::framework::Tensor input1_gpu;

--- a/paddle/operators/sgd_op.h
+++ b/paddle/operators/sgd_op.h
@@ -19,28 +19,25 @@ limitations under the License. */
 namespace paddle {
 namespace operators {

-using Tensor = framework::Tensor;
-template <typename T, int MajorType = Eigen::RowMajor,
-          typename IndexType = Eigen::DenseIndex>
-using EigenVector = framework::EigenVector<T, MajorType, IndexType>;
-
 template <typename Place, typename T>
 class SGDOpKernel : public framework::OpKernel<T> {
 public:
  void Compute(const framework::ExecutionContext& ctx) const override {
-    auto param = ctx.Input<Tensor>("Param");
-    auto grad = ctx.Input<Tensor>("Grad");
-    auto param_out = ctx.Output<Tensor>("ParamOut");
-    float lr = ctx.Input<Tensor>("LearningRate")->data<float>()[0];
+    auto param = ctx.Input<framework::Tensor>("Param");
+    auto grad = ctx.Input<framework::Tensor>("Grad");
+    auto param_out = ctx.Output<framework::Tensor>("ParamOut");
+    auto learning_rate = ctx.Input<framework::Tensor>("LearningRate");

    param_out->mutable_data<T>(ctx.GetPlace());

-    auto p = EigenVector<T>::Flatten(*param);
-    auto g = EigenVector<T>::Flatten(*grad);
-    auto o = EigenVector<T>::Flatten(*param_out);
+    auto p = framework::EigenVector<T>::Flatten(*param);
+    auto g = framework::EigenVector<T>::Flatten(*grad);
+    auto o = framework::EigenVector<T>::Flatten(*param_out);
+    auto lr = framework::EigenVector<T>::Flatten(*learning_rate);
    auto place = ctx.GetEigenDevice<Place>();

-    o.device(place) = p - lr * g;
+    Eigen::DSizes<int, 1> grad_dsize(grad->numel());
+    o.device(place) = p - lr.broadcast(grad_dsize) * g;
  }
 };


--- a/paddle/operators/strided_memcpy_test.cc
+++ b/paddle/operators/strided_memcpy_test.cc
@@ -72,7 +72,7 @@ TEST(StridedMemcpy, CPUConcat) {
  }
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(StridedMemcpy, GPUCrop) {
  // clang-format off
  int src[] = {
@@ -157,4 +157,4 @@ TEST(StridedMemcpy, GPUConcat) {

 #endif
 }  // namespace operators
-}  // namespace paddle
\ No newline at end of file
+}  // namespace paddle
--- a/paddle/platform/device_context.cc
+++ b/paddle/platform/device_context.cc
@@ -35,7 +35,7 @@ Eigen::DefaultDevice* CPUDeviceContext::eigen_device() const {

 Place CPUDeviceContext::GetPlace() const { return CPUPlace(); }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 template <>
 Eigen::GpuDevice*

--- a/paddle/platform/device_context.h
+++ b/paddle/platform/device_context.h
@@ -14,7 +14,7 @@ limitations under the License. */
 #include "paddle/platform/enforce.h"
 #include "paddle/platform/place.h"

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 #include "paddle/platform/dynload/cublas.h"
 #include "paddle/platform/dynload/cudnn.h"
 #include "paddle/platform/gpu_info.h"
@@ -61,7 +61,7 @@ class CPUDeviceContext : public DeviceContext {
  std::unique_ptr<Eigen::DefaultDevice> eigen_device_;
 };

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 template <>
 struct EigenDeviceConverter<platform::GPUPlace> {
  using EigenDeviceType = Eigen::GpuDevice;

--- a/paddle/platform/device_context_test.cc
+++ b/paddle/platform/device_context_test.cc
@@ -20,7 +20,7 @@ TEST(Device, Init) {
  using paddle::platform::CUDADeviceContext;
  using paddle::platform::GPUPlace;

-  int count = paddle::platform::GetDeviceCount();
+  int count = paddle::platform::GetCUDADeviceCount();
  for (int i = 0; i < count; i++) {
    DeviceContext* device_context = new CUDADeviceContext(GPUPlace(i));
    Eigen::GpuDevice* gpu_device =
@@ -34,7 +34,7 @@ TEST(Device, CUDADeviceContext) {
  using paddle::platform::CUDADeviceContext;
  using paddle::platform::GPUPlace;

-  int count = paddle::platform::GetDeviceCount();
+  int count = paddle::platform::GetCUDADeviceCount();
  for (int i = 0; i < count; i++) {
    CUDADeviceContext* device_context = new CUDADeviceContext(GPUPlace(i));
    Eigen::GpuDevice* gpu_device = device_context->eigen_device();

--- a/paddle/platform/enforce.h
+++ b/paddle/platform/enforce.h
@@ -29,7 +29,7 @@ limitations under the License. */
 #include <cxxabi.h>  // for __cxa_demangle
 #endif

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 #include "paddle/platform/dynload/cublas.h"
 #include "paddle/platform/dynload/cudnn.h"
@@ -113,7 +113,7 @@ inline typename std::enable_if<sizeof...(Args) != 0, void>::type throw_on_error(
  }
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 template <typename... Args>
 inline typename std::enable_if<sizeof...(Args) != 0, void>::type throw_on_error(

--- a/paddle/platform/enforce_test.cc
+++ b/paddle/platform/enforce_test.cc
@@ -213,4 +213,4 @@ TEST(ENFORCE_USER_DEFINED_CLASS, EQ) {
 TEST(ENFORCE_USER_DEFINED_CLASS, NE) {
  Dims a{{1, 2, 3, 4}}, b{{5, 6, 7, 8}};
  ASSERT_THROW(PADDLE_ENFORCE_EQ(a, b), paddle::platform::EnforceNotMet);
-}
\ No newline at end of file
+}
--- a/paddle/platform/gpu_info.cc
+++ b/paddle/platform/gpu_info.cc
@@ -26,11 +26,11 @@ DEFINE_double(fraction_of_gpu_memory_to_use, 0.95,
 namespace paddle {
 namespace platform {

-int GetDeviceCount() {
+int GetCUDADeviceCount() {
  int count;
  PADDLE_ENFORCE(
      cudaGetDeviceCount(&count),
-      "cudaGetDeviceCount failed in paddle::platform::GetDeviceCount");
+      "cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCount");
  return count;
 }

@@ -43,7 +43,8 @@ int GetCurrentDeviceId() {
 }

 void SetDeviceId(int id) {
-  PADDLE_ENFORCE(id < GetDeviceCount(), "id must less than GPU count");
+  // TODO(qijun): find a better way to cache the cuda device count
+  PADDLE_ENFORCE(id < GetCUDADeviceCount(), "id must less than GPU count");
  PADDLE_ENFORCE(cudaSetDevice(id),
                 "cudaSetDevice failed in paddle::platform::SetDeviceId");
 }

--- a/paddle/platform/gpu_info.h
+++ b/paddle/platform/gpu_info.h
@@ -14,7 +14,7 @@ limitations under the License. */

 #pragma once

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 #include <cuda_runtime.h>
 #include <stddef.h>
@@ -28,7 +28,7 @@ const std::string kEnvFractionGpuMemoryToUse =
    "PADDLE_FRACTION_GPU_MEMORY_TO_USE";

 //! Get the total number of GPU devices in system.
-int GetDeviceCount();
+int GetCUDADeviceCount();

 //! Get the current GPU device id in system.
 int GetCurrentDeviceId();

--- a/paddle/platform/variant.h
+++ b/paddle/platform/variant.h
@@ -16,7 +16,7 @@

 #include <boost/config.hpp>

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA

 // Because boost's variadic templates has bug on nvcc, boost will disable
 // variadic template support when GPU enabled on nvcc.

--- a/paddle/pserver/test/SocketTest.cpp
+++ b/paddle/pserver/test/SocketTest.cpp
@@ -215,7 +215,7 @@ int main(int argc, char** argv) {

  uint64_t dataSize = FLAGS_dim * sizeof(real);

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  GpuVector gpuParam(FLAGS_dim);
  GpuVector gpuGrad(FLAGS_dim);
 #else

--- a/paddle/pserver/test/test_ProtoServer.cpp
+++ b/paddle/pserver/test/test_ProtoServer.cpp
@@ -99,7 +99,7 @@ TEST(ProtoServer, regular) {
 }

 TEST(ProtoServer, extended) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  ProtoClient* client;
  if (FLAGS_rdma_tcp == "rdma")
    client = new ProtoClient(FLAGS_server_addr, FLAGS_port, F_RDMA);

--- a/paddle/pybind/pybind.cc
+++ b/paddle/pybind/pybind.cc
@@ -34,7 +34,7 @@ static size_t UniqueIntegerGenerator() {
 }

 bool IsCompileGPU() {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
  return false;
 #else
  return true;
@@ -78,7 +78,7 @@ PYBIND11_PLUGIN(core) {
      .def("set", PyCPUTensorSetFromArray<float>)
      .def("set", PyCPUTensorSetFromArray<int>)
      .def("set", PyCPUTensorSetFromArray<double>)
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
      .def("set", PyCUDATensorSetFromArray<float>)
      .def("set", PyCUDATensorSetFromArray<int>)
      .def("set", PyCUDATensorSetFromArray<double>)
@@ -96,7 +96,7 @@ PYBIND11_PLUGIN(core) {
      .def(
          "__init__",
          [](LoDTensor &instance, const std::vector<std::vector<size_t>> &lod) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
            new (&instance) LoDTensor(lod);
 #else
             LoD new_lod;
@@ -107,7 +107,7 @@ PYBIND11_PLUGIN(core) {
          })
      .def("set_lod",
           [](LoDTensor &self, const std::vector<std::vector<size_t>> &lod) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
             self.set_lod(lod);
 #else
             LoD new_lod;
@@ -117,7 +117,7 @@ PYBIND11_PLUGIN(core) {
 #endif
           })
      .def("lod", [](LoDTensor &self) -> std::vector<std::vector<size_t>> {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
        return self.lod();
 #else
           auto lod = self.lod();
@@ -203,7 +203,7 @@ All parameter, weight, gradient are variables in Paddle.
      .def_static("create",
                  [](paddle::platform::GPUPlace& place)
                      -> paddle::platform::DeviceContext* {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
                    PADDLE_THROW("GPUPlace is not supported in CPU device.");
 #else
                    return new paddle::platform::CUDADeviceContext(place);

--- a/paddle/pybind/tensor_py.h
+++ b/paddle/pybind/tensor_py.h
@@ -106,7 +106,7 @@ void PyCPUTensorSetFromArray(
  std::memcpy(dst, array.data(), sizeof(T) * array.size());
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 template <typename T>
 void PyCUDATensorSetFromArray(
    framework::Tensor &self,

--- a/paddle/string/to_string_test.cc
+++ b/paddle/string/to_string_test.cc
@@ -36,4 +36,4 @@ TEST(to_string, user_defined) {
  using namespace paddle::string;
  UserDefinedClass instance;
  ASSERT_EQ(kOutputString, to_string(instance));
-}
\ No newline at end of file
+}
--- a/paddle/trainer/MergeModel.cpp
+++ b/paddle/trainer/MergeModel.cpp
@@ -29,7 +29,7 @@ int main(int argc, char** argv) {
  initMain(argc, argv);
  initPython(argc, argv);
  string confFile = TrainerConfigHelper::getConfigNameFromPath(FLAGS_model_dir);
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
  FLAGS_use_gpu = false;
 #endif
  auto config = std::make_shared<TrainerConfigHelper>(confFile);

--- a/paddle/trainer/tests/test_Compare.cpp
+++ b/paddle/trainer/tests/test_Compare.cpp
@@ -146,7 +146,7 @@ void compareGradient(comData& comDataCpu, comData& comDataGpu) {
 }

 int main(int argc, char** argv) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
  exit(0);
 #endif
  paddle::initMain(argc, argv);

--- a/paddle/trainer/tests/test_CompareSparse.cpp
+++ b/paddle/trainer/tests/test_CompareSparse.cpp
@@ -174,7 +174,7 @@ TEST(compareSparse, multiGradientMachine) {
    FLAGS_local = local;
    FLAGS_ports_num_for_sparse = 5;
    for (bool useGpu : {false, true}) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
      if (useGpu) continue;
 #endif
      FLAGS_parallel_nn = useGpu;
@@ -198,7 +198,7 @@ TEST(compareSparse, NeuralNetwork) {
    FLAGS_local = local;
    FLAGS_ports_num_for_sparse = 5;
    for (bool useGpu : {false, true}) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
      if (useGpu) continue;
 #endif
      FLAGS_parallel_nn = useGpu;

--- a/paddle/trainer/tests/test_Trainer.cpp
+++ b/paddle/trainer/tests/test_Trainer.cpp
@@ -51,7 +51,7 @@ void checkGradientTest(const string& configFile,

 TEST(checkGradient, cpu) { checkGradientTest(configFile1, false, false); }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(checkGradient, gpu) { checkGradientTest(configFile1, true, false); }

 TEST(checkGradient, multiGpu) {
@@ -97,7 +97,7 @@ TEST(checkGradient, hsigmoid) { checkGradientTest(configFile2, false, false); }

 TEST(checkGradient, chunk) {
  checkGradientTest(configFile3, false, false);
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  checkGradientTest(configFile3, true, true);
 #endif
 }

--- a/paddle/trainer/tests/test_TrainerOnePass.cpp
+++ b/paddle/trainer/tests/test_TrainerOnePass.cpp
@@ -79,7 +79,7 @@ void trainerOnePassTest(const string& configFile,
 // 1. test trainer (cpu, gpu).
 TEST(trainerOnePass, cpu) { trainerOnePassTest(configFile1, false, false); }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(trainerOnePass, gpu) { trainerOnePassTest(configFile1, true, false); }

 TEST(trainerOnePass, gpu2) { trainerOnePassTest(configFile1, true, false, 2); }
@@ -94,7 +94,7 @@ TEST(trainerOnePass, parallel) {
 #endif

 // 2. test average_window.
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(average_window, gpu) {
  trainerOnePassTest(configFile1, true, false, 4, 0.01);
 }
@@ -266,7 +266,7 @@ TEST(checkRemoteUpdater, cpuTrainerOldUpdater) {
  checkRemoteParameterUpdaterTest(configFile1, false, false, 1, true);
 }

-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
 TEST(checkRemoteUpdater, gpuTrainer) {
  checkRemoteParameterUpdaterTest(configFile1, true, false);
 }

--- a/paddle/trainer/tests/test_recurrent_machine_generation.cpp
+++ b/paddle/trainer/tests/test_recurrent_machine_generation.cpp
@@ -113,7 +113,7 @@ void testGeneration(const string& configFile,
 #ifndef PADDLE_TYPE_DOUBLE

 TEST(RecurrentGradientMachine, test_generation) {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
  const auto useGpuConfs = {false};
 #else
  const auto useGpuConfs = {true, false};

--- a/paddle/utils/Flags.cpp
+++ b/paddle/utils/Flags.cpp
@@ -14,7 +14,7 @@ limitations under the License. */

 #include "Flags.h"

-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
 DEFINE_bool(use_gpu, false, "Only support CPU training");
 #else
 DEFINE_bool(use_gpu, true, "Whether to use GPU for training");

--- a/paddle/utils/Util.h
+++ b/paddle/utils/Util.h
@@ -218,7 +218,7 @@ protected:
 * *d2* is peer device to enable direct access to by the d1 device.
 */
 inline void enablePeerAccess(int d1, int d2) {
-#ifdef PADDLE_WITH_GPU
+#ifdef PADDLE_WITH_CUDA
  if (hl_device_can_access_peer(d1, d2)) {
    SetDevice dev(d1);
    hl_device_enable_peer_access(d2);

--- a/paddle/utils/Version.h
+++ b/paddle/utils/Version.h
@@ -48,7 +48,7 @@ void printVersion(std::ostream& os);
 * @return return true if paddle compiled with GPU
 */
 constexpr bool isWithGpu() {
-#ifndef PADDLE_WITH_GPU
+#ifndef PADDLE_WITH_CUDA
  return false;
 #else
  return true;

--- a/python/paddle/v2/framework/tests/test_activation_op.py
+++ b/python/paddle/v2/framework/tests/test_activation_op.py
@@ -122,6 +122,23 @@ class TestBRelu(OpTest):
        self.check_grad(['X'], 'Y', max_relative_error=0.02)


+class TestLeakyRelu(OpTest):
+    def setUp(self):
+        self.op_type = "leaky_relu"
+        alpha = 0.02
+        self.attrs = {'alpha': alpha}
+        self.inputs = {'X': np.random.uniform(-3, 3, [4, 4]).astype("float32")}
+        self.outputs = {
+            'Y': np.maximum(self.inputs['X'], alpha * self.inputs['X'])
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(['X'], 'Y', max_relative_error=0.007)
+
+
 class TestSoftRelu(OpTest):
    def setUp(self):
        self.op_type = "soft_relu"

--- a/python/paddle/v2/framework/tests/test_adadelta_op.py
+++ b/python/paddle/v2/framework/tests/test_adadelta_op.py
+import unittest
+import numpy as np
+from op_test import OpTest
+
+
+class TestAdadeltaOp1(OpTest):
+    def setUp(self):
+        self.op_type = "adadelta"
+        param = np.random.uniform(-1, 1, (102, 105)).astype("float32")
+        grad = np.random.uniform(-1, 1, (102, 105)).astype("float32")
+        # The squared gradient is positive
+        avg_squared_grad = np.random.random((102, 105)).astype("float32")
+        # The squared update is positive
+        avg_squared_update = np.random.random((102, 105)).astype("float32")
+
+        rho = 0.95
+        epsilon = 1e-6
+
+        self.inputs = {
+            'Param': param,
+            'Grad': grad,
+            'AvgSquaredGrad': avg_squared_grad,
+            'AvgSquaredUpdate': avg_squared_update
+        }
+
+        self.attrs = {'rho': rho, 'epsilon': epsilon}
+
+        avg_squared_grad_out = rho * avg_squared_grad + \
+            (1 - rho) * np.square(grad)
+        update = -np.multiply(
+            np.sqrt(
+                np.divide(avg_squared_update + epsilon, avg_squared_grad_out +
+                          epsilon)), grad)
+
+        avg_squared_update_out = rho * avg_squared_update + \
+            (1 - rho) * np.square(update)
+
+        param_out = param + update
+
+        self.outputs = {
+            'ParamOut': param_out,
+            'AvgSquaredGradOut': avg_squared_grad_out,
+            'AvgSquaredUpdateOut': avg_squared_update_out
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+
+class TestAdadeltaOp2(OpTest):
+    '''Test Adadelta op with default attribute values
+    '''
+
+    def setUp(self):
+        self.op_type = "adadelta"
+        param = np.random.uniform(-1, 1, (102, 105)).astype("float32")
+        grad = np.random.uniform(-1, 1, (102, 105)).astype("float32")
+        # The squared gradient is positive
+        avg_squared_grad = np.random.random((102, 105)).astype("float32")
+        # The squared update is positive
+        avg_squared_update = np.random.random((102, 105)).astype("float32")
+
+        rho = 0.95
+        epsilon = 1e-6
+
+        self.inputs = {
+            'Param': param,
+            'Grad': grad,
+            'AvgSquaredGrad': avg_squared_grad,
+            'AvgSquaredUpdate': avg_squared_update
+        }
+
+        avg_squared_grad_out = rho * avg_squared_grad + \
+            (1 - rho) * np.square(grad)
+        update = -np.multiply(
+            np.sqrt(
+                np.divide(avg_squared_update + epsilon, avg_squared_grad_out +
+                          epsilon)), grad)
+
+        avg_squared_update_out = rho * avg_squared_update + \
+            (1 - rho) * np.square(update)
+
+        param_out = param + update
+
+        self.outputs = {
+            'ParamOut': param_out,
+            'AvgSquaredGradOut': avg_squared_grad_out,
+            'AvgSquaredUpdateOut': avg_squared_update_out
+        }
+
+    def test_check_output(self):
+        self.check_output()
+
+
+if __name__ == "__main__":
+    unittest.main()