diff --git a/doc/design/cpp_data_feeding.md b/doc/design/cpp_data_feeding.md
deleted file mode 100644
index 2cbb0083e6b557d703ce180cb0a85050a777aa2f..0000000000000000000000000000000000000000
--- a/doc/design/cpp_data_feeding.md
+++ /dev/null
@@ -1,78 +0,0 @@
-# C++ Data Feeding
-
-While using Paddle V2 API for Training, data feeding completely depends on the Python code. To get rid of the Python environment and achieve the goal of "wrapping the whole training by a while loop op" in Paddle Fluid, a C++ data feeding mechanism is required. 
-
-In this document we show the fundamental design of a C++ data feeding process, which includes data reading, shuffling and batching.
-
-## Reader
-
-In order to handle the above mentioned problem, a new concept called 'Reader' is introduced. `Reader` is a series of inherited classes which can be held by our `Variable` and they are used to read or process file data.
-
-
-### `ReaderBase`
-
-`ReaderBase` is the abstract base class for all readers. It defines the interface for all readers.
-
-```cpp
-class ReaderBase {
- public:
-  explicit ReaderBase(const std::vector<DDim>& shapes) : shapes_(shapes) {
-    PADDLE_ENFORCE(!shapes_.empty());
-  }
-  // Read the next batch of data. (A 'batch' can be only one instance)
-  // If the next batch doesn't exist, '*out' will be an empty std::vector.
-  virtual void ReadNext(std::vector<LoDTensor>* out) = 0;
-  
-  // Reinitialize the reader and read the file from the beginning.
-  virtual void ReInit() = 0;
-  
-  // Get a certain read in data's shape.
-  DDim shape(size_t idx) const;
-  // Get shapes of all read in data.
-  std::vector<DDim> shapes() const { return shapes_; }
-  // Set shapes of read in data.
-  void set_shapes(const std::vector<DDim>& shapes) { shapes_ = shapes; }
-
-  virtual ~ReaderBase() {}
-
- protected:
-  std::vector<DDim> shapes_;
-};
-```
-
-### `FileReader` and `DecoratedReader`
-
-These two classes are derived from the `ReaderBase` and will further be derived by more specific readers. Thus, in our design, there are two kinds of readers: file readers and decorated readers. A file reader reads from a file of some specific format, and yield only one instance of data at a time. For example, RecordIO reader, jpg reader, .... A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some processing on them(shuffling, or batching), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.
-
-All the readers share exactly the same interface as defined in `ReaderBase`. So they can be decorated for more than one time: We can **shuffle** a reader's outputs and then **batch** the shuffle outputs. The interface consistency also allows related ops use readers without knowing what they are exactly.
-
-
-### `ReaderHolder`
-
-Different readers belong to different class types. This leads to a problem: How can we drop them into `Variable`s and fetch them out by a unified method? For example, if a Variable holds a `BatchReader`, we can not get it by the following code:
-
-```cpp
-var->Get<ReaderBase>("batch_reader");
-```
-
-We would have to write:
-
-```cpp
-var->Get<BatchReader>("batch_reader");
-```
-
-This requires that in order to get a reader from a variable, every time, we must know the reader's type exactly. This is nearly impossible.
-
-To solve this problem, we introduce `ReaderHolder` as a wrapper. It acts as an empty decorator of `ReaderBase`, which hides reader's type. With `ReaderHolder` we are able to fetch all types of readers by `var->Get<ReaderHolder>("...")` and regard the obtained object as a reader.
-
-## Related Operators
-
-To create and invoke readers, some new ops are introduced:
-
-### `CreateReaderOp`
-
-Each reader has its creation op. File readers' creation ops have no input and yield the created file reader as its output. Decorated readers' creation ops take the underlying readers as inputs and then yield new decorated readers.
-
-### `ReadOp`
-
-A reader is only a Variable. It cannot trigger the reading process by itself. So we add the `ReadOp` to execute it. A `ReadOp` takes a reader Variable as its input. Each time it runs, it invokes the reader‘s `ReadNext()` function and gets a new batch of data(or only one instance of data, if we use file reader directly). The output data of a reader are in the form of `std::vector<LoDTenosr>`, so the `ReadOp` also needs to split the vector and move LoDTensors to their respective output Variables.
diff --git a/doc/design/images/duplicate_op.graffle b/doc/design/images/duplicate_op.graffle
deleted file mode 100644
index 5979f792e252f028a615729215529c2be42d9165..0000000000000000000000000000000000000000
Binary files a/doc/design/images/duplicate_op.graffle and /dev/null differ
diff --git a/doc/design/images/duplicate_op.png b/doc/design/images/duplicate_op.png
deleted file mode 100644
index f299c5d37f260a1bb0daec886f0a4ee1c1f31c92..0000000000000000000000000000000000000000
Binary files a/doc/design/images/duplicate_op.png and /dev/null differ
diff --git a/doc/design/images/duplicate_op2.graffle b/doc/design/images/duplicate_op2.graffle
deleted file mode 100644
index 5cec3bc64dbd44dc99e348485969f29bd128ceb1..0000000000000000000000000000000000000000
Binary files a/doc/design/images/duplicate_op2.graffle and /dev/null differ
diff --git a/doc/design/images/duplicate_op2.png b/doc/design/images/duplicate_op2.png
deleted file mode 100644
index 21cdd5cabf1b5203e1435a75b57770d2f702fa92..0000000000000000000000000000000000000000
Binary files a/doc/design/images/duplicate_op2.png and /dev/null differ
diff --git a/doc/design/images/replica.png b/doc/design/images/replica.png
deleted file mode 100644
index ef59e56b01d792a059279e6bb9a29f3db6a59a41..0000000000000000000000000000000000000000
Binary files a/doc/design/images/replica.png and /dev/null differ
diff --git a/doc/design/images/two_phase_commit.png b/doc/design/images/two_phase_commit.png
deleted file mode 100644
index ef6f7317bd440cc7d9fe08fcbbf2b7a542f99049..0000000000000000000000000000000000000000
Binary files a/doc/design/images/two_phase_commit.png and /dev/null differ
diff --git a/doc/design/images/asgd.gif b/doc/fluid/design/algorithm/images/asgd.gif
similarity index 100%
rename from doc/design/images/asgd.gif
rename to doc/fluid/design/algorithm/images/asgd.gif
diff --git a/doc/design/images/theta_star.gif b/doc/fluid/design/algorithm/images/theta_star.gif
similarity index 100%
rename from doc/design/images/theta_star.gif
rename to doc/fluid/design/algorithm/images/theta_star.gif
diff --git a/doc/design/parameter_average.md b/doc/fluid/design/algorithm/parameter_average.md
similarity index 100%
rename from doc/design/parameter_average.md
rename to doc/fluid/design/algorithm/parameter_average.md
diff --git a/doc/design/build_system/README.md b/doc/fluid/design/concepts/README.md
similarity index 100%
rename from doc/design/build_system/README.md
rename to doc/fluid/design/concepts/README.md
diff --git a/doc/design/block.md b/doc/fluid/design/concepts/block.md
similarity index 100%
rename from doc/design/block.md
rename to doc/fluid/design/concepts/block.md
diff --git a/doc/fluid/design/concepts/cpp_data_feeding.md b/doc/fluid/design/concepts/cpp_data_feeding.md
new file mode 100644
index 0000000000000000000000000000000000000000..8607b40ccbbe01db77afed72c1efa780b520744c
--- /dev/null
+++ b/doc/fluid/design/concepts/cpp_data_feeding.md
@@ -0,0 +1,171 @@
+# C++ Data Feeding
+
+While using Paddle V2 API for training, data feeding completely depends on the Python code. To get rid of the Python environment and achieve the goal of "wrapping the whole training by a while loop op" in Paddle Fluid, a C++ data feeding mechanism is required.
+
+In this document, we show the fundamental design of a C++ data feeding process, which includes data reading, shuffling and batching.
+
+## Overview
+
+![](images/readers.png)
+
+## Reader
+
+In order to handle the above-mentioned problem, a new concept called 'Reader' is introduced. `Reader` is a series of inherited classes which can be held by our `Variable` and they are used to read or process file data.
+
+
+### ReaderBase
+
+`ReaderBase` is the abstract base class for all readers. It defines the interface for all readers.
+
+```cpp
+class ReaderBase {
+ public:
+  // Reads the next batch of data. (A 'batch' can be only one instance)
+  // If the next batch doesn't exist, it throws an exception
+  virtual void ReadNext(std::vector<LoDTensor>* out) = 0;
+  
+  // Checks whether the next instance exists.
+  virtual bool HasNext() = 0;
+  
+  // Reinitializes the reader and read the file from the beginning.
+  virtual void ReInit() = 0;
+
+  virtual ~ReaderBase();
+};
+```
+
+### FileReader
+
+`FileReader` is derived from the `ReaderBase`. It is still an abstract class and will further be derived by Readers of respective specific format.
+
+```cpp
+class FileReader : public ReaderBase {
+ public:
+  explicit FileReader(const std::vector<DDim>& dims);
+
+  void ReadNext(std::vector<LoDTensor>* out) override;
+
+ protected:
+  virtual void ReadNextImpl(std::vector<LoDTensor>* out) = 0;
+
+ private:
+  std::vector<DDim> dims_;
+};
+```
+
+A file reader binds with a single file and reads one data instance at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.
+
+The `ReadNextImpl()` is invoked by `ReadNext()`. Besides invoking `ReadNextImpl()`, `ReadNext()` is also responsible for checking the output, making sure that each shape of `LoDTensor` in `*out` is consistent with the one in `dims_`.  
+
+### DecoratedReader
+
+A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some processing on them(shuffling,  batching or something else), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.
+
+```cpp
+class DecoratedReader : public ReaderBase {
+ public:
+  explicit DecoratedReader(ReaderBase* reader) : ReaderBase(), reader_(reader) {
+    PADDLE_ENFORCE_NOT_NULL(reader_);
+  }
+
+  void ReInit() override { reader_->ReInit(); }
+
+  bool HasNext() const override { return reader_->HasNext(); }
+
+ protected:
+  ReaderBase* reader_;
+};
+```
+
+Both the `FileReader` and `DecoratedReader` share exactly the same interface as defined in `ReaderBase`. So they can be decorated for multiple times: We can **shuffle** a reader's outputs and then **batch** the shuffled outputs. The interface consistency also allows related ops use readers without knowing their underlying type.
+
+### MultipleReader
+
+All `FileReader` binds with a single file and are single-threaded. However, sometimes we need to read data from more than one file. In this case, it's not enough to only have `FileReader` and `DecoratedReader`.
+
+So `MultipleReader` is introduced. It is also derived from `ReaderBase`. A `MultipleReader` holds several prefetching `FileReaders` and these readers run concurrently. Another pivotal part of a `MultipleReader` is a buffer channel. The channel collects data yield by all prefetching readers and makes subsequent OPs or decorated readers be able to fetch data without concerning about multiple readers scheduling.
+
+![](images/multiple_reader.png)
+
+This graph shows how a `MultipleReader` works with three prefetching file readers and two GPUs. There is a queue of files which are going to be read. Each time when a prefetching file reader is free(complete reading from one file), it fetches a new file from the queue. Each prefetching file reader runs in a separated prefetch thread and dumps their outputs to the same channel.
+
+To the subsequent two decorated readers, the `MultipleReader` is **a single reader**. They don't need to concern about how prefetch readers are scheduled. They only need to invoke `MultipleReader::ReadNext()` to get the next data from the buffer channel. 
+
+### ReaderHolder
+
+Different readers belong to different class types. This leads to a problem: How can we drop them into `Variable`s and fetch them out by a unified method? For example, if a Variable holds a `BatchReader`, we can not get it by the following code:
+
+```cpp
+var->Get<ReaderBase>("batch_reader");
+```
+
+We would have to write:
+
+```cpp
+var->Get<BatchReader>("batch_reader");
+```
+
+This requires that in order to get a reader from a variable, every time, we must know the reader's type exactly. This is nearly impossible.
+
+To solve this problem, we introduce `ReaderHolder` as a wrapper. It acts as an empty decorator of `ReaderBase`, which hides reader's type. With `ReaderHolder` we are able to fetch all types of readers by `var->Get<ReaderHolder>("...")` and regard the obtained object as a reader.
+
+## Related Operators
+
+To create and invoke readers, some new ops are introduced:
+
+### CreateReaderOp
+
+Each reader has its creation op. File readers' creation ops have no input and yield the created file reader as its output. Decorated readers' creation ops take the underlying readers as inputs and then yield new decorated readers.
+
+However, direct usage of file readers' creation ops is not recommended because a file reader can only read one file via a single thread. Using `OpenFilesOp` is a better choice.
+
+### OpenFilesOp
+
+The `OpenFilesOp` is the creation op of `MultipleReader`. It takes no input but requires a list of file names as one of its attributes. The newly created `MultipleReader` then creates its own prefetching readers according to given file names.
+
+To make sure that created prefetching readers match file formats, we need a name prefix rule to append file format tags to file names, as well as a file reader registry mechanism to map file format tags to their corresponding file readers' constructors.
+
+### HasNextOp
+
+`HasNextOp` is used to check whether the next data batch exists via the reader's `HasNext()` interface.
+
+### ResetOp
+
+`ResetOp` is used to reset a reader via its `ReInit()` interface.
+
+### ReadOp
+
+A reader is only a Variable. It cannot trigger the reading process by itself. So we add the `ReadOp` to execute it. A `ReadOp` takes a reader Variable as its input. Each time it runs, it invokes the reader‘s `ReadNext()` function and gets a new batch of data(or only one instance of data, if we use file reader directly). The output data of a reader are in the form of `std::vector<LoDTenosr>`, so the `ReadOp` also needs to split the vector and move LoDTensors to their respective output Variables.
+
+## Program with Readers
+
+A `Program` holds readers as its persistable variables. These variables are created by `CreateReaderOp` or `OpenFilesOp`. These ops shall run only once. So they shall be settled in the `startup_program`. `HasNextOp`, `ResetOp` and `ReadOp` are required by training loop, so they shall be in the `main_program`.
+
+The ops of a `startup_program` with readers would be like this:
+
+```
+multiple_reader = open_files_op(...)
+batch_reader = create_batch_reader_op(multiple_reader)
+double_buffer_reader = create_double_buffer_op(batch_reader)
+... (other initializers)
+```
+
+The forwarding ops of the corresponding `main_program` would be like this:
+
+```
+while_op {
+    has_next = has_next_op(double_buffer_reader)
+    if_else_op(has_next) {
+        batch_data = read_op(double_buffer_reader)
+        ... (subsequent training ops)
+    } else {
+        reset_op(double_buffer_reader)
+    }
+}
+```
+
+Two important considerations for these programs are as follows:
+
+1. The multiple\_reader is the batch\_reader's underlying reader, and the batch\_reader is the double\_buffer\_reader's underlying reader. `read_op`, `has_next_op` and other reader related ops will only invoke the top-most reader. In this case, it's the double\_buffer\_reader.
+
+2. All readers exist in both `startup_program` and `main_program`. And they are persistable.
diff --git a/doc/design/executor.md b/doc/fluid/design/concepts/executor.md
similarity index 100%
rename from doc/design/executor.md
rename to doc/fluid/design/concepts/executor.md
diff --git a/doc/design/functions_operators_layers.md b/doc/fluid/design/concepts/functions_operators_layers.md
similarity index 100%
rename from doc/design/functions_operators_layers.md
rename to doc/fluid/design/concepts/functions_operators_layers.md
diff --git a/doc/fluid/design/concepts/images/multiple_reader.png b/doc/fluid/design/concepts/images/multiple_reader.png
new file mode 100644
index 0000000000000000000000000000000000000000..b22126b31db4982c13fc3a0827805e6aaf955046
Binary files /dev/null and b/doc/fluid/design/concepts/images/multiple_reader.png differ
diff --git a/doc/fluid/design/concepts/images/readers.png b/doc/fluid/design/concepts/images/readers.png
new file mode 100644
index 0000000000000000000000000000000000000000..fd59168ce16c9e2a0ef45303c28c997cfd7740be
Binary files /dev/null and b/doc/fluid/design/concepts/images/readers.png differ
diff --git a/paddle/fluid/framework/lod_tensor.md b/doc/fluid/design/concepts/lod_tensor.md
similarity index 100%
rename from paddle/fluid/framework/lod_tensor.md
rename to doc/fluid/design/concepts/lod_tensor.md
diff --git a/doc/design/program.md b/doc/fluid/design/concepts/program.md
similarity index 100%
rename from doc/design/program.md
rename to doc/fluid/design/concepts/program.md
diff --git a/doc/design/scope.md b/doc/fluid/design/concepts/scope.md
similarity index 100%
rename from doc/design/scope.md
rename to doc/fluid/design/concepts/scope.md
diff --git a/paddle/fluid/framework/tensor.md b/doc/fluid/design/concepts/tensor.md
similarity index 100%
rename from paddle/fluid/framework/tensor.md
rename to doc/fluid/design/concepts/tensor.md
diff --git a/doc/design/tensor_array.md b/doc/fluid/design/concepts/tensor_array.md
similarity index 100%
rename from doc/design/tensor_array.md
rename to doc/fluid/design/concepts/tensor_array.md
diff --git a/doc/design/var_desc.md b/doc/fluid/design/concepts/var_desc.md
similarity index 100%
rename from doc/design/var_desc.md
rename to doc/fluid/design/concepts/var_desc.md
diff --git a/paddle/fluid/framework/variable.md b/doc/fluid/design/concepts/variable.md
similarity index 100%
rename from paddle/fluid/framework/variable.md
rename to doc/fluid/design/concepts/variable.md
diff --git a/doc/design/concurrent_programming.md b/doc/fluid/design/concurrent/concurrent_programming.md
similarity index 100%
rename from doc/design/concurrent_programming.md
rename to doc/fluid/design/concurrent/concurrent_programming.md
diff --git a/doc/design/csp.md b/doc/fluid/design/concurrent/csp.md
similarity index 100%
rename from doc/design/csp.md
rename to doc/fluid/design/concurrent/csp.md
diff --git a/doc/design/parallel_do.md b/doc/fluid/design/concurrent/parallel_do.md
similarity index 100%
rename from doc/design/parallel_do.md
rename to doc/fluid/design/concurrent/parallel_do.md
diff --git a/doc/design/float16.md b/doc/fluid/design/data_type/float16.md
similarity index 100%
rename from doc/design/float16.md
rename to doc/fluid/design/data_type/float16.md
diff --git a/doc/fluid/design/dist_train/distributed_lookup_table_design.md b/doc/fluid/design/dist_train/distributed_lookup_table_design.md
new file mode 100644
index 0000000000000000000000000000000000000000..e543adf0f97cc6b47415b807d7a1ed1effec9b22
--- /dev/null
+++ b/doc/fluid/design/dist_train/distributed_lookup_table_design.md
@@ -0,0 +1,128 @@
+## Design Doc: Distributed Lookup Table Operator
+
+A lookup table operator in PaddlePaddle where the table could be out
+of the memory of a computer.
+
+## Background
+
+A lookup table operator is well-used in deep learning for learning the
+representation, or the
+[*embedding*](http://www.cs.toronto.edu/~fritz/absps/ieee-lre.pdf), of
+symbols.
+
+### The Forward Algorithm
+
+The forward algorithm of the lookup table is a multiplication of the
+input vector x and the lookup table matrix W:
+
+$$y = x * W$$
+
+When x is a sparse vector of symbols, the above multiplication
+simplifies into looking up rows in W that correspond to symbols in x,
+denoted by W(x).  Please be aware that W could be huge and out of the
+memory, so we'd need a distributed storage service, which supports the
+lookup of rows.
+
+The following figure illustrates the multiplication of x with two
+non-zero elements, or say, two symbols, and a lookup table W:
+
+![lookup table](./src/lookup_table.png)
+
+### The Backward Algorithm
+
+The backward algorithm computes W'(x) using W(x).  W'(x) has the same
+scale of size as W(x) and is much smaller than W.
+
+To optimize W given W', we can do simple SGD update:
+
+$$W = f(W') = \lambda * W'$$
+
+or some more sophisticated algorithms that rely on both W' and W:
+
+$$W = f(W, W')$$
+
+The following figure illustrates the backward pass of the lookup
+operator: ![lookup table training](./src/lookup_table_training.png)
+
+## Distributed Storage Service
+
+The forward algorithm requires a distributed storage service for W.
+The backward algorithm prefers that the storage system can apply the
+optimization algorithm on W.  The following two sections describe two
+solutions -- the former doesn't require that the storage service can
+do optimization, the latter does.
+
+### Storage Service Doesn't Optimize
+
+In this design, we use highly-optimized distributed storage, e.g.,
+memcached, as the storage service, and we run the optimization
+algorithm on parameter servers of PaddlePaddle.  The following figure
+illustrates the training process.
+
+<!--
+Note: please update the following URL when update this digraph.
+<img src='https://g.gravizo.com/svg?
+digraph G {
+  rankdir="LR";
+  subgraph cluster1 {
+  P1 [label="pserver 1"];
+  P2 [label="pserver 2"];
+  T1 [label="trainer 1"];
+  T2 [label="trainer 2"];
+  T3 [label="trainer 3"];
+  }
+  KV [label="memcached"];
+  T1 -> P1;
+  T1 -> P2;
+  T2 -> P1;
+  T2 -> P2;
+  T3 -> P1;
+  T3 -> P2;
+  P1 -> KV [color=gray, weight=0.1];
+  KV -> P1 [color=gray, weight=0.1];
+  P2 -> KV [color=gray, weight=0.1];
+  KV -> P2 [color=gray, weight=0.1];
+  KV -> T1 [color=gray, weight=0.1];
+  KV -> T2 [color=gray, weight=0.1];
+  KV -> T3 [color=gray, weight=0.1];
+}
+)
+'/>
+-->
+
+<img src='https://g.gravizo.com/svg?%20digraph%20G%20{%20rankdir=%22LR%22;%20subgraph%20cluster1%20{%20P1%20[label=%22pserver%201%22];%20P2%20[label=%22pserver%202%22];%20T1%20[label=%22trainer%201%22];%20T2%20[label=%22trainer%202%22];%20T3%20[label=%22trainer%203%22];%20}%20KV%20[label=%22memcached%22];%20T1%20-%3E%20P1;%20T1%20-%3E%20P2;%20T2%20-%3E%20P1;%20T2%20-%3E%20P2;%20T3%20-%3E%20P1;%20T3%20-%3E%20P2;%20P1%20-%3E%20KV%20[color=gray,%20weight=0.1];%20KV%20-%3E%20P1%20[color=gray,%20weight=0.1];%20P2%20-%3E%20KV%20[color=gray,%20weight=0.1];%20KV%20-%3E%20P2%20[color=gray,%20weight=0.1];%20KV%20-%3E%20T1%20[color=gray,%20weight=0.1];%20KV%20-%3E%20T2%20[color=gray,%20weight=0.1];%20KV%20-%3E%20T3%20[color=gray,%20weight=0.1];%20}'/>
+
+Each trainer runs the forward and backward passes using their local
+data:
+
+1. In the forward pass, when a trainer runs the forward algorithm of a
+   lookup operator, it retrieves W(x) from the storage service.
+1. The trainer computes W'(x) in the backward pass using W(x).
+
+During the global update process:
+
+1. Each trainer uploads its W'(x) to parameter servers.
+1. The parameter server runs the optimization algorithm, e.g., the
+   Adam optimization algorithm, which requires that
+   1. The parameter server retrieves W(x) from memcached, and
+   1. The parameter server pushes $\Delta W(x)=f(W(x), lambda \sum_j
+      W'(x))$ to memcached, where $f$ denotes the optimization
+      algorithm.
+
+### Storage Service Does Optimize
+
+This design is very similar to the above one, except that the
+optimization algorithm $f$ runs on the storage service.
+
+- Pro: parameter servers do not retrieve W(x) from the storage
+  service, thus saves half network communication.
+- Con: the storage service needs to be able to run the optimization
+  algorithm.
+
+## Conclusion
+
+Let us do the "storage service does not optimize" solution first, as a
+baseline at least, because it is easier to use a well-optimized
+distributed storage service like memcached.  We can do the "storage
+service does optimize" solution later or at the same time, which, if
+implemented carefully, should have better performance than the former.
diff --git a/doc/fluid/design/dist_train/src/lookup_table.png b/doc/fluid/design/dist_train/src/lookup_table.png
new file mode 100644
index 0000000000000000000000000000000000000000..72dfe3547f731d0d090338afb206b0549dff472e
Binary files /dev/null and b/doc/fluid/design/dist_train/src/lookup_table.png differ
diff --git a/doc/fluid/design/dist_train/src/lookup_table_training.png b/doc/fluid/design/dist_train/src/lookup_table_training.png
new file mode 100644
index 0000000000000000000000000000000000000000..cc7cc4aeb3b885850fe2f70f19fb84d5873bed1e
Binary files /dev/null and b/doc/fluid/design/dist_train/src/lookup_table_training.png differ
diff --git a/doc/design/ops/images/2_level_rnn.dot b/doc/fluid/design/dynamic_rnn/2_level_rnn.dot
similarity index 100%
rename from doc/design/ops/images/2_level_rnn.dot
rename to doc/fluid/design/dynamic_rnn/2_level_rnn.dot
diff --git a/doc/design/ops/images/2_level_rnn.png b/doc/fluid/design/dynamic_rnn/2_level_rnn.png
similarity index 100%
rename from doc/design/ops/images/2_level_rnn.png
rename to doc/fluid/design/dynamic_rnn/2_level_rnn.png
diff --git a/doc/design/ops/images/rnn.dot b/doc/fluid/design/dynamic_rnn/rnn.dot
similarity index 100%
rename from doc/design/ops/images/rnn.dot
rename to doc/fluid/design/dynamic_rnn/rnn.dot
diff --git a/doc/design/ops/images/rnn.jpg b/doc/fluid/design/dynamic_rnn/rnn.jpg
similarity index 100%
rename from doc/design/ops/images/rnn.jpg
rename to doc/fluid/design/dynamic_rnn/rnn.jpg
diff --git a/doc/design/ops/rnn.md b/doc/fluid/design/dynamic_rnn/rnn.md
similarity index 100%
rename from doc/design/ops/rnn.md
rename to doc/fluid/design/dynamic_rnn/rnn.md
diff --git a/doc/design/ops/images/rnn.png b/doc/fluid/design/dynamic_rnn/rnn.png
similarity index 100%
rename from doc/design/ops/images/rnn.png
rename to doc/fluid/design/dynamic_rnn/rnn.png
diff --git a/doc/design/ops/images/rnn_2level_data.dot b/doc/fluid/design/dynamic_rnn/rnn_2level_data.dot
similarity index 100%
rename from doc/design/ops/images/rnn_2level_data.dot
rename to doc/fluid/design/dynamic_rnn/rnn_2level_data.dot
diff --git a/doc/design/ops/images/rnn_2level_data.png b/doc/fluid/design/dynamic_rnn/rnn_2level_data.png
similarity index 100%
rename from doc/design/ops/images/rnn_2level_data.png
rename to doc/fluid/design/dynamic_rnn/rnn_2level_data.png
diff --git a/paddle/fluid/operators/op_documentation/rnn_design.md b/doc/fluid/design/dynamic_rnn/rnn_design.md
similarity index 100%
rename from paddle/fluid/operators/op_documentation/rnn_design.md
rename to doc/fluid/design/dynamic_rnn/rnn_design.md
diff --git a/doc/design/if_else_op.md b/doc/fluid/design/execution/if_else_op.md
similarity index 100%
rename from doc/design/if_else_op.md
rename to doc/fluid/design/execution/if_else_op.md
diff --git a/doc/design/switch.md b/doc/fluid/design/execution/switch.md
similarity index 100%
rename from doc/design/switch.md
rename to doc/fluid/design/execution/switch.md
diff --git a/doc/design/multi_language_interface/00.why_plain_c.md b/doc/fluid/design/interface/00.why_plain_c.md
similarity index 100%
rename from doc/design/multi_language_interface/00.why_plain_c.md
rename to doc/fluid/design/interface/00.why_plain_c.md
diff --git a/doc/design/multi_language_interface/01.inference_implementation.md b/doc/fluid/design/interface/01.inference_implementation.md
similarity index 100%
rename from doc/design/multi_language_interface/01.inference_implementation.md
rename to doc/fluid/design/interface/01.inference_implementation.md
diff --git a/paddle/fluid/memory/README.md b/doc/fluid/design/memory/README.md
similarity index 100%
rename from paddle/fluid/memory/README.md
rename to doc/fluid/design/memory/README.md
diff --git a/doc/design/images/control_flow_graph.png b/doc/fluid/design/memory/images/control_flow_graph.png
similarity index 100%
rename from doc/design/images/control_flow_graph.png
rename to doc/fluid/design/memory/images/control_flow_graph.png
diff --git a/doc/design/images/dataflow_equations.png b/doc/fluid/design/memory/images/dataflow_equations.png
similarity index 100%
rename from doc/design/images/dataflow_equations.png
rename to doc/fluid/design/memory/images/dataflow_equations.png
diff --git a/doc/design/images/deep_learning.png b/doc/fluid/design/memory/images/deep_learning.png
similarity index 100%
rename from doc/design/images/deep_learning.png
rename to doc/fluid/design/memory/images/deep_learning.png
diff --git a/doc/design/memory_optimization.md b/doc/fluid/design/memory/memory_optimization.md
similarity index 100%
rename from doc/design/memory_optimization.md
rename to doc/fluid/design/memory/memory_optimization.md
diff --git a/doc/design/backward.md b/doc/fluid/design/modules/backward.md
similarity index 100%
rename from doc/design/backward.md
rename to doc/fluid/design/modules/backward.md
diff --git a/paddle/fluid/operators/op_documentation/batch_norm_op.md b/doc/fluid/design/modules/batch_norm_op.md
similarity index 100%
rename from paddle/fluid/operators/op_documentation/batch_norm_op.md
rename to doc/fluid/design/modules/batch_norm_op.md
diff --git a/doc/design/evaluator.md b/doc/fluid/design/modules/evaluator.md
similarity index 100%
rename from doc/design/evaluator.md
rename to doc/fluid/design/modules/evaluator.md
diff --git a/paddle/fluid/operators/images/batch_norm_fork.dot b/doc/fluid/design/modules/images/batch_norm_fork.dot
similarity index 100%
rename from paddle/fluid/operators/images/batch_norm_fork.dot
rename to doc/fluid/design/modules/images/batch_norm_fork.dot
diff --git a/paddle/fluid/operators/images/batch_norm_fork.png b/doc/fluid/design/modules/images/batch_norm_fork.png
similarity index 100%
rename from paddle/fluid/operators/images/batch_norm_fork.png
rename to doc/fluid/design/modules/images/batch_norm_fork.png
diff --git a/paddle/fluid/operators/images/batch_norm_op_kernel.png b/doc/fluid/design/modules/images/batch_norm_op_kernel.png
similarity index 100%
rename from paddle/fluid/operators/images/batch_norm_op_kernel.png
rename to doc/fluid/design/modules/images/batch_norm_op_kernel.png
diff --git a/doc/design/images/feed_forward.png b/doc/fluid/design/modules/images/feed_forward.png
similarity index 100%
rename from doc/design/images/feed_forward.png
rename to doc/fluid/design/modules/images/feed_forward.png
diff --git a/doc/design/images/feed_forward_regularized.png b/doc/fluid/design/modules/images/feed_forward_regularized.png
similarity index 100%
rename from doc/design/images/feed_forward_regularized.png
rename to doc/fluid/design/modules/images/feed_forward_regularized.png
diff --git a/doc/design/images/l1_regularization.png b/doc/fluid/design/modules/images/l1_regularization.png
similarity index 100%
rename from doc/design/images/l1_regularization.png
rename to doc/fluid/design/modules/images/l1_regularization.png
diff --git a/doc/design/images/l2_regularization.png b/doc/fluid/design/modules/images/l2_regularization.png
similarity index 100%
rename from doc/design/images/l2_regularization.png
rename to doc/fluid/design/modules/images/l2_regularization.png
diff --git a/doc/design/images/loss_equation.png b/doc/fluid/design/modules/images/loss_equation.png
similarity index 100%
rename from doc/design/images/loss_equation.png
rename to doc/fluid/design/modules/images/loss_equation.png
diff --git a/doc/design/infer_var_type.md b/doc/fluid/design/modules/infer_var_type.md
similarity index 100%
rename from doc/design/infer_var_type.md
rename to doc/fluid/design/modules/infer_var_type.md
diff --git a/paddle/fluid/operators/op_documentation/net_op_design.md b/doc/fluid/design/modules/net_op_design.md
similarity index 100%
rename from paddle/fluid/operators/op_documentation/net_op_design.md
rename to doc/fluid/design/modules/net_op_design.md
diff --git a/doc/design/optimizer.md b/doc/fluid/design/modules/optimizer.md
similarity index 100%
rename from doc/design/optimizer.md
rename to doc/fluid/design/modules/optimizer.md
diff --git a/doc/design/prune.md b/doc/fluid/design/modules/prune.md
similarity index 100%
rename from doc/design/prune.md
rename to doc/fluid/design/modules/prune.md
diff --git a/doc/design/python_api.md b/doc/fluid/design/modules/python_api.md
similarity index 100%
rename from doc/design/python_api.md
rename to doc/fluid/design/modules/python_api.md
diff --git a/doc/design/register_grad_op.md b/doc/fluid/design/modules/register_grad_op.md
similarity index 100%
rename from doc/design/register_grad_op.md
rename to doc/fluid/design/modules/register_grad_op.md
diff --git a/doc/design/regularization.md b/doc/fluid/design/modules/regularization.md
similarity index 100%
rename from doc/design/regularization.md
rename to doc/fluid/design/modules/regularization.md
diff --git a/doc/design/selected_rows.md b/doc/fluid/design/modules/selected_rows.md
similarity index 100%
rename from doc/design/selected_rows.md
rename to doc/fluid/design/modules/selected_rows.md
diff --git a/doc/design/api.md b/doc/fluid/design/motivation/api.md
similarity index 100%
rename from doc/design/api.md
rename to doc/fluid/design/motivation/api.md
diff --git a/doc/design/fluid-compiler.graffle b/doc/fluid/design/motivation/fluid-compiler.graffle
similarity index 100%
rename from doc/design/fluid-compiler.graffle
rename to doc/fluid/design/motivation/fluid-compiler.graffle
diff --git a/doc/design/fluid-compiler.png b/doc/fluid/design/motivation/fluid-compiler.png
similarity index 100%
rename from doc/design/fluid-compiler.png
rename to doc/fluid/design/motivation/fluid-compiler.png
diff --git a/doc/design/fluid.md b/doc/fluid/design/motivation/fluid.md
similarity index 98%
rename from doc/design/fluid.md
rename to doc/fluid/design/motivation/fluid.md
index f78fa8c1914124f33b9730f918c8887ced4f8d9d..110b7d78bf12ac8328fb3a913e4386e75d63c995 100644
--- a/doc/design/fluid.md
+++ b/doc/fluid/design/motivation/fluid.md
@@ -103,7 +103,7 @@ In computability theory, a system of data-manipulation rules, such as a programm
 
 There are two ways to execute a Fluid program.  When a program is executed, it creates a protobuf message [`ProgramDesc`](https://github.com/PaddlePaddle/Paddle/blob/a91efdde6910ce92a78e3aa7157412c4c88d9ee8/paddle/framework/framework.proto#L145) that describes the process and is conceptually like an [abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree).
 
-There is a C++ class [`Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/executor.h), which runs a `ProgramDesc`, similar to how an interpreter runs a Python program.
+There is a C++ class [`Executor`](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/executor.h), which runs a `ProgramDesc`, similar to how an interpreter runs a Python program.
 
 Fluid is moving towards the direction of a compiler, which is explain in [fluid_compiler.md](fluid_compiler.md).
 
diff --git a/doc/design/fluid_compiler.md b/doc/fluid/design/motivation/fluid_compiler.md
similarity index 100%
rename from doc/design/fluid_compiler.md
rename to doc/fluid/design/motivation/fluid_compiler.md
diff --git a/doc/design/refactorization.md b/doc/fluid/design/motivation/refactorization.md
similarity index 100%
rename from doc/design/refactorization.md
rename to doc/fluid/design/motivation/refactorization.md
diff --git a/doc/design/kernel_hint_design.md b/doc/fluid/design/muti_devices/kernel_hint_design.md
similarity index 100%
rename from doc/design/kernel_hint_design.md
rename to doc/fluid/design/muti_devices/kernel_hint_design.md
diff --git a/doc/design/kernel_selection.md b/doc/fluid/design/muti_devices/kernel_selection.md
similarity index 100%
rename from doc/design/kernel_selection.md
rename to doc/fluid/design/muti_devices/kernel_selection.md
diff --git a/doc/design/operator_kernel_type.md b/doc/fluid/design/muti_devices/operator_kernel_type.md
similarity index 100%
rename from doc/design/operator_kernel_type.md
rename to doc/fluid/design/muti_devices/operator_kernel_type.md
diff --git a/doc/design/speech/deep_speech_2.md b/doc/fluid/design/network/deep_speech_2.md
similarity index 98%
rename from doc/design/speech/deep_speech_2.md
rename to doc/fluid/design/network/deep_speech_2.md
index cfdc4d6df04344c70d3334626bd38eca997c31ff..af0c6ef36feba9e0239e7a5f81a8dc9108b2471a 100644
--- a/doc/design/speech/deep_speech_2.md
+++ b/doc/fluid/design/network/deep_speech_2.md
@@ -94,7 +94,7 @@ The classical DS2 network contains 15 layers (from bottom to top):
 - **One** CTC-loss layer
 
 <div align="center">
-<img src="image/ds2_network.png" width=350><br/>
+<img src="images/ds2_network.png" width=350><br/>
 Figure 1. Archetecture of Deep Speech 2 Network.
 </div>
 
@@ -141,7 +141,7 @@ TODO by Assignees
 ### Beam Search with CTC and LM
 
 <div align="center">
-<img src="image/beam_search.png" width=600><br/>
+<img src="images/beam_search.png" width=600><br/>
 Figure 2. Algorithm for CTC Beam Search Decoder.
 </div>
 
diff --git a/doc/design/ops/images/LOD-and-shape-changes-during-decoding.jpg b/doc/fluid/design/network/images/LOD-and-shape-changes-during-decoding.jpg
similarity index 100%
rename from doc/design/ops/images/LOD-and-shape-changes-during-decoding.jpg
rename to doc/fluid/design/network/images/LOD-and-shape-changes-during-decoding.jpg
diff --git a/doc/design/speech/image/beam_search.png b/doc/fluid/design/network/images/beam_search.png
similarity index 100%
rename from doc/design/speech/image/beam_search.png
rename to doc/fluid/design/network/images/beam_search.png
diff --git a/doc/design/speech/image/ds2_network.png b/doc/fluid/design/network/images/ds2_network.png
similarity index 100%
rename from doc/design/speech/image/ds2_network.png
rename to doc/fluid/design/network/images/ds2_network.png
diff --git a/doc/design/ops/sequence_decoder.md b/doc/fluid/design/network/sequence_decoder.md
similarity index 100%
rename from doc/design/ops/sequence_decoder.md
rename to doc/fluid/design/network/sequence_decoder.md
diff --git a/doc/design/auto_gradient_check.md b/doc/fluid/design/others/auto_gradient_check.md
similarity index 100%
rename from doc/design/auto_gradient_check.md
rename to doc/fluid/design/others/auto_gradient_check.md
diff --git a/doc/design/dcgan.png b/doc/fluid/design/others/dcgan.png
similarity index 100%
rename from doc/design/dcgan.png
rename to doc/fluid/design/others/dcgan.png
diff --git a/doc/design/gan_api.md b/doc/fluid/design/others/gan_api.md
similarity index 100%
rename from doc/design/gan_api.md
rename to doc/fluid/design/others/gan_api.md
diff --git a/doc/design/graph.md b/doc/fluid/design/others/graph.md
similarity index 100%
rename from doc/design/graph.md
rename to doc/fluid/design/others/graph.md
diff --git a/doc/design/graph_survey.md b/doc/fluid/design/others/graph_survey.md
similarity index 100%
rename from doc/design/graph_survey.md
rename to doc/fluid/design/others/graph_survey.md
diff --git a/doc/design/images/graph_construction_example.bash b/doc/fluid/design/others/images/graph_construction_example.bash
similarity index 100%
rename from doc/design/images/graph_construction_example.bash
rename to doc/fluid/design/others/images/graph_construction_example.bash
diff --git a/doc/design/images/graph_construction_example.dot b/doc/fluid/design/others/images/graph_construction_example.dot
similarity index 100%
rename from doc/design/images/graph_construction_example.dot
rename to doc/fluid/design/others/images/graph_construction_example.dot
diff --git a/doc/design/images/graph_construction_example_all.png b/doc/fluid/design/others/images/graph_construction_example_all.png
similarity index 100%
rename from doc/design/images/graph_construction_example_all.png
rename to doc/fluid/design/others/images/graph_construction_example_all.png
diff --git a/doc/design/images/graph_construction_example_forward_backward.png b/doc/fluid/design/others/images/graph_construction_example_forward_backward.png
similarity index 100%
rename from doc/design/images/graph_construction_example_forward_backward.png
rename to doc/fluid/design/others/images/graph_construction_example_forward_backward.png
diff --git a/doc/design/images/graph_construction_example_forward_only.png b/doc/fluid/design/others/images/graph_construction_example_forward_only.png
similarity index 100%
rename from doc/design/images/graph_construction_example_forward_only.png
rename to doc/fluid/design/others/images/graph_construction_example_forward_only.png
diff --git a/doc/design/parameters_in_cpp.md b/doc/fluid/design/others/parameters_in_cpp.md
similarity index 100%
rename from doc/design/parameters_in_cpp.md
rename to doc/fluid/design/others/parameters_in_cpp.md
diff --git a/doc/design/simple_op_design.md b/doc/fluid/design/others/simple_op_design.md
similarity index 100%
rename from doc/design/simple_op_design.md
rename to doc/fluid/design/others/simple_op_design.md
diff --git a/doc/design/test.dot b/doc/fluid/design/others/test.dot
similarity index 100%
rename from doc/design/test.dot
rename to doc/fluid/design/others/test.dot
diff --git a/doc/design/test.dot.png b/doc/fluid/design/others/test.dot.png
similarity index 100%
rename from doc/design/test.dot.png
rename to doc/fluid/design/others/test.dot.png
diff --git a/doc/design/ci_build_whl.png b/doc/fluid/dev/ci_build_whl.png
similarity index 100%
rename from doc/design/ci_build_whl.png
rename to doc/fluid/dev/ci_build_whl.png
diff --git a/paddle/fluid/operators/op_documentation/name_convention.md b/doc/fluid/dev/name_convention.md
similarity index 100%
rename from paddle/fluid/operators/op_documentation/name_convention.md
rename to doc/fluid/dev/name_convention.md
diff --git a/paddle/fluid/operators/op_documentation/op_markdown_format.md b/doc/fluid/dev/op_markdown_format.md
similarity index 100%
rename from paddle/fluid/operators/op_documentation/op_markdown_format.md
rename to doc/fluid/dev/op_markdown_format.md
diff --git a/doc/design/releasing_process.md b/doc/fluid/dev/releasing_process.md
similarity index 100%
rename from doc/design/releasing_process.md
rename to doc/fluid/dev/releasing_process.md
diff --git a/doc/design/support_new_device.md b/doc/fluid/dev/support_new_device.md
similarity index 100%
rename from doc/design/support_new_device.md
rename to doc/fluid/dev/support_new_device.md
diff --git a/doc/design/reader/README.md b/doc/fluid/getstarted/concepts/reader/README.md
similarity index 100%
rename from doc/design/reader/README.md
rename to doc/fluid/getstarted/concepts/reader/README.md
diff --git a/doc/design/model_format.md b/doc/fluid/getstarted/concepts/save_model/model_format.md
similarity index 100%
rename from doc/design/model_format.md
rename to doc/fluid/getstarted/concepts/save_model/model_format.md
diff --git a/doc/design/error_clip.md b/doc/fluid/howto/performance/error_clip.md
similarity index 100%
rename from doc/design/error_clip.md
rename to doc/fluid/howto/performance/error_clip.md
diff --git a/doc/design/images/profiler.png b/doc/fluid/howto/performance/images/profiler.png
similarity index 100%
rename from doc/design/images/profiler.png
rename to doc/fluid/howto/performance/images/profiler.png
diff --git a/doc/design/profiler.md b/doc/fluid/howto/performance/profiler.md
similarity index 100%
rename from doc/design/profiler.md
rename to doc/fluid/howto/performance/profiler.md
diff --git a/doc/design/images/multigpu_allreduce.graffle b/doc/fluid/howto/third_party/images/multigpu_allreduce.graffle
similarity index 100%
rename from doc/design/images/multigpu_allreduce.graffle
rename to doc/fluid/howto/third_party/images/multigpu_allreduce.graffle
diff --git a/doc/design/images/multigpu_allreduce.png b/doc/fluid/howto/third_party/images/multigpu_allreduce.png
similarity index 100%
rename from doc/design/images/multigpu_allreduce.png
rename to doc/fluid/howto/third_party/images/multigpu_allreduce.png
diff --git a/doc/design/images/multigpu_before_convert.graffle b/doc/fluid/howto/third_party/images/multigpu_before_convert.graffle
similarity index 100%
rename from doc/design/images/multigpu_before_convert.graffle
rename to doc/fluid/howto/third_party/images/multigpu_before_convert.graffle
diff --git a/doc/design/images/multigpu_before_convert.png b/doc/fluid/howto/third_party/images/multigpu_before_convert.png
similarity index 100%
rename from doc/design/images/multigpu_before_convert.png
rename to doc/fluid/howto/third_party/images/multigpu_before_convert.png
diff --git a/doc/design/mkl/mkldnn_fluid.md b/doc/fluid/howto/third_party/mkldnn_fluid.md
similarity index 100%
rename from doc/design/mkl/mkldnn_fluid.md
rename to doc/fluid/howto/third_party/mkldnn_fluid.md
diff --git a/doc/design/paddle_nccl.md b/doc/fluid/howto/third_party/paddle_nccl.md
similarity index 100%
rename from doc/design/paddle_nccl.md
rename to doc/fluid/howto/third_party/paddle_nccl.md
diff --git a/doc/v2/api/fluid/optimizer.rst b/doc/v2/api/fluid/optimizer.rst
index 9b165f870459b4f9ef2efe24f5604a3fcb96f7f3..2f820595c35c2bccd6a5c8a20c60d796c04c8e97 100644
--- a/doc/v2/api/fluid/optimizer.rst
+++ b/doc/v2/api/fluid/optimizer.rst
@@ -47,3 +47,10 @@ DecayedAdagrad
     :members:
     :noindex:
 
+Adadelta
+--------------
+
+..  autoclass:: paddle.fluid.optimizer.AdadeltaOptimizer
+    :members:
+    :noindex:
+
diff --git a/doc/design/cluster_train/README.md b/doc/v2/design/cluster_train/README.md
similarity index 100%
rename from doc/design/cluster_train/README.md
rename to doc/v2/design/cluster_train/README.md
diff --git a/doc/design/cluster_train/checkpointing.md b/doc/v2/design/cluster_train/checkpointing.md
similarity index 100%
rename from doc/design/cluster_train/checkpointing.md
rename to doc/v2/design/cluster_train/checkpointing.md
diff --git a/doc/design/cluster_train/data_dispatch.md b/doc/v2/design/cluster_train/data_dispatch.md
similarity index 100%
rename from doc/design/cluster_train/data_dispatch.md
rename to doc/v2/design/cluster_train/data_dispatch.md
diff --git a/doc/design/cluster_train/large_model_dist_train.md b/doc/v2/design/cluster_train/large_model_dist_train.md
similarity index 100%
rename from doc/design/cluster_train/large_model_dist_train.md
rename to doc/v2/design/cluster_train/large_model_dist_train.md
diff --git a/doc/design/cluster_train/master_server.md b/doc/v2/design/cluster_train/master_server.md
similarity index 100%
rename from doc/design/cluster_train/master_server.md
rename to doc/v2/design/cluster_train/master_server.md
diff --git a/doc/design/cluster_train/pserver_client.md b/doc/v2/design/cluster_train/pserver_client.md
similarity index 100%
rename from doc/design/cluster_train/pserver_client.md
rename to doc/v2/design/cluster_train/pserver_client.md
diff --git a/doc/design/cluster_train/remote_parameter_updater.md b/doc/v2/design/cluster_train/remote_parameter_updater.md
similarity index 100%
rename from doc/design/cluster_train/remote_parameter_updater.md
rename to doc/v2/design/cluster_train/remote_parameter_updater.md
diff --git a/doc/design/cluster_train/save_model.md b/doc/v2/design/cluster_train/save_model.md
similarity index 100%
rename from doc/design/cluster_train/save_model.md
rename to doc/v2/design/cluster_train/save_model.md
diff --git a/doc/design/cluster_train/src/checkpointing.png b/doc/v2/design/cluster_train/src/checkpointing.png
similarity index 100%
rename from doc/design/cluster_train/src/checkpointing.png
rename to doc/v2/design/cluster_train/src/checkpointing.png
diff --git a/doc/design/cluster_train/src/data_dispatch.png b/doc/v2/design/cluster_train/src/data_dispatch.png
similarity index 100%
rename from doc/design/cluster_train/src/data_dispatch.png
rename to doc/v2/design/cluster_train/src/data_dispatch.png
diff --git a/doc/design/cluster_train/src/dataset.graffle b/doc/v2/design/cluster_train/src/dataset.graffle
similarity index 100%
rename from doc/design/cluster_train/src/dataset.graffle
rename to doc/v2/design/cluster_train/src/dataset.graffle
diff --git a/doc/design/cluster_train/src/dataset.png b/doc/v2/design/cluster_train/src/dataset.png
similarity index 100%
rename from doc/design/cluster_train/src/dataset.png
rename to doc/v2/design/cluster_train/src/dataset.png
diff --git a/doc/design/cluster_train/src/file_storage.graffle b/doc/v2/design/cluster_train/src/file_storage.graffle
similarity index 100%
rename from doc/design/cluster_train/src/file_storage.graffle
rename to doc/v2/design/cluster_train/src/file_storage.graffle
diff --git a/doc/design/cluster_train/src/file_storage.png b/doc/v2/design/cluster_train/src/file_storage.png
similarity index 100%
rename from doc/design/cluster_train/src/file_storage.png
rename to doc/v2/design/cluster_train/src/file_storage.png
diff --git a/doc/design/cluster_train/src/init_lock.graffle b/doc/v2/design/cluster_train/src/init_lock.graffle
similarity index 100%
rename from doc/design/cluster_train/src/init_lock.graffle
rename to doc/v2/design/cluster_train/src/init_lock.graffle
diff --git a/doc/design/cluster_train/src/init_lock.png b/doc/v2/design/cluster_train/src/init_lock.png
similarity index 100%
rename from doc/design/cluster_train/src/init_lock.png
rename to doc/v2/design/cluster_train/src/init_lock.png
diff --git a/doc/design/cluster_train/src/paddle-cloud-in-data-center.png b/doc/v2/design/cluster_train/src/paddle-cloud-in-data-center.png
similarity index 100%
rename from doc/design/cluster_train/src/paddle-cloud-in-data-center.png
rename to doc/v2/design/cluster_train/src/paddle-cloud-in-data-center.png
diff --git a/doc/design/cluster_train/src/paddle-etcd.graffle b/doc/v2/design/cluster_train/src/paddle-etcd.graffle
similarity index 100%
rename from doc/design/cluster_train/src/paddle-etcd.graffle
rename to doc/v2/design/cluster_train/src/paddle-etcd.graffle
diff --git a/doc/design/cluster_train/src/paddle-etcd.png b/doc/v2/design/cluster_train/src/paddle-etcd.png
similarity index 100%
rename from doc/design/cluster_train/src/paddle-etcd.png
rename to doc/v2/design/cluster_train/src/paddle-etcd.png
diff --git a/doc/design/cluster_train/src/paddle-model-sharding.graffle b/doc/v2/design/cluster_train/src/paddle-model-sharding.graffle
similarity index 100%
rename from doc/design/cluster_train/src/paddle-model-sharding.graffle
rename to doc/v2/design/cluster_train/src/paddle-model-sharding.graffle
diff --git a/doc/design/cluster_train/src/paddle-model-sharding.png b/doc/v2/design/cluster_train/src/paddle-model-sharding.png
similarity index 100%
rename from doc/design/cluster_train/src/paddle-model-sharding.png
rename to doc/v2/design/cluster_train/src/paddle-model-sharding.png
diff --git a/doc/design/cluster_train/src/paddle-ps-0.png b/doc/v2/design/cluster_train/src/paddle-ps-0.png
similarity index 100%
rename from doc/design/cluster_train/src/paddle-ps-0.png
rename to doc/v2/design/cluster_train/src/paddle-ps-0.png
diff --git a/doc/design/cluster_train/src/paddle-ps-1.png b/doc/v2/design/cluster_train/src/paddle-ps-1.png
similarity index 100%
rename from doc/design/cluster_train/src/paddle-ps-1.png
rename to doc/v2/design/cluster_train/src/paddle-ps-1.png
diff --git a/doc/design/cluster_train/src/paddle-ps.graffle b/doc/v2/design/cluster_train/src/paddle-ps.graffle
similarity index 100%
rename from doc/design/cluster_train/src/paddle-ps.graffle
rename to doc/v2/design/cluster_train/src/paddle-ps.graffle
diff --git a/doc/design/cluster_train/src/paddle-task-queues.graffle b/doc/v2/design/cluster_train/src/paddle-task-queues.graffle
similarity index 100%
rename from doc/design/cluster_train/src/paddle-task-queues.graffle
rename to doc/v2/design/cluster_train/src/paddle-task-queues.graffle
diff --git a/doc/design/cluster_train/src/paddle-task-queues.png b/doc/v2/design/cluster_train/src/paddle-task-queues.png
similarity index 100%
rename from doc/design/cluster_train/src/paddle-task-queues.png
rename to doc/v2/design/cluster_train/src/paddle-task-queues.png
diff --git a/doc/design/cluster_train/src/paddle-task-states.graffle b/doc/v2/design/cluster_train/src/paddle-task-states.graffle
similarity index 100%
rename from doc/design/cluster_train/src/paddle-task-states.graffle
rename to doc/v2/design/cluster_train/src/paddle-task-states.graffle
diff --git a/doc/design/cluster_train/src/paddle-task-states.png b/doc/v2/design/cluster_train/src/paddle-task-states.png
similarity index 100%
rename from doc/design/cluster_train/src/paddle-task-states.png
rename to doc/v2/design/cluster_train/src/paddle-task-states.png
diff --git a/doc/design/cluster_train/src/pserver_init.graffle b/doc/v2/design/cluster_train/src/pserver_init.graffle
similarity index 100%
rename from doc/design/cluster_train/src/pserver_init.graffle
rename to doc/v2/design/cluster_train/src/pserver_init.graffle
diff --git a/doc/design/cluster_train/src/pserver_init.png b/doc/v2/design/cluster_train/src/pserver_init.png
similarity index 100%
rename from doc/design/cluster_train/src/pserver_init.png
rename to doc/v2/design/cluster_train/src/pserver_init.png
diff --git a/doc/design/cluster_train/src/submit-job.graffle b/doc/v2/design/cluster_train/src/submit-job.graffle
similarity index 100%
rename from doc/design/cluster_train/src/submit-job.graffle
rename to doc/v2/design/cluster_train/src/submit-job.graffle
diff --git a/doc/design/cluster_train/src/submit-job.png b/doc/v2/design/cluster_train/src/submit-job.png
similarity index 100%
rename from doc/design/cluster_train/src/submit-job.png
rename to doc/v2/design/cluster_train/src/submit-job.png
diff --git a/doc/design/cluster_train/src/trainer.graffle b/doc/v2/design/cluster_train/src/trainer.graffle
similarity index 100%
rename from doc/design/cluster_train/src/trainer.graffle
rename to doc/v2/design/cluster_train/src/trainer.graffle
diff --git a/doc/design/cluster_train/src/trainer.png b/doc/v2/design/cluster_train/src/trainer.png
similarity index 100%
rename from doc/design/cluster_train/src/trainer.png
rename to doc/v2/design/cluster_train/src/trainer.png
diff --git a/doc/design/cluster_train/submit-job.md b/doc/v2/design/cluster_train/submit-job.md
similarity index 100%
rename from doc/design/cluster_train/submit-job.md
rename to doc/v2/design/cluster_train/submit-job.md
diff --git a/doc/design/mkl/image/engine.png b/doc/v2/design/mkl/image/engine.png
similarity index 100%
rename from doc/design/mkl/image/engine.png
rename to doc/v2/design/mkl/image/engine.png
diff --git a/doc/design/mkl/image/gradients.png b/doc/v2/design/mkl/image/gradients.png
similarity index 100%
rename from doc/design/mkl/image/gradients.png
rename to doc/v2/design/mkl/image/gradients.png
diff --git a/doc/design/mkl/image/layers.png b/doc/v2/design/mkl/image/layers.png
similarity index 100%
rename from doc/design/mkl/image/layers.png
rename to doc/v2/design/mkl/image/layers.png
diff --git a/doc/design/mkl/image/matrix.png b/doc/v2/design/mkl/image/matrix.png
similarity index 100%
rename from doc/design/mkl/image/matrix.png
rename to doc/v2/design/mkl/image/matrix.png
diff --git a/doc/design/mkl/image/overview.png b/doc/v2/design/mkl/image/overview.png
similarity index 100%
rename from doc/design/mkl/image/overview.png
rename to doc/v2/design/mkl/image/overview.png
diff --git a/doc/design/mkl/mkl_packed.md b/doc/v2/design/mkl/mkl_packed.md
similarity index 100%
rename from doc/design/mkl/mkl_packed.md
rename to doc/v2/design/mkl/mkl_packed.md
diff --git a/doc/design/mkl/mkldnn.md b/doc/v2/design/mkl/mkldnn.md
similarity index 100%
rename from doc/design/mkl/mkldnn.md
rename to doc/v2/design/mkl/mkldnn.md
diff --git a/doc/v2/howto/cluster/index_en.rst b/doc/v2/howto/cluster/index_en.rst
index 2640a09dcc904619bc97c9bd3f3d81a9dc307663..c965d30d54e71339cf10d4b05f25e740c81adbf9 100644
--- a/doc/v2/howto/cluster/index_en.rst
+++ b/doc/v2/howto/cluster/index_en.rst
@@ -1,8 +1,7 @@
 Distributed Training
 ====================
 
-In this section, we'll explain how to run distributed training jobs with PaddlePaddle on different types of clusters. The diagram below shows the main architecture of a distributed trainning job:
-
+The effectiveness of the deep learning model is often directly related to the scale of the data: it can generally achieve better results after increasing the size of the dataset on the same model. However, it can not fit in one single computer when the amount of data increases to a certain extent. At this point, using multiple computers for distributed training is a natural solution. In distributed training, the training data is divided into multiple copies (sharding), and multiple machines participating in the training read their own data for training and collaboratively update the parameters of the overall model.
 .. image:: src/ps_en.png
    :width: 500
 
@@ -10,13 +9,27 @@ In this section, we'll explain how to run distributed training jobs with PaddleP
 - Trainer: each trainer reads the data shard, and train the neural network. Then the trainer will upload calculated "gradients" to parameter servers, and wait for parameters to be optimized on the parameter server side. When that finishes, the trainer download optimized parameters and continues its training.
 - Parameter server: every parameter server stores part of the whole neural network model data. They will do optimization calculations when gradients are uploaded from trainers, and then send updated parameters to trainers.
 
-PaddlePaddle can support both synchronize stochastic gradient descent (SGD) and asynchronous SGD.
+The training of synchronous random gradient descent for neural network can be achieved by cooperation of trainers and parameter servers.
+
+PaddlePaddle supports both synchronize stochastic gradient descent (SGD) and asynchronous SGD.
 
-When training with synchronize SGD, PaddlePaddle uses an internal "synchronize barrier" which makes gradients update and parameter download in strict order. On the other hand, asynchronous SGD won't wait for all trainers to finish upload at a single step, this will increase the parallelism of distributed training: parameter servers do not depend on each other, they'll do parameter optimization concurrently. Parameter servers will not wait for trainers, so trainers will also do their work concurrently. But asynchronous SGD will introduce more randomness and noises in the gradient.
+Before starting the cluster training, you need to prepare the cluster configuration, PaddlePaddle installation, and other preparations. To understand how to configure the basic environment for distributed training, check the link below:
 
 ..  toctree::
   :maxdepth: 1
 
   preparations_en.md
+
+Cluster training has a large number of configurable parameters, such as the number of machines used, communication ports, etc. To learn how to configure the distributed training process by setting startup these parameters, check the link below:
+
+..  toctree::
+  :maxdepth: 1
+
   cmd_argument_en.md
+
+PaddlePaddle is compatible with a variety of different clusters. Each cluster has its own advantages, To learn how to run PaddlePaddle in different types of them, check the link below:
+
+..  toctree::
+  :maxdepth: 1
+
   multi_cluster/index_en.rst
diff --git a/paddle/capi/CMakeLists.txt b/paddle/capi/CMakeLists.txt
index ebb083c5a477d5be91ef14be74dd9de349d07931..e06e9a2b363d1ffc6876b98bcb7304b0a54dbcaa 100644
--- a/paddle/capi/CMakeLists.txt
+++ b/paddle/capi/CMakeLists.txt
@@ -36,7 +36,7 @@ target_include_directories(paddle_capi PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
 add_style_check_target(paddle_capi ${CAPI_SOURCES} ${CAPI_HEADER}
   ${CAPI_PRIVATE_HEADER})
 
-add_dependencies(paddle_capi paddle_proto)
+add_dependencies(paddle_capi paddle_proto paddle_gserver)
 
 # TODO: paddle_capi_whole will be removed.
 set(PADDLE_CAPI_LAYERS_LIBS
diff --git a/paddle/fluid/framework/CMakeLists.txt b/paddle/fluid/framework/CMakeLists.txt
index 15e5574ecfd406b87db8370948352b7e736937ea..a4ea74a6d2fbc29dc33a6b57ee453f49ed36c7fa 100644
--- a/paddle/fluid/framework/CMakeLists.txt
+++ b/paddle/fluid/framework/CMakeLists.txt
@@ -103,4 +103,5 @@ cc_test(cow_ptr_tests SRCS details/cow_ptr_test.cc)
 cc_test(channel_test SRCS channel_test.cc)
 cc_test(tuple_test SRCS tuple_test.cc )
 cc_test(concurrency_test SRCS concurrency_test.cc DEPS go_op channel_close_op channel_create_op
-        channel_send_op channel_recv_op sum_op elementwise_add_op executor proto_desc)
+        channel_send_op channel_recv_op sum_op select_op elementwise_add_op compare_op
+        conditional_block_op while_op assign_op print_op executor proto_desc)
diff --git a/paddle/fluid/framework/channel.h b/paddle/fluid/framework/channel.h
index 51e2b03f9cb9abf6b3effe4035d4eec2ba4f9fbf..adfaba26ace78f547161ad4029a741f3ca8a6764 100644
--- a/paddle/fluid/framework/channel.h
+++ b/paddle/fluid/framework/channel.h
@@ -162,24 +162,12 @@ class ChannelHolder {
     }
   }
 
-  template <typename T>
   void RemoveFromSendQ(const void* referrer) {
-    if (IsInitialized()) {
-      Channel<T>* channel = static_cast<Channel<T>*>(holder_->Ptr());
-      if (channel != nullptr) {
-        channel->RemoveFromSendQ(referrer);
-      }
-    }
+    if (IsInitialized()) holder_->RemoveFromSendQ(referrer);
   }
 
-  template <typename T>
   void RemoveFromReceiveQ(const void* referrer) {
-    if (IsInitialized()) {
-      Channel<T>* channel = static_cast<Channel<T>*>(holder_->Ptr());
-      if (channel != nullptr) {
-        channel->RemoveFromReceiveQ(referrer);
-      }
-    }
+    if (IsInitialized()) holder_->RemoveFromReceiveQ(referrer);
   }
 
   inline bool IsInitialized() const { return holder_ != nullptr; }
@@ -201,6 +189,8 @@ class ChannelHolder {
     virtual bool IsClosed() = 0;
     virtual bool CanSend() = 0;
     virtual bool CanReceive() = 0;
+    virtual void RemoveFromSendQ(const void* referrer) = 0;
+    virtual void RemoveFromReceiveQ(const void* referrer) = 0;
     virtual void Close() = 0;
     virtual void Lock() = 0;
     virtual void Unlock() = 0;
@@ -238,6 +228,18 @@ class ChannelHolder {
       return false;
     }
 
+    virtual void RemoveFromSendQ(const void* referrer) {
+      if (channel_) {
+        channel_->RemoveFromSendQ(referrer);
+      }
+    }
+
+    virtual void RemoveFromReceiveQ(const void* referrer) {
+      if (channel_) {
+        channel_->RemoveFromReceiveQ(referrer);
+      }
+    }
+
     virtual void Close() {
       if (channel_) channel_->Close();
     }
diff --git a/paddle/fluid/framework/channel_impl.h b/paddle/fluid/framework/channel_impl.h
index c194c03e264cccfa9ad755ea24bc6372e82bfb00..457abbf373d4549229e8fd8bd6b2087cc6b8f5c8 100644
--- a/paddle/fluid/framework/channel_impl.h
+++ b/paddle/fluid/framework/channel_impl.h
@@ -151,7 +151,7 @@ bool ChannelImpl<T>::Send(T *item) {
       // We do not care about notifying other
       // because they would have been notified
       // by the executed select case.
-      return Send(item);
+      return send_return(Send(item));
 
     // Wake up the blocked process and unlock
     m->Notify();
@@ -214,7 +214,7 @@ bool ChannelImpl<T>::Receive(T *item) {
       // We do not care about notifying other
       // because they would have been notified
       // by the executed select case.
-      return Receive(item);
+      return recv_return(Receive(item));
 
     // Wake up the blocked process and unlock
     m->Notify();
@@ -331,7 +331,6 @@ void ChannelImpl<T>::RemoveFromSendQ(const void *referrer) {
 
     if (sendMsg->referrer == referrer) {
       it = sendq.erase(it);
-      send_ctr--;
     } else {
       ++it;
     }
@@ -347,7 +346,6 @@ void ChannelImpl<T>::RemoveFromReceiveQ(const void *referrer) {
 
     if (recvMsg->referrer == referrer) {
       it = recvq.erase(it);
-      recv_ctr--;
     } else {
       ++it;
     }
diff --git a/paddle/fluid/framework/channel_test.cc b/paddle/fluid/framework/channel_test.cc
index edfb41c72489113d9803c2957baed1ce44f8296d..73be5cdbe2a1f5994ecee4c415e83962f50532fe 100644
--- a/paddle/fluid/framework/channel_test.cc
+++ b/paddle/fluid/framework/channel_test.cc
@@ -871,3 +871,67 @@ TEST(ChannelHolder, ChannelHolderDestroyUnblocksSendersTest) {
   ch->Reset<int>(0);
   ChannelHolderDestroyUnblockSenders(ch, false);
 }
+
+// This tests that closing a channelholder many times.
+void ChannelHolderManyTimesClose(ChannelHolder *ch) {
+  const int num_threads = 15;
+  std::thread t[num_threads];
+  bool thread_ended[num_threads];
+
+  // Launches threads that try to send data to channel.
+  for (size_t i = 0; i < num_threads / 3; i++) {
+    thread_ended[i] = false;
+    t[i] = std::thread(
+        [&](bool *ended) {
+          int data = 10;
+          ch->Send(&data);
+          *ended = true;
+        },
+        &thread_ended[i]);
+  }
+
+  // Launches threads that try to receive data to channel.
+  for (size_t i = num_threads / 3; i < 2 * num_threads / 3; i++) {
+    thread_ended[i] = false;
+    t[i] = std::thread(
+        [&](bool *p) {
+          int data;
+          if (ch->Receive(&data)) {
+            EXPECT_EQ(data, 10);
+          }
+          *p = true;
+        },
+        &thread_ended[i]);
+  }
+
+  // Launches threads that try to close the channel.
+  for (size_t i = 2 * num_threads / 3; i < num_threads; i++) {
+    thread_ended[i] = false;
+    t[i] = std::thread(
+        [&](bool *p) {
+          if (!ch->IsClosed()) {
+            ch->close();
+          }
+          *p = true;
+        },
+        &thread_ended[i]);
+  }
+
+  std::this_thread::sleep_for(std::chrono::milliseconds(100));  // wait
+
+  // Verify that all threads are unblocked
+  for (size_t i = 0; i < num_threads; i++) {
+    EXPECT_EQ(thread_ended[i], true);
+  }
+  EXPECT_TRUE(ch->IsClosed());
+  // delete the channel
+  delete ch;
+  for (size_t i = 0; i < num_threads; i++) t[i].join();
+}
+
+TEST(ChannelHolder, ChannelHolderManyTimesCloseTest) {
+  // Check for Buffered Channel
+  ChannelHolder *ch = new ChannelHolder();
+  ch->Reset<int>(10);
+  ChannelHolderManyTimesClose(ch);
+}
diff --git a/paddle/fluid/framework/concurrency_test.cc b/paddle/fluid/framework/concurrency_test.cc
index 5770b0a5a18659e615e80a7c48113d8b543b69ec..25152054eb8452a9667bd65b4441665476c1d46d 100644
--- a/paddle/fluid/framework/concurrency_test.cc
+++ b/paddle/fluid/framework/concurrency_test.cc
@@ -19,7 +19,6 @@ limitations under the License. */
 #include "paddle/fluid/framework/channel.h"
 #include "paddle/fluid/framework/executor.h"
 #include "paddle/fluid/framework/op_registry.h"
-#include "paddle/fluid/framework/program_desc.h"
 
 USE_NO_KERNEL_OP(go);
 USE_NO_KERNEL_OP(channel_close);
@@ -27,6 +26,12 @@ USE_NO_KERNEL_OP(channel_create);
 USE_NO_KERNEL_OP(channel_recv);
 USE_NO_KERNEL_OP(channel_send);
 USE_NO_KERNEL_OP(elementwise_add);
+USE_NO_KERNEL_OP(select);
+USE_NO_KERNEL_OP(conditional_block);
+USE_NO_KERNEL_OP(equal);
+USE_NO_KERNEL_OP(assign);
+USE_NO_KERNEL_OP(while);
+USE_NO_KERNEL_OP(print);
 
 namespace f = paddle::framework;
 namespace p = paddle::platform;
@@ -35,27 +40,15 @@ namespace paddle {
 namespace framework {
 
 template <typename T>
-void CreateIntVariable(Scope &scope, p::CPUPlace &place, std::string name,
-                       T value) {
-  // Create LoDTensor<int> of dim [1,1]
+LoDTensor *CreateVariable(Scope &scope, p::CPUPlace &place, std::string name,
+                          T value) {
+  // Create LoDTensor<int> of dim [1]
   auto var = scope.Var(name);
   auto tensor = var->GetMutable<LoDTensor>();
-  tensor->Resize({1, 1});
+  tensor->Resize({1});
   T *expect = tensor->mutable_data<T>(place);
   expect[0] = value;
-}
-
-void InitTensorsInScope(Scope &scope, p::CPUPlace &place) {
-  p::CPUDeviceContext ctx(place);
-
-  // Create channel variable
-  scope.Var("Channel");
-
-  // Create Variables, x0 will be put into channel,
-  // result will be pulled from channel
-  CreateIntVariable(scope, place, "Status", false);
-  CreateIntVariable(scope, place, "x0", 99);
-  CreateIntVariable(scope, place, "result", 0);
+  return tensor;
 }
 
 void AddOp(const std::string &type, const VariableNameMap &inputs,
@@ -73,12 +66,116 @@ void AddOp(const std::string &type, const VariableNameMap &inputs,
   op->SetAttrMap(attrs);
 }
 
+void AddCase(ProgramDesc *program, Scope *scope, p::CPUPlace *place,
+             BlockDesc *casesBlock, int caseId, int caseType,
+             std::string caseChannel, std::string caseVarName,
+             std::function<void(BlockDesc *, Scope *)> func) {
+  std::string caseCondName = std::string("caseCond") + std::to_string(caseId);
+  std::string caseCondXVarName =
+      std::string("caseCondX") + std::to_string(caseId);
+
+  BlockDesc *caseBlock = program->AppendBlock(*casesBlock);
+  func(caseBlock, scope);
+
+  CreateVariable(*scope, *place, caseCondName, false);
+  CreateVariable(*scope, *place, caseCondXVarName, caseId);
+  CreateVariable(*scope, *place, caseVarName, caseId);
+
+  scope->Var("step_scope");
+
+  AddOp("equal", {{"X", {caseCondXVarName}}, {"Y", {"caseToExecute"}}},
+        {{"Out", {caseCondName}}}, {}, casesBlock);
+
+  AddOp("conditional_block", {{"X", {caseCondName}}, {"Params", {}}},
+        {{"Out", {}}, {"Scope", {"step_scope"}}},
+        {{"sub_block", caseBlock}, {"is_scalar_condition", true}}, casesBlock);
+}
+
+void AddFibonacciSelect(Scope *scope, p::CPUPlace *place, ProgramDesc *program,
+                        BlockDesc *parentBlock, std::string dataChanName,
+                        std::string quitChanName) {
+  BlockDesc *whileBlock = program->AppendBlock(*parentBlock);
+
+  CreateVariable(*scope, *place, "whileExitCond", true);
+  CreateVariable(*scope, *place, "caseToExecute", -1);
+  CreateVariable(*scope, *place, "case1var", 0);
+
+  CreateVariable(*scope, *place, "xtemp", 0);
+
+  // TODO(thuan): Need to create fibXToSend, since channel send moves the actual
+  // data,
+  // which causes the data to be no longer accessible to do the fib calculation
+  // TODO(abhinav): Change channel send to do a copy instead of a move!
+  CreateVariable(*scope, *place, "fibXToSend", 0);
+
+  CreateVariable(*scope, *place, "fibX", 0);
+  CreateVariable(*scope, *place, "fibY", 1);
+  CreateVariable(*scope, *place, "quitVar", 0);
+
+  BlockDesc *casesBlock = program->AppendBlock(*whileBlock);
+  std::function<void(BlockDesc * caseBlock)> f = [](BlockDesc *caseBlock) {};
+
+  // TODO(thuan): Remove this once we change channel send to do a copy instead
+  // of move
+  AddOp("assign", {{"X", {"fibX"}}}, {{"Out", {"fibXToSend"}}}, {}, whileBlock);
+
+  // Case 0: Send to dataChanName
+  std::function<void(BlockDesc * caseBlock, Scope * scope)> case0Func = [&](
+      BlockDesc *caseBlock, Scope *scope) {
+    AddOp("assign", {{"X", {"fibX"}}}, {{"Out", {"xtemp"}}}, {}, caseBlock);
+    AddOp("assign", {{"X", {"fibY"}}}, {{"Out", {"fibX"}}}, {}, caseBlock);
+    AddOp("elementwise_add", {{"X", {"xtemp"}}, {"Y", {"fibY"}}},
+          {{"Out", {"fibY"}}}, {}, caseBlock);
+  };
+  AddCase(program, scope, place, casesBlock, 0, 1, dataChanName, "fibXToSend",
+          case0Func);
+  std::string case0Config =
+      std::string("0,1,") + dataChanName + std::string(",fibXToSend");
+
+  // Case 1: Receive from quitChanName
+  std::function<void(BlockDesc * caseBlock, Scope * scope)> case2Func = [&](
+      BlockDesc *caseBlock, Scope *scope) {
+    // Exit the while loop after we receive from quit channel.
+    // We assign a false to "whileExitCond" variable, which will
+    // break out of while_op loop
+    CreateVariable(*scope, *place, "whileFalse", false);
+    AddOp("assign", {{"X", {"whileFalse"}}}, {{"Out", {"whileExitCond"}}}, {},
+          caseBlock);
+  };
+  AddCase(program, scope, place, casesBlock, 1, 2, quitChanName, "quitVar",
+          case2Func);
+  std::string case1Config =
+      std::string("1,2,") + quitChanName + std::string(",quitVar");
+
+  // Select block
+  AddOp("select", {{"X", {dataChanName, quitChanName}},
+                   {"case_to_execute", {"caseToExecute"}}},
+        {}, {{"sub_block", casesBlock},
+             {"cases", std::vector<std::string>{case0Config, case1Config}}},
+        whileBlock);
+
+  scope->Var("stepScopes");
+  AddOp("while",
+        {{"X", {dataChanName, quitChanName}}, {"Condition", {"whileExitCond"}}},
+        {{"Out", {}}, {"StepScopes", {"stepScopes"}}},
+        {{"sub_block", whileBlock}}, parentBlock);
+}
+
 TEST(Concurrency, Go_Op) {
   Scope scope;
   p::CPUPlace place;
 
   // Initialize scope variables
-  InitTensorsInScope(scope, place);
+  p::CPUDeviceContext ctx(place);
+
+  // Create channel variable
+  scope.Var("Channel");
+
+  // Create Variables, x0 will be put into channel,
+  // result will be pulled from channel
+  CreateVariable(scope, place, "Status", false);
+  CreateVariable(scope, place, "x0", 99);
+  CreateVariable(scope, place, "result", 0);
 
   framework::Executor executor(place);
   ProgramDesc program;
@@ -118,5 +215,78 @@ TEST(Concurrency, Go_Op) {
   auto *finalData = tensor.data<int>();
   EXPECT_EQ(finalData[0], 99);
 }
+
+/**
+ * This test implements the fibonacci function using go_op and select_op
+ */
+TEST(Concurrency, Select) {
+  Scope scope;
+  p::CPUPlace place;
+
+  // Initialize scope variables
+  p::CPUDeviceContext ctx(place);
+
+  CreateVariable(scope, place, "Status", false);
+  CreateVariable(scope, place, "result", 0);
+  CreateVariable(scope, place, "currentXFib", 0);
+
+  framework::Executor executor(place);
+  ProgramDesc program;
+  BlockDesc *block = program.MutableBlock(0);
+
+  // Create channel OP
+  std::string dataChanName = "Channel";
+  scope.Var(dataChanName);
+  AddOp("channel_create", {}, {{"Out", {dataChanName}}},
+        {{"capacity", 0}, {"data_type", f::proto::VarType::LOD_TENSOR}}, block);
+
+  std::string quitChanName = "Quit";
+  scope.Var(quitChanName);
+  AddOp("channel_create", {}, {{"Out", {quitChanName}}},
+        {{"capacity", 0}, {"data_type", f::proto::VarType::LOD_TENSOR}}, block);
+
+  // Create Go Op routine, which loops 10 times over fibonacci sequence
+  CreateVariable(scope, place, "xReceiveVar", 0);
+
+  BlockDesc *goOpBlock = program.AppendBlock(program.Block(0));
+  for (int i = 0; i < 10; ++i) {
+    AddOp("channel_recv", {{"Channel", {dataChanName}}},
+          {{"Status", {"Status"}}, {"Out", {"currentXFib"}}}, {}, goOpBlock);
+    AddOp("print", {{"In", {"currentXFib"}}}, {{"Out", {"currentXFib"}}},
+          {{"first_n", 100},
+           {"summarize", -1},
+           {"print_tensor_name", false},
+           {"print_tensor_type", true},
+           {"print_tensor_shape", false},
+           {"print_tensor_lod", false},
+           {"print_phase", std::string("FORWARD")},
+           {"message", std::string("X: ")}},
+          goOpBlock);
+  }
+
+  CreateVariable(scope, place, "quitSignal", 0);
+  AddOp("channel_send", {{"Channel", {quitChanName}}, {"X", {"quitSignal"}}},
+        {{"Status", {"Status"}}}, {}, goOpBlock);
+
+  // Create Go Op
+  AddOp("go", {{"X", {dataChanName, quitChanName}}}, {},
+        {{"sub_block", goOpBlock}}, block);
+
+  AddFibonacciSelect(&scope, &place, &program, block, dataChanName,
+                     quitChanName);
+
+  // Create Channel Close Op
+  AddOp("channel_close", {{"Channel", {dataChanName}}}, {}, {}, block);
+  AddOp("channel_close", {{"Channel", {quitChanName}}}, {}, {}, block);
+
+  executor.Run(program, &scope, 0, true, true);
+
+  // After we call executor.run, "result" variable should be equal to 34
+  // (which is 10 loops through fibonacci sequence)
+  const LoDTensor &tensor = (scope.FindVar("currentXFib"))->Get<LoDTensor>();
+  auto *finalData = tensor.data<int>();
+  EXPECT_EQ(finalData[0], 34);
+}
+
 }  // namespace framework
 }  // namespace paddle
diff --git a/paddle/fluid/framework/executor.cc b/paddle/fluid/framework/executor.cc
index 5cae38b2a857b2037f0e5ae4da50d1591da0c11a..7155d5ef2febc20aaa684c04a7a59f781857c9e5 100644
--- a/paddle/fluid/framework/executor.cc
+++ b/paddle/fluid/framework/executor.cc
@@ -25,6 +25,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/reader.h"
 #include "paddle/fluid/platform/place.h"
+#include "paddle/fluid/platform/profiler.h"
 
 DECLARE_bool(benchmark);
 DEFINE_bool(check_nan_inf, false,
@@ -33,12 +34,17 @@ DEFINE_bool(check_nan_inf, false,
 
 namespace paddle {
 namespace framework {
+namespace {
+// block id starts from 0. This id is used to represent the codeblock
+// wrapping the first block 0.
+int kProgramId = -1;
+}  // namespace
 
 struct ExecutorPrepareContext {
   ExecutorPrepareContext(const framework::ProgramDesc& prog, size_t block_id)
       : prog_(prog), block_id_(block_id) {}
 
-  framework::ProgramDesc prog_;
+  const framework::ProgramDesc& prog_;
   size_t block_id_;
   std::vector<std::unique_ptr<OperatorBase>> ops_;
 };
@@ -94,6 +100,7 @@ static void CheckTensorNANOrInf(const std::string& name,
 
 void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id,
                    bool create_local_scope, bool create_vars) {
+  platform::RecordBlock b(block_id);
   auto* ctx = Prepare(pdesc, block_id);
   RunPreparedContext(ctx, scope, create_local_scope, create_vars);
   delete ctx;
@@ -106,10 +113,11 @@ void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id,
 // and feed_holder_name. Raise exception when any mismatch is found.
 // Return true if the block has feed operators and holder of matching info.
 static bool has_feed_operators(
-    BlockDesc* block, std::map<std::string, const LoDTensor*>& feed_targets,
+    const BlockDesc& block,
+    std::map<std::string, const LoDTensor*>& feed_targets,
     const std::string& feed_holder_name) {
   size_t feed_count = 0;
-  for (auto* op : block->AllOps()) {
+  for (auto* op : block.AllOps()) {
     if (op->Type() == kFeedOpType) {
       feed_count++;
       PADDLE_ENFORCE_EQ(op->Input("X")[0], feed_holder_name,
@@ -128,7 +136,7 @@ static bool has_feed_operators(
         "The number of feed operators should match 'feed_targets'");
 
     // When feed operator are present, so should be feed_holder
-    auto var = block->FindVar(feed_holder_name);
+    auto var = block.FindVar(feed_holder_name);
     PADDLE_ENFORCE_NOT_NULL(var, "Block should already have a '%s' variable",
                             feed_holder_name);
     PADDLE_ENFORCE_EQ(var->GetType(), proto::VarType::FEED_MINIBATCH,
@@ -146,10 +154,10 @@ static bool has_feed_operators(
 // and fetch_holder_name. Raise exception when any mismatch is found.
 // Return true if the block has fetch operators and holder of matching info.
 static bool has_fetch_operators(
-    BlockDesc* block, std::map<std::string, LoDTensor*>& fetch_targets,
+    const BlockDesc& block, std::map<std::string, LoDTensor*>& fetch_targets,
     const std::string& fetch_holder_name) {
   size_t fetch_count = 0;
-  for (auto* op : block->AllOps()) {
+  for (auto* op : block.AllOps()) {
     if (op->Type() == kFetchOpType) {
       fetch_count++;
       PADDLE_ENFORCE_EQ(op->Output("Out")[0], fetch_holder_name,
@@ -168,7 +176,7 @@ static bool has_fetch_operators(
         "The number of fetch operators should match 'fetch_targets'");
 
     // When fetch operator are present, so should be fetch_holder
-    auto var = block->FindVar(fetch_holder_name);
+    auto var = block.FindVar(fetch_holder_name);
     PADDLE_ENFORCE_NOT_NULL(var, "Block should already have a '%s' variable",
                             fetch_holder_name);
     PADDLE_ENFORCE_EQ(var->GetType(), proto::VarType::FETCH_LIST,
@@ -184,10 +192,20 @@ void Executor::Run(const ProgramDesc& program, Scope* scope,
                    std::map<std::string, LoDTensor*>& fetch_targets,
                    const std::string& feed_holder_name,
                    const std::string& fetch_holder_name) {
-  auto* copy_program = new ProgramDesc(program);
+  platform::RecordBlock b(kProgramId);
+  bool has_feed_ops =
+      has_feed_operators(program.Block(0), feed_targets, feed_holder_name);
+  bool has_fetch_ops =
+      has_fetch_operators(program.Block(0), fetch_targets, fetch_holder_name);
+
+  ProgramDesc* copy_program = const_cast<ProgramDesc*>(&program);
+  if (!has_feed_ops || !has_fetch_ops) {
+    copy_program = std::unique_ptr<ProgramDesc>(new ProgramDesc(program)).get();
+  }
+
   auto* global_block = copy_program->MutableBlock(0);
 
-  if (!has_feed_operators(global_block, feed_targets, feed_holder_name)) {
+  if (!has_feed_ops) {
     // create feed_holder variable
     auto* feed_holder = global_block->Var(feed_holder_name);
     feed_holder->SetType(proto::VarType::FEED_MINIBATCH);
@@ -220,7 +238,7 @@ void Executor::Run(const ProgramDesc& program, Scope* scope,
     }
   }
 
-  if (!has_fetch_operators(global_block, fetch_targets, fetch_holder_name)) {
+  if (!has_fetch_ops) {
     // create fetch_holder variable
     auto* fetch_holder = global_block->Var(fetch_holder_name);
     fetch_holder->SetType(proto::VarType::FETCH_LIST);
@@ -254,8 +272,6 @@ void Executor::Run(const ProgramDesc& program, Scope* scope,
           GetFetchVariable(*scope, fetch_holder_name, idx);
     }
   }
-
-  delete copy_program;
 }
 
 ExecutorPrepareContext* Executor::Prepare(const ProgramDesc& program,
@@ -305,9 +321,8 @@ void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope,
   }    // if (create_vars)
 
   for (auto& op : ctx->ops_) {
-    VLOG(4) << place_ << " " << op->DebugStringEx(local_scope);
-    op->Run(*local_scope, place_);
     VLOG(3) << place_ << " " << op->DebugStringEx(local_scope);
+    op->Run(*local_scope, place_);
 
     if (FLAGS_benchmark) {
       VLOG(2) << "Memory used after operator " + op->Type() + " running: "
diff --git a/paddle/fluid/framework/init.cc b/paddle/fluid/framework/init.cc
index 2e0a224ff5df749fd8c809dc88a85a1643542abf..3c0d93642ac41e8d90f9a248e81cea7a4fe12293 100644
--- a/paddle/fluid/framework/init.cc
+++ b/paddle/fluid/framework/init.cc
@@ -26,6 +26,7 @@ namespace paddle {
 namespace framework {
 
 std::once_flag gflags_init_flag;
+std::once_flag p2p_init_flag;
 
 void InitGflags(std::vector<std::string> &argv) {
   std::call_once(gflags_init_flag, [&]() {
@@ -42,6 +43,27 @@ void InitGflags(std::vector<std::string> &argv) {
   });
 }
 
+void InitP2P(int count) {
+#ifdef PADDLE_WITH_CUDA
+  std::call_once(p2p_init_flag, [&]() {
+    for (int i = 0; i < count; ++i) {
+      for (int j = 0; j < count; ++j) {
+        if (i == j) continue;
+        int can_acess = -1;
+        PADDLE_ENFORCE(cudaDeviceCanAccessPeer(&can_acess, i, j),
+                       "Failed to test P2P access.");
+        if (can_acess != 1) {
+          LOG(WARNING) << "Cannot enable P2P access from " << i << " to " << j;
+        } else {
+          cudaSetDevice(i);
+          cudaDeviceEnablePeerAccess(j, 0);
+        }
+      }
+    }
+  });
+#endif
+}
+
 void InitDevices() {
   /*Init all avaiable devices by default */
 
@@ -63,7 +85,7 @@ void InitDevices() {
   for (int i = 0; i < count; ++i) {
     places.emplace_back(platform::CUDAPlace(i));
   }
-
+  InitP2P(count);
   platform::DeviceContextPool::Init(places);
 }
 
diff --git a/paddle/fluid/inference/CMakeLists.txt b/paddle/fluid/inference/CMakeLists.txt
index 17ccca8cdcbcaabaddbbc0ca1d3ca4fdf054b0fb..aff427310f15be72f5c8d0fa1537ffa6bbe2881d 100644
--- a/paddle/fluid/inference/CMakeLists.txt
+++ b/paddle/fluid/inference/CMakeLists.txt
@@ -13,6 +13,11 @@ cc_library(paddle_fluid_shared SHARED
     SRCS io.cc
     DEPS ARCHIVE_START ${GLOB_OP_LIB} ${FLUID_CORE_MODULES} ARCHIVE_END)
 set_target_properties(paddle_fluid_shared PROPERTIES OUTPUT_NAME paddle_fluid)
+if(NOT APPLE)
+  # TODO(liuyiqun): Temporarily disable the link flag because it is not support on Mac.
+  set(LINK_FLAGS "-Wl,--version-script ${CMAKE_CURRENT_SOURCE_DIR}/paddle_fluid.map")
+  set_target_properties(paddle_fluid_shared PROPERTIES LINK_FLAGS "${LINK_FLAGS}")
+endif()
 
 if(WITH_TESTING)
   add_subdirectory(tests/book)
diff --git a/paddle/fluid/inference/paddle_fluid.map b/paddle/fluid/inference/paddle_fluid.map
new file mode 100644
index 0000000000000000000000000000000000000000..5203784dc1fcb672eb6a26d9dfd3ffbe02e08038
--- /dev/null
+++ b/paddle/fluid/inference/paddle_fluid.map
@@ -0,0 +1,6 @@
+{
+	global:
+		*paddle*;
+	local:
+		*;
+};
diff --git a/paddle/fluid/operators/CMakeLists.txt b/paddle/fluid/operators/CMakeLists.txt
index 625e0f7561899d30b40f9daa56f743a37bdaa27f..d30124d4a3b89b802a4abaae07a33b76526f163d 100644
--- a/paddle/fluid/operators/CMakeLists.txt
+++ b/paddle/fluid/operators/CMakeLists.txt
@@ -165,7 +165,6 @@ op_library(cond_op DEPS framework_proto tensor net_op)
 op_library(cross_entropy_op DEPS cross_entropy)
 op_library(softmax_with_cross_entropy_op DEPS cross_entropy softmax)
 op_library(softmax_op DEPS softmax)
-op_library(detection_output_op DEPS softmax)
 op_library(sequence_softmax_op DEPS softmax)
 op_library(sum_op DEPS selected_rows_functor)
 op_library(sgd_op DEPS selected_rows_functor)
@@ -203,6 +202,11 @@ op_library(save_combine_op DEPS lod_tensor)
 op_library(load_combine_op DEPS lod_tensor)
 op_library(concat_op DEPS concat)
 
+# FIXME(thuan): Move CSP operators to paddle/fluid/framework/operators/concurrency
+add_subdirectory(concurrency)
+op_library(channel_send_op DEPS concurrency)
+op_library(channel_recv_op DEPS concurrency)
+
 list(REMOVE_ITEM GENERAL_OPS ${DEPS_OPS})
 foreach(src ${GENERAL_OPS})
     op_library(${src})
diff --git a/paddle/fluid/operators/cast_op.cc b/paddle/fluid/operators/cast_op.cc
index 72f8cb04f2de3af4ee526c3d9b86ff96e34f0b0a..dd0068d571f72c9c22334e523cd091fe4c8da5a6 100644
--- a/paddle/fluid/operators/cast_op.cc
+++ b/paddle/fluid/operators/cast_op.cc
@@ -14,6 +14,7 @@ limitations under the License. */
 
 #include "paddle/fluid/operators/cast_op.h"
 #include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/platform/float16.h"
 
 namespace paddle {
 namespace operators {
@@ -88,4 +89,5 @@ REGISTER_OP_CPU_KERNEL(cast, ops::CastOpKernel<CPU, float>,
                        ops::CastOpKernel<CPU, double>,
                        ops::CastOpKernel<CPU, int>,
                        ops::CastOpKernel<CPU, int64_t>,
-                       ops::CastOpKernel<CPU, bool>);
+                       ops::CastOpKernel<CPU, bool>,
+                       ops::CastOpKernel<CPU, paddle::platform::float16>);
diff --git a/paddle/fluid/operators/cast_op.cu b/paddle/fluid/operators/cast_op.cu
index 507e9a531aae70e60bc6748bfab800310d6e0c21..c486c5850e25fcf4370f02cb145c244743a4cc4b 100644
--- a/paddle/fluid/operators/cast_op.cu
+++ b/paddle/fluid/operators/cast_op.cu
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
 limitations under the License. */
 
 #include "paddle/fluid/operators/cast_op.h"
+#include "paddle/fluid/platform/float16.h"
 
 template <typename T>
 using CastOpKernel =
@@ -20,4 +21,5 @@ using CastOpKernel =
 
 REGISTER_OP_CUDA_KERNEL(cast, CastOpKernel<float>, CastOpKernel<double>,
                         CastOpKernel<int>, CastOpKernel<int64_t>,
-                        CastOpKernel<bool>);
+                        CastOpKernel<bool>,
+                        CastOpKernel<paddle::platform::float16>);
diff --git a/paddle/fluid/operators/channel_recv_op.cc b/paddle/fluid/operators/channel_recv_op.cc
index c12b88e7a91c4ea7044223464a2f902db494d1a8..844b3ae3b7bf87c9b253128165b3c938801d5d60 100644
--- a/paddle/fluid/operators/channel_recv_op.cc
+++ b/paddle/fluid/operators/channel_recv_op.cc
@@ -18,6 +18,7 @@ limitations under the License. */
 #include <paddle/fluid/framework/reader.h>
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/var_type.h"
+#include "paddle/fluid/operators/concurrency/channel_util.h"
 #include "paddle/fluid/operators/math/math_function.h"
 
 static constexpr char Channel[] = "Channel";
@@ -36,25 +37,6 @@ void SetReceiveStatus(const platform::Place &dev_place,
   status_tensor[0] = status;
 }
 
-bool ChannelReceive(framework::ChannelHolder *ch, framework::Variable *var) {
-  // Get type of channel and use that to call mutable data for Variable
-  auto type = framework::ToVarType(ch->Type());
-  if (type == framework::proto::VarType_Type_LOD_TENSOR)
-    return ch->Receive(var->GetMutable<framework::LoDTensor>());
-  else if (type == framework::proto::VarType_Type_LOD_RANK_TABLE)
-    return ch->Receive(var->GetMutable<framework::LoDRankTable>());
-  else if (type == framework::proto::VarType_Type_LOD_TENSOR_ARRAY)
-    return ch->Receive(var->GetMutable<framework::LoDTensorArray>());
-  else if (type == framework::proto::VarType_Type_SELECTED_ROWS)
-    return ch->Receive(var->GetMutable<framework::SelectedRows>());
-  else if (type == framework::proto::VarType_Type_READER)
-    return ch->Receive(var->GetMutable<framework::ReaderHolder>());
-  else if (type == framework::proto::VarType_Type_CHANNEL)
-    return ch->Receive(var->GetMutable<framework::ChannelHolder>());
-  else
-    PADDLE_THROW("ChannelReceive:Unsupported type");
-}
-
 class ChannelRecvOp : public framework::OperatorBase {
  public:
   ChannelRecvOp(const std::string &type,
@@ -81,7 +63,7 @@ class ChannelRecvOp : public framework::OperatorBase {
         scope.FindVar(Input(Channel))->GetMutable<framework::ChannelHolder>();
     auto output_var = scope.FindVar(Output(Out));
     // Receive the data from the channel.
-    bool ok = ChannelReceive(ch, output_var);
+    bool ok = concurrency::ChannelReceive(ch, output_var);
 
     // Set the status output of the `ChannelReceive` call.
     SetReceiveStatus(dev_place, *scope.FindVar(Output(Status)), ok);
diff --git a/paddle/fluid/operators/channel_send_op.cc b/paddle/fluid/operators/channel_send_op.cc
index 6d7715ad229e821f02437246e3326063cb1ee757..47cf7d7efc9996e8a8db11b79c0310f77c2435a4 100644
--- a/paddle/fluid/operators/channel_send_op.cc
+++ b/paddle/fluid/operators/channel_send_op.cc
@@ -18,6 +18,7 @@ limitations under the License. */
 #include <paddle/fluid/framework/reader.h>
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/var_type.h"
+#include "paddle/fluid/operators/concurrency/channel_util.h"
 #include "paddle/fluid/operators/math/math_function.h"
 
 static constexpr char Channel[] = "Channel";
@@ -37,24 +38,6 @@ void SetSendStatus(const platform::Place &dev_place,
   status_tensor[0] = status;
 }
 
-bool ChannelSend(framework::ChannelHolder *ch, framework::Variable *var) {
-  auto type = framework::ToVarType(var->Type());
-  if (type == framework::proto::VarType_Type_LOD_TENSOR)
-    return ch->Send(var->GetMutable<framework::LoDTensor>());
-  else if (type == framework::proto::VarType_Type_LOD_RANK_TABLE)
-    return ch->Send(var->GetMutable<framework::LoDRankTable>());
-  else if (type == framework::proto::VarType_Type_LOD_TENSOR_ARRAY)
-    return ch->Send(var->GetMutable<framework::LoDTensorArray>());
-  else if (type == framework::proto::VarType_Type_SELECTED_ROWS)
-    return ch->Send(var->GetMutable<framework::SelectedRows>());
-  else if (type == framework::proto::VarType_Type_READER)
-    return ch->Send(var->GetMutable<framework::ReaderHolder>());
-  else if (type == framework::proto::VarType_Type_CHANNEL)
-    return ch->Send(var->GetMutable<framework::ChannelHolder>());
-  else
-    PADDLE_THROW("ChannelSend:Unsupported type");
-}
-
 class ChannelSendOp : public framework::OperatorBase {
  public:
   ChannelSendOp(const std::string &type,
@@ -82,7 +65,7 @@ class ChannelSendOp : public framework::OperatorBase {
     auto input_var = scope.FindVar(Input(X));
 
     // Send the input data through the channel.
-    bool ok = ChannelSend(ch, input_var);
+    bool ok = concurrency::ChannelSend(ch, input_var);
 
     // Set the status output of the `ChannelSend` call.
     SetSendStatus(dev_place, *scope.FindVar(Output(Status)), ok);
diff --git a/paddle/fluid/operators/concurrency/CMakeLists.txt b/paddle/fluid/operators/concurrency/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..e4617440d152b4c15d09e81cd19c76739b95b979
--- /dev/null
+++ b/paddle/fluid/operators/concurrency/CMakeLists.txt
@@ -0,0 +1 @@
+cc_library(concurrency SRCS channel_util.cc DEPS device_context framework_proto boost eigen3)
diff --git a/paddle/fluid/operators/concurrency/channel_util.cc b/paddle/fluid/operators/concurrency/channel_util.cc
new file mode 100644
index 0000000000000000000000000000000000000000..a483af7affd824da7d18676d934dc959167ef71f
--- /dev/null
+++ b/paddle/fluid/operators/concurrency/channel_util.cc
@@ -0,0 +1,111 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "channel_util.h"
+#include "paddle/fluid/framework/var_type.h"
+
+namespace poc = paddle::operators::concurrency;
+
+bool poc::ChannelSend(framework::ChannelHolder *ch, framework::Variable *var) {
+  auto type = framework::ToVarType(var->Type());
+  if (type == framework::proto::VarType_Type_LOD_TENSOR)
+    return ch->Send(var->GetMutable<framework::LoDTensor>());
+  else if (type == framework::proto::VarType_Type_LOD_RANK_TABLE)
+    return ch->Send(var->GetMutable<framework::LoDRankTable>());
+  else if (type == framework::proto::VarType_Type_LOD_TENSOR_ARRAY)
+    return ch->Send(var->GetMutable<framework::LoDTensorArray>());
+  else if (type == framework::proto::VarType_Type_SELECTED_ROWS)
+    return ch->Send(var->GetMutable<framework::SelectedRows>());
+  else if (type == framework::proto::VarType_Type_READER)
+    return ch->Send(var->GetMutable<framework::ReaderHolder>());
+  else if (type == framework::proto::VarType_Type_CHANNEL)
+    return ch->Send(var->GetMutable<framework::ChannelHolder>());
+  else
+    PADDLE_THROW("ChannelSend:Unsupported type");
+}
+
+bool poc::ChannelReceive(framework::ChannelHolder *ch,
+                         framework::Variable *var) {
+  // Get type of channel and use that to call mutable data for Variable
+  auto type = framework::ToVarType(ch->Type());
+  if (type == framework::proto::VarType_Type_LOD_TENSOR)
+    return ch->Receive(var->GetMutable<framework::LoDTensor>());
+  else if (type == framework::proto::VarType_Type_LOD_RANK_TABLE)
+    return ch->Receive(var->GetMutable<framework::LoDRankTable>());
+  else if (type == framework::proto::VarType_Type_LOD_TENSOR_ARRAY)
+    return ch->Receive(var->GetMutable<framework::LoDTensorArray>());
+  else if (type == framework::proto::VarType_Type_SELECTED_ROWS)
+    return ch->Receive(var->GetMutable<framework::SelectedRows>());
+  else if (type == framework::proto::VarType_Type_READER)
+    return ch->Receive(var->GetMutable<framework::ReaderHolder>());
+  else if (type == framework::proto::VarType_Type_CHANNEL)
+    return ch->Receive(var->GetMutable<framework::ChannelHolder>());
+  else
+    PADDLE_THROW("ChannelReceive:Unsupported type");
+}
+
+void poc::ChannelAddToSendQ(framework::ChannelHolder *ch, const void *referrer,
+                            framework::Variable *var,
+                            std::shared_ptr<std::condition_variable_any> cond,
+                            std::function<bool(framework::ChannelAction)> cb) {
+  auto type = framework::ToVarType(var->Type());
+  if (type == framework::proto::VarType_Type_LOD_TENSOR) {
+    ch->AddToSendQ(referrer, var->GetMutable<framework::LoDTensor>(), cond, cb);
+  } else if (type == framework::proto::VarType_Type_LOD_RANK_TABLE) {
+    ch->AddToSendQ(referrer, var->GetMutable<framework::LoDRankTable>(), cond,
+                   cb);
+  } else if (type == framework::proto::VarType_Type_LOD_TENSOR_ARRAY) {
+    ch->AddToSendQ(referrer, var->GetMutable<framework::LoDTensorArray>(), cond,
+                   cb);
+  } else if (type == framework::proto::VarType_Type_SELECTED_ROWS) {
+    ch->AddToSendQ(referrer, var->GetMutable<framework::SelectedRows>(), cond,
+                   cb);
+  } else if (type == framework::proto::VarType_Type_READER) {
+    ch->AddToSendQ(referrer, var->GetMutable<framework::ReaderHolder>(), cond,
+                   cb);
+  } else if (type == framework::proto::VarType_Type_CHANNEL) {
+    ch->AddToSendQ(referrer, var->GetMutable<framework::ChannelHolder>(), cond,
+                   cb);
+  } else {
+    PADDLE_THROW("ChannelAddToSendQ:Unsupported type");
+  }
+}
+
+void poc::ChannelAddToReceiveQ(
+    framework::ChannelHolder *ch, const void *referrer,
+    framework::Variable *var, std::shared_ptr<std::condition_variable_any> cond,
+    std::function<bool(framework::ChannelAction)> cb) {
+  auto type = framework::ToVarType(var->Type());
+  if (type == framework::proto::VarType_Type_LOD_TENSOR) {
+    ch->AddToReceiveQ(referrer, var->GetMutable<framework::LoDTensor>(), cond,
+                      cb);
+  } else if (type == framework::proto::VarType_Type_LOD_RANK_TABLE) {
+    ch->AddToReceiveQ(referrer, var->GetMutable<framework::LoDRankTable>(),
+                      cond, cb);
+  } else if (type == framework::proto::VarType_Type_LOD_TENSOR_ARRAY) {
+    ch->AddToReceiveQ(referrer, var->GetMutable<framework::LoDTensorArray>(),
+                      cond, cb);
+  } else if (type == framework::proto::VarType_Type_SELECTED_ROWS) {
+    ch->AddToReceiveQ(referrer, var->GetMutable<framework::SelectedRows>(),
+                      cond, cb);
+  } else if (type == framework::proto::VarType_Type_READER) {
+    ch->AddToReceiveQ(referrer, var->GetMutable<framework::ReaderHolder>(),
+                      cond, cb);
+  } else if (type == framework::proto::VarType_Type_CHANNEL) {
+    ch->AddToReceiveQ(referrer, var->GetMutable<framework::ChannelHolder>(),
+                      cond, cb);
+  } else {
+    PADDLE_THROW("ChannelAddToReceiveQ:Unsupported type");
+  }
+}
diff --git a/paddle/fluid/operators/concurrency/channel_util.h b/paddle/fluid/operators/concurrency/channel_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..c3674bd9815df451751707bfa84d18dbb5fa0f6b
--- /dev/null
+++ b/paddle/fluid/operators/concurrency/channel_util.h
@@ -0,0 +1,38 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+
+#include "paddle/fluid/framework/channel.h"
+#include "paddle/fluid/framework/variable.h"
+
+namespace paddle {
+namespace operators {
+namespace concurrency {
+
+bool ChannelSend(framework::ChannelHolder *ch, framework::Variable *var);
+bool ChannelReceive(framework::ChannelHolder *ch, framework::Variable *var);
+
+void ChannelAddToSendQ(framework::ChannelHolder *ch, const void *referrer,
+                       framework::Variable *var,
+                       std::shared_ptr<std::condition_variable_any> cond,
+                       std::function<bool(framework::ChannelAction)> cb);
+void ChannelAddToReceiveQ(framework::ChannelHolder *ch, const void *referrer,
+                          framework::Variable *var,
+                          std::shared_ptr<std::condition_variable_any> cond,
+                          std::function<bool(framework::ChannelAction)> cb);
+
+}  // namespace concurrency
+}  // namespace operators
+}  // namespace paddle
diff --git a/paddle/fluid/operators/conv_cudnn_op.cu.cc b/paddle/fluid/operators/conv_cudnn_op.cu.cc
index ff0fbf21f86269885df5491afab7443df813f13f..a32aba4c1ff2f5e775aeb41f25b02322dbc6a64a 100644
--- a/paddle/fluid/operators/conv_cudnn_op.cu.cc
+++ b/paddle/fluid/operators/conv_cudnn_op.cu.cc
@@ -18,6 +18,7 @@ limitations under the License. */
 #include "paddle/fluid/operators/conv_op.h"
 #include "paddle/fluid/platform/assert.h"
 #include "paddle/fluid/platform/cudnn_helper.h"
+#include "paddle/fluid/platform/float16.h"
 
 namespace paddle {
 namespace operators {
@@ -27,6 +28,8 @@ using ScopedTensorDescriptor = platform::ScopedTensorDescriptor;
 using ScopedFilterDescriptor = platform::ScopedFilterDescriptor;
 using ScopedConvolutionDescriptor = platform::ScopedConvolutionDescriptor;
 using DataLayout = platform::DataLayout;
+template <typename T>
+using ScalingParamType = typename platform::CudnnDataType<T>::ScalingParamType;
 
 static constexpr size_t kCONV_CUDNN_WORKSPACE_LIMIT_BYTES =
     static_cast<size_t>(1024) * 1024 * 1024;
@@ -133,7 +136,7 @@ class CUDNNConvOpKernel : public framework::OpKernel<T> {
     platform::CUDAPlace gpu = boost::get<platform::CUDAPlace>(ctx.GetPlace());
     cudnn_workspace = paddle::memory::Alloc(gpu, workspace_size_in_bytes);
     // ------------------- cudnn conv forward ---------------------
-    T alpha = 1.0f, beta = 0.0f;
+    ScalingParamType<T> alpha = 1.0f, beta = 0.0f;
     for (int i = 0; i < groups; i++) {
       PADDLE_ENFORCE(platform::dynload::cudnnConvolutionForward(
           handle, &alpha, cudnn_input_desc, input_data + i * group_offset_in,
@@ -280,7 +283,7 @@ class CUDNNConvGradOpKernel : public framework::OpKernel<T> {
     platform::CUDAPlace gpu = boost::get<platform::CUDAPlace>(ctx.GetPlace());
     cudnn_workspace = paddle::memory::Alloc(gpu, workspace_size_in_bytes);
     // ------------------- cudnn conv backward data ---------------------
-    T alpha = 1.0f, beta = 0.0f;
+    ScalingParamType<T> alpha = 1.0f, beta = 0.0f;
     if (input_grad) {
       T* input_grad_data = input_grad->mutable_data<T>(ctx.GetPlace());
       // Because beta is zero, it is unnecessary to reset input_grad.
@@ -315,16 +318,18 @@ class CUDNNConvGradOpKernel : public framework::OpKernel<T> {
 }  // namespace operators
 }  // namespace paddle
 
-REGISTER_OP_KERNEL(conv2d, CUDNN, ::paddle::platform::CUDAPlace,
+namespace plat = paddle::platform;
+REGISTER_OP_KERNEL(conv2d, CUDNN, plat::CUDAPlace,
                    paddle::operators::CUDNNConvOpKernel<float>,
-                   paddle::operators::CUDNNConvOpKernel<double>);
-REGISTER_OP_KERNEL(conv2d_grad, CUDNN, ::paddle::platform::CUDAPlace,
+                   paddle::operators::CUDNNConvOpKernel<double>,
+                   paddle::operators::CUDNNConvOpKernel<plat::float16>);
+REGISTER_OP_KERNEL(conv2d_grad, CUDNN, plat::CUDAPlace,
                    paddle::operators::CUDNNConvGradOpKernel<float>,
                    paddle::operators::CUDNNConvGradOpKernel<double>);
 
-REGISTER_OP_KERNEL(conv3d, CUDNN, ::paddle::platform::CUDAPlace,
+REGISTER_OP_KERNEL(conv3d, CUDNN, plat::CUDAPlace,
                    paddle::operators::CUDNNConvOpKernel<float>,
                    paddle::operators::CUDNNConvOpKernel<double>);
-REGISTER_OP_KERNEL(conv3d_grad, CUDNN, ::paddle::platform::CUDAPlace,
+REGISTER_OP_KERNEL(conv3d_grad, CUDNN, plat::CUDAPlace,
                    paddle::operators::CUDNNConvGradOpKernel<float>,
                    paddle::operators::CUDNNConvGradOpKernel<double>);
diff --git a/paddle/fluid/operators/conv_op.cc b/paddle/fluid/operators/conv_op.cc
index 4b02b80d7772fa15d2333692551da5e59d93765f..650bc92be22af9ea8afcacf590a11190109e8811 100644
--- a/paddle/fluid/operators/conv_op.cc
+++ b/paddle/fluid/operators/conv_op.cc
@@ -70,25 +70,36 @@ void ConvOp::InferShape(framework::InferShapeContext* ctx) const {
 
 framework::OpKernelType ConvOp::GetExpectedKernelType(
     const framework::ExecutionContext& ctx) const {
-  framework::LibraryType library_{framework::LibraryType::kPlain};
+  framework::LibraryType library{framework::LibraryType::kPlain};
 #ifdef PADDLE_WITH_CUDA
   if (platform::CanCUDNNBeUsed(ctx)) {
-    library_ = framework::LibraryType::kCUDNN;
+    library = framework::LibraryType::kCUDNN;
   }
 #endif
 #ifdef PADDLE_WITH_MKLDNN
-  if (library_ == framework::LibraryType::kPlain &&
+  if (library == framework::LibraryType::kPlain &&
       platform::CanMKLDNNBeUsed(ctx)) {
-    library_ = framework::LibraryType::kMKLDNN;
+    library = framework::LibraryType::kMKLDNN;
   }
 #endif
 
+  auto input_data_type =
+      framework::ToDataType(ctx.Input<Tensor>("Input")->type());
+  auto filter_data_type =
+      framework::ToDataType(ctx.Input<Tensor>("Filter")->type());
+  PADDLE_ENFORCE_EQ(input_data_type, filter_data_type,
+                    "input and filter data type should be consistent");
+
+  if (input_data_type == framework::proto::VarType::FP16) {
+    PADDLE_ENFORCE_EQ(library, framework::LibraryType::kCUDNN,
+                      "float16 can only be used when CUDNN is used");
+  }
+
   std::string data_format = ctx.Attr<std::string>("data_format");
   // TODO(pzelazko-intel): enable MKLDNN layout when it's ready
-  framework::DataLayout layout_ = framework::StringToDataLayout(data_format);
-  return framework::OpKernelType(
-      framework::ToDataType(ctx.Input<Tensor>("Input")->type()), ctx.GetPlace(),
-      layout_, library_);
+  framework::DataLayout layout = framework::StringToDataLayout(data_format);
+  return framework::OpKernelType(input_data_type, ctx.GetPlace(), layout,
+                                 library);
 }
 
 Conv2DOpMaker::Conv2DOpMaker(OpProto* proto, OpAttrChecker* op_checker)
diff --git a/paddle/fluid/operators/detection_output_op.cc b/paddle/fluid/operators/detection_output_op.cc
deleted file mode 100644
index f7520475917ff23535f11ccfde0ee915112bba30..0000000000000000000000000000000000000000
--- a/paddle/fluid/operators/detection_output_op.cc
+++ /dev/null
@@ -1,89 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-Indicesou may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include "paddle/fluid/operators/detection_output_op.h"
-namespace paddle {
-namespace operators {
-
-class DetectionOutputOpMaker : public framework::OpProtoAndCheckerMaker {
- public:
-  DetectionOutputOpMaker(OpProto* proto, OpAttrChecker* op_checker)
-      : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("Loc",
-             "(Tensor) The input tensor of detection_output operator."
-             "The input predict locations"
-             "The format of input tensor is kNCHW. Where K is priorbox point "
-             "numbers,"
-             "N is How many boxes are there on each point, "
-             "C is 4, H and W both are 1.");
-    AddInput("Conf",
-             "(Tensor) The input tensor of detection_output operator."
-             "The input priorbox confidence."
-             "The format of input tensor is kNCHW. Where K is priorbox point "
-             "numbers,"
-             "N is How many boxes are there on each point, "
-             "C is the number of classes, H and W both are 1.");
-    AddInput("PriorBox",
-             "(Tensor) The input tensor of detection_output operator."
-             "The format of input tensor is the position and variance "
-             "of the boxes");
-    AddOutput("Out",
-              "(Tensor) The output tensor of detection_output operator.");
-    AddAttr<int>("background_label_id", "(int), The background class index.");
-    AddAttr<int>("num_classes", "(int), The number of the classification.");
-    AddAttr<float>("nms_threshold",
-                   "(float), The Non-maximum suppression threshold.");
-    AddAttr<float>("confidence_threshold",
-                   "(float), The classification confidence threshold.");
-    AddAttr<int>("top_k", "(int), The bbox number kept of the layer’s output.");
-    AddAttr<int>("nms_top_k",
-                 "(int), The bbox number kept of the NMS’s output.");
-    AddComment(R"DOC(
-          detection output for SSD(single shot multibox detector)
-          Apply the NMS to the output of network and compute the predict
-          bounding box location. The output’s shape of this layer could
-          be zero if there is no valid bounding box.
-        )DOC");
-  }
-};
-
-class DetectionOutputOp : public framework::OperatorWithKernel {
- public:
-  using framework::OperatorWithKernel::OperatorWithKernel;
-  void InferShape(framework::InferShapeContext* ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput("Loc"),
-                   "Input(X) of DetectionOutputOp"
-                   "should not be null.");
-    PADDLE_ENFORCE(ctx->HasInput("Conf"),
-                   "Input(X) of DetectionOutputOp"
-                   "should not be null.");
-    PADDLE_ENFORCE(ctx->HasInput("PriorBox"),
-                   "Input(X) of DetectionOutputOp"
-                   "should not be null.");
-    PADDLE_ENFORCE(ctx->HasOutput("Out"),
-                   "Output(Out) of DetectionOutputOp should not be null.");
-    std::vector<int64_t> output_shape({1, 7});
-    ctx->SetOutputDim("Out", framework::make_ddim(output_shape));
-  }
-};
-}  // namespace operators
-}  // namespace paddle
-
-namespace ops = paddle::operators;
-REGISTER_OP_WITHOUT_GRADIENT(detection_output, ops::DetectionOutputOp,
-                             ops::DetectionOutputOpMaker);
-REGISTER_OP_CPU_KERNEL(
-    detection_output,
-    ops::DetectionOutputKernel<paddle::platform::CPUDeviceContext, float>,
-    ops::DetectionOutputKernel<paddle::platform::CPUDeviceContext, double>);
diff --git a/paddle/fluid/operators/detection_output_op.cu.cc b/paddle/fluid/operators/detection_output_op.cu.cc
deleted file mode 100644
index 0f48765c9c67c1d3fa32b19d5e87b2acaa3c486a..0000000000000000000000000000000000000000
--- a/paddle/fluid/operators/detection_output_op.cu.cc
+++ /dev/null
@@ -1,21 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-Indicesou may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include "paddle/fluid/operators/detection_output_op.h"
-
-namespace ops = paddle::operators;
-REGISTER_OP_CUDA_KERNEL(
-    detection_output,
-    ops::DetectionOutputKernel<paddle::platform::CUDADeviceContext, float>,
-    ops::DetectionOutputKernel<paddle::platform::CUDADeviceContext, double>);
diff --git a/paddle/fluid/operators/detection_output_op.h b/paddle/fluid/operators/detection_output_op.h
deleted file mode 100644
index af9081c93436776b6ca6ee7139e340054111e440..0000000000000000000000000000000000000000
--- a/paddle/fluid/operators/detection_output_op.h
+++ /dev/null
@@ -1,167 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
-
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   Indicesou may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License. */
-
-#pragma once
-#include "paddle/fluid/framework/op_registry.h"
-#include "paddle/fluid/framework/tensor.h"
-#include "paddle/fluid/operators/math/detection_util.h"
-#include "paddle/fluid/operators/math/math_function.h"
-#include "paddle/fluid/operators/math/softmax.h"
-#include "paddle/fluid/operators/strided_memcpy.h"
-namespace paddle {
-namespace operators {
-template <typename DeviceContext, typename T>
-inline void transpose_fun(const framework::ExecutionContext& context,
-                          const framework::Tensor& src,
-                          framework::Tensor* dst) {
-  int input_nums = src.dims()[0];
-  int offset = 0;
-  for (int j = 0; j < input_nums; ++j) {
-    framework::Tensor in_p_tensor = src.Slice(j, j + 1);
-    std::vector<int64_t> shape_vec(
-        {in_p_tensor.dims()[0], in_p_tensor.dims()[1], in_p_tensor.dims()[3],
-         in_p_tensor.dims()[4], in_p_tensor.dims()[2]});
-    framework::DDim shape(framework::make_ddim(shape_vec));
-    framework::Tensor in_p_tensor_transpose;
-    in_p_tensor_transpose.mutable_data<T>(shape, context.GetPlace());
-    std::vector<int> shape_axis({0, 1, 3, 4, 2});
-    math::Transpose<DeviceContext, T, 5> trans5;
-    trans5(context.template device_context<DeviceContext>(), in_p_tensor,
-           &in_p_tensor_transpose, shape_axis);
-    auto dst_stride = framework::stride(dst->dims());
-    auto src_stride = framework::stride(in_p_tensor_transpose.dims());
-    StridedMemcpy<T>(context.device_context(), in_p_tensor_transpose.data<T>(),
-                     src_stride, in_p_tensor_transpose.dims(), dst_stride,
-                     dst->data<T>() + offset);
-    offset += in_p_tensor_transpose.dims()[4] * src_stride[4];
-  }
-}
-template <typename DeviceContext, typename T>
-class DetectionOutputKernel : public framework::OpKernel<T> {
- public:
-  void Compute(const framework::ExecutionContext& context) const override {
-    const framework::Tensor* in_loc = context.Input<framework::Tensor>("Loc");
-    const framework::Tensor* in_conf = context.Input<framework::Tensor>("Conf");
-    const framework::Tensor* in_priorbox =
-        context.Input<framework::Tensor>("PriorBox");
-    auto* out = context.Output<framework::Tensor>("Out");
-    int num_classes = context.template Attr<int>("num_classes");
-    int top_k = context.template Attr<int>("top_k");
-    int nms_top_k = context.template Attr<int>("nms_top_k");
-    int background_label_id = context.template Attr<int>("background_label_id");
-    float nms_threshold = context.template Attr<float>("nms_threshold");
-    float confidence_threshold =
-        context.template Attr<float>("confidence_threshold");
-    size_t batch_size = in_conf->dims()[1];
-    int conf_sum_size = in_conf->numel();
-    // for softmax
-    std::vector<int64_t> conf_shape_softmax_vec(
-        {conf_sum_size / num_classes, num_classes});
-    framework::DDim conf_shape_softmax(
-        framework::make_ddim(conf_shape_softmax_vec));
-    // for knchw => nhwc
-    std::vector<int64_t> loc_shape_vec({1, in_loc->dims()[1], in_loc->dims()[3],
-                                        in_loc->dims()[4],
-                                        in_loc->dims()[2] * in_loc->dims()[0]});
-    std::vector<int64_t> conf_shape_vec(
-        {1, in_conf->dims()[1], in_conf->dims()[3], in_conf->dims()[4],
-         in_conf->dims()[2] * in_conf->dims()[0]});
-    framework::DDim loc_shape(framework::make_ddim(loc_shape_vec));
-    framework::DDim conf_shape(framework::make_ddim(conf_shape_vec));
-    framework::Tensor loc_tensor;
-    framework::Tensor conf_tensor;
-    loc_tensor.mutable_data<T>(loc_shape, context.GetPlace());
-    conf_tensor.mutable_data<T>(conf_shape, context.GetPlace());
-    // for cpu
-    framework::Tensor loc_cpu;
-    framework::Tensor conf_cpu;
-    framework::Tensor priorbox_cpu;
-    const T* priorbox_data = in_priorbox->data<T>();
-    transpose_fun<DeviceContext, T>(context, *in_loc, &loc_tensor);
-    transpose_fun<DeviceContext, T>(context, *in_conf, &conf_tensor);
-    conf_tensor.Resize(conf_shape_softmax);
-    math::SoftmaxFunctor<DeviceContext, T>()(
-        context.template device_context<DeviceContext>(), &conf_tensor,
-        &conf_tensor);
-    T* loc_data = loc_tensor.data<T>();
-    T* conf_data = conf_tensor.data<T>();
-    if (platform::is_gpu_place(context.GetPlace())) {
-      loc_cpu.mutable_data<T>(loc_tensor.dims(), platform::CPUPlace());
-      framework::TensorCopy(loc_tensor, platform::CPUPlace(),
-                            context.device_context(), &loc_cpu);
-      loc_data = loc_cpu.data<T>();
-      conf_cpu.mutable_data<T>(conf_tensor.dims(), platform::CPUPlace());
-      framework::TensorCopy(conf_tensor, platform::CPUPlace(),
-                            context.device_context(), &conf_cpu);
-      conf_data = conf_cpu.data<T>();
-      priorbox_cpu.mutable_data<T>(in_priorbox->dims(), platform::CPUPlace());
-      framework::TensorCopy(*in_priorbox, platform::CPUPlace(),
-                            context.device_context(), &priorbox_cpu);
-      priorbox_data = priorbox_cpu.data<T>();
-    }
-    // get decode bboxes
-    size_t num_priors = in_priorbox->numel() / 8;
-    std::vector<std::vector<operators::math::BBox<T>>> all_decoded_bboxes;
-    for (size_t n = 0; n < batch_size; ++n) {
-      std::vector<operators::math::BBox<T>> decoded_bboxes;
-      for (size_t i = 0; i < num_priors; ++i) {
-        size_t prior_offset = i * 8;
-        size_t loc_pred_offset = n * num_priors * 4 + i * 4;
-        std::vector<math::BBox<T>> prior_bbox_vec;
-        math::GetBBoxFromPriorData<T>(priorbox_data + prior_offset, 1,
-                                      prior_bbox_vec);
-        std::vector<std::vector<T>> prior_bbox_var;
-        math::GetBBoxVarFromPriorData<T>(priorbox_data + prior_offset, 1,
-                                         prior_bbox_var);
-        std::vector<T> loc_pred_data;
-        for (size_t j = 0; j < 4; ++j)
-          loc_pred_data.push_back(*(loc_data + loc_pred_offset + j));
-        math::BBox<T> bbox = math::DecodeBBoxWithVar<T>(
-            prior_bbox_vec[0], prior_bbox_var[0], loc_pred_data);
-        decoded_bboxes.push_back(bbox);
-      }
-      all_decoded_bboxes.push_back(decoded_bboxes);
-    }
-    std::vector<std::map<size_t, std::vector<size_t>>> all_indices;
-    int num_kept = math::GetDetectionIndices<T>(
-        conf_data, num_priors, num_classes, background_label_id, batch_size,
-        confidence_threshold, nms_top_k, nms_threshold, top_k,
-        all_decoded_bboxes, &all_indices);
-
-    if (num_kept <= 0) {
-      std::vector<int64_t> out_shape_vec({0, 0});
-      framework::DDim out_shape(framework::make_ddim(out_shape_vec));
-      out->Resize(out_shape);
-      return;
-    }
-    std::vector<int64_t> out_shape_vec({num_kept, 7});
-    framework::DDim out_shape(framework::make_ddim(out_shape_vec));
-    out->mutable_data<T>(out_shape, context.GetPlace());
-    framework::Tensor out_cpu;
-    T* out_data = out->data<T>();
-    if (platform::is_gpu_place(context.GetPlace())) {
-      out_cpu.mutable_data<T>(out->dims(), platform::CPUPlace());
-      out_data = out_cpu.data<T>();
-    }
-    math::GetDetectionOutput<T>(conf_data, num_kept, num_priors, num_classes,
-                                batch_size, all_indices, all_decoded_bboxes,
-                                out_data);
-    if (platform::is_gpu_place(context.GetPlace())) {
-      framework::TensorCopy(out_cpu, platform::CUDAPlace(),
-                            context.device_context(), out);
-    }
-  }
-};
-}  // namespace operators
-}  // namespace paddle
diff --git a/paddle/fluid/operators/dropout_op.cc b/paddle/fluid/operators/dropout_op.cc
index 1074ed6acc22a81f46c466d917ef973945a12898..e4436549f6185ba04a5f270893596a6dcb11e89b 100644
--- a/paddle/fluid/operators/dropout_op.cc
+++ b/paddle/fluid/operators/dropout_op.cc
@@ -35,7 +35,6 @@ class DropoutOp : public framework::OperatorWithKernel {
   }
 };
 
-template <typename AttrType>
 class DropoutOpMaker : public framework::OpProtoAndCheckerMaker {
  public:
   DropoutOpMaker(OpProto* proto, OpAttrChecker* op_checker)
@@ -73,7 +72,6 @@ are set equal to their corresponding inputs.
   }
 };
 
-template <typename AttrType>
 class DropoutOpGrad : public framework::OperatorWithKernel {
  public:
   using framework::OperatorWithKernel::OperatorWithKernel;
@@ -103,11 +101,10 @@ class DropoutOpGrad : public framework::OperatorWithKernel {
 }  // namespace paddle
 
 namespace ops = paddle::operators;
-REGISTER_OP(dropout, ops::DropoutOp, ops::DropoutOpMaker<float>, dropout_grad,
-            ops::DropoutOpGrad<float>);
+REGISTER_OP(dropout, ops::DropoutOp, ops::DropoutOpMaker, dropout_grad,
+            ops::DropoutOpGrad);
 REGISTER_OP_CPU_KERNEL(
-    dropout,
-    ops::CPUDropoutKernel<paddle::platform::CPUDeviceContext, float, float>);
+    dropout, ops::CPUDropoutKernel<paddle::platform::CPUDeviceContext, float>);
 REGISTER_OP_CPU_KERNEL(
     dropout_grad,
     ops::DropoutGradKernel<paddle::platform::CPUDeviceContext, float>);
diff --git a/paddle/fluid/operators/dropout_op.cu b/paddle/fluid/operators/dropout_op.cu
index d6f9c04359d733cb4f3f0586e9239ee67deb7078..f6c85a2a537b37feb20e6d62729dc5075af2a5d9 100644
--- a/paddle/fluid/operators/dropout_op.cu
+++ b/paddle/fluid/operators/dropout_op.cu
@@ -18,17 +18,18 @@ limitations under the License. */
 #include <thrust/random.h>
 #include <thrust/transform.h>
 #include "paddle/fluid/operators/dropout_op.h"
+#include "paddle/fluid/platform/float16.h"
 
 namespace paddle {
 namespace operators {
 
-template <typename T, typename AttrType>
+template <typename T>
 __global__ void RandomGenerator(const size_t n, const int seed,
-                                const AttrType dropout_prob, const T* src,
+                                const float dropout_prob, const T* src,
                                 T* mask_data, T* dst) {
   thrust::minstd_rand rng;
   rng.seed(seed);
-  thrust::uniform_real_distribution<AttrType> dist(0, 1);
+  thrust::uniform_real_distribution<float> dist(0, 1);
 
   int idx = blockDim.x * blockIdx.x + threadIdx.x;
   for (; idx < n; idx += blockDim.x * gridDim.x) {
@@ -44,14 +45,14 @@ __global__ void RandomGenerator(const size_t n, const int seed,
 // It seems that Eigen::Tensor::setRandom in GPU will SEGFAULT.
 // Use std::random and thrust::random(thrust is a std library in CUDA) to
 // implement uniform random.
-template <typename Place, typename T, typename AttrType>
+template <typename Place, typename T>
 class GPUDropoutKernel : public framework::OpKernel<T> {
  public:
   void Compute(const framework::ExecutionContext& context) const override {
     auto* x = context.Input<Tensor>("X");
     auto* y = context.Output<Tensor>("Out");
     y->mutable_data<T>(context.GetPlace());
-    AttrType dropout_prob = context.Attr<AttrType>("dropout_prob");
+    float dropout_prob = context.Attr<float>("dropout_prob");
 
     auto X = EigenMatrix<T>::Reshape(*x, 1);
     auto Y = EigenMatrix<T>::Reshape(*y, 1);
@@ -70,11 +71,11 @@ class GPUDropoutKernel : public framework::OpKernel<T> {
 
       int threads = 512;
       int grid = (x->numel() + threads - 1) / threads;
-      RandomGenerator<T, AttrType><<<grid, threads, 0,
-                                     context.cuda_device_context().stream()>>>(
+      RandomGenerator<
+          T><<<grid, threads, 0, context.cuda_device_context().stream()>>>(
           size, seed, dropout_prob, x_data, mask_data, y_data);
     } else {
-      Y.device(place) = X * (1.0f - dropout_prob);
+      Y.device(place) = X * static_cast<T>(1.0f - dropout_prob);
     }
   }
 };
@@ -83,9 +84,9 @@ class GPUDropoutKernel : public framework::OpKernel<T> {
 }  // namespace paddle
 
 namespace ops = paddle::operators;
+namespace plat = paddle::platform;
 REGISTER_OP_CUDA_KERNEL(
-    dropout,
-    ops::GPUDropoutKernel<paddle::platform::CUDADeviceContext, float, float>);
-REGISTER_OP_CUDA_KERNEL(
-    dropout_grad,
-    ops::DropoutGradKernel<paddle::platform::CUDADeviceContext, float>);
+    dropout, ops::GPUDropoutKernel<plat::CUDADeviceContext, float>,
+    ops::GPUDropoutKernel<plat::CUDADeviceContext, plat::float16>);
+REGISTER_OP_CUDA_KERNEL(dropout_grad,
+                        ops::DropoutGradKernel<plat::CUDADeviceContext, float>);
diff --git a/paddle/fluid/operators/dropout_op.h b/paddle/fluid/operators/dropout_op.h
index 209e4dec1756dc65fbf147c4dbbf0913d3c6ef7e..b5ee86ae2d11dfc835e1a3a6826ce016baf38a29 100644
--- a/paddle/fluid/operators/dropout_op.h
+++ b/paddle/fluid/operators/dropout_op.h
@@ -25,7 +25,7 @@ template <typename T, int MajorType = Eigen::RowMajor,
           typename IndexType = Eigen::DenseIndex>
 using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
 
-template <typename DeviceContext, typename T, typename AttrType>
+template <typename DeviceContext, typename T>
 class CPUDropoutKernel : public framework::OpKernel<T> {
  public:
   void Compute(const framework::ExecutionContext& context) const override {
diff --git a/paddle/fluid/operators/feed_op.cc b/paddle/fluid/operators/feed_op.cc
index 90c31877f6a87d1e237283d489353b4aba26c97b..debacf07c360b9aa69000a0d891f04239ed08807 100644
--- a/paddle/fluid/operators/feed_op.cc
+++ b/paddle/fluid/operators/feed_op.cc
@@ -15,6 +15,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/feed_fetch_type.h"
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/platform/profiler.h"
 
 namespace paddle {
 namespace operators {
@@ -28,6 +29,10 @@ class FeedOp : public framework::OperatorBase {
  private:
   void RunImpl(const framework::Scope &scope,
                const platform::Place &place) const override {
+    // get device context from pool
+    auto *dev_ctx = platform::DeviceContextPool::Instance().Get(place);
+    platform::RecordEvent record_event(Type(), dev_ctx);
+
     auto feed_var_name = Input("X");
     auto *feed_var = scope.FindVar(feed_var_name);
 
@@ -50,14 +55,10 @@ class FeedOp : public framework::OperatorBase {
     auto &feed_item = feed_list.at(static_cast<size_t>(col));
     auto *out_item = out_var->GetMutable<framework::FeedFetchType>();
 
-    // get device context from pool
-    platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
-    auto &dev_ctx = *pool.Get(place);
-
     if (platform::is_same_place(feed_item.place(), place)) {
       out_item->ShareDataWith(feed_item);
     } else {
-      framework::TensorCopy(feed_item, place, dev_ctx, out_item);
+      framework::TensorCopy(feed_item, place, *dev_ctx, out_item);
     }
     out_item->set_lod(feed_item.lod());
   }
diff --git a/paddle/fluid/operators/fetch_op.cc b/paddle/fluid/operators/fetch_op.cc
index d66f01d1b7ce8528a7c0177b2889aff7e0c5a12b..7c7f3e9059fbb1e3f2cca4f04edfff55c9452761 100644
--- a/paddle/fluid/operators/fetch_op.cc
+++ b/paddle/fluid/operators/fetch_op.cc
@@ -15,6 +15,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/feed_fetch_type.h"
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/platform/device_context.h"
+#include "paddle/fluid/platform/profiler.h"
 
 namespace paddle {
 namespace operators {
@@ -29,6 +30,9 @@ class FetchOp : public framework::OperatorBase {
  private:
   void RunImpl(const framework::Scope &scope,
                const platform::Place &place) const override {
+    platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
+    platform::RecordEvent record_event(Type(), pool.Get(place));
+
     auto fetch_var_name = Input("X");
     auto *fetch_var = scope.FindVar(fetch_var_name);
     PADDLE_ENFORCE(fetch_var != nullptr,
@@ -53,7 +57,6 @@ class FetchOp : public framework::OperatorBase {
 
     // FIXME(yuyang18): Should we assume the fetch operator always generate
     // CPU outputs?
-    platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
     auto &dev_ctx = *pool.Get(src_item.place());
 
     TensorCopy(src_item, platform::CPUPlace(), dev_ctx, &dst_item);
diff --git a/paddle/fluid/operators/load_op.cc b/paddle/fluid/operators/load_op.cc
index 05f809ac5628420251957116bb2390b4502f11b8..6ffe0bec5e38432676ecadfa1abbbe70a1425bb1 100644
--- a/paddle/fluid/operators/load_op.cc
+++ b/paddle/fluid/operators/load_op.cc
@@ -15,6 +15,7 @@ limitations under the License. */
 
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/platform/device_context.h"
+#include "paddle/fluid/platform/profiler.h"
 
 namespace paddle {
 namespace operators {
@@ -29,6 +30,9 @@ class LoadOp : public framework::OperatorBase {
  private:
   void RunImpl(const framework::Scope &scope,
                const platform::Place &place) const override {
+    auto *dev_ctx = platform::DeviceContextPool::Instance().Get(place);
+    platform::RecordEvent record_event(Type(), dev_ctx);
+
     auto filename = Attr<std::string>("file_path");
     std::ifstream fin(filename);
     PADDLE_ENFORCE(static_cast<bool>(fin), "Cannot open file %s for load op",
@@ -41,9 +45,7 @@ class LoadOp : public framework::OperatorBase {
 
     auto *tensor = out_var->GetMutable<framework::LoDTensor>();
 
-    platform::DeviceContextPool &pool = platform::DeviceContextPool::Instance();
-    auto &dev_ctx = *pool.Get(place);
-    DeserializeFromStream(fin, tensor, dev_ctx);
+    DeserializeFromStream(fin, tensor, *dev_ctx);
 
     if (platform::is_gpu_place(place)) {
       // copy CPU to GPU
@@ -55,7 +57,7 @@ class LoadOp : public framework::OperatorBase {
       out_var->Clear();
       tensor = out_var->GetMutable<framework::LoDTensor>();
       tensor->set_lod(cpu_tensor.lod());
-      TensorCopy(cpu_tensor, place, dev_ctx, tensor);
+      TensorCopy(cpu_tensor, place, *dev_ctx, tensor);
     }
   }
 };
diff --git a/paddle/fluid/operators/lod_reset_op.cc b/paddle/fluid/operators/lod_reset_op.cc
index 6a66297cb843ead1a507a6867c1c562224861cbf..7d5687f2d0666d393d7bb1c1a2fdde6c95e6d615 100644
--- a/paddle/fluid/operators/lod_reset_op.cc
+++ b/paddle/fluid/operators/lod_reset_op.cc
@@ -22,17 +22,16 @@ class LoDResetOp : public framework::OperatorWithKernel {
   using framework::OperatorWithKernel::OperatorWithKernel;
 
   void InferShape(framework::InferShapeContext *ctx) const override {
-    // input check
     PADDLE_ENFORCE(ctx->HasInput("X"),
                    "Input(X) of LoDResetOp should not be null.");
     PADDLE_ENFORCE(ctx->HasOutput("Out"),
                    "Output(Out) of LoDResetOp should not be null.");
-    // If target LoD is not set form Input(), then it must be set from Attr().
-    if (!ctx->HasInput("TargetLoD")) {
+
+    if (!ctx->HasInput("Y")) {
       auto level0 = ctx->Attrs().Get<std::vector<int>>("target_lod");
-      PADDLE_ENFORCE(level0.size() > 1,
-                     "Target LoD is not found, should be set to be a valid one "
-                     "through Input() or Attr().");
+      PADDLE_ENFORCE_GT(level0.size(), 1,
+                        "If Input(Y) not provided, the target lod should be "
+                        "specified by attribute `target_lod`.");
     }
     ctx->SetOutputDim("Out", ctx->GetInputDim("X"));
   }
@@ -50,36 +49,77 @@ class LoDResetOpMaker : public framework::OpProtoAndCheckerMaker {
  public:
   LoDResetOpMaker(OpProto *proto, OpAttrChecker *op_checker)
       : OpProtoAndCheckerMaker(proto, op_checker) {
-    AddInput("X", "(LoDTensor) The input tensor of lod_reset operator.");
-    AddInput("TargetLoD",
-             "(Tensor, optional) The target level 0 LoD from Input().")
+    AddInput("X",
+             "(Tensor, LoDTensor) Input variable of LoDResetOp which "
+             "could be a Tensor or LoDTensor, where the data of output "
+             "variable inherits from.");
+    AddInput("Y",
+             "(Tensor, LoDTensor, optional) If provided and Y is LoDTensor, "
+             "lod of Input(Y) would be considered as the target lod first, "
+             "otherwise data of Input(Y) would be considered as the "
+             "target lod.")
         .AsDispensable();
-    AddOutput("Out", "(LoDTensor) The output tensor of lod_reset operator.");
+    AddOutput("Out",
+              "(LoDTensor) Output variable of LoDResetOp which should be a "
+              "LoDTensor.");
     AddAttr<std::vector<int>>("target_lod",
                               "The target level 0 LoD from Attr().")
         .SetDefault(std::vector<int>{});
     AddComment(R"DOC(LoDReset operator
 
-Reset LoD of Input(X) into a new one specified by Input(TargetLoD) or
-Attr(target_lod), or set LoD for Input(X) if it doesn't have one.
-Currently the lod_reset operator only supports the reset of level 0 LoD.
-At least one of Input(TargetLoD) and Attr(target_lod) must be set,
-and if both of them are set, Input(TargetLoD) will be chosen as the
-target LoD.
+Set LoD of `X` to a new one specified by `Y` or attribute `target_lod`. When `Y`
+provided and `Y` is a LoDTensor, `Y.lod` would be considered as target LoD
+first, otherwise `Y.data` would be considered as target LoD. If `Y` is not
+provided, target LoD should be specified by attribute `target_lod`.
+If target LoD is specified by `Y.data` or `target_lod`, only one level LoD
+is supported.
+
+Example 1:
+
+Given a 1-level LoDTensor input(X):
+    X.lod =  [[ 0,     2,                   5      6 ]]
+    X.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+    X.dims = [6, 1]
+
+attr(target_lod): [0, 4, 6]
+
+then we get a 1-level LoDTensor:
+    Out.lod =  [[ 0,                   4,            6 ]]
+    Out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+    Out.dims = [6, 1]
+
+Example 2:
 
-An example:
-Given a float LoDTensor X with shape (6, 1), its transpose form represents
+Given a 1-level LoDTensor input(X):
+    X.lod =  [[ 0,     2,                   5      6 ]]
+    X.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+    X.dims = [6, 1]
 
-    [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
+input(Y) is a Tensor:
+    Y.data = [[0, 2, 6]]
+    Y.dims = [1, 3]
 
-with LoD = [[0, 2, 5, 6]] and the three (transposed) sequences look like
+then we get a 1-level LoDTensor:
+    Out.lod =  [[ 0,     2,                          6 ]]
+    Out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+    Out.dims = [6, 1]
 
-    [1.0, 2.0], [3.0, 4.0, 5.0], [6.0].
+Example 3:
 
-If target LoD = [0, 4, 6], the lod_reset operator will reset the LoD and
-the sequences that the LoDTensor Output(Out) contains becomes:
+Given a 1-level LoDTensor input(X):
+    X.lod =  [[ 0,      2,                   5     6 ]]
+    X.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+    X.dims = [6, 1]
 
-    [1.0, 2.0, 3.0, 4.0], [5.0, 6.0].
+input(Y) is a 2-level LoDTensor:
+    Y.lod =  [[0, 2, 4], [0, 2, 5, 6]]
+    Y.data = [[1.1], [2.1], [3.1], [4.1], [5.1], [6.1]]
+    Y.dims = [6, 1]
+
+then we get a 2-level LoDTensor:
+    Out.lod =  [[0, 2, 4], [0, 2, 5, 6]]
+    Out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+    Out.dims = [6, 1]
 
 )DOC");
   }
@@ -90,10 +130,16 @@ class LoDResetGradOp : public framework::OperatorWithKernel {
   using framework::OperatorWithKernel::OperatorWithKernel;
 
   void InferShape(framework::InferShapeContext *ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("X"),
+                   "Input(X) of LoDResetGradOp should not be null.");
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
-                   "Input(Out@GRAD) shouldn't be null.");
-    ctx->SetOutputDim(framework::GradVarName("X"), ctx->GetInputDim("X"));
+                   "Input(Out@Grad) of LoDResetGradOp should not be null.");
+
+    auto x_grad_name = framework::GradVarName("X");
+    if (ctx->HasOutput(x_grad_name)) {
+      ctx->SetOutputDim(x_grad_name, ctx->GetInputDim("X"));
+      ctx->ShareLoD("X", /*->*/ x_grad_name);
+    }
   }
 
  protected:
@@ -111,9 +157,13 @@ class LoDResetGradOp : public framework::OperatorWithKernel {
 namespace ops = paddle::operators;
 REGISTER_OP(lod_reset, ops::LoDResetOp, ops::LoDResetOpMaker, lod_reset_grad,
             ops::LoDResetGradOp);
-REGISTER_OP_CPU_KERNEL(lod_reset,
-                       ops::LoDResetKernel<paddle::platform::CPUPlace, float>,
-                       ops::LoDResetKernel<paddle::platform::CPUPlace, double>);
+REGISTER_OP_CPU_KERNEL(
+    lod_reset, ops::LoDResetKernel<paddle::platform::CPUPlace, float>,
+    ops::LoDResetKernel<paddle::platform::CPUPlace, double>,
+    ops::LoDResetKernel<paddle::platform::CPUPlace, int>,
+    ops::LoDResetKernel<paddle::platform::CPUPlace, int64_t>);
 REGISTER_OP_CPU_KERNEL(
     lod_reset_grad, ops::LoDResetGradKernel<paddle::platform::CPUPlace, float>,
-    ops::LoDResetGradKernel<paddle::platform::CPUPlace, double>);
+    ops::LoDResetGradKernel<paddle::platform::CPUPlace, double>,
+    ops::LoDResetGradKernel<paddle::platform::CPUPlace, int>,
+    ops::LoDResetGradKernel<paddle::platform::CPUPlace, int64_t>);
diff --git a/paddle/fluid/operators/lod_reset_op.cu b/paddle/fluid/operators/lod_reset_op.cu
index b0e87a851a77a1cc98d419a63d4d9e5e1b9dd163..888d4c12eb4e3f4fd94d8dd4178c59acd0abb23b 100644
--- a/paddle/fluid/operators/lod_reset_op.cu
+++ b/paddle/fluid/operators/lod_reset_op.cu
@@ -18,8 +18,12 @@ namespace ops = paddle::operators;
 
 REGISTER_OP_CUDA_KERNEL(
     lod_reset, ops::LoDResetKernel<paddle::platform::CUDADeviceContext, float>,
-    ops::LoDResetKernel<paddle::platform::CUDADeviceContext, double>);
+    ops::LoDResetKernel<paddle::platform::CUDADeviceContext, double>,
+    ops::LoDResetKernel<paddle::platform::CUDADeviceContext, int>,
+    ops::LoDResetKernel<paddle::platform::CUDADeviceContext, int64_t>);
 REGISTER_OP_CUDA_KERNEL(
     lod_reset_grad,
     ops::LoDResetGradKernel<paddle::platform::CUDADeviceContext, float>,
-    ops::LoDResetGradKernel<paddle::platform::CUDADeviceContext, double>);
+    ops::LoDResetGradKernel<paddle::platform::CUDADeviceContext, double>,
+    ops::LoDResetGradKernel<paddle::platform::CUDADeviceContext, int>,
+    ops::LoDResetGradKernel<paddle::platform::CUDADeviceContext, int64_t>);
diff --git a/paddle/fluid/operators/lod_reset_op.h b/paddle/fluid/operators/lod_reset_op.h
index 8186d4f8262101edc723af390eee1aec4fa6f3a5..99f01c2a255ade81421c2bba95ff3d38ced6f87c 100644
--- a/paddle/fluid/operators/lod_reset_op.h
+++ b/paddle/fluid/operators/lod_reset_op.h
@@ -26,35 +26,46 @@ class LoDResetKernel : public framework::OpKernel<T> {
   void Compute(const framework::ExecutionContext& ctx) const {
     auto* out = ctx.Output<framework::LoDTensor>("Out");
     auto* in = ctx.Input<framework::LoDTensor>("X");
-    auto* lod_t = ctx.Input<framework::Tensor>("TargetLoD");
+    auto* lod_t = ctx.Input<framework::LoDTensor>("Y");
+
+    out->ShareDataWith(*in);
 
     std::vector<int> level0;
     if (lod_t) {
-      auto* lod = lod_t->data<int>();
-      if (platform::is_gpu_place(ctx.GetPlace())) {
-        framework::Tensor lod_cpu;
-        framework::TensorCopy(*lod_t, platform::CPUPlace(),
-                              ctx.device_context(), &lod_cpu);
-        lod = lod_cpu.data<int>();
+      if (lod_t->lod().size() > 0) {
+        auto y_lod = lod_t->lod();
+        auto last_level = y_lod[y_lod.size() - 1];
+        PADDLE_ENFORCE_EQ(last_level.back(), in->dims()[0],
+                          "Last value of `Y`'s last level LoD should be equal "
+                          "to the first dimension of `X`");
+        out->set_lod(y_lod);
+        return;  // early return, since lod already set
+      } else {
+        auto* lod = lod_t->data<int>();
+        if (platform::is_gpu_place(ctx.GetPlace())) {
+          framework::Tensor lod_cpu;
+          framework::TensorCopy(*lod_t, platform::CPUPlace(),
+                                ctx.device_context(), &lod_cpu);
+          lod = lod_cpu.data<int>();
+        }
+        level0 = std::vector<int>(lod, lod + lod_t->numel());
       }
-      level0 = std::vector<int>(lod, lod + lod_t->numel());
     } else {
       level0 = ctx.Attr<std::vector<int>>("target_lod");
     }
 
-    PADDLE_ENFORCE(level0.size() > 1UL,
-                   "The size of target LoD should be greater than 1.");
-    PADDLE_ENFORCE(level0[0] == 0,
-                   "Target LoD should be a vector starting from 0.");
-    PADDLE_ENFORCE(level0.back() == in->dims()[0],
-                   "Target LoD should be a vector end with the "
-                   "first dimension of Input(X).");
+    PADDLE_ENFORCE_GT(level0.size(), 1UL,
+                      "Size of target LoD should be greater than 1.");
+    PADDLE_ENFORCE_EQ(level0[0], 0,
+                      "Target LoD should be a vector starting from 0.");
+    PADDLE_ENFORCE_EQ(level0.back(), in->dims()[0],
+                      "Target LoD should be a vector end with the "
+                      "first dimension of Input(X).");
     for (size_t i = 0; i < level0.size() - 1; ++i) {
       PADDLE_ENFORCE(level0[i + 1] > level0[i],
                      "Target LoD should be an ascending vector.");
     }
 
-    out->ShareDataWith(*in);
     // cast level0 to size_t
     std::vector<size_t> ulevel0(level0.size(), 0);
     std::transform(level0.begin(), level0.end(), ulevel0.begin(),
diff --git a/paddle/fluid/operators/lrn_op.cc b/paddle/fluid/operators/lrn_op.cc
index b0c213d637c244e4cbacbe75218537973efed047..692e85dcffa583abcb22a1629953badc67489efa 100644
--- a/paddle/fluid/operators/lrn_op.cc
+++ b/paddle/fluid/operators/lrn_op.cc
@@ -36,7 +36,7 @@ struct LRNFunctor<platform::CPUDeviceContext, T> {
     auto e_x = framework::EigenTensor<T, 4>::From(input);
     for (int m = 0; m < N; m++) {
       for (int i = 0; i < C; i++) {
-        for (int c = start; c <= end; c++) {
+        for (int c = start; c < end; c++) {
           int ch = i + c;
           if (ch >= 0 && ch < C) {
             auto s = e_mid.slice(Eigen::array<int, 4>({{m, i, 0, 0}}),
@@ -92,7 +92,7 @@ struct LRNGradFunctor<platform::CPUDeviceContext, T> {
                                  Eigen::array<int, 4>({{1, 1, H, W}}));
 
         i_x_g = i_mid.pow(-beta) * i_out_g;
-        for (int c = start; c <= end; c++) {
+        for (int c = start; c < end; c++) {
           int ch = i + c;
           if (ch < 0 || ch >= C) {
             continue;
diff --git a/paddle/fluid/operators/math/CMakeLists.txt b/paddle/fluid/operators/math/CMakeLists.txt
index a181d802262d15b188060dae4330cec0e24714ab..547d081006f1c28ba73cb02d38e36bb612cea494 100644
--- a/paddle/fluid/operators/math/CMakeLists.txt
+++ b/paddle/fluid/operators/math/CMakeLists.txt
@@ -38,12 +38,12 @@ math_library(lstm_compute DEPS activation_functions)
 math_library(math_function DEPS cblas)
 math_library(maxouting)
 math_library(pooling)
-math_library(selected_rows_functor DEPS selected_rows)
+math_library(selected_rows_functor DEPS selected_rows math_function)
 math_library(sequence2batch)
 math_library(sequence_padding)
 math_library(sequence_pooling DEPS math_function)
 math_library(sequence_scale)
-math_library(softmax)
+math_library(softmax DEPS math_function)
 math_library(unpooling)
 math_library(vol2col)
 
diff --git a/paddle/fluid/operators/math/concat.cc b/paddle/fluid/operators/math/concat.cc
index b542143419e05e9baf29e9a2322447f32ddd9829..b672c79afd97e36894af647fd4bc6edfb885ff13 100644
--- a/paddle/fluid/operators/math/concat.cc
+++ b/paddle/fluid/operators/math/concat.cc
@@ -44,7 +44,7 @@ class ConcatFunctor<platform::CPUDeviceContext, T> {
       out_cols += t_cols;
       input_cols[i] = t_cols;
     }
-    auto& cpu_place = boost::get<platform::CPUPlace>(context.GetPlace());
+    auto cpu_place = boost::get<platform::CPUPlace>(context.GetPlace());
 
     // computation
     for (int k = 0; k < out_rows; ++k) {
@@ -87,7 +87,7 @@ class ConcatGradFunctor<platform::CPUDeviceContext, T> {
       input_cols += t_cols;
       output_cols[i] = t_cols;
     }
-    auto& cpu_place = boost::get<platform::CPUPlace>(context.GetPlace());
+    auto cpu_place = boost::get<platform::CPUPlace>(context.GetPlace());
 
     // computation
     for (int k = 0; k < input_rows; ++k) {
diff --git a/paddle/fluid/operators/math/detection_util.h b/paddle/fluid/operators/math/detection_util.h
deleted file mode 100644
index c31764cfaf5bbdfea2f3ed06f31f97965a8858ed..0000000000000000000000000000000000000000
--- a/paddle/fluid/operators/math/detection_util.h
+++ /dev/null
@@ -1,300 +0,0 @@
-/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-#pragma once
-#include <map>
-#include "paddle/fluid/framework/selected_rows.h"
-#include "paddle/fluid/platform/device_context.h"
-
-namespace paddle {
-namespace operators {
-namespace math {
-template <typename T>
-struct BBox {
-  BBox(T x_min, T y_min, T x_max, T y_max)
-      : x_min(x_min),
-        y_min(y_min),
-        x_max(x_max),
-        y_max(y_max),
-        is_difficult(false) {}
-
-  BBox() {}
-
-  T get_width() const { return x_max - x_min; }
-
-  T get_height() const { return y_max - y_min; }
-
-  T get_center_x() const { return (x_min + x_max) / 2; }
-
-  T get_center_y() const { return (y_min + y_max) / 2; }
-
-  T get_area() const { return get_width() * get_height(); }
-
-  // coordinate of bounding box
-  T x_min;
-  T y_min;
-  T x_max;
-  T y_max;
-  // whether difficult object (e.g. object with heavy occlusion is difficult)
-  bool is_difficult;
-};
-// KNCHW ==> NHWC
-// template <typename T>
-template <typename T>
-void GetBBoxFromPriorData(const T* prior_data, const size_t num_bboxes,
-                          std::vector<BBox<T>>& bbox_vec);
-template <typename T>
-void GetBBoxVarFromPriorData(const T* prior_data, const size_t num,
-                             std::vector<std::vector<T>>& var_vec);
-template <typename T>
-BBox<T> DecodeBBoxWithVar(BBox<T>& prior_bbox,
-                          const std::vector<T>& prior_bbox_var,
-                          const std::vector<T>& loc_pred_data);
-template <typename T1, typename T2>
-bool SortScorePairDescend(const std::pair<T1, T2>& pair1,
-                          const std::pair<T1, T2>& pair2);
-template <typename T>
-bool SortScorePairDescend(const std::pair<T, BBox<T>>& pair1,
-                          const std::pair<T, BBox<T>>& pair2);
-template <typename T>
-T jaccard_overlap(const BBox<T>& bbox1, const BBox<T>& bbox2);
-
-template <typename T>
-void ApplyNmsFast(const std::vector<BBox<T>>& bboxes, const T* conf_score_data,
-                  size_t class_idx, size_t top_k, T conf_threshold,
-                  T nms_threshold, size_t num_priors, size_t num_classes,
-                  std::vector<size_t>* indices);
-template <typename T>
-int GetDetectionIndices(
-    const T* conf_data, const size_t num_priors, const size_t num_classes,
-    const size_t background_label_id, const size_t batch_size,
-    const T conf_threshold, const size_t nms_top_k, const T nms_threshold,
-    const size_t top_k,
-    const std::vector<std::vector<BBox<T>>>& all_decoded_bboxes,
-    std::vector<std::map<size_t, std::vector<size_t>>>* all_detection_indices);
-template <typename T>
-BBox<T> ClipBBox(const BBox<T>& bbox);
-template <typename T>
-void GetDetectionOutput(
-    const T* conf_data, const size_t num_kept, const size_t num_priors,
-    const size_t num_classes, const size_t batch_size,
-    const std::vector<std::map<size_t, std::vector<size_t>>>& all_indices,
-    const std::vector<std::vector<BBox<T>>>& all_decoded_bboxes, T* out_data);
-template <typename T>
-void GetBBoxFromPriorData(const T* prior_data, const size_t num_bboxes,
-                          std::vector<BBox<T>>& bbox_vec) {
-  size_t out_offset = bbox_vec.size();
-  bbox_vec.resize(bbox_vec.size() + num_bboxes);
-  for (size_t i = 0; i < num_bboxes; ++i) {
-    BBox<T> bbox;
-    bbox.x_min = *(prior_data + i * 8);
-    bbox.y_min = *(prior_data + i * 8 + 1);
-    bbox.x_max = *(prior_data + i * 8 + 2);
-    bbox.y_max = *(prior_data + i * 8 + 3);
-    bbox_vec[out_offset + i] = bbox;
-  }
-}
-template <typename T>
-void GetBBoxVarFromPriorData(const T* prior_data, const size_t num,
-                             std::vector<std::vector<T>>& var_vec) {
-  size_t out_offset = var_vec.size();
-  var_vec.resize(var_vec.size() + num);
-  for (size_t i = 0; i < num; ++i) {
-    std::vector<T> var;
-    var.push_back(*(prior_data + i * 8 + 4));
-    var.push_back(*(prior_data + i * 8 + 5));
-    var.push_back(*(prior_data + i * 8 + 6));
-    var.push_back(*(prior_data + i * 8 + 7));
-    var_vec[out_offset + i] = var;
-  }
-}
-template <typename T>
-BBox<T> DecodeBBoxWithVar(BBox<T>& prior_bbox,
-                          const std::vector<T>& prior_bbox_var,
-                          const std::vector<T>& loc_pred_data) {
-  T prior_bbox_width = prior_bbox.get_width();
-  T prior_bbox_height = prior_bbox.get_height();
-  T prior_bbox_center_x = prior_bbox.get_center_x();
-  T prior_bbox_center_y = prior_bbox.get_center_y();
-
-  T decoded_bbox_center_x =
-      prior_bbox_var[0] * loc_pred_data[0] * prior_bbox_width +
-      prior_bbox_center_x;
-  T decoded_bbox_center_y =
-      prior_bbox_var[1] * loc_pred_data[1] * prior_bbox_height +
-      prior_bbox_center_y;
-  T decoded_bbox_width =
-      std::exp(prior_bbox_var[2] * loc_pred_data[2]) * prior_bbox_width;
-  T decoded_bbox_height =
-      std::exp(prior_bbox_var[3] * loc_pred_data[3]) * prior_bbox_height;
-
-  BBox<T> decoded_bbox;
-  decoded_bbox.x_min = decoded_bbox_center_x - decoded_bbox_width / 2;
-  decoded_bbox.y_min = decoded_bbox_center_y - decoded_bbox_height / 2;
-  decoded_bbox.x_max = decoded_bbox_center_x + decoded_bbox_width / 2;
-  decoded_bbox.y_max = decoded_bbox_center_y + decoded_bbox_height / 2;
-
-  return decoded_bbox;
-}
-template <typename T1, typename T2>
-bool SortScorePairDescend(const std::pair<T1, T2>& pair1,
-                          const std::pair<T1, T2>& pair2) {
-  return pair1.first > pair2.first;
-}
-template <typename T>
-T jaccard_overlap(const BBox<T>& bbox1, const BBox<T>& bbox2) {
-  if (bbox2.x_min > bbox1.x_max || bbox2.x_max < bbox1.x_min ||
-      bbox2.y_min > bbox1.y_max || bbox2.y_max < bbox1.y_min) {
-    return 0.0;
-  } else {
-    T inter_x_min = std::max(bbox1.x_min, bbox2.x_min);
-    T inter_y_min = std::max(bbox1.y_min, bbox2.y_min);
-    T interX_max = std::min(bbox1.x_max, bbox2.x_max);
-    T interY_max = std::min(bbox1.y_max, bbox2.y_max);
-
-    T inter_width = interX_max - inter_x_min;
-    T inter_height = interY_max - inter_y_min;
-    T inter_area = inter_width * inter_height;
-
-    T bbox_area1 = bbox1.get_area();
-    T bbox_area2 = bbox2.get_area();
-
-    return inter_area / (bbox_area1 + bbox_area2 - inter_area);
-  }
-}
-
-template <typename T>
-void ApplyNmsFast(const std::vector<BBox<T>>& bboxes, const T* conf_score_data,
-                  size_t class_idx, size_t top_k, T conf_threshold,
-                  T nms_threshold, size_t num_priors, size_t num_classes,
-                  std::vector<size_t>* indices) {
-  std::vector<std::pair<T, size_t>> scores;
-  for (size_t i = 0; i < num_priors; ++i) {
-    size_t conf_offset = i * num_classes + class_idx;
-    if (conf_score_data[conf_offset] > conf_threshold)
-      scores.push_back(std::make_pair(conf_score_data[conf_offset], i));
-  }
-  std::stable_sort(scores.begin(), scores.end(),
-                   SortScorePairDescend<T, size_t>);
-  if (top_k > 0 && top_k < scores.size()) scores.resize(top_k);
-  while (scores.size() > 0) {
-    const size_t idx = scores.front().second;
-    bool keep = true;
-    for (size_t i = 0; i < indices->size(); ++i) {
-      if (keep) {
-        const size_t saved_idx = (*indices)[i];
-        T overlap = jaccard_overlap<T>(bboxes[idx], bboxes[saved_idx]);
-        keep = overlap <= nms_threshold;
-      } else {
-        break;
-      }
-    }
-    if (keep) indices->push_back(idx);
-    scores.erase(scores.begin());
-  }
-}
-template <typename T>
-int GetDetectionIndices(
-    const T* conf_data, const size_t num_priors, const size_t num_classes,
-    const size_t background_label_id, const size_t batch_size,
-    const T conf_threshold, const size_t nms_top_k, const T nms_threshold,
-    const size_t top_k,
-    const std::vector<std::vector<BBox<T>>>& all_decoded_bboxes,
-    std::vector<std::map<size_t, std::vector<size_t>>>* all_detection_indices) {
-  int total_keep_num = 0;
-  for (size_t n = 0; n < batch_size; ++n) {
-    const std::vector<BBox<T>>& decoded_bboxes = all_decoded_bboxes[n];
-    size_t num_detected = 0;
-    std::map<size_t, std::vector<size_t>> indices;
-    size_t conf_offset = n * num_priors * num_classes;
-    for (size_t c = 0; c < num_classes; ++c) {
-      if (c == background_label_id) continue;
-      ApplyNmsFast<T>(decoded_bboxes, conf_data + conf_offset, c, nms_top_k,
-                      conf_threshold, nms_threshold, num_priors, num_classes,
-                      &(indices[c]));
-      num_detected += indices[c].size();
-    }
-    if (top_k > 0 && num_detected > top_k) {
-      // std::vector<pair<T,T>> score_index_pairs;
-      std::vector<std::pair<T, std::pair<size_t, size_t>>> score_index_pairs;
-      for (size_t c = 0; c < num_classes; ++c) {
-        const std::vector<size_t>& label_indices = indices[c];
-        for (size_t i = 0; i < label_indices.size(); ++i) {
-          size_t idx = label_indices[i];
-          score_index_pairs.push_back(
-              std::make_pair((conf_data + conf_offset)[idx * num_classes + c],
-                             std::make_pair(c, idx)));
-        }
-      }
-      std::sort(score_index_pairs.begin(), score_index_pairs.end(),
-                SortScorePairDescend<T, std::pair<size_t, size_t>>);
-      score_index_pairs.resize(top_k);
-      std::map<size_t, std::vector<size_t>> new_indices;
-      for (size_t i = 0; i < score_index_pairs.size(); ++i) {
-        size_t label = score_index_pairs[i].second.first;
-        size_t idx = score_index_pairs[i].second.second;
-        new_indices[label].push_back(idx);
-      }
-      all_detection_indices->push_back(new_indices);
-      total_keep_num += top_k;
-    } else {
-      all_detection_indices->push_back(indices);
-      total_keep_num += num_detected;
-    }
-  }
-  return total_keep_num;
-}
-template <typename T>
-BBox<T> ClipBBox(const BBox<T>& bbox) {
-  T one = static_cast<T>(1.0);
-  T zero = static_cast<T>(0.0);
-  BBox<T> clipped_bbox;
-  clipped_bbox.x_min = std::max(std::min(bbox.x_min, one), zero);
-  clipped_bbox.y_min = std::max(std::min(bbox.y_min, one), zero);
-  clipped_bbox.x_max = std::max(std::min(bbox.x_max, one), zero);
-  clipped_bbox.y_max = std::max(std::min(bbox.y_max, one), zero);
-  return clipped_bbox;
-}
-template <typename T>
-void GetDetectionOutput(
-    const T* conf_data, const size_t num_kept, const size_t num_priors,
-    const size_t num_classes, const size_t batch_size,
-    const std::vector<std::map<size_t, std::vector<size_t>>>& all_indices,
-    const std::vector<std::vector<BBox<T>>>& all_decoded_bboxes, T* out_data) {
-  size_t count = 0;
-  for (size_t n = 0; n < batch_size; ++n) {
-    for (std::map<size_t, std::vector<size_t>>::const_iterator it =
-             all_indices[n].begin();
-         it != all_indices[n].end(); ++it) {
-      size_t label = it->first;
-      const std::vector<size_t>& indices = it->second;
-      const std::vector<BBox<T>>& decoded_bboxes = all_decoded_bboxes[n];
-      for (size_t i = 0; i < indices.size(); ++i) {
-        size_t idx = indices[i];
-        size_t conf_offset = n * num_priors * num_classes + idx * num_classes;
-        out_data[count * 7] = n;
-        out_data[count * 7 + 1] = label;
-        out_data[count * 7 + 2] = (conf_data + conf_offset)[label];
-        BBox<T> clipped_bbox = ClipBBox<T>(decoded_bboxes[idx]);
-        out_data[count * 7 + 3] = clipped_bbox.x_min;
-        out_data[count * 7 + 4] = clipped_bbox.y_min;
-        out_data[count * 7 + 5] = clipped_bbox.x_max;
-        out_data[count * 7 + 6] = clipped_bbox.y_max;
-        ++count;
-      }
-    }
-  }
-}
-}  // namespace math
-}  // namespace operators
-}  // namespace paddle
diff --git a/paddle/fluid/operators/math/math_function.cc b/paddle/fluid/operators/math/math_function.cc
index 35d251f71a0cb631d5900498ea3188b5ddeae334..17e576a9d5c8f50fbe84b066a93460f03ae6bb08 100644
--- a/paddle/fluid/operators/math/math_function.cc
+++ b/paddle/fluid/operators/math/math_function.cc
@@ -371,6 +371,8 @@ template struct RowwiseAdd<platform::CPUDeviceContext, double>;
 
 template struct ColwiseSum<platform::CPUDeviceContext, float>;
 template struct ColwiseSum<platform::CPUDeviceContext, double>;
+template struct ColwiseSum<platform::CPUDeviceContext, int>;
+template struct ColwiseSum<platform::CPUDeviceContext, int64_t>;
 
 template struct RowwiseSum<platform::CPUDeviceContext, float>;
 template struct RowwiseSum<platform::CPUDeviceContext, double>;
diff --git a/paddle/fluid/operators/math/math_function.cu b/paddle/fluid/operators/math/math_function.cu
index 3abbcdb71d03eaf6f8eba3d97150d27ac5a5405e..c6ca2693a053360ce5dc44765acf1520a11cce2c 100644
--- a/paddle/fluid/operators/math/math_function.cu
+++ b/paddle/fluid/operators/math/math_function.cu
@@ -422,6 +422,8 @@ struct RowwiseAdd<platform::CUDADeviceContext, T> {
 template struct RowwiseAdd<platform::CUDADeviceContext, float>;
 template struct RowwiseAdd<platform::CUDADeviceContext, double>;
 template struct ColwiseSum<platform::CUDADeviceContext, float>;
+template struct ColwiseSum<platform::CUDADeviceContext, int>;
+template struct ColwiseSum<platform::CUDADeviceContext, int64_t>;
 // template struct ColwiseSum<platform::CUDADeviceContext, double>;
 // The ColwiseSum<platform::CUDADeviceContext, double> failed in debug mode,
 // and only failed for this case. So reimplemented it.
diff --git a/paddle/fluid/operators/parallel_do_op.cc b/paddle/fluid/operators/parallel_do_op.cc
index bf4d0476df32d7454d4064cb6ee454e6ad5d6fc5..4001b9a130348b4e3ea99f3017eae6d85e41fc6e 100644
--- a/paddle/fluid/operators/parallel_do_op.cc
+++ b/paddle/fluid/operators/parallel_do_op.cc
@@ -18,6 +18,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/framework/threadpool.h"
 #include "paddle/fluid/operators/detail/safe_ref.h"
+#include "paddle/fluid/platform/profiler.h"
 
 namespace paddle {
 namespace operators {
@@ -158,11 +159,14 @@ class ParallelDoOp : public framework::OperatorBase {
       auto &place = places[place_idx];
       auto *cur_scope = sub_scopes[place_idx];
 
-      workers.emplace_back(framework::Async([program, cur_scope, place, block] {
-        framework::Executor executor(place);
-        executor.Run(*program, cur_scope, block->ID(),
-                     false /*create_local_scope*/);
-      }));
+      workers.emplace_back(
+          framework::Async([program, cur_scope, place, block, place_idx] {
+            // Give the thread an id to distinguish parallel block with same id.
+            platform::RecordThread rt(static_cast<int>(place_idx) + 1);
+            framework::Executor executor(place);
+            executor.Run(*program, cur_scope, block->ID(),
+                         false /*create_local_scope*/);
+          }));
     }
     for (auto &worker : workers) {
       worker.wait();
@@ -234,11 +238,14 @@ class ParallelDoGradOp : public framework::OperatorBase {
       auto *cur_scope = sub_scopes[i];
 
       // execute
-      workers.emplace_back(framework::Async([program, cur_scope, place, block] {
-        framework::Executor executor(place);
-        executor.Run(*program, cur_scope, block->ID(),
-                     false /*create_local_scope*/);
-      }));
+      workers.emplace_back(
+          framework::Async([program, cur_scope, place, block, i] {
+            // Give the thread an id to distinguish parallel block with same id.
+            platform::RecordThread rt(static_cast<int>(i) + 1);
+            framework::Executor executor(place);
+            executor.Run(*program, cur_scope, block->ID(),
+                         false /*create_local_scope*/);
+          }));
     }
     for (auto &worker : workers) {
       worker.wait();
diff --git a/paddle/fluid/operators/pool_cudnn_op.cu.cc b/paddle/fluid/operators/pool_cudnn_op.cu.cc
index 781d96981e4c033d9287ab3de9860dfd9fcd2875..39c862b03ad497dca5c38ccecff20be510ab60e5 100644
--- a/paddle/fluid/operators/pool_cudnn_op.cu.cc
+++ b/paddle/fluid/operators/pool_cudnn_op.cu.cc
@@ -24,6 +24,8 @@ using ScopedTensorDescriptor = platform::ScopedTensorDescriptor;
 using ScopedPoolingDescriptor = platform::ScopedPoolingDescriptor;
 using DataLayout = platform::DataLayout;
 using PoolingMode = platform::PoolingMode;
+template <typename T>
+using ScalingParamType = typename platform::CudnnDataType<T>::ScalingParamType;
 
 template <typename T>
 class PoolCUDNNOpKernel : public framework::OpKernel<T> {
@@ -78,8 +80,7 @@ class PoolCUDNNOpKernel : public framework::OpKernel<T> {
 
     // ------------------- cudnn pool algorithm ---------------------
     auto handle = ctx.cuda_device_context().cudnn_handle();
-    T alpha = 1.0f, beta = 0.0f;
-
+    ScalingParamType<T> alpha = 1.0f, beta = 0.0f;
     PADDLE_ENFORCE(platform::dynload::cudnnPoolingForward(
         handle, cudnn_pool_desc, &alpha, cudnn_input_desc, input_data, &beta,
         cudnn_output_desc, output_data));
@@ -144,8 +145,7 @@ class PoolCUDNNGradOpKernel : public framework::OpKernel<T> {
 
     // ------------------- cudnn pool algorithm ---------------------
     auto handle = ctx.cuda_device_context().cudnn_handle();
-    T alpha = 1.0f, beta = 0.0f;
-
+    ScalingParamType<T> alpha = 1.0f, beta = 0.0f;
     if (input_grad) {
       T *input_grad_data = input_grad->mutable_data<T>(ctx.GetPlace());
       // Because beta is zero, it is unnecessary to reset input_grad.
@@ -162,17 +162,19 @@ class PoolCUDNNGradOpKernel : public framework::OpKernel<T> {
 }  // namespace paddle
 
 namespace ops = paddle::operators;
+namespace plat = paddle::platform;
 
-REGISTER_OP_KERNEL(pool2d, CUDNN, ::paddle::platform::CUDAPlace,
+REGISTER_OP_KERNEL(pool2d, CUDNN, plat::CUDAPlace,
                    ops::PoolCUDNNOpKernel<float>,
-                   ops::PoolCUDNNOpKernel<double>);
-REGISTER_OP_KERNEL(pool2d_grad, CUDNN, ::paddle::platform::CUDAPlace,
+                   ops::PoolCUDNNOpKernel<double>,
+                   ops::PoolCUDNNOpKernel<plat::float16>);
+REGISTER_OP_KERNEL(pool2d_grad, CUDNN, plat::CUDAPlace,
                    ops::PoolCUDNNGradOpKernel<float>,
                    ops::PoolCUDNNGradOpKernel<double>);
 
-REGISTER_OP_KERNEL(pool3d, CUDNN, ::paddle::platform::CUDAPlace,
+REGISTER_OP_KERNEL(pool3d, CUDNN, plat::CUDAPlace,
                    ops::PoolCUDNNOpKernel<float>,
                    ops::PoolCUDNNOpKernel<double>);
-REGISTER_OP_KERNEL(pool3d_grad, CUDNN, ::paddle::platform::CUDAPlace,
+REGISTER_OP_KERNEL(pool3d_grad, CUDNN, plat::CUDAPlace,
                    ops::PoolCUDNNGradOpKernel<float>,
                    ops::PoolCUDNNGradOpKernel<double>);
diff --git a/paddle/fluid/operators/pool_op.cc b/paddle/fluid/operators/pool_op.cc
index d78da10016a0e2b1d9a0ca9f3dfe4e8009bbe61d..b144ec5f7d315cb340dcd94b4a519bfcfd2a0e66 100644
--- a/paddle/fluid/operators/pool_op.cc
+++ b/paddle/fluid/operators/pool_op.cc
@@ -124,11 +124,15 @@ framework::OpKernelType PoolOpGrad::GetExpectedKernelType(
   }
 #endif
 
+  auto input_data_type = framework::ToDataType(ctx.Input<Tensor>("X")->type());
+  if (input_data_type == framework::proto::VarType::FP16) {
+    PADDLE_ENFORCE_EQ(library_, framework::LibraryType::kCUDNN,
+                      "float16 can only be used when CUDNN is used");
+  }
   std::string data_format = ctx.Attr<std::string>("data_format");
   framework::DataLayout layout_ = framework::StringToDataLayout(data_format);
-  return framework::OpKernelType(
-      framework::ToDataType(ctx.Input<Tensor>("X")->type()), ctx.GetPlace(),
-      layout_, library_);
+  return framework::OpKernelType(input_data_type, ctx.GetPlace(), layout_,
+                                 library_);
 }
 
 Pool2dOpMaker::Pool2dOpMaker(OpProto *proto, OpAttrChecker *op_checker)
diff --git a/paddle/fluid/operators/reader/create_double_buffer_reader_op.cc b/paddle/fluid/operators/reader/create_double_buffer_reader_op.cc
index 447fae10535c1b458ed7de24ad3659b3c48ecb4a..76cdb794ccdb4a015ae8630940a5c26845e7a7b3 100644
--- a/paddle/fluid/operators/reader/create_double_buffer_reader_op.cc
+++ b/paddle/fluid/operators/reader/create_double_buffer_reader_op.cc
@@ -48,20 +48,24 @@ class DoubleBufferReader : public framework::DecoratedReader {
 
   void start_thread() {
     buffer_ = framework::MakeChannel<Item>(kDoubleBufferSize);
-    std::thread prefetch([this] { PrefetchThreadFunc(); });
-    prefetch.detach();
+    prefetcher_ = std::thread([this] { PrefetchThreadFunc(); });
   }
 
   void ReadNext(std::vector<framework::LoDTensor>* out) override;
   void ReInit() override;
 
-  ~DoubleBufferReader() { buffer_->Close(); }
+  ~DoubleBufferReader() {
+    buffer_->Close();
+    prefetcher_.join();
+    delete buffer_;
+  }
 
   bool HasNext() const override;
 
  private:
   void PrefetchThreadFunc();
 
+  std::thread prefetcher_;
   framework::Channel<Item>* buffer_;
   platform::Place place_;
   std::vector<std::unique_ptr<platform::DeviceContext>> ctxs_;
@@ -137,6 +141,8 @@ void DoubleBufferReader::ReadNext(std::vector<framework::LoDTensor>* out) {
 void DoubleBufferReader::ReInit() {
   reader_->ReInit();
   buffer_->Close();
+  prefetcher_.join();
+  delete buffer_;
   start_thread();
 }
 
@@ -162,11 +168,12 @@ void DoubleBufferReader::PrefetchThreadFunc() {
 
     if (!buffer_->Send(&batch)) {
       VLOG(5) << "WARNING: The double buffer channel has been closed. The "
-                 "prefetch thread terminates.";
+                 "prefetch thread will terminate.";
       break;
     }
   }
   buffer_->Close();
+  VLOG(5) << "Prefetch thread terminates.";
 }
 
 bool DoubleBufferReader::HasNext() const {
diff --git a/paddle/fluid/operators/reader/create_shuffle_reader_op.cc b/paddle/fluid/operators/reader/create_shuffle_reader_op.cc
index 70e2f587dc414a850ddc341b98f26ae54636755c..3a1f3805a0483c2f5eabdc7432556051d8308964 100644
--- a/paddle/fluid/operators/reader/create_shuffle_reader_op.cc
+++ b/paddle/fluid/operators/reader/create_shuffle_reader_op.cc
@@ -34,6 +34,9 @@ class ShuffleReader : public framework::DecoratedReader {
   }
 
   void ReadNext(std::vector<framework::LoDTensor>* out) override {
+    if (!HasNext()) {
+      PADDLE_THROW("There is no next data!");
+    }
     if (iteration_pos_ >= buffer_.size()) {
       VLOG(10) << "Resetting shuffle buffer";
       ReadIntoBuffers();
@@ -50,7 +53,6 @@ class ShuffleReader : public framework::DecoratedReader {
     buffer_.clear();
     buffer_.reserve(buffer_size_);
     iteration_pos_ = 0;
-    PADDLE_ENFORCE(reader_->HasNext());
     for (size_t i = 0; i < buffer_size_; ++i) {
       if (!reader_->HasNext()) {
         break;
diff --git a/paddle/fluid/operators/select_op.cc b/paddle/fluid/operators/select_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..8344a239df7b3fcbe91f91a17a3c5958013b55a6
--- /dev/null
+++ b/paddle/fluid/operators/select_op.cc
@@ -0,0 +1,414 @@
+/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <boost/tokenizer.hpp>
+#include <memory>
+#include <thread>
+#include <vector>
+#include "paddle/fluid/framework/channel.h"
+#include "paddle/fluid/framework/executor.h"
+#include "paddle/fluid/framework/lod_tensor.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/operators/concurrency/channel_util.h"
+
+namespace paddle {
+namespace operators {
+
+static constexpr char kX[] = "X";
+static constexpr char kCaseToExecute[] = "case_to_execute";
+
+static constexpr char kCases[] = "cases";
+static constexpr char kCasesBlock[] = "sub_block";
+
+class SelectOp : public framework::OperatorBase {
+ public:
+  SelectOp(const std::string &type, const framework::VariableNameMap &inputs,
+           const framework::VariableNameMap &outputs,
+           const framework::AttributeMap &attrs)
+      : framework::OperatorBase(type, inputs, outputs, attrs) {}
+
+ private:
+  enum class SelectOpCaseType {
+    DEFAULT = 0,
+    SEND = 1,
+    RECEIVE = 2,
+  };
+
+  struct SelectOpCase {
+    int caseIndex;
+    SelectOpCaseType caseType;
+    std::string channelName;
+    std::string varName;
+
+    SelectOpCase() {}
+
+    SelectOpCase(int caseIndex, SelectOpCaseType caseType,
+                 std::string channelName, std::string varName)
+        : caseIndex(caseIndex),
+          caseType(caseType),
+          channelName(channelName),
+          varName(varName) {}
+  };
+
+  void RunImpl(const framework::Scope &scope,
+               const platform::Place &dev_place) const override {
+    std::vector<std::string> casesConfigs =
+        Attr<std::vector<std::string>>(kCases);
+
+    framework::BlockDesc *casesBlock =
+        Attr<framework::BlockDesc *>(kCasesBlock);
+
+    framework::Scope &casesBlockScope = scope.NewScope();
+
+    std::string caseToExecuteVarName = Input(kCaseToExecute);
+    framework::Variable *caseToExecuteVar =
+        casesBlockScope.FindVar(caseToExecuteVarName);
+
+    // Construct cases from "conditional_block_op"(s) in the casesBlock
+    std::vector<std::shared_ptr<SelectOpCase>> cases =
+        ParseAndShuffleCases(&casesConfigs);
+
+    // Get all unique channels involved in select
+    std::set<framework::ChannelHolder *> channelsSet;
+    for (auto c : cases) {
+      if (!c->channelName.empty()) {
+        auto channelVar = scope.FindVar(c->channelName);
+        framework::ChannelHolder *ch =
+            channelVar->GetMutable<framework::ChannelHolder>();
+
+        if (channelsSet.find(ch) == channelsSet.end()) {
+          channelsSet.insert(ch);
+        }
+      }
+    }
+
+    // Order all channels by their pointer address
+    std::vector<framework::ChannelHolder *> channels(channelsSet.begin(),
+                                                     channelsSet.end());
+    std::sort(channels.begin(), channels.end());
+
+    // Poll all cases
+    int32_t caseToExecute = pollCases(&scope, &cases, channels);
+
+    // At this point, the case to execute has already been determined,
+    // so we can proceed with executing the cases block
+    framework::LoDTensor *caseToExecuteTensor =
+        caseToExecuteVar->GetMutable<framework::LoDTensor>();
+    caseToExecuteTensor->data<int32_t>()[0] = caseToExecute;
+
+    // Execute the cases block, only one case will be executed since we set the
+    // case_to_execute value to the index of the case we want to execute
+    framework::Executor executor(dev_place);
+    framework::ProgramDesc *program = casesBlock->Program();
+    executor.Run(*program, &casesBlockScope, casesBlock->ID(),
+                 false /*create_local_scope*/);
+  }
+
+  /**
+   * Goes through all operators in the casesConfigs and processes
+   * "conditional_block" operators.  These operators are mapped to our
+   * SelectOpCase objects.  We randomize the case orders, and set the
+   * default case (if any exists) as the last case)
+   * @param casesBlock
+   * @return
+   */
+  std::vector<std::shared_ptr<SelectOpCase>> ParseAndShuffleCases(
+      std::vector<std::string> *casesConfigs) const {
+    std::vector<std::shared_ptr<SelectOpCase>> cases;
+    std::shared_ptr<SelectOpCase> defaultCase;
+
+    if (casesConfigs != nullptr) {
+      boost::char_delimiters_separator<char> sep(false, ",", "");
+      for (std::vector<std::string>::iterator itr = casesConfigs->begin();
+           itr < casesConfigs->end(); ++itr) {
+        std::string caseConfig = *itr;
+        boost::tokenizer<> tokens(caseConfig, sep);
+
+        boost::tokenizer<>::iterator tok_iter = tokens.begin();
+        PADDLE_ENFORCE(tok_iter != tokens.end(), "Cannot get case index");
+        std::string caseIndexString = *tok_iter;
+        int caseIndex = std::stoi(caseIndexString);
+
+        ++tok_iter;
+        PADDLE_ENFORCE(tok_iter != tokens.end(), "Cannot get case type");
+        std::string caseTypeString = *tok_iter;
+        SelectOpCaseType caseType = (SelectOpCaseType)std::stoi(caseTypeString);
+
+        std::string caseChannel;
+        std::string caseChannelVar;
+
+        ++tok_iter;
+        if (caseType != SelectOpCaseType::DEFAULT) {
+          PADDLE_ENFORCE(tok_iter != tokens.end(), "Cannot get case channel");
+          caseChannel = *tok_iter;
+
+          ++tok_iter;
+          PADDLE_ENFORCE(tok_iter != tokens.end(),
+                         "Cannot get case channel variable");
+          caseChannelVar = *tok_iter;
+        }
+
+        auto c = std::make_shared<SelectOpCase>(caseIndex, caseType,
+                                                caseChannel, caseChannelVar);
+
+        if (caseType == SelectOpCaseType::DEFAULT) {
+          PADDLE_ENFORCE(defaultCase == nullptr,
+                         "Select can only contain one default case.");
+          defaultCase = c;
+        } else {
+          cases.push_back(c);
+        }
+      }
+    }
+
+    // Randomly sort cases, with default case being last
+    std::random_shuffle(cases.begin(), cases.end());
+    if (defaultCase != nullptr) {
+      cases.push_back(defaultCase);
+    }
+
+    return cases;
+  }
+
+  /**
+   * This method will recursively poll the cases and determines if any case
+   * condition is true.
+   * If none of the cases conditions are true (and there is no default case),
+   * then block
+   * the thread.  The thread may be woken up by a channel operation, at which
+   * point we
+   * execute the case.
+   * @param scope
+   * @param cases
+   * @param channels
+   * @return
+   */
+  int32_t pollCases(const framework::Scope *scope,
+                    std::vector<std::shared_ptr<SelectOpCase>> *cases,
+                    std::vector<framework::ChannelHolder *> channels) const {
+    // Lock all involved channels
+    lockChannels(channels);
+
+    std::atomic<int> caseToExecute(-1);
+
+    std::vector<std::shared_ptr<SelectOpCase>>::iterator it = cases->begin();
+    while (it != cases->end()) {
+      std::shared_ptr<SelectOpCase> c = *it;
+
+      auto chVar = scope->FindVar(c->channelName);
+      framework::ChannelHolder *ch =
+          chVar->GetMutable<framework::ChannelHolder>();
+
+      switch (c->caseType) {
+        case SelectOpCaseType::SEND:
+          PADDLE_ENFORCE(!ch->IsClosed(), "Cannot send to a closed channel");
+          if (ch->CanSend()) {
+            // We can send to channel directly, send the data to channel
+            // and execute case
+            auto chVar = scope->FindVar(c->varName);
+            concurrency::ChannelSend(ch, chVar);
+            caseToExecute = c->caseIndex;
+          }
+          break;
+        case SelectOpCaseType::RECEIVE:
+          if (ch->CanReceive()) {
+            // We can receive from channel directly, send the data to channel
+            // and execute case
+            auto chVar = scope->FindVar(c->varName);
+            concurrency::ChannelReceive(ch, chVar);
+            caseToExecute = c->caseIndex;
+          }
+          break;
+        case SelectOpCaseType::DEFAULT:
+          caseToExecute = c->caseIndex;
+          break;
+      }
+
+      if (caseToExecute != -1) {
+        // We found a case to execute, stop looking at other case statements
+        break;
+      }
+
+      ++it;
+    }
+
+    if (caseToExecute == -1) {
+      // None of the cases are eligible to execute, enqueue current thread
+      // into all the sending/receiving queue of each involved channel
+      std::atomic<bool> completed(false);
+      std::recursive_mutex mutex;
+      std::unique_lock<std::recursive_mutex> lock{mutex};
+      // std::condition_variable_any selectCond;
+      auto selectCond = std::make_shared<std::condition_variable_any>();
+
+      std::recursive_mutex callbackMutex;
+      pushThreadOnChannelQueues(scope, cases, selectCond, caseToExecute,
+                                completed, callbackMutex);
+
+      // TODO(thuan): Atomically unlock all channels and sleep current thread
+      unlockChannels(channels);
+      selectCond->wait(lock, [&completed]() { return completed.load(); });
+
+      // Select has been woken up by case operation
+      lockChannels(channels);
+      removeThreadOnChannelQueues(scope, cases);
+
+      if (caseToExecute == -1) {
+        // Recursively poll cases, since we were woken up by a channel close
+        // TODO(thuan): Need to test if this is a valid case
+        unlockChannels(channels);
+        return pollCases(scope, cases, channels);
+      }
+    }
+
+    // At this point, caseToExecute != -1, and we can proceed with executing
+    // the case block
+    unlockChannels(channels);
+
+    return caseToExecute;
+  }
+
+  void lockChannels(std::vector<framework::ChannelHolder *> chs) const {
+    std::vector<framework::ChannelHolder *>::iterator it = chs.begin();
+    while (it != chs.end()) {
+      framework::ChannelHolder *ch = *it;
+      ch->Lock();
+      ++it;
+    }
+  }
+
+  void unlockChannels(std::vector<framework::ChannelHolder *> chs) const {
+    std::vector<framework::ChannelHolder *>::reverse_iterator it = chs.rbegin();
+    while (it != chs.rend()) {
+      framework::ChannelHolder *ch = *it;
+      ch->Unlock();
+      ++it;
+    }
+  }
+
+  void pushThreadOnChannelQueues(
+      const framework::Scope *scope,
+      std::vector<std::shared_ptr<SelectOpCase>> *cases,
+      std::shared_ptr<std::condition_variable_any> rCond,
+      std::atomic<int> &caseToExecute, std::atomic<bool> &completed,
+      std::recursive_mutex &callbackMutex) const {
+    std::vector<std::shared_ptr<SelectOpCase>>::iterator it = cases->begin();
+    while (it != cases->end()) {
+      std::shared_ptr<SelectOpCase> c = *it;
+
+      auto chVar = scope->FindVar(c->channelName);
+      framework::ChannelHolder *ch =
+          chVar->GetMutable<framework::ChannelHolder>();
+
+      std::function<bool(framework::ChannelAction channelAction)> cb =
+          [&caseToExecute, &completed, &callbackMutex,
+           c](framework::ChannelAction channelAction) {
+            std::lock_guard<std::recursive_mutex> lock{callbackMutex};
+
+            bool canProcess = false;
+            if (!completed) {
+              // If the channel wasn't closed, we set the caseToExecute index
+              // as this current case
+              if (channelAction != framework::ChannelAction::CLOSE) {
+                caseToExecute = c->caseIndex;
+              }
+              // This will allow our conditional variable to break out of wait
+              completed = true;
+              canProcess = true;
+            }
+
+            return canProcess;
+          };
+
+      switch (c->caseType) {
+        case SelectOpCaseType::SEND: {
+          auto chOutputVar = scope->FindVar(c->varName);
+          concurrency::ChannelAddToSendQ(ch, this, chOutputVar, rCond, cb);
+          break;
+        }
+        case SelectOpCaseType::RECEIVE: {
+          auto chOutputVar = scope->FindVar(c->varName);
+          concurrency::ChannelAddToReceiveQ(ch, this, chOutputVar, rCond, cb);
+          break;
+        }
+        default:
+          break;
+      }
+      ++it;
+    }
+  }
+
+  void removeThreadOnChannelQueues(
+      const framework::Scope *scope,
+      std::vector<std::shared_ptr<SelectOpCase>> *cases) const {
+    std::vector<std::shared_ptr<SelectOpCase>>::iterator it = cases->begin();
+    while (it != cases->end()) {
+      std::shared_ptr<SelectOpCase> c = *it;
+
+      auto chVar = scope->FindVar(c->channelName);
+      framework::ChannelHolder *ch =
+          chVar->GetMutable<framework::ChannelHolder>();
+      switch (c->caseType) {
+        case SelectOpCaseType::SEND: {
+          ch->RemoveFromSendQ(this);
+          break;
+        }
+        case SelectOpCaseType::RECEIVE: {
+          ch->RemoveFromReceiveQ(this);
+          break;
+        }
+        default:
+          break;
+      }
+      ++it;
+    }
+  }
+};
+
+class SelectOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  SelectOpMaker(OpProto *proto, OpAttrChecker *op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {
+    AddInput(kX,
+             "A set of variables, which are required by operators inside the "
+             "cases of Select Op")
+        .AsDuplicable();
+    AddInput(kCaseToExecute,
+             "(Int) The variable the sets the index of the case to execute, "
+             "after evaluating the channels being sent to and received from")
+        .AsDuplicable();
+    AddAttr<std::vector<std::string>>(kCases,
+                                      "(String vector) Serialized list of"
+                                      "all cases in the select op. Each"
+                                      "case is serialized as: "
+                                      "'<index>,<type>,<channel>,<value>'"
+                                      "where type is 0 for default, 1 for"
+                                      "send, and 2 for receive"
+                                      "No channel and values are needed for"
+                                      "default cases.");
+    AddAttr<framework::BlockDesc *>(kCasesBlock,
+                                    "The cases block inside select_op");
+    AddComment(R"DOC(
+)DOC");
+  }
+};
+
+// TODO(thuan): Implement Gradient Operator for SELECT_OP
+
+}  // namespace operators
+}  // namespace paddle
+
+REGISTER_OPERATOR(select, paddle::operators::SelectOp,
+                  paddle::framework::EmptyGradOpMaker,
+                  paddle::operators::SelectOpMaker);
diff --git a/paddle/fluid/operators/sequence_expand_op.cc b/paddle/fluid/operators/sequence_expand_op.cc
index a5d84d629b2e50763dac9bc571ac490414a8a406..786fe63e7580ce16b946d5049a490eed2c3c6ced 100644
--- a/paddle/fluid/operators/sequence_expand_op.cc
+++ b/paddle/fluid/operators/sequence_expand_op.cc
@@ -17,7 +17,7 @@ limitations under the License. */
 namespace paddle {
 namespace operators {
 
-using framework::Tensor;
+using framework::LoDTensor;
 
 class SequenceExpandOp : public framework::OperatorWithKernel {
  public:
@@ -25,15 +25,71 @@ class SequenceExpandOp : public framework::OperatorWithKernel {
 
  protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput("X"));
-    PADDLE_ENFORCE(ctx->HasOutput("Out"));
-    PADDLE_ENFORCE(ctx->HasInput("Y"));
-    framework::DDim out_dim;
-    auto y_dim = ctx->GetInputDim("Y");
-    out_dim = ctx->GetInputDim("X");
-    out_dim[0] = y_dim[0];
-    ctx->ShareLoD("Y", "Out");
-    ctx->SetOutputDim("Out", out_dim);
+    PADDLE_ENFORCE(ctx->HasInput("X"),
+                   "Input(X) of SequenceExpandOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Y"),
+                   "Input(Y) of SequenceExpandOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("Out"),
+                   "Output(Out) of SequenceExpandOp should not be null.");
+
+    auto x_dims = ctx->GetInputDim("X");
+    auto out_dims = x_dims;
+    int ref_level = ctx->Attrs().Get<int>("ref_level");
+
+    PADDLE_ENFORCE_GE(x_dims.size(), 2,
+                      "Dimension number of Input(X) should be at least 2.");
+
+    if (ctx->IsRuntime()) {
+      framework::Variable* x_var =
+          boost::get<framework::Variable*>(ctx->GetInputVarPtrs("X")[0]);
+      framework::Variable* y_var =
+          boost::get<framework::Variable*>(ctx->GetInputVarPtrs("Y")[0]);
+
+      auto& x_lod = x_var->Get<LoDTensor>().lod();
+      auto& y_lod = y_var->Get<LoDTensor>().lod();
+
+      PADDLE_ENFORCE_LE(x_lod.size(), 1,
+                        "Level number of Input(X)'s lod should not be "
+                        "greater than 1.");
+      PADDLE_ENFORCE_GT(y_lod.size(), 0,
+                        "Level number of Input(Y)'s lod should be "
+                        "greater than 0.");
+      PADDLE_ENFORCE(
+          ref_level == -1 ||
+              (ref_level >= 0 && ref_level < static_cast<int>(y_lod.size())),
+          "Invlid `ref_level`, which should be either equal to -1 "
+          "or in [0, %d)",
+          y_lod.size());
+
+      if (ref_level == -1) ref_level = y_lod.size() - 1;
+
+      if (x_lod.size() > 0) {
+        PADDLE_ENFORCE(x_lod[0].size() == y_lod[ref_level].size(),
+                       "Level number of Input(X)'s lod could be 0. Otherwise "
+                       "size of Input(X)'s first level lod should be equal to "
+                       "size of Input(Y)'s referred level lod.");
+      }
+
+      int64_t out_first_dim = 0;
+      if (y_lod[ref_level].size() <= 1) {
+        out_first_dim = x_dims[0];
+      } else {
+        for (size_t i = 1; i < y_lod[ref_level].size(); ++i) {
+          int x_seq_len = 1;
+          if (x_lod.size() == 1) {
+            x_seq_len = x_lod[0][i] - x_lod[0][i - 1];
+          }
+          out_first_dim +=
+              (y_lod[ref_level][i] - y_lod[ref_level][i - 1]) * x_seq_len;
+        }
+      }
+      out_dims[0] = out_first_dim;
+      ctx->SetOutputDim("Out", out_dims);
+    } else {
+      out_dims[0] = -1;
+      ctx->SetOutputDim("Out", out_dims);
+      ctx->ShareLoD("X", /*->*/ "Out");
+    }
   }
 };
 
@@ -42,83 +98,81 @@ class SequenceExpandOpMaker : public framework::OpProtoAndCheckerMaker {
   SequenceExpandOpMaker(OpProto* proto, OpAttrChecker* op_checker)
       : OpProtoAndCheckerMaker(proto, op_checker) {
     AddInput("X",
-             "(Tensor or LoDTensor) The input(X) of this operator can be a "
-             "LoDTensor or a base Tensor.");
+             "(LoDTensor, default LoDTensor<float>) A 2-D LoDTensor whose lod "
+             "level is at most 1.");
     AddInput("Y",
-             "(LoDTensor)The reference input(Y) of sequence_expand op."
-             "It must be a LoDTensor with k-level(k>0)."
-             "The input(X) will be expanded according to LOD of input(Y)."
-             "The element numbers of last level in input(Y) "
-             "must be equal to dims[0] of input(X).");
+             "(LoDTensor, default LoDTensor<float>) Referred LoDTensor whose "
+             "lod (specified level) is referred by Input(X).");
     AddOutput("Out",
-              "(LodTensor)The output of sequence_expand op."
-              "The lod of output will be as same as input(Y)'s lod.");
+              "(LodTensor, default LoDTensor<float>) Output LoDTensor which is "
+              "generated from Input(X) by referring lod of Input(Y).");
+    AddAttr<int>("ref_level", "Specify lod level of Input(Y).").SetDefault(-1);
     AddComment(R"DOC(
 Sequence Expand Operator.
 
-This operator expands input(X) according to LOD of input(Y).
+This operator expands `X` according to specified level lod of `Y`. Current
+implementation constaints that lod level of `X` should be at most 1. Attribute
+`ref_level` is used to specify which level lod of `Y` is referred to expand `X`.
+If set `ref_level` to -1, then last level lod of `Y` would be referred.
+Please note, rank of `X` should be at least 2, when the rank exceeds 2, `X`
+would be viewed as a 2-D tensor.
+
 Following are cases to better explain how this works:
+
 Case 1:
 
-Given a 2-level LoDTensor input(X)
-    X.lod = [[0,       2, 3],
-             [0, 1,    3, 4]]
-    X.data = [a, b, c, d]
+Given a 1-level LoDTensor input(X)
+    X.lod =  [[0,   2,        4]]
+    X.data = [[a], [b], [c], [d]]
     X.dims = [4, 1]
 and input(Y)
     Y.lod = [[0,    2,    4],
              [0, 3, 6, 7, 8]]
-with condition len(Y.lod[-1]) -1 == X.dims[0]
-then we get 2-level LoDTensor
-    Out.lod = [[0,                2,    4],
-               [0,       3,       6, 7, 8]]
-    Out.data = [a, a, a, b, b, b, c, d]
+ref_level: 0
+then we get 1-level LoDTensor
+    Out.lod =  [[0,   2,        4,        6,        8]]
+    Out.data = [[a], [b], [a], [b], [c], [d], [c], [d]]
     Out.dims = [8, 1]
 
 Case 2:
 
+Given 1-level LoDTensor input(X)
+    X.lod =  [[0,   1,        4]]
+    X.data = [[a], [b], [c], [d]]
+    X.dims = [4, 1]
+and input(Y)
+    Y.lod = [[0,    2,    4],
+             [0, 3, 6, 6, 8]]
+ref_level: 0
+then we get 1-level LoDTensor
+    Out.lod =  [[0,   1,   2,        5,             8]]
+    Out.data = [[a], [a], [b], [c], [d], [b], [c], [d]]
+    Out.dims = [8, 1]
+
+Case 3:
+
 Given a common Tensor input(X)
-    X.data = [a, b, c]
+    X.data = [[a], [b], [c]]
     X.dims = [3, 1]
 and input(Y)
     Y.lod = [[0, 2, 3, 6]]
-with condition len(Y.lod[-1]) -1 == X.dims[0]
-then we get 1-level LoDTensor
-    Out.lod = [[0,    2, 3,      6]]
-    Out.data = [a, a, b, c, c, c]
+ref_level: -1
+then we get a common Tensor
+    Out.data = [[a], [a], [b], [c], [c], [c]]
     Out.dims = [6, 1]
 
-Case 3:
+Case 4:
 
 Given a common Tensor input(X)
     X.data = [[a, b], [c, d], [e, f]]
     X.dims = [3, 2]
 and input(Y)
     Y.lod = [[0, 2, 3, 6]]
-with condition len(Y.lod[-1]) -1 == X.dims[0]
-then we get 1-level LoDTensor
-    Out.lod = [[0,           2,     3,                     6]]
-    Out.data = [[a,b], [a,b] [c,d], [e, f], [e, f], [e, f]]
+ref_level: 0
+then we get a common LoDTensor
+    Out.data = [[a, b], [a, b] [c, d], [e, f], [e, f], [e, f]]
     Out.dims = [6, 2]
 
-Case 4:
-
-Given 2-level a LoDTensor input(X)
-    X.lod = [[0,       2, 3],
-             [0, 1,    3, 4]]
-    X.data = [a, b, c, d]
-    X.dims = [4, 1]
-and input(Y)
-    Y.lod = [[0,    2,    4],
-             [0, 3, 6, 6, 8]]
-with condition len(Y.lod[-1]) -1 == X.dims[0]
-then we get 2-level LoDTensor
-    Out.lod = [[0,                2,    4],
-               [0,       3,       6, 6, 8]]
-    Out.data = [a, a, a, b, b, b, d, d]
-    Out.dims = [8, 1]
-
-
 )DOC");
   }
 };
@@ -129,12 +183,14 @@ class SequenceExpandOpGrad : public framework::OperatorWithKernel {
 
  protected:
   void InferShape(framework::InferShapeContext* ctx) const override {
-    PADDLE_ENFORCE(ctx->HasInput("X"));
-    PADDLE_ENFORCE(ctx->HasInput("Out"));
+    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Out"), "Input(Out) should not be null.");
     PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
-                   "The input(Out@GRAD) should not be null");
+                   "Input(Out@GRAD) should not be null.");
+
     auto x_dims = ctx->GetInputDim("X");
     auto x_grad_name = framework::GradVarName("X");
+
     if (ctx->HasOutput(x_grad_name)) {
       ctx->SetOutputDim(x_grad_name, x_dims);
     }
@@ -149,7 +205,13 @@ REGISTER_OP(sequence_expand, ops::SequenceExpandOp, ops::SequenceExpandOpMaker,
             sequence_expand_grad, ops::SequenceExpandOpGrad);
 REGISTER_OP_CPU_KERNEL(
     sequence_expand,
-    ops::SequenceExpandKernel<paddle::platform::CPUDeviceContext, float>);
+    ops::SequenceExpandKernel<paddle::platform::CPUDeviceContext, float>,
+    ops::SequenceExpandKernel<paddle::platform::CPUDeviceContext, double>,
+    ops::SequenceExpandKernel<paddle::platform::CPUDeviceContext, int>,
+    ops::SequenceExpandKernel<paddle::platform::CPUDeviceContext, int64_t>);
 REGISTER_OP_CPU_KERNEL(
     sequence_expand_grad,
-    ops::SequenceExpandGradKernel<paddle::platform::CPUDeviceContext, float>);
+    ops::SequenceExpandGradKernel<paddle::platform::CPUDeviceContext, float>,
+    ops::SequenceExpandGradKernel<paddle::platform::CPUDeviceContext, double>,
+    ops::SequenceExpandGradKernel<paddle::platform::CPUDeviceContext, int>,
+    ops::SequenceExpandGradKernel<paddle::platform::CPUDeviceContext, int64_t>);
diff --git a/paddle/fluid/operators/sequence_expand_op.cu b/paddle/fluid/operators/sequence_expand_op.cu
index 26622d23afa1c703e237628bcb11db8f1da73210..bb51bb2902eea797de3449fcb6c8b52b4f0e7fbf 100644
--- a/paddle/fluid/operators/sequence_expand_op.cu
+++ b/paddle/fluid/operators/sequence_expand_op.cu
@@ -18,7 +18,14 @@ limitations under the License. */
 namespace ops = paddle::operators;
 REGISTER_OP_CUDA_KERNEL(
     sequence_expand,
-    ops::SequenceExpandKernel<paddle::platform::CUDADeviceContext, float>);
+    ops::SequenceExpandKernel<paddle::platform::CUDADeviceContext, float>,
+    ops::SequenceExpandKernel<paddle::platform::CUDADeviceContext, double>,
+    ops::SequenceExpandKernel<paddle::platform::CUDADeviceContext, int>,
+    ops::SequenceExpandKernel<paddle::platform::CUDADeviceContext, int64_t>);
 REGISTER_OP_CUDA_KERNEL(
     sequence_expand_grad,
-    ops::SequenceExpandGradKernel<paddle::platform::CUDADeviceContext, float>);
+    ops::SequenceExpandGradKernel<paddle::platform::CUDADeviceContext, float>,
+    ops::SequenceExpandGradKernel<paddle::platform::CUDADeviceContext, double>,
+    ops::SequenceExpandGradKernel<paddle::platform::CUDADeviceContext, int>,
+    ops::SequenceExpandGradKernel<paddle::platform::CUDADeviceContext,
+                                  int64_t>);
diff --git a/paddle/fluid/operators/sequence_expand_op.h b/paddle/fluid/operators/sequence_expand_op.h
index 76dde976db2d19e307ae7406be8280f9b4987187..db7d8bd6821fabd9714a160970558291ec47197f 100644
--- a/paddle/fluid/operators/sequence_expand_op.h
+++ b/paddle/fluid/operators/sequence_expand_op.h
@@ -16,45 +16,75 @@ limitations under the License. */
 
 #include "paddle/fluid/framework/op_registry.h"
 #include "paddle/fluid/memory/memcpy.h"
-#include "unsupported/Eigen/CXX11/Tensor"
+#include "paddle/fluid/operators/math/math_function.h"
 
 namespace paddle {
 namespace operators {
 
 using LoDTensor = framework::LoDTensor;
+template <typename T, int MajorType = Eigen::RowMajor,
+          typename IndexType = Eigen::DenseIndex>
+using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
 
 template <typename DeviceContext, typename T>
 class SequenceExpandKernel : public framework::OpKernel<T> {
  public:
   void Compute(const framework::ExecutionContext& context) const override {
     auto* x = context.Input<LoDTensor>("X");
-    auto* out = context.Output<LoDTensor>("Out");
-    const T* x_data = x->data<T>();
-    auto x_dims = x->dims();
     auto* y = context.Input<LoDTensor>("Y");
-    PADDLE_ENFORCE(!y->lod().empty(), "y should have lod");
-    PADDLE_ENFORCE_EQ(static_cast<size_t>(x_dims[0]),
-                      y->lod().back().size() - 1,
-                      "The size of last lod level in Input(Y)"
-                      "must be equal to dims[0] of Input(X).");
-    out->set_lod(y->lod());
-    auto* place =
-        context.template device_context<DeviceContext>().eigen_device();
-    size_t element_len = framework::product(x_dims) / x_dims[0];
-    T* out_data = out->mutable_data<T>(context.GetPlace());
-    auto out_starts = out->lod().back();
-
-    for (size_t i = 0; i < out_starts.size() - 1; i++) {
-      int scale = out_starts[i + 1] - out_starts[i];
-      Eigen::TensorMap<
-          Eigen::Tensor<const T, 2, Eigen::RowMajor, Eigen::DenseIndex>>
-          x_t(x_data, 1, element_len);
-      Eigen::TensorMap<Eigen::Tensor<T, 2, Eigen::RowMajor, Eigen::DenseIndex>>
-          out_t(out_data, scale, element_len);
-      Eigen::array<int, 2> cast({{scale, 1}});
-      out_t.device(*place) = x_t.broadcast(cast);
-      x_data += element_len;
-      out_data += element_len * scale;
+    auto* out = context.Output<LoDTensor>("Out");
+
+    int ref_level = context.Attr<int>("ref_level");
+    auto& x_lod = x->lod();
+    auto& y_lod = y->lod();
+
+    if (ref_level == -1) ref_level = y_lod.size() - 1;
+
+    out->mutable_data<T>(context.GetPlace());
+
+    if (y_lod[ref_level].size() <= 1) {
+      framework::TensorCopy(*x, context.GetPlace(), out);
+      return;
+    }
+
+    auto& out_lod = *out->mutable_lod();
+    if (x_lod.size() == 1) {
+      out_lod.resize(1);
+      out_lod[0] = {0};
+    }
+
+    int out_offset = 0;
+    auto& eigen_place =
+        *context.template device_context<DeviceContext>().eigen_device();
+    for (size_t i = 1; i < y_lod[ref_level].size(); ++i) {
+      int repeat_num = y_lod[ref_level][i] - y_lod[ref_level][i - 1];
+      int x_start = i - 1;
+      int x_end = i;
+      if (x_lod.size() == 1) {
+        x_start = x_lod[0][i - 1];
+        x_end = x_lod[0][i];
+      }
+      int x_seq_len = x_end - x_start;
+      if (repeat_num > 0) {
+        auto x_sub_tensor = x->Slice(x_start, x_end);
+        x_sub_tensor.Resize({1, x_sub_tensor.numel()});
+        int out_start = out_offset;
+        if (x_lod.size() == 1) {
+          out_start = out_lod[0][out_offset];
+        }
+        auto out_sub_tensor =
+            out->Slice(out_start, out_start + x_seq_len * repeat_num);
+        out_sub_tensor.Resize({repeat_num, x_sub_tensor.dims()[1]});
+        EigenMatrix<T>::From(out_sub_tensor).device(eigen_place) =
+            EigenMatrix<T>::From(x_sub_tensor)
+                .broadcast(Eigen::array<int, 2>({{repeat_num, 1}}));
+      }
+      for (int j = 0; j < repeat_num; ++j) {
+        if (x_lod.size() == 1) {
+          out_lod[0].push_back(out_lod[0].back() + x_seq_len);
+        }
+        out_offset++;
+      }
     }
   }
 };
@@ -75,27 +105,51 @@ template <typename DeviceContext, typename T>
 class SequenceExpandGradKernel : public framework::OpKernel<T> {
  public:
   void Compute(const framework::ExecutionContext& context) const override {
-    auto* d_out = context.Input<LoDTensor>(framework::GradVarName("Out"));
+    auto* g_out = context.Input<LoDTensor>(framework::GradVarName("Out"));
     auto* x = context.Input<LoDTensor>("X");
-    auto* out = context.Input<LoDTensor>("Out");
-    auto* d_x = context.Output<LoDTensor>(framework::GradVarName("X"));
-    auto out_last_level = out->lod().back();
-    d_x->set_lod(x->lod());
-    const T* d_out_data = d_out->data<T>();
-    T* d_x_data = d_x->mutable_data<T>(context.GetPlace());
-    size_t element_len = d_out->numel() / d_out->dims()[0];
-    for (size_t i = 0; i < out_last_level.size() - 1; ++i) {
-      size_t repeat = out_last_level[i + 1] - out_last_level[i];
-      Eigen::TensorMap<
-          Eigen::Tensor<const T, 2, Eigen::RowMajor, Eigen::DenseIndex>>
-      d_out_t(d_out_data, static_cast<int>(repeat), element_len);
-      Eigen::TensorMap<Eigen::Tensor<T, 1, Eigen::RowMajor, Eigen::DenseIndex>>
-      d_x_t(d_x_data, static_cast<int>(element_len));
-      auto place =
-          context.template device_context<DeviceContext>().eigen_device();
-      d_x_t.device(*place) = d_out_t.sum(Eigen::array<int, 1>({{0}}));
-      d_out_data += (repeat * element_len);
-      d_x_data += element_len;
+    auto* y = context.Input<LoDTensor>("Y");
+    auto* g_x = context.Output<LoDTensor>(framework::GradVarName("X"));
+    int ref_level = context.Attr<int>("ref_level");
+
+    g_x->mutable_data<T>(context.GetPlace());
+    g_x->set_lod(x->lod());
+
+    auto& x_lod = x->lod();
+    auto& y_lod = y->lod();
+
+    if (ref_level == -1) ref_level = y_lod.size() - 1;
+
+    // just copy the gradient
+    if (y_lod[ref_level].size() <= 1) {
+      framework::TensorCopy(*g_out, context.GetPlace(), g_x);
+      return;
+    }
+
+    auto& dev_ctx = context.template device_context<DeviceContext>();
+
+    math::SetConstant<DeviceContext, T> set_zero;
+    set_zero(dev_ctx, g_x, static_cast<T>(0));
+
+    int g_out_offset = 0;
+    for (size_t i = 1; i < y_lod[ref_level].size(); ++i) {
+      int repeat_num = y_lod[ref_level][i] - y_lod[ref_level][i - 1];
+      if (repeat_num > 0) {
+        int x_start = i - 1;
+        int x_end = i;
+        if (x_lod.size() == 1) {
+          x_start = x_lod[0][i - 1];
+          x_end = x_lod[0][i];
+        }
+        int x_seq_len = x_end - x_start;
+        auto g_x_sub = g_x->Slice(x_start, x_end);
+        g_x_sub.Resize(flatten_to_1d(g_x_sub.dims()));
+        int g_out_end = g_out_offset + repeat_num * x_seq_len;
+        auto g_out_sub = g_out->Slice(g_out_offset, g_out_end);
+        g_out_sub.Resize({repeat_num, g_x_sub.dims()[0]});
+        math::ColwiseSum<DeviceContext, T> col_sum;
+        col_sum(dev_ctx, g_out_sub, &g_x_sub);
+        g_out_offset += repeat_num * x_seq_len;
+      }
     }
   }
 };
diff --git a/paddle/fluid/operators/softmax_with_cross_entropy_op.cu b/paddle/fluid/operators/softmax_with_cross_entropy_op.cu
index 39b246a5bedb2819fc9b7fc407cfe03e59af0b68..8f7840cee1dd95a828fd4ac8815e335a5db47e3d 100644
--- a/paddle/fluid/operators/softmax_with_cross_entropy_op.cu
+++ b/paddle/fluid/operators/softmax_with_cross_entropy_op.cu
@@ -23,21 +23,21 @@ using Tensor = framework::Tensor;
 
 namespace {
 template <typename T>
-__global__ void CrossEntropyGrad(T* logit_grad, const T* loss_grad,
-                                 const int64_t* labels, const int batch_size,
-                                 const int class_num) {
-  int tid = blockIdx.x * blockDim.x + threadIdx.x;
-  int sample_idx = tid / class_num;
-
-  if (tid < batch_size) {
-    PADDLE_ASSERT(labels[sample_idx] >= 0 && labels[sample_idx] < class_num);
-    logit_grad[tid * class_num + labels[tid]] -= static_cast<T>(1.);
+__global__ void CrossEntropyGrad(T* logit_grad, const int64_t* labels,
+                                 const int batch_size, const int class_num) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < batch_size;
+       i += blockDim.x * gridDim.x) {
+    int idx = i * class_num + labels[i];
+    logit_grad[idx] -= static_cast<T>(1.);
   }
+}
 
-  __syncthreads();
-
-  if (tid < batch_size * class_num) {
-    logit_grad[tid] *= loss_grad[sample_idx];
+template <typename T>
+__global__ void Scale(T* logit_grad, const T* loss_grad, const int num,
+                      const int class_num) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < num;
+       i += blockDim.x * gridDim.x) {
+    logit_grad[i] *= loss_grad[i / class_num];
   }
 }
 
@@ -94,22 +94,22 @@ class SoftmaxWithCrossEntropyGradCUDAKernel : public framework::OpKernel<T> {
     const int batch_size = logit_grad->dims()[0];
     const int class_num = logit_grad->dims()[1];
     int block = 512;
-    int grid = (batch_size * class_num + block - 1) / block;
+    auto stream = context.cuda_device_context().stream();
 
     if (context.Attr<bool>("soft_label")) {
+      int grid = (batch_size * class_num + block - 1) / block;
       const T* label_data = labels->data<T>();
-      SoftCrossEntropyGradientKernel<
-          T><<<grid, block, 0,
-               context.template device_context<platform::CUDADeviceContext>()
-                   .stream()>>>(logit_grad_data, loss_grad_data, label_data,
-                                batch_size, class_num);
+      SoftCrossEntropyGradientKernel<T><<<grid, block, 0, stream>>>(
+          logit_grad_data, loss_grad_data, label_data, batch_size, class_num);
     } else {
+      int grid = (batch_size + block - 1) / block;
       const int64_t* label_data = labels->data<int64_t>();
-      CrossEntropyGrad<
-          T><<<grid, block, 0,
-               context.template device_context<platform::CUDADeviceContext>()
-                   .stream()>>>(logit_grad_data, loss_grad_data, label_data,
-                                batch_size, class_num);
+      CrossEntropyGrad<T><<<grid, block, 0, stream>>>(
+          logit_grad_data, label_data, batch_size, class_num);
+      int num = batch_size * class_num;
+      grid = (num + block - 1) / block;
+      Scale<T><<<grid, block, 0, stream>>>(logit_grad_data, loss_grad_data, num,
+                                           class_num);
     }
   }
 };
diff --git a/paddle/fluid/platform/cudnn_helper.h b/paddle/fluid/platform/cudnn_helper.h
index 9a2ac3ff33df3f8b9e24203f9dba2130e1d16510..7e001ecc56173db76e8c576e7efd66f41192f292 100644
--- a/paddle/fluid/platform/cudnn_helper.h
+++ b/paddle/fluid/platform/cudnn_helper.h
@@ -19,6 +19,7 @@ limitations under the License. */
 #include "paddle/fluid/framework/operator.h"
 #include "paddle/fluid/platform/dynload/cudnn.h"
 #include "paddle/fluid/platform/enforce.h"
+#include "paddle/fluid/platform/float16.h"
 #include "paddle/fluid/platform/macros.h"
 
 namespace paddle {
@@ -80,6 +81,22 @@ enum class PoolingMode {
 template <typename T>
 class CudnnDataType;
 
+template <>
+class CudnnDataType<float16> {
+ public:
+  static const cudnnDataType_t type = CUDNN_DATA_HALF;
+  // The scaling param type is float for HALF and FLOAT tensors
+  typedef const float ScalingParamType;
+  static ScalingParamType* kOne() {
+    static ScalingParamType v = 1.0;
+    return &v;
+  }
+  static ScalingParamType* kZero() {
+    static ScalingParamType v = 0.0;
+    return &v;
+  }
+};
+
 template <>
 class CudnnDataType<float> {
  public:
diff --git a/paddle/fluid/platform/device_tracer.cc b/paddle/fluid/platform/device_tracer.cc
index 78e00d5420bbea40c9bea4be919ec4ce5ececdcb..3b4437f576e1c2e931a86ec6d5e823ec1f344c52 100644
--- a/paddle/fluid/platform/device_tracer.cc
+++ b/paddle/fluid/platform/device_tracer.cc
@@ -26,8 +26,14 @@ limitations under the License. */
 namespace paddle {
 namespace platform {
 namespace {
+// Current thread's id. Note, we don't distinguish nested threads
+// for now.
+thread_local int cur_thread_id = 0;
+// Tracking the nested block stacks of each thread.
+thread_local std::deque<int> block_id_stack;
+// Tracking the nested event stacks.
+thread_local std::deque<std::string> annotation_stack;
 
-thread_local const char *cur_annotation = nullptr;
 std::once_flag tracer_once_flag;
 DeviceTracer *tracer = nullptr;
 }  // namespace
@@ -191,19 +197,19 @@ class DeviceTracerImpl : public DeviceTracer {
     correlations_[id] = anno;
   }
 
-  void AddCPURecords(const char *anno, uint64_t start_ns, uint64_t end_ns) {
-    if (!anno) {
-      // TODO(panyx0718): Currently, it doesn't support nested situation
-      // Up-level can be cleared by low-level and therefore get nullptr
-      // here.
+  void AddCPURecords(const std::string &anno, uint64_t start_ns,
+                     uint64_t end_ns, int64_t device_id, int64_t thread_id) {
+    if (anno.empty()) {
+      VLOG(1) << "Empty timeline annotation.";
       return;
     }
     std::lock_guard<std::mutex> l(trace_mu_);
-    cpu_records_.push_back(CPURecord{anno, start_ns, end_ns, 0});
+    cpu_records_.push_back(
+        CPURecord{anno, start_ns, end_ns, device_id, thread_id});
   }
 
   void AddMemRecords(const std::string &name, uint64_t start_ns,
-                     uint64_t end_ns, uint32_t device_id, uint32_t stream_id,
+                     uint64_t end_ns, int64_t device_id, int64_t stream_id,
                      uint32_t correlation_id, uint64_t bytes) {
     // 0 means timestamp information could not be collected for the kernel.
     if (start_ns == 0 || end_ns == 0) {
@@ -215,8 +221,8 @@ class DeviceTracerImpl : public DeviceTracer {
                                      stream_id, correlation_id, bytes});
   }
 
-  void AddKernelRecords(uint64_t start, uint64_t end, uint32_t device_id,
-                        uint32_t stream_id, uint32_t correlation_id) {
+  void AddKernelRecords(uint64_t start, uint64_t end, int64_t device_id,
+                        int64_t stream_id, uint32_t correlation_id) {
     // 0 means timestamp information could not be collected for the kernel.
     if (start == 0 || end == 0) {
       VLOG(3) << correlation_id << " cannot be traced";
@@ -270,27 +276,30 @@ class DeviceTracerImpl : public DeviceTracer {
         continue;
       }
       auto *event = profile_pb.add_events();
+      event->set_type(proto::Event::GPUKernel);
       event->set_name(correlations_.at(r.correlation_id));
       event->set_start_ns(r.start_ns);
       event->set_end_ns(r.end_ns);
-      event->set_stream_id(r.stream_id);
+      event->set_sub_device_id(r.stream_id);
       event->set_device_id(r.device_id);
     }
 
     for (const CPURecord &r : cpu_records_) {
       auto *event = profile_pb.add_events();
+      event->set_type(proto::Event::CPU);
       event->set_name(r.name);
       event->set_start_ns(r.start_ns);
       event->set_end_ns(r.end_ns);
-      event->set_stream_id(r.thread_id);
-      event->set_device_id(-1);
+      event->set_sub_device_id(r.thread_id);
+      event->set_device_id(r.device_id);
     }
     for (const MemRecord &r : mem_records_) {
       auto *event = profile_pb.add_events();
+      event->set_type(proto::Event::GPUKernel);
       event->set_name(r.name);
       event->set_start_ns(r.start_ns);
       event->set_end_ns(r.end_ns);
-      event->set_stream_id(r.stream_id);
+      event->set_sub_device_id(r.stream_id);
       event->set_device_id(r.device_id);
       event->mutable_memcopy()->set_bytes(r.bytes);
     }
@@ -323,8 +332,9 @@ class DeviceTracerImpl : public DeviceTracer {
     if ((domain == CUPTI_CB_DOMAIN_DRIVER_API) &&
         (cbid == CUPTI_DRIVER_TRACE_CBID_cuLaunchKernel)) {
       if (cbInfo->callbackSite == CUPTI_API_ENTER) {
-        const std::string anno =
-            cur_annotation ? cur_annotation : cbInfo->symbolName;
+        const std::string anno = !annotation_stack.empty()
+                                     ? annotation_stack.back()
+                                     : cbInfo->symbolName;
         tracer->AddAnnotation(cbInfo->correlationId, anno);
       }
     } else {
@@ -351,14 +361,15 @@ class DeviceTracerDummy : public DeviceTracer {
 
   void AddAnnotation(uint64_t id, const std::string &anno) {}
 
-  void AddCPURecords(const char *anno, uint64_t start_ns, uint64_t end_ns) {}
+  void AddCPURecords(const std::string &anno, uint64_t start_ns,
+                     uint64_t end_ns, int64_t device_id, int64_t thread_id) {}
 
   void AddMemRecords(const std::string &name, uint64_t start_ns,
-                     uint64_t end_ns, uint32_t device_id, uint32_t stream_id,
+                     uint64_t end_ns, int64_t device_id, int64_t stream_id,
                      uint32_t correlation_id, uint64_t bytes) {}
 
-  void AddKernelRecords(uint64_t start, uint64_t end, uint32_t device_id,
-                        uint32_t stream_id, uint32_t correlation_id) {}
+  void AddKernelRecords(uint64_t start, uint64_t end, int64_t device_id,
+                        int64_t stream_id, uint32_t correlation_id) {}
 
   bool IsEnabled() { return false; }
 
@@ -384,11 +395,28 @@ DeviceTracer *GetDeviceTracer() {
   return tracer;
 }
 
-void SetCurAnnotation(const char *anno) { cur_annotation = anno; }
+void SetCurAnnotation(const std::string &anno) {
+  annotation_stack.push_back(anno);
+}
+
+void ClearCurAnnotation() { annotation_stack.pop_back(); }
+
+std::string CurAnnotation() {
+  if (annotation_stack.empty()) return "";
+  return annotation_stack.back();
+}
+
+void SetCurBlock(int block_id) { block_id_stack.push_back(block_id); }
+
+void ClearCurBlock() { block_id_stack.pop_back(); }
+
+int BlockDepth() { return block_id_stack.size(); }
+
+void SetCurThread(int thread_id) { cur_thread_id = thread_id; }
 
-void ClearCurAnnotation() { cur_annotation = nullptr; }
+void ClearCurThread() { cur_thread_id = 0; }
 
-const char *CurAnnotation() { return cur_annotation; }
+int CurThread() { return cur_thread_id; }
 
 }  // namespace platform
 }  // namespace paddle
diff --git a/paddle/fluid/platform/device_tracer.h b/paddle/fluid/platform/device_tracer.h
index 23f7cdbdffc9f48ac5555455bf745233c81dd0cb..deb3d23f786353b8e7a2f28d094e364158885a34 100644
--- a/paddle/fluid/platform/device_tracer.h
+++ b/paddle/fluid/platform/device_tracer.h
@@ -32,22 +32,23 @@ class DeviceTracer {
   struct KernelRecord {
     uint64_t start_ns;
     uint64_t end_ns;
-    uint32_t device_id;
-    uint32_t stream_id;
+    int64_t device_id;
+    int64_t stream_id;
     uint32_t correlation_id;
   };
   struct CPURecord {
     std::string name;
     uint64_t start_ns;
     uint64_t end_ns;
-    uint64_t thread_id;
+    int64_t device_id;
+    int64_t thread_id;
   };
   struct MemRecord {
     std::string name;
     uint64_t start_ns;
     uint64_t end_ns;
-    uint32_t device_id;
-    uint32_t stream_id;
+    int64_t device_id;
+    int64_t stream_id;
     uint32_t correlation_id;
     uint64_t bytes;
   };
@@ -64,18 +65,18 @@ class DeviceTracer {
   virtual void AddAnnotation(uint64_t id, const std::string& anno) = 0;
 
   virtual void AddMemRecords(const std::string& name, uint64_t start_ns,
-                             uint64_t end_ns, uint32_t device_id,
-                             uint32_t stream_id, uint32_t correlation_id,
+                             uint64_t end_ns, int64_t device_id,
+                             int64_t stream_id, uint32_t correlation_id,
                              uint64_t bytes) = 0;
 
-  virtual void AddCPURecords(const char* anno, uint64_t start_ns,
-                             uint64_t end_ns) = 0;
+  virtual void AddCPURecords(const std::string& anno, uint64_t start_ns,
+                             uint64_t end_ns, int64_t device_id,
+                             int64_t thread_id) = 0;
 
   // Add a cuda kernel stats. `correlation_id` will be mapped to annotation
   // added before for human readability.
-  virtual void AddKernelRecords(uint64_t start, uint64_t end,
-                                uint32_t device_id, uint32_t stream_id,
-                                uint32_t correlation_id) = 0;
+  virtual void AddKernelRecords(uint64_t start, uint64_t end, int64_t device_id,
+                                int64_t stream_id, uint32_t correlation_id) = 0;
 
   // Generate a proto after done (Disabled).
   virtual proto::Profile GenProfile(const std::string& profile_path) = 0;
@@ -87,10 +88,18 @@ class DeviceTracer {
 DeviceTracer* GetDeviceTracer();
 
 // Set a name for the cuda kernel operation being launched by the thread.
-void SetCurAnnotation(const char* anno);
+void SetCurAnnotation(const std::string& anno);
 // Clear the name after the operation is done.
 void ClearCurAnnotation();
 // Current name of the operation being run in the thread.
-const char* CurAnnotation();
+std::string CurAnnotation();
+
+void SetCurBlock(int block_id);
+void ClearCurBlock();
+int BlockDepth();
+
+void SetCurThread(int thread_id);
+void ClearCurThread();
+int CurThread();
 }  // namespace platform
 }  // namespace paddle
diff --git a/paddle/fluid/platform/float16.h b/paddle/fluid/platform/float16.h
index 52fb8c2531357ad7a2b2f8613e5c7fbcef52c6bb..d3312a47f479160439d720dd993ee25a56d732fe 100644
--- a/paddle/fluid/platform/float16.h
+++ b/paddle/fluid/platform/float16.h
@@ -483,8 +483,123 @@ DEVICE inline bool operator>=(const half& a, const half& b) {
 
 #endif  // PADDLE_CUDA_FP16
 
-// Arithmetic operators on ARMv8.2-A CPU
-#if defined(PADDLE_WITH_NATIVE_FP16)
+// Arithmetic operators for float16 on GPU
+#if defined(PADDLE_CUDA_FP16)
+HOSTDEVICE inline float16 operator+(const float16& a, const float16& b) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+  return float16(__hadd(half(a), half(b)));
+#else
+  return float16(float(a) + float(b));
+#endif
+}
+
+HOSTDEVICE inline float16 operator-(const float16& a, const float16& b) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+  return float16(__hsub(half(a), half(b)));
+#else
+  return float16(float(a) - float(b));
+#endif
+}
+
+HOSTDEVICE inline float16 operator*(const float16& a, const float16& b) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+  return float16(__hmul(half(a), half(b)));
+#else
+  return float16(float(a) * float(b));
+#endif
+}
+
+HOSTDEVICE inline float16 operator/(const float16& a, const float16& b) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 300
+  // TODO(kexinzhao): check which cuda version starts to support __hdiv
+  float num = __half2float(half(a));
+  float denom = __half2float(half(b));
+  return float16(num / denom);
+#else
+  return float16(float(a) / float(b));
+#endif
+}
+
+HOSTDEVICE inline float16 operator-(const float16& a) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+  return float16(__hneg(half(a)));
+#else
+  float16 res;
+  res.x = a.x ^ 0x8000;
+  return res;
+#endif
+}
+
+HOSTDEVICE inline float16& operator+=(float16& a, const float16& b) {
+  a = a + b;
+  return a;
+}
+
+HOSTDEVICE inline float16& operator-=(float16& a, const float16& b) {
+  a = a - b;
+  return a;
+}
+
+HOSTDEVICE inline float16& operator*=(float16& a, const float16& b) {
+  a = a * b;
+  return a;
+}
+
+HOSTDEVICE inline float16& operator/=(float16& a, const float16& b) {
+  a = a / b;
+  return a;
+}
+
+HOSTDEVICE inline bool operator==(const float16& a, const float16& b) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+  return __heq(half(a), half(b));
+#else
+  return float(a) == float(b);
+#endif
+}
+
+HOSTDEVICE inline bool operator!=(const float16& a, const float16& b) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+  return __hne(half(a), half(b));
+#else
+  return float(a) != float(b);
+#endif
+}
+
+HOSTDEVICE inline bool operator<(const float16& a, const float16& b) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+  return __hlt(half(a), half(b));
+#else
+  return float(a) < float(b);
+#endif
+}
+
+HOSTDEVICE inline bool operator<=(const float16& a, const float16& b) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+  return __hle(half(a), half(b));
+#else
+  return float(a) <= float(b);
+#endif
+}
+
+HOSTDEVICE inline bool operator>(const float16& a, const float16& b) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+  return __hgt(half(a), half(b));
+#else
+  return float(a) > float(b);
+#endif
+}
+
+HOSTDEVICE inline bool operator>=(const float16& a, const float16& b) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+  return __hge(half(a), half(b));
+#else
+  return float(a) >= float(b);
+#endif
+}
+
+// Arithmetic operators for float16 on ARMv8.2-A CPU
+#elif defined(PADDLE_WITH_NATIVE_FP16)
 HOST inline float16 operator+(const float16& a, const float16& b) {
   float16 res;
   asm volatile(
@@ -668,71 +783,71 @@ HOST inline bool operator>=(const float16& a, const float16& b) {
   return (res & 0xffff) != 0;
 }
 
-// Arithmetic operators, software emulated on other CPU
+// Arithmetic operators for float16, software emulated on other CPU
 #else
-HOSTDEVICE inline float16 operator+(const float16& a, const float16& b) {
+HOST inline float16 operator+(const float16& a, const float16& b) {
   return float16(float(a) + float(b));
 }
 
-HOSTDEVICE inline float16 operator-(const float16& a, const float16& b) {
+HOST inline float16 operator-(const float16& a, const float16& b) {
   return float16(float(a) - float(b));
 }
 
-HOSTDEVICE inline float16 operator*(const float16& a, const float16& b) {
+HOST inline float16 operator*(const float16& a, const float16& b) {
   return float16(float(a) * float(b));
 }
 
-HOSTDEVICE inline float16 operator/(const float16& a, const float16& b) {
+HOST inline float16 operator/(const float16& a, const float16& b) {
   return float16(float(a) / float(b));
 }
 
-HOSTDEVICE inline float16 operator-(const float16& a) {
+HOST inline float16 operator-(const float16& a) {
   float16 res;
   res.x = a.x ^ 0x8000;
   return res;
 }
 
-HOSTDEVICE inline float16& operator+=(float16& a, const float16& b) {
+HOST inline float16& operator+=(float16& a, const float16& b) {
   a = float16(float(a) + float(b));
   return a;
 }
 
-HOSTDEVICE inline float16& operator-=(float16& a, const float16& b) {
+HOST inline float16& operator-=(float16& a, const float16& b) {
   a = float16(float(a) - float(b));
   return a;
 }
 
-HOSTDEVICE inline float16& operator*=(float16& a, const float16& b) {
+HOST inline float16& operator*=(float16& a, const float16& b) {
   a = float16(float(a) * float(b));
   return a;
 }
 
-HOSTDEVICE inline float16& operator/=(float16& a, const float16& b) {
+HOST inline float16& operator/=(float16& a, const float16& b) {
   a = float16(float(a) / float(b));
   return a;
 }
 
-HOSTDEVICE inline bool operator==(const float16& a, const float16& b) {
+HOST inline bool operator==(const float16& a, const float16& b) {
   return float(a) == float(b);
 }
 
-HOSTDEVICE inline bool operator!=(const float16& a, const float16& b) {
+HOST inline bool operator!=(const float16& a, const float16& b) {
   return float(a) != float(b);
 }
 
-HOSTDEVICE inline bool operator<(const float16& a, const float16& b) {
+HOST inline bool operator<(const float16& a, const float16& b) {
   return float(a) < float(b);
 }
 
-HOSTDEVICE inline bool operator<=(const float16& a, const float16& b) {
+HOST inline bool operator<=(const float16& a, const float16& b) {
   return float(a) <= float(b);
 }
 
-HOSTDEVICE inline bool operator>(const float16& a, const float16& b) {
+HOST inline bool operator>(const float16& a, const float16& b) {
   return float(a) > float(b);
 }
 
-HOSTDEVICE inline bool operator>=(const float16& a, const float16& b) {
+HOST inline bool operator>=(const float16& a, const float16& b) {
   return float(a) >= float(b);
 }
 #endif
diff --git a/paddle/fluid/platform/profiler.cc b/paddle/fluid/platform/profiler.cc
index 28ef3e04b1c50e0d42eeb27608259c6449429da5..b25206ff35cc87dcdd363bc0de54530f629d73ed 100644
--- a/paddle/fluid/platform/profiler.cc
+++ b/paddle/fluid/platform/profiler.cc
@@ -147,19 +147,48 @@ RecordEvent::RecordEvent(const std::string& name, const DeviceContext* dev_ctx)
   name_ = name;
   PushEvent(name_, dev_ctx_);
   // Maybe need the same push/pop behavior.
-  SetCurAnnotation(name_.c_str());
+  SetCurAnnotation(name_);
 }
 
 RecordEvent::~RecordEvent() {
   if (g_state == ProfilerState::kDisabled) return;
   DeviceTracer* tracer = GetDeviceTracer();
   if (tracer) {
-    tracer->AddCPURecords(CurAnnotation(), start_ns_, PosixInNsec());
+    tracer->AddCPURecords(CurAnnotation(), start_ns_, PosixInNsec(),
+                          BlockDepth(), CurThread());
   }
   ClearCurAnnotation();
   PopEvent(name_, dev_ctx_);
 }
 
+RecordBlock::RecordBlock(int block_id) : start_ns_(PosixInNsec()) {
+  if (g_state == ProfilerState::kDisabled) return;
+  SetCurBlock(block_id);
+  name_ = string::Sprintf("block_%d", block_id);
+}
+
+RecordBlock::~RecordBlock() {
+  if (g_state == ProfilerState::kDisabled) return;
+  DeviceTracer* tracer = GetDeviceTracer();
+  if (tracer) {
+    // We try to put all blocks at the same nested depth in the
+    // same timeline lane. and distinguish the using thread_id.
+    tracer->AddCPURecords(name_, start_ns_, PosixInNsec(), BlockDepth(),
+                          CurThread());
+  }
+  ClearCurBlock();
+}
+
+RecordThread::RecordThread(int thread_id) {
+  if (g_state == ProfilerState::kDisabled) return;
+  SetCurThread(thread_id);
+}
+
+RecordThread::~RecordThread() {
+  if (g_state == ProfilerState::kDisabled) return;
+  ClearCurThread();
+}
+
 void EnableProfiler(ProfilerState state) {
   PADDLE_ENFORCE(state != ProfilerState::kDisabled,
                  "Can't enbale profling, since the input state is ",
diff --git a/paddle/fluid/platform/profiler.h b/paddle/fluid/platform/profiler.h
index 3542ce6cda87e3b013d60393e4ba93da61921940..de9a5cc20d76bf84778e0933831f218abb66c465 100644
--- a/paddle/fluid/platform/profiler.h
+++ b/paddle/fluid/platform/profiler.h
@@ -118,6 +118,20 @@ struct RecordEvent {
   std::string full_name_;
 };
 
+struct RecordBlock {
+  explicit RecordBlock(int block_id);
+  ~RecordBlock();
+
+ private:
+  std::string name_;
+  uint64_t start_ns_;
+};
+
+struct RecordThread {
+  explicit RecordThread(int thread_id);
+  ~RecordThread();
+};
+
 // Return the event list of all threads. Assumed the returned value calls
 // event_lists, event_lists[i][j] represents the j-th Event of i-th thread.
 std::vector<std::vector<Event>> GetAllEvents();
diff --git a/paddle/fluid/platform/profiler.proto b/paddle/fluid/platform/profiler.proto
index 71b5a9b12ef4a045ebfd3ee3d06ee25032083ff5..7b42aa785ec6ad5731e3adee1e9f189127a826a1 100644
--- a/paddle/fluid/platform/profiler.proto
+++ b/paddle/fluid/platform/profiler.proto
@@ -18,12 +18,17 @@ package paddle.platform.proto;
 message MemCopy { optional uint64 bytes = 1; }
 
 message Event {
+  enum EventType {
+    CPU = 0;
+    GPUKernel = 1;
+  }
+  optional EventType type = 8;
   optional string name = 1;
   optional uint64 start_ns = 2;
   optional uint64 end_ns = 3;
   // When positive, it represents gpu id. When -1, it represents CPU.
   optional int64 device_id = 5;
-  optional uint32 stream_id = 6;
+  optional int64 sub_device_id = 6;
 
   optional MemCopy memcopy = 7;
 }
diff --git a/paddle/fluid/pybind/tensor_py.h b/paddle/fluid/pybind/tensor_py.h
index 3b206f2f87abe01363fb7e61c319559c6dd24594..6f8c597f8e610594851c318c122563523e4e7ea6 100644
--- a/paddle/fluid/pybind/tensor_py.h
+++ b/paddle/fluid/pybind/tensor_py.h
@@ -72,6 +72,7 @@ struct CastToPyBufferImpl<true, I, ARGS...> {
         paddle::platform::GpuMemcpyAsync(
             dst_ptr, src_ptr, sizeof(CUR_TYPE) * tensor.numel(),
             cudaMemcpyDeviceToHost, dev_ctx->stream());
+        dev_ctx->Wait();
 #else
         PADDLE_THROW("'CUDAPlace' is not supported in CPU only device.");
 #endif
diff --git a/paddle/fluid/recordio/header.cc b/paddle/fluid/recordio/header.cc
index e50de15b7c2b480357f5f6c7daa2b4a676749679..ed09d58f6a3e2dba50bf4407c0463480575b248e 100644
--- a/paddle/fluid/recordio/header.cc
+++ b/paddle/fluid/recordio/header.cc
@@ -29,8 +29,8 @@ Header::Header(uint32_t num, uint32_t sum, Compressor c, uint32_t cs)
 
 bool Header::Parse(std::istream& is) {
   uint32_t magic;
-  size_t read_size =
-      is.readsome(reinterpret_cast<char*>(&magic), sizeof(uint32_t));
+  is.read(reinterpret_cast<char*>(&magic), sizeof(uint32_t));
+  size_t read_size = is.gcount();
   if (read_size < sizeof(uint32_t)) {
     return false;
   }
diff --git a/python/paddle/fluid/__init__.py b/python/paddle/fluid/__init__.py
index dcde08632a6bb4c5936c32048c2cc1dca7608b06..fcea28220485039c9daf3c5fa2688c31f9f34c42 100644
--- a/python/paddle/fluid/__init__.py
+++ b/python/paddle/fluid/__init__.py
@@ -35,7 +35,7 @@ from core import LoDTensor, CPUPlace, CUDAPlace
 from distribute_transpiler import DistributeTranspiler
 from distribute_transpiler_simple import SimpleDistributeTranspiler
 from concurrency import (Go, make_channel, channel_send, channel_recv,
-                         channel_close)
+                         channel_close, Select)
 import clip
 from memory_optimization_transpiler import memory_optimize, release_memory
 import profiler
diff --git a/python/paddle/fluid/concurrency.py b/python/paddle/fluid/concurrency.py
index dec224fc886cd0739add0ebb6488625ef5063b8d..0fc4981a8e9da09f15e6d0a5e5c6761e01328876 100644
--- a/python/paddle/fluid/concurrency.py
+++ b/python/paddle/fluid/concurrency.py
@@ -12,17 +12,14 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from layers.control_flow import BlockGuard
+from layers.control_flow import BlockGuard, Select
 from layer_helper import LayerHelper, unique_name
 from layers import fill_constant
 import core
 
 __all__ = [
-    'Go',
-    'make_channel',
-    'channel_send',
-    'channel_recv',
-    'channel_close',
+    'Go', 'make_channel', 'channel_send', 'channel_recv', 'channel_close',
+    'Select'
 ]
 
 
@@ -134,7 +131,7 @@ def make_channel(dtype, capacity=0):
     return channel
 
 
-def channel_send(channel, value):
+def channel_send(channel, value, copy=False):
     """
     Sends a value through a channel variable. Used by an unbuffered or buffered
     channel to pass data from within or to a concurrent Go block, where
@@ -144,6 +141,8 @@ def channel_send(channel, value):
         channel (Variable|Channel): Channel variable created using
         `make_channel`.
         value (Variable): Value to send to channel
+        copy (bool): Copy data while channel send. If False, then data
+        is moved. The input cannot be used after move.
     Returns:
         Variable: The boolean status on whether or not the channel
                   successfully sent the passed value.
@@ -165,11 +164,26 @@ def channel_send(channel, value):
         type=core.VarDesc.VarType.LOD_TENSOR,
         dtype=core.VarDesc.VarType.BOOL)
 
+    X = value
+
+    if copy is True:
+        copied_X = helper.create_variable(
+            name=unique_name.generate(value.name + '_copy'),
+            type=value.type,
+            dtype=value.dtype,
+            shape=value.shape,
+            lod_level=value.lod_level,
+            capacity=value.capacity)
+
+        assign_op = channel_send_block.append_op(
+            type="assign_op", inputs={"X": value}, outputs={"Out": copied_X})
+        X = copied_X
+
     channel_send_op = channel_send_block.append_op(
         type="channel_send",
         inputs={
             "Channel": channel,
-            "X": value,
+            "X": X,
         },
         outputs={"Status": status})
 
@@ -198,7 +212,7 @@ def channel_recv(channel, return_value):
 
           ch = fluid.make_channel(dtype='int32', capacity=10)
           with fluid.Go():
-            returned_value = fluid.channel_recv(ch, 'int32')
+            returned_value, return_status = fluid.channel_recv(ch, 'int32')
 
           # Code to send data through the channel.
     """
diff --git a/python/paddle/fluid/framework.py b/python/paddle/fluid/framework.py
index d14d6349b1bcf598e25bbeb9913d2d0da71a5054..70ecffd910a46570b5a8e576d88039fa5e22e726 100644
--- a/python/paddle/fluid/framework.py
+++ b/python/paddle/fluid/framework.py
@@ -487,7 +487,7 @@ class Operator(object):
             'rnn_memory_helper_grad', 'conditional_block', 'while', 'send',
             'recv', 'listen_and_serv', 'parallel_do', 'save_combine',
             'load_combine', 'ncclInit', 'channel_create', 'channel_close',
-            'channel_send', 'channel_recv'
+            'channel_send', 'channel_recv', 'select'
         }
         if type not in no_kernel_op_set:
             self.desc.infer_var_type(self.block.desc)
diff --git a/python/paddle/fluid/layers/control_flow.py b/python/paddle/fluid/layers/control_flow.py
index 1bb1aa30ee1019c6f80eb64b6dc20459e7a3073b..02cd0a05a11d8d1d52d42c2b62799f1093d0abc2 100644
--- a/python/paddle/fluid/layers/control_flow.py
+++ b/python/paddle/fluid/layers/control_flow.py
@@ -16,7 +16,7 @@ import contextlib
 from layer_function_generator import autodoc
 from tensor import assign, fill_constant
 from .. import core
-from ..framework import Program, Variable, Operator
+from ..framework import Program, Variable, Operator, Block
 from ..layer_helper import LayerHelper, unique_name
 from ops import logical_and, logical_not, logical_or
 
@@ -29,6 +29,7 @@ __all__ = [
     'WhileGuard',
     'While',
     'Switch',
+    'Select',
     'lod_rank_table',
     'max_sequence_len',
     'topk',
@@ -1211,6 +1212,186 @@ class Switch(object):
         return True
 
 
+class SelectCase(object):
+    DEFAULT = 0
+    SEND = 1
+    RECEIVE = 2
+
+    def __init__(self,
+                 case_idx,
+                 case_to_execute,
+                 channel_action_fn=None,
+                 channel=None,
+                 value=None):
+        self.helper = LayerHelper('conditional_block')
+        self.main_program = self.helper.main_program
+        self.is_scalar_condition = True
+
+        self.case_to_execute = case_to_execute
+        self.idx = case_idx
+
+        # Since we aren't going to use the `channel_send` or `channel_recv`
+        # functions directly, we just need to capture the name.
+        self.action = (self.SEND
+                       if channel_action_fn.__name__ == ('channel_send') else
+                       self.RECEIVE) if channel_action_fn else (self.DEFAULT)
+        self.value = value
+        self.channel = channel
+
+    def __enter__(self):
+        self.block = self.main_program.create_block()
+
+    def construct_op(self):
+        main_program = self.helper.main_program
+        cases_block = main_program.current_block()
+
+        inner_outputs = set()
+        input_set = set()
+        params = set()
+
+        for op in self.block.ops:
+            # Iterate over all operators, get all the inputs
+            # and add as input to the SelectCase operator.
+            for iname in op.input_names:
+                for in_var_name in op.input(iname):
+                    if in_var_name not in inner_outputs:
+                        input_set.add(in_var_name)
+
+            for oname in op.output_names:
+                for out_var_name in op.output(oname):
+                    inner_outputs.add(out_var_name)
+
+        param_list = [
+            cases_block.var(each_name) for each_name in params
+            if each_name not in input_set
+        ]
+
+        # Iterate over all operators, get all the outputs
+        # add to the output list of SelectCase operator only if
+        # they exist in the parent block.
+        out_vars = []
+        for inner_out_name in inner_outputs:
+            if inner_out_name in cases_block.vars:
+                out_vars.append(cases_block.var(inner_out_name))
+
+        # First, create an op that will determine whether or not this is the
+        # conditional variable to execute.
+        should_execute_block = equal(
+            fill_constant(
+                shape=[1], dtype=core.VarDesc.VarType.INT32, value=self.idx),
+            self.case_to_execute)
+
+        step_scope = cases_block.create_var(
+            type=core.VarDesc.VarType.STEP_SCOPES)
+
+        cases_block.append_op(
+            type='conditional_block',
+            inputs={'X': [should_execute_block],
+                    'Params': param_list},
+            outputs={'Out': out_vars,
+                     'Scope': [step_scope]},
+            attrs={
+                'sub_block': self.block,
+                'is_scalar_condition': self.is_scalar_condition
+            })
+
+        return '%s,%s,%s,%s' % (self.idx, self.action, self.channel.name
+                                if self.channel else '', self.value.name
+                                if self.value else '')
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.main_program.rollback()
+        if exc_type is not None:
+            return False  # re-raise exception
+        return True
+
+
+class Select(BlockGuard):
+    def __init__(self, name=None):
+        self.helper = LayerHelper('select', name=name)
+        self.cases = []
+
+        super(Select, self).__init__(self.helper.main_program)
+        self.case_to_execute = fill_constant(
+            shape=[1], dtype=core.VarDesc.VarType.INT32, value=-1)
+
+    def __enter__(self):
+        super(Select, self).__enter__()
+        return self
+
+    def case(self, channel_action_fn, channel, value):
+        """Create a new block for this condition.
+        """
+        select_case = SelectCase(
+            len(self.cases), self.case_to_execute, channel_action_fn, channel,
+            value)
+
+        self.cases.append(select_case)
+
+        return select_case
+
+    def default(self):
+        """Create a default case block for this condition.
+        """
+        default_case = SelectCase(len(self.cases), self.case_to_execute)
+
+        self.cases.append(default_case)
+
+        return default_case
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        if exc_type is not None:
+            return False
+
+        # Create a select op and another block to wrap its
+        # case blocks.
+        select_block = self.helper.main_program.current_block()
+        parent_block = self.helper.main_program.block(select_block.parent_idx)
+
+        # Construct each case op, inside the newly created select block.
+        serialized_cases = []
+        for case in self.cases:
+            serialized_cases.append(case.construct_op())
+
+        intermediate = set()
+        params = set()
+
+        for case_block in select_block.ops:
+            if case_block.attrs and 'sub_block' in case_block.attrs:
+                for each_op in case_block.attrs['sub_block'].ops:
+                    assert isinstance(each_op, Operator)
+                    for iname in each_op.input_names:
+                        for in_var_name in each_op.input(iname):
+                            if in_var_name not in intermediate:
+                                params.add(in_var_name)
+
+                    for oname in each_op.output_names:
+                        for out_var_name in each_op.output(oname):
+                            intermediate.add(out_var_name)
+
+        # TODO(varunarora): Figure out if defining output is needed.
+        out_list = [
+            parent_block.var(var_name) for var_name in parent_block.vars
+            if var_name in intermediate
+        ]
+
+        X = [select_block.var_recursive(x_name) for x_name in params]
+
+        # Needs to be used by `equal` inside the cases block.
+        X.append(self.case_to_execute)
+
+        # Construct the select op.
+        parent_block.append_op(
+            type='select',
+            inputs={'X': X,
+                    'case_to_execute': self.case_to_execute},
+            attrs={'sub_block': select_block,
+                   'cases': serialized_cases},
+            outputs={})
+
+        return super(Select, self).__exit__(exc_type, exc_val, exc_tb)
+
+
 class IfElseBlockGuard(object):
     def __init__(self, is_true, ifelse):
         if not isinstance(ifelse, IfElse):
diff --git a/python/paddle/fluid/layers/layer_function_generator.py b/python/paddle/fluid/layers/layer_function_generator.py
index bd79022a0c39cf18bd05d49ac62986d342a4ae06..35b01a79914b3427836d4abd51aa2e2eb471d517 100644
--- a/python/paddle/fluid/layers/layer_function_generator.py
+++ b/python/paddle/fluid/layers/layer_function_generator.py
@@ -16,10 +16,7 @@ import cStringIO
 import functools
 import warnings
 
-from .. import proto
-
-framework_pb2 = proto.framework_pb2
-
+from ..proto import framework_pb2
 from ..framework import OpProtoHolder, Variable
 from ..layer_helper import LayerHelper
 
diff --git a/python/paddle/fluid/layers/nn.py b/python/paddle/fluid/layers/nn.py
index bf161d6618b10da66f25d3f11300a4a2b10b875a..75d3d895081e29e25fd5cf29d19e4b8459035ffb 100644
--- a/python/paddle/fluid/layers/nn.py
+++ b/python/paddle/fluid/layers/nn.py
@@ -73,6 +73,7 @@ __all__ = [
     'smooth_l1',
     'one_hot',
     'autoincreased_step_counter',
+    'lod_reset',
 ]
 
 
@@ -1808,52 +1809,52 @@ def conv2d_transpose(input,
     return out
 
 
-def sequence_expand(x, y, name=None):
+def sequence_expand(x, y, ref_level=-1, name=None):
     """Sequence Expand Layer. This layer will expand the input variable **x**
-    according to LoD information of **y**. And the following examples will
-    explain how sequence_expand works:
+    according to specified level lod of **y**. Please note that lod level of
+    **x** is at most 1 and rank of **x** is at least 2. When rank of **x**
+    is greater than 2, then it would be viewed as a 2-D tensor.
+    Following examples will explain how sequence_expand works:
 
     .. code-block:: text
 
         * Case 1
             x is a LoDTensor:
-                x.lod = [[0,       2, 3],
-                         [0, 1,    3, 4]]
-                x.data = [a, b, c, d]
+                x.lod  = [[0,   2,        4]]
+                x.data = [[a], [b], [c], [d]]
                 x.dims = [4, 1]
 
             y is a LoDTensor:
                 y.lod = [[0,    2,    4],
                          [0, 3, 6, 7, 8]]
 
-            with condition len(y.lod[-1]) - 1 == x.dims[0]
+            ref_level: 0
 
-            then output is a 2-level LoDTensor:
-                out.lod = [[0,                2,    4],
-                           [0,       3,       6, 7, 8]]
-                out.data = [a, a, a, b, b, b, c, d]
+            then output is a 1-level LoDTensor:
+                out.lod =  [[0,   2,        4,        6,        8]]
+                out.data = [[a], [b], [a], [b], [c], [d], [c], [d]]
                 out.dims = [8, 1]
 
         * Case 2
             x is a Tensor:
-                x.data = [a, b, c]
+                x.data = [[a], [b], [c]]
                 x.dims = [3, 1]
 
             y is a LoDTensor:
-                y.lod = [[0, 2, 3, 6]]
-
-            with condition len(y.lod[-1]) - 1 == x.dims[0]
+                y.lod = [[0, 2, 2, 5]]
 
-            then output is a 1-level LoDTensor:
-                out.lod = [[0,    2, 3,      6]]
-                out.data = [a, a, b, c, c, c]
-                out.dims = [6, 1]
+            ref_level: -1
 
+            then output is a Tensor:
+                out.data = [[a], [a], [c], [c], [c]]
+                out.dims = [5, 1]
     Args:
         x (Variable): The input variable which is a Tensor or LoDTensor.
         y (Variable): The input variable which is a LoDTensor.
+        ref_level (int): Lod level of `y` to be referred by `x`. If set to -1,
+                         refer the last level of lod.
         name(str|None): A name for this layer(optional). If set None, the layer
-                       will be named automatically.
+                        will be named automatically.
 
     Returns:
         Variable: The expanded variable which is a LoDTensor.
@@ -1864,14 +1865,17 @@ def sequence_expand(x, y, name=None):
             x = fluid.layers.data(name='x', shape=[10], dtype='float32')
             y = fluid.layers.data(name='y', shape=[10, 20],
                              dtype='float32', lod_level=1)
-            out = layers.sequence_expand(x=x, y=y)
+            out = layers.sequence_expand(x=x, y=y, ref_level=0)
     """
     helper = LayerHelper('sequence_expand', input=x, **locals())
     dtype = helper.input_dtype()
     tmp = helper.create_tmp_variable(dtype)
     helper.append_op(
-        type='sequence_expand', inputs={'X': x,
-                                        'Y': y}, outputs={'Out': tmp})
+        type='sequence_expand',
+        inputs={'X': x,
+                'Y': y},
+        outputs={'Out': tmp},
+        attrs={'ref_level': ref_level})
     return tmp
 
 
@@ -2225,7 +2229,7 @@ def reduce_prod(input, dim=None, keep_dim=False, name=None):
         keep_dim (bool|False): Whether to reserve the reduced dimension in the
             output Tensor. The result tensor will have one fewer dimension
             than the :attr:`input` unless :attr:`keep_dim` is true.
-        name(str|None): A name for this layer(optional). If set None, the 
+        name(str|None): A name for this layer(optional). If set None, the
             layer will be named automatically.
 
     Returns:
@@ -2241,7 +2245,7 @@ def reduce_prod(input, dim=None, keep_dim=False, name=None):
             fluid.layers.reduce_prod(x)  # [0.0002268]
             fluid.layers.reduce_prod(x, dim=0)  # [0.02, 0.06, 0.3, 0.63]
             fluid.layers.reduce_prod(x, dim=-1)  # [0.027, 0.0084]
-            fluid.layers.reduce_prod(x, dim=1, 
+            fluid.layers.reduce_prod(x, dim=1,
                                      keep_dim=True)  # [[0.027], [0.0084]]
     """
     helper = LayerHelper('reduce_prod', **locals())
@@ -3292,3 +3296,98 @@ def autoincreased_step_counter(counter_name=None, begin=1, step=1):
         counter.stop_gradient = True
 
     return counter
+
+
+def lod_reset(x, y=None, target_lod=None):
+    """
+    LoD Reset Operator. Set LoD of **x** to a new one specified by **y** or
+    **target_lod**. When **y** provided, **y.lod** would be considered as target
+    LoD first, otherwise **y.data** would be considered as target LoD. If **y**
+    is not provided, target LoD should be specified by **target_lod**.
+    If target LoD is specified by **Y.data** or **target_lod**, only one level
+    LoD is supported.
+
+    .. code-block:: text
+
+        * Example 1:
+
+            Given a 1-level LoDTensor x:
+                x.lod =  [[ 0,     2,                   5      6 ]]
+                x.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+                x.dims = [6, 1]
+
+            target_lod: [0, 4, 6]
+
+            then we get a 1-level LoDTensor:
+                out.lod =  [[ 0,                   4,            6 ]]
+                out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+                out.dims = [6, 1]
+
+        * Example 2:
+
+            Given a 1-level LoDTensor x:
+                x.lod =  [[ 0,     2,                   5      6 ]]
+                x.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+                x.dims = [6, 1]
+
+            y is a Tensor:
+                y.data = [[0, 2, 6]]
+                y.dims = [1, 3]
+
+            then we get a 1-level LoDTensor:
+                out.lod =  [[ 0,     2,                          6 ]]
+                out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+                out.dims = [6, 1]
+
+        * Example 3:
+
+            Given a 1-level LoDTensor x:
+                x.lod =  [[ 0,      2,                   5     6 ]]
+                x.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+                x.dims = [6, 1]
+
+            y is a 2-level LoDTensor:
+                y.lod =  [[0, 2, 4], [0, 2, 5, 6]]
+                y.data = [[1.1], [2.1], [3.1], [4.1], [5.1], [6.1]]
+                y.dims = [6, 1]
+
+            then we get a 2-level LoDTensor:
+                out.lod =  [[0, 2, 4], [0, 2, 5, 6]]
+                out.data = [[1.0], [2.0], [3.0], [4.0], [5.0], [6.0]]
+                out.dims = [6, 1]
+
+    Args:
+        x (Variable): Input variable which could be a Tensor or LodTensor.
+        y (Variable|None): If provided, output's LoD would be derived from y.
+        target_lod (list|tuple|None): One level LoD which should be considered
+                                      as target LoD when y not provided.
+
+    Returns:
+        Variable: Output variable with LoD specified by this operator.
+
+    Raises:
+        ValueError: If y and target_lod are both None.
+
+    Examples:
+        .. code-block:: python
+
+            x = layers.data(name='x', shape=[10])
+            y = layers.data(name='y', shape=[10, 20], lod_level=2)
+            out = layers.lod_reset(x=x, y=y)
+    """
+    helper = LayerHelper("lod_reset", **locals())
+    out = helper.create_tmp_variable(dtype=x.dtype)
+    if y is not None:
+        helper.append_op(
+            type="lod_reset", inputs={'X': x,
+                                      'Y': y}, outputs={'Out': out})
+    elif target_lod is not None:
+        helper.append_op(
+            type="lod_reset",
+            inputs={'X': x},
+            attrs={'target_lod': target_lod},
+            outputs={'Out': out})
+    else:
+        raise ValueError("y and target_lod should not be both None.")
+
+    return out
diff --git a/python/paddle/fluid/optimizer.py b/python/paddle/fluid/optimizer.py
index 8b8621469d856e63dfd9685d7cd3d4c1d2ada1ce..e8623ee0dae820ece1874ed58ecf962b5f296d94 100644
--- a/python/paddle/fluid/optimizer.py
+++ b/python/paddle/fluid/optimizer.py
@@ -24,7 +24,9 @@ from layer_helper import LayerHelper
 from regularizer import append_regularization_ops
 from clip import append_gradient_clip_ops, error_clip_callback
 
-__all__ = ['SGD', 'Momentum', 'Adagrad', 'Adam', 'Adamax', 'DecayedAdagrad']
+__all__ = [
+    'SGD', 'Momentum', 'Adagrad', 'Adam', 'Adamax', 'DecayedAdagrad', 'Adadelta'
+]
 
 
 class Optimizer(object):
@@ -580,6 +582,88 @@ class DecayedAdagradOptimizer(Optimizer):
         return decayed_adagrad_op
 
 
+class AdadeltaOptimizer(Optimizer):
+    """
+    **Adadelta Optimizer**
+    Simple Adadelta optimizer with average squared grad state and
+    average squared update state.
+    The details of adadelta please refer to this
+    `ADADELTA: AN ADAPTIVE LEARNING RATE METHOD
+    <http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf>`_.
+
+    ..  math::
+
+        E(g_t^2) &= \\rho * E(g_{t-1}^2) + (1-\\rho) * g^2 \\\\
+        learning\\_rate &= sqrt( ( E(dx_{t-1}^2) + \\epsilon ) / ( \\
+                          E(g_t^2) + \\epsilon ) ) \\\\
+        E(dx_t^2) &= \\rho * E(dx_{t-1}^2) + (1-\\rho) * (-g*learning\\_rate)^2
+
+    Args:
+        learning_rate(float): global leraning rate
+        rho(float): rho in equation
+        epsilon(float): epsilon in equation
+
+    Examples:
+        .. code-block:: python
+
+            optimizer = fluid.optimizer.Adadelta(
+                learning_rate=0.0003, epsilon=1.0e-6, rho=0.95)
+            _, params_grads = optimizer.minimize(cost)
+    """
+
+    _avg_squared_grad_acc_str = "_avg_squared_grad"
+    _avg_squared_update_acc_str = "_avg_squared_update"
+
+    def __init__(self, learning_rate, epsilon=1.0e-6, rho=0.95, **kwargs):
+        if learning_rate is None:
+            raise ValueError("learning_rate is not set.")
+        if epsilon is None:
+            raise ValueError("epsilon is not set.")
+        if rho is None:
+            raise ValueError("rho is not set.")
+        super(AdadeltaOptimizer, self).__init__(
+            learning_rate=learning_rate, **kwargs)
+        self.type = "adadelta"
+        self._epsilon = epsilon
+        self._rho = rho
+
+    def _create_accumulators(self, block, parameters):
+        if not isinstance(block, framework.Block):
+            raise TypeError("block is not instance of framework.Block.")
+
+        for p in parameters:
+            self._add_accumulator(self._avg_squared_grad_acc_str, p)
+            self._add_accumulator(self._avg_squared_update_acc_str, p)
+
+    def _append_optimize_op(self, block, param_and_grad):
+        if not isinstance(block, framework.Block):
+            raise TypeError("block is not instance of framework.Block.")
+
+        avg_squared_grad_acc = self._get_accumulator(
+            self._avg_squared_grad_acc_str, param_and_grad[0])
+        avg_squared_update_acc = self._get_accumulator(
+            self._avg_squared_update_acc_str, param_and_grad[0])
+
+        # Create the adadelta optimizer op
+        adadelta_op = block.append_op(
+            type=self.type,
+            inputs={
+                "Param": param_and_grad[0],
+                "Grad": param_and_grad[1],
+                "AvgSquaredGrad": avg_squared_grad_acc,
+                "AvgSquaredUpdate": avg_squared_update_acc
+            },
+            outputs={
+                "ParamOut": param_and_grad[0],
+                "AvgSquaredGradOut": avg_squared_grad_acc,
+                "AvgSquaredUpdateOut": avg_squared_update_acc
+            },
+            attrs={"epsilon": self._epsilon,
+                   "rho": self._rho})
+
+        return adadelta_op
+
+
 # We short the class name, since users will use the optimizer with the package
 # name. The sample code:
 #
@@ -594,3 +678,4 @@ Adagrad = AdagradOptimizer
 Adam = AdamOptimizer
 Adamax = AdamaxOptimizer
 DecayedAdagrad = DecayedAdagradOptimizer
+Adadelta = AdadeltaOptimizer
diff --git a/python/paddle/fluid/regularizer.py b/python/paddle/fluid/regularizer.py
index 029db7d2dd4b7def8cea374e3f2ed31226f2bc18..604c6f9ab36c2332223d1ba943d67113922615b3 100644
--- a/python/paddle/fluid/regularizer.py
+++ b/python/paddle/fluid/regularizer.py
@@ -44,6 +44,11 @@ def append_regularization_ops(parameters_and_grads, regularization=None):
     """
     params_and_grads = []
     for param, grad in parameters_and_grads:
+        # If no gradient then we don't need to do anything
+        if grad is None:
+            params_and_grads.append((param, grad))
+            continue
+
         regularization_term = None
         if param.regularizer is not None:
             # Add variable for regularization term in grad block
@@ -51,9 +56,8 @@ def append_regularization_ops(parameters_and_grads, regularization=None):
         elif regularization is not None:
             regularization_term = regularization(param, grad, grad.block)
 
-        # If no gradient or no regularization specified,
-        # then we don't need to do anything
-        if grad is None or regularization_term is None:
+        # If no regularization specified, then we don't need to do anything
+        if regularization_term is None:
             params_and_grads.append((param, grad))
             continue
 
diff --git a/python/paddle/fluid/tests/book/test_machine_translation.py b/python/paddle/fluid/tests/book/test_machine_translation.py
index fa38bd3762423497b82c3b421b3a1db4cd87525b..3a1a0859ecfd4ac5337e2112f8b22e32d8474f22 100644
--- a/python/paddle/fluid/tests/book/test_machine_translation.py
+++ b/python/paddle/fluid/tests/book/test_machine_translation.py
@@ -118,12 +118,12 @@ def decoder_decode(context, is_sparse):
             is_sparse=is_sparse)
 
         # use rnn unit to update rnn
-        current_state = pd.fc(input=[pre_ids_emb, pre_state_expanded],
+        current_state = pd.fc(input=[pre_state_expanded, pre_ids_emb],
                               size=decoder_size,
                               act='tanh')
-
+        current_state_with_lod = pd.lod_reset(x=current_state, y=pre_score)
         # use score to do beam search
-        current_score = pd.fc(input=current_state,
+        current_score = pd.fc(input=current_state_with_lod,
                               size=target_dict_dim,
                               act='softmax')
         topk_scores, topk_indices = pd.topk(current_score, k=50)
diff --git a/python/paddle/fluid/tests/test_concurrency.py b/python/paddle/fluid/tests/test_concurrency.py
index 9f7bf63c5e017251e87af94690ff32c47c538c6b..924895a9afac610059bac5f617c49712441339cc 100644
--- a/python/paddle/fluid/tests/test_concurrency.py
+++ b/python/paddle/fluid/tests/test_concurrency.py
@@ -15,9 +15,9 @@
 import unittest
 import paddle.fluid as fluid
 import paddle.fluid.core as core
-from paddle.fluid import framework, unique_name
+from paddle.fluid import framework, unique_name, layer_helper
 from paddle.fluid.executor import Executor
-from paddle.fluid.layers import fill_constant
+from paddle.fluid.layers import fill_constant, assign, While, elementwise_add, Print
 
 
 class TestRoutineOp(unittest.TestCase):
@@ -86,8 +86,7 @@ class TestRoutineOp(unittest.TestCase):
         self.assertEqual(leftmost_data[0][0], n + 1)
 
     def _create_one_dim_tensor(self, value):
-        one_dim_tensor = fill_constant(
-            shape=[1], dtype=core.VarDesc.VarType.INT64, value=value)
+        one_dim_tensor = fill_constant(shape=[1], dtype='int', value=value)
         one_dim_tensor.stop_gradient = True
         return one_dim_tensor
 
@@ -95,6 +94,180 @@ class TestRoutineOp(unittest.TestCase):
         return framework.default_main_program().current_block().create_var(
             name=unique_name.generate(name), type=type, dtype=dtype)
 
+    def _create_persistable_tensor(self, name, type, dtype):
+        return framework.default_main_program().current_block().create_var(
+            name=unique_name.generate(name),
+            type=type,
+            dtype=dtype,
+            persistable=True)
+
+    def test_select(self):
+        with framework.program_guard(framework.Program()):
+            ch1 = fluid.make_channel(
+                dtype=core.VarDesc.VarType.LOD_TENSOR, capacity=1)
+
+            result1 = self._create_tensor('return_value',
+                                          core.VarDesc.VarType.LOD_TENSOR,
+                                          core.VarDesc.VarType.FP64)
+
+            input_value = fill_constant(
+                shape=[1], dtype=core.VarDesc.VarType.FP64, value=10)
+
+            with fluid.Select() as select:
+                with select.case(fluid.channel_send, ch1, input_value):
+                    # Execute something.
+                    pass
+
+                with select.default():
+                    pass
+
+            # This should not block because we are using a buffered channel.
+            result1, status = fluid.channel_recv(ch1, result1)
+            fluid.channel_close(ch1)
+
+            cpu = core.CPUPlace()
+            exe = Executor(cpu)
+
+            result = exe.run(fetch_list=[result1])
+            self.assertEqual(result[0][0], 10)
+
+    def test_fibonacci(self):
+        """
+        Mimics Fibonacci Go example: https://tour.golang.org/concurrency/5
+        """
+        with framework.program_guard(framework.Program()):
+            quit_ch_input_var = self._create_persistable_tensor(
+                'quit_ch_input', core.VarDesc.VarType.LOD_TENSOR,
+                core.VarDesc.VarType.INT32)
+            quit_ch_input = fill_constant(
+                shape=[1],
+                dtype=core.VarDesc.VarType.INT32,
+                value=0,
+                out=quit_ch_input_var)
+
+            result = self._create_persistable_tensor(
+                'result', core.VarDesc.VarType.LOD_TENSOR,
+                core.VarDesc.VarType.INT32)
+            fill_constant(
+                shape=[1],
+                dtype=core.VarDesc.VarType.INT32,
+                value=0,
+                out=result)
+
+            x = fill_constant(
+                shape=[1], dtype=core.VarDesc.VarType.INT32, value=0)
+            y = fill_constant(
+                shape=[1], dtype=core.VarDesc.VarType.INT32, value=1)
+
+            while_cond = fill_constant(
+                shape=[1], dtype=core.VarDesc.VarType.BOOL, value=True)
+
+            while_false = fill_constant(
+                shape=[1], dtype=core.VarDesc.VarType.BOOL, value=False)
+
+            x_tmp = fill_constant(
+                shape=[1], dtype=core.VarDesc.VarType.INT32, value=0)
+
+            def fibonacci(channel, quit_channel):
+                while_op = While(cond=while_cond)
+                with while_op.block():
+                    result2 = fill_constant(
+                        shape=[1], dtype=core.VarDesc.VarType.INT32, value=0)
+                    x_to_send_tmp = fill_constant(
+                        shape=[1], dtype=core.VarDesc.VarType.INT32, value=0)
+
+                    # TODO(abhinav): Need to perform copy when doing a channel send.
+                    #   Once this is complete, we can remove these lines
+                    assign(input=x, output=x_to_send_tmp)
+
+                    with fluid.Select() as select:
+                        with select.case(fluid.channel_send, channel,
+                                         x_to_send_tmp):
+                            assign(input=x, output=x_tmp)
+                            assign(input=y, output=x)
+                            assign(elementwise_add(x=x_tmp, y=y), output=y)
+
+                        with select.case(fluid.channel_recv, quit_channel,
+                                         result2):
+                            # Quit
+                            helper = layer_helper.LayerHelper('assign')
+                            helper.append_op(
+                                type='assign',
+                                inputs={'X': [while_false]},
+                                outputs={'Out': [while_cond]})
+
+            ch1 = fluid.make_channel(dtype=core.VarDesc.VarType.LOD_TENSOR)
+            quit_ch = fluid.make_channel(dtype=core.VarDesc.VarType.LOD_TENSOR)
+
+            with fluid.Go():
+                for i in xrange(10):
+                    fluid.channel_recv(ch1, result)
+                    Print(result)
+
+                fluid.channel_send(quit_ch, quit_ch_input)
+
+            fibonacci(ch1, quit_ch)
+
+            fluid.channel_close(ch1)
+            fluid.channel_close(quit_ch)
+
+            cpu = core.CPUPlace()
+            exe = Executor(cpu)
+
+            exe_result = exe.run(fetch_list=[result])
+            self.assertEqual(exe_result[0][0], 34)
+
+    def test_ping_pong(self):
+        """
+        Mimics Ping Pong example: https://gobyexample.com/channel-directions
+        """
+        with framework.program_guard(framework.Program()):
+            result = self._create_tensor('return_value',
+                                         core.VarDesc.VarType.LOD_TENSOR,
+                                         core.VarDesc.VarType.FP64)
+
+            ping_result = self._create_tensor('ping_return_value',
+                                              core.VarDesc.VarType.LOD_TENSOR,
+                                              core.VarDesc.VarType.FP64)
+
+            pong_result = self._create_tensor('pong_return_value',
+                                              core.VarDesc.VarType.LOD_TENSOR,
+                                              core.VarDesc.VarType.FP64)
+
+            def ping(ch, message):
+                message_to_send_tmp = fill_constant(
+                    shape=[1], dtype=core.VarDesc.VarType.FP64, value=0)
+
+                assign(input=message, output=message_to_send_tmp)
+                fluid.channel_send(ch, message_to_send_tmp)
+
+            def pong(ch1, ch2):
+                fluid.channel_recv(ch1, ping_result)
+                assign(input=ping_result, output=pong_result)
+                fluid.channel_send(ch2, pong_result)
+
+            pings = fluid.make_channel(
+                dtype=core.VarDesc.VarType.LOD_TENSOR, capacity=1)
+            pongs = fluid.make_channel(
+                dtype=core.VarDesc.VarType.LOD_TENSOR, capacity=1)
+
+            msg = fill_constant(
+                shape=[1], dtype=core.VarDesc.VarType.FP64, value=9)
+
+            ping(pings, msg)
+            pong(pings, pongs)
+
+            fluid.channel_recv(pongs, result)
+
+            fluid.channel_close(pings)
+            fluid.channel_close(pongs)
+
+            cpu = core.CPUPlace()
+            exe = Executor(cpu)
+
+            exe_result = exe.run(fetch_list=[result])
+            self.assertEqual(exe_result[0][0], 9)
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/CMakeLists.txt b/python/paddle/fluid/tests/unittests/CMakeLists.txt
index f96c2ca4f0593b6c2624d449304f23425c69ab93..0ad273c7161977e18f91f952fd3a9dc144bf73f0 100644
--- a/python/paddle/fluid/tests/unittests/CMakeLists.txt
+++ b/python/paddle/fluid/tests/unittests/CMakeLists.txt
@@ -11,7 +11,6 @@ list(REMOVE_ITEM TEST_OPS test_lstm_unit_op) # # FIXME(qijun) https://github.com
 list(REMOVE_ITEM TEST_OPS test_nce) # IXME(qijun) https://github.com/PaddlePaddle/Paddle/issues/7778
 list(REMOVE_ITEM TEST_OPS test_recurrent_op) # FIXME(qijun) https://github.com/PaddlePaddle/Paddle/issues/6152
 list(REMOVE_ITEM TEST_OPS test_cond_op) # FIXME(qijun): https://github.com/PaddlePaddle/Paddle/issues/5101#issuecomment-339814957
-list(REMOVE_ITEM TEST_OPS test_detection_output_op) # FIXME: detection_output_op will be rewritten. This unittest should be
 
 list(REMOVE_ITEM TEST_OPS op_test) # op_test is a helper python file, not a test
 list(REMOVE_ITEM TEST_OPS decorators) # decorators is a helper python file, not a test
diff --git a/python/paddle/fluid/tests/unittests/op_test.py b/python/paddle/fluid/tests/unittests/op_test.py
index f7e02595ec3b41ae7bb32353c258736968ca78d4..8393f7827b1c7d361ebea72f2cfc6033268772f0 100644
--- a/python/paddle/fluid/tests/unittests/op_test.py
+++ b/python/paddle/fluid/tests/unittests/op_test.py
@@ -469,6 +469,28 @@ class OpTest(unittest.TestCase):
             tensor.set_lod(lod)
         return tensor
 
+    @staticmethod
+    def np_dtype_to_fluid_dtype(input):
+        """Change the dtype of float16 numpy array
+
+        numpy float16 is binded to paddle::platform::float16 
+        in tensor_py.h via the help of uint16 data type since
+        the internal memory representation of float16 is 
+        uint16_t in paddle and np.uint16 in numpy, which are
+        themselves binded together by pybind.
+
+        Args:
+            input: input numpy array
+
+        Returns:
+            input: The dtype of input will be changed to np.uint16 if 
+                it is originally np.float16, such that the internal memory
+                of input will be reinterpreted as of dtype np.uint16. 
+        """
+        if input.dtype == np.float16:
+            input.dtype = np.uint16
+        return input
+
     def _get_gradient(self, input_to_check, place, output_names, no_grad_set):
         prog = Program()
         block = prog.global_block()
diff --git a/python/paddle/fluid/tests/unittests/test_cast_op.py b/python/paddle/fluid/tests/unittests/test_cast_op.py
index 8fb8d03828393ccfe57c0848d79b960c641ad39a..b8d3ed3aa3eb0e47e79f46cdf681a3b9cca46036 100644
--- a/python/paddle/fluid/tests/unittests/test_cast_op.py
+++ b/python/paddle/fluid/tests/unittests/test_cast_op.py
@@ -18,7 +18,7 @@ import numpy as np
 import paddle.fluid.core as core
 
 
-class TestCastOp(op_test.OpTest):
+class TestCastOp1(op_test.OpTest):
     def setUp(self):
         ipt = np.random.random(size=[10, 10])
         self.inputs = {'X': ipt.astype('float32')}
@@ -36,5 +36,36 @@ class TestCastOp(op_test.OpTest):
         self.check_grad(['X'], ['Out'])
 
 
+class TestCastOp2(op_test.OpTest):
+    def setUp(self):
+        ipt = np.random.random(size=[10, 10])
+        # numpy float16 is binded to fluid float16 via uint16
+        self.inputs = {'X': ipt.astype('float16').view(np.uint16)}
+        self.outputs = {'Out': ipt.astype('float32')}
+        self.attrs = {
+            'in_dtype': int(core.VarDesc.VarType.FP16),
+            'out_dtype': int(core.VarDesc.VarType.FP32)
+        }
+        self.op_type = 'cast'
+
+    def test_check_output(self):
+        self.check_output(atol=1e-3)
+
+
+class TestCastOp3(op_test.OpTest):
+    def setUp(self):
+        ipt = np.random.random(size=[10, 10])
+        self.inputs = {'X': ipt.astype('float32')}
+        self.outputs = {'Out': ipt.astype('float16')}
+        self.attrs = {
+            'in_dtype': int(core.VarDesc.VarType.FP32),
+            'out_dtype': int(core.VarDesc.VarType.FP16)
+        }
+        self.op_type = 'cast'
+
+    def test_check_output(self):
+        self.check_output(atol=1e-3)
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_conv2d_op.py b/python/paddle/fluid/tests/unittests/test_conv2d_op.py
index a49fecf09509f7b1d9f758eebcf90bf9fbf7669f..4b6e3fb69a12095c77f343515fe3b6d1f3fccb14 100644
--- a/python/paddle/fluid/tests/unittests/test_conv2d_op.py
+++ b/python/paddle/fluid/tests/unittests/test_conv2d_op.py
@@ -63,9 +63,11 @@ def conv2d_forward_naive(input, filter, group, conv_param):
 
 class TestConv2dOp(OpTest):
     def setUp(self):
+        self.op_type = "conv2d"
         self.use_cudnn = False
         self.use_mkldnn = False
-        self.init_op_type()
+        self.dtype = np.float32
+        self.init_kernel_type()
         self.init_group()
         self.init_dilation()
         self.init_test_case()
@@ -75,12 +77,16 @@ class TestConv2dOp(OpTest):
             'pad': self.pad,
             'dilation': self.dilations
         }
-        input = np.random.random(self.input_size).astype("float32")
-        filter = np.random.random(self.filter_size).astype("float32")
+
+        input = np.random.random(self.input_size).astype(self.dtype)
+        filter = np.random.random(self.filter_size).astype(self.dtype)
         output = conv2d_forward_naive(input, filter, self.groups,
-                                      conv2d_param).astype('float32')
+                                      conv2d_param).astype(self.dtype)
 
-        self.inputs = {'Input': input, 'Filter': filter}
+        self.inputs = {
+            'Input': OpTest.np_dtype_to_fluid_dtype(input),
+            'Filter': OpTest.np_dtype_to_fluid_dtype(filter)
+        }
         self.attrs = {
             'strides': self.stride,
             'paddings': self.pad,
@@ -99,6 +105,8 @@ class TestConv2dOp(OpTest):
             self.check_output()
 
     def test_check_grad(self):
+        if self.dtype == np.float16:
+            return
         if self.use_cudnn:
             place = core.CUDAPlace(0)
             self.check_grad_with_place(
@@ -111,6 +119,8 @@ class TestConv2dOp(OpTest):
                 set(['Input', 'Filter']), 'Output', max_relative_error=0.02)
 
     def test_check_grad_no_filter(self):
+        if self.dtype == np.float16:
+            return
         if self.use_cudnn:
             place = core.CUDAPlace(0)
             self.check_grad_with_place(
@@ -126,6 +136,8 @@ class TestConv2dOp(OpTest):
                 no_grad_set=set(['Filter']))
 
     def test_check_grad_no_input(self):
+        if self.dtype == np.float16:
+            return
         if self.use_cudnn:
             place = core.CUDAPlace(0)
             self.check_grad_with_place(
@@ -154,8 +166,8 @@ class TestConv2dOp(OpTest):
     def init_group(self):
         self.groups = 1
 
-    def init_op_type(self):
-        self.op_type = "conv2d"
+    def init_kernel_type(self):
+        pass
 
 
 class TestWithPad(TestConv2dOp):
@@ -227,39 +239,105 @@ class TestWithInput1x1Filter1x1(TestConv2dOp):
 
 #----------------Conv2dCUDNN----------------
 class TestCUDNN(TestConv2dOp):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "conv2d"
+
+
+class TestFP16CUDNN(TestConv2dOp):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=2e-2)
 
 
 class TestCUDNNWithPad(TestWithPad):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "conv2d"
+
+
+class TestFP16CUDNNWithPad(TestWithPad):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=2e-2)
 
 
 class TestCUDNNWithStride(TestWithStride):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "conv2d"
+
+
+class TestFP16CUDNNWithStride(TestWithStride):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=2e-2)
 
 
 class TestCUDNNWithGroup(TestWithGroup):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "conv2d"
+
+
+class TestFP16CUDNNWithGroup(TestWithGroup):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=2e-2)
 
 
 class TestCUDNNWith1x1(TestWith1x1):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "conv2d"
+
+
+class TestFP16CUDNNWith1x1(TestWith1x1):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=2e-2)
 
 
 class TestCUDNNWithInput1x1Filter1x1(TestWithInput1x1Filter1x1):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "conv2d"
+
+
+class TestFP16CUDNNWithInput1x1Filter1x1(TestWithInput1x1Filter1x1):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=2e-2)
 
 
 class TestDepthwiseConv(TestConv2dOp):
@@ -295,21 +373,18 @@ class TestDepthwiseConv2(TestConv2dOp):
 
 #----------------Conv2dMKLDNN----------------
 class TestMKLDNN(TestConv2dOp):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_mkldnn = True
-        self.op_type = "conv2d"
 
 
 class TestMKLDNNWithPad(TestWithPad):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_mkldnn = True
-        self.op_type = "conv2d"
 
 
 class TestMKLDNNWithStride(TestWithStride):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_mkldnn = True
-        self.op_type = "conv2d"
 
 
 if __name__ == '__main__':
diff --git a/python/paddle/fluid/tests/unittests/test_detection_output_op.py b/python/paddle/fluid/tests/unittests/test_detection_output_op.py
deleted file mode 100644
index 94681319144ee3e0d51b57944f5692183578c01b..0000000000000000000000000000000000000000
--- a/python/paddle/fluid/tests/unittests/test_detection_output_op.py
+++ /dev/null
@@ -1,71 +0,0 @@
-#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import unittest
-import numpy as np
-from op_test import OpTest
-
-
-class TestUnpoolOp(OpTest):
-    def setUp(self):
-        self.op_type = "detection_output"
-        self.init_test_case()
-
-        #loc.shape ((1, 4, 4, 1, 1))
-        #conf.shape ((1, 4, 2, 1, 1))
-
-        loc = np.array([[[[[0.1]], [[0.1]], [[0.1]], [[0.1]]],
-                         [[[0.1]], [[0.1]], [[0.1]], [[0.1]]],
-                         [[[0.1]], [[0.1]], [[0.1]], [[0.1]]],
-                         [[[0.1]], [[0.1]], [[0.1]], [[0.1]]]]])
-        conf = np.array([[[[[0.1]], [[0.9]]], [[[0.2]], [[0.8]]],
-                          [[[0.3]], [[0.7]]], [[[0.4]], [[0.6]]]]])
-        priorbox = np.array([
-            0.1, 0.1, 0.5, 0.5, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.6, 0.6, 0.1,
-            0.1, 0.2, 0.2, 0.3, 0.3, 0.7, 0.7, 0.1, 0.1, 0.2, 0.2, 0.4, 0.4,
-            0.8, 0.8, 0.1, 0.1, 0.2, 0.2
-        ])
-
-        output = np.array([
-            0, 1, 0.68997443, 0.099959746, 0.099959746, 0.50804031, 0.50804031
-        ])
-        self.inputs = {
-            'Loc': loc.astype('float32'),
-            'Conf': conf.astype('float32'),
-            'PriorBox': priorbox.astype('float32')
-        }
-        self.attrs = {
-            'num_classes': self.num_classes,
-            'top_k': self.top_k,
-            'nms_top_k': self.nms_top_k,
-            'background_label_id': self.background_label_id,
-            'nms_threshold': self.nms_threshold,
-            'confidence_threshold': self.confidence_threshold,
-        }
-        self.outputs = {'Out': output.astype('float32')}
-
-    def test_check_output(self):
-        self.check_output()
-
-    def init_test_case(self):
-        self.num_classes = 2
-        self.top_k = 10
-        self.nms_top_k = 20
-        self.background_label_id = 0
-        self.nms_threshold = 0.01
-        self.confidence_threshold = 0.01
-
-
-if __name__ == '__main__':
-    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_dropout_op.py b/python/paddle/fluid/tests/unittests/test_dropout_op.py
index 60930a612c128cbf18e89711b9246d148e41ec58..eaa3435a86462236a99489749abe877648677053 100644
--- a/python/paddle/fluid/tests/unittests/test_dropout_op.py
+++ b/python/paddle/fluid/tests/unittests/test_dropout_op.py
@@ -14,6 +14,7 @@
 
 import unittest
 import numpy as np
+import paddle.fluid.core as core
 from op_test import OpTest
 
 
@@ -82,5 +83,37 @@ class TestDropoutOp5(OpTest):
         self.check_output()
 
 
+class TestFP16DropoutOp(OpTest):
+    def setUp(self):
+        self.op_type = "dropout"
+        self.init_test_case()
+
+        x = np.random.random(self.input_size).astype("float16")
+        out = x * (1.0 - self.prob)
+        self.inputs = {'X': OpTest.np_dtype_to_fluid_dtype(x)}
+        self.attrs = {
+            'dropout_prob': self.prob,
+            'fix_seed': self.fix_seed,
+            'is_test': True
+        }
+        self.outputs = {'Out': out}
+
+    def init_test_case(self):
+        self.input_size = [32, 64]
+        self.prob = 0.35
+        self.fix_seed = True
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda() and core.op_support_gpu("dropout"):
+            self.check_output_with_place(core.CUDAPlace(0), atol=1e-3)
+
+
+class TestFP16DropoutOp2(TestFP16DropoutOp):
+    def init_test_case(self):
+        self.input_size = [32, 64, 3]
+        self.prob = 0.75
+        self.fix_seed = False
+
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_layers.py b/python/paddle/fluid/tests/unittests/test_layers.py
index 90d70aa39fdc4d4d3f9062eb6a3eb0cdd014acfc..b5fd59cf3a1bea50b799c3ace8f3b9cea088b9d5 100644
--- a/python/paddle/fluid/tests/unittests/test_layers.py
+++ b/python/paddle/fluid/tests/unittests/test_layers.py
@@ -181,8 +181,8 @@ class TestBook(unittest.TestCase):
         with program_guard(program):
             x = layers.data(name='x', shape=[10], dtype='float32')
             y = layers.data(
-                name='y', shape=[10, 20], dtype='float32', lod_level=1)
-            self.assertIsNotNone(layers.sequence_expand(x=x, y=y))
+                name='y', shape=[10, 20], dtype='float32', lod_level=2)
+            self.assertIsNotNone(layers.sequence_expand(x=x, y=y, ref_level=1))
         print(str(program))
 
     def test_lstm_unit(self):
@@ -327,6 +327,15 @@ class TestBook(unittest.TestCase):
             self.assertIsNotNone(loss)
         print(str(program))
 
+    def test_lod_reset(self):
+        program = Program()
+        with program_guard(program):
+            x = layers.data(name='x', shape=[10], dtype='float32')
+            y = layers.data(
+                name='y', shape=[10, 20], dtype='float32', lod_level=2)
+            print(layers.lod_reset(x=x, y=y))
+        print(str(program))
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_lod_reset_op.py b/python/paddle/fluid/tests/unittests/test_lod_reset_op.py
index 3bf8230f8748dd87ec3c85b0cbd78df2e695a96b..6b6d4c824aeae319dacf224408ce96a0d9c5bb35 100644
--- a/python/paddle/fluid/tests/unittests/test_lod_reset_op.py
+++ b/python/paddle/fluid/tests/unittests/test_lod_reset_op.py
@@ -42,7 +42,7 @@ class TestLodResetOpByInput(OpTest):
         target_lod_0 = [0, 4, 7, 10]
         self.inputs = {
             'X': (x, lod),
-            'TargetLoD': np.array([target_lod_0]).astype('int32')
+            'Y': np.array([target_lod_0]).astype('int32')
         }
         self.outputs = {'Out': (x, [target_lod_0])}
 
@@ -50,7 +50,7 @@ class TestLodResetOpByInput(OpTest):
         self.check_output()
 
     def test_check_grad(self):
-        self.check_grad(["X"], "Out", no_grad_set=set("TargetLoD"))
+        self.check_grad(["X"], "Out", no_grad_set=set("Y"))
 
 
 class TestLodResetOpBoth(OpTest):
@@ -62,7 +62,7 @@ class TestLodResetOpBoth(OpTest):
         target_lod_0_in = [0, 4, 7, 10]
         self.inputs = {
             'X': (x, lod),
-            'TargetLoD': np.array(target_lod_0_in).astype('int32')
+            'Y': np.array(target_lod_0_in).astype('int32')
         }
         self.attrs = {'target_lod': target_lod_0_attr}
         self.outputs = {'Out': (x, [target_lod_0_in])}
@@ -71,7 +71,24 @@ class TestLodResetOpBoth(OpTest):
         self.check_output()
 
     def test_check_grad(self):
-        self.check_grad(["X"], "Out", no_grad_set=set("TargetLoD"))
+        self.check_grad(["X"], "Out", no_grad_set=set("Y"))
+
+
+class TestLodResetOpYIsLoDTensor(OpTest):
+    def setUp(self):
+        self.op_type = "lod_reset"
+        x = np.random.random((10, 20)).astype("float32")
+        lod = [[0, 3, 5, 10]]
+        y = np.random.random((10, 10)).astype("float32")
+        target_lod_0 = [[0, 4, 7, 10]]
+        self.inputs = {'X': (x, lod), 'Y': (y, target_lod_0)}
+        self.outputs = {'Out': (x, target_lod_0)}
+
+    def test_check_output(self):
+        self.check_output()
+
+    def test_check_grad(self):
+        self.check_grad(["X"], "Out", no_grad_set=set("Y"))
 
 
 if __name__ == '__main__':
diff --git a/python/paddle/fluid/tests/unittests/test_lrn_op.py b/python/paddle/fluid/tests/unittests/test_lrn_op.py
index 7f2352c5882ce36d8d681a737806f3ee0e3ace98..eaff45cbb2a58798e9d55149510bec72eea370cd 100644
--- a/python/paddle/fluid/tests/unittests/test_lrn_op.py
+++ b/python/paddle/fluid/tests/unittests/test_lrn_op.py
@@ -41,7 +41,7 @@ class TestLRNOp(OpTest):
         mid.fill(self.k)
         for m in range(0, self.N):
             for i in range(0, self.C):
-                for c in range(start, end + 1):
+                for c in range(start, end):
                     ch = i + c
                     if ch < 0 or ch >= self.C:
                         continue
diff --git a/python/paddle/fluid/tests/unittests/test_pool2d_op.py b/python/paddle/fluid/tests/unittests/test_pool2d_op.py
index 964d78f1966aa10e36eeaabe943d44e002d50293..764fa575fba1615de3171e848890b3836e640849 100644
--- a/python/paddle/fluid/tests/unittests/test_pool2d_op.py
+++ b/python/paddle/fluid/tests/unittests/test_pool2d_op.py
@@ -78,20 +78,22 @@ def avg_pool2D_forward_naive(x,
 
 class TestPool2d_Op(OpTest):
     def setUp(self):
+        self.op_type = "pool2d"
         self.use_cudnn = False
         self.use_mkldnn = False
+        self.dtype = np.float32
         self.init_test_case()
         self.init_global_pool()
-        self.init_op_type()
+        self.init_kernel_type()
         self.init_pool_type()
         self.init_ceil_mode()
         if self.global_pool:
             self.paddings = [0 for _ in range(len(self.paddings))]
-        input = np.random.random(self.shape).astype("float32")
+        input = np.random.random(self.shape).astype(self.dtype)
         output = self.pool2D_forward_naive(input, self.ksize, self.strides,
                                            self.paddings, self.global_pool,
-                                           self.ceil_mode).astype("float32")
-        self.inputs = {'X': input}
+                                           self.ceil_mode).astype(self.dtype)
+        self.inputs = {'X': OpTest.np_dtype_to_fluid_dtype(input)}
 
         self.attrs = {
             'strides': self.strides,
@@ -105,7 +107,7 @@ class TestPool2d_Op(OpTest):
             'data_format': 'AnyLayout'  # TODO(dzhwinter) : should be fix latter
         }
 
-        self.outputs = {'Out': output.astype('float32')}
+        self.outputs = {'Out': output}
 
     def test_check_output(self):
         if self.use_cudnn:
@@ -115,6 +117,8 @@ class TestPool2d_Op(OpTest):
             self.check_output()
 
     def test_check_grad(self):
+        if self.dtype == np.float16:
+            return
         if self.use_cudnn and self.pool_type != "max":
             place = core.CUDAPlace(0)
             self.check_grad_with_place(
@@ -128,8 +132,8 @@ class TestPool2d_Op(OpTest):
         self.strides = [1, 1]
         self.paddings = [0, 0]
 
-    def init_op_type(self):
-        self.op_type = "pool2d"
+    def init_kernel_type(self):
+        pass
 
     def init_pool_type(self):
         self.pool_type = "avg"
@@ -149,9 +153,6 @@ class TestCase1(TestPool2d_Op):
         self.strides = [1, 1]
         self.paddings = [0, 0]
 
-    def init_op_type(self):
-        self.op_type = "pool2d"
-
     def init_pool_type(self):
         self.pool_type = "avg"
         self.pool2D_forward_naive = avg_pool2D_forward_naive
@@ -167,9 +168,6 @@ class TestCase2(TestPool2d_Op):
         self.strides = [1, 1]
         self.paddings = [1, 1]
 
-    def init_op_type(self):
-        self.op_type = "pool2d"
-
     def init_pool_type(self):
         self.pool_type = "avg"
         self.pool2D_forward_naive = avg_pool2D_forward_naive
@@ -179,27 +177,18 @@ class TestCase2(TestPool2d_Op):
 
 
 class TestCase3(TestPool2d_Op):
-    def init_op_type(self):
-        self.op_type = "pool2d"
-
     def init_pool_type(self):
         self.pool_type = "max"
         self.pool2D_forward_naive = max_pool2D_forward_naive
 
 
 class TestCase4(TestCase1):
-    def init_op_type(self):
-        self.op_type = "pool2d"
-
     def init_pool_type(self):
         self.pool_type = "max"
         self.pool2D_forward_naive = max_pool2D_forward_naive
 
 
 class TestCase5(TestCase2):
-    def init_op_type(self):
-        self.op_type = "pool2d"
-
     def init_pool_type(self):
         self.pool_type = "max"
         self.pool2D_forward_naive = max_pool2D_forward_naive
@@ -207,39 +196,105 @@ class TestCase5(TestCase2):
 
 #--------------------test pool2d--------------------
 class TestCUDNNCase1(TestPool2d_Op):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "pool2d"
+
+
+class TestFP16CUDNNCase1(TestPool2d_Op):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=1e-3)
 
 
 class TestCUDNNCase2(TestCase1):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "pool2d"
+
+
+class TestFP16CUDNNCase2(TestCase1):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=1e-3)
 
 
 class TestCUDNNCase3(TestCase2):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "pool2d"
+
+
+class TestFP16CUDNNCase3(TestCase2):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=1e-3)
 
 
 class TestCUDNNCase4(TestCase3):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "pool2d"
+
+
+class TestFP16CUDNNCase4(TestCase3):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=1e-3)
 
 
 class TestCUDNNCase5(TestCase4):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "pool2d"
+
+
+class TestFP16CUDNNCase5(TestCase4):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=1e-3)
 
 
 class TestCUDNNCase6(TestCase5):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_cudnn = True
-        self.op_type = "pool2d"
+
+
+class TestFP16CUDNNCase6(TestCase5):
+    def init_kernel_type(self):
+        self.use_cudnn = True
+        self.dtype = np.float16
+
+    def test_check_output(self):
+        if core.is_compiled_with_cuda():
+            place = core.CUDAPlace(0)
+            if core.is_float16_supported(place):
+                self.check_output_with_place(place, atol=1e-3)
 
 
 class TestCeilModeCase1(TestCUDNNCase1):
@@ -264,39 +319,33 @@ class TestCeilModeCase4(TestCase2):
 
 #--------------------test pool2d MKLDNN--------------------
 class TestMKLDNNCase1(TestPool2d_Op):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_mkldnn = True
-        self.op_type = "pool2d"
 
 
 class TestMKLDNNCase2(TestCase1):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_mkldnn = True
-        self.op_type = "pool2d"
 
 
 class TestMKLDNNCase3(TestCase2):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_mkldnn = True
-        self.op_type = "pool2d"
 
 
 class TestMKLDNNCase4(TestCase3):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_mkldnn = True
-        self.op_type = "pool2d"
 
 
 class TestMKLDNNCase5(TestCase4):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_mkldnn = True
-        self.op_type = "pool2d"
 
 
 class TestMKLDNNCase6(TestCase5):
-    def init_op_type(self):
+    def init_kernel_type(self):
         self.use_mkldnn = True
-        self.op_type = "pool2d"
 
 
 if __name__ == '__main__':
diff --git a/python/paddle/fluid/tests/unittests/test_profiler.py b/python/paddle/fluid/tests/unittests/test_profiler.py
index 1da6b94eea30e65913ce713e0e5e355507534161..cf6fe14a86aa1ab6ea3f60ad15f33d708e9b803a 100644
--- a/python/paddle/fluid/tests/unittests/test_profiler.py
+++ b/python/paddle/fluid/tests/unittests/test_profiler.py
@@ -31,8 +31,22 @@ class TestProfiler(unittest.TestCase):
 
         with fluid.program_guard(main_program, startup_program):
             image = fluid.layers.data(name='x', shape=[784], dtype='float32')
-            hidden1 = fluid.layers.fc(input=image, size=128, act='relu')
-            hidden2 = fluid.layers.fc(input=hidden1, size=64, act='relu')
+            hidden1 = fluid.layers.fc(input=image, size=64, act='relu')
+            i = layers.zeros(shape=[1], dtype='int64')
+            counter = fluid.layers.zeros(
+                shape=[1], dtype='int64', force_cpu=True)
+            until = layers.fill_constant([1], dtype='int64', value=10)
+            data_arr = layers.array_write(hidden1, i)
+            cond = fluid.layers.less_than(x=counter, y=until)
+            while_op = fluid.layers.While(cond=cond)
+            with while_op.block():
+                hidden_n = fluid.layers.fc(input=hidden1, size=64, act='relu')
+                layers.array_write(hidden_n, i, data_arr)
+                fluid.layers.increment(x=counter, value=1, in_place=True)
+                layers.less_than(x=counter, y=until, cond=cond)
+
+            hidden_n = layers.array_read(data_arr, i)
+            hidden2 = fluid.layers.fc(input=hidden_n, size=64, act='relu')
             predict = fluid.layers.fc(input=hidden2, size=10, act='softmax')
             label = fluid.layers.data(name='y', shape=[1], dtype='int64')
             cost = fluid.layers.cross_entropy(input=predict, label=label)
diff --git a/python/paddle/fluid/tests/unittests/test_sequence_expand.py b/python/paddle/fluid/tests/unittests/test_sequence_expand.py
index 957fa5d2c4a795cfd01047c1b7845674e4c1d549..7feb509c4d6f5768552fc2515081f7e68f420967 100644
--- a/python/paddle/fluid/tests/unittests/test_sequence_expand.py
+++ b/python/paddle/fluid/tests/unittests/test_sequence_expand.py
@@ -27,12 +27,36 @@ class TestSequenceExpand(OpTest):
     def compute(self):
         x = self.inputs['X']
         x_data, x_lod = x if type(x) == tuple else (x, None)
-        n = 1 + x_data.shape[0] if not x_lod else len(x_lod[0])
         y_data, y_lod = self.inputs['Y']
-        repeats = [((y_lod[-1][i + 1] - y_lod[-1][i]))
-                   for i in range(len(y_lod[-1]) - 1)]
-        out = x_data.repeat(repeats, axis=0)
-        self.outputs = {'Out': out}
+
+        if hasattr(self, 'attrs'):
+            ref_level = self.attrs['ref_level']
+        else:
+            ref_level = len(y_lod) - 1
+
+        out = np.zeros(shape=((0, ) + x_data.shape[1:]), dtype=x_data.dtype)
+
+        if x_lod is None:
+            x_idx = [i for i in xrange(x_data.shape[0] + 1)]
+        else:
+            x_idx = x_lod[0]
+            out_lod = [[0]]
+
+        for i in xrange(1, len(y_lod[ref_level])):
+            repeat_num = y_lod[ref_level][i] - y_lod[ref_level][i - 1]
+            x_len = x_idx[i] - x_idx[i - 1]
+            if repeat_num > 0:
+                x_sub = x_data[x_idx[i - 1]:x_idx[i], :]
+                x_sub = np.repeat(x_sub, repeat_num, axis=0)
+                out = np.vstack((out, x_sub))
+                if x_lod is not None:
+                    for j in xrange(repeat_num):
+                        out_lod[0].append(out_lod[0][-1] + x_len)
+
+        if x_lod is None:
+            self.outputs = {'Out': out}
+        else:
+            self.outputs = {'Out': (out, out_lod)}
 
     def setUp(self):
         self.op_type = 'sequence_expand'
@@ -52,7 +76,8 @@ class TestSequenceExpandCase1(TestSequenceExpand):
         x_lod = [[0, 2, 5]]
         y_data = np.random.uniform(0.1, 1, [13, 1]).astype('float32')
         y_lod = [[0, 2, 5], [0, 2, 4, 7, 10, 13]]
-        self.inputs = {'X': (x_data, x_lod), 'Y': (y_data, y_lod)}
+        self.inputs = {'X': x_data, 'Y': (y_data, y_lod)}
+        self.attrs = {'ref_level': 0}
 
 
 class TestSequenceExpandCase2(TestSequenceExpand):
@@ -60,8 +85,9 @@ class TestSequenceExpandCase2(TestSequenceExpand):
         x_data = np.random.uniform(0.1, 1, [1, 2, 2]).astype('float32')
         x_lod = [[0, 1]]
         y_data = np.random.uniform(0.1, 1, [2, 2, 2]).astype('float32')
-        y_lod = [[0, 2]]
+        y_lod = [[0, 2], [0, 2]]
         self.inputs = {'X': (x_data, x_lod), 'Y': (y_data, y_lod)}
+        self.attrs = {'ref_level': 0}
 
 
 class TestSequenceExpandCase3(TestSequenceExpand):
@@ -75,14 +101,9 @@ class TestSequenceExpandCase3(TestSequenceExpand):
 
 class TestSequenceExpandCase4(TestSequenceExpand):
     def set_data(self):
-        x_data = np.array(
-            [0.1, 0.3, 0.2, 0.15, 0.25, 0.2, 0.15, 0.25, 0.1, 0.3]).reshape(
-                [2, 5]).astype('float32')
-        x_lod = [[
-            0,
-            1,
-            2,
-        ]]
+        data = [0.1, 0.3, 0.2, 0.15, 0.25, 0.2, 0.15, 0.25, 0.1, 0.3]
+        x_data = np.array(data).reshape([5, 2]).astype('float32')
+        x_lod = [[0, 2, 5]]
         y_data = np.random.uniform(0.1, 1, [2, 1]).astype('float32')
         y_lod = [[0, 1, 2], [0, 1, 2]]
         self.inputs = {'X': (x_data, x_lod), 'Y': (y_data, y_lod)}
diff --git a/python/paddle/fluid/tests/unittests/test_softmax_with_cross_entropy_op.py b/python/paddle/fluid/tests/unittests/test_softmax_with_cross_entropy_op.py
index 889fea2ce66e64d439b51498722e571f48cd1f96..c0d9fc8f22a7c4f791d80a9cad87d003b5d54299 100644
--- a/python/paddle/fluid/tests/unittests/test_softmax_with_cross_entropy_op.py
+++ b/python/paddle/fluid/tests/unittests/test_softmax_with_cross_entropy_op.py
@@ -26,7 +26,7 @@ class TestSoftmaxWithCrossEntropyOp(OpTest):
 
     def setUp(self):
         self.op_type = "softmax_with_cross_entropy"
-        batch_size = 2
+        batch_size = 41
         class_num = 37
 
         logits = np.random.uniform(0.1, 1.0,
@@ -59,7 +59,7 @@ class TestSoftmaxWithCrossEntropyOp2(OpTest):
 
     def setUp(self):
         self.op_type = "softmax_with_cross_entropy"
-        batch_size = 2
+        batch_size = 41
         class_num = 37
 
         logits = np.random.uniform(0.1, 1.0,
diff --git a/tools/timeline.py b/tools/timeline.py
index ee83a1baecdd4243bb6c546486a837393980fb65..f4083c824e7333a74661d096d4954609f767c83e 100644
--- a/tools/timeline.py
+++ b/tools/timeline.py
@@ -121,27 +121,34 @@ class Timeline(object):
 
     def _allocate_pids(self):
         for event in self._profile_pb.events:
-            if event.device_id not in self._devices:
-                pid = self._allocate_pid()
-                self._devices[event.device_id] = pid
-                if event.device_id >= 0:
-                    self._chrome_trace.emit_pid("gpu:%s:stream:%d" %
-                                                (pid, event.stream_id), pid)
-                elif event.device_id == -1:
-                    self._chrome_trace.emit_pid("cpu:thread_hash:%d" %
-                                                event.stream_id, pid)
+            if event.type == profiler_pb2.Event.CPU:
+                if (event.device_id, "CPU") not in self._devices:
+                    pid = self._allocate_pid()
+                    self._devices[(event.device_id, "CPU")] = pid
+                    self._chrome_trace.emit_pid("cpu:block:%d" %
+                                                (event.device_id), pid)
+            elif event.type == profiler_pb2.Event.GPUKernel:
+                if (event.device_id, "GPUKernel") not in self._devices:
+                    pid = self._allocate_pid()
+                    self._devices[(event.device_id, "GPUKernel")] = pid
+                    self._chrome_trace.emit_pid("gpu:%d" % (event.device_id),
+                                                pid)
 
     def _allocate_events(self):
         for event in self._profile_pb.events:
-            pid = self._devices[event.device_id]
+            if event.type == profiler_pb2.Event.CPU:
+                type = "CPU"
+            elif event.type == profiler_pb2.Event.GPUKernel:
+                type = "GPUKernel"
+            pid = self._devices[(event.device_id, type)]
             args = {'name': event.name}
             if event.memcopy.bytes > 0:
                 args = {'mem_bytes': event.memcopy.bytes}
             # TODO(panyx0718): Chrome tracing only handles ms. However, some
             # ops takes micro-seconds. Hence, we keep the ns here.
-            self._chrome_trace.emit_region(event.start_ns,
-                                           (event.end_ns - event.start_ns) /
-                                           1.0, pid, 0, 'Op', event.name, args)
+            self._chrome_trace.emit_region(
+                event.start_ns, (event.end_ns - event.start_ns) / 1.0, pid,
+                event.sub_device_id, 'Op', event.name, args)
 
     def generate_chrome_trace(self):
         self._allocate_pids()