diff --git a/benchmark/IntelOptimizedPaddle.md b/benchmark/IntelOptimizedPaddle.md
deleted file mode 100644
index 8b7dc5b7db800896eb4de2054ab5e584aed93999..0000000000000000000000000000000000000000
--- a/benchmark/IntelOptimizedPaddle.md
+++ /dev/null
@@ -1,112 +0,0 @@
-# Benchmark
-
-Machine:
-
-- Server: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, 2 Sockets, 20 Cores per socket
-- Laptop: TBD
-
-System: CentOS release 6.3 (Final), Docker 1.12.1.
-
-PaddlePaddle:
-- paddlepaddle/paddle:0.11.0 (for MKLML and MKL-DNN)
-  - MKL-DNN tag v0.11
-  - MKLML 2018.0.1.20171007
-- paddlepaddle/paddle:0.11.0-openblas (for OpenBLAS)
-  - OpenBLAS v0.2.20
-	 
-On each machine, we will test and compare the performance of training on single node using MKL-DNN / MKLML / OpenBLAS respectively.
-
-## Benchmark Model
-
-### Server
-
-#### Training
-Test on batch size 64, 128, 256 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
-Pay attetion that the speed below includes forward, backward and parameter update time. So we can not directly compare the data with the benchmark of caffe `time` [command](https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/caffe/image/run.sh#L9), which only contain forward and backward. The updating time of parameter would become very heavy when the weight size are large, especially on alexnet.
-
-Input image size - 3 * 224 * 224, Time: images/second
-
-- VGG-19
-
-| BatchSize    | 64    | 128  | 256     |
-|--------------|-------| -----| --------|
-| OpenBLAS     | 7.80  | 9.00  | 10.80  | 
-| MKLML        | 12.12 | 13.70 | 16.18  |
-| MKL-DNN      | 28.46 | 29.83 | 30.44  |
-
-<img src="figs/vgg-cpu-train.png" width="500">
-
- - ResNet-50
-
-| BatchSize    | 64    | 128   | 256    |
-|--------------|-------| ------| -------|
-| OpenBLAS     | 25.22 | 25.68 | 27.12  | 
-| MKLML        | 32.52 | 31.89 | 33.12  |
-| MKL-DNN      | 81.69 | 82.35 | 84.08  |
-
-<img src="figs/resnet-cpu-train.png" width="500">
-
- - GoogLeNet
-
-| BatchSize    | 64    | 128   | 256    |
-|--------------|-------| ------| -------|
-| OpenBLAS     | 89.52 | 96.97 | 108.25 | 
-| MKLML        | 128.46| 137.89| 158.63 |
-| MKL-DNN      | 250.46| 264.83| 269.50 |
-
-<img src="figs/googlenet-cpu-train.png" width="500">
-
-- AlexNet
-
-| BatchSize    | 64     | 128    | 256    |
-|--------------|--------| ------ | -------|
-| OpenBLAS     | 45.62  | 72.79  | 107.22 | 
-| MKLML        | 66.37  | 105.60 | 144.04 |
-| MKL-DNN      | 399.00 | 498.94 | 626.53 | 
-
-<img src="figs/alexnet-cpu-train.png" width="500">
-
-#### Inference
-Test on batch size 1, 2, 4, 8, 16 on Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
-- VGG-19
-
-| BatchSize | 1     | 2     | 4     | 8     | 16    |
-|-----------|-------|-------|-------|-------|-------|
-| OpenBLAS  | 1.10  | 1.96  | 3.62  | 3.63  | 2.25  |
-| MKLML     | 5.58  | 9.80  | 15.15 | 21.21 | 28.67 |
-| MKL-DNN   | 75.07 | 88.64 | 82.58 | 92.29 | 96.75 |
-
-<img src="figs/vgg-cpu-infer.png" width="500">
-
-- ResNet-50
-
-| BatchSize | 1     | 2      | 4      | 8      | 16     |
-|-----------|-------|--------|--------|--------|--------|
-| OpenBLAS  | 3.31  | 6.72   | 11.59  | 13.17  | 9.27   |
-| MKLML     | 6.33  | 12.02  | 22.88  | 40.53  | 63.09  |
-| MKL-DNN   | 107.83| 148.84 | 177.78 | 189.35 | 217.69 |
-
-<img src="figs/resnet-cpu-infer.png" width="500">
-
-- GoogLeNet
-
-| BatchSize | 1      | 2      | 4      | 8      | 16     |
-|-----------|--------|--------|--------|--------|--------|
-| OpenBLAS  | 12.06  | 23.56  | 34.48  | 36.45  | 23.12  |
-| MKLML     | 22.74  | 41.56  | 81.22  | 133.47 | 210.53 |
-| MKL-DNN   | 175.10 | 272.92 | 450.70 | 512.00 | 600.94 |
-
-<img src="figs/googlenet-cpu-infer.png" width="500">
-
-- AlexNet
-
-| BatchSize | 1      | 2      | 4      | 8      | 16     |
-|-----------|--------|--------|--------|--------|--------|
-| OpenBLAS  | 3.53   | 6.23   | 15.04  | 26.06  | 31.62  |
-| MKLML     | 21.32  | 36.55  | 73.06  | 131.15 | 192.77 |
-| MKL-DNN   | 442.91 | 656.41 | 719.10 | 847.68 | 850.51 |
-
-<img src="figs/alexnet-cpu-infer.png" width="500">
-
-### Laptop
-TBD
diff --git a/benchmark/README.md b/benchmark/README.md
deleted file mode 100644
index 367013f0457f9bbb9ae1335ea63dce181316d444..0000000000000000000000000000000000000000
--- a/benchmark/README.md
+++ /dev/null
@@ -1,168 +0,0 @@
-# Benchmark
-
-Machine: 
-
-- CPU: 12-core Intel(R) Xeon(R) CPU E5-2620 v2 @2.10GHz
-- GPU: Tesla K40m
-- cuDNN: v5.1
-- system: Docker 1.12.1, all platforms are tested in docker environment.
-
-Platforms: 
-
-- PaddlePaddle: paddledev/paddle:gpu-devel-v0.9.0a0 
-- Tensorflow: gcr.io/tensorflow/tensorflow:0.11.0rc0-gpu 
-- Caffe: kaixhin/cuda-caffe
-
-Several convolutional neural networks and recurrent neural networks are used to test.
-
-## Image
-
-### Benchmark Model
-
-AlexNet, GoogleNet and a small network used in Caffe.
-
-- [AlexNet](https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet): but the group size is one.
-
-- [GoogleNet](https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet): but remove loss1 and loss2 when testing benchmark.
-
-- [SmallNet](https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10\_quick\_train\_test.prototxt)
-
-
-### Single-GPU
-
-- AlexNet:  input - 3 * 227 * 227,  Time: ms/batch
-
-| BatchSize    | 64  | 128  | 256   | 512  |
-|--------------|-----| -----| ------| -----|
-| PaddlePaddle | 195 | 334  | 602   | 1629 |
-| TensorFlow   | 223 | 364  | 645   | 1235 |
-| Caffe        | 324 | 627  | 1232  | 2513 |
- 
-**Notation**
-
-All platforms use cuDNN-v5.1. We see that caffe is slower in this experiment, because its workspace limit size of cuDNN-conv interface is 8 * 1024 * 1024, which is smaller in PaddlePaddle and TensorFlow. Note that Caffe will be faster if increasing the workspace limit size.
- 
-- GoogletNet:  input - 3 * 224 * 224, Time: ms/batch
-
-
-| BatchSize    | 64    |   128  | 256     |
-|--------------|-------| -------| --------|
-| PaddlePaddle | 613   | 1149   | 2348    |
-| TensorFlow   | 644   | 1176   | 2219    |
-| Caffe        | 694   | 1364   | out of memory   |
-
-- SmallNet: input - 3 * 32 * 32, Time ms/batch
-
-| BatchSize    | 64     |   128    | 256     | 512     |
-|--------------|--------| -------- | --------|---------|
-| PaddlePaddle | 10.463 | 18.184   | 33.113  |  63.039 |
-| TensorFlow   | 9     | 15       | 28      | 59       |
-| Caffe        | 9.373  | 16.6606  | 31.4797 | 59.719  |
-
-**Notation**
-
-All the single-GPU experiments in caffe use `caffe time` to calculate elapsed time, which does not include parameter updating time. However, both PaddlePaddle and TensorFlow experiments contain the parameter updating time. As compared with the total time, this part is relatively little on single machine, we can ignore it.
-
-In Tensorflow, they implement algorithm searching method instead of using the algorithm searching interface in cuDNN.
-
-### Multi-GPU: 4 GPUs
-
-- AlexNet,  ms / batch
-
-| total-BatchSize | 128 * 4  | 256 * 4    |
-|------------------|----------| -----------|
-| PaddlePaddle     | 347      | 622        |
-| TensorFlow       | 377      | 675        |
-| Caffe            | 1229     | 2435       |
-
-For example, if `total-BatchSize = 128 * 4`, the speedup ratio is calculated by 
-
-```
-  time_at_1gpu_batch_128 * 4 / time_at_4gpu_total_batch_512 
-= (334 * 4)/347 
-= 3.85
-``` 
-
-<img src="figs/alexnet-4gpu.png" width="420">
-
-
-- GoogleNet, ms / batch
-
-| total-BatchSize  | 128 * 4      |  256 * 4    |
-|-------------------|--------------| ----------- |
-| PaddlePaddle      | 1178         | 2367        |
-| TensorFlow        | 1210         | 2292        |
-| Caffe             | 2007         | out of memory  |
-
-<img src="figs/googlenet-4gpu.png" width="420">
-
-
-## RNN
-We use lstm network for text classfication to test benchmark.
-
-### Dataset
--  [IMDB](http://www.iro.umontreal.ca/~lisa/deep/data/imdb.pkl)
-- Sequence length is 100. In fact, PaddlePaddle supports training with variable-length sequence, but TensorFlow needs to pad. Thus, we also pad sequence length to 100 in PaddlePaddle in order to compare.
-- Dictionary size=30000 
-- Peephole connection is used in `lstmemory` by default in PaddlePaddle. It is also configured in TensorFlow.
-
-### Single-GPU
-
-#### LSTM in Text Classification
-
-Testing `2 lstm layer + fc` network with different hidden size and batch size.
-  
-- Batch size = 64, ms / batch
- 
-| hidden_size  | 256   | 512    |  1280   |
-|--------------|-------| -------| --------|
-| PaddlePaddle | 83    | 184    | 641     |
-| TensorFlow   | 175   | 280    | 818     |
-
-- Batch size = 128, ms / batch
- 
-| hidden_size  | 256    | 512    |  1280   |
-|--------------|------- | -------| --------|
-| PaddlePaddle | 110    | 261    | 1007    |
-| TensorFlow   | 181    | 361    | 1237    |
-
-
-- Batch size = 256, ms / batch
- 
-| hidden_size  | 256   | 512    |  1280   |
-|--------------|-------| -------| --------|
-| PaddlePaddle | 170   | 414    | 1655    |
-| TensorFlow   | 238   | 536    | 1905    |
-
-<img src="figs/rnn_lstm_cls.png" width="600">
-
-#### Seq2Seq
-
-The benchmark of sequence-to-sequence network will be added later.
- 
-
-### Multi GPU: 4 GPUs
-
-#### LSTM in Text Classification
-
-- hidden_size = 256, ms / batch
- 
-| batch_size   | 256    |  512    |
-|--------------| -------| --------|
-| PaddlePaddle | 90     | 118     |
-| TensorFlow   | 226    | 118     |
-
-
-- hidden_size = 512, ms / batch
- 
-| batch_size   | 256    |  512    |
-|--------------| -------| --------|
-| PaddlePaddle | 189    | 268     |
-| TensorFlow   | 297    | 383     |
-
-
-<img src="figs/rnn_lstm_4gpus.png" width="420">
-
-#### Seq2Seq
-
-The benchmark of sequence-to-sequence network will be added later.
diff --git a/benchmark/fluid/Dockerfile b/benchmark/fluid/Dockerfile
index 2e1e0d376899fd664866621263db62258e7c3869..81ea870050fe5db4a60fee40221991e38de6bd2e 100644
--- a/benchmark/fluid/Dockerfile
+++ b/benchmark/fluid/Dockerfile
@@ -15,9 +15,6 @@ RUN ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.7 /usr/lib/libcudnn.so && ln -s
 RUN pip install -U pip
 RUN pip install -U kubernetes paddlepaddle
 
-RUN sh -c 'echo "import paddle.v2 as paddle\npaddle.dataset.cifar.train10()\npaddle.dataset.flowers.fetch()" | python'
-RUN sh -c 'echo "import paddle.v2 as paddle\npaddle.dataset.mnist.train()\npaddle.dataset.mnist.test()\npaddle.dataset.imdb.fetch()" | python'
-RUN sh -c 'echo "import paddle.v2 as paddle\npaddle.dataset.imikolov.fetch()" | python'
 RUN pip uninstall -y paddlepaddle && mkdir /workspace
 
 ADD https://raw.githubusercontent.com/PaddlePaddle/cloud/develop/docker/paddle_k8s /usr/bin
diff --git a/benchmark/paddle/image/check_env.sh b/benchmark/fluid/check_env.sh
similarity index 100%
rename from benchmark/paddle/image/check_env.sh
rename to benchmark/fluid/check_env.sh
diff --git a/benchmark/paddle/image/alexnet.py b/benchmark/paddle/image/alexnet.py
deleted file mode 100644
index 9efc3f0494e4a817a7357f29e684f621bce1921e..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/alexnet.py
+++ /dev/null
@@ -1,93 +0,0 @@
-#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from paddle.trainer_config_helpers import *
-
-height = 227
-width = 227
-num_class = 1000
-batch_size = get_config_arg('batch_size', int, 128)
-gp = get_config_arg('layer_num', int, 1)
-is_infer = get_config_arg("is_infer", bool, False)
-num_samples = get_config_arg('num_samples', int, 2560)
-
-args = {
-    'height': height,
-    'width': width,
-    'color': True,
-    'num_class': num_class,
-    'is_infer': is_infer,
-    'num_samples': num_samples
-}
-define_py_data_sources2(
-    "train.list" if not is_infer else None,
-    "test.list" if is_infer else None,
-    module="provider",
-    obj="process",
-    args=args)
-
-settings(
-    batch_size=batch_size,
-    learning_rate=0.01 / batch_size,
-    learning_method=MomentumOptimizer(0.9),
-    regularization=L2Regularization(0.0005 * batch_size))
-
-# conv1
-net = data_layer('data', size=height * width * 3)
-net = img_conv_layer(
-    input=net,
-    filter_size=11,
-    num_channels=3,
-    num_filters=96,
-    stride=4,
-    padding=1)
-net = img_cmrnorm_layer(input=net, size=5, scale=0.0001, power=0.75)
-net = img_pool_layer(input=net, pool_size=3, stride=2)
-
-# conv2
-net = img_conv_layer(
-    input=net, filter_size=5, num_filters=256, stride=1, padding=2, groups=gp)
-net = img_cmrnorm_layer(input=net, size=5, scale=0.0001, power=0.75)
-net = img_pool_layer(input=net, pool_size=3, stride=2)
-
-# conv3
-net = img_conv_layer(
-    input=net, filter_size=3, num_filters=384, stride=1, padding=1)
-# conv4
-net = img_conv_layer(
-    input=net, filter_size=3, num_filters=384, stride=1, padding=1, groups=gp)
-
-# conv5
-net = img_conv_layer(
-    input=net, filter_size=3, num_filters=256, stride=1, padding=1, groups=gp)
-net = img_pool_layer(input=net, pool_size=3, stride=2)
-
-net = fc_layer(
-    input=net,
-    size=4096,
-    act=ReluActivation(),
-    layer_attr=ExtraAttr(drop_rate=0.5))
-net = fc_layer(
-    input=net,
-    size=4096,
-    act=ReluActivation(),
-    layer_attr=ExtraAttr(drop_rate=0.5))
-net = fc_layer(input=net, size=1000, act=SoftmaxActivation())
-
-if is_infer:
-    outputs(net)
-else:
-    lab = data_layer('label', num_class)
-    loss = cross_entropy(input=net, label=lab)
-    outputs(loss)
diff --git a/benchmark/paddle/image/googlenet.py b/benchmark/paddle/image/googlenet.py
deleted file mode 100644
index 2a850ccb7f2c75b467554181fc5f4aa8f2b97a09..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/googlenet.py
+++ /dev/null
@@ -1,245 +0,0 @@
-#!/usr/bin/env python
-from paddle.trainer_config_helpers import *
-
-height = 224
-width = 224
-num_class = 1000
-batch_size = get_config_arg('batch_size', int, 128)
-use_gpu = get_config_arg('use_gpu', bool, True)
-is_infer = get_config_arg("is_infer", bool, False)
-num_samples = get_config_arg('num_samples', int, 2560)
-
-args = {
-    'height': height,
-    'width': width,
-    'color': True,
-    'num_class': num_class,
-    'is_infer': is_infer,
-    'num_samples': num_samples
-}
-define_py_data_sources2(
-    "train.list" if not is_infer else None,
-    "test.list" if is_infer else None,
-    module="provider",
-    obj="process",
-    args=args)
-
-settings(
-    batch_size=batch_size,
-    learning_rate=0.01 / batch_size,
-    learning_method=MomentumOptimizer(0.9),
-    regularization=L2Regularization(0.0005 * batch_size))
-
-conv_projection = conv_projection if use_gpu else img_conv_layer
-
-def inception2(name, input, channels, \
-    filter1,
-    filter3R, filter3,
-    filter5R, filter5,
-    proj):
-
-    conv1 = name + '_1'
-    conv3r = name + '_3r'
-    conv3 = name + '_3'
-    conv5r = name + '_5r'
-    conv5 = name + '_5'
-    maxpool = name + '_max'
-    convproj = name + '_proj'
-
-    cov1 = img_conv_layer(
-        name=conv1,
-        input=input,
-        filter_size=1,
-        num_channels=channels,
-        num_filters=filter1,
-        stride=1,
-        padding=0)
-
-    cov3r = img_conv_layer(
-        name=conv3r,
-        input=input,
-        filter_size=1,
-        num_channels=channels,
-        num_filters=filter3R,
-        stride=1,
-        padding=0)
-    cov3 = img_conv_layer(
-        name=conv3,
-        input=cov3r,
-        filter_size=3,
-        num_filters=filter3,
-        stride=1,
-        padding=1)
-
-    cov5r = img_conv_layer(
-        name=conv5r,
-        input=input,
-        filter_size=1,
-        num_channels=channels,
-        num_filters=filter5R,
-        stride=1,
-        padding=0)
-    cov5 = img_conv_layer(
-        name=conv5,
-        input=cov5r,
-        filter_size=5,
-        num_filters=filter5,
-        stride=1,
-        padding=2)
-
-    pool1 = img_pool_layer(
-        name=maxpool,
-        input=input,
-        pool_size=3,
-        num_channels=channels,
-        stride=1,
-        padding=1)
-    covprj = img_conv_layer(
-        name=convproj,
-        input=pool1,
-        filter_size=1,
-        num_filters=proj,
-        stride=1,
-        padding=0)
-
-    cat = concat_layer(name=name, input=[cov1, cov3, cov5, covprj])
-    return cat
-
-def inception(name, input, channels, \
-    filter1,
-    filter3R, filter3,
-    filter5R, filter5,
-    proj):
-
-    cov1 = conv_projection(
-        input=input,
-        filter_size=1,
-        num_channels=channels,
-        num_filters=filter1,
-        stride=1,
-        padding=0)
-
-    cov3r = img_conv_layer(
-        name=name + '_3r',
-        input=input,
-        filter_size=1,
-        num_channels=channels,
-        num_filters=filter3R,
-        stride=1,
-        padding=0)
-    cov3 = conv_projection(
-        input=cov3r, filter_size=3, num_filters=filter3, stride=1, padding=1)
-
-    cov5r = img_conv_layer(
-        name=name + '_5r',
-        input=input,
-        filter_size=1,
-        num_channels=channels,
-        num_filters=filter5R,
-        stride=1,
-        padding=0)
-    cov5 = conv_projection(
-        input=cov5r, filter_size=5, num_filters=filter5, stride=1, padding=2)
-
-    pool1 = img_pool_layer(
-        name=name + '_max',
-        input=input,
-        pool_size=3,
-        num_channels=channels,
-        stride=1,
-        padding=1)
-    covprj = conv_projection(
-        input=pool1, filter_size=1, num_filters=proj, stride=1, padding=0)
-
-    cat = concat_layer(
-        name=name,
-        input=[cov1, cov3, cov5, covprj],
-        bias_attr=True if use_gpu else False,
-        act=ReluActivation())
-    return cat
-
-
-data = data_layer(name="input", size=3 * height * width)
-
-# stage 1
-conv1 = img_conv_layer(
-    name="conv1",
-    input=data,
-    filter_size=7,
-    num_channels=3,
-    num_filters=64,
-    stride=2,
-    padding=3)
-pool1 = img_pool_layer(
-    name="pool1", input=conv1, pool_size=3, num_channels=64, stride=2)
-
-# stage 2
-conv2_1 = img_conv_layer(
-    name="conv2_1",
-    input=pool1,
-    filter_size=1,
-    num_filters=64,
-    stride=1,
-    padding=0)
-conv2_2 = img_conv_layer(
-    name="conv2_2",
-    input=conv2_1,
-    filter_size=3,
-    num_filters=192,
-    stride=1,
-    padding=1)
-pool2 = img_pool_layer(
-    name="pool2", input=conv2_2, pool_size=3, num_channels=192, stride=2)
-
-# stage 3
-ince3a = inception("ince3a", pool2, 192, 64, 96, 128, 16, 32, 32)
-ince3b = inception("ince3b", ince3a, 256, 128, 128, 192, 32, 96, 64)
-pool3 = img_pool_layer(
-    name="pool3", input=ince3b, num_channels=480, pool_size=3, stride=2)
-
-# stage 4
-ince4a = inception("ince4a", pool3, 480, 192, 96, 208, 16, 48, 64)
-ince4b = inception("ince4b", ince4a, 512, 160, 112, 224, 24, 64, 64)
-ince4c = inception("ince4c", ince4b, 512, 128, 128, 256, 24, 64, 64)
-ince4d = inception("ince4d", ince4c, 512, 112, 144, 288, 32, 64, 64)
-ince4e = inception("ince4e", ince4d, 528, 256, 160, 320, 32, 128, 128)
-pool4 = img_pool_layer(
-    name="pool4", input=ince4e, num_channels=832, pool_size=3, stride=2)
-
-# stage 5
-ince5a = inception("ince5a", pool4, 832, 256, 160, 320, 32, 128, 128)
-ince5b = inception("ince5b", ince5a, 832, 384, 192, 384, 48, 128, 128)
-pool5 = img_pool_layer(
-    name="pool5",
-    input=ince5b,
-    num_channels=1024,
-    pool_size=7,
-    stride=7,
-    pool_type=AvgPooling())
-
-# We remove loss1 and loss2 for all system when testing benchmark
-# output 1
-# pool_o1 = img_pool_layer(name="pool_o1", input=ince4a, num_channels=512, pool_size=5, stride=3, pool_type=AvgPooling())
-# conv_o1 = img_conv_layer(name="conv_o1", input=pool_o1, filter_size=1, num_filters=128, stride=1, padding=0)
-# fc_o1 = fc_layer(name="fc_o1", input=conv_o1, size=1024, layer_attr=ExtraAttr(drop_rate=0.7), act=ReluActivation())
-# out1 = fc_layer(name="output1", input=fc_o1,  size=1000, act=SoftmaxActivation())
-# loss1 = cross_entropy(name='loss1', input=out1, label=lab, coeff=0.3) 
-
-# output 2
-#pool_o2 = img_pool_layer(name="pool_o2", input=ince4d, num_channels=528, pool_size=5, stride=3, pool_type=AvgPooling())
-#conv_o2 = img_conv_layer(name="conv_o2", input=pool_o2, filter_size=1, num_filters=128, stride=1, padding=0)
-#fc_o2 = fc_layer(name="fc_o2", input=conv_o2, size=1024, layer_attr=ExtraAttr(drop_rate=0.7), act=ReluActivation())
-#out2 = fc_layer(name="output2", input=fc_o2, size=1000, act=SoftmaxActivation())
-#loss2 = cross_entropy(name='loss2', input=out2, label=lab, coeff=0.3) 
-
-# output 3
-dropout = dropout_layer(name="dropout", input=pool5, dropout_rate=0.4)
-out3 = fc_layer(
-    name="output3", input=dropout, size=1000, act=SoftmaxActivation())
-
-if is_infer:
-    outputs(out3)
-else:
-    lab = data_layer(name="label", size=num_class)
-    loss3 = cross_entropy(name='loss3', input=out3, label=lab)
-    outputs(loss3)
diff --git a/benchmark/paddle/image/plotlog.py b/benchmark/paddle/image/plotlog.py
deleted file mode 100644
index 8679d4f272d1b7aaf8d5a397f07698a6b70e4fcd..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/plotlog.py
+++ /dev/null
@@ -1,114 +0,0 @@
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import sys
-import argparse
-import matplotlib.pyplot as plt
-
-
-def parse_args():
-    parser = argparse.ArgumentParser('Parse Log')
-    parser.add_argument(
-        '--file_path', '-f', type=str, help='the path of the log file')
-    parser.add_argument(
-        '--sample_rate',
-        '-s',
-        type=float,
-        default=1.0,
-        help='the rate to take samples from log')
-    parser.add_argument(
-        '--log_period', '-p', type=int, default=1, help='the period of log')
-
-    args = parser.parse_args()
-    return args
-
-
-def parse_file(file_name):
-    loss = []
-    error = []
-    with open(file_name) as f:
-        for i, line in enumerate(f):
-            line = line.strip()
-            if not line.startswith('pass'):
-                continue
-            line_split = line.split(' ')
-            if len(line_split) != 5:
-                continue
-
-            loss_str = line_split[2][:-1]
-            cur_loss = float(loss_str.split('=')[-1])
-            loss.append(cur_loss)
-
-            err_str = line_split[3][:-1]
-            cur_err = float(err_str.split('=')[-1])
-            error.append(cur_err)
-
-    accuracy = [1.0 - err for err in error]
-
-    return loss, accuracy
-
-
-def sample(metric, sample_rate):
-    interval = int(1.0 / sample_rate)
-    if interval > len(metric):
-        return metric[:1]
-
-    num = len(metric) / interval
-    idx = [interval * i for i in range(num)]
-    metric_sample = [metric[id] for id in idx]
-    return metric_sample
-
-
-def plot_metric(metric,
-                batch_id,
-                graph_title,
-                line_style='b-',
-                line_label='y',
-                line_num=1):
-    plt.figure()
-    plt.title(graph_title)
-    if line_num == 1:
-        plt.plot(batch_id, metric, line_style, label=line_label)
-    else:
-        for i in range(line_num):
-            plt.plot(batch_id, metric[i], line_style[i], label=line_label[i])
-    plt.xlabel('batch')
-    plt.ylabel(graph_title)
-    plt.legend()
-    plt.savefig(graph_title + '.jpg')
-    plt.close()
-
-
-def main():
-    args = parse_args()
-    assert args.sample_rate > 0. and args.sample_rate <= 1.0, "The sample rate should in the range (0, 1]."
-
-    loss, accuracy = parse_file(args.file_path)
-    batch = [args.log_period * i for i in range(len(loss))]
-
-    batch_sample = sample(batch, args.sample_rate)
-    loss_sample = sample(loss, args.sample_rate)
-    accuracy_sample = sample(accuracy, args.sample_rate)
-
-    plot_metric(loss_sample, batch_sample, 'loss', line_label='loss')
-    plot_metric(
-        accuracy_sample,
-        batch_sample,
-        'accuracy',
-        line_style='g-',
-        line_label='accuracy')
-
-
-if __name__ == '__main__':
-    main()
diff --git a/benchmark/paddle/image/provider.py b/benchmark/paddle/image/provider.py
deleted file mode 100644
index 6ad817ccefab3e44a8f962e907ba2110a6ed4a45..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/provider.py
+++ /dev/null
@@ -1,47 +0,0 @@
-#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import io, os
-import random
-import numpy as np
-from paddle.trainer.PyDataProvider2 import *
-
-
-def initHook(settings, height, width, color, num_class, **kwargs):
-    settings.height = height
-    settings.width = width
-    settings.color = color
-    settings.num_class = num_class
-    if settings.color:
-        settings.data_size = settings.height * settings.width * 3
-    else:
-        settings.data_size = settings.height * settings.width
-    settings.is_infer = kwargs.get('is_infer', False)
-    settings.num_samples = kwargs.get('num_samples', 2560)
-    if settings.is_infer:
-        settings.slots = [dense_vector(settings.data_size)]
-    else:
-        settings.slots = [dense_vector(settings.data_size), integer_value(1)]
-
-
-@provider(
-    init_hook=initHook, min_pool_size=-1, cache=CacheType.CACHE_PASS_IN_MEM)
-def process(settings, file_list):
-    for i in xrange(settings.num_samples):
-        img = np.random.rand(1, settings.data_size).reshape(-1, 1).flatten()
-        if settings.is_infer:
-            yield img.astype('float32')
-        else:
-            lab = random.randint(0, settings.num_class - 1)
-            yield img.astype('float32'), int(lab)
diff --git a/benchmark/paddle/image/resnet.py b/benchmark/paddle/image/resnet.py
deleted file mode 100644
index 2846e4763f1cda4602f03af5ec649d57ee6cf0d8..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/resnet.py
+++ /dev/null
@@ -1,230 +0,0 @@
-#!/usr/bin/env python
-from paddle.trainer_config_helpers import *
-
-height = 224
-width = 224
-num_class = 1000
-batch_size = get_config_arg('batch_size', int, 64)
-layer_num = get_config_arg("layer_num", int, 50)
-is_infer = get_config_arg("is_infer", bool, False)
-num_samples = get_config_arg('num_samples', int, 2560)
-
-args = {
-    'height': height,
-    'width': width,
-    'color': True,
-    'num_class': num_class,
-    'is_infer': is_infer,
-    'num_samples': num_samples
-}
-define_py_data_sources2(
-    "train.list" if not is_infer else None,
-    "test.list" if is_infer else None,
-    module="provider",
-    obj="process",
-    args=args)
-
-settings(
-    batch_size=batch_size,
-    learning_rate=0.01 / batch_size,
-    learning_method=MomentumOptimizer(0.9),
-    regularization=L2Regularization(0.0005 * batch_size))
-
-
-#######################Network Configuration #############
-def conv_bn_layer(name,
-                  input,
-                  filter_size,
-                  num_filters,
-                  stride,
-                  padding,
-                  channels=None,
-                  active_type=ReluActivation()):
-    """
-    A wrapper for conv layer with batch normalization layers.
-    Note:
-    conv layer has no activation.
-    """
-
-    tmp = img_conv_layer(
-        name=name + "_conv",
-        input=input,
-        filter_size=filter_size,
-        num_channels=channels,
-        num_filters=num_filters,
-        stride=stride,
-        padding=padding,
-        act=LinearActivation(),
-        bias_attr=False)
-    return batch_norm_layer(
-        name=name + "_bn",
-        input=tmp,
-        act=active_type,
-        use_global_stats=is_infer)
-
-
-def bottleneck_block(name, input, num_filters1, num_filters2):
-    """
-    A wrapper for bottlenect building block in ResNet.
-    Last conv_bn_layer has no activation.
-    Addto layer has activation of relu.
-    """
-    last_name = conv_bn_layer(
-        name=name + '_branch2a',
-        input=input,
-        filter_size=1,
-        num_filters=num_filters1,
-        stride=1,
-        padding=0)
-    last_name = conv_bn_layer(
-        name=name + '_branch2b',
-        input=last_name,
-        filter_size=3,
-        num_filters=num_filters1,
-        stride=1,
-        padding=1)
-    last_name = conv_bn_layer(
-        name=name + '_branch2c',
-        input=last_name,
-        filter_size=1,
-        num_filters=num_filters2,
-        stride=1,
-        padding=0,
-        active_type=LinearActivation())
-
-    return addto_layer(
-        name=name + "_addto", input=[input, last_name], act=ReluActivation())
-
-
-def mid_projection(name, input, num_filters1, num_filters2, stride=2):
-    """
-    A wrapper for middile projection in ResNet.
-    projection shortcuts are used for increasing dimensions,
-    and other shortcuts are identity
-    branch1: projection shortcuts are used for increasing
-    dimensions, has no activation.
-    branch2x: bottleneck building block, shortcuts are identity.
-    """
-    # stride = 2
-    branch1 = conv_bn_layer(
-        name=name + '_branch1',
-        input=input,
-        filter_size=1,
-        num_filters=num_filters2,
-        stride=stride,
-        padding=0,
-        active_type=LinearActivation())
-
-    last_name = conv_bn_layer(
-        name=name + '_branch2a',
-        input=input,
-        filter_size=1,
-        num_filters=num_filters1,
-        stride=stride,
-        padding=0)
-    last_name = conv_bn_layer(
-        name=name + '_branch2b',
-        input=last_name,
-        filter_size=3,
-        num_filters=num_filters1,
-        stride=1,
-        padding=1)
-
-    last_name = conv_bn_layer(
-        name=name + '_branch2c',
-        input=last_name,
-        filter_size=1,
-        num_filters=num_filters2,
-        stride=1,
-        padding=0,
-        active_type=LinearActivation())
-
-    return addto_layer(
-        name=name + "_addto", input=[branch1, last_name], act=ReluActivation())
-
-
-img = data_layer(name='image', size=height * width * 3)
-
-
-def deep_res_net(res2_num=3, res3_num=4, res4_num=6, res5_num=3):
-    """
-    A wrapper for 50,101,152 layers of ResNet.
-    res2_num: number of blocks stacked in conv2_x
-    res3_num: number of blocks stacked in conv3_x
-    res4_num: number of blocks stacked in conv4_x
-    res5_num: number of blocks stacked in conv5_x
-    """
-    # For ImageNet
-    # conv1: 112x112
-    tmp = conv_bn_layer(
-        "conv1",
-        input=img,
-        filter_size=7,
-        channels=3,
-        num_filters=64,
-        stride=2,
-        padding=3)
-    tmp = img_pool_layer(name="pool1", input=tmp, pool_size=3, stride=2)
-
-    # conv2_x: 56x56
-    tmp = mid_projection(
-        name="res2_1", input=tmp, num_filters1=64, num_filters2=256, stride=1)
-    for i in xrange(2, res2_num + 1, 1):
-        tmp = bottleneck_block(
-            name="res2_" + str(i), input=tmp, num_filters1=64, num_filters2=256)
-
-    # conv3_x: 28x28
-    tmp = mid_projection(
-        name="res3_1", input=tmp, num_filters1=128, num_filters2=512)
-    for i in xrange(2, res3_num + 1, 1):
-        tmp = bottleneck_block(
-            name="res3_" + str(i),
-            input=tmp,
-            num_filters1=128,
-            num_filters2=512)
-
-    # conv4_x: 14x14
-    tmp = mid_projection(
-        name="res4_1", input=tmp, num_filters1=256, num_filters2=1024)
-    for i in xrange(2, res4_num + 1, 1):
-        tmp = bottleneck_block(
-            name="res4_" + str(i),
-            input=tmp,
-            num_filters1=256,
-            num_filters2=1024)
-
-    # conv5_x: 7x7
-    tmp = mid_projection(
-        name="res5_1", input=tmp, num_filters1=512, num_filters2=2048)
-    for i in xrange(2, res5_num + 1, 1):
-        tmp = bottleneck_block(
-            name="res5_" + str(i),
-            input=tmp,
-            num_filters1=512,
-            num_filters2=2048)
-
-    tmp = img_pool_layer(
-        name='avgpool',
-        input=tmp,
-        pool_size=7,
-        stride=1,
-        pool_type=AvgPooling())
-
-    return fc_layer(input=tmp, size=num_class, act=SoftmaxActivation())
-
-
-if layer_num == 50:
-    resnet = deep_res_net(3, 4, 6, 3)
-elif layer_num == 101:
-    resnet = deep_res_net(3, 4, 23, 3)
-elif layer_num == 152:
-    resnet = deep_res_net(3, 8, 36, 3)
-else:
-    print("Wrong layer number.")
-
-if is_infer:
-    outputs(resnet)
-else:
-    lbl = data_layer(name="label", size=num_class)
-    loss = cross_entropy(name='loss', input=resnet, label=lbl)
-    outputs(loss)
diff --git a/benchmark/paddle/image/run.sh b/benchmark/paddle/image/run.sh
deleted file mode 100755
index 5b58a8d773aab795e5439b0f0e5d81bec66b5f56..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/run.sh
+++ /dev/null
@@ -1,53 +0,0 @@
-#!/bin/bash
-
-set -e
-
-function train() {
-  cfg=$1
-  thread=$2
-  bz=$3
-  args="batch_size=$3"
-  prefix=$4
-  paddle train --job=time \
-    --config=$cfg \
-    --use_gpu=True \
-    --trainer_count=$thread \
-    --log_period=10 \
-    --test_period=100 \
-    --config_args=$args \
-    > logs/$prefix-${thread}gpu-$bz.log 2>&1 
-}
-
-if [ ! -d "train.list" ]; then
-  echo " " > train.list
-fi
-if [ ! -d "logs" ]; then
-  mkdir logs
-fi
-
-#========single-gpu=========#
-# alexnet
-train alexnet.py 1 64 alexnet
-train alexnet.py 1 128 alexnet
-train alexnet.py 1 256 alexnet
-train alexnet.py 1 512 alexnet
-
-# googlenet
-train googlenet.py 1 64 googlenet
-train googlenet.py 1 128 googlenet
-train googlenet.py 1 256 googlenet
-
-# smallnet
-train smallnet_mnist_cifar.py 1 64 smallnet
-train smallnet_mnist_cifar.py 1 128 smallnet
-train smallnet_mnist_cifar.py 1 256 smallnet
-train smallnet_mnist_cifar.py 1 512 smallnet
-
-
-############################
-#========multi-gpus=========#
-train alexnet.py 4 512 alexnet
-train alexnet.py 4 1024 alexnet
-
-train googlenet.py 4 512 googlenet 
-train googlenet.py 4 1024 googlenet
diff --git a/benchmark/paddle/image/run_mkl_infer.sh b/benchmark/paddle/image/run_mkl_infer.sh
deleted file mode 100755
index 0fad5e04cc992a3ec97591d3833957bb7517a8f3..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/run_mkl_infer.sh
+++ /dev/null
@@ -1,89 +0,0 @@
-#!/bin/bash
-
-set -e
-
-function clock_to_seconds() {
-  hours=`echo $1 | awk -F ':' '{print $1}'`
-  mins=`echo $1 | awk -F ':' '{print $2}'`
-  secs=`echo $1 | awk -F ':' '{print $3}'`
-  echo `awk 'BEGIN{printf "%.2f",('$secs' + '$mins' * 60 + '$hours' * 3600)}'`
-}
-
-function infer() {
-  unset OMP_NUM_THREADS MKL_NUM_THREADS OMP_DYNAMIC KMP_AFFINITY
-  topology=$1
-  layer_num=$2
-  bs=$3
-  use_mkldnn=$4
-  if [ $4 == "True" ]; then
-    thread=1
-    log="logs/infer-${topology}-${layer_num}-mkldnn-${bs}.log"
-  elif [ $4 == "False" ]; then
-    thread=`nproc`
-    if [ $thread -gt $bs ]; then
-      thread=$bs
-    fi
-    log="logs/infer-${topology}-${layer_num}-${thread}mklml-${bs}.log"
-  else
-    echo "Wrong input $4, use True or False."
-    exit 0
-  fi
-
-  models_in="models/${topology}-${layer_num}/pass-00000/"
-  if [ ! -d $models_in ]; then
-    echo "Training model ${topology}_${layer_num}"
-    paddle train --job=train \
-      --config="${topology}.py" \
-      --use_mkldnn=True \
-      --use_gpu=False \
-      --trainer_count=1 \
-      --num_passes=1 \
-      --save_dir="models/${topology}-${layer_num}" \
-      --config_args="batch_size=128,layer_num=${layer_num},num_samples=256" \
-      > /dev/null 2>&1
-    echo "Done"
-  fi
-  log_period=$((256 / bs))
-  paddle train --job=test \
-    --config="${topology}.py" \
-    --use_mkldnn=$use_mkldnn \
-    --use_gpu=False \
-    --trainer_count=$thread \
-    --log_period=$log_period \
-    --config_args="batch_size=${bs},layer_num=${layer_num},is_infer=True" \
-    --init_model_path=$models_in \
-    2>&1 | tee ${log}
-
-  # calculate the last 5 logs period time of 1280 samples,
-  # the time before are burning time.
-  start=`tail ${log} -n 7 | head -n 1 | awk -F ' ' '{print $2}' | xargs`
-  end=`tail ${log} -n 2 | head -n 1 | awk -F ' ' '{print $2}' | xargs`
-  start_sec=`clock_to_seconds $start`
-  end_sec=`clock_to_seconds $end`
-  fps=`awk 'BEGIN{printf "%.2f",(1280 / ('$end_sec' - '$start_sec'))}'`
-  echo "Last 1280 samples start: ${start}(${start_sec} sec), end: ${end}(${end_sec} sec;" >> ${log}
-  echo "FPS: $fps images/sec" 2>&1 | tee -a ${log}
-}
-
-if [ ! -f "train.list" ]; then
-  echo " " > train.list
-fi
-if [ ! -f "test.list" ]; then
-  echo " " > test.list
-fi
-if [ ! -d "logs" ]; then
-  mkdir logs
-fi
-if [ ! -d "models" ]; then
-  mkdir -p models
-fi
-
-# inference benchmark
-for use_mkldnn in True False; do
-  for batchsize in 1 2 4 8 16; do
-    infer vgg 19 $batchsize $use_mkldnn
-    infer resnet 50 $batchsize $use_mkldnn
-    infer googlenet v1 $batchsize $use_mkldnn
-    infer alexnet 2 $batchsize $use_mkldnn
-  done
-done
diff --git a/benchmark/paddle/image/run_mkl_train.sh b/benchmark/paddle/image/run_mkl_train.sh
deleted file mode 100755
index 1583bf134a276a08aa2f8e84dc63adbb205a83d6..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/run_mkl_train.sh
+++ /dev/null
@@ -1,54 +0,0 @@
-#!/bin/bash
-
-set -e
-
-function train() {
-  unset OMP_NUM_THREADS MKL_NUM_THREADS OMP_DYNAMIC KMP_AFFINITY
-  topology=$1
-  layer_num=$2
-  bs=$3
-  use_mkldnn=$4
-  if [ $4 == "True" ]; then
-    thread=1
-    log="logs/train-${topology}-${layer_num}-mkldnn-${bs}.log"
-  elif [ $4 == "False" ]; then
-    thread=`nproc`
-    # each trainer_count use only 1 core to avoid conflict
-    log="logs/train-${topology}-${layer_num}-${thread}mklml-${bs}.log"
-  else
-    echo "Wrong input $4, use True or False."
-    exit 0
-  fi
-  args="batch_size=${bs},layer_num=${layer_num}"
-  config="${topology}.py"
-  paddle train --job=time \
-    --config=$config \
-    --use_mkldnn=$use_mkldnn \
-    --use_gpu=False \
-    --trainer_count=$thread \
-    --log_period=10 \
-    --test_period=100 \
-    --config_args=$args \
-    2>&1 | tee ${log} 
-
-  avg_time=`tail ${log} -n 1 | awk -F ' ' '{print $8}' | sed 's/avg=//'`
-  fps=`awk 'BEGIN{printf "%.2f",('$bs' / '$avg_time' * 1000)}'`
-  echo "FPS: $fps images/sec" 2>&1 | tee -a ${log}
-}
-
-if [ ! -f "train.list" ]; then
-  echo " " > train.list
-fi
-if [ ! -d "logs" ]; then
-  mkdir logs
-fi
-
-# training benchmark
-for use_mkldnn in True False; do
-  for batchsize in 64 128 256; do
-    train vgg 19 $batchsize $use_mkldnn
-    train resnet 50 $batchsize $use_mkldnn
-    train googlenet v1 $batchsize $use_mkldnn
-    train alexnet 2 $batchsize $use_mkldnn
-  done
-done
diff --git a/benchmark/paddle/image/run_openblas_infer.sh b/benchmark/paddle/image/run_openblas_infer.sh
deleted file mode 100755
index 987381cabc2e793886099212660723c122b73bb0..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/run_openblas_infer.sh
+++ /dev/null
@@ -1,71 +0,0 @@
-#!/bin/bash
-
-set -e
-
-function clock_to_seconds() {
-  hours=`echo $1 | awk -F ':' '{print $1}'`
-  mins=`echo $1 | awk -F ':' '{print $2}'`
-  secs=`echo $1 | awk -F ':' '{print $3}'`
-  echo `awk 'BEGIN{printf "%.2f",('$secs' + '$mins' * 60 + '$hours' * 3600)}'`
-}
-
-function infer() {
-  export OPENBLAS_MAIN_FREE=1
-  topology=$1
-  layer_num=$2
-  bs=$3
-  trainers=`nproc`
-  if [ $trainers -gt $bs ]; then
-    trainers=$bs
-  fi
-  log="logs/infer-${topology}-${layer_num}-${trainers}openblas-${bs}.log"
-  threads=$((`nproc` / trainers))
-  if [ $threads -eq 0 ]; then
-    threads=1
-  fi
-  export OPENBLAS_NUM_THREADS=$threads
-
-  models_in="models/${topology}-${layer_num}/pass-00000/"
-  if [ ! -d $models_in ]; then
-    echo "./run_mkl_infer.sh to save the model first"
-    exit 0
-  fi
-  log_period=$((32 / bs))
-  paddle train --job=test \
-    --config="${topology}.py" \
-    --use_mkldnn=False \
-    --use_gpu=False \
-    --trainer_count=$trainers \
-    --log_period=$log_period \
-    --config_args="batch_size=${bs},layer_num=${layer_num},is_infer=True,num_samples=256" \
-    --init_model_path=$models_in \
-    2>&1 | tee ${log}
-
-  # calculate the last 5 logs period time of 160(=32*5) samples,
-  # the time before are burning time.
-  start=`tail ${log} -n 7 | head -n 1 | awk -F ' ' '{print $2}' | xargs`
-  end=`tail ${log} -n 2 | head -n 1 | awk -F ' ' '{print $2}' | xargs`
-  start_sec=`clock_to_seconds $start`
-  end_sec=`clock_to_seconds $end`
-  fps=`awk 'BEGIN{printf "%.2f",(160 / ('$end_sec' - '$start_sec'))}'`
-  echo "Last 160 samples start: ${start}(${start_sec} sec), end: ${end}(${end_sec} sec;" >> ${log}
-  echo "FPS: $fps images/sec" 2>&1 | tee -a ${log}
-}
-
-if [ ! -f "train.list" ]; then
-  echo " " > train.list
-fi
-if [ ! -f "test.list" ]; then
-  echo " " > test.list
-fi
-if [ ! -d "logs" ]; then
-  mkdir logs
-fi
-
-# inference benchmark
-for batchsize in 1 2 4 8 16; do
-  infer vgg 19 $batchsize
-  infer resnet 50 $batchsize 
-  infer googlenet v1 $batchsize
-  infer alexnet 2 $batchsize
-done
diff --git a/benchmark/paddle/image/run_openblas_train.sh b/benchmark/paddle/image/run_openblas_train.sh
deleted file mode 100755
index cc64e1d09da02087b1737190a0b75dc7758600a6..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/run_openblas_train.sh
+++ /dev/null
@@ -1,43 +0,0 @@
-#!/bin/bash
-
-set -e
-
-function train() {
-  export OPENBLAS_NUM_THREADS=1
-  topology=$1
-  layer_num=$2
-  bs=$3
-  thread=`nproc`
-  # each trainer_count use only 1 core to avoid conflict
-  log="logs/train-${topology}-${layer_num}-${thread}openblas-${bs}.log"
-  args="batch_size=${bs},layer_num=${layer_num}"
-  config="${topology}.py"
-  paddle train --job=time \
-    --config=$config \
-    --use_mkldnn=False \
-    --use_gpu=False \
-    --trainer_count=$thread \
-    --log_period=3 \
-    --test_period=30 \
-    --config_args=$args \
-    2>&1 | tee ${log} 
-
-  avg_time=`tail ${log} -n 1 | awk -F ' ' '{print $8}' | sed 's/avg=//'`
-  fps=`awk 'BEGIN{printf "%.2f",('$bs' / '$avg_time' * 1000)}'`
-  echo "FPS: $fps images/sec" 2>&1 | tee -a ${log}
-}
-
-if [ ! -f "train.list" ]; then
-  echo " " > train.list
-fi
-if [ ! -d "logs" ]; then
-  mkdir logs
-fi
-
-# training benchmark
-for batchsize in 64 128 256; do
-  train vgg 19 $batchsize
-  train resnet 50 $batchsize
-  train googlenet v1 $batchsize
-  train alexnet 2 $batchsize
-done
diff --git a/benchmark/paddle/image/smallnet_mnist_cifar.py b/benchmark/paddle/image/smallnet_mnist_cifar.py
deleted file mode 100644
index 58879c454f37991405d83bbb593bb5d1e977ff53..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/smallnet_mnist_cifar.py
+++ /dev/null
@@ -1,49 +0,0 @@
-#!/usr/bin/env python
-
-from paddle.trainer_config_helpers import *
-
-height = 32
-width = 32
-num_class = 10
-
-batch_size = get_config_arg('batch_size', int, 128)
-
-args = {'height': height, 'width': width, 'color': True, 'num_class': num_class}
-define_py_data_sources2(
-    "train.list", None, module="provider", obj="process", args=args)
-
-settings(
-    batch_size=batch_size,
-    learning_rate=0.01 / batch_size,
-    learning_method=MomentumOptimizer(0.9),
-    regularization=L2Regularization(0.0005 * batch_size))
-
-# conv1
-net = data_layer('data', size=height * width * 3)
-net = img_conv_layer(
-    input=net,
-    filter_size=5,
-    num_channels=3,
-    num_filters=32,
-    stride=1,
-    padding=2)
-net = img_pool_layer(input=net, pool_size=3, stride=2, padding=1)
-
-# conv2
-net = img_conv_layer(
-    input=net, filter_size=5, num_filters=32, stride=1, padding=2)
-net = img_pool_layer(
-    input=net, pool_size=3, stride=2, padding=1, pool_type=AvgPooling())
-
-# conv3
-net = img_conv_layer(
-    input=net, filter_size=3, num_filters=64, stride=1, padding=1)
-net = img_pool_layer(
-    input=net, pool_size=3, stride=2, padding=1, pool_type=AvgPooling())
-
-net = fc_layer(input=net, size=64, act=ReluActivation())
-net = fc_layer(input=net, size=10, act=SoftmaxActivation())
-
-lab = data_layer('label', num_class)
-loss = classification_cost(input=net, label=lab)
-outputs(loss)
diff --git a/benchmark/paddle/image/vgg.py b/benchmark/paddle/image/vgg.py
deleted file mode 100644
index ca0a6798fb8c35b68cf84d263855955eb93ba0b0..0000000000000000000000000000000000000000
--- a/benchmark/paddle/image/vgg.py
+++ /dev/null
@@ -1,119 +0,0 @@
-#!/usr/bin/env python
-from paddle.trainer_config_helpers import *
-
-height = 224
-width = 224
-num_class = 1000
-batch_size = get_config_arg('batch_size', int, 64)
-layer_num = get_config_arg('layer_num', int, 19)
-is_infer = get_config_arg("is_infer", bool, False)
-num_samples = get_config_arg('num_samples', int, 2560)
-
-args = {
-    'height': height,
-    'width': width,
-    'color': True,
-    'num_class': num_class,
-    'is_infer': is_infer,
-    'num_samples': num_samples
-}
-define_py_data_sources2(
-    "train.list" if not is_infer else None,
-    "test.list" if is_infer else None,
-    module="provider",
-    obj="process",
-    args=args)
-
-settings(
-    batch_size=batch_size,
-    learning_rate=0.001 / batch_size,
-    learning_method=MomentumOptimizer(0.9),
-    regularization=L2Regularization(0.0005 * batch_size))
-
-img = data_layer(name='image', size=height * width * 3)
-
-
-def vgg_network(vgg_num=3):
-    tmp = img_conv_group(
-        input=img,
-        num_channels=3,
-        conv_padding=1,
-        conv_num_filter=[64, 64],
-        conv_filter_size=3,
-        conv_act=ReluActivation(),
-        pool_size=2,
-        pool_stride=2,
-        pool_type=MaxPooling())
-
-    tmp = img_conv_group(
-        input=tmp,
-        conv_num_filter=[128, 128],
-        conv_padding=1,
-        conv_filter_size=3,
-        conv_act=ReluActivation(),
-        pool_stride=2,
-        pool_type=MaxPooling(),
-        pool_size=2)
-
-    channels = []
-    for i in range(vgg_num):
-        channels.append(256)
-    tmp = img_conv_group(
-        input=tmp,
-        conv_num_filter=channels,
-        conv_padding=1,
-        conv_filter_size=3,
-        conv_act=ReluActivation(),
-        pool_stride=2,
-        pool_type=MaxPooling(),
-        pool_size=2)
-    channels = []
-    for i in range(vgg_num):
-        channels.append(512)
-    tmp = img_conv_group(
-        input=tmp,
-        conv_num_filter=channels,
-        conv_padding=1,
-        conv_filter_size=3,
-        conv_act=ReluActivation(),
-        pool_stride=2,
-        pool_type=MaxPooling(),
-        pool_size=2)
-    tmp = img_conv_group(
-        input=tmp,
-        conv_num_filter=channels,
-        conv_padding=1,
-        conv_filter_size=3,
-        conv_act=ReluActivation(),
-        pool_stride=2,
-        pool_type=MaxPooling(),
-        pool_size=2)
-
-    tmp = fc_layer(
-        input=tmp,
-        size=4096,
-        act=ReluActivation(),
-        layer_attr=ExtraAttr(drop_rate=0.5))
-
-    tmp = fc_layer(
-        input=tmp,
-        size=4096,
-        act=ReluActivation(),
-        layer_attr=ExtraAttr(drop_rate=0.5))
-
-    return fc_layer(input=tmp, size=num_class, act=SoftmaxActivation())
-
-
-if layer_num == 16:
-    vgg = vgg_network(3)
-elif layer_num == 19:
-    vgg = vgg_network(4)
-else:
-    print("Wrong layer number.")
-
-if is_infer:
-    outputs(vgg)
-else:
-    lab = data_layer('label', num_class)
-    loss = cross_entropy(input=vgg, label=lab)
-    outputs(loss)
diff --git a/benchmark/paddle/rnn/imdb.py b/benchmark/paddle/rnn/imdb.py
deleted file mode 100755
index 2a67f9b0cf52484d9d44fe9db0b1e57cdd20fd43..0000000000000000000000000000000000000000
--- a/benchmark/paddle/rnn/imdb.py
+++ /dev/null
@@ -1,60 +0,0 @@
-#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import print_function
-import six.moves.cPickle as pickle
-import gzip
-import os
-import numpy
-
-
-def get_dataset_file(dataset, default_dataset, origin):
-    data_dir, data_file = os.path.split(dataset)
-    if (not os.path.isfile(dataset)) and data_file == default_dataset:
-        from six.moves import urllib
-        print('Downloading data from %s' % origin)
-        urllib.request.urlretrieve(origin, dataset)
-
-    return dataset
-
-
-def create_data(path="imdb.pkl"):
-
-    if (not os.path.isfile('imdb.train.pkl')):
-        path = get_dataset_file(
-            path, "imdb.pkl",
-            "http://www.iro.umontreal.ca/~lisa/deep/data/imdb.pkl")
-
-        if path.endswith(".gz"):
-            f = gzip.open(path, 'rb')
-        else:
-            f = open(path, 'rb')
-
-        train_set = pickle.load(f)
-        test_set = pickle.load(f)
-        f.close()
-
-        pickle.dump(train_set, open('imdb.train.pkl', 'wb'))
-        pickle.dump(test_set, open('imdb.test.pkl', 'wb'))
-
-    if (not os.path.isfile('train.list')):
-        file('train.list', 'w').write('imdb.train.pkl\n')
-
-
-def main():
-    create_data('imdb.pkl')
-
-
-if __name__ == "__main__":
-    main()
diff --git a/benchmark/paddle/rnn/provider.py b/benchmark/paddle/rnn/provider.py
deleted file mode 100644
index 23cc0c44a98d0ae7f586d1a376a603198f2c6144..0000000000000000000000000000000000000000
--- a/benchmark/paddle/rnn/provider.py
+++ /dev/null
@@ -1,86 +0,0 @@
-#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import io, os
-import random
-import numpy as np
-import six.moves.cPickle as pickle
-from paddle.trainer.PyDataProvider2 import *
-
-
-def remove_unk(x, n_words):
-    return [[1 if w >= n_words else w for w in sen] for sen in x]
-
-
-# ==============================================================
-#  tensorflow uses fixed length, but PaddlePaddle can process
-#  variable-length. Padding is used in benchmark in order to
-#  compare with other platform. 
-# ==============================================================
-def pad_sequences(sequences,
-                  maxlen=None,
-                  dtype='int32',
-                  padding='post',
-                  truncating='post',
-                  value=0.):
-    lengths = [len(s) for s in sequences]
-
-    nb_samples = len(sequences)
-    if maxlen is None:
-        maxlen = np.max(lengths)
-
-    x = (np.ones((nb_samples, maxlen)) * value).astype(dtype)
-    for idx, s in enumerate(sequences):
-        if len(s) == 0:
-            continue  # empty list was found
-        if truncating == 'pre':
-            trunc = s[-maxlen:]
-        elif truncating == 'post':
-            trunc = s[:maxlen]
-        else:
-            raise ValueError("Truncating type '%s' not understood" % padding)
-
-        if padding == 'post':
-            x[idx, :len(trunc)] = trunc
-        elif padding == 'pre':
-            x[idx, -len(trunc):] = trunc
-        else:
-            raise ValueError("Padding type '%s' not understood" % padding)
-    return x
-
-
-def initHook(settings, vocab_size, pad_seq, maxlen, **kwargs):
-    settings.vocab_size = vocab_size
-    settings.pad_seq = pad_seq
-    settings.maxlen = maxlen
-    settings.input_types = [
-        integer_value_sequence(vocab_size), integer_value(2)
-    ]
-
-
-@provider(
-    init_hook=initHook, min_pool_size=-1, cache=CacheType.CACHE_PASS_IN_MEM)
-def process(settings, file):
-    f = open(file, 'rb')
-    train_set = pickle.load(f)
-    f.close()
-    x, y = train_set
-
-    # remove unk, namely remove the words out of dictionary
-    x = remove_unk(x, settings.vocab_size)
-    if settings.pad_seq:
-        x = pad_sequences(x, maxlen=settings.maxlen, value=0.)
-
-    for i in range(len(y)):
-        yield map(int, x[i]), int(y[i])
diff --git a/benchmark/paddle/rnn/rnn.py b/benchmark/paddle/rnn/rnn.py
deleted file mode 100755
index 83eb3e565473f7e7e91cddeaa3cd2aafb7e3df2c..0000000000000000000000000000000000000000
--- a/benchmark/paddle/rnn/rnn.py
+++ /dev/null
@@ -1,38 +0,0 @@
-#!/usr/bin/env python
-
-from paddle.trainer_config_helpers import *
-import imdb
-
-num_class = 2
-vocab_size = 30000
-fixedlen = 100
-batch_size = get_config_arg('batch_size', int, 128)
-lstm_num = get_config_arg('lstm_num', int, 1)
-hidden_size = get_config_arg('hidden_size', int, 128)
-# whether to pad sequence into fixed length
-pad_seq = get_config_arg('pad_seq', bool, True)
-imdb.create_data('imdb.pkl')
-
-args = {'vocab_size': vocab_size, 'pad_seq': pad_seq, 'maxlen': fixedlen}
-define_py_data_sources2(
-    "train.list", None, module="provider", obj="process", args=args)
-
-settings(
-    batch_size=batch_size,
-    learning_rate=2e-3,
-    learning_method=AdamOptimizer(),
-    regularization=L2Regularization(8e-4),
-    gradient_clipping_threshold=25)
-
-net = data_layer('data', size=vocab_size)
-net = embedding_layer(input=net, size=128)
-
-for i in xrange(lstm_num):
-    net = simple_lstm(input=net, size=hidden_size)
-
-net = last_seq(input=net)
-net = fc_layer(input=net, size=2, act=SoftmaxActivation())
-
-lab = data_layer('label', num_class)
-loss = classification_cost(input=net, label=lab)
-outputs(loss)
diff --git a/benchmark/paddle/rnn/run.sh b/benchmark/paddle/rnn/run.sh
deleted file mode 100755
index f99a562b3f88a98560f4bf7aee98ceee9daefe67..0000000000000000000000000000000000000000
--- a/benchmark/paddle/rnn/run.sh
+++ /dev/null
@@ -1,52 +0,0 @@
-#!/bin/bash
-
-set -e
-
-function train() {
-  cfg=$1
-  thread=$2
-  args="lstm_num=${3},seq_pad=${4},hidden_size=${5},batch_size=${6}"
-  paddle train --job=time \
-    --config=$cfg \
-    --use_gpu=1 \
-    --trainer_count=$thread \
-    --log_period=10 \
-    --test_period=100 \
-    --num_passes=1 \
-    --feed_data=1 \
-    --config_args=$args \
-    >logs/rnn-pad${4}-${thread}gpu-lstm${3}-batch${6}-hid${5}.log 2>&1
-}
-
-if [ ! -d "logs" ]; then
-  mkdir logs
-fi
-
-## padding, single gpu
-#-----config--gpu--lstm_num--padding--hidden_size--batch_size
-## lstm_num=2, batch_size=64
-train rnn.py 1 2 1 256 64 
-train rnn.py 1 2 1 512 64 
-train rnn.py 1 2 1 1280 64 
-
-## lstm_num=2, batch_size=128
-train rnn.py 1 2 1 256 128 
-train rnn.py 1 2 1 512 128 
-train rnn.py 1 2 1 1280 128 
-
-## lstm_num=4, batch_size=256
-train rnn.py 1 2 1 256 256 
-train rnn.py 1 2 1 512 256 
-train rnn.py 1 2 1 1280 256 
-
-
-#==================multi gpus=====================#
-# hidden_size=256, lstm_num=2, different batch size
-train rnn.py 4 2 1 256 128 
-train rnn.py 4 2 1 256 256 
-train rnn.py 4 2 1 256 512 
-
-# hidden_size=512, lstm_num=4, different batch size
-train rnn.py 4 2 1 512 128 
-train rnn.py 4 2 1 512 256 
-train rnn.py 4 2 1 512 512 
diff --git a/benchmark/tensorflow/machine_translation.py b/benchmark/tensorflow/machine_translation.py
index 8f77dce98353af53803246be8dc61063836b7867..7837669edc7a206c03e5b9fa2989bf45b35f0605 100644
--- a/benchmark/tensorflow/machine_translation.py
+++ b/benchmark/tensorflow/machine_translation.py
@@ -35,8 +35,6 @@ import os
 import argparse
 import time
 
-import paddle.v2 as paddle
-
 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
     "--embedding_dim",
diff --git a/benchmark/tensorflow/mnist.py b/benchmark/tensorflow/mnist.py
index 7140eed6eaff49b5c65f9ccb2e38f113a4cdbdbf..03d533fecfededddd3956ba83ea600456782cfc9 100644
--- a/benchmark/tensorflow/mnist.py
+++ b/benchmark/tensorflow/mnist.py
@@ -21,7 +21,6 @@ import time
 import numpy as np
 
 import tensorflow as tf
-import paddle.v2 as paddle
 
 DTYPE = tf.float32
 
diff --git a/benchmark/tensorflow/resnet.py b/benchmark/tensorflow/resnet.py
index c432fa8d59571e128b9ff9e3ffa1949b792ef3a4..fdb044195766b847e16a0cc33424a999c1d9166e 100644
--- a/benchmark/tensorflow/resnet.py
+++ b/benchmark/tensorflow/resnet.py
@@ -27,7 +27,6 @@ import argparse
 import time
 import numpy as np
 
-import paddle.v2 as paddle
 import tensorflow as tf
 
 DTYPE = tf.float32
diff --git a/benchmark/tensorflow/stacked_dynamic_lstm.py b/benchmark/tensorflow/stacked_dynamic_lstm.py
index 5285033005044d907d0b7e91eb66ee7281c4f27a..1f532dc2fa082ea0f6b1da560e1a57b96d2ef1bb 100644
--- a/benchmark/tensorflow/stacked_dynamic_lstm.py
+++ b/benchmark/tensorflow/stacked_dynamic_lstm.py
@@ -21,8 +21,6 @@ import argparse
 import time
 import tensorflow as tf
 
-import paddle.v2 as paddle
-
 
 def parse_args():
     parser = argparse.ArgumentParser("LSTM model benchmark.")
diff --git a/benchmark/tensorflow/vgg.py b/benchmark/tensorflow/vgg.py
index fba5ec71a46b3ac8b2e1244424c39fd5192e5458..d32c835bd7a7dafaafe0970fb6b422db3c866370 100644
--- a/benchmark/tensorflow/vgg.py
+++ b/benchmark/tensorflow/vgg.py
@@ -13,7 +13,6 @@
 # limitations under the License.
 """VGG16 benchmark in TensorFlow"""
 import tensorflow as tf
-import paddle.v2 as paddle
 import numpy as np
 import argparse
 import time
diff --git a/paddle/fluid/framework/ir/identity_scale_op_clean_pass.cc b/paddle/fluid/framework/ir/identity_scale_op_clean_pass.cc
index 3b738aa159ebfd77f00c9e532fbd94542e2097db..5bdc0c5faed7131b873edf9b43c847c010b6e3f3 100644
--- a/paddle/fluid/framework/ir/identity_scale_op_clean_pass.cc
+++ b/paddle/fluid/framework/ir/identity_scale_op_clean_pass.cc
@@ -38,9 +38,13 @@ std::unique_ptr<ir::Graph> IdentityScaleOpCleanPass::ApplyImpl(
                       ->assert_is_op("scale")
                       ->assert_op_attr<float>("scale", 1.)
                       ->assert_op_attr<float>("bias", 0.);
-  auto scale_out = detector.mutable_pattern()
-                       ->NewNode("scale_out")
-                       ->assert_is_op_output("scale");
+  auto scale_out =
+      detector.mutable_pattern()
+          ->NewNode("scale_out")
+          ->assert_is_op_output("scale")
+          // scale's output var should has only one consumer, or it can't be
+          // removed.
+          ->assert_more([](Node* x) { return x->outputs.size() == 1UL; });
 
   pre_op->LinksTo({scale_in});
   scale_op->LinksFrom({scale_in}).LinksTo({scale_out});
diff --git a/paddle/fluid/imperative/layer.cc b/paddle/fluid/imperative/layer.cc
index 47488d4dea79f285769f29c93f7888a7f783f070..8f20f0c06e043ddc629e47c6e49280c5467b0e20 100644
--- a/paddle/fluid/imperative/layer.cc
+++ b/paddle/fluid/imperative/layer.cc
@@ -207,7 +207,7 @@ framework::LoDTensor& VarBase::GradValue() {
 
 std::map<std::string, std::vector<VarBase*>> OpBase::ApplyGrad() {
   if (grad_op_descs_.empty() && backward_id_ <= 0) {
-    LOG(WARNING) << "op with no grad: " << op_desc_->Type();
+    VLOG(3) << "op with no grad: " << op_desc_->Type();
     return {};
   }
 
diff --git a/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.cc b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.cc
index a64f85ee9ac1a7bb8f0ed7bb8678166bbbcd5746..96befe7f8a5d16402338ac337daa96d714b4d310 100644
--- a/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.cc
+++ b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.cc
@@ -460,77 +460,6 @@ inline bool CheckNodeIndegreeEquals(const Node &node, size_t n) {
   return node.inputs.size() == n;
 }
 
-NodesTSIterator::NodesTSIterator(const std::vector<Node *> &source) {
-  PADDLE_ENFORCE(!source.empty(),
-                 "Start points of topological sorting should not be empty!");
-  // CHECK all the inputs' in-degree is 0
-  for (auto *node : source) {
-    PADDLE_ENFORCE(CheckNodeIndegreeEquals(*node, 0));
-  }
-
-  std::unordered_set<Node *> visited;
-  std::unordered_set<Node *> to_visit{source.begin(), source.end()};
-
-  std::vector<Node *> inlink_visited;
-  while (!to_visit.empty()) {
-    std::vector<Node *> queue(to_visit.begin(), to_visit.end());
-    for (auto *p : queue) {
-      if (Agent(p).deleted()) {
-        visited.insert(p);
-        to_visit.erase(p);
-      }
-
-      inlink_visited.clear();
-
-      std::copy_if(p->inputs.begin(), p->inputs.end(),
-                   std::back_inserter(inlink_visited),
-                   [&](Node *x) -> bool { return visited.count(x) != 0; });
-
-      if (inlink_visited.size() == p->inputs.size()) {
-        sorted_.push_back(p);
-        for (auto *_ : p->outputs) {
-          if (!visited.count(_)) {
-            to_visit.insert(_);
-          }
-        }
-
-        to_visit.erase(p);
-        visited.insert(p);
-      }
-    }
-  }
-}
-
-NodesTSIterator::NodesTSIterator(const NodesTSIterator &other)
-    : sorted_(other.sorted_), cursor_(other.cursor_) {}
-
-Node &NodesTSIterator::operator*() {
-  PADDLE_ENFORCE_LT(cursor_, sorted_.size());
-  return *sorted_[cursor_];
-}
-
-NodesTSIterator &NodesTSIterator::operator++() {
-  if (++cursor_ >= sorted_.size()) {
-    sorted_.clear();
-    cursor_ = 0;
-  }
-  return *this;
-}
-NodesTSIterator &NodesTSIterator::operator=(const NodesTSIterator &other) {
-  cursor_ = other.cursor_;
-  sorted_ = other.sorted_;
-  return *this;
-}
-
-bool NodesTSIterator::operator==(const NodesTSIterator &other) {
-  return sorted_ == other.sorted_ && cursor_ == other.cursor_;
-}
-
-Node *NodesTSIterator::operator->() {
-  PADDLE_ENFORCE_LT(cursor_, sorted_.size());
-  return sorted_[cursor_];
-}
-
 }  // namespace analysis
 }  // namespace inference
 }  // namespace paddle
diff --git a/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h
index ea88edd042aa9d46f66af1aa92f2cb273696c118..5d11c217b69f11d45c6fb6d552dc404fa8313daf 100644
--- a/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h
+++ b/paddle/fluid/inference/analysis/ir_passes/subgraph_detector.h
@@ -30,6 +30,7 @@ namespace inference {
 namespace analysis {
 
 using framework::ir::Graph;
+using framework::ir::NodesTSIterator;
 
 const char kIsFunctionNode[] = "__is_function_node__";
 const char kFunctionNodeSubGraph[] = "__function_node_sub_graph__";
@@ -132,32 +133,6 @@ struct Agent {
   framework::ir::Node *x_;
 };
 
-// Topological sorting iterator on nodes.
-struct NodesTSIterator
-    : public std::iterator<std::forward_iterator_tag, framework::ir::Node *> {
-  NodesTSIterator() = default;
-  explicit NodesTSIterator(const std::vector<framework::ir::Node *> &source);
-  NodesTSIterator(NodesTSIterator &&other)
-      : sorted_(std::move(other.sorted_)), cursor_(other.cursor_) {
-    other.cursor_ = 0;
-  }
-  NodesTSIterator(const NodesTSIterator &other);
-
-  framework::ir::Node &operator*();
-  NodesTSIterator &operator++();
-  // TODO(Superjomn) current implementation just compare the first
-  // element, need to compare the graph and all the elements in the queue and
-  // set.
-  NodesTSIterator &operator=(const NodesTSIterator &other);
-  bool operator==(const NodesTSIterator &other);
-  bool operator!=(const NodesTSIterator &other) { return !(*this == other); }
-  framework::ir::Node *operator->();
-
- private:
-  std::vector<framework::ir::Node *> sorted_;
-  size_t cursor_{0};
-};
-
 // The nodes those have no input will be treated as start points.
 static std::vector<framework::ir::Node *> ExtractStartPoints(const Graph &g) {
   std::vector<framework::ir::Node *> result;
diff --git a/paddle/fluid/operators/detection/density_prior_box_op.h b/paddle/fluid/operators/detection/density_prior_box_op.h
index 3591681fc3f6951dfc8d73e8edce38180b771eaf..42137215e21af1a529563ecc995a54d610120beb 100644
--- a/paddle/fluid/operators/detection/density_prior_box_op.h
+++ b/paddle/fluid/operators/detection/density_prior_box_op.h
@@ -72,7 +72,7 @@ class DensityPriorBoxOpKernel : public framework::OpKernel<T> {
 #ifdef PADDLE_WITH_MKLML
 #pragma omp parallel for
 #endif
-    for (int i = 0; i < fixed_ratios.size(); i++) {
+    for (size_t i = 0; i < fixed_ratios.size(); i++) {
       sqrt_fixed_ratios.push_back(sqrt(fixed_ratios[i]));
     }
 
@@ -115,11 +115,10 @@ class DensityPriorBoxOpKernel : public framework::OpKernel<T> {
       }
     }
     if (clip) {
-      platform::Transform<platform::CPUDeviceContext> trans;
-      ClipFunctor<T> clip_func;
-      trans(ctx.template device_context<platform::CPUDeviceContext>(),
-            boxes->data<T>(), boxes->data<T>() + boxes->numel(),
-            boxes->data<T>(), clip_func);
+      T* dt = boxes->data<T>();
+      std::transform(dt, dt + boxes->numel(), dt, [](T v) -> T {
+        return std::min<T>(std::max<T>(v, 0.), 1.);
+      });
     }
     framework::Tensor var_t;
     var_t.mutable_data<T>(
@@ -141,7 +140,7 @@ class DensityPriorBoxOpKernel : public framework::OpKernel<T> {
 #pragma omp parallel for collapse(2)
 #endif
     for (int i = 0; i < box_num; ++i) {
-      for (int j = 0; j < variances.size(); ++j) {
+      for (size_t j = 0; j < variances.size(); ++j) {
         e_vars(i, j) = variances[j];
       }
     }
diff --git a/paddle/fluid/operators/detection/prior_box_op.h b/paddle/fluid/operators/detection/prior_box_op.h
index 4e226abbb51c271502f0ca5419d488643b5a1a82..f84405664596ebe25983e5acbbb82bfc18c38124 100644
--- a/paddle/fluid/operators/detection/prior_box_op.h
+++ b/paddle/fluid/operators/detection/prior_box_op.h
@@ -46,13 +46,6 @@ inline void ExpandAspectRatios(const std::vector<float>& input_aspect_ratior,
   }
 }
 
-template <typename T>
-struct ClipFunctor {
-  HOSTDEVICE inline T operator()(T in) const {
-    return std::min<T>(std::max<T>(in, 0.), 1.);
-  }
-};
-
 template <typename T>
 class PriorBoxOpKernel : public framework::OpKernel<T> {
  public:
@@ -101,31 +94,30 @@ class PriorBoxOpKernel : public framework::OpKernel<T> {
     boxes->mutable_data<T>(ctx.GetPlace());
     vars->mutable_data<T>(ctx.GetPlace());
 
-    auto e_boxes = framework::EigenTensor<T, 4>::From(*boxes);
+    T* b_t = boxes->data<T>();
     for (int h = 0; h < feature_height; ++h) {
       for (int w = 0; w < feature_width; ++w) {
         T center_x = (w + offset) * step_width;
         T center_y = (h + offset) * step_height;
         T box_width, box_height;
-        int idx = 0;
         for (size_t s = 0; s < min_sizes.size(); ++s) {
           auto min_size = min_sizes[s];
           if (min_max_aspect_ratios_order) {
             box_width = box_height = min_size / 2.;
-            e_boxes(h, w, idx, 0) = (center_x - box_width) / img_width;
-            e_boxes(h, w, idx, 1) = (center_y - box_height) / img_height;
-            e_boxes(h, w, idx, 2) = (center_x + box_width) / img_width;
-            e_boxes(h, w, idx, 3) = (center_y + box_height) / img_height;
-            idx++;
+            b_t[0] = (center_x - box_width) / img_width;
+            b_t[1] = (center_y - box_height) / img_height;
+            b_t[2] = (center_x + box_width) / img_width;
+            b_t[3] = (center_y + box_height) / img_height;
+            b_t += 4;
             if (max_sizes.size() > 0) {
               auto max_size = max_sizes[s];
               // square prior with size sqrt(minSize * maxSize)
               box_width = box_height = sqrt(min_size * max_size) / 2.;
-              e_boxes(h, w, idx, 0) = (center_x - box_width) / img_width;
-              e_boxes(h, w, idx, 1) = (center_y - box_height) / img_height;
-              e_boxes(h, w, idx, 2) = (center_x + box_width) / img_width;
-              e_boxes(h, w, idx, 3) = (center_y + box_height) / img_height;
-              idx++;
+              b_t[0] = (center_x - box_width) / img_width;
+              b_t[1] = (center_y - box_height) / img_height;
+              b_t[2] = (center_x + box_width) / img_width;
+              b_t[3] = (center_y + box_height) / img_height;
+              b_t += 4;
             }
             // priors with different aspect ratios
             for (size_t r = 0; r < aspect_ratios.size(); ++r) {
@@ -135,11 +127,11 @@ class PriorBoxOpKernel : public framework::OpKernel<T> {
               }
               box_width = min_size * sqrt(ar) / 2.;
               box_height = min_size / sqrt(ar) / 2.;
-              e_boxes(h, w, idx, 0) = (center_x - box_width) / img_width;
-              e_boxes(h, w, idx, 1) = (center_y - box_height) / img_height;
-              e_boxes(h, w, idx, 2) = (center_x + box_width) / img_width;
-              e_boxes(h, w, idx, 3) = (center_y + box_height) / img_height;
-              idx++;
+              b_t[0] = (center_x - box_width) / img_width;
+              b_t[1] = (center_y - box_height) / img_height;
+              b_t[2] = (center_x + box_width) / img_width;
+              b_t[3] = (center_y + box_height) / img_height;
+              b_t += 4;
             }
           } else {
             // priors with different aspect ratios
@@ -147,21 +139,21 @@ class PriorBoxOpKernel : public framework::OpKernel<T> {
               float ar = aspect_ratios[r];
               box_width = min_size * sqrt(ar) / 2.;
               box_height = min_size / sqrt(ar) / 2.;
-              e_boxes(h, w, idx, 0) = (center_x - box_width) / img_width;
-              e_boxes(h, w, idx, 1) = (center_y - box_height) / img_height;
-              e_boxes(h, w, idx, 2) = (center_x + box_width) / img_width;
-              e_boxes(h, w, idx, 3) = (center_y + box_height) / img_height;
-              idx++;
+              b_t[0] = (center_x - box_width) / img_width;
+              b_t[1] = (center_y - box_height) / img_height;
+              b_t[2] = (center_x + box_width) / img_width;
+              b_t[3] = (center_y + box_height) / img_height;
+              b_t += 4;
             }
             if (max_sizes.size() > 0) {
               auto max_size = max_sizes[s];
               // square prior with size sqrt(minSize * maxSize)
               box_width = box_height = sqrt(min_size * max_size) / 2.;
-              e_boxes(h, w, idx, 0) = (center_x - box_width) / img_width;
-              e_boxes(h, w, idx, 1) = (center_y - box_height) / img_height;
-              e_boxes(h, w, idx, 2) = (center_x + box_width) / img_width;
-              e_boxes(h, w, idx, 3) = (center_y + box_height) / img_height;
-              idx++;
+              b_t[0] = (center_x - box_width) / img_width;
+              b_t[1] = (center_y - box_height) / img_height;
+              b_t[2] = (center_x + box_width) / img_width;
+              b_t[3] = (center_y + box_height) / img_height;
+              b_t += 4;
             }
           }
         }
@@ -169,11 +161,10 @@ class PriorBoxOpKernel : public framework::OpKernel<T> {
     }
 
     if (clip) {
-      platform::Transform<platform::CPUDeviceContext> trans;
-      ClipFunctor<T> clip_func;
-      trans(ctx.template device_context<platform::CPUDeviceContext>(),
-            boxes->data<T>(), boxes->data<T>() + boxes->numel(),
-            boxes->data<T>(), clip_func);
+      T* dt = boxes->data<T>();
+      std::transform(dt, dt + boxes->numel(), dt, [](T v) -> T {
+        return std::min<T>(std::max<T>(v, 0.), 1.);
+      });
     }
 
     framework::Tensor var_t;
diff --git a/paddle/fluid/operators/group_norm_op.cc b/paddle/fluid/operators/group_norm_op.cc
index e18d9841bb87c6a684d53e1bceb6c20a37dcfcfa..cbdffa0db8277dbf7257c3b3c1d03c1b459d5b2b 100644
--- a/paddle/fluid/operators/group_norm_op.cc
+++ b/paddle/fluid/operators/group_norm_op.cc
@@ -170,13 +170,48 @@ class GroupNormGradMaker : public framework::SingleGradOpDescMaker {
   }
 };
 
+class GroupNormInplaceInToOut : public framework::InplaceInToOut {
+ public:
+  using InplaceInToOut::InplaceInToOut;
+
+ protected:
+  std::unordered_map<std::string, std::string> Apply(
+      const framework::OpDesc &op_desc,
+      framework::BlockDesc *block) const override {
+    return {{"X", "Y"}};
+  }
+};
+
+class GroupNormGradInplaceInToOut : public framework::InplaceInToOut {
+ public:
+  using InplaceInToOut::InplaceInToOut;
+
+ protected:
+  std::unordered_map<std::string, std::string> Apply(
+      const framework::OpDesc &op_desc,
+      framework::BlockDesc *block) const override {
+    return {{framework::GradVarName("Y"), framework::GradVarName("X")}};
+  }
+};
+
+class GroupNormOpInferVarType
+    : public framework::PassInDtypeAndVarTypeToOutput {
+ protected:
+  std::unordered_map<std::string, std::string> GetInputOutputWithSameType()
+      const override {
+    return {{"X", /*->*/ "Y"}};
+  }
+};
+
 }  // namespace operators
 }  // namespace paddle
 
 namespace ops = paddle::operators;
 REGISTER_OPERATOR(group_norm, ops::GroupNormOp, ops::GroupNormOpMaker,
-                  ops::GroupNormGradMaker);
-REGISTER_OPERATOR(group_norm_grad, ops::GroupNormGradOp);
+                  ops::GroupNormOpInferVarType, ops::GroupNormGradMaker,
+                  ops::GroupNormInplaceInToOut);
+REGISTER_OPERATOR(group_norm_grad, ops::GroupNormGradOp,
+                  ops::GroupNormGradInplaceInToOut);
 REGISTER_OP_CPU_KERNEL(
     group_norm, ops::GroupNormKernel<paddle::platform::CPUDeviceContext, float>,
     ops::GroupNormKernel<paddle::platform::CPUDeviceContext, double>);
diff --git a/paddle/fluid/operators/load_combine_op.cc b/paddle/fluid/operators/load_combine_op.cc
index c4a2282e16483dbe78a32a4148c5bc4349dde3dc..f5c802986e0573e81b3ab6187b57657b52b37215 100644
--- a/paddle/fluid/operators/load_combine_op.cc
+++ b/paddle/fluid/operators/load_combine_op.cc
@@ -64,7 +64,7 @@ class LoadCombineOp : public framework::OperatorBase {
       auto *tensor = out_var->GetMutable<framework::LoDTensor>();
 
       // Error checking
-      PADDLE_ENFORCE(static_cast<bool>(buffer), "Cannot read more");
+      PADDLE_ENFORCE(static_cast<bool>(*buffer), "Cannot read more");
 
       // Get data from fin to tensor
       DeserializeFromStream(*buffer, tensor, dev_ctx);
@@ -90,6 +90,10 @@ class LoadCombineOp : public framework::OperatorBase {
         tensor->ShareDataWith(fp16_tensor);
       }
     }
+    buffer->peek();
+    PADDLE_ENFORCE(buffer->eof(),
+                   "You are not allowed to load partial data via "
+                   "load_combine_op, use load_op instead.");
   }
 };
 
diff --git a/paddle/fluid/operators/lstm_op.h b/paddle/fluid/operators/lstm_op.h
index 7d62d2d020ec2e3a29ad8720a8f04fead3a90a63..3f110024b285d41ccfe305e35c8efca5ed5ee0fe 100644
--- a/paddle/fluid/operators/lstm_op.h
+++ b/paddle/fluid/operators/lstm_op.h
@@ -311,6 +311,10 @@ class LSTMGradKernel : public framework::OpKernel<T> {
         lstm_grad.prev_state_grad = c0_g ? ordered_c0_g.data<T>() : nullptr;
       }
 
+      // lstm_value.output_value not used in bp, set to nullptr
+      // lstm_grad.state_active_grad not used in bp, set to nullptr
+      lstm_value.output_value = nullptr;
+      lstm_grad.state_active_grad = nullptr;
       int cur_batch_size = bend - bstart;
       math::LstmUnitGradFunctor<DeviceContext, T>::compute(
           device_ctx, lstm_value, lstm_grad, frame_size, cur_batch_size,
diff --git a/paddle/fluid/operators/lstmp_op.h b/paddle/fluid/operators/lstmp_op.h
index 370dd04d1449a8e211febf9a4f9e90e6f5008e20..1f11e57dcb721012c7b8e50d7e138355685053da 100644
--- a/paddle/fluid/operators/lstmp_op.h
+++ b/paddle/fluid/operators/lstmp_op.h
@@ -405,6 +405,11 @@ class LSTMPGradKernel : public framework::OpKernel<T> {
       }
 
       int cur_batch_size = bend - bstart;
+      // lstmp_value.output_value not used in bp, set to null
+      // lstmp_grad.state_active_grad not used in bp, set to null
+      lstmp_value.output_value = nullptr;
+      lstmp_grad.state_active_grad = nullptr;
+
       math::LstmUnitGradFunctor<DeviceContext, T>::compute(
           device_ctx, lstmp_value, lstmp_grad, frame_size, cur_batch_size,
           gate_act, cell_act, cand_act);
diff --git a/paddle/fluid/operators/row_conv_op.cc b/paddle/fluid/operators/row_conv_op.cc
index 10b1b0c899d833d70fa6afe51998fe210899e3c3..d283bddbe9f974ac6835ee91d5a7851453687b80 100644
--- a/paddle/fluid/operators/row_conv_op.cc
+++ b/paddle/fluid/operators/row_conv_op.cc
@@ -109,23 +109,23 @@ from future subsequences in a computationally efficient manner to improve
 unidirectional recurrent neural networks. The row convolution operator is 
 different from the 1D sequence convolution, and is computed as follows:
 
-Given an input sequence $in$ of length $t$ and input dimension $d$, 
-and a filter ($W$) of size $context \times d$, 
+Given an input sequence $X$ of length $t$ and input dimension $D$, 
+and a filter ($W$) of size $context \times D$,
 the output sequence is convolved as:
 
 $$
-out_{i, :} = \\sum_{j=i}^{i + context} in_{j,:} \\cdot W_{i-j, :}
+out_{i} = \\sum_{j=i}^{i + context - 1} X_{j} \\cdot W_{j-i}
 $$
 
 In the above equation:
 
 * $Out_{i}$: The i-th row of output variable with shape [1, D].
 
-* $\\tau$: Future context size.
+* $context$: Future context size.
 
 * $X_{j}$: The j-th row of input variable with shape [1, D].
 
-* $W_{i-j}$: The (i-j)-th row of parameters with shape [1, D].
+* $W_{j-i}$: The (j-i)-th row of parameters with shape [1, D].
 
 More details about row_conv please refer to
 the design document
diff --git a/paddle/fluid/platform/enforce.h b/paddle/fluid/platform/enforce.h
index 142d38f0609d963ce3ff45c595b8432b0e5edd21..d32f9c8667d34246e55b039165bec05b7bec0376 100644
--- a/paddle/fluid/platform/enforce.h
+++ b/paddle/fluid/platform/enforce.h
@@ -233,9 +233,11 @@ inline void throw_on_error(ncclResult_t stat, const std::string& msg) {
 #endif  // __APPLE__ and windows
 #endif  // PADDLE_WITH_CUDA
 
-#define PADDLE_THROW(...)                  \
-  throw ::paddle::platform::EnforceNotMet( \
-      ::paddle::string::Sprintf(__VA_ARGS__), __FILE__, __LINE__)
+#define PADDLE_THROW(...)                                            \
+  do {                                                               \
+    throw ::paddle::platform::EnforceNotMet(                         \
+        ::paddle::string::Sprintf(__VA_ARGS__), __FILE__, __LINE__); \
+  } while (0)
 
 #define PADDLE_ENFORCE(COND, ...)                                         \
   do {                                                                    \
@@ -270,23 +272,25 @@ inline void throw_on_error(ncclResult_t stat, const std::string& msg) {
  *    extra messages is also supported, for example:
  *    PADDLE_ENFORCE(a, b, "some simple enforce failed between %d numbers", 2)
  */
-#define PADDLE_ENFORCE_NOT_NULL(__VAL, ...)                  \
-  do {                                                       \
-    if (UNLIKELY(nullptr == (__VAL))) {                      \
-      PADDLE_THROW(#__VAL " should not be null\n%s",         \
-                   paddle::string::Sprintf("" __VA_ARGS__)); \
-    }                                                        \
+#define PADDLE_ENFORCE_NOT_NULL(__VAL, ...)                 \
+  do {                                                      \
+    if (UNLIKELY(nullptr == (__VAL))) {                     \
+      PADDLE_THROW(#__VAL " should not be null\n%s",        \
+                   ::paddle::string::Sprintf(__VA_ARGS__)); \
+    }                                                       \
   } while (0)
 
 #define __PADDLE_BINARY_COMPARE(__VAL0, __VAL1, __CMP, __INV_CMP, ...)  \
   do {                                                                  \
-    if (UNLIKELY(!((__VAL0)__CMP(__VAL1)))) {                           \
+    auto __cond1__ = (__VAL0);                                          \
+    auto __cond2__ = (__VAL1);                                          \
+    if (UNLIKELY(!((__cond1__)__CMP(__cond2__)))) {                     \
       PADDLE_THROW("Enforce failed. Expected %s " #__CMP                \
                    " %s, but received %s:%s " #__INV_CMP " %s:%s.\n%s", \
                    #__VAL0, #__VAL1, #__VAL0,                           \
-                   paddle::string::to_string(__VAL0), #__VAL1,          \
-                   paddle::string::to_string(__VAL1),                   \
-                   paddle::string::Sprintf("" __VA_ARGS__));            \
+                   ::paddle::string::to_string(__cond1__), #__VAL1,     \
+                   ::paddle::string::to_string(__cond2__),              \
+                   ::paddle::string::Sprintf(__VA_ARGS__));             \
     }                                                                   \
   } while (0)
 
diff --git a/paddle/fluid/pybind/ir.cc b/paddle/fluid/pybind/ir.cc
index 24059140ab20e24917b93a5f60936b1087797ff9..1cd1be8e8d9da8c6a82ceefc3284084bfeda0252 100644
--- a/paddle/fluid/pybind/ir.cc
+++ b/paddle/fluid/pybind/ir.cc
@@ -13,10 +13,12 @@
 // limitations under the License.
 
 #include "paddle/fluid/pybind/ir.h"
+#include <algorithm>
 #include <string>
 #include <unordered_map>
 #include <unordered_set>
 #include "paddle/fluid/framework/ir/graph.h"
+#include "paddle/fluid/framework/ir/graph_helper.h"
 #include "paddle/fluid/framework/ir/graph_pattern_detector.h"
 #include "paddle/fluid/framework/ir/node.h"
 #include "paddle/fluid/framework/op_desc.h"
@@ -27,6 +29,10 @@ namespace py = pybind11;
 using paddle::framework::ir::Graph;
 using paddle::framework::ir::Node;
 using paddle::framework::ir::GraphSafeRemoveNodes;
+using paddle::framework::ir::HasCircle;
+using paddle::framework::ir::GraphNum;
+using paddle::framework::ir::TopologySortOperations;
+using paddle::framework::ir::BuildOperationAdjList;
 using paddle::framework::OpDesc;
 using paddle::framework::ProgramDesc;
 using paddle::framework::VarDesc;
@@ -36,6 +42,12 @@ namespace paddle {
 namespace pybind {
 void BindGraph(py::module *m) {
   m->def("graph_safe_remove_nodes", GraphSafeRemoveNodes);
+  m->def("has_circle", HasCircle);
+  m->def("graph_num", GraphNum);
+  m->def("topology_sort", TopologySortOperations,
+         return_value_policy::reference);
+  m->def("build_adjacency_list", BuildOperationAdjList,
+         return_value_policy::reference);
   py::class_<Graph, std::shared_ptr<Graph>>(
       *m, "Graph",
       "The graph is a Directed Acyclic Single Static Assignment Graph, see "
@@ -46,7 +58,6 @@ void BindGraph(py::module *m) {
       .def("get_float", &Graph::Get<float>)
       .def("get_double", &Graph::Get<double>)
       .def("get_string", &Graph::Get<std::string>)
-      .def("get_program", &Graph::Get<ProgramDesc>)
       .def("get_marked_nodes", &Graph::Get<std::unordered_set<const Node *>>)
       .def("set", [](Graph &self, const std::string &attr_name,
                      int attr) { return self.Set(attr_name, new int(attr)); })
@@ -63,11 +74,6 @@ void BindGraph(py::module *m) {
            [](Graph &self, const std::string &attr_name, double attr) {
              return self.Set(attr_name, new double(attr));
            })
-      .def("set",
-           [](Graph &self, const std::string &attr_name,
-              const ProgramDesc &attr) {
-             return self.Set(attr_name, new ProgramDesc(attr));
-           })
       .def("set",
            [](Graph &self, const std::string &attr_name,
               const std::unordered_set<const Node *> &attr) {
@@ -108,42 +114,42 @@ void BindNode(py::module *m) {
       .def("is_op", &Node::IsOp)
       .def("is_var", &Node::IsVar)
       .def("is_ctrl_var", &Node::IsCtrlVar)
+      .def("clear_inputs", [](Node &self) { self.inputs.clear(); })
       .def("inputs_remove",
            [](Node &self, int node_id) {
-             for (auto it = self.inputs.begin(); it != self.inputs.end();
-                  it++) {
-               if ((*it)->id() == node_id) {
-                 self.inputs.erase(it);
-               }
+             auto pos = std::find_if(
+                 self.inputs.begin(), self.inputs.end(),
+                 [&node_id](const Node *n) { return n->id() == node_id; });
+             if (pos != self.inputs.end()) {
+               self.inputs.erase(pos);
              }
            })
       .def("inputs_remove",
            [](Node &self, Node &node) {
-             for (auto it = self.inputs.begin(); it != self.inputs.end();
-                  it++) {
-               if (*it == &node) {
-                 self.inputs.erase(it);
-               }
+             auto pos =
+                 std::find(self.inputs.begin(), self.inputs.end(), &node);
+             if (pos != self.inputs.end()) {
+               self.inputs.erase(pos);
              }
            })
       .def("inputs_append",
            [](Node &self, Node &node) { self.inputs.push_back(&node); })
+      .def("clear_outputs", [](Node &self) { self.outputs.clear(); })
       .def("outputs_remove",
            [](Node &self, int node_id) {
-             for (auto it = self.outputs.begin(); it != self.outputs.end();
-                  it++) {
-               if ((*it)->id() == node_id) {
-                 self.outputs.erase(it);
-               }
+             auto pos = std::find_if(
+                 self.outputs.begin(), self.outputs.end(),
+                 [&node_id](const Node *n) { return n->id() == node_id; });
+             if (pos != self.outputs.end()) {
+               self.outputs.erase(pos);
              }
            })
       .def("outputs_remove",
            [](Node &self, Node &node) {
-             for (auto it = self.outputs.begin(); it != self.outputs.end();
-                  it++) {
-               if (*it == &node) {
-                 self.outputs.erase(it);
-               }
+             auto pos =
+                 std::find(self.outputs.begin(), self.outputs.end(), &node);
+             if (pos != self.outputs.end()) {
+               self.outputs.erase(pos);
              }
            })
       .def("outputs_append",
diff --git a/paddle/fluid/pybind/pybind.cc b/paddle/fluid/pybind/pybind.cc
index 351513712cc4297bf7fbe67878aeba162ef66e4d..a4a01ad647b038bd2bfea00fefa30abb19f58b66 100644
--- a/paddle/fluid/pybind/pybind.cc
+++ b/paddle/fluid/pybind/pybind.cc
@@ -829,8 +829,7 @@ All parameter, weight, gradient are variables in Paddle.
   m.def("disable_profiler", platform::DisableProfiler);
   m.def("is_profiler_enabled", platform::IsProfileEnabled);
   m.def("reset_profiler", platform::ResetProfiler);
-  m.def("get_pass", [](const py::bytes &binary_str) {
-    std::string pass_type(binary_str);
+  m.def("get_pass", [](const std::string &pass_type) {
     auto pass = framework::ir::PassRegistry::Instance().Get(pass_type);
     return std::shared_ptr<framework::ir::Pass>(std::move(pass));
   });
@@ -838,10 +837,9 @@ All parameter, weight, gradient are variables in Paddle.
   py::class_<ir::Pass, std::shared_ptr<ir::Pass>> pass(m, "Pass");
   pass.def(py::init())
       .def("has", &ir::Pass::Has)
-      .def("set",
-           [](ir::Pass &self, const std::string &attr_name,
-              const ProgramDesc &attr) {
-             return self.Set(attr_name, new ProgramDesc(attr));
+      .def("set_not_owned",
+           [](ir::Pass &self, const std::string &attr_name, ProgramDesc &attr) {
+             self.SetNotOwned<ProgramDesc>(attr_name, &attr);
            })
       .def(
           "set",
@@ -850,7 +848,6 @@ All parameter, weight, gradient are variables in Paddle.
           })
       .def("set", [](ir::Pass &self, const std::string &name,
                      int val) { self.Set<const int>(name, new int(val)); })
-      .def("get_program", &ir::Pass::Get<ProgramDesc>)
       .def("type", &ir::Pass::Type)
       .def("apply", [](ir::Pass &self, std::shared_ptr<ir::Graph> graph) {
         std::unique_ptr<ir::Graph> origin_graph(graph.get());
diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt
index 90b8fd1a0aab159eb1a829d67485c845182d295b..bcc997ff4511db45d2a775092c0798d7c1e9be06 100644
--- a/python/CMakeLists.txt
+++ b/python/CMakeLists.txt
@@ -64,6 +64,7 @@ if (WITH_TESTING)
   add_subdirectory(paddle/dataset/tests)
   add_subdirectory(paddle/fluid/tests)
   add_subdirectory(paddle/fluid/contrib/tests)
+  add_subdirectory(paddle/fluid/contrib/slim/tests)
 endif()
 install(DIRECTORY ${PADDLE_PYTHON_PACKAGE_DIR}
     DESTINATION opt/paddle/share/wheels
diff --git a/python/paddle/fluid/contrib/int8_inference/README.md b/python/paddle/fluid/contrib/int8_inference/README.md
index a9691dad4494f5eacf427b2806b2393baa57dc1e..460ae393f158ae320c93601365a68b8cfe2ba50e 100644
--- a/python/paddle/fluid/contrib/int8_inference/README.md
+++ b/python/paddle/fluid/contrib/int8_inference/README.md
@@ -63,10 +63,10 @@ Notes:
 ## 4. How to reproduce the results
 * Small dataset
 ```bash
-python python/paddle/fluid/contrib/tests/test_calibration.py
+FLAGS_use_mkldnn=true python python/paddle/fluid/contrib/tests/test_calibration.py
 ```
 
 * Full dataset
 ```bash
-DATASET=full python python/paddle/fluid/contrib/tests/test_calibration.py
+FLAGS_use_mkldnn=true DATASET=full python python/paddle/fluid/contrib/tests/test_calibration.py
 ```
diff --git a/python/paddle/fluid/contrib/slim/quantization/quantization_pass.py b/python/paddle/fluid/contrib/slim/quantization/quantization_pass.py
index 266a106bc507104c0a8db1c882b55ac59e88195e..18b58e6f388bbe9495333b12f32d63b74fddcb3a 100644
--- a/python/paddle/fluid/contrib/slim/quantization/quantization_pass.py
+++ b/python/paddle/fluid/contrib/slim/quantization/quantization_pass.py
@@ -13,14 +13,19 @@
 # limitations under the License.
 
 import collections
+import numpy as np
+import six
+from ..... import compat as cpt
 from .... import core
 from ....framework import IrGraph
 from ....framework import Program
-from ....framework import Variable
 from ....initializer import Constant
 from .... import unique_name
 
-__all__ = ['QuantizationTransformPass']
+__all__ = [
+    'QuantizationTransformPass', 'QuantizationFreezePass', 'ConvertToInt8Pass',
+    'TransformForMobilePass'
+]
 
 
 class QuantizationTransformPass(object):
@@ -35,7 +40,13 @@ class QuantizationTransformPass(object):
         """
         Convert and rewrite the IrGraph according to weight and
         activation quantization type.
+
         Args:
+            scope(fluid.Scope): When activation use 'range_abs_max' as the quantize
+            type, this pass will create some new parameters. The scope is used to
+            initialize these new parameters.
+            program_exe(fluid.Executor): program_exe is used to initialize new
+            parameters described above.
             weight_bits (int): quantization bit number for weights,
                 the bias is not quantized.
             activation_bits (int): quantization bit number for activation.
@@ -49,6 +60,7 @@ class QuantizationTransformPass(object):
                 support 'abs_max'. The 'range_abs_max' usually is not used for
                 weight, since weights are fixed once the model is well trained.
             window_size (int): the window size for 'range_abs_max' quantization.
+
         Examples:
         .. code-block:: python
             # The original graph will be rewrite.
@@ -88,31 +100,35 @@ class QuantizationTransformPass(object):
         self._quantizable_grad_ops = [
             '%s_grad' % (op) for op in self._quantizable_ops
         ]
-        self._fake_quant_op_types = [
-            'fake_quantize_abs_max', 'fake_quantize_range_abs_max'
-        ]
-        self._fake_dequant_op_types = ['fake_dequantize_max_abs']
         self._is_test = None
         self._global_step = None
 
     def apply(self, graph):
+        """
+        Quantize the graph for training process. According to weight and
+        activation quantization type, the graph will be added some fake
+        quantize operators and fake dequantize operators.
+
+        Args:
+            graph(IrGraph): the applied graph.
+        """
         assert isinstance(graph,
                           IrGraph), 'graph must be the instance of IrGraph.'
         self._need_initialized.clear()
         self._is_test = graph.is_test()
         # marked the variable which has been dequantized.
         dequantized_vars = collections.OrderedDict()
-        params = [p.name() for p in graph.all_parameters()]
+        persistable_vars = [p.name() for p in graph.all_persistable_vars()]
 
         def _transform_forward(graph, op):
             for var_node in op.inputs:
                 if var_node.name() in dequantized_vars:
                     dequant_var_node = dequantized_vars[var_node.name()]
                 else:
-                    quant_bits = self._weight_bits if var_node.name() in params \
+                    quant_bits = self._weight_bits if var_node.name() in persistable_vars \
                     else self._activation_bits
                     quant_type = self._weight_quantize_type if var_node.name() \
-                        in params else self._activation_quantize_type
+                        in persistable_vars else self._activation_quantize_type
                     quant_var_node, scale_var_node = self._insert_quant_op(
                         graph, var_node, quant_bits, quant_type)
                     dequant_var_node = self._insert_dequant_op(
@@ -150,9 +166,14 @@ class QuantizationTransformPass(object):
             assert self._program_exe is not None, \
             'The program_exe cannot be set None when activation_quantize_type equals to range_abs_max.'
             init_program = Program()
-            for var_desc, initializer in self._need_initialized.iteritems():
-                var = Variable(init_program.global_block())
-                var._set_desc(var_desc)
+            for var_desc, initializer in six.iteritems(self._need_initialized):
+                var = init_program.global_block().create_var(
+                    name=var_desc.name(),
+                    shape=var_desc.shape(),
+                    dtype=var_desc.dtype(),
+                    type=var_desc.type(),
+                    lod_level=var_desc.lod_level(),
+                    persistable=var_desc.persistable())
                 initializer(var, init_program.global_block())
             self._program_exe.run(program=init_program, scope=self._scope)
 
@@ -161,7 +182,7 @@ class QuantizationTransformPass(object):
     def _create_global_step(self, graph):
         if self._weight_quantize_type == 'range_abs_max' or \
                 self._activation_quantize_type == 'range_abs_max':
-            counter_name = '@STEP_COUNTER@'
+            counter_name = cpt.to_text('@STEP_COUNTER@')
             for node in graph.all_vars():
                 if node.name() == counter_name:
                     self._global_step = node
@@ -175,9 +196,14 @@ class QuantizationTransformPass(object):
                     Constant(value=0, force_cpu=True)
                 global_step_out = graph.create_var_node_from_desc(
                     global_step_in.var())
+                # The attribute of `op_role` is needed by ParallelExecutor.
                 increment_op = graph.create_op_node(
                     op_type='increment',
-                    attrs={'step': 1.0},
+                    attrs={
+                        'step': 1.0,
+                        'op_role':
+                        core.op_proto_and_checker_maker.OpRole.Forward
+                    },
                     inputs={'X': global_step_in},
                     outputs={'Out': global_step_out})
                 graph.link_to(global_step_in, increment_op)
@@ -212,7 +238,10 @@ class QuantizationTransformPass(object):
             var_dtype=var_node.var().dtype())
         quant_op_node = graph.create_op_node(
             op_type='fake_quantize_abs_max',
-            attrs={'bit_length': quant_bits},
+            attrs={
+                'bit_length': quant_bits,
+                'op_role': core.op_proto_and_checker_maker.OpRole.Forward
+            },
             inputs={'X': var_node},
             outputs={'Out': quant_var_node,
                      'OutScale': scale_var_node})
@@ -257,7 +286,8 @@ class QuantizationTransformPass(object):
         attrs = {
             'window_size': self._window_size,
             'bit_length': quant_bits,
-            'is_test': self._is_test
+            'is_test': self._is_test,
+            'op_role': core.op_proto_and_checker_maker.OpRole.Forward
         }
         quant_op_node = graph.create_op_node(
             op_type='fake_quantize_range_abs_max',
@@ -290,7 +320,10 @@ class QuantizationTransformPass(object):
         max_range = (1 << (quant_bits - 1)) - 1
         dequant_op_node = graph.create_op_node(
             op_type='fake_dequantize_max_abs',
-            attrs={'max_range': float(max_range)},
+            attrs={
+                'max_range': float(max_range),
+                'op_role': core.op_proto_and_checker_maker.OpRole.Forward
+            },
             inputs={'X': var_node,
                     'Scale': scale_var_node},
             outputs={'Out': dequant_var_node})
@@ -316,3 +349,330 @@ class QuantizationTransformPass(object):
         Return the scale name of quantized variable for the input `var_name`.
         """
         return "%s.scale" % (var_name)
+
+
+class QuantizationFreezePass(object):
+    """
+    The freeze pass is used to adjust the quantize operator order, for example:
+        1) `activation -> quant -> dequant -> conv2d` will be freezed into
+        `activation -> quant -> conv2d -> dequant`
+        2) `weight -> quant -> dequant -> conv2d` will be freezed into `weight -> conv2d`,
+        and weight will be sacled offline.
+
+    Args:
+        scope(fluid.Scope): scope is used to get the weight tensor values.
+        place(fluid.CPUPlace|fluid.CUDAPlace): place is used to restore the weight tensors.
+        weight_bits (int): quantization bit number for weights.
+        activation_bits (int): quantization bit number for activation.
+        weight_quantize_type (str): quantization type for weights, support 'abs_max'.
+        The 'range_abs_max' usually is not used for weight, since weights are fixed once the
+        model is well trained.
+    """
+
+    def __init__(self,
+                 scope,
+                 place,
+                 weight_bits=8,
+                 activation_bits=8,
+                 weight_quantize_type='abs_max'):
+        assert scope is not None, \
+            'The scope cannot be set None.'
+        assert place is not None, \
+            'The place cannot be set None.'
+        self._scope = scope
+        self._place = place
+        self._weight_bits = weight_bits
+        self._activation_bits = activation_bits
+        self._weight_quantize_type = weight_quantize_type
+        self._quantizable_ops = ['conv2d', 'depthwise_conv2d', 'mul']
+        self._fake_quant_op_names = [
+            'fake_quantize_abs_max', 'fake_quantize_range_abs_max'
+        ]
+        self._fake_dequant_op_names = ['fake_dequantize_max_abs']
+        self._op_input_rename_map = collections.OrderedDict()
+        self._op_output_rename_map = collections.OrderedDict()
+        self._var_scale_map = collections.OrderedDict()
+
+    def apply(self, graph):
+        """
+        Adjust quantize/dequantize operators order for the inference process.
+
+        Args:
+            graph(IrGraph): the applied graph.
+        """
+        persistable_vars = [p.name() for p in graph.all_persistable_vars()]
+        ops = graph.all_ops()
+        for op_node in ops:
+            op_name = op_node.name()
+            if op_name in self._fake_quant_op_names:
+                input_arg_name = op_node.op().input('X')[0]
+                if input_arg_name in persistable_vars:
+                    if self._weight_quantize_type == 'abs_max':
+                        param = self._load_var(input_arg_name)
+                        scale_v = np.max(np.abs(param))
+                    else:
+                        scale_v = self._load_var(op_node.op().output('OutScale')
+                                                 [0])[0]
+                    self._var_scale_map[input_arg_name] = scale_v
+                else:
+                    scale_v = graph.var_node(op_node.op().output('OutScale')[0])
+                    self._var_scale_map[input_arg_name] = scale_v
+                if input_arg_name in persistable_vars:
+                    self._remove_fake_quant_and_dequant_op(graph, op_node)
+                    # quantize weight and restore
+                    param_v = self._load_var(input_arg_name)
+                    quantized_param_v = self._quant(param_v, scale_v,
+                                                    self._weight_bits)
+                    self._restore_var(input_arg_name, quantized_param_v)
+
+        ops = graph.all_ops()
+        for op_node in ops:
+            op_name = op_node.name()
+            if op_name in self._fake_dequant_op_names:
+                self._remove_fake_quant_and_dequant_op(graph, op_node)
+
+        ops = graph.all_ops()
+        for op_node in ops:
+            op_name = op_node.name()
+            if op_name in self._quantizable_ops:
+                self._insert_post_dequant_op(graph, op_node)
+
+        for op_node in ops:
+            # insert dequant_op after fc/conv, need to rename inputs of the followed ops
+            for var_node in op_node.inputs:
+                name = var_node.name()
+                if name in self._op_output_rename_map:
+                    old_in = graph.var_node(name)
+                    new_in = self._op_output_rename_map[name]
+                    graph.update_input_link(old_in, new_in, op_node)
+
+        # remove the unused var node in the graph
+        self._remove_unused_var_nodes(graph)
+        return graph
+
+    def _remove_fake_quant_and_dequant_op(self, graph, op_node):
+        k = op_node.op().output('Out')[0]
+        v = op_node.op().input('X')[0]
+        if v not in self._op_input_rename_map:
+            self._op_input_rename_map[k] = v
+        else:
+            self._op_input_rename_map[k] = self._op_input_rename_map[v]
+        graph.safe_remove_nodes(op_node)
+
+    def _insert_post_dequant_op(self, graph, op_node):
+        max_range = None
+        scale_var_node = None
+        persistable_vars = [p.name() for p in graph.all_persistable_vars()]
+        for var_node in op_node.inputs:
+            name = var_node.name()
+            if name in self._op_input_rename_map:
+                old_in = graph.var_node(name)
+                new_in = graph.var_node(self._op_input_rename_map[name])
+                new_in.clear_outputs()
+                graph.update_input_link(old_in, new_in, op_node)
+            original_var_name = self._original_var_name(name)
+            scale_v = self._var_scale_map[original_var_name]
+            if original_var_name in persistable_vars:
+                param_range = (1 << (self._weight_bits - 1)) - 1
+                act_range = (1 << (self._activation_bits - 1)) - 1
+                assert self._is_float(
+                    scale_v), 'The scale of parameter %s is not a float.' % (
+                        original_var_name)
+                max_range = param_range * act_range / scale_v
+            else:
+                assert isinstance(scale_v, core.Node)
+                scale_var_node = self._var_scale_map[original_var_name]
+
+        if len(op_node.outputs) != 1:
+            raise ValueError("Only support one output, but op %s has"
+                             " more than one output." % (op_node.name()))
+
+        output_var_node = op_node.outputs[0]
+        dequant_var_node = graph.create_var_node(
+            name=self._dequantized_var_name(output_var_node.name()),
+            var_type=output_var_node.var().type(),
+            shape=output_var_node.var().shape(),
+            var_dtype=output_var_node.var().dtype())
+        dequant_op_node = graph.create_op_node(
+            op_type='fake_dequantize_max_abs',
+            attrs={
+                'max_range': float(max_range),
+                'op_role': core.op_proto_and_checker_maker.OpRole.Forward
+            },
+            inputs={'X': output_var_node,
+                    'Scale': scale_var_node},
+            outputs={'Out': dequant_var_node})
+        graph.link_to(output_var_node, dequant_op_node)
+        graph.link_to(scale_var_node, dequant_op_node)
+        graph.link_to(dequant_op_node, dequant_var_node)
+        self._op_output_rename_map[output_var_node.name()] = dequant_var_node
+        return dequant_var_node
+
+    def _load_var(self, name):
+        return np.array(self._scope.find_var(name).get_tensor())
+
+    def _restore_var(self, name, array):
+        tensor = self._scope.find_var(name).get_tensor()
+        tensor.set(array, self._place)
+
+    def _remove_unused_var_nodes(self, graph):
+        all_used_vars = set()
+        ops = graph.all_ops()
+        for op_node in ops:
+            for input_node in op_node.inputs:
+                all_used_vars.add(input_node)
+            for output_node in op_node.outputs:
+                all_used_vars.add(output_node)
+
+        all_unused_vars = graph.all_vars() - all_used_vars
+        graph.safe_remove_nodes(all_unused_vars)
+
+    def _original_var_name(self, var_name):
+        """
+        Return the original variable name.
+        """
+        if var_name.endswith('.quantized.dequantized'):
+            return var_name[:-len('.quantized.dequantized')]
+        if var_name.endswith('.quantized'):
+            return var_name[:-len('.quantized')]
+        if var_name.endswith('.dequantized'):
+            return var_name[:-len('.dequantized')]
+        if var_name.endswith('.scale'):
+            return var_name[:-len('.scale')]
+        else:
+            return var_name
+
+    def _dequantized_var_name(self, var_name):
+        """
+        Return dequantized variable name for the input `var_name`.
+        """
+        return "%s.dequantized" % (var_name)
+
+    def _is_float(self, v):
+        return isinstance(v, float) or isinstance(v, np.float32) \
+            or isinstance(v, np.float64)
+
+    def _quant(self, x, scale, num_bits):
+        return np.round(x / scale * ((1 << (num_bits - 1)) - 1))
+
+
+class ConvertToInt8Pass(object):
+    """
+    Convert the weights into int8_t type.
+
+    Args:
+        scope(fluid.Scope): scope is used to get the weight tensor values.
+        place(fluid.CPUPlace|fluid.CUDAPlace): place is used to restore the
+        8bits weight tensors.
+    """
+
+    def __init__(self, scope, place):
+        assert scope is not None, \
+            'The scope cannot be set None.'
+        assert place is not None, \
+            'The place cannot be set None.'
+        self._scope = scope
+        self._place = place
+        self._quantizable_ops = ['conv2d', 'depthwise_conv2d', 'mul']
+
+    def apply(self, graph):
+        """
+        Convert weights' tpye of the graph. After that, the data type of the
+        graph weigths is int8_t.
+
+        Args:
+            graph(IrGraph): the applied graph.
+        """
+        persistable_vars = [p.name() for p in graph.all_persistable_vars()]
+        ops = graph.all_ops()
+        input_map = {}
+        for op_node in ops:
+            op_name = op_node.name()
+            if op_name in self._quantizable_ops:
+                for var_node in op_node.inputs:
+                    name = var_node.name()
+                    if name in persistable_vars:
+                        if name not in input_map:
+                            int8_var_node = self._convert_to_int8(graph,
+                                                                  var_node)
+                            input_map[name] = int8_var_node
+                        graph.update_input_link(var_node, input_map[name],
+                                                op_node)
+
+        # remove the unused var node in the graph
+        self._remove_unused_var_nodes(graph)
+        return graph
+
+    def _convert_to_int8(self, graph, var_node):
+        int8_var_node_name = var_node.name() + ".int8"
+        int8_var_node = graph.create_param_node(
+            name=cpt.to_text(int8_var_node_name),
+            var_type=var_node.var().type(),
+            shape=var_node.var().shape(),
+            var_dtype=core.VarDesc.VarType.INT8)
+        array = self._load_var(var_node.name())
+        self._scope.var(int8_var_node_name)
+        self._store_var(int8_var_node_name, array, np.int8)
+        return int8_var_node
+
+    def _load_var(self, name):
+        return np.array(self._scope.find_var(name).get_tensor())
+
+    def _store_var(self, name, array, dtype):
+        tensor = self._scope.find_var(name).get_tensor()
+        tensor.set(array.astype(dtype), self._place)
+
+    def _remove_unused_var_nodes(self, graph):
+        all_used_vars = set()
+        ops = graph.all_ops()
+        for op_node in ops:
+            for input_node in op_node.inputs:
+                all_used_vars.add(input_node)
+            for output_node in op_node.outputs:
+                all_used_vars.add(output_node)
+
+        all_unused_vars = graph.all_vars() - all_used_vars
+        graph.safe_remove_nodes(all_unused_vars)
+
+
+class TransformForMobilePass(object):
+    """
+    This pass is used to convert the freezed graph for paddle-mobile execution.
+    """
+
+    def __init__(self):
+        self._fake_quant_op_names = [
+            'fake_quantize_abs_max', 'fake_quantize_range_abs_max'
+        ]
+        self._fake_dequant_op_names = ['fake_dequantize_max_abs']
+
+    def apply(self, graph):
+        """
+        Because paddle-mobile use `quantize` an `dequantize` as the names of
+        quantize operator and dequantize operator, the `apply` function just
+        realize this logic.
+
+        Args:
+            graph(IrGraph): the graph will be transformed.
+        """
+        ops = graph.all_ops()
+        for op_node in ops:
+            name = op_node.name()
+            if name in self._fake_quant_op_names:
+                op_node.op().set_type('quantize')
+                quant_node = graph.create_op_node_from_desc(op_node.op())
+                for input_node in op_node.inputs:
+                    graph.link_to(input_node, quant_node)
+                for output_node in op_node.outputs:
+                    graph.link_to(quant_node, output_node)
+                graph.safe_remove_nodes(op_node)
+            if name in self._fake_dequant_op_names:
+                op_node.op().set_type('dequantize')
+                dequant_node = graph.create_op_node_from_desc(op_node.op())
+                for input_node in op_node.inputs:
+                    graph.link_to(input_node, dequant_node)
+                for output_node in op_node.outputs:
+                    graph.link_to(dequant_node, output_node)
+                graph.safe_remove_nodes(op_node)
+
+        return graph
diff --git a/python/paddle/fluid/contrib/slim/tests/CMakeLists.txt b/python/paddle/fluid/contrib/slim/tests/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..79bec8c4ad34d682895250bc29b1fddb3a569bd4
--- /dev/null
+++ b/python/paddle/fluid/contrib/slim/tests/CMakeLists.txt
@@ -0,0 +1,6 @@
+file(GLOB TEST_OPS RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}" "test_*.py")
+string(REPLACE ".py" "" TEST_OPS "${TEST_OPS}")
+
+foreach(src ${TEST_OPS})
+    py_test(${src} SRCS ${src}.py)
+endforeach()
diff --git a/python/paddle/fluid/contrib/slim/unitest/__init__.py b/python/paddle/fluid/contrib/slim/tests/__init__.py
similarity index 100%
rename from python/paddle/fluid/contrib/slim/unitest/__init__.py
rename to python/paddle/fluid/contrib/slim/tests/__init__.py
diff --git a/python/paddle/fluid/contrib/slim/unitest/configs/config.yaml b/python/paddle/fluid/contrib/slim/tests/configs/config.yaml
similarity index 88%
rename from python/paddle/fluid/contrib/slim/unitest/configs/config.yaml
rename to python/paddle/fluid/contrib/slim/tests/configs/config.yaml
index db488b96330210df15b02b19d90abd5c9101f844..d9b49029d3e34d487ad65fe0f7e54e2cee1d5838 100644
--- a/python/paddle/fluid/contrib/slim/unitest/configs/config.yaml
+++ b/python/paddle/fluid/contrib/slim/tests/configs/config.yaml
@@ -1,5 +1,5 @@
 version: 1.0
-include: ["./unitest/configs/pruners.yaml", "./unitest/configs/pruners_0.yaml"]
+include: ["./configs/pruners.yaml", "./configs/pruners_0.yaml"]
 pruners:
     pruner_1:
         class: 'RatioPruner'
diff --git a/python/paddle/fluid/contrib/slim/unitest/configs/pruners.yaml b/python/paddle/fluid/contrib/slim/tests/configs/pruners.yaml
similarity index 100%
rename from python/paddle/fluid/contrib/slim/unitest/configs/pruners.yaml
rename to python/paddle/fluid/contrib/slim/tests/configs/pruners.yaml
diff --git a/python/paddle/fluid/contrib/slim/unitest/configs/pruners_0.yaml b/python/paddle/fluid/contrib/slim/tests/configs/pruners_0.yaml
similarity index 100%
rename from python/paddle/fluid/contrib/slim/unitest/configs/pruners_0.yaml
rename to python/paddle/fluid/contrib/slim/tests/configs/pruners_0.yaml
diff --git a/python/paddle/fluid/contrib/slim/unitest/test_factory.py b/python/paddle/fluid/contrib/slim/tests/test_factory.py
similarity index 95%
rename from python/paddle/fluid/contrib/slim/unitest/test_factory.py
rename to python/paddle/fluid/contrib/slim/tests/test_factory.py
index 07f28aac905d1a2813dbde6143235c7916fd9278..2fc72b6475e6bdd977dafb57696046a1100d0087 100644
--- a/python/paddle/fluid/contrib/slim/unitest/test_factory.py
+++ b/python/paddle/fluid/contrib/slim/tests/test_factory.py
@@ -18,7 +18,7 @@ import unittest
 
 class TestFactory(unittest.TestCase):
     def test_parse(self):
-        factory = ConfigFactory('./unitest/configs/config.yaml')
+        factory = ConfigFactory('./configs/config.yaml')
 
         pruner = factory.instance('pruner_1')
         self.assertEquals(pruner.ratios['conv1_1.w'], 0.3)
diff --git a/python/paddle/fluid/contrib/slim/tests/test_graph.py b/python/paddle/fluid/contrib/slim/tests/test_graph.py
new file mode 100644
index 0000000000000000000000000000000000000000..75e0c95b5c3cc06d66eab9de0b85e5d7ed110837
--- /dev/null
+++ b/python/paddle/fluid/contrib/slim/tests/test_graph.py
@@ -0,0 +1,80 @@
+#   copyright (c) 2018 paddlepaddle authors. all rights reserved.
+#
+# licensed under the apache license, version 2.0 (the "license");
+# you may not use this file except in compliance with the license.
+# you may obtain a copy of the license at
+#
+#     http://www.apache.org/licenses/license-2.0
+#
+# unless required by applicable law or agreed to in writing, software
+# distributed under the license is distributed on an "as is" basis,
+# without warranties or conditions of any kind, either express or implied.
+# see the license for the specific language governing permissions and
+# limitations under the license.
+
+from __future__ import print_function
+import unittest
+import paddle.fluid as fluid
+import six
+from paddle.fluid.framework import IrGraph
+from paddle.fluid import core
+
+
+def residual_block(num):
+    def conv_bn_layer(input,
+                      ch_out,
+                      filter_size,
+                      stride,
+                      padding,
+                      act='relu',
+                      bias_attr=False):
+        tmp = fluid.layers.conv2d(
+            input=input,
+            filter_size=filter_size,
+            num_filters=ch_out,
+            stride=stride,
+            padding=padding,
+            act=None,
+            bias_attr=bias_attr)
+        return fluid.layers.batch_norm(input=tmp, act=act)
+
+    data = fluid.layers.data(name='image', shape=[1, 32, 32], dtype='float32')
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    hidden = data
+    for _ in six.moves.xrange(num):
+        conv = conv_bn_layer(hidden, 16, 3, 1, 1, act=None, bias_attr=True)
+        short = conv_bn_layer(hidden, 16, 1, 1, 0, act=None)
+        hidden = fluid.layers.elementwise_add(x=conv, y=short, act='relu')
+    fc = fluid.layers.fc(input=hidden, size=10)
+    loss = fluid.layers.cross_entropy(input=fc, label=label)
+    loss = fluid.layers.mean(loss)
+    return loss
+
+
+class TestGraph(unittest.TestCase):
+    def test_graph_functions(self):
+        main = fluid.Program()
+        startup = fluid.Program()
+        with fluid.program_guard(main, startup):
+            loss = residual_block(2)
+            opt = fluid.optimizer.Adam(learning_rate=0.001)
+            opt.minimize(loss)
+        graph = IrGraph(core.Graph(main.desc), for_test=False)
+        marked_nodes = set()
+        for op in graph.all_ops():
+            if op.name().find('conv2d') > -1:
+                marked_nodes.add(op)
+        graph.draw('.', 'residual', marked_nodes)
+        self.assertFalse(graph.has_circle())
+        self.assertEqual(graph.graph_num(), 1)
+        nodes = graph.topology_sort()
+        self.assertEqual(len(nodes), len(graph.all_ops()))
+        nodes_map = graph.build_adjacency_list()
+        self.assertEqual(len(nodes_map), len(graph.all_ops()))
+        nodes_num = len(graph.all_nodes())
+        graph.safe_remove_nodes(marked_nodes)
+        self.assertEqual(len(graph.all_nodes()), nodes_num - len(marked_nodes))
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/python/paddle/fluid/contrib/slim/tests/test_quantization_pass.py b/python/paddle/fluid/contrib/slim/tests/test_quantization_pass.py
new file mode 100644
index 0000000000000000000000000000000000000000..2f291132f3049af21420f863972792c1a862b9ad
--- /dev/null
+++ b/python/paddle/fluid/contrib/slim/tests/test_quantization_pass.py
@@ -0,0 +1,372 @@
+#   copyright (c) 2018 paddlepaddle authors. all rights reserved.
+#
+# licensed under the apache license, version 2.0 (the "license");
+# you may not use this file except in compliance with the license.
+# you may obtain a copy of the license at
+#
+#     http://www.apache.org/licenses/license-2.0
+#
+# unless required by applicable law or agreed to in writing, software
+# distributed under the license is distributed on an "as is" basis,
+# without warranties or conditions of any kind, either express or implied.
+# see the license for the specific language governing permissions and
+# limitations under the license.
+
+import unittest
+import random
+import numpy as np
+import paddle.fluid as fluid
+import six
+import paddle
+from paddle.fluid.framework import IrGraph
+from paddle.fluid.contrib.slim.quantization import QuantizationTransformPass
+from paddle.fluid.contrib.slim.quantization import QuantizationFreezePass
+from paddle.fluid.contrib.slim.quantization import ConvertToInt8Pass
+from paddle.fluid.contrib.slim.quantization import TransformForMobilePass
+from paddle.fluid import core
+
+
+def linear_fc(num):
+    data = fluid.layers.data(name='image', shape=[1, 32, 32], dtype='float32')
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    hidden = data
+    for _ in six.moves.xrange(num):
+        hidden = fluid.layers.fc(hidden, size=128, act='relu')
+    loss = fluid.layers.cross_entropy(input=hidden, label=label)
+    loss = fluid.layers.mean(loss)
+    return loss
+
+
+def residual_block(num):
+    def conv_bn_layer(input,
+                      ch_out,
+                      filter_size,
+                      stride,
+                      padding,
+                      act='relu',
+                      bias_attr=False):
+        tmp = fluid.layers.conv2d(
+            input=input,
+            filter_size=filter_size,
+            num_filters=ch_out,
+            stride=stride,
+            padding=padding,
+            act=None,
+            bias_attr=bias_attr)
+        return fluid.layers.batch_norm(input=tmp, act=act)
+
+    data = fluid.layers.data(name='image', shape=[1, 32, 32], dtype='float32')
+    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    hidden = data
+    for _ in six.moves.xrange(num):
+        conv = conv_bn_layer(hidden, 16, 3, 1, 1, act=None, bias_attr=True)
+        short = conv_bn_layer(hidden, 16, 1, 1, 0, act=None)
+        hidden = fluid.layers.elementwise_add(x=conv, y=short, act='relu')
+    fc = fluid.layers.fc(input=hidden, size=10)
+    loss = fluid.layers.cross_entropy(input=fc, label=label)
+    loss = fluid.layers.mean(loss)
+    return loss
+
+
+def conv_net(img, label):
+    conv_pool_1 = fluid.nets.simple_img_conv_pool(
+        input=img,
+        filter_size=5,
+        num_filters=20,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    conv_pool_1 = fluid.layers.batch_norm(conv_pool_1)
+    conv_pool_2 = fluid.nets.simple_img_conv_pool(
+        input=conv_pool_1,
+        filter_size=5,
+        num_filters=50,
+        pool_size=2,
+        pool_stride=2,
+        act="relu")
+    prediction = fluid.layers.fc(input=conv_pool_2, size=10, act='softmax')
+    loss = fluid.layers.cross_entropy(input=prediction, label=label)
+    avg_loss = fluid.layers.mean(loss)
+    return avg_loss
+
+
+class TestQuantizationTransformPass(unittest.TestCase):
+    def setUp(self):
+        self.quantizable_op_and_inputs = {
+            'conv2d': ['Input', 'Filter'],
+            'depthwise_conv2d': ['Input', 'Filter'],
+            'mul': ['X', 'Y']
+        }
+        self.quantizable_grad_op_inputs = {
+            'conv2d_grad': ['Input', 'Filter'],
+            'depthwise_conv2d_grad': ['Input', 'Filter'],
+            'mul_grad': ['X', 'Y']
+        }
+
+    def check_program(self, transform_pass, program):
+        quantized_ops = set()
+        for block in program.blocks:
+            for op in block.ops:
+                # check forward
+                if op.type in self.quantizable_op_and_inputs:
+                    for arg_name in op.input_arg_names:
+                        self.assertTrue(
+                            arg_name.endswith('.quantized.dequantized'))
+                        quantized_ops.add(arg_name)
+
+            for op in block.ops:
+                # check backward
+                if op.type in self.quantizable_grad_op_inputs:
+                    for pname in self.quantizable_grad_op_inputs[op.type]:
+                        arg_name = op.input(pname)[0]
+                        self.assertTrue(
+                            arg_name.endswith('.quantized.dequantized'))
+                        self.assertTrue(arg_name in quantized_ops)
+
+    def linear_fc_quant(self, quant_type):
+        main = fluid.Program()
+        startup = fluid.Program()
+        with fluid.program_guard(main, startup):
+            loss = linear_fc(3)
+            opt = fluid.optimizer.Adam(learning_rate=0.001)
+            opt.minimize(loss)
+        exe = fluid.Executor(fluid.CPUPlace())
+        graph = IrGraph(core.Graph(main.desc), for_test=False)
+        transform_pass = QuantizationTransformPass(
+            scope=fluid.global_scope(),
+            program_exe=exe,
+            activation_quantize_type=quant_type)
+        transform_pass.apply(graph)
+        marked_nodes = set()
+        for op in graph.all_ops():
+            if op.name().find('quantize') > -1:
+                marked_nodes.add(op)
+        graph.draw('.', 'quantize_fc_' + quant_type, marked_nodes)
+        program = graph.to_program()
+        self.check_program(transform_pass, program)
+        val_graph = IrGraph(core.Graph(program.desc), for_test=False)
+        val_marked_nodes = set()
+        for op in val_graph.all_ops():
+            if op.name().find('quantize') > -1:
+                val_marked_nodes.add(op)
+        val_graph.draw('.', 'val_fc_' + quant_type, val_marked_nodes)
+
+    def test_linear_fc_quant_abs_max(self):
+        self.act_quant_op_type = 'fake_quantize_abs_max'
+        self.linear_fc_quant('abs_max')
+
+    def test_linear_fc_quant_range_abs_max(self):
+        self.act_quant_op_type = 'fake_quantize_range_abs_max'
+        self.linear_fc_quant('range_abs_max')
+
+    def residual_block_quant(self, quant_type):
+        main = fluid.Program()
+        startup = fluid.Program()
+        with fluid.program_guard(main, startup):
+            loss = residual_block(2)
+            opt = fluid.optimizer.Adam(learning_rate=0.001)
+            opt.minimize(loss)
+        exe = fluid.Executor(fluid.CPUPlace())
+        graph = IrGraph(core.Graph(main.desc), for_test=False)
+        transform_pass = QuantizationTransformPass(
+            scope=fluid.global_scope(),
+            program_exe=exe,
+            activation_quantize_type=quant_type)
+        transform_pass.apply(graph)
+        marked_nodes = set()
+        for op in graph.all_ops():
+            if op.name().find('quantize') > -1:
+                marked_nodes.add(op)
+        graph.draw('.', 'quantize_residual_' + quant_type, marked_nodes)
+        program = graph.to_program()
+        self.check_program(transform_pass, program)
+        val_graph = IrGraph(core.Graph(program.desc), for_test=False)
+        val_marked_nodes = set()
+        for op in val_graph.all_ops():
+            if op.name().find('quantize') > -1:
+                val_marked_nodes.add(op)
+        val_graph.draw('.', 'val_residual_' + quant_type, val_marked_nodes)
+
+    def test_residual_block_abs_max(self):
+        self.act_quant_op_type = 'fake_quantize_abs_max'
+        self.residual_block_quant('abs_max')
+
+    def test_residual_block_range_abs_max(self):
+        self.act_quant_op_type = 'fake_quantize_range_abs_max'
+        self.residual_block_quant('range_abs_max')
+
+
+class TestQuantizationFreezePass(unittest.TestCase):
+    def freeze_graph(self, use_cuda, seed, quant_type):
+        def build_program(main, startup, is_test):
+            main.random_seed = seed
+            startup.random_seed = seed
+            with fluid.unique_name.guard():
+                with fluid.program_guard(main, startup):
+                    img = fluid.layers.data(
+                        name='image', shape=[1, 28, 28], dtype='float32')
+                    label = fluid.layers.data(
+                        name='label', shape=[1], dtype='int64')
+                    loss = conv_net(img, label)
+                    if not is_test:
+                        opt = fluid.optimizer.Adam(learning_rate=0.001)
+                        opt.minimize(loss)
+            return [img, label], loss
+
+        random.seed(0)
+        np.random.seed(0)
+
+        main = fluid.Program()
+        startup = fluid.Program()
+        test_program = fluid.Program()
+        feeds, loss = build_program(main, startup, False)
+        build_program(test_program, startup, True)
+        test_program = test_program.clone(for_test=True)
+        main_graph = IrGraph(core.Graph(main.desc), for_test=False)
+        test_graph = IrGraph(core.Graph(test_program.desc), for_test=True)
+
+        place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+        exe = fluid.Executor(place)
+        scope = fluid.Scope()
+        with fluid.scope_guard(scope):
+            exe.run(startup)
+        transform_pass = QuantizationTransformPass(
+            scope=scope, program_exe=exe, activation_quantize_type=quant_type)
+        transform_pass.apply(main_graph)
+        transform_pass.apply(test_graph)
+        dev_name = '_gpu_' if use_cuda else '_cpu_'
+        marked_nodes = set()
+        for op in main_graph.all_ops():
+            if op.name().find('quantize') > -1:
+                marked_nodes.add(op)
+        main_graph.draw('.', 'main' + dev_name + quant_type, marked_nodes)
+        marked_nodes = set()
+        for op in test_graph.all_ops():
+            if op.name().find('quantize') > -1:
+                marked_nodes.add(op)
+        test_graph.draw('.', 'test' + dev_name + quant_type, marked_nodes)
+
+        quantized_main_program = main_graph.to_program()
+        quantized_test_program = test_graph.to_program()
+        iters = 5
+        batch_size = 8
+
+        #train_exe = fluid.ParallelExecutor(
+        #    main_program=quantized_main_program,
+        #    use_cuda=bool(use_cuda),
+        #    loss_name=loss.name,
+        #    scope=scope)
+        train_reader = paddle.batch(
+            paddle.reader.shuffle(
+                paddle.dataset.mnist.train(), buf_size=500),
+            batch_size=batch_size)
+        test_reader = paddle.batch(
+            paddle.dataset.mnist.test(), batch_size=batch_size)
+        feeder = fluid.DataFeeder(feed_list=feeds, place=place)
+        with fluid.scope_guard(scope):
+            for _ in range(iters):
+                data = next(train_reader())
+                loss_v = exe.run(program=quantized_main_program,
+                                 feed=feeder.feed(data),
+                                 fetch_list=[loss])
+                #loss_v = train_exe.run(feed=feeder.feed(data),
+                #                       fetch_list=[loss.name])
+                #print('{}: {}'.format('loss' + dev_name + quant_type, loss_v))
+
+        test_data = next(test_reader())
+        with fluid.program_guard(quantized_test_program):
+            w_var = fluid.framework._get_var('conv2d_1.w_0.quantized',
+                                             quantized_test_program)
+        # Testing
+        with fluid.scope_guard(scope):
+            test_loss1, w_quant = exe.run(program=quantized_test_program,
+                                          feed=feeder.feed(test_data),
+                                          fetch_list=[loss, w_var])
+
+        # Freeze graph for inference, but the weight of fc/conv is still float type.
+        freeze_pass = QuantizationFreezePass(scope=scope, place=place)
+        freeze_pass.apply(test_graph)
+        marked_nodes = set()
+        for op in test_graph.all_ops():
+            if op.name().find('quantize') > -1:
+                marked_nodes.add(op)
+        test_graph.draw('.', 'test_freeze' + dev_name + quant_type,
+                        marked_nodes)
+
+        server_program = test_graph.to_program()
+        with fluid.scope_guard(scope):
+            test_loss2, = exe.run(program=server_program,
+                                  feed=feeder.feed(test_data),
+                                  fetch_list=[loss])
+        self.assertAlmostEqual(test_loss1, test_loss2, delta=5e-3)
+        #print('{}: {}'.format('test_loss1' + dev_name + quant_type, test_loss1))
+        #print('{}: {}'.format('test_loss2' + dev_name + quant_type, test_loss2))
+        w_freeze = np.array(scope.find_var('conv2d_1.w_0').get_tensor())
+        # Maybe failed, this is due to the calculation precision
+        # self.assertAlmostEqual(np.sum(w_freeze), np.sum(w_quant))
+        #print('{}: {}'.format('w_freeze' + dev_name + quant_type,
+        #                      np.sum(w_freeze)))
+        #print('{}: {}'.format('w_quant' + dev_name + quant_type,
+        #                      np.sum(w_quant)))
+
+        # Convert parameter to 8-bit.
+        convert_int8_pass = ConvertToInt8Pass(scope=scope, place=place)
+        convert_int8_pass.apply(test_graph)
+        marked_nodes = set()
+        for op in test_graph.all_ops():
+            if op.name().find('quantize') > -1:
+                marked_nodes.add(op)
+        test_graph.draw('.', 'test_int8' + dev_name + quant_type, marked_nodes)
+        server_program_int8 = test_graph.to_program()
+        # Save the 8-bit parameter and model file.
+        with fluid.scope_guard(scope):
+            fluid.io.save_inference_model('server_int8' + dev_name + quant_type,
+                                          ['image', 'label'], [loss], exe,
+                                          server_program_int8)
+            # Test whether the 8-bit parameter and model file can be loaded successfully.
+            [infer, feed, fetch] = fluid.io.load_inference_model(
+                'server_int8' + dev_name + quant_type, exe)
+        # Check the loaded 8-bit weight.
+        w_8bit = np.array(scope.find_var('conv2d_1.w_0.int8').get_tensor())
+        self.assertEqual(w_8bit.dtype, np.int8)
+        self.assertEqual(np.sum(w_8bit), np.sum(w_freeze))
+        #print('{}: {}'.format('w_8bit' + dev_name + quant_type, np.sum(w_8bit)))
+        #print('{}: {}'.format('w_freeze' + dev_name + quant_type,
+        #                      np.sum(w_freeze)))
+
+        mobile_pass = TransformForMobilePass()
+        mobile_pass.apply(test_graph)
+        marked_nodes = set()
+        for op in test_graph.all_ops():
+            if op.name().find('quantize') > -1:
+                marked_nodes.add(op)
+        test_graph.draw('.', 'test_mobile' + dev_name + quant_type,
+                        marked_nodes)
+
+        mobile_program = test_graph.to_program()
+        with fluid.scope_guard(scope):
+            fluid.io.save_inference_model('mobile_int8' + dev_name + quant_type,
+                                          ['image', 'label'], [loss], exe,
+                                          mobile_program)
+
+    def test_freeze_graph_cuda_dynamic(self):
+        if fluid.core.is_compiled_with_cuda():
+            with fluid.unique_name.guard():
+                self.freeze_graph(True, seed=1, quant_type='abs_max')
+
+    def test_freeze_graph_cpu_dynamic(self):
+        with fluid.unique_name.guard():
+            self.freeze_graph(False, seed=2, quant_type='abs_max')
+
+    def test_freeze_graph_cuda_static(self):
+        if fluid.core.is_compiled_with_cuda():
+            with fluid.unique_name.guard():
+                self.freeze_graph(True, seed=1, quant_type='range_abs_max')
+
+    def test_freeze_graph_cpu_static(self):
+        with fluid.unique_name.guard():
+            self.freeze_graph(False, seed=2, quant_type='range_abs_max')
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/python/paddle/fluid/contrib/slim/unitest/test_quantization_pass.py b/python/paddle/fluid/contrib/slim/unitest/test_quantization_pass.py
deleted file mode 100644
index 1bd4b95d6b90b7f16d507061190f0b463f6c4cc5..0000000000000000000000000000000000000000
--- a/python/paddle/fluid/contrib/slim/unitest/test_quantization_pass.py
+++ /dev/null
@@ -1,175 +0,0 @@
-#   copyright (c) 2018 paddlepaddle authors. all rights reserved.
-#
-# licensed under the apache license, version 2.0 (the "license");
-# you may not use this file except in compliance with the license.
-# you may obtain a copy of the license at
-#
-#     http://www.apache.org/licenses/license-2.0
-#
-# unless required by applicable law or agreed to in writing, software
-# distributed under the license is distributed on an "as is" basis,
-# without warranties or conditions of any kind, either express or implied.
-# see the license for the specific language governing permissions and
-# limitations under the license.
-
-import unittest
-import random
-import numpy as np
-import paddle.fluid as fluid
-import six
-from paddle.fluid.framework import Program
-from paddle.fluid.framework import IrGraph
-from paddle.fluid.contrib.slim.quantization import QuantizationTransformPass
-from paddle.fluid import core
-
-
-def linear_fc(num):
-    data = fluid.layers.data(name='image', shape=[1, 32, 32], dtype='float32')
-    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-    hidden = data
-    for _ in six.moves.xrange(num):
-        hidden = fluid.layers.fc(hidden, size=128, act='relu')
-    loss = fluid.layers.cross_entropy(input=hidden, label=label)
-    loss = fluid.layers.mean(loss)
-    return loss
-
-
-def residual_block(num):
-    def conv_bn_layer(input,
-                      ch_out,
-                      filter_size,
-                      stride,
-                      padding,
-                      act='relu',
-                      bias_attr=False):
-        tmp = fluid.layers.conv2d(
-            input=input,
-            filter_size=filter_size,
-            num_filters=ch_out,
-            stride=stride,
-            padding=padding,
-            act=None,
-            bias_attr=bias_attr)
-        return fluid.layers.batch_norm(input=tmp, act=act)
-
-    data = fluid.layers.data(name='image', shape=[1, 32, 32], dtype='float32')
-    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
-    hidden = data
-    for _ in six.moves.xrange(num):
-        conv = conv_bn_layer(hidden, 16, 3, 1, 1, act=None, bias_attr=True)
-        short = conv_bn_layer(hidden, 16, 1, 1, 0, act=None)
-        hidden = fluid.layers.elementwise_add(x=conv, y=short, act='relu')
-    fc = fluid.layers.fc(input=hidden, size=10)
-    loss = fluid.layers.cross_entropy(input=fc, label=label)
-    loss = fluid.layers.mean(loss)
-    return loss
-
-
-class TestQuantizationTransformPass(unittest.TestCase):
-    def setUp(self):
-        self.quantizable_op_and_inputs = {
-            'conv2d': ['Input', 'Filter'],
-            'depthwise_conv2d': ['Input', 'Filter'],
-            'mul': ['X', 'Y']
-        }
-        self.quantizable_grad_op_inputs = {
-            'conv2d_grad': ['Input', 'Filter'],
-            'depthwise_conv2d_grad': ['Input', 'Filter'],
-            'mul_grad': ['X', 'Y']
-        }
-
-    def check_program(self, transform_pass, program):
-        quantized_ops = set()
-        for block in program.blocks:
-            for op in block.ops:
-                # check forward
-                if op.type in self.quantizable_op_and_inputs:
-                    for arg_name in op.input_arg_names:
-                        self.assertTrue(
-                            arg_name.endswith('.quantized.dequantized'))
-                        quantized_ops.add(arg_name)
-
-            for op in block.ops:
-                # check backward
-                if op.type in self.quantizable_grad_op_inputs:
-                    for pname in self.quantizable_grad_op_inputs[op.type]:
-                        arg_name = op.input(pname)[0]
-                        self.assertTrue(
-                            arg_name.endswith('.quantized.dequantized'))
-                        self.assertTrue(arg_name in quantized_ops)
-
-    def linear_fc_quant(self, quant_type):
-        main = fluid.Program()
-        startup = fluid.Program()
-        with fluid.program_guard(main, startup):
-            loss = linear_fc(3)
-            opt = fluid.optimizer.Adam(learning_rate=0.001)
-            opt.minimize(loss)
-        exe = fluid.Executor(fluid.CPUPlace())
-        graph = IrGraph(core.Graph(main.desc), for_test=False)
-        transform_pass = QuantizationTransformPass(
-            scope=fluid.global_scope(),
-            program_exe=exe,
-            activation_quantize_type=quant_type)
-        transform_pass.apply(graph)
-        marked_nodes = set()
-        for op in graph.all_ops():
-            if op.name().find('quantize') > -1:
-                marked_nodes.add(op)
-        graph.draw('.', 'quantize_fc_' + quant_type, marked_nodes)
-        program = graph.to_program()
-        self.check_program(transform_pass, program)
-        val_graph = IrGraph(core.Graph(program.desc), for_test=False)
-        val_marked_nodes = set()
-        for op in val_graph.all_ops():
-            if op.name().find('quantize') > -1:
-                val_marked_nodes.add(op)
-        val_graph.draw('.', 'val_fc_' + quant_type, val_marked_nodes)
-
-    def test_linear_fc_quant_abs_max(self):
-        self.act_quant_op_type = 'fake_quantize_abs_max'
-        self.linear_fc_quant('abs_max')
-
-    def test_linear_fc_quant_range_abs_max(self):
-        self.act_quant_op_type = 'fake_quantize_range_abs_max'
-        self.linear_fc_quant('range_abs_max')
-
-    def residual_block_quant(self, quant_type):
-        main = fluid.Program()
-        startup = fluid.Program()
-        with fluid.program_guard(main, startup):
-            loss = residual_block(2)
-            opt = fluid.optimizer.Adam(learning_rate=0.001)
-            opt.minimize(loss)
-        exe = fluid.Executor(fluid.CPUPlace())
-        graph = IrGraph(core.Graph(main.desc), for_test=False)
-        transform_pass = QuantizationTransformPass(
-            scope=fluid.global_scope(),
-            program_exe=exe,
-            activation_quantize_type=quant_type)
-        transform_pass.apply(graph)
-        marked_nodes = set()
-        for op in graph.all_ops():
-            if op.name().find('quantize') > -1:
-                marked_nodes.add(op)
-        graph.draw('.', 'quantize_residual_' + quant_type, marked_nodes)
-        program = graph.to_program()
-        self.check_program(transform_pass, program)
-        val_graph = IrGraph(core.Graph(program.desc), for_test=False)
-        val_marked_nodes = set()
-        for op in val_graph.all_ops():
-            if op.name().find('quantize') > -1:
-                val_marked_nodes.add(op)
-        val_graph.draw('.', 'val_residual_' + quant_type, val_marked_nodes)
-
-    def test_residual_block_abs_max(self):
-        self.act_quant_op_type = 'fake_quantize_abs_max'
-        self.residual_block_quant('abs_max')
-
-    def test_residual_block_range_abs_max(self):
-        self.act_quant_op_type = 'fake_quantize_range_abs_max'
-        self.residual_block_quant('range_abs_max')
-
-
-if __name__ == '__main__':
-    unittest.main()
diff --git a/python/paddle/fluid/contrib/tests/CMakeLists.txt b/python/paddle/fluid/contrib/tests/CMakeLists.txt
index 81aee1233d1db756686d1a934b94672dc5c770fe..a2c59416467e5dbe66f058666633807eb0e45047 100644
--- a/python/paddle/fluid/contrib/tests/CMakeLists.txt
+++ b/python/paddle/fluid/contrib/tests/CMakeLists.txt
@@ -6,5 +6,9 @@ if(APPLE OR WIN32 OR NOT WITH_MKL)
 endif()
 
 foreach(src ${TEST_OPS})
-    py_test(${src} SRCS ${src}.py)
+    if(src MATCHES "test_calibration")
+        py_test(${src} SRCS ${src}.py ENVS FLAGS_use_mkldnn=true)
+    else()
+        py_test(${src} SRCS ${src}.py)
+    endif()
 endforeach()
diff --git a/python/paddle/fluid/contrib/tests/test_calibration.py b/python/paddle/fluid/contrib/tests/test_calibration.py
index 424ea245a0f2dff0d437ace386f2e4e0fa6b517d..b9f938bebed71dc9611df8d743a066858ea38bca 100644
--- a/python/paddle/fluid/contrib/tests/test_calibration.py
+++ b/python/paddle/fluid/contrib/tests/test_calibration.py
@@ -199,7 +199,6 @@ class TestCalibrationForResnet50(unittest.TestCase):
 
     def run_program(self, model_path, generate_int8=False, algo='direct'):
         image_shape = [3, 224, 224]
-        os.environ['FLAGS_use_mkldnn'] = 'True'
 
         fluid.memory_optimize(fluid.default_main_program())
 
@@ -241,9 +240,6 @@ class TestCalibrationForResnet50(unittest.TestCase):
             label = label.reshape([-1, 1])
             running_program = calibrator.sampling_program.clone(
             ) if generate_int8 else infer_program.clone()
-            for op in running_program.current_block().ops:
-                if op.has_attr("use_mkldnn"):
-                    op._set_attr("use_mkldnn", True)
 
             t1 = time.time()
             _, acc1, _ = exe.run(
diff --git a/python/paddle/fluid/contrib/tests/test_quantize_transpiler.py b/python/paddle/fluid/contrib/tests/test_quantize_transpiler.py
index 86fa84ad4bd7a55fb27f4e43128f0bfda6dfe6db..77fdf0087b93c3ad44a2492de68f8f57ce243ef3 100644
--- a/python/paddle/fluid/contrib/tests/test_quantize_transpiler.py
+++ b/python/paddle/fluid/contrib/tests/test_quantize_transpiler.py
@@ -204,9 +204,11 @@ class TestQuantizeTranspiler(unittest.TestCase):
         build_program(test_program, startup, True)
         test_program = test_program.clone(for_test=True)
 
-        quant_transpiler = QuantizeTranspiler()
-        quant_transpiler.training_transpile(main)
-        quant_transpiler.training_transpile(test_program)
+        quant_type = 'range_abs_max'  # 'range_abs_max' or 'abs_max'
+        quant_transpiler = QuantizeTranspiler(
+            activation_quantize_type=quant_type)
+        quant_transpiler.training_transpile(main, startup)
+        quant_transpiler.training_transpile(test_program, startup)
 
         place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
         exe = fluid.Executor(place)
diff --git a/python/paddle/fluid/framework.py b/python/paddle/fluid/framework.py
index 832c97c7deb49b4e118e15989ab7a34da6ce57a0..ef304b11106628f8541b348fb263274a0c4b31e9 100644
--- a/python/paddle/fluid/framework.py
+++ b/python/paddle/fluid/framework.py
@@ -16,6 +16,8 @@ from __future__ import print_function
 
 import collections
 from collections import defaultdict
+from collections import Iterable
+import contextlib
 from .wrapped_decorator import signature_safe_contextmanager
 import os
 import re
@@ -1529,12 +1531,16 @@ class Block(object):
 
 class IrGraph(object):
     """
-    IrGraph uses core.Graph as the delegation to accomplish the manipulation.
+    Python IrGraph. Beneath it is a core.Graph, which is used for
+    create a c++ Ir Pass Graph. An IrGraph is just a graph view of
+    a Program. In an IrGraph, both Variables and Operators are graph
+    nodes.
     """
 
     def __init__(self, graph, for_test=False):
         """
-        Construct the IrGraph using core.Graph.
+        Construct an IrGraph using core.Graph.
+
         Args:
             graph(core.Graph): C++ Graph.
             for_test(bool): True for the test graph and false for the train graph.
@@ -1545,23 +1551,81 @@ class IrGraph(object):
         self._for_test = for_test
 
     def is_test(self):
+        """
+        If the graph is used for testing, the function returns true. Otherwise, returns false.
+        """
         return self._for_test
 
-    def all_parameters(self):
-        param_nodes = set()
-        for node in self.graph.nodes():
-            if node.is_var() and node.var() is not None and node.var(
-            ).persistable():
-                param_nodes.add(node)
-        return param_nodes
+    def all_nodes(self):
+        """
+        Return all nodes included in the graph as a set.
+        """
+        return {node for node in self.graph.nodes()}
 
     def all_vars(self):
+        """
+        Return all variable nodes included in the graph as a set.
+        """
         return {node for node in self.graph.nodes() if node.is_var()}
 
+    def all_persistable_vars(self):
+        """
+        Return all persistable variable nodes included in the graph as a set.
+        """
+        persistable_nodes = set()
+        for node in self.graph.nodes():
+            if node.is_var() and node.var() is not None and node.var(
+            ).persistable():
+                persistable_nodes.add(node)
+        return persistable_nodes
+
     def all_ops(self):
+        """
+        Return all operator nodes included in the graph as a set.
+        """
         return {node for node in self.graph.nodes() if node.is_op()}
 
+    def var_node(self, name):
+        """
+        Get a variable node by name from the graph.
+
+        Args:
+            name(str): the name of the variable node.
+
+        Raises:
+            ValueError: The If input's type is not str, or this graph
+            doesn't have a variable with the giving name.
+
+        Returns:
+            core.Node: the variable node with the giving name.
+        """
+        if not isinstance(name, six.string_types):
+            raise TypeError(
+                "var require string as parameter, but get %s instead." %
+                (type(name)))
+        target_var_node = None
+        var_nodes = self.all_vars()
+        for var_node in var_nodes:
+            if var_node.name() == name:
+                target_var_node = var_node
+        if target_var_node is None:
+            raise ValueError("var_node %s not in this graph" % name)
+        return target_var_node
+
     def create_param_node(self, name, var_type, shape, var_dtype):
+        """
+        Create a persistable variable node in the graph. In IrGraph,
+        it can not distinguish between persistable variables and parameters.
+
+        Args:
+            name(str): the name of the persistable variable node.
+            vart_type(core.VarDesc.VarType): the type of the persistable variable node.
+            shape(list): the shape of the persistable variable node.
+            var_dtype(core.VarDesc.VarType): the data type of the persistable variable node.
+
+        Returns:
+            core.Node: the created persistable variable node.
+        """
         var_desc = core.VarDesc(name)
         var_desc.set_type(var_type)
         var_desc.set_shape(shape)
@@ -1570,6 +1634,20 @@ class IrGraph(object):
         return self.graph.create_var_node(var_desc)
 
     def create_var_node(self, name, var_type, shape, var_dtype):
+        """
+        Create a variable node in the graph. The created variable node is
+        not persistable.
+
+        Args:
+            name(str): the name of the variable node.
+            vart_type(core.VarDesc.VarType): the type of the variable node.
+            shape(list): the shape of the variable node.
+            var_dtype(core.VarDesc.VarType): the data type of the variable node.
+
+        Returns:
+            core.Node: the created variable node.
+        """
+
         var_desc = core.VarDesc(name)
         var_desc.set_type(var_type)
         var_desc.set_shape(shape)
@@ -1577,19 +1655,41 @@ class IrGraph(object):
         return self.graph.create_var_node(var_desc)
 
     def create_var_node_from_desc(self, var_desc):
+        """
+        Create a variable node by using an existing VarDesc in the graph.
+        Depend on the giving VarDesc, the created variable node may be persistable.
+
+        Args:
+            var_desc(core.VarDesc): the giving variable description.
+
+        Returns:
+            core.Node: the created variable node.
+        """
         return self.graph.create_var_node(var_desc)
 
     def create_op_node(self, op_type, attrs, inputs, outputs):
+        """
+        Create a operator node in the graph.
+
+        Args:
+            op_type(str): the type of the operator node.
+            attrs(dict): the attributes of the operator node.
+            inputs(dict): the inputs of the operator node.
+            outputs(dict): the outpus of the operator node.
+
+        Returns:
+            core.Node: the created operator node.
+        """
         op_desc = core.OpDesc()
         op_desc.set_type(op_type)
-        for attr, value in attrs.iteritems():
+        for attr, value in six.iteritems(attrs):
             self._update_desc_attr(op_desc, attr, value)
-        for input_name, var_nodes in inputs.iteritems():
+        for input_name, var_nodes in six.iteritems(inputs):
             if not isinstance(var_nodes, list):
                 var_nodes = [var_nodes]
             op_desc.set_input(input_name,
                               [var_node.name() for var_node in var_nodes])
-        for output_name, var_nodes in outputs.iteritems():
+        for output_name, var_nodes in six.iteritems(outputs):
             if not isinstance(var_nodes, list):
                 var_nodes = [var_nodes]
             op_desc.set_output(output_name,
@@ -1597,11 +1697,29 @@ class IrGraph(object):
         return self.graph.create_op_node(op_desc)
 
     def create_op_node_from_desc(self, op_desc):
+        """
+        Create a operator node by using an existing OpDesc in the graph.
+
+        Args:
+            op_desc(core.VarDesc): the giving operator description.
+
+        Returns:
+            core.Node: the created operator node.
+        """
         return self.graph.create_op_node(op_desc)
 
     def update_input_link(self, old_input_node, new_input_node, op_node):
-        assert old_input_node in self.graph.nodes() and new_input_node in self.graph.nodes() and \
-            op_node in self.graph.nodes(), 'Th three arguments must be in the graph nodes.'
+        """
+        Update the input's link of a operator node.
+
+        Args:
+            old_input_node(core.Node): the old input node of the giving op_node.
+            new_input_node(core.Node): the new input node of the giving op_node.
+            op_node(core.Node): the operator node that is needed to update input's link.
+        """
+        assert old_input_node in self.graph.nodes() and new_input_node in \
+        self.graph.nodes() and op_node in self.graph.nodes(), \
+        'The three arguments(old_input_node&new_input_node&op_node) must be in the graph nodes.'
         old_input_node.outputs_remove(op_node)
         op_node.inputs_remove(old_input_node)
         new_input_node.outputs_append(op_node)
@@ -1609,17 +1727,85 @@ class IrGraph(object):
         op_node.op()._rename_input(old_input_node.name(), new_input_node.name())
 
     def link_to(self, node_in, node_out):
+        """
+        Connect two nodes.
+
+        Args:
+            node_in(core.Node): the input node.
+            node_out(core.Node): the output node.
+        """
         assert node_in in self.graph.nodes() and node_out in self.graph.nodes(), \
-            'Th two arguments must be in the graph nodes.'
+            'The two arguments(node_in&node_out) must be in the graph nodes.'
         node_in.outputs_append(node_out)
         node_out.inputs_append(node_in)
 
     def safe_remove_nodes(self, remove_nodes):
+        """
+        Remove nodes safely since links connected to these removed nodes are
+        also removed.
+
+        Args:
+            remove_nodes(set): the nodes prepared to be removed.
+        """
         if not isinstance(remove_nodes, set):
-            remove_nodes = set(remove_nodes)
+            if isinstance(remove_nodes, Iterable):
+                remove_nodes = set(remove_nodes)
+            else:
+                remove_nodes = {remove_nodes}
         core.graph_safe_remove_nodes(self.graph, remove_nodes)
 
-    def draw(self, save_path, name, marked_nodes=None):
+    def has_circle(self):
+        """
+        Check if the graph has a circle.
+
+        Returns:
+            bool: True if the graph has a circle else False.
+        """
+        return core.has_circle(self.graph)
+
+    def graph_num(self):
+        """
+        Count the number of unconnected graphs in this graph.
+
+        Returns:
+            int: the number of unconnected graphs.
+        """
+        return core.graph_num(self.graph)
+
+    def topology_sort(self):
+        """
+        Perform the topology sort operation on the graph.
+
+        Notes: the `graph` cannot contain a circle.
+
+        Returns:
+            set(core.Node): nodes in topology order.
+        """
+        return core.topology_sort(self.graph)
+
+    def build_adjacency_list(self):
+        """
+        Build an adjacency list of operations for the `graph`.
+
+        Returns:
+            dict{core.Node: set(core.Node)}: the adjacency list.
+        """
+        return core.build_adjacency_list(self.graph)
+
+    def draw(self, save_path, name, marked_nodes=None, remove_ctr_var=True):
+        """
+        Draw the graph. If `dot` command is installed, the drawn graph
+        will be saved as pdf file type, otherwise dot file type is used.
+
+        Args:
+            save_path(str): the save path of drawn graph.
+            name(str): the name of drawn graph.
+            marked_nodes(set(core.Node)): nodes that are needed to be marked.
+            Default value is None.
+            remove_ctr_var(bool): If it is set True, all control variable nodes
+            in the graph will be removed. Default value is True.
+        """
+
         def _convert_to_pdf(dot_file_path):
             pdf_save_path = os.path.splitext(dot_file_path)[0] + '.pdf'
             exited_code = subprocess.call('dot -Tpdf ' + dot_file_path \
@@ -1629,15 +1815,17 @@ class IrGraph(object):
                 print('The {} is saved as the dot filetype.'.format(
                     dot_file_path))
 
-        remove_ctr_vars = set()
+        if remove_ctr_var:
+            remove_ctr_vars = set()
+            for node in self.graph.nodes():
+                if node.is_ctrl_var():
+                    remove_ctr_vars.add(node)
+            self.safe_remove_nodes(remove_ctr_vars)
         ops_num = 0
         for node in self.graph.nodes():
-            if node.is_ctrl_var():
-                remove_ctr_vars.add(node)
-            elif node.is_op():
+            if node.is_op():
                 ops_num += 1
         print('Total ops num = {}.'.format(ops_num))
-        self.safe_remove_nodes(remove_ctr_vars)
         if marked_nodes is not None:
             if not isinstance(marked_nodes, set):
                 marked_nodes = set(marked_nodes)
@@ -1652,10 +1840,20 @@ class IrGraph(object):
         _convert_to_pdf(viz_dot_path)
 
     def to_program(self):
+        """
+        Convert the graph into a Program.
+
+        Notes: When the graph includes backward operator nodes, the
+        conversion process may be failed. Usually, this function is
+        only used to convert a test graph.
+
+        Returns:
+            Program: a program converted from the graph.
+        """
         convert_pass = core.get_pass('graph_to_program_pass')
-        convert_pass.set('program', Program().desc)
+        desc = core.ProgramDesc()
+        convert_pass.set_not_owned('program', desc)
         convert_pass.apply(self.graph)
-        desc = convert_pass.get_program('program')
         program = Program._construct_from_desc(desc)
         return program
 
diff --git a/python/paddle/fluid/imperative/layers.py b/python/paddle/fluid/imperative/layers.py
index 71ff95bdea36967c1fa6b5c94cc7ca305e7a544a..59fe6bbf74b80c2260c5b4881fee8807482c9c68 100644
--- a/python/paddle/fluid/imperative/layers.py
+++ b/python/paddle/fluid/imperative/layers.py
@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import collections
 import contextlib
 import sys
 import numpy as np
@@ -30,31 +31,45 @@ class Layer(core.Layer):
     def __init__(self, dtype=core.VarDesc.VarType.FP32, name=None):
         self._built = False
         self._dtype = dtype
+        self._parameters = collections.OrderedDict()
+        self._sub_layers = collections.OrderedDict()
+
+    def parameters(self, include_sublayers=True):
+        """Returns a list of Parameters from current and sub-layers.
+
+        Args:
+            include_sublayers: If true, also include the parameters from
+            sublayers.
+
+        Returns a list of Parameters.
+        """
+        ret = [p for p in self._parameters.values()]
+        if include_sublayers:
+            for l in self._sub_layers.values():
+                for p in l.parameters(include_sublayers):
+                    ret.append(p)
+        return ret
+
+    def sublayers(self, include_sublayers=True):
+        """Returns a list of sub layers.
 
-    def parameters(self):
-        params = []
-        for key in self.__dict__.keys():
-            value = self.__dict__[key]
-            if isinstance(value, framework.Parameter):
-                params.append(value)
-            elif isinstance(value, core.Layer):
-                params.extend(value.parameters())
-            elif isinstance(value, collections.Container):
-                if len(value) == 0:
-                    continue
-                if isinstance(value[0], framework.Parameter):
-                    params.extend(value)
-                elif isinstance(value[0], core.Layer):
-                    for v in value:
-                        params.extend(v.parameters())
-
-        return params
+        Args:
+            include_sublayers: If true, also include the layers from sublayers.
+
+        Returns a list of sub layers.
+        """
+        ret = [l for l in self._sub_layers.values()]
+        if include_sublayers:
+            for l in self._sub_layers.values():
+                for sub_l in l.sublayers(include_sublayers):
+                    ret.append(sub_l)
+        return ret
 
     def clear_gradients(self):
         for p in self.parameters():
             p._clear_gradient()
 
-    def _build_once(self, inputs):
+    def _build_once(self, *args):
         pass
 
     def __call__(self, *inputs):
@@ -71,6 +86,66 @@ class Layer(core.Layer):
     def backward(self, *inputs):
         raise ValueError("Layer shouldn't implement backward")
 
+    def add_sublayer(self, name, sublayer):
+        """Adds a sub Layer instance.
+
+          Added sublayer can be access like self.name.
+
+        Args:
+            name: name of this sublayer.
+            sublayer: an instance of Layer.
+        Returns:
+            the sublayer passed in.
+        """
+        assert isinstance(sublayer, core.Layer)
+        self._sub_layers[name] = sublayer
+        return sublayer
+
+    def add_parameter(self, name, parameter):
+        """Adds a Parameter instance.
+
+          Added parameter can be access like self.name.
+
+        Args:
+            name: name of this sublayer.
+            parameter: an instance of Parameter.
+        Returns:
+            the parameter passed in.
+        """
+        assert isinstance(parameter, framework.Parameter)
+        self._parameters[name] = parameter
+        return parameter
+
+    def __getattr__(self, name):
+        if name in self._parameters:
+            return self._parameters[name]
+        elif name in self._sub_layers:
+            return self._sub_layers[name]
+
+    def __setattr__(self, name, value):
+        if isinstance(value, framework.Parameter):
+            params = self.__dict__.get('_parameters', None)
+            if params is None:
+                raise ValueError(
+                    "super(YourLayer, self).__init__() should be called first")
+            params[name] = value
+        elif isinstance(value, core.Layer):
+            layers = self.__dict__.get('_sub_layers', None)
+            if layers is None:
+                raise ValueError(
+                    "super(YourLayer, self).__init__() should be called first")
+            layers[name] = value
+        else:
+            object.__setattr__(self, name, value)
+
+    def __delattr__(self, name):
+        if name in self._parameters:
+            del self._parameters[name]
+        elif name in self._sub_layers:
+            del self._sub_layers[name]
+        else:
+            object.__delattr__(self, name)
+
 
 class PyLayer(core.PyLayer):
     """Layers composed of user-defined python codes."""
diff --git a/python/paddle/fluid/imperative/nn.py b/python/paddle/fluid/imperative/nn.py
index 6c5961cc63d1c140e0a6f33aac054acdbbe8e8e0..c86a373ae4a92053538c93386003f9014c32841f 100644
--- a/python/paddle/fluid/imperative/nn.py
+++ b/python/paddle/fluid/imperative/nn.py
@@ -225,9 +225,6 @@ class FC(layers.Layer):
             act=act,
             name=name)
 
-    def parameters(self):
-        return [self._w, self._b]
-
     def _build_once(self, input):
         input_shape = input.shape
         param_shape = [
@@ -478,9 +475,6 @@ class Embedding(layers.Layer):
             dtype=self._dtype,
             is_bias=False)
 
-    def parameters(self):
-        return [self._w]
-
     def forward(self, input):
         out = self._helper.create_variable_for_type_inference(self._dtype)
         self._helper.append_op(
diff --git a/python/paddle/fluid/layers/nn.py b/python/paddle/fluid/layers/nn.py
index 46ce58fd2db19205e4d1194c30cba76a83fa2de8..fe257f9a37fd5291ec0d8d5f33971e94bfae09e0 100644
--- a/python/paddle/fluid/layers/nn.py
+++ b/python/paddle/fluid/layers/nn.py
@@ -3236,7 +3236,7 @@ def group_norm(input,
     # create output
     mean_out = helper.create_variable(dtype=dtype, stop_gradient=True)
     variance_out = helper.create_variable(dtype=dtype, stop_gradient=True)
-    group_norm_out = helper.create_variable(dtype)
+    group_norm_out = helper.create_variable(dtype=dtype)
 
     helper.append_op(
         type="group_norm",
@@ -5936,13 +5936,10 @@ def reshape(x, shape, actual_shape=None, act=None, inplace=False, name=None):
                                 than :attr:`shape`.
         act (str): The non-linear activation to be applied to the reshaped tensor
                    variable.
-        inplace(bool): Must use :attr:`False` if :attr:`x` is used in multiple
-                       operators. If this flag is set :attr:`True`, reuse input
-                       :attr:`x` to reshape, which will change the shape of
-                       tensor variable :attr:`x` and might cause errors when
-                       :attr:`x` is used in multiple operators. If :attr:`False`,
-                       preserve the shape :attr:`x` and create a new output tensor
-                       variable whose data is copied from input x but reshaped.
+        inplace(bool): If ``inplace`` is `True`, the input and output of ``layers.reshape``
+                       are the same variable, otherwise, the input and output of
+                       ``layers.reshape`` are different variables. Note that if :attr:`x`
+                       is more than one layer's input, ``inplace`` must be :attr:`False`.
         name (str): The name of this layer. It is optional.
 
     Returns:
@@ -8335,6 +8332,46 @@ def stack(x, axis=0):
     If :code:`axis` < 0, it would be replaced with :code:`axis+rank(x[0])+1`.
     If :code:`axis` is None, it would be replaced with 0.
 
+    For Example:
+
+    .. code-block:: text
+
+        Case 1:
+          Input:
+            x[0].data = [ [1.0 , 2.0 ] ]
+            x[0].dims = [1, 2]
+            x[1].data = [ [3.0 , 4.0 ] ]
+            x[1].dims = [1, 2]
+            x[2].data = [ [5.0 , 6.0 ] ]
+            x[2].dims = [1, 2]
+
+          Attrs:
+            axis = 0
+
+          Output:
+            Out.data =[ [ [1.0, 2.0] ],
+                        [ [3.0, 4.0] ],
+                        [ [5.0, 6.0] ] ]
+            Out.dims = [3, 1, 2]
+
+        Case 2:
+          Given
+            x[0].data = [ [1.0 , 2.0 ] ]
+            x[0].dims = [1, 2]
+            x[1].data = [ [3.0 , 4.0 ] ]
+            x[1].dims = [1, 2]
+            x[2].data = [ [5.0 , 6.0 ] ]
+            x[2].dims = [1, 2]
+
+          Attrs:
+            axis = 1 or axis = -2
+
+          Output:
+            Out.data =[ [ [1.0, 2.0]
+                          [3.0, 4.0]
+                          [5.0, 6.0] ] ]
+            Out.dims = [1, 3, 2]
+
     Args:
         x (Variable|list(Variable)|tuple(Variable)): Input variables.
         axis (int|None): The axis along which all inputs are stacked.
diff --git a/python/paddle/fluid/layers/tensor.py b/python/paddle/fluid/layers/tensor.py
index 2153ca254f0e286a77160a2d53473e1bc76109d5..af747c3cecac66492bb2e2642a88f66a5cfae3db 100644
--- a/python/paddle/fluid/layers/tensor.py
+++ b/python/paddle/fluid/layers/tensor.py
@@ -567,7 +567,7 @@ def ones(shape, dtype, force_cpu=False):
     It also sets *stop_gradient* to True.
 
     Args:
-        shape(tuple|list|None): Shape of output tensor
+        shape(tuple|list): Shape of output tensor
         dtype(np.dtype|core.VarDesc.VarType|str): Data type of output tensor
 
     Returns:
@@ -578,6 +578,10 @@ def ones(shape, dtype, force_cpu=False):
 
           data = fluid.layers.ones(shape=[1], dtype='int64')
     """
+    assert isinstance(shape, list) or isinstance(
+        shape, tuple), "The shape's type should be list or tuple."
+    assert reduce(lambda x, y: x * y,
+                  shape) > 0, "The shape is invalid: %s." % (str(shape))
     return fill_constant(value=1.0, **locals())
 
 
diff --git a/python/paddle/fluid/tests/demo/file_reader/convert_data_to_recordio.py b/python/paddle/fluid/tests/demo/file_reader/convert_data_to_recordio.py
index 45a104ec9625eacfcb87ea6eae619e3d71410da9..b00af91a9dce637e312c9dc5d7d3824106b5a051 100644
--- a/python/paddle/fluid/tests/demo/file_reader/convert_data_to_recordio.py
+++ b/python/paddle/fluid/tests/demo/file_reader/convert_data_to_recordio.py
@@ -16,7 +16,6 @@ from __future__ import print_function
 
 import sys
 import paddle.fluid as fluid
-import paddle.v2 as paddle
 
 
 def load_vocab(filename):
diff --git a/python/paddle/fluid/tests/demo/pyreader.py b/python/paddle/fluid/tests/demo/pyreader.py
index ec61e0ebae4feb1a2177da916b77b2ba2d3981b9..bbcef4c3ff23d955662be10b5f4b96a66da4c7d8 100644
--- a/python/paddle/fluid/tests/demo/pyreader.py
+++ b/python/paddle/fluid/tests/demo/pyreader.py
@@ -20,7 +20,6 @@ import six
 import paddle
 import paddle.dataset.mnist as mnist
 import paddle.fluid as fluid
-import paddle.v2
 
 
 def network(is_train):
@@ -72,7 +71,7 @@ def main():
         use_cuda=use_cuda, share_vars_from=trainer, main_program=test_prog)
 
     train_reader.decorate_paddle_reader(
-        paddle.v2.reader.shuffle(
+        paddle.reader.shuffle(
             paddle.batch(mnist.train(), 512), buf_size=8192))
 
     test_reader.decorate_paddle_reader(paddle.batch(mnist.test(), 512))
diff --git a/python/paddle/fluid/tests/unittests/test_base_layer.py b/python/paddle/fluid/tests/unittests/test_base_layer.py
new file mode 100644
index 0000000000000000000000000000000000000000..bf00698d63624d4e20a0853641219a2735d89d25
--- /dev/null
+++ b/python/paddle/fluid/tests/unittests/test_base_layer.py
@@ -0,0 +1,82 @@
+# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+import numpy as np
+
+import paddle.fluid as fluid
+from paddle.fluid.layer_helper import LayerHelper
+
+
+class L1(fluid.imperative.Layer):
+    def __init__(self):
+        super(L1, self).__init__()
+        self._helper = LayerHelper(
+            'MyLayer',
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.Constant(value=0.1)))
+
+        self.w1 = self._helper.create_parameter(
+            attr=self._helper.param_attr,
+            shape=[2, 2],
+            dtype='float32',
+            is_bias=False)
+        self.w2 = self._helper.create_parameter(
+            attr=self._helper.param_attr,
+            shape=[2, 2],
+            dtype='float32',
+            is_bias=False)
+
+    def forward(self):
+        return self.w1 + self.w2
+
+
+class L2(fluid.imperative.Layer):
+    def __init__(self):
+        super(L2, self).__init__()
+        self.layer1 = L1()
+        self.layer2 = L1()
+
+    def forward(self):
+        return self.layer1() + self.layer2()
+
+
+class L3(fluid.imperative.Layer):
+    def __init__(self):
+        super(L3, self).__init__()
+        self.layer1 = L2()
+        self.layer2 = L2()
+
+    def forward(self):
+        return self.layer1() + self.layer2()
+
+
+class TestBaseLayer(unittest.TestCase):
+    def test_one_level(self):
+        with fluid.imperative.guard():
+            l = L1()
+            ret = l()
+            self.assertEqual(l.w1.name, "MyLayer_0.w_0")
+            self.assertEqual(l.w2.name, "MyLayer_0.w_1")
+            self.assertTrue(np.allclose(ret._numpy(), 0.2 * np.ones([2, 2])))
+
+    def test_three_level(self):
+        with fluid.imperative.guard():
+            l = L3()
+            ret = l()
+            self.assertTrue(np.allclose(ret._numpy(), 0.8 * np.ones([2, 2])))
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/python/paddle/fluid/tests/unittests/test_dist_transpiler.py b/python/paddle/fluid/tests/unittests/test_dist_transpiler.py
index 3566fed215229223f4d2ecd1bbb66cb297dd7716..12132477d28c74c7da718321140a3ddef784fc30 100644
--- a/python/paddle/fluid/tests/unittests/test_dist_transpiler.py
+++ b/python/paddle/fluid/tests/unittests/test_dist_transpiler.py
@@ -22,6 +22,9 @@ import six
 import unittest
 import numpy as np
 
+import gc
+gc.set_debug(gc.DEBUG_COLLECTABLE)
+
 import paddle.fluid as fluid
 
 
@@ -99,6 +102,12 @@ class TranspilerTest(unittest.TestCase):
         with fluid.unique_name.guard():
             with fluid.program_guard(main, startup):
                 self.transpiler_test_impl()
+        # NOTE: run gc.collect to eliminate pybind side objects to
+        # prevent random double-deallocate when inherited in python.
+        del self.transpiler
+        del main
+        del startup
+        gc.collect()
 
 
 class TestBasicModel(TranspilerTest):
@@ -797,6 +806,7 @@ class TestNCCL2Transpile(TranspilerTest):
             print([op.type for op in startup.global_block().ops])
             self.assertEqual(startup.global_block().ops[-1].type, "gen_nccl_id")
             self.assertIsNotNone(startup.global_block().vars.get("NCCLID"))
+            gc.collect()
         else:
             pass
 
diff --git a/python/paddle/fluid/tests/unittests/test_imperative.py b/python/paddle/fluid/tests/unittests/test_imperative.py
index baaddf9f2e5b123300f1d083b33ea644665348fd..c54e998ea875e1bd27f9816f88db0e38bc488459 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative.py
@@ -333,6 +333,18 @@ class TestImperative(unittest.TestCase):
         self.assertTrue(np.allclose(dy_out, static_out))
         self.assertTrue(np.allclose(dy_grad, static_grad))
 
+        params = mlp.parameters(True)
+        self.assertEqual("FC_0.w_0", params[0].name)
+        self.assertEqual("FC_0.b_0", params[1].name)
+        self.assertEqual("FC_1.w_0", params[2].name)
+        self.assertEqual("FC_1.b_0", params[3].name)
+        self.assertEqual(len(params), 4)
+
+        sublayers = mlp.sublayers(True)
+        self.assertEqual(mlp._fc1, sublayers[0])
+        self.assertEqual(mlp._fc2, sublayers[1])
+        self.assertEqual(len(sublayers), 2)
+
     def test_rnn(self):
         np_inp = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0],
                            [10.0, 11.0, 12.0]])
diff --git a/python/paddle/fluid/tests/unittests/test_imperative_gan.py b/python/paddle/fluid/tests/unittests/test_imperative_gan.py
index 681661bfc63db95653be371688a047efe96f3866..33c196d1ab52b393491561e75054e6c323fce18d 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative_gan.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_gan.py
@@ -33,9 +33,6 @@ class Discriminator(fluid.imperative.Layer):
         self._fc1 = FC(size=32, act='elu', name="d_fc1")
         self._fc2 = FC(size=1, name="d_fc2")
 
-    def parameters(self):
-        return self._fc1.parameters() + self._fc2.parameters()
-
     def forward(self, inputs):
         x = self._fc1(inputs)
         return self._fc2(x)
@@ -48,10 +45,6 @@ class Generator(fluid.imperative.Layer):
         self._fc2 = FC(size=64, act='elu', name="g_fc2")
         self._fc3 = FC(size=1, name="g_fc3")
 
-    def parameters(self):
-        return self._fc1.parameters() + self._fc2.parameters(
-        ) + self._fc3.parameters()
-
     def forward(self, inputs):
         x = self._fc1(inputs)
         x = self._fc2(x)
diff --git a/python/paddle/fluid/tests/unittests/test_imperative_ptb_rnn.py b/python/paddle/fluid/tests/unittests/test_imperative_ptb_rnn.py
index afe990e74ff96dfbca4f335b561f9bbe7d295246..82aff18b728f45388cd747e3bce19c1c9d6f91cc 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative_ptb_rnn.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_ptb_rnn.py
@@ -75,16 +75,6 @@ class SimpleLSTMRNN(fluid.imperative.Layer):
             self.hidden_array.append(pre_hidden)
             self.cell_array.append(pre_cell)
 
-    def parameters(self):
-        parameters = list()
-        for param in self.weight_1_arr:
-            parameters.append(param)
-        for param in self.weight_2_arr:
-            parameters.append(param)
-        for bias in self.bias_arr:
-            parameters.append(bias)
-        return parameters
-
     def forward(self, input_embedding, init_hidden=None, init_cell=None):
         res = []
         for index in range(self._num_steps):
@@ -177,12 +167,6 @@ class PtbModel(fluid.imperative.Layer):
     def _build_once(self, input, label, init_hidden, init_cell):
         pass
 
-    def parameters(self):
-        parameters = self.simple_lstm_rnn.parameters() + [
-            self.softmax_weight, self.softmax_bias
-        ] + self.embedding.parameters()
-        return parameters
-
     def forward(self, input, label, init_hidden, init_cell):
 
         init_h = fluid.layers.reshape(
diff --git a/python/paddle/fluid/tests/unittests/test_imperative_resnet.py b/python/paddle/fluid/tests/unittests/test_imperative_resnet.py
index c27fd0b8024a8fa3310a62de34299fb621e2902f..128d18621db8374c6c385dddbefc0d29e760a02f 100644
--- a/python/paddle/fluid/tests/unittests/test_imperative_resnet.py
+++ b/python/paddle/fluid/tests/unittests/test_imperative_resnet.py
@@ -21,7 +21,6 @@ import paddle
 import paddle.fluid as fluid
 from paddle.fluid import core
 from paddle.fluid.layer_helper import LayerHelper
-from paddle.fluid.optimizer import SGDOptimizer
 from paddle.fluid.imperative.nn import Conv2D, Pool2D, BatchNorm, FC
 from paddle.fluid.imperative.base import to_variable
 from test_imperative_base import new_program_scope
@@ -173,11 +172,13 @@ class ResNet(fluid.imperative.Layer):
         for block in range(len(depth)):
             shortcut = False
             for i in range(depth[block]):
-                bottleneck_block = BottleneckBlock(
-                    num_channels=num_channels,
-                    num_filters=num_filters[block],
-                    stride=2 if i == 0 and block != 0 else 1,
-                    shortcut=shortcut)
+                bottleneck_block = self.add_sublayer(
+                    'bb_%d_%d' % (block, i),
+                    BottleneckBlock(
+                        num_channels=num_channels,
+                        num_filters=num_filters[block],
+                        stride=2 if i == 0 and block != 0 else 1,
+                        shortcut=shortcut))
                 num_channels = bottleneck_block._num_channels_out
                 self.bottleneck_block_list.append(bottleneck_block)
                 shortcut = True
@@ -223,8 +224,7 @@ class TestImperativeResnet(unittest.TestCase):
                 batch_size=batch_size)
 
             dy_param_init_value = {}
-            for param in fluid.default_main_program().global_block(
-            ).all_parameters():
+            for param in resnet.parameters():
                 dy_param_init_value[param.name] = param._numpy()
 
             for batch_id, data in enumerate(train_reader()):
@@ -247,16 +247,14 @@ class TestImperativeResnet(unittest.TestCase):
                 dy_out = avg_loss._numpy()
 
                 if batch_id == 0:
-                    for param in fluid.default_main_program().global_block(
-                    ).all_parameters():
+                    for param in resnet.parameters():
                         if param.name not in dy_param_init_value:
                             dy_param_init_value[param.name] = param._numpy()
 
                 avg_loss._backward()
 
                 dy_grad_value = {}
-                for param in fluid.default_main_program().global_block(
-                ).all_parameters():
+                for param in resnet.parameters():
                     if not param.stop_gradient:
                         np_array = np.array(param._ivar._grad_ivar().value()
                                             .get_tensor())
@@ -267,8 +265,7 @@ class TestImperativeResnet(unittest.TestCase):
                 resnet.clear_gradients()
 
                 dy_param_value = {}
-                for param in fluid.default_main_program().global_block(
-                ).all_parameters():
+                for param in resnet.parameters():
                     dy_param_value[param.name] = param._numpy()
 
         with new_program_scope():
@@ -349,6 +346,7 @@ class TestImperativeResnet(unittest.TestCase):
         self.assertTrue(np.allclose(static_out, dy_out))
 
         self.assertEqual(len(dy_param_init_value), len(static_param_init_value))
+
         for key, value in six.iteritems(static_param_init_value):
             self.assertTrue(np.allclose(value, dy_param_init_value[key]))
             self.assertTrue(np.isfinite(value.all()))
diff --git a/python/paddle/fluid/transpiler/memory_optimization_transpiler.py b/python/paddle/fluid/transpiler/memory_optimization_transpiler.py
index 52c1aea288fa2bb7478ad14186367900c05f64e7..ee8cde441ffc63ebd923bd579a7f44d1e2218cf0 100755
--- a/python/paddle/fluid/transpiler/memory_optimization_transpiler.py
+++ b/python/paddle/fluid/transpiler/memory_optimization_transpiler.py
@@ -355,6 +355,10 @@ class ControlFlowGraph(object):
                                                  is_forward).dtype()
                         cache_dtype = self._find_var(block_desc, cache_var,
                                                      is_forward).dtype()
+                        if x_dtype != cache_dtype:
+                            if PRINT_LOG:
+                                print("x_dtype and cache_dtype are different")
+                            continue
 
                         if not compare_shape(x_shape, cache_shape, level):
                             continue
diff --git a/python/paddle/utils/dump_config.py b/python/paddle/utils/dump_config.py
deleted file mode 100644
index 6a96a0a78fc77c50904ee7822c725c41e646c5e6..0000000000000000000000000000000000000000
--- a/python/paddle/utils/dump_config.py
+++ /dev/null
@@ -1,45 +0,0 @@
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from paddle.trainer.config_parser import parse_config
-from paddle.proto import TrainerConfig_pb2
-import sys
-
-__all__ = []
-
-if __name__ == '__main__':
-    whole_conf = False
-    binary = False
-    if len(sys.argv) == 2:
-        conf = parse_config(sys.argv[1], '')
-    elif len(sys.argv) == 3:
-        conf = parse_config(sys.argv[1], sys.argv[2])
-    elif len(sys.argv) == 4:
-        conf = parse_config(sys.argv[1], sys.argv[2])
-        if sys.argv[3] == '--whole':
-            whole_conf = True
-        elif sys.argv[3] == '--binary':
-            binary = True
-    else:
-        raise RuntimeError()
-
-    assert isinstance(conf, TrainerConfig_pb2.TrainerConfig)
-
-    if whole_conf:
-        print(conf)
-    else:
-        if binary:
-            sys.stdout.write(conf.model_config.SerializeToString())
-        else:
-            print(conf.model_config)
diff --git a/python/paddle/utils/dump_v2_config.py b/python/paddle/utils/dump_v2_config.py
deleted file mode 100644
index 5dc2111e379fd39b40e1e9bcf2e577b57b101a68..0000000000000000000000000000000000000000
--- a/python/paddle/utils/dump_v2_config.py
+++ /dev/null
@@ -1,62 +0,0 @@
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-import collections
-
-from paddle.trainer_config_helpers.layers import LayerOutput
-from paddle.v2.layer import parse_network
-from paddle.proto import TrainerConfig_pb2
-
-__all__ = ["dump_v2_config"]
-
-
-def dump_v2_config(topology, save_path, binary=False):
-    """ Dump the network topology to a specified file.
-
-    This function is only used to dump network defined by using PaddlePaddle V2
-    APIs. This function will NOT dump configurations related to PaddlePaddle
-    optimizer.
-
-    :param topology: The output layers (can be more than one layers given in a
-                     Python List or Tuple) of the entire network. Using the
-                     specified layers (if more than one layer is given) as root,
-                     traversing back to the data layer(s), all the layers
-                     connected to the specified output layers will be dumped.
-                     Layers not connceted to the specified will not be dumped.
-    :type topology: LayerOutput|List|Tuple
-    :param save_path: The path to save the dumped network topology.
-    :type save_path: str
-    :param binary: Whether to dump the serialized network topology or not.
-                   The default value is false. NOTE that, if you call this
-                   function to generate network topology for PaddlePaddle C-API,
-                   a serialized version of network topology is required. When
-                   using PaddlePaddle C-API, this flag MUST be set to True.
-    :type binary: bool
-    """
-
-    if isinstance(topology, LayerOutput):
-        topology = [topology]
-    elif isinstance(topology, collections.Sequence):
-        for out_layer in topology:
-            assert isinstance(out_layer, LayerOutput), (
-                "The type of each element in the parameter topology "
-                "should be LayerOutput.")
-    else:
-        raise RuntimeError("Error input type for parameter topology.")
-
-    model_str = parse_network(topology)
-    with open(save_path, "w") as fout:
-        if binary:
-            fout.write(model_str.SerializeToString())
-        else:
-            fout.write(str(model_str))
diff --git a/python/paddle/utils/image_multiproc.py b/python/paddle/utils/image_multiproc.py
deleted file mode 100644
index d1bbda3fd3562efe486377d41a9fb7359bafa4e7..0000000000000000000000000000000000000000
--- a/python/paddle/utils/image_multiproc.py
+++ /dev/null
@@ -1,278 +0,0 @@
-#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os, sys
-import numpy as np
-from PIL import Image
-import six
-from six.moves import cStringIO as StringIO
-import multiprocessing
-import functools
-import itertools
-
-from paddle.utils.image_util import *
-from paddle.trainer.config_parser import logger
-
-try:
-    import cv2
-except ImportError:
-    logger.warning("OpenCV2 is not installed, using PIL to process")
-    cv2 = None
-
-__all__ = ["CvTransformer", "PILTransformer", "MultiProcessImageTransformer"]
-
-
-class CvTransformer(ImageTransformer):
-    """
-    CvTransformer used python-opencv to process image.
-    """
-
-    def __init__(
-            self,
-            min_size=None,
-            crop_size=None,
-            transpose=(2, 0, 1),  # transpose to C * H * W
-            channel_swap=None,
-            mean=None,
-            is_train=True,
-            is_color=True):
-        ImageTransformer.__init__(self, transpose, channel_swap, mean, is_color)
-        self.min_size = min_size
-        self.crop_size = crop_size
-        self.is_train = is_train
-
-    def resize(self, im, min_size):
-        row, col = im.shape[:2]
-        new_row, new_col = min_size, min_size
-        if row > col:
-            new_row = min_size * row / col
-        else:
-            new_col = min_size * col / row
-        im = cv2.resize(im, (new_row, new_col), interpolation=cv2.INTER_CUBIC)
-        return im
-
-    def crop_and_flip(self, im):
-        """
-        Return cropped image.
-        The size of the cropped image is inner_size * inner_size.
-        im: (H x W x K) ndarrays
-        """
-        row, col = im.shape[:2]
-        start_h, start_w = 0, 0
-        if self.is_train:
-            start_h = np.random.randint(0, row - self.crop_size + 1)
-            start_w = np.random.randint(0, col - self.crop_size + 1)
-        else:
-            start_h = (row - self.crop_size) / 2
-            start_w = (col - self.crop_size) / 2
-        end_h, end_w = start_h + self.crop_size, start_w + self.crop_size
-        if self.is_color:
-            im = im[start_h:end_h, start_w:end_w, :]
-        else:
-            im = im[start_h:end_h, start_w:end_w]
-        if (self.is_train) and (np.random.randint(2) == 0):
-            if self.is_color:
-                im = im[:, ::-1, :]
-            else:
-                im = im[:, ::-1]
-        return im
-
-    def transform(self, im):
-        im = self.resize(im, self.min_size)
-        im = self.crop_and_flip(im)
-        # transpose, swap channel, sub mean
-        im = im.astype('float32')
-        ImageTransformer.transformer(self, im)
-        return im
-
-    def load_image_from_string(self, data):
-        flag = cv2.CV_LOAD_IMAGE_COLOR if self.is_color else cv2.CV_LOAD_IMAGE_GRAYSCALE
-        im = cv2.imdecode(np.fromstring(data, np.uint8), flag)
-        return im
-
-    def transform_from_string(self, data):
-        im = self.load_image_from_string(data)
-        return self.transform(im)
-
-    def load_image_from_file(self, file):
-        flag = cv2.CV_LOAD_IMAGE_COLOR if self.is_color else cv2.CV_LOAD_IMAGE_GRAYSCALE
-        im = cv2.imread(file, flag)
-        return im
-
-    def transform_from_file(self, file):
-        im = self.load_image_from_file(file)
-        return self.transform(im)
-
-
-class PILTransformer(ImageTransformer):
-    """
-    PILTransformer used PIL to process image.
-    """
-
-    def __init__(
-            self,
-            min_size=None,
-            crop_size=None,
-            transpose=(2, 0, 1),  # transpose to C * H * W
-            channel_swap=None,
-            mean=None,
-            is_train=True,
-            is_color=True):
-        ImageTransformer.__init__(self, transpose, channel_swap, mean, is_color)
-        self.min_size = min_size
-        self.crop_size = crop_size
-        self.is_train = is_train
-
-    def resize(self, im, min_size):
-        row, col = im.size[:2]
-        new_row, new_col = min_size, min_size
-        if row > col:
-            new_row = min_size * row / col
-        else:
-            new_col = min_size * col / row
-        im = im.resize((new_row, new_col), Image.ANTIALIAS)
-        return im
-
-    def crop_and_flip(self, im):
-        """
-        Return cropped image.
-        The size of the cropped image is inner_size * inner_size.
-        """
-        row, col = im.size[:2]
-        start_h, start_w = 0, 0
-        if self.is_train:
-            start_h = np.random.randint(0, row - self.crop_size + 1)
-            start_w = np.random.randint(0, col - self.crop_size + 1)
-        else:
-            start_h = (row - self.crop_size) / 2
-            start_w = (col - self.crop_size) / 2
-        end_h, end_w = start_h + self.crop_size, start_w + self.crop_size
-        im = im.crop((start_h, start_w, end_h, end_w))
-        if (self.is_train) and (np.random.randint(2) == 0):
-            im = im.transpose(Image.FLIP_LEFT_RIGHT)
-        return im
-
-    def transform(self, im):
-        im = self.resize(im, self.min_size)
-        im = self.crop_and_flip(im)
-        im = np.array(im, dtype=np.float32)  # convert to numpy.array
-        # transpose, swap channel, sub mean
-        ImageTransformer.transformer(self, im)
-        return im
-
-    def load_image_from_string(self, data):
-        im = Image.open(StringIO(data))
-        return im
-
-    def transform_from_string(self, data):
-        im = self.load_image_from_string(data)
-        return self.transform(im)
-
-    def load_image_from_file(self, file):
-        im = Image.open(file)
-        return im
-
-    def transform_from_file(self, file):
-        im = self.load_image_from_file(file)
-        return self.transform(im)
-
-
-def job(is_img_string, transformer, data_label_pack):
-    (data, label) = data_label_pack
-    if is_img_string:
-        return transformer.transform_from_string(data), label
-    else:
-        return transformer.transform_from_file(data), label
-
-
-class MultiProcessImageTransformer(object):
-    def __init__(self,
-                 procnum=10,
-                 resize_size=None,
-                 crop_size=None,
-                 transpose=(2, 0, 1),
-                 channel_swap=None,
-                 mean=None,
-                 is_train=True,
-                 is_color=True,
-                 is_img_string=True):
-        """
-        Processing image with multi-process. If it is used in PyDataProvider,
-        the simple usage for CNN is as follows:
-
-        .. code-block:: python
-
-            def hool(settings, is_train,  **kwargs):
-                settings.is_train = is_train
-                settings.mean_value = np.array([103.939,116.779,123.68], dtype=np.float32)
-                settings.input_types = [
-                    dense_vector(3 * 224 * 224),
-                    integer_value(1)]
-                settings.transformer = MultiProcessImageTransformer(
-                    procnum=10,
-                    resize_size=256,
-                    crop_size=224,
-                    transpose=(2, 0, 1),
-                    mean=settings.mean_values,
-                    is_train=settings.is_train)
-
-
-            @provider(init_hook=hook, pool_size=20480)
-            def process(settings, file_list):
-                with open(file_list, 'r') as fdata:
-                    for line in fdata:
-                        data_dic = np.load(line.strip()) # load the data batch pickled by Pickle.
-                        data = data_dic['data']
-                        labels = data_dic['label']
-                        labels = np.array(labels, dtype=np.float32)
-                        for im, lab in settings.dp.run(data, labels):
-                            yield [im.astype('float32'), int(lab)]
-
-        :param procnum: processor number.
-        :type procnum: int
-        :param resize_size: the shorter edge size of image after resizing.
-        :type resize_size: int
-        :param crop_size: the croping size.
-        :type crop_size: int
-        :param transpose: the transpose order, Paddle only allow C * H * W order.
-        :type transpose: tuple or list
-        :param channel_swap: the channel swap order, RGB or BRG.
-        :type channel_swap: tuple or list
-        :param mean: the mean values of image, per-channel mean or element-wise mean.
-        :type mean: array, The dimension is 1 for per-channel mean.
-                    The dimension is 3 for element-wise mean.
-        :param is_train: training peroid or testing peroid.
-        :type is_train: bool.
-        :param is_color: the image is color or gray.
-        :type is_color: bool.
-        :param is_img_string: The input can be the file name of image or image string.
-        :type is_img_string: bool.
-        """
-
-        self.procnum = procnum
-        self.pool = multiprocessing.Pool(procnum)
-        self.is_img_string = is_img_string
-        if cv2 is not None:
-            self.transformer = CvTransformer(resize_size, crop_size, transpose,
-                                             channel_swap, mean, is_train,
-                                             is_color)
-        else:
-            self.transformer = PILTransformer(resize_size, crop_size, transpose,
-                                              channel_swap, mean, is_train,
-                                              is_color)
-
-    def run(self, data, label):
-        fun = functools.partial(job, self.is_img_string, self.transformer)
-        return self.pool.imap_unordered(
-            fun, six.moves.zip(data, label), chunksize=100 * self.procnum)
diff --git a/python/paddle/utils/make_model_diagram.py b/python/paddle/utils/make_model_diagram.py
deleted file mode 100644
index 52759d3ad230c3a5a5488a8bc46a2e8f8fae1025..0000000000000000000000000000000000000000
--- a/python/paddle/utils/make_model_diagram.py
+++ /dev/null
@@ -1,140 +0,0 @@
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Generate dot diagram file for the given paddle model config
-# The generated file can be viewed using Graphviz (http://graphviz.org)
-
-from __future__ import print_function
-
-import six
-import sys
-import traceback
-
-from paddle.trainer.config_parser import parse_config
-
-
-def make_layer_label(layer_config):
-    label = '%s type=%s' % (layer_config.name, layer_config.type)
-    if layer_config.reversed:
-        label += ' <=='
-
-    label2 = ''
-    if layer_config.active_type:
-        label2 += 'act=%s ' % layer_config.active_type
-    if layer_config.bias_parameter_name:
-        label2 += 'bias=%s ' % layer_config.bias_parameter_name
-
-    if label2:
-        label += '\l' + label2
-    return label
-
-
-def make_diagram(config_file, dot_file, config_arg_str):
-    config = parse_config(config_file, config_arg_str)
-    make_diagram_from_proto(config.model_config, dot_file)
-
-
-def make_diagram_from_proto(model_config, dot_file):
-    # print >> sys.stderr, config
-    name2id = {}
-    f = open(dot_file, 'w')
-    submodel_layers = set()
-
-    def make_link(link):
-        return 'l%s -> l%s;' % (name2id[link.layer_name],
-                                name2id[link.link_name])
-
-    def make_mem(mem):
-        s = ''
-        if mem.boot_layer_name:
-            s += 'l%s -> l%s;\n' % (name2id[mem.boot_layer_name],
-                                    name2id[mem.layer_name])
-        s += 'l%s -> l%s [style=dashed];' % (name2id[mem.layer_name],
-                                             name2id[mem.link_name])
-        return s
-
-    print('digraph graphname {', file=f)
-    print('node [width=0.375,height=0.25];', file=f)
-    for i in six.moves.xrange(len(model_config.layers)):
-        l = model_config.layers[i]
-        name2id[l.name] = i
-
-    i = 0
-    for sub_model in model_config.sub_models:
-        if sub_model.name == 'root':
-            continue
-        print('subgraph cluster_%s {' % i, file=f)
-        print('style=dashed;', file=f)
-        label = '%s ' % sub_model.name
-        if sub_model.reversed:
-            label += '<=='
-        print('label = "%s";' % label, file=f)
-        i += 1
-        submodel_layers.add(sub_model.name)
-        for layer_name in sub_model.layer_names:
-            submodel_layers.add(layer_name)
-            lid = name2id[layer_name]
-            layer_config = model_config.layers[lid]
-            label = make_layer_label(layer_config)
-            print('l%s [label="%s", shape=box];' % (lid, label), file=f)
-        print('}', file=f)
-
-    for i in six.moves.xrange(len(model_config.layers)):
-        l = model_config.layers[i]
-        if l.name not in submodel_layers:
-            label = make_layer_label(l)
-            print('l%s [label="%s", shape=box];' % (i, label), file=f)
-
-    for sub_model in model_config.sub_models:
-        if sub_model.name == 'root':
-            continue
-        for link in sub_model.in_links:
-            print(make_link(link), file=f)
-        for link in sub_model.out_links:
-            print(make_link(link), file=f)
-        for mem in sub_model.memories:
-            print(make_mem(mem), file=f)
-
-    for i in six.moves.xrange(len(model_config.layers)):
-        for l in model_config.layers[i].inputs:
-            print(
-                'l%s -> l%s [label="%s"];' % (name2id[l.input_layer_name], i,
-                                              l.input_parameter_name),
-                file=f)
-
-    print('}', file=f)
-    f.close()
-
-
-def usage():
-    print(
-        ("Usage: python show_model_diagram.py" +
-         " CONFIG_FILE DOT_FILE [config_str]"),
-        file=sys.stderr)
-    exit(1)
-
-
-if __name__ == '__main__':
-    if len(sys.argv) < 3 or len(sys.argv) > 4:
-        usage()
-
-    config_file = sys.argv[1]
-    dot_file = sys.argv[2]
-    config_arg_str = sys.argv[3] if len(sys.argv) == 4 else ''
-
-    try:
-        make_diagram(config_file, dot_file, config_arg_str)
-    except:
-        traceback.print_exc()
-        raise
diff --git a/python/paddle/utils/merge_model.py b/python/paddle/utils/merge_model.py
deleted file mode 100644
index b74649e93640c3600636034d58792b8d12dffeda..0000000000000000000000000000000000000000
--- a/python/paddle/utils/merge_model.py
+++ /dev/null
@@ -1,73 +0,0 @@
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import gzip
-import struct
-import os
-
-from paddle.trainer_config_helpers.layers import LayerOutput
-from paddle.v2.parameters import Parameters
-from paddle.proto import ModelConfig_pb2
-from paddle.v2.topology import Topology
-
-
-def merge_v2_model(net, param_file, output_file):
-    '''Merge the model config and parameters into one file.
-
-    The model configuration file describes the model structure which
-    ends with .py. The parameters file stores the parameters of the model
-    which ends with .tar.gz.
-
-    @param  net            The output layer of the network for inference.
-    @param  param_file     Path of the parameters (.tar.gz) which is stored by
-                           v2 api.
-    @param  output_file    Path of the merged file which will be generated.
-
-    Usage:
-
-        from paddle.utils.merge_model import merge_v2_model
-        # import your network configuration
-        from example_net import net_conf
-
-        net = net_conf(is_predict=True)
-        param_file = './param_pass_00000.tar.gz'
-        output_file = './output.paddle'
-
-        merge_v2_model(net, param_file, output_file)
-
-    '''
-
-    assert isinstance(net, LayerOutput), \
-            "The net should be the output of the network for inference"
-    assert os.path.exists(param_file), \
-            "The model parameters file %s does not exists " % (param_file)
-
-    model_proto = Topology(net).proto()
-    assert isinstance(model_proto, ModelConfig_pb2.ModelConfig)
-
-    with gzip.open(param_file) as f:
-        params = Parameters.from_tar(f)
-
-    if os.path.exists(output_file):
-        os.remove(output_file)
-
-    with open(output_file, 'w') as f:
-        param_names = [param.name for param in model_proto.parameters]
-        conf_str = model_proto.SerializeToString()
-        f.write(struct.pack('q', len(conf_str)))
-        f.write(conf_str)
-        for pname in param_names:
-            params.serialize(pname, f)
-
-    print('Generate  %s  success!' % (output_file))
diff --git a/python/paddle/utils/predefined_net.py b/python/paddle/utils/predefined_net.py
deleted file mode 100644
index 2801f4877c079615239b92be146b3e33df16b37f..0000000000000000000000000000000000000000
--- a/python/paddle/utils/predefined_net.py
+++ /dev/null
@@ -1,381 +0,0 @@
-# Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import numpy as np
-import six
-import os
-from paddle.trainer.config_parser import *
-from paddle.utils.preprocess_img import \
-    ImageClassificationDatasetCreater
-from paddle.trainer_config_helpers import *
-
-
-def image_data(data_dir,
-               processed_image_size,
-               overwrite=False,
-               color=True,
-               train_list="batches/train.list",
-               test_list="batches/test.list",
-               meta_file="batches/batches.meta",
-               use_jpeg=1):
-    """
-    Predefined image data provider for image classification.
-    train_list: a text file containing a list of training batches.
-    test_list: a text file containing a list of test batches.
-    processed_image_size: all the input images will be resized into this size.
-       If the image is not square. Then the shorter edge will be resized into
-       this size, and the aspect ratio is kept the same.
-    color: whether the images are color or gray.
-    meta_path: the path of the meta file that stores the mean image file and
-               other dataset information, such as the size of images,
-               the size of the mean image, the number of classes.
-    async_load_data: whether to load image data asynchronuously.
-    """
-    data_creator = ImageClassificationDatasetCreater(
-        data_dir, processed_image_size, color)
-    batch_data_dir = data_dir
-    train_list = os.path.join(batch_data_dir, train_list)
-    test_list = os.path.join(batch_data_dir, test_list)
-    meta_path = os.path.join(batch_data_dir, meta_file)
-    image_size = processed_image_size
-    conf = np.load(meta_path)
-    mean_image_size = conf["mean_image_size"]
-    is_color = conf["color"]
-    num_classes = conf["num_classes"]
-    color_string = "color" if is_color else "gray"
-
-    args = {
-        'meta': meta_path,
-        'mean_img_size': mean_image_size,
-        'img_size': image_size,
-        'num_classes': num_classes,
-        'use_jpeg': use_jpeg != 0,
-        'color': color_string
-    }
-
-    define_py_data_sources2(
-        train_list,
-        test_list,
-        module='image_provider',
-        obj='processData',
-        args=args)
-    return {
-        "image_size": image_size,
-        "num_classes": num_classes,
-        "is_color": is_color
-    }
-
-
-def get_extra_layer_attr(drop_rate):
-    if drop_rate == 0:
-        return None
-    else:
-        return ExtraLayerAttribute(drop_rate=drop_rate)
-
-
-def image_data_layers(image_size, num_classes, is_color=False,
-                      is_predict=False):
-    """
-    Data layers for image classification.
-    image_size: image size.
-    num_classes: num of classes.
-    is_color: whether the input images are color.
-    is_predict: whether the network is used for prediction.
-    """
-    num_image_channels = 3 if is_color else 1
-    data_input = data_layer("input",
-                            image_size * image_size * num_image_channels)
-    if is_predict:
-        return data_input, None, num_image_channels
-    else:
-        label_input = data_layer("label", 1)
-        return data_input, label_input, num_image_channels
-
-
-def simple_conv_net(data_conf, is_color=False):
-    """
-    A Wrapper for a simple network for MNIST digit recognition.
-    It contains two convolutional layers, one fully conencted layer, and
-    one softmax layer.
-    data_conf is a dictionary with the following keys:
-        image_size: image size.
-        num_classes: num of classes.
-        is_color: whether the input images are color.
-    """
-    for k, v in six.iteritems(data_conf):
-        globals()[k] = v
-    data_input, label_input, num_image_channels = \
-        image_data_layers(image_size, num_classes, is_color, is_predict)
-    filter_sizes = [5, 5]
-    num_channels = [32, 64]
-    strides = [1, 1]
-    fc_dims = [500]
-    conv_bn_pool1 = img_conv_bn_pool(
-        name="g1",
-        input=data_input,
-        filter_size=filter_sizes[0],
-        num_channel=num_image_channels,
-        num_filters=num_channels[0],
-        conv_stride=1,
-        conv_padding=0,
-        pool_size=3,
-        pool_stride=2,
-        act=ReluActivation())
-    conv_bn_pool2 = img_conv_bn_pool(
-        name="g2",
-        input=conv_bn_pool1,
-        filter_size=filter_sizes[1],
-        num_channel=num_channels[0],
-        num_filters=num_channels[1],
-        conv_stride=1,
-        conv_padding=0,
-        pool_size=3,
-        pool_stride=2,
-        act=ReluActivation())
-    fc3 = fc_layer(
-        name="fc3", input=conv_bn_pool2, dim=fc_dims[0], act=ReluActivation())
-    fc3_dropped = dropout_layer(name="fc3_dropped", input=fc3, dropout_rate=0.5)
-    output = fc_layer(
-        name="output",
-        input=fc3_dropped,
-        dim=fc_dims[0],
-        act=SoftmaxActivation())
-    if is_predict:
-        end_of_network(output)
-    else:
-        cost = classify(name="cost", input=output, label=label_input)
-        end_of_network(cost)
-
-
-def conv_layer_group(prefix_num,
-                     num_layers,
-                     input,
-                     input_channels,
-                     output_channels,
-                     drop_rates=[],
-                     strides=[],
-                     with_bn=[]):
-    """
-    A set of convolution layers, and batch normalization layers,
-    followed by one pooling layer.
-    It is utilized in VGG network for image classifcation.
-    prefix_num: the prefix number of the layer names.
-                For example, if prefix_num = 1, the first convolutioal layer's
-                name will be conv_1_1.
-    num_layers: number of the convolutional layers.
-    input: the name of the input layer.
-    input_channels: the number of channels of the input feature map.
-    output_channels: the number of channels of the output feature map.
-    drop_rates: the drop rates of the BN layers. It will be all zero by default.
-    strides: the stride of the convolution for the layers.
-             It will be all 1 by  default.
-    with_bn: whether to use Batch Normalization for Conv layers.
-             By default,  it is all false.
-    """
-    if len(drop_rates) == 0: drop_rates = [0] * num_layers
-    if len(strides) == 0: strides = [1] * num_layers
-    if len(with_bn) == 0: with_bn = [False] * num_layers
-    assert (len(drop_rates) == num_layers)
-    assert (len(strides) == num_layers)
-
-    for i in range(1, num_layers + 1):
-        if i == 1:
-            i_conv_in = input
-        else:
-            i_conv_in = group_output
-        i_channels_conv = input_channels if i == 1 else output_channels
-        conv_act = LinearActivation() if with_bn[i - 1] else ReluActivation()
-        conv_output = img_conv_layer(
-            name="conv%d_%d" % (prefix_num, i),
-            input=i_conv_in,
-            filter_size=3,
-            num_channels=i_channels_conv,
-            num_filters=output_channels,
-            stride=strides[i - 1],
-            padding=1,
-            act=conv_act)
-        if with_bn[i - 1]:
-            bn = batch_norm_layer(
-                name="conv%d_%d_bn" % (prefix_num, i),
-                input=conv_output,
-                num_channels=output_channels,
-                act=ReluActivation(),
-                layer_attr=get_extra_layer_attr(drop_rate=drop_rates[i - 1]))
-            group_output = bn
-        else:
-            group_output = conv_output
-    pool = img_pool_layer(
-        name="pool%d" % prefix_num,
-        input=group_output,
-        pool_size=2,
-        num_channels=output_channels,
-        stride=2)
-    return pool
-
-
-def vgg_conv_net(image_size,
-                 num_classes,
-                 num_layers,
-                 channels,
-                 strides,
-                 with_bn,
-                 fc_dims,
-                 drop_rates,
-                 drop_rates_fc=[],
-                 is_color=True,
-                 is_predict=False):
-    """
-    A Wrapper for a VGG network for image classification.
-    It is a set of convolutional groups followed by several fully
-    connected layers, and a cross-entropy classifiation loss.
-    The detailed architecture of the paper can be found here:
-      Very Deep Convolutional Networks for Large-Scale Visual Recognition
-      http://www.robots.ox.ac.uk/~vgg/research/very_deep/
-    image_size: image size.
-    num_classes: num of classes.
-    num_layers: the number of layers for all the convolution groups.
-    channels: the number of output filters for all the convolution groups.
-    with_bn: whether each layer of a convolution group is followed by a
-    batch normalization.
-    drop_rates: the dropout rates for all the convolutional layers.
-    fc_dims: the dimension for all the fully connected layers.
-    is_color: whether the input images are color.
-    """
-    data_input, label_input, num_image_channels = \
-        image_data_layers(image_size, num_classes, is_color, is_predict)
-    assert (len(num_layers) == len(channels))
-    assert (len(num_layers) == len(strides))
-    assert (len(num_layers) == len(with_bn))
-    num_fc_layers = len(fc_dims)
-    assert (num_fc_layers + 1 == len(drop_rates_fc))
-
-    for i in range(len(num_layers)):
-        input_layer = data_input if i == 0 else group_output
-        input_channels = 3 if i == 0 else channels[i - 1]
-        group_output = conv_layer_group(
-            prefix_num=i + 1,
-            num_layers=num_layers[i],
-            input=input_layer,
-            input_channels=input_channels,
-            output_channels=channels[i],
-            drop_rates=drop_rates[i],
-            strides=strides[i],
-            with_bn=with_bn[i])
-    conv_output_name = group_output
-    if drop_rates_fc[0] != 0.0:
-        dropped_pool_name = "pool_dropped"
-        conv_output_name = dropout_layer(
-            name=dropped_pool_name,
-            input=conv_output_name,
-            dropout_rate=drop_rates_fc[0])
-    for i in range(len(fc_dims)):
-        input_layer_name = conv_output_name if i == 0 else fc_output
-        active_type = LinearActivation() if i == len(
-            fc_dims) - 1 else ReluActivation()
-        drop_rate = 0.0 if i == len(fc_dims) - 1 else drop_rates_fc[i + 1]
-        fc_output = fc_layer(
-            name="fc%d" % (i + 1),
-            input=input_layer_name,
-            size=fc_dims[i],
-            act=active_type,
-            layer_attr=get_extra_layer_attr(drop_rate))
-    bn = batch_norm_layer(
-        name="fc_bn",
-        input=fc_output,
-        num_channels=fc_dims[len(fc_dims) - 1],
-        act=ReluActivation(),
-        layer_attr=get_extra_layer_attr(drop_rate=drop_rates_fc[-1]))
-    output = fc_layer(
-        name="output", input=bn, size=num_classes, act=SoftmaxActivation())
-    if is_predict:
-        outputs(output)
-    else:
-        cost = classification_cost(name="cost", input=output, label=label_input)
-        outputs(cost)
-
-
-def vgg16_conv_net(image_size, num_classes, is_color=True, is_predict=False):
-    """
-    A Wrapper for a 16 layers VGG network for image classification.
-    The detailed architecture of the paper can be found here:
-      Very Deep Convolutional Networks for Large-Scale Visual Recognition
-      http://www.robots.ox.ac.uk/~vgg/research/very_deep/
-    image_size: image size.
-    num_classes: num of classes.
-    is_color: whether the input images are color.
-    """
-    vgg_conv_net(image_size, num_classes,
-                 num_layers=[2, 2, 3, 3, 3],
-                 channels=[64, 128, 256, 512, 512],
-                 strides=[[], [], [], [], []],
-                 with_bn=[[False, True], [False, True], [False, False, True], \
-                          [False, False, True], [False, False, True]],
-                 drop_rates=[[]] * 5,
-                 drop_rates_fc=[0.0, 0.5, 0.5],
-                 fc_dims=[4096, 4096],
-                 is_predict=is_predict)
-
-
-def small_vgg(data_conf, is_predict=False):
-    """
-    A Wrapper for a small VGG network for CIFAR-10 image classification.
-    The detailed architecture of the paper can be found here:
-      92.45% on CIFAR-10 in Torch
-      http://torch.ch/blog/2015/07/30/cifar.html
-    Due to the constraints of CuDNN, it only has four convolutional groups
-    rather than five.
-    Thus, it only achieves 91.2% test accuracy and 98.1% training accuracy.
-    data_conf is a dictionary with the following keys:
-        image_size: image size.
-        num_classes: num of classes.
-        is_color: whether the input images are color.
-    """
-    for k, v in six.iteritems(data_conf):
-        globals()[k] = v
-    vgg_conv_net(image_size, num_classes,
-                 num_layers=[2, 2, 3, 3],
-                 channels=[64, 128, 256, 512],
-                 strides=[[], [], [], []],
-                 with_bn=[[True, True], [True, True], [True, True, True], \
-                          [True, True, True]],
-                 drop_rates=[[0.3, 0.0], [0.4, 0.0],
-                             [0.4, 0.4, 0.0], [0.4, 0.4, 0.0]],
-                 drop_rates_fc=[0.5, 0.5],
-                 fc_dims=[512],
-                 is_predict=is_predict)
-
-
-def training_settings(learning_rate=0.1,
-                      batch_size=128,
-                      algorithm="sgd",
-                      momentum=0.9,
-                      decay_rate=0.001):
-    """
-    Training settings.
-    learning_rate: learning rate of the training.
-    batch_size: the size of each training batch.
-    algorithm: training algorithm, can be
-       - sgd
-       - adagrad
-       - adadelta
-       - rmsprop
-    momentum: momentum of the training algorithm.
-    decay_rate: weight decay rate.
-    """
-    Settings(
-        algorithm=algorithm,
-        batch_size=batch_size,
-        learning_rate=learning_rate / float(batch_size))
-    default_momentum(momentum)
-    default_decay_rate(decay_rate * batch_size)