diff --git a/.gitignore b/.gitignore
index 61a80a88edb71e9ba4192f84ab7821ba139bb9ce..60b49517f73affaf8da00009ba11684dc1a352c0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -4,3 +4,4 @@
 *.pyc
 *~
 *.vscode
+*.idea
\ No newline at end of file
diff --git a/PaddleCV/Paddle3D/PointNet++/.gitignore b/PaddleCV/Paddle3D/PointNet++/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..b278bea2f2de85ff3778008fe1302b9e3fdfba81
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/.gitignore
@@ -0,0 +1,8 @@
+checkpoints*
+ext_op/src/*.o
+ext_op/src/*.so
+*.log*
+dataset/Indoor3DSemSeg/*
+!dataset/Indoor3DSemSeg/*.sh
+dataset/ModelNet40/*
+!dataset/ModelNet40/*.sh
diff --git a/PaddleCV/Paddle3D/PointNet++/README.md b/PaddleCV/Paddle3D/PointNet++/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..cb5842bce83e7bc0f7a510fb185e4972670b04ac
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/README.md
@@ -0,0 +1,247 @@
+# PointNet++ 分类和语义分割模型
+
+---
+## 内容
+
+- [简介](#简介)
+- [快速开始](#快速开始)
+- [参考文献](#参考文献)
+- [版本更新](#版本更新)
+
+## 简介
+
+[PointNet++](https://arxiv.org/abs/1706.02413) 是 Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas 等人提出的，针对3D数据进行分类和语义分割的模型。该模型基于PointNet进行了拓展, 使用分层点集特征学习来提取点云数据的特征，首先通过对输入point进行分组和采样提取局部区域模式，然后使用多层感知器来获取点特征。PointNet++ 还将点特征传播用于语义分割模型，采用基于距离插值和跨级跳转连接的分层传播策略，对点特征进行向上采样，获得所有原始点的点特征。
+
+
+网络结构如下所示：
+
+<p align="center">
+<img src="image/pointnet2.jpg" height=300 width=800 hspace='10'/> <br />
+用于点云分类和分割的 PointNet++ 网络结构
+</p>
+
+集合抽象层是网络的基本模块，每个集合抽象层由三个关键层构成:采样层、分组层和特征提取层。
+
+- **采样层**：采样层使用最远点采样(FPS)的方法，从输入点中选择一组点，它定义了局部区域的中心。与随机抽样的方法相比，在质心数目相同的情况下，FPS可以更好的覆盖整个点集。
+
+- **分组层**：分组层通过寻找中心体周围的“邻近”点来构造局部区域集。在度量空间采样的点集中，点的邻域由度量距离定义。这种方法被称为“query ball”，它使得局部区域的特征在空间上更加一般化。
+
+- **特征提取层**: 特征提取层使用 mini-PointNet 对分组层给出的各个区域进行特征提取，获得局部特征。
+
+
+
+**注意:** PointNet++ 模型构建依赖于自定义的 C++ 算子，目前仅支持GPU设备在Linux/Unix系统上进行编译，本模型**不能运行在Windows系统或CPU设备上**
+
+
+## 快速开始
+
+### 安装
+
+**安装 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle):**
+
+在当前目录下运行样例代码需要 PaddelPaddle Fluid [develop每日版本](https://www.paddlepaddle.org.cn/install/doc/tables#多版本whl包列表-dev-11)或使用PaddlePaddle [develop分支](https://github.com/PaddlePaddle/Paddle/tree/develop)源码编译安装. 
+
+为了使自定义算子与paddle版本兼容，建议您**优先使用源码编译paddle**，源码编译方式请参考[编译安装](https://www.paddlepaddle.org.cn/install/doc/source/ubuntu)
+
+
+### 编译自定义OP
+
+请确认Paddle版本为PaddelPaddle Fluid develop每日版本或基于Paddle develop分支源码编译安装，**推荐使用源码编译安装的方式**。
+
+自定义OP编译方式如下：
+
+    进入 `ext_op/src` 目录，执行编译脚本
+    ```
+    cd ext_op/src
+    sh make.sh
+    ```
+
+    成功编译后，`ext_op/src` 目录下将会生成 `pointnet2_lib.so` 
+
+    执行下列操作，确保自定义算子编译正确：
+
+    ```
+    # 设置动态库的路径到 LD_LIBRARY_PATH 中
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+
+    # 回到 ext_op 目录，添加 PYTHONPATH
+    cd ..
+    export PYTHONPATH=$PYTHONPATH:`pwd`
+
+    # 运行单测 
+    python tests/test_farthest_point_sampling_op.py
+    python tests/test_gather_point_op.py
+    python tests/test_group_points_op.py
+    python tests/test_query_ball_op.py
+    python tests/test_three_interp_op.py
+    python tests/test_three_nn_op.py
+    ```
+    单测运行成功会输出提示信息，如下所示：
+
+    ```
+    .
+    ----------------------------------------------------------------------
+    Ran 1 test in 13.205s
+
+    OK
+    ```
+
+**说明：** 更多关于自定义OP的编译说明，请参考[自定义OP编译](./ext_op/README.md)
+
+
+### 数据准备
+
+**ModelNet40 数据集:**
+
+PointNet++ 分类模型在 [ModelNet40 数据集](https://shapenet.cs.stanford.edu/media/modelnet40_ply_hdf5_2048.zip)上进行训练，我们提供了数据集下载脚本：
+
+```
+cd dataset/ModelNet40
+sh download.sh
+```
+
+数据目录结构如下所示：
+
+```
+  dataset/ModelNet40/modelnet40_ply_hdf5_2048
+  ├── train_files.txt
+  ├── test_files.txt
+  ├── shape_names.txt
+  ├── ply_data_train0.h5
+  ├── ply_data_train_0_id2file.json
+  ├── ply_data_test0.h5
+  ├── ply_data_test_0_id2file.json
+  |   ...
+
+```
+
+**Indoor3DSemSeg 数据集:**
+
+PointNet++ 分割模型在 [Indoor3DSemSeg 数据集](https://shapenet.cs.stanford.edu/media/indoor3d_sem_seg_hdf5_data.zip)上进行训练，我们提供了数据集下载脚本：
+
+```
+cd dataset/Indoor3DSemSeg
+sh download.sh
+```
+
+数据目录结构如下所示：
+
+```
+  dataset/Indoor3DSemSeg/
+  ├── all_files.txt
+  ├── room_filelist.txt
+  ├── ply_data_all_0.h5
+  ├── ply_data_all_1.h5
+  |   ...
+
+```
+
+### 训练
+
+分类/分割模型默认使用单卡训练，在启动训练前请指定单卡GPU，并将动态库的路径添加到 LD_LIBRARY_PATH 中：
+
+```
+# 指定0号卡进行GPU训练
+export CUDA_VISIBLE_DEVICES=0
+
+# 设置动态库的路径到 LD_LIBRARY_PATH 中
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+
+```
+
+**分类模型:**
+
+可通过如下方式启动 PointNet++ 分类模型的训练：
+
+```
+# 开始训练
+python train_cls.py --model=MSG --batch_size=16 --save_dir=checkpoints_msg_cls
+```
+
+我们同时提供了训练分类模型的“快速开始”脚本：
+
+```
+sh scripts/train_cls.sh
+```
+
+**语义分割模型:**
+
+可通过如下方式启动 PointNet++ 语义分割模型的训练：
+
+```
+# 开始训练
+python train_seg.py --model=MSG --batch_size=32 --save_dir=checkpoints_msg_seg
+```
+
+我们同时提供了训练语义分割模型的“快速开始”脚本：
+
+```
+sh scripts/train_seg.sh
+```
+
+### 模型评估
+
+
+分类/分割模型默认使用单卡评估，首先指定单卡GPU，并将动态库的路径添加到 LD_LIBRARY_PATH 中：
+
+```
+# 指定0号卡进行GPU评估
+export CUDA_VISIBLE_DEVICES=0
+
+# 设置动态库的路径到 LD_LIBRARY_PATH 中
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+
+```
+
+**分类模型:**
+
+可通过如下方式启动 PointNet++ 分类模型的评估：
+
+```
+# 对给定权重进行评估
+python eval_cls.py --model=MSG --weights=checkpoints_cls/200
+```
+
+我们同时提供了评估分类模型的“快速开始”脚本：
+
+```
+sh scripts/eval_cls.sh
+```
+
+分类模型的评估结果如下所示：
+
+| model | Top-1 | download |
+| :----- | :---: | :---: |
+| SSG(Single-Scale Group) | 89.3 | [model](https://paddlemodels.bj.bcebos.com/Paddle3D/pointnet2_ssg_cls.tar) |
+| MSG(Multi-Scale Group)  | 90.0 | [model](https://paddlemodels.bj.bcebos.com/Paddle3D/pointnet2_msg_cls.tar) |
+
+**语义分割模型:**
+
+可通过如下方式启动 PointNet++ 语义分割模型的评估：
+
+```
+# 对给定权重进行评估
+python eval_seg.py --model=MSG --weights=checkpoints_seg/200
+```
+
+我们同时提供了评估语义分割模型的“快速开始”脚本：
+
+```
+sh scripts/eval_seg.sh
+```
+
+语义分割模型的评估结果如下所示：
+
+| model | Top-1 | download |
+| :----- | :---: | :---: |
+| SSG(Single-Scale Group) | 86.1 | [model](https://paddlemodels.bj.bcebos.com/Paddle3D/pointnet2_ssg_seg.tar) |
+| MSG(Multi-Scale Group)  | 86.6 | [model](https://paddlemodels.bj.bcebos.com/Paddle3D/pointnet2_msg_seg.tar) |
+
+## 参考文献
+
+- [PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space](https://arxiv.org/abs/1706.02413), Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas.
+- [PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation](https://www.semanticscholar.org/paper/PointNet%3A-Deep-Learning-on-Point-Sets-for-3D-and-Qi-Su/d997beefc0922d97202789d2ac307c55c2c52fba), Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas.
+
+## 版本更新
+
+- 11/2019, 新增 PointNet++ 分类和语义分割模型。
diff --git a/PaddleCV/Paddle3D/PointNet++/README_en.md b/PaddleCV/Paddle3D/PointNet++/README_en.md
new file mode 100644
index 0000000000000000000000000000000000000000..f42fe9dbf5037a685fb0afbd920465fbcf8e4406
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/README_en.md
@@ -0,0 +1,253 @@
+# PointNet++ classification and semantic segmentation model
+
+---
+## Table of Contents
+
+- [Introduction](#introduction)
+- [Quick Start](#quick-start)
+- [Reference](#reference)
+- [Update](#update)
+
+## Introduction
+
+[PointNet++](https://arxiv.org/abs/1706.02413) is a point classification and segmentation model for 3D data proposed by Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas.
+This model is a extension work based on PointNet extract features of point clouds data with hierarchical point set feature learning, perform set abstractions by grouping and sampling points at first to extract
+local region patterns, then use multi-layer perceptron to get point features. PointNet++ also used point feature propagation for semantic segmentation model, adopt a hierarchical
+propagation strategy with distance based interpolation and across level skip links, point features was upsampled to obtain point features for all the original points.
+
+The network structure is shown as below.
+
+<p align="center">
+<img src="image/pointnet2.jpg" height=300 width=800 hspace='10'/> <br />
+PointNet++ architecture for Point set Segmentation and Classification
+</p>
+
+Set Abstraction layer is the basic module of the network, each set abstraction layer is made of three key layers：Sampling layer, Grouping layer and PointNet layer.
+
+- **Sample layer**: Sampling layer uses farthest point sampling(FPS) to select a set of points from input points, which defines the centroids of local regions. Compared with random sampling, it has better converage of the entire point set given the same number of centroids.
+
+- **Grouping layer**: Grouping layer constructs local region sets by finding "neighboring" points around the centroids. In a point set sampled from a metric space, the neighborhood of a point is defined by metric distance. This method is called "ball query", which make local region feature more generalizable across space.
+
+- **PointNet layer**: PointNet layer uses a mini-PointNet to encode local region patterns into feature vectors.
+
+
+**NOTE:** PointNet++ model builds base on custom C++ operations, which can only support GPU devices and compiled on Linux/Unix currently, this model **cannot run on Windows or CPU deivices**.
+
+
+## Quick Start
+
+### Installation
+
+**Install [PaddlePaddle](https://github.com/PaddlePaddle/Paddle):**
+
+Running sample code in this directory requires PaddelPaddle Fluid develop [daily version wheel](https://www.paddlepaddle.org.cn/install/doc/tables#多版本whl包列表-dev-11) or compiled from PaddlePaddle [develop branch](https://github.com/PaddlePaddle/Paddle/tree/develop).
+
+In order to make the custom OP compatible with the Paddle version, it is recommended to **compile from PaddlePaddle develop branch source code**. For source code compilation, please refer to [Compile and Install](https://www.paddlepaddle.org.cn/install/doc/source/ubuntu)
+
+### Compile custom operations
+
+Please make sure you are using PaddlePaddle Fluid develop daily version or compiled from PaddlePaddle develop branch.
+Custom operations can be compiled as follows:
+
+```
+cd ext_op/src
+sh make.sh
+```
+
+If the compilation is finished successfully, `pointnet2_lib.so` will be generated under `exr_op/src`.
+
+Make sure custom operations pass as follows:
+
+```
+# export paddle libs to LD_LIBRARY_PATH for custom op library
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+
+# back to ext_op and add PYTHONPATH
+cd ..
+export PYTHONPATH=$PYTHONPATH:`pwd`
+
+# Run unit tests
+python test/test_farthest_point_sampling_op.py
+python test/test_gather_point_op.py
+python test/test_group_points_op.py
+python test/test_query_ball_op.py
+python test/test_three_interp_op.py
+python test/test_three_nn_op.py
+```
+The prompt message for successful running is as follows:
+
+```
+.
+----------------------------------------------------------------------
+Ran 1 test in 13.205s
+
+OK
+```
+
+### Data preparation
+
+**ModelNet40 dataset:**
+
+PointNet++ classification models are trained on [ModelNet40 dataset](https://shapenet.cs.stanford.edu/media/modelnet40_ply_hdf5_2048.zip), we also provide download scripts as follows:
+
+```
+cd dataset/ModelNet40
+sh download.sh
+```
+
+The dataset catalog structure is as follows:
+
+```
+  dataset/ModelNet40/modelnet40_ply_hdf5_2048
+  ├── train_files.txt
+  ├── test_files.txt
+  ├── shape_names.txt
+  ├── ply_data_train0.h5
+  ├── ply_data_train_0_id2file.json
+  ├── ply_data_test0.h5
+  ├── ply_data_test_0_id2file.json
+  |   ...
+
+```
+
+**Indoor3DSemSeg dataset:**
+
+PointNet++ semantic segmentation models are trained on [Indoor3DSemSeg dataset](https://shapenet.cs.stanford.edu/media/indoor3d_sem_seg_hdf5_data.zip), we also provide download scripts as follows:
+
+```
+cd dataset/Indoor3DSemSeg
+sh download.sh
+```
+
+The dataset catalog structure is as follows:
+
+```
+  dataset/Indoor3DSemSeg/
+  ├── all_files.txt
+  ├── room_filelist.txt
+  ├── ply_data_all_0.h5
+  ├── ply_data_all_1.h5
+  |   ...
+
+```
+
+### Training
+
+**Classification Model:**
+
+For PointNet++ classification model, training can be start as follows:
+
+```
+# For single GPU deivces
+export CUDA_VISIBLE_DEVICES=0
+
+# enable gc to save GPU memory
+export FLAGS_fast_eager_deletion_mode=1
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_fraction_of_gpu_memory_to_use=0.98
+
+# export paddle libs to LD_LIBRARY_PATH for custom op library
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+
+# start training
+python train_cls.py --model=MSG --batch_size=16 --save_dir=checkpoints_msg_cls
+```
+
+We also provided quick start script for training classification model as follows:
+
+```
+sh scripts/train_cls.sh
+```
+
+**Semantic Segmentation Model:**
+
+For PointNet++ semantic segmentation model, training can be start as follows:
+
+```
+# For single GPU deivces
+export CUDA_VISIBLE_DEVICES=0
+
+# enable gc to save GPU memory
+export FLAGS_fast_eager_deletion_mode=1
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_fraction_of_gpu_memory_to_use=0.98
+
+# export paddle libs to LD_LIBRARY_PATH for custom op library
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+
+# start training
+python train_seg.py --model=MSG --batch_size=32 --save_dir=checkpoints_msg_seg
+```
+
+We also provided quick start scripts for training semantic segmentation model as follows:
+
+```
+sh scripts/train_seg.sh
+```
+
+### Evaluation
+
+**Classification Model:**
+
+For PointNet++ classification model, evaluation can be start as follows:
+
+```
+# For single GPU deivces
+export CUDA_VISIBLE_DEVICES=0
+
+# export paddle libs to LD_LIBRARY_PATH for custom op library
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+
+# start evaluation with given weights
+python eval_cls.py --model=MSG --weights=checkpoints_cls/200
+```
+
+We also provided quick start script for training classification model as follows:
+
+```
+sh scripts/eval_cls.sh
+```
+
+Classification model evaluation result is shown as below:
+
+| model | Top-1 | download |
+| :----- | :---: | :---: |
+| SSG(Single-Scale Group) | 89.3 | [model]() |
+| MSG(Multi-Scale Group)  | 90.0 | [model]() |
+
+**Semantic Segmentation Model:**
+
+For PointNet++ semantic segmentation model, evaluation can be start as follows:
+
+```
+# For single GPU deivces
+export CUDA_VISIBLE_DEVICES=0
+
+# export paddle libs to LD_LIBRARY_PATH for custom op library
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+
+# start evaluation with given weights 
+python eval_seg.py --model=MSG --weights=checkpoints_seg/200
+```
+
+We also provided quick start scripts for training semantic segmentation model as follows:
+
+```
+sh scripts/eval_seg.sh
+```
+
+Semantic segmentation model evaluation result is shown as below:
+
+| model | Top-1 | download |
+| :----- | :---: | :---: |
+| SSG(Single-Scale Group) | 86.1 | [model]() |
+| MSG(Multi-Scale Group)  | 86.8 | [model]() |
+
+## Reference
+
+- [PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space](https://arxiv.org/abs/1706.02413), Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas.
+- [PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation](https://www.semanticscholar.org/paper/PointNet%3A-Deep-Learning-on-Point-Sets-for-3D-and-Qi-Su/d997beefc0922d97202789d2ac307c55c2c52fba), Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas.
+
+## Update
+
+- 11/2019, Add PointNet++ classification and semantic segmentation model.
diff --git a/PaddleCV/Paddle3D/PointNet++/data/__init__.py b/PaddleCV/Paddle3D/PointNet++/data/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..939af885c1485aaef3d6a85ba950cecb310827c5
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/data/__init__.py
@@ -0,0 +1,21 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from . import indoor3d_reader
+from . import modelnet40_reader
+from .indoor3d_reader import *
+from .modelnet40_reader import *
+
+__all__ = indoor3d_reader.__all__
+__all__ += modelnet40_reader.__all__
diff --git a/PaddleCV/Paddle3D/PointNet++/data/data_utils.py b/PaddleCV/Paddle3D/PointNet++/data/data_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..0f0872895bd55be2d9f84837e28671d5486d8b40
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/data/data_utils.py
@@ -0,0 +1,127 @@
+"""
+This code is based on https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/pointnet2/data/data_utils.py
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import numpy as np
+
+def angle_axis(angle, axis):
+    """
+    Returns a 4x4 rotation matrix that performs a rotation around axis by angle
+    Parameters
+    ----------
+    angle : float
+        Angle to rotate by
+    axis: np.ndarray
+        Axis to rotate about
+    Returns
+    -------
+    Tensor
+        3x3 rotation matrix
+    """
+    u = axis / np.linalg.norm(axis)
+    cosval, sinval = np.cos(angle), np.sin(angle)
+
+    # yapf: disable
+    cross_prod_mat = np.array([[0.0, -u[2], u[1]],
+                                [u[2], 0.0, -u[0]],
+                                [-u[1], u[0], 0.0]])
+
+    R = np.array(
+        cosval * np.eye(3)
+        + sinval * cross_prod_mat
+        + (1.0 - cosval) * np.outer(u, u)).astype("float32")
+    return R
+
+class PointcloudScale(object):
+    def __init__(self, lo=0.8, hi=1.25):
+        self.lo, self.hi = lo, hi
+
+    def __call__(self, points):
+        scaler = np.random.uniform(self.lo, self.hi)
+        points[:, 0:3] *= scaler
+        return points
+
+class PointcloudRotate(object):
+    def __init__(self, axis=np.array([0.0, 1.0, 0.0])):
+        self.axis = axis
+
+    def __call__(self, points):
+        rotation_angle = np.random.uniform() * 2 * np.pi
+        rotation_matrix = angle_axis(rotation_angle, self.axis)
+
+        normals = points.shape[1] > 3
+        if not normals:
+            return np.matmul(points, rotation_matrix.T)
+        else:
+            pc_xyz = points[:, 0:3]
+            pc_normals = points[:, 3:]
+            points[:, 0:3] = np.matmul(pc_xyz, rotation_matrix.T)
+            points[:, 3:] = np.matmul(pc_normals, rotation_matrix.T)
+            return points
+
+class PointcloudTranslate(object):
+    def __init__(self, translate_range=0.1):
+        self.translate_range = translate_range
+
+    def __call__(self, points):
+        translation = np.random.uniform(-self.translate_range, self.translate_range)
+        points[:, 0:3] += translation
+        return points
+
+
+class PointcloudJitter(object):
+    def __init__(self, std=0.01, clip=0.05):
+        self.std, self.clip = std, clip
+
+    def __call__(self, points):
+        jittered_data = np.random.normal(loc=0,scale=self.std,size=(points.shape[0],3))
+        jittered_data = np.clip(jittered_data, -self.clip, self.clip)
+
+        points[:, 0:3] += jittered_data
+        return points
+
+class PointcloudRotatePerturbation(object):
+    def __init__(self, angle_sigma=0.06, angle_clip=0.18):
+        self.angle_sigma, self.angle_clip = angle_sigma, angle_clip
+
+    def _get_angles(self):
+        angles = np.clip(
+            self.angle_sigma * np.random.randn(3), -self.angle_clip, self.angle_clip
+        )
+        return angles
+    def __call__(self, points):
+        angles = self._get_angles()
+        Rx = angle_axis(angles[0], np.array([1.0, 0.0, 0.0]))
+        Ry = angle_axis(angles[1], np.array([0.0, 1.0, 0.0]))
+        Rz = angle_axis(angles[2], np.array([0.0, 0.0, 1.0]))
+
+        rotation_matrix = np.matmul(np.matmul(Rz, Ry), Rx)
+
+        normals = points.shape[1] > 3
+        if not normals:
+            return np.matmul(points, rotation_matrix.T)
+        else:
+            pc_xyz = points[:, 0:3]
+            pc_normals = points[:, 3:]
+            points[:, 0:3] = np.matmul(pc_xyz, rotation_matrix.T)
+            points[:, 3:] = np.matmul(pc_normals, rotation_matrix.T)
+            return points
+
+
+class PointcloudRandomInputDropout(object):
+    def __init__(self, max_dropout_ratio=0.875):
+        assert max_dropout_ratio >= 0 and max_dropout_ratio < 1
+        self.max_dropout_ratio = max_dropout_ratio
+
+    def __call__(self, points):
+        dropout_ratio = np.random.random() * self.max_dropout_ratio  # 0~0.875
+        drop_idx = np.where(np.random.random((points.shape[0])) <= dropout_ratio)[0]
+        if len(drop_idx) > 0:
+            points[drop_idx] = points[0]  # set to the first point
+
+        return points
diff --git a/PaddleCV/Paddle3D/PointNet++/data/indoor3d_reader.py b/PaddleCV/Paddle3D/PointNet++/data/indoor3d_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..c27a37963ecb808f51ed65235e739d89c70c0ac6
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/data/indoor3d_reader.py
@@ -0,0 +1,129 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import os
+import os.path as osp
+import signal
+import numpy as np
+import h5py
+import random
+import logging
+
+__all__ = ["Indoor3DReader"]
+
+logger = logging.getLogger(__name__)
+
+
+class Indoor3DReader(object):
+    def __init__(self, data_dir, test_area="Area_5"):
+        self.data_dir = data_dir
+        self.test_area = test_area
+        self.load_data()
+
+    def _read_data_file(self, fname):
+        assert osp.isfile(fname), \
+                "{} is not a file".format(fname)
+        with open(fname) as f:
+            return [line.strip() for line in f]
+
+    def _load_h5_file(self, fname):
+        assert osp.isfile(fname), \
+                "{} is not a file".format(fname)
+        f = h5py.File(fname, mode='r')
+        return f['data'][:], f['label'][:]
+
+    def load_data(self):
+        logger.info("Loading Indoor3D dataset from {} ...".format(self.data_dir))
+        # read all_files.txt
+        all_files_fname = osp.join(self.data_dir, 'all_files.txt')
+        all_files = self._read_data_file(all_files_fname)
+
+        # read room_filelist.txt
+        room_fname = osp.join(self.data_dir, 'room_filelist.txt')
+        room_filelist = self._read_data_file(room_fname)
+
+        points, labels = [], []
+        for f in all_files:
+            h5_fname = osp.join(self.data_dir, osp.split(f)[-1]) 
+            point, label = self._load_h5_file(h5_fname)
+            points.append(point)
+            labels.append(label)
+        points = np.concatenate(points, 0)
+        labels = np.concatenate(labels, 0)
+
+        train_idxs, test_idxs = [], []
+        for i, room in enumerate(room_filelist):
+            if self.test_area in room:
+                test_idxs.append(i)
+            else:
+                train_idxs.append(i)
+
+        self.data = {}
+        self.data['train'] = {}
+        self.data['train']['points'] = points[train_idxs, ...]
+        self.data['train']['labels'] = labels[train_idxs, ...]
+        self.data['test'] = {}
+        self.data['test']['points'] = points[test_idxs, ...]
+        self.data['test']['labels'] = labels[test_idxs, ...]
+        logger.info("Load data finished")
+
+    def get_reader(self, batch_size, num_points, mode='train', shuffle=True):
+        assert mode in ['train', 'test'], \
+                "mode can only be 'train' or 'test'"
+        data = self.data[mode]
+        points = data['points']
+        labels = data['labels']
+
+        if mode == 'train' and shuffle:
+            idxs = np.arange(len(points))
+            np.random.shuffle(idxs)
+            points = points[idxs]
+            labels = labels[idxs]
+
+        def reader():
+            batch_out = []
+            for point, label in zip(points, labels):
+                # shuffle points
+                p = point.copy()
+                l = label.copy()
+                pt_idxs = np.arange(num_points)
+                np.random.shuffle(pt_idxs)
+                p = p[pt_idxs]
+                l = l[pt_idxs]
+
+                xyz = p[:, :3]
+                feature = p[:, 3:]
+                label = l[:, np.newaxis]
+                batch_out.append((xyz, feature, label))
+
+                if len(batch_out) == batch_size:
+                    yield batch_out
+                    batch_out = []
+
+        return reader
+
+
+def _term_reader(signum, frame):
+    logger.info('pid {} terminated, terminate reader process '
+                'group {}...'.format(os.getpid(), os.getpgrp()))
+    os.killpg(os.getpgid(os.getpid()), signal.SIGKILL)
+
+signal.signal(signal.SIGINT, _term_reader)
+signal.signal(signal.SIGTERM, _term_reader)
+
diff --git a/PaddleCV/Paddle3D/PointNet++/data/modelnet40_reader.py b/PaddleCV/Paddle3D/PointNet++/data/modelnet40_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..e32f10ab719db0742df127f243c8064bf4dfdd48
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/data/modelnet40_reader.py
@@ -0,0 +1,116 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import os
+import os.path as osp
+import signal
+import numpy as np
+import h5py
+import random
+import logging
+
+__all__ = ["ModelNet40ClsReader"]
+
+logger = logging.getLogger(__name__)
+
+
+class ModelNet40ClsReader(object):
+    def __init__(self, data_dir, mode='train', transforms=None):
+        assert mode in ['train', 'test'], \
+                "mode can only be 'train' or 'test'"
+        self.data_dir = data_dir
+        self.mode = mode 
+        self.transforms = transforms
+        self.load_data()
+
+    def _read_data_file(self, fname):
+        assert osp.isfile(fname), \
+                "{} is not a file".format(fname)
+        with open(fname) as f:
+            return [line.strip()[5:] for line in f]
+
+    def _load_h5_file(self, fname):
+        assert osp.isfile(fname), \
+                "{} is not a file".format(fname)
+        f = h5py.File(fname, mode='r')
+        return f['data'][:], f['label'][:]
+
+    def load_data(self):
+        logger.info("Loading ModelNet40 dataset {} split from {} "
+                    "...".format(self.mode, self.data_dir))
+        if self.mode == 'train':
+            files_fname = osp.join(self.data_dir, 'train_files.txt')
+            files = self._read_data_file(files_fname)
+        else:
+            files_fname = osp.join(self.data_dir, 'test_files.txt')
+            files = self._read_data_file(files_fname)
+
+        points, labels = [], []
+        for f in files:
+            h5_fname = osp.join(self.data_dir, osp.split(f)[-1])
+            point, label = self._load_h5_file(h5_fname)
+            points.append(point)
+            labels.append(label)
+        self.points = np.concatenate(points, 0)
+        self.labels = np.concatenate(labels, 0)
+        logger.info("Load {} data finished".format(self.mode))
+
+    def get_reader(self, batch_size, num_points, shuffle=True):
+        self.num_points = min(num_points, self.points.shape[1])
+        points = self.points
+        labels = self.labels
+        if shuffle and self.mode == 'train':
+            idxs = np.arange(len(self.points))
+            np.random.shuffle(idxs)
+            points = points[idxs]
+            labels = labels[idxs]
+
+        def reader():
+            batch_out = []
+            for point, label in zip(points, labels):
+                p = point.copy()
+                l = label.copy()
+                pt_idxs = np.arange(self.num_points)
+                if shuffle:
+                    np.random.shuffle(pt_idxs)
+                c_points = p[pt_idxs]
+                if self.transforms is not None:
+                    for trans in self.transforms:
+                        c_points = trans(c_points)
+                
+                xyz = c_points[:, :3]
+                # modelnet40 only have xyz features
+                # feature = c_points[:, 3:]
+                label = l[:, np.newaxis]
+                batch_out.append((xyz, label))
+
+                if len(batch_out) == batch_size:
+                    yield batch_out
+                    batch_out = []
+        return reader
+
+
+def _term_reader(signum, frame):
+    logger.info('pid {} terminated, terminate reader process '
+                'group {}...'.format(os.getpid(), os.getpgrp()))
+    os.killpg(os.getpgid(os.getpid()), signal.SIGKILL)
+
+signal.signal(signal.SIGINT, _term_reader)
+signal.signal(signal.SIGTERM, _term_reader)
+
diff --git a/PaddleCV/Paddle3D/PointNet++/dataset/Indoor3DSemSeg/download.sh b/PaddleCV/Paddle3D/PointNet++/dataset/Indoor3DSemSeg/download.sh
new file mode 100644
index 0000000000000000000000000000000000000000..27a889806416ef56e09660058d5db1da1f0de725
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/dataset/Indoor3DSemSeg/download.sh
@@ -0,0 +1,8 @@
+DIR="$( cd "$(dirname "$0")" ; pwd -P  )"
+cd "$DIR"
+
+echo "Downloading https://shapenet.cs.stanford.edu/media/indoor3d_sem_seg_hdf5_data.zip"
+wget https://shapenet.cs.stanford.edu/media/indoor3d_sem_seg_hdf5_data.zip
+
+echo "Unzip indoor3d_sem_seg_hdf5_data.zip"
+unzip indoor3d_sem_seg_hdf5_data.zip
diff --git a/PaddleCV/Paddle3D/PointNet++/dataset/ModelNet40/download.sh b/PaddleCV/Paddle3D/PointNet++/dataset/ModelNet40/download.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0a6e95328eac4188cb2fee6b7f331be6e76ae16d
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/dataset/ModelNet40/download.sh
@@ -0,0 +1,8 @@
+DIR="$( cd "$(dirname "$0")" ; pwd -P  )"
+cd "$DIR"
+
+echo "Downloading https://shapenet.cs.stanford.edu/media/modelnet40_ply_hdf5_2048.zip"
+wget https://shapenet.cs.stanford.edu/media/modelnet40_ply_hdf5_2048.zip
+
+echo "Unzip modelnet40_ply_hdf5_2048.zip"
+unzip modelnet40_ply_hdf5_2048.zip
diff --git a/PaddleCV/Paddle3D/PointNet++/eval_cls.py b/PaddleCV/Paddle3D/PointNet++/eval_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..a25731a658b18ec8814b8521303a90b6f5dcf02b
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/eval_cls.py
@@ -0,0 +1,148 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import sys
+import time
+import shutil
+import argparse
+import ast
+import logging
+import numpy as np
+import paddle.fluid as fluid
+
+from models import *
+from data.data_utils import *
+from data.modelnet40_reader import ModelNet40ClsReader 
+from utils import *
+
+logging.root.handlers = []
+FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+np.random.seed(1024)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("PointNet++ semantic segmentation train script")
+    parser.add_argument(
+        '--model',
+        type=str,
+        default='MSG',
+        help='SSG or MSG model to train, default MSG')
+    parser.add_argument(
+        '--use_gpu',
+        type=ast.literal_eval,
+        default=True,
+        help='default use gpu.')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=1,
+        help='evaluation batch size, default 1')
+    parser.add_argument(
+        '--num_points',
+        type=int,
+        default=4096,
+        help='number of points in a sample, default: 4096')
+    parser.add_argument(
+        '--num_classes',
+        type=int,
+        default=40,
+        help='number of classes in dataset, default: 13')
+    parser.add_argument(
+        '--weights',
+        type=str,
+        default='checkpoints/200',
+        help='directory name to save train snapshoot')
+    parser.add_argument(
+        '--data_dir',
+        type=str,
+        default='dataset/ModelNet40/modelnet40_ply_hdf5_2048',
+        help='dataset directory')
+    parser.add_argument(
+        '--log_interval',
+        type=int,
+        default=100,
+        help='mini-batch interval for logging.')
+    args = parser.parse_args()
+    return args
+
+
+def eval():
+    args = parse_args()
+    print_arguments(args)
+    # check whether the installed paddle is compiled with GPU
+    check_gpu(args.use_gpu)
+
+    assert args.model in ['MSG', 'SSG'], \
+            "--model can only be 'MSG' or 'SSG'"
+
+    # build model
+    startup = fluid.Program()
+    eval_prog = fluid.Program()
+    with fluid.program_guard(eval_prog, startup):
+        with fluid.unique_name.guard():
+            eval_model = PointNet2ClsMSG(args.num_classes, args.num_points) \
+                           if args.model == 'MSG' else \
+                         PointNet2ClsSSG(args.num_classes, args.num_points)
+            eval_model.build_model()
+            eval_feeds = eval_model.get_feeds()
+            eval_outputs = eval_model.get_outputs()
+            eval_pyreader = eval_model.get_pyreader()
+    eval_prog = eval_prog.clone(True)
+    eval_keys, eval_values = parse_outputs(eval_outputs)
+
+    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(startup)
+
+    assert os.path.exists(args.weights), "weights {} not exists.".format(args.weights)
+    def if_exist(var):
+        return os.path.exists(os.path.join(args.weights, var.name))
+    fluid.io.load_vars(exe, args.weights, eval_prog, predicate=if_exist)
+
+    eval_compile_prog = fluid.compiler.CompiledProgram(eval_prog)
+    
+    # get reader
+    modelnet_reader = ModelNet40ClsReader(args.data_dir, mode='test')
+    eval_reader = modelnet_reader.get_reader(args.batch_size, args.num_points)
+    eval_pyreader.decorate_sample_list_generator(eval_reader, place)
+
+    eval_stat = Stat()
+    try:
+        eval_pyreader.start()
+        eval_iter = 0
+        eval_periods = []
+        while True:
+            cur_time = time.time()
+            eval_outs = exe.run(eval_compile_prog, fetch_list=eval_values)
+            period = time.time() - cur_time
+            eval_periods.append(period)
+            eval_stat.update(eval_keys, eval_outs)
+            if eval_iter % args.log_interval == 0:
+                log_str = ""
+                for name, value in zip(eval_keys, eval_outs):
+                    log_str += "{}: {:.4f}, ".format(name, np.mean(value))
+                logger.info("[EVAL] batch {}: {}time: {:.2f}".format(eval_iter, log_str, period))
+            eval_iter += 1
+    except fluid.core.EOFException:
+        logger.info("[EVAL] Eval finished, {}average time: {:.2f}".format(eval_stat.get_mean_log(), np.mean(eval_periods[1:])))
+    finally:
+        eval_pyreader.reset()
+
+
+if __name__ == "__main__":
+    eval()
diff --git a/PaddleCV/Paddle3D/PointNet++/eval_seg.py b/PaddleCV/Paddle3D/PointNet++/eval_seg.py
new file mode 100644
index 0000000000000000000000000000000000000000..56c257bb6dee2027a49d3abe48097bb7bfd4a610
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/eval_seg.py
@@ -0,0 +1,147 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import sys
+import time
+import shutil
+import argparse
+import ast
+import logging
+import numpy as np
+import paddle.fluid as fluid
+
+from models import *
+from data.indoor3d_reader import Indoor3DReader
+from utils import *
+
+logging.root.handlers = []
+FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+np.random.seed(1024)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("PointNet++ semantic segmentation train script")
+    parser.add_argument(
+        '--model',
+        type=str,
+        default='MSG',
+        help='SSG or MSG model to train, default MSG')
+    parser.add_argument(
+        '--use_gpu',
+        type=ast.literal_eval,
+        default=True,
+        help='default use gpu.')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=1,
+        help='evaluation batch size, default 1')
+    parser.add_argument(
+        '--num_points',
+        type=int,
+        default=4096,
+        help='number of points in a sample, default: 4096')
+    parser.add_argument(
+        '--num_classes',
+        type=int,
+        default=13,
+        help='number of classes in dataset, default: 13')
+    parser.add_argument(
+        '--weights',
+        type=str,
+        default='checkpoints/200',
+        help='directory name to save train snapshoot')
+    parser.add_argument(
+        '--data_dir',
+        type=str,
+        default='dataset/Indoor3DSemSeg/indoor3d_sem_seg_hdf5_data',
+        help='dataset directory')
+    parser.add_argument(
+        '--log_interval',
+        type=int,
+        default=100,
+        help='mini-batch interval for logging.')
+    args = parser.parse_args()
+    return args
+
+
+def eval():
+    args = parse_args()
+    print_arguments(args)
+    # check whether the installed paddle is compiled with GPU
+    check_gpu(args.use_gpu)
+
+    assert args.model in ['MSG', 'SSG'], \
+            "--model can only be 'MSG' or 'SSG'"
+
+    # build model
+    startup = fluid.Program()
+    eval_prog = fluid.Program()
+    with fluid.program_guard(eval_prog, startup):
+        with fluid.unique_name.guard():
+            eval_model = PointNet2SemSegMSG(args.num_classes, args.num_points) \
+                           if args.model == 'MSG' else \
+                         PointNet2SemSegSSG(args.num_classes, args.num_points)
+            eval_model.build_model()
+            eval_feeds = eval_model.get_feeds()
+            eval_outputs = eval_model.get_outputs()
+            eval_pyreader = eval_model.get_pyreader()
+    eval_prog = eval_prog.clone(True)
+    eval_keys, eval_values = parse_outputs(eval_outputs)
+
+    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(startup)
+
+    assert os.path.exists(args.weights), "weights {} not exists.".format(args.weights)
+    def if_exist(var):
+        return os.path.exists(os.path.join(args.weights, var.name))
+    fluid.io.load_vars(exe, args.weights, eval_prog, predicate=if_exist)
+
+    eval_compile_prog = fluid.compiler.CompiledProgram(eval_prog)
+    
+    # get reader
+    indoor_reader = Indoor3DReader(args.data_dir)
+    eval_reader = indoor_reader.get_reader(args.batch_size, args.num_points, mode='test')
+    eval_pyreader.decorate_sample_list_generator(eval_reader, place)
+
+    eval_stat = Stat()
+    try:
+        eval_pyreader.start()
+        eval_iter = 0
+        eval_periods = []
+        while True:
+            cur_time = time.time()
+            eval_outs = exe.run(eval_compile_prog, fetch_list=eval_values)
+            period = time.time() - cur_time
+            eval_periods.append(period)
+            eval_stat.update(eval_keys, eval_outs)
+            if eval_iter % args.log_interval == 0:
+                log_str = ""
+                for name, value in zip(eval_keys, eval_outs):
+                    log_str += "{}: {:.4f}, ".format(name, np.mean(value))
+                logger.info("[EVAL] batch {}: {}time: {:.2f}".format(eval_iter, log_str, period))
+            eval_iter += 1
+    except fluid.core.EOFException:
+        logger.info("[EVAL] Eval finished, {}average time: {:.2f}".format(eval_stat.get_mean_log(), np.mean(eval_periods[1:])))
+    finally:
+        eval_pyreader.reset()
+
+
+if __name__ == "__main__":
+    eval()
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/README.md b/PaddleCV/Paddle3D/PointNet++/ext_op/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..3316b51a0194f4e006d1c9455504f7d712c35c6e
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/README.md
@@ -0,0 +1,96 @@
+# 自定义OP的编译过程
+
+## 代码结构
+
+  - src: 扩展OP C++/CUDA 源码
+  - pointnet_lib.py: Python封装
+  - tests: 各OP单测程序
+
+## 安装PaddlePaddle
+
+请通过如下方式安装PaddlePaddle：
+
+- 通过[Paddle develop分支](https://github.com/PaddlePaddle/Paddle/tree/develop)源码编译安装，编译方法如下:
+
+  1. [Ubuntu](https://www.paddlepaddle.org.cn/install/doc/source/ubuntu)
+  1. [CentOS](https://www.paddlepaddle.org.cn/install/doc/source/centos)
+  1. [MasOS](https://www.paddlepaddle.org.cn/install/doc/source/macos)
+  1. [Windows](https://www.paddlepaddle.org.cn/install/doc/source/windows)
+
+  **说明：** 推荐使用docker编译
+
+- 安装Paddle develop[每日版本whl包](https://www.paddlepaddle.org.cn/install/doc/tables#多版本whl包列表-dev-11)
+
+  **注意：** 编译自定义OP使用的gcc版本须与Paddle编译使用gcc版本一致，Paddle develop每日版本目前采用**gcc 4.8.2**版本编译，若使用每日版本，请使用**gcc 4.8.2**版本编译自定义OP，否则可能出现兼容性问题。
+
+## 编译自定义OP
+
+自定义op需要将实现的C++、CUDA代码编译成动态库，mask.sh中通过g++/nvcc编译，当然您也可以写Makefile或者CMake。
+
+编译需要include PaddlePaddle的相关头文件，链接PaddlePaddle的lib库。 头文件和lib库可通过下面命令获取到:
+
+```
+# python
+>>> import paddle
+>>> print(paddle.sysconfig.get_include())
+/paddle/pyenv/local/lib/python2.7/site-packages/paddle/include
+>>> print(paddle.sysconfig.get_lib())
+/paddle/pyenv/local/lib/python2.7/site-packages/paddle/libs
+```
+
+我们提供动态库编译脚本如下：
+
+```
+cd src
+sh make.sh
+```
+
+最终编译会产出`pointnet_lib.so`
+
+**说明：** 若使用源码编译安装PaddlePaddle的方式，编译过程中`cmake`未设置`WITH_MKLDNN`的方式，
+编译自定义OP时会报错找不到`mkldnn.h`等文件，可在`make.sh`中删除编译命令中的`-DPADDLE_WITH_MKLDNN`选项。
+
+## 设置环境变量
+
+需要将Paddle的核心库设置到`LD_LIBRARY_PATH`里, 先运行下面程序获取路径:
+
+```
+import paddle
+print(paddle.sysconfig.get_lib())
+```
+
+可通过如下方式添加动态库路径:
+
+```
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+```
+
+## 执行单测
+
+执行下列单测，确保自定义算子可在网络中正确使用：
+
+```
+# 回到 ext_op 目录，添加 PYTHONPATH
+cd ..
+export PYTHONPATH=$PYTHONPATH:`pwd`
+
+# 运行单测 
+python test/test_farthest_point_sampling_op.py
+python test/test_gather_point_op.py
+python test/test_group_points_op.py
+python test/test_query_ball_op.py
+python test/test_three_interp_op.py
+python test/test_three_nn_op.py
+```
+
+单测运行成功会输出提示信息，如下所示：
+
+```
+.
+----------------------------------------------------------------------
+Ran 1 test in 13.205s
+
+OK
+```
+
+更多关于如何在框架外部自定义 C++ OP，可阅读[官网说明文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/index_cn.html)
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/__init__.py b/PaddleCV/Paddle3D/PointNet++/ext_op/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4afd4332108896cd690efda3fc6f0b1eb86fb086
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/__init__.py
@@ -0,0 +1,18 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from . import pointnet_lib
+from .pointnet_lib import *
+
+__all__ = pointnet_lib.__all__
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/pointnet_lib.py b/PaddleCV/Paddle3D/PointNet++/ext_op/pointnet_lib.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f607bf8775a6c0b20440f1635ae1c05b2ad8f07
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/pointnet_lib.py
@@ -0,0 +1,264 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import paddle.fluid as fluid
+
+file_dir = os.path.dirname(os.path.abspath(__file__))
+fluid.load_op_library(os.path.join(file_dir, 'src/pointnet_lib.so'))
+
+from paddle.fluid.layer_helper import LayerHelper
+
+__all__ = ['three_nn', 'three_interp', 'query_ball', 'gather_point',
+            'farthest_point_sampling', 'group_points']
+
+
+def three_nn(input, known, eps=1e-10, name=None):
+    """
+    **Three Nearest Neighbor Layer**
+
+    This operator samples the top-3 nearest neighbor of each point
+    coordinates specified by Input(X) between known point coordinates
+    specified by Input(Known) and calcualte the distance between these
+    nearest neighbors.
+
+    Args:
+        input (Variable): The input tensor of three_nn operator. This
+                          is a 3-D tensor with shape of [B, N, 3].
+        known (Variable): The input tensor of known points of three_nn
+                          operator. This is a 3-D tensor with shape of
+                          [B, M, 3].
+        name(str|None): A name for this layer(optional). If set None, the layer
+                        will be named automatically.
+
+    Returns:
+        distance (Variable): The output distance tensor of three_nn operator.
+                             This is a 3-D tensor with shape of [B, N, 3].
+        idx (Variable): The output index tensor of three_nn operator.
+                             This is a 3-D tensor with shape of [B, N, 3].
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle.fluid as fluid
+            x = fluid.layers.data(name='x', shape=[16, 3], dtype='float32')
+            known = fluid.layers.data(name='known', shape=[32, 3], dtype='float32')
+            distance, idx = fluid.layers.three_nn(input, known)
+    """
+    helper = LayerHelper('three_nn', **locals())
+    dtype = helper.input_dtype()
+    dist = helper.create_variable_for_type_inference(dtype)
+    idx = helper.create_variable_for_type_inference(dtype)
+    helper.append_op(
+        type="three_nn",
+        inputs={"X": input,
+                "Known": known},
+        outputs={"Distance": dist,
+                 "Idx": idx},
+        attrs={'eps': eps})
+    return (dist, idx)
+
+
+def three_interp(input, weight, idx, name=None):
+    """
+    **Three Interpolate Layer**
+
+    This operator calculate interpolate results from input, weight and
+    index.
+
+    Args:
+        input (Variable): The input tensor of three_interp operator. This
+                          is a 3-D tensor with shape of [B, M, C].
+        weight (Variable): The weight tensor of three_interp operator. This
+                          is a 3-D tensor with shape of [B, N, 3].
+        idx (Variable): The index tensor of three_interp operator. This
+                          is a 3-D tensor with shape of [B, N, 3].
+        name(str|None): A name for this layer(optional). If set None, the layer
+                        will be named automatically.
+
+    Returns:
+        output (Variable): The output tensor of three_interp operator.
+                             This is a 3-D tensor with shape of [B, N, C].
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle.fluid as fluid
+            x = fluid.layers.data(name='x', shape=[16, 3], dtype='float32')
+            weight = fluid.layers.data(name='weight', shape=[32, 3], dtype='float32')
+            index = fluid.layers.data(name='index', shape=[32, 3], dtype='int32')
+            out = fluid.layers.three_interp(x, weight, index)
+    """
+    helper = LayerHelper('three_interp', **locals())
+    dtype = helper.input_dtype()
+    out = helper.create_variable_for_type_inference(dtype)
+    helper.append_op(
+        type="three_interp",
+        inputs={"X": input,
+                "Weight": weight,
+                "Idx": idx},
+        outputs={"Out": out, })
+    return out
+
+
+def query_ball(input, new_points, radius, n_sample):
+    """
+    **Query Ball Layer**
+
+    Output is a tensor with the indicies of the features that form the query balls.
+
+    Args:
+        input(Variable): XYZ coordinates of features with shape of [B,N,3].
+        new_points(Variable): Centers coordinates of the ball query with shape of [B,M,3].
+        radius(float|Variable): Radius of the balls.
+        n_sample(int|Variable): Maximum number of features in the balls.
+    Return:
+        output(Variable): Tensor with the indicies of the features that form the query balls,with shape of [B,M,n_sample]
+
+    Examples:
+        .. code-block::python
+
+            import paddle.fluid as fluid
+            x = fluid.layers.data(name='points',shape=[-1,5,3],dtype='float32')
+            new_points = fluid.layers.data(name='new_points', shape=[-1,2,3], dtype='float32')
+            output = fluid.layers.query_ball(x,new_points,radius=4.0,n_sample=5)
+
+
+
+    """
+    helper = LayerHelper('query_ball', **locals())
+    dtype = helper.input_dtype()
+    out = helper.create_variable_for_type_inference(dtype)
+    helper.append_op(
+        type="query_ball",
+        inputs={"Points": input,
+                "New_Points": new_points},
+        attrs={"N_sample": n_sample,
+               "Radius": radius},
+        outputs={"Output": out})
+    return out
+
+
+def farthest_point_sampling(input, sampled_point_num):
+    '''
+    Sampling point based on its max eucliden distance with other points. 
+    
+    Args:
+        input (Variable): input point cloud dataset with shape (B, N, 3)
+            B is batch size, N is points's nums, 3 is (x,y,z) coordinate
+        sampled_point_num (int): sampled points's nums
+
+    Retrun:
+        output (Variable): return sampled points with shape (B, M)
+            B is batch size, M is points's nums
+
+    Examples:
+        .. code-block:: python
+        x = fluid.layers.data(name='data', shape=(2,100,3), dtype='float32')
+        sampled_points = fluid.layers.farthest_point_sampling(
+            x, 50
+        )
+    '''
+
+    helper = LayerHelper('farthest_point_sampling', **locals())
+    dtype = input.dtype
+    op_out = helper.create_variable_for_type_inference(dtype)
+    helper.append_op(
+        type='farthest_point_sampling',
+        inputs={'X': input},
+        outputs={'Output': op_out},
+        attrs={'sampled_point_num': sampled_point_num})
+    return op_out
+
+
+def gather_point(input, index):
+    """
+    **Gather Point Layer**
+    Output is obtained by gathering entries of X indexed by `index` 
+    and concatenate them together.
+    .. math::
+        Out = X[Index]
+    .. code-block:: text
+        Given:
+        X = [[1, 2, 3],
+             [3, 4, 5],
+             [5, 6, 7]]
+        Index = [[1, 2]
+        Then:
+        Out = [[3, 4, 5],
+               [5, 6, 7]]
+    Args:
+        input (Variable): The source input with rank>=1, This
+                          is a 3-D tensor with shape of [B, N, 3].
+        index (Variable): The index input with shape of [B, M].
+      
+    Returns:
+        output (Variable): The output is a tensor with shape of [B,M].
+    Examples:
+        .. code-block:: python
+            import paddle.fluid as fluid
+            x = fluid.layers.data(name='x', shape=[-1, 5, 3], dtype='float32')
+            index = fluid.layers.data(name='index', shape=[-1, 1], dtype='int32')
+            output = fluid.layers.gather_point(x, index)
+    """
+
+    helper = LayerHelper('gather_point', **locals())
+    dtype = helper.input_dtype()
+    out = helper.create_variable_for_type_inference(dtype)
+    helper.append_op(
+        type="gather_point",
+        inputs={"X": input,
+                "Index": index},
+        outputs={"Output": out})
+    return out
+
+
+def group_points(input, idx, name=None):
+    """
+    **Group Points Layer**
+
+    This operator group input points with index.
+
+    Args:
+        input (Variable): The input tensor of three_interp operator. This
+                          is a 3-D tensor with shape of [B, N, C].
+        idx (Variable): The index tensor of three_interp operator. This
+                          is a 3-D tensor with shape of [B, M, S].
+        name(str|None): A name for this layer(optional). If set None, the layer
+                        will be named automatically.
+
+    Returns:
+        output (Variable): The output tensor of three_interp operator.
+                             This is a 4-D tensor with shape of [B, M, S, C].
+
+    Examples:
+
+        .. code-block:: python
+
+            import paddle.fluid as fluid
+            x = fluid.layers.data(name='x', shape=[16, 3], dtype='float32')
+            index = fluid.layers.data(name='index', shape=[32, 3], dtype='int32')
+            out  = fluid.layers.group_points(x, index)
+    """
+    helper = LayerHelper('group_points', **locals())
+    dtype = helper.input_dtype()
+    out = helper.create_variable_for_type_inference(dtype)
+    helper.append_op(
+        type="group_points",
+        inputs={"X": input,
+                "Idx": idx},
+        outputs={"Out": out, })
+    return out
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/farthest_point_sampling_op.cc b/PaddleCV/Paddle3D/PointNet++/ext_op/src/farthest_point_sampling_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ace1e01c9475b20bb13019d211e39770f7160bac
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/farthest_point_sampling_op.cc
@@ -0,0 +1,69 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+class FarthestPointSamplingOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("X",
+             "(Tensor)input point cloud dataset with shape (B, N, 3)"
+             "B is batch size, N is points's nums, 3 is (x,y,z) coordinate");
+    AddOutput("Output",
+              "(Tensor)return sampled points with shape (B, M)"
+              "B is batch size, M is points's nums");
+    AddAttr<int>("sampled_point_num", "sampling points's num")
+        .SetDefault(0)
+        .EqualGreaterThan(0);
+    AddComment(
+        R"Doc(
+            Sampling point based on 
+            its max eucliden distance with other points.)Doc");
+  }
+};
+
+class FarthestPointSamplingOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+protected:
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) shoud not be null");
+    auto x_dims = ctx->GetInputDim("X");
+    PADDLE_ENFORCE(x_dims.size() == 3,
+                   "Input(X) of FathestPointSamplingOp should be 3-D Tensor");
+    const int m = ctx->Attrs().Get<int>("sampled_point_num");
+    ctx->SetOutputDim("Output", {x_dims[0], m});
+  }
+
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext &ctx) const override {
+    auto input_data_type = ctx.Input<Tensor>("X")->type();
+    return framework::OpKernelType(input_data_type, ctx.GetPlace());
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(farthest_point_sampling,
+                  ops::FarthestPointSamplingOp,
+                  ops::FarthestPointSamplingOpMaker);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/farthest_point_sampling_op.cu b/PaddleCV/Paddle3D/PointNet++/ext_op/src/farthest_point_sampling_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..56515254991d09b66335f202546a176411ade2f2
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/farthest_point_sampling_op.cu
@@ -0,0 +1,151 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename T, unsigned int block_size>
+__global__ void farthestpointsamplingKernel(int b,
+                                            int n,
+                                            int m,
+                                            const T *__restrict__ dataset,
+                                            T *__restrict__ temp,
+                                            int *__restrict__ idxs) {
+  // 1. add first point
+  // 2. add the point having farthest distance with first point's
+  // 3. make second point as first point, repeat 1,2
+  if (m <= 0) return;
+  const int BlockSize = block_size;
+  __shared__ float dists[BlockSize];
+  __shared__ int dists_i[BlockSize];
+  const int BufferSize = 3072;
+  __shared__ float buf[BufferSize * 3];
+
+  // one block one batch, n points
+  // one thread one point
+  for (int i = blockIdx.x; i < b; i += gridDim.x) {
+    // can select old point as first point randomly
+    int old = 0;
+    if (threadIdx.x == 0) idxs[i * m + 0] = old;
+
+    for (int j = threadIdx.x; j < n; j += blockDim.x) {
+      temp[blockIdx.x * n + j] = 1e38;
+    }
+    for (int j = threadIdx.x; j < min(BufferSize, n) * 3; j += blockDim.x) {
+      buf[j] = dataset[i * n * 3 + j];
+    }
+    // wait all threads do this in the same block
+    __syncthreads();
+
+    // out m points
+    for (int j = 1; j < m; j++) {
+      // Step 1.
+      // fatherest distance
+      int besti = 0;
+      float best = -1;
+      // first point in m points
+      float x1 = dataset[i * n * 3 + old * 3 + 0];
+      float y1 = dataset[i * n * 3 + old * 3 + 1];
+      float z1 = dataset[i * n * 3 + old * 3 + 2];
+
+      // Step 2.
+      // find farthest point of (x1, y1, z1)
+      for (int k = threadIdx.x; k < n; k += blockDim.x) {
+        float td = temp[blockIdx.x * n + k];
+        float x2, y2, z2;
+        if (k < BufferSize) {
+          x2 = buf[k * 3 + 0];
+          y2 = buf[k * 3 + 1];
+          z2 = buf[k * 3 + 2];
+        } else {
+          x2 = dataset[i * n * 3 + k * 3 + 0];
+          y2 = dataset[i * n * 3 + k * 3 + 1];
+          z2 = dataset[i * n * 3 + k * 3 + 2];
+        }
+        // compute eucliden distance
+        float d = (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1) +
+                  (z2 - z1) * (z2 - z1);
+        float d2 = min(d, td);
+        if (d2 != td) temp[blockIdx.x * n + k] = d2;
+        if (d2 > best) {
+          best = d2;
+          besti = k;
+        }
+      }
+
+      // step 3.
+      dists[threadIdx.x] = best;
+      dists_i[threadIdx.x] = besti;
+      for (int u = 0; (1 << u) < blockDim.x; u++) {
+        __syncthreads();
+        if (threadIdx.x < (blockDim.x >> (u + 1))) {
+          int i1 = (threadIdx.x * 2) << u;
+          int i2 = (threadIdx.x * 2 + 1) << u;
+          if (dists[i1] < dists[i2]) {
+            dists[i1] = dists[i2];
+            dists_i[i1] = dists_i[i2];
+          }
+        }
+      }
+      __syncthreads();
+      // store the found node index
+      old = dists_i[0];
+      if (threadIdx.x == 0) idxs[i * m + j] = old;
+    }
+  }
+}
+
+template <typename T>
+class FarthestPointSamplingOpCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
+                   "This kernel only runs on GPU device.");
+    auto *input = ctx.Input<Tensor>("X");
+    auto *output = ctx.Output<Tensor>("Output");
+    if (input->numel() == 0) return;
+    // allocate memory
+    auto *ptr_out_points_index = output->mutable_data<int>(ctx.GetPlace());
+
+    // b, n, m
+    int batch_size = input->dims()[0];
+    int in_n_points = input->dims()[1];
+    int out_m_points = ctx.Attr<int>("sampled_point_num");
+
+    const T *ptr_in_points = input->data<T>();
+
+    Tensor tmp;
+    auto *ptr_tmp_e =
+        tmp.mutable_data<T>({batch_size, in_n_points}, ctx.GetPlace());
+
+    // run fathest point sampling kernel
+    // P40 have max 512 thread
+    farthestpointsamplingKernel<T, 512><<<32, 512>>>(batch_size,
+                                                     in_n_points,
+                                                     out_m_points,
+                                                     ptr_in_points,
+                                                     ptr_tmp_e,
+                                                     ptr_out_points_index);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(farthest_point_sampling,
+                        ops::FarthestPointSamplingOpCUDAKernel<float>,
+                        ops::FarthestPointSamplingOpCUDAKernel<double>);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/gather_point_op.cc b/PaddleCV/Paddle3D/PointNet++/ext_op/src/gather_point_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..0f41f1b3ad7cfc22e7fa7abfa8cbfa277ad9b136
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/gather_point_op.cc
@@ -0,0 +1,118 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/framework/op_registry.h"
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+class GatherPointOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("X"), "Input(X) shoud not be null");
+    auto x_dims = ctx->GetInputDim("X");
+    PADDLE_ENFORCE(x_dims.size() == 3 && x_dims[2] == 3,
+                   "Input(X) of GatherPointOp should be 3-D Tensor, the last "
+                   "dimension must be 3");
+    auto index_dims = ctx->GetInputDim("Index");
+    PADDLE_ENFORCE(index_dims.size() == 2 && index_dims[0] == x_dims[0],
+                   "Index of GatherPointop should be 2-D Tensor");
+    ctx->SetOutputDim("Output", {x_dims[0], index_dims[1], 3});
+  }
+
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    auto input_data_type = ctx.Input<Tensor>("X")->type();
+    return framework::OpKernelType(input_data_type, ctx.GetPlace());
+  }
+};
+
+class GatherPointOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("X",
+             "Input points with shape (batch, n, 3), n is input "
+             "points's num");
+    AddInput("Index",
+             "input index with shape (batch, m), m is output points's num");
+    AddOutput("Output", "output points with shape(batch, m, 3)");
+    AddComment(
+        R"Doc(
+        Gather Point Operator.
+        Out is obtained by gathering entries of X indexed by Index and 
+        concatenate them together.
+
+        Example:
+        X = [[1, 2, 3],
+             [3, 4, 5],
+             [5, 6, 7]]
+        Index = [[1, 2]]
+
+        Then:
+        Out = [[3, 4, 5],[5, 6, 7]])Doc");
+  }
+};
+
+class GatherPointOpGrad : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+protected:
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Index"), "Input(Index) should not be null");
+    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Output")),
+                   "Input(Output@GRAD) should not be null");
+    auto dim_x = ctx->GetInputDim("X");
+    if (ctx->HasOutput(framework::GradVarName("X"))) {
+      ctx->SetOutputDim(framework::GradVarName("X"), dim_x);
+    }
+  }
+
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        ctx.Input<Tensor>(framework::GradVarName("Output"))->type(),
+        ctx.GetPlace());
+  }
+};
+
+template <typename T>
+class GatherPointGradDescMaker : public framework::SingleGradOpMaker<T> {
+public:
+  using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
+
+protected:
+  std::unique_ptr<T> Apply() const override {
+    auto* op = new T();
+    op->SetType("gather_point_grad");
+    op->SetInput("X", this->Input("X"));
+    op->SetInput("Index", this->Input("Index"));
+    op->SetInput(framework::GradVarName("Output"), this->OutputGrad("Output"));
+    op->SetOutput(framework::GradVarName("X"), this->InputGrad("X"));
+    op->SetAttrMap(this->Attrs());
+    return std::unique_ptr<T>(op);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(gather_point,
+                  ops::GatherPointOp,
+                  ops::GatherPointOpMaker,
+                  ops::GatherPointGradDescMaker<paddle::framework::OpDesc>,
+                  ops::GatherPointGradDescMaker<paddle::imperative::OpBase>);
+REGISTER_OPERATOR(gather_point_grad, ops::GatherPointOpGrad);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/gather_point_op.cu b/PaddleCV/Paddle3D/PointNet++/ext_op/src/gather_point_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..fa3a96f8a7061b8340d58265d9a876900bf0f234
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/gather_point_op.cu
@@ -0,0 +1,126 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/platform/cuda_primitives.h"
+
+#include "util.cu.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename T>
+__global__ void GatherPointKernel(int b,
+                                  int n,
+                                  int m,
+                                  const T *__restrict__ inp,
+                                  const int *__restrict__ idx,
+                                  T *__restrict__ out) {
+  for (int i = blockIdx.x; i < b; i += gridDim.x) {
+    for (int j = blockIdx.y * blockDim.x + threadIdx.x; j < m;
+         j += blockDim.x * gridDim.y) {
+      int a = idx[i * m + j];
+      for (int k = 0; k < 3; k++) {
+        out[(i * m + j) * 3 + k] = inp[(i * n + a) * 3 + k];
+      }
+    }
+  }
+}
+
+template <typename T>
+__global__ void GatherPointGradKernel(int b,
+                                      int n,
+                                      int m,
+                                      const T *__restrict__ out_grad,
+                                      const int *__restrict__ idx,
+                                      T *__restrict__ in_grad) {
+  for (int i = blockIdx.x; i < b; i += gridDim.x) {
+    for (int j = blockIdx.y * blockDim.x + threadIdx.x; j < m;
+         j += blockDim.x * gridDim.y) {
+      int a = idx[i * m + j];
+      const T *out_grad_pos = &out_grad[(i * m + j) * 3];
+      T *in_grad_pos = &in_grad[(i * n + a) * 3];
+      for (int k = 0; k < 3; k++) {
+        platform::CudaAtomicAdd(&in_grad_pos[k], out_grad_pos[k]);
+      }
+    }
+  }
+}
+
+template <typename T>
+class GatherPointOpCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
+                   "This kernel only runs on GPU device.");
+    auto *points = ctx.Input<Tensor>("X");
+    auto *index = ctx.Input<Tensor>("Index");
+    auto *output = ctx.Output<Tensor>("Output");
+
+    if (points->numel() == 0) return;
+
+    const T *p_points = points->data<T>();
+    const int *p_index = index->data<int>();
+    T *p_out_points = output->mutable_data<T>(ctx.GetPlace());
+
+    int batch_size = points->dims()[0];
+    int n_points = points->dims()[1];
+    int m_points = index->dims()[1];
+
+    GatherPointKernel<<<dim3(2, 8, 1), 512>>>(
+        batch_size, n_points, m_points, p_points, p_index, p_out_points);
+  }
+};
+
+template <typename T>
+class GatherPointGradOpCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    auto *points = ctx.Input<Tensor>("X");
+    auto *index = ctx.Input<Tensor>("Index");
+    auto *output_grad = ctx.Input<Tensor>(framework::GradVarName("Output"));
+    auto *points_grad = ctx.Output<Tensor>(framework::GradVarName("X"));
+
+    if (points->numel() == 0) return;
+
+    const T *p_output_grad = output_grad->data<T>();
+    const int *p_index = index->data<int>();
+    T *p_points_grad = points_grad->mutable_data<T>(ctx.GetPlace());
+    int pnum = points_grad->numel();
+
+    auto &dev_ctx = ctx.template device_context<platform::CUDADeviceContext>();
+    Zero<<<(pnum + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(p_points_grad,
+                                                               pnum);
+
+    int batch_size = points->dims()[0];
+    int n_points = points->dims()[1];
+    int m_points = index->dims()[1];
+
+    GatherPointGradKernel<<<dim3(2, 8, 1), 512, 0, dev_ctx.stream()>>>(
+        batch_size, n_points, m_points, p_output_grad, p_index, p_points_grad);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(gather_point,
+                        ops::GatherPointOpCUDAKernel<float>,
+                        ops::GatherPointOpCUDAKernel<double>,
+                        ops::GatherPointOpCUDAKernel<int>);
+REGISTER_OP_CUDA_KERNEL(gather_point_grad,
+                        ops::GatherPointGradOpCUDAKernel<float>,
+                        ops::GatherPointGradOpCUDAKernel<double>,
+                        ops::GatherPointGradOpCUDAKernel<int>);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/group_points_op.cc b/PaddleCV/Paddle3D/PointNet++/ext_op/src/group_points_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..7266c553b2d2da95a8fa6355a0ffa2250ba01f71
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/group_points_op.cc
@@ -0,0 +1,124 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include <memory>
+#include <string>
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+
+class GroupPointsOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("X"),
+                   "Input(X) of GroupPointsOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Idx"),
+                   "Input(Idx) of GroupPointsOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("Out"),
+                   "Output(Out) of GroupPointsOp should not be null.");
+
+    auto dim_x = ctx->GetInputDim("X");  // [B, C, N]
+    PADDLE_ENFORCE_EQ(dim_x.size(), 3, "X's dimension must be 3");
+
+    auto dim_idx = ctx->GetInputDim("Idx");  // [B, npoints, nsample]
+    PADDLE_ENFORCE_EQ(dim_idx.size(), 3, "Idx's dimension must be 3");
+
+    PADDLE_ENFORCE_EQ(dim_x[0], dim_idx[0],
+                      "X and Idx dim[0] should be equal.");
+
+    // output: [B, C, M, S]
+    std::vector<int64_t> dim_out({dim_x[0], dim_x[1], dim_idx[1], dim_idx[2]});
+    ctx->SetOutputDim("Out", framework::make_ddim(dim_out));
+  }
+
+ protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(ctx.Input<Tensor>("X")->type(),
+                                   ctx.GetPlace());
+  }
+};
+
+class GroupPointsOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  void Make() override {
+    AddInput("X",
+             "The input tensor of group_points operator. "
+             "This is a 3-D tensor with shape of [B, C, N].");
+    AddInput("Idx",
+             "The input tensor of nearest neighbor index of group_points "
+             "operator. This is a 3-D tensor with shape of [B, M, S].");
+    AddOutput("Out",
+              "The output tensor of group_points operator. "
+              "This is a 4-D tensor with shape of [B, C, M, S].");
+
+    AddComment(R"DOC(
+          This operator group input points with index.
+         )DOC");
+  }
+};
+
+class GroupPointsOpGrad : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+ protected:
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Idx"), "Input(Idx) should not be null");
+    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
+                   "Input(Out@GRAD) should not be null");
+    auto dim_x = ctx->GetInputDim("X");
+    if (ctx->HasOutput(framework::GradVarName("X"))) {
+      ctx->SetOutputDim(framework::GradVarName("X"), dim_x);
+    }
+  }
+
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        ctx.Input<Tensor>(framework::GradVarName("Out"))->type(),
+        ctx.GetPlace());
+  }
+};
+
+template <typename T>
+class GroupPointsGradDescMaker : public framework::SingleGradOpMaker<T> {
+ public:
+  using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
+
+ protected:
+  std::unique_ptr<T> Apply() const override {
+    auto* op = new T();
+    op->SetType("group_points_grad");
+    op->SetInput("X", this->Input("X"));
+    op->SetInput("Idx", this->Input("Idx"));
+    op->SetInput(framework::GradVarName("Out"), this->OutputGrad("Out"));
+    op->SetOutput(framework::GradVarName("X"), this->InputGrad("X"));
+    op->SetAttrMap(this->Attrs());
+    return std::unique_ptr<T>(op);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(group_points, ops::GroupPointsOp, ops::GroupPointsOpMaker,
+                  ops::GroupPointsGradDescMaker<paddle::framework::OpDesc>,
+                  ops::GroupPointsGradDescMaker<paddle::imperative::OpBase>);
+REGISTER_OPERATOR(group_points_grad, ops::GroupPointsOpGrad);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/group_points_op.cu b/PaddleCV/Paddle3D/PointNet++/ext_op/src/group_points_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..0d7e02898f3dc68f59215b89356bb56b957b524a
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/group_points_op.cu
@@ -0,0 +1,144 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/platform/cuda_primitives.h"
+
+#include "util.cu.h"
+
+#define TOTAL_THREADS 1024
+#define THREADS_PER_BLOCK 256
+#define DIVUP(m, n) ((m) / (n) + ((m) % (n) > 0))
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+
+template <typename T>
+__global__ void KeGroupPointsFW(int b, int c, int n, int npoints, int nsample,
+                                const T* __restrict__ points,
+                                const int* __restrict__ idx,
+                                T* __restrict__ out) {
+  // points: (B, C, N)
+  // idx: (B, npoints, nsample)
+  // output:
+  //      out: (B, C, npoints, nsample)
+  int bs_idx = blockIdx.z;
+  int c_idx = blockIdx.y;
+  int index = blockIdx.x * blockDim.x + threadIdx.x;
+  int pt_idx = index / nsample;
+  if (bs_idx >= b || c_idx >= c || pt_idx >= npoints) return;
+
+  int sample_idx = index % nsample;
+
+  idx += bs_idx * npoints * nsample + pt_idx * nsample + sample_idx;
+  int in_idx = bs_idx * c * n + c_idx * n + idx[0];
+  int out_idx = bs_idx * c * npoints * nsample + c_idx * npoints * nsample +
+                pt_idx * nsample + sample_idx;
+
+  out[out_idx] = points[in_idx];
+}
+
+template <typename T>
+
+__global__ void KeGroupPointsBW(int b, int c, int n, int npoints, int nsample,
+                                const T* __restrict__ grad_out,
+                                const int* __restrict__ idx,
+                                T* __restrict__ grad_points) {
+  // grad_out: (B, C, npoints, nsample)
+  // idx: (B, npoints, nsample)
+  // output:
+  //      grad_points: (B, C, N)
+  int bs_idx = blockIdx.z;
+  int c_idx = blockIdx.y;
+  int index = blockIdx.x * blockDim.x + threadIdx.x;
+  int pt_idx = index / nsample;
+  if (bs_idx >= b || c_idx >= c || pt_idx >= npoints) return;
+
+  int sample_idx = index % nsample;
+  grad_out += bs_idx * c * npoints * nsample + c_idx * npoints * nsample +
+              pt_idx * nsample + sample_idx;
+  idx += bs_idx * npoints * nsample + pt_idx * nsample + sample_idx;
+
+  platform::CudaAtomicAdd(grad_points + bs_idx * c * n + c_idx * n + idx[0],
+                          grad_out[0]);
+}
+
+template <typename T>
+class GroupPointsOpCUDAKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
+                   "This kernel only runs on GPU device.");
+    auto* input = ctx.Input<Tensor>("X");
+    auto* idx = ctx.Input<Tensor>("Idx");
+    auto* output = ctx.Output<Tensor>("Out");
+    auto* input_data = input->data<T>();
+    auto* idx_data = idx->data<int>();
+
+    const int b = input->dims()[0];
+    const int c = input->dims()[1];
+    const int n = input->dims()[2];
+    const int m = idx->dims()[1];
+    const int s = idx->dims()[2];
+
+    auto* output_data = output->mutable_data<T>({b, c, m, s}, ctx.GetPlace());
+
+    dim3 blocks(DIVUP(m * s, THREADS_PER_BLOCK), c, b);
+    dim3 threads(THREADS_PER_BLOCK);
+    KeGroupPointsFW<
+        T><<<blocks, threads, 0, ctx.cuda_device_context().stream()>>>(
+        b, c, n, m, s, input_data, idx_data, output_data);
+  }
+};
+
+template <typename T>
+class GroupPointsGradOpCUDAKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* input = ctx.Input<Tensor>("X");
+    auto* idx = ctx.Input<Tensor>("Idx");
+    auto* output_grad = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto* input_grad = ctx.Output<Tensor>(framework::GradVarName("X"));
+    auto* idx_data = idx->data<int>();
+    auto output_grad_data = output_grad->data<T>();
+
+    const int b = input->dims()[0];
+    const int c = input->dims()[1];
+    const int n = input->dims()[2];
+    const int m = idx->dims()[1];
+    const int s = idx->dims()[2];
+
+    auto* input_grad_data =
+        input_grad->mutable_data<T>({b, c, n}, ctx.GetPlace());
+    auto& dev_ctx =
+        ctx.template device_context<platform::CUDADeviceContext>();
+    int pnum = input_grad->numel();
+    Zero<<<(pnum + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(input_grad_data,
+                                                               pnum);
+
+    dim3 blocks(DIVUP(m * s, THREADS_PER_BLOCK), c, b);
+    dim3 threads(THREADS_PER_BLOCK);
+
+    KeGroupPointsBW<<<blocks, threads, 0, dev_ctx.stream()>>>(
+        b, c, n, m, s, output_grad_data, idx_data, input_grad_data);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(group_points, ops::GroupPointsOpCUDAKernel<float>,
+                        ops::GroupPointsOpCUDAKernel<double>);
+REGISTER_OP_CUDA_KERNEL(group_points_grad,
+                        ops::GroupPointsGradOpCUDAKernel<float>,
+                        ops::GroupPointsGradOpCUDAKernel<double>);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/make.sh b/PaddleCV/Paddle3D/PointNet++/ext_op/src/make.sh
new file mode 100644
index 0000000000000000000000000000000000000000..79505635c0392a32c065a99502614128397bf3e7
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/make.sh
@@ -0,0 +1,21 @@
+include_dir=$( python -c 'import paddle; print(paddle.sysconfig.get_include())' )
+lib_dir=$( python -c 'import paddle; print(paddle.sysconfig.get_lib())' )
+
+echo $include_dir
+echo $lib_dir
+
+OPS='farthest_point_sampling_op gather_point_op group_points_op query_ball_op three_interp_op three_nn_op'
+for op in ${OPS}
+do
+nvcc ${op}.cu -c -o ${op}.cu.o -ccbin cc -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -DPADDLE_WITH_MKLDNN -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O0 -g -DNVCC \
+    -I ${include_dir}/third_party/ \
+    -I ${include_dir}
+done
+
+g++ farthest_point_sampling_op.cc farthest_point_sampling_op.cu.o gather_point_op.cc gather_point_op.cu.o group_points_op.cc group_points_op.cu.o query_ball_op.cu.o query_ball_op.cc three_interp_op.cu.o three_interp_op.cc three_nn_op.cu.o three_nn_op.cc -o pointnet_lib.so -DPADDLE_WITH_MKLDNN -shared -fPIC -std=c++11 -O0 -g \
+  -I ${include_dir}/third_party/ \
+  -I ${include_dir} \
+  -L ${lib_dir} \
+  -L /usr/local/cuda/lib64 -lpaddle_framework -lcudart
+
+rm *.cu.o
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/query_ball_op.cc b/PaddleCV/Paddle3D/PointNet++/ext_op/src/query_ball_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..c473b0d325db422a25fac4a133c7127311418557
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/query_ball_op.cc
@@ -0,0 +1,82 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/framework/op_registry.h"
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+class QueryBallOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    // points: [b,n,3]
+    PADDLE_ENFORCE(ctx->HasInput("Points"), "Input(Points) shoud not be null");
+    auto p_dims = ctx->GetInputDim("Points");
+    PADDLE_ENFORCE(p_dims.size() == 3 && p_dims[2] == 3,
+                   "Input(Points) of QueryBallOp should be 3-D Tensor, the "
+                   "last dimension must be 3");
+    // new_points: [b,m,3]
+    PADDLE_ENFORCE(ctx->HasInput("New_Points"),
+                   "Input(New_Points) shoud not be null");
+    auto np_dims = ctx->GetInputDim("New_Points");
+    PADDLE_ENFORCE(np_dims.size() == 3 && np_dims[2] == 3,
+                   "Input(New_Points) of QueryBallOp should be 3-D Tensor, the "
+                   "last dimension must be 3");
+    int n_sample = ctx->Attrs().Get<int>("N_sample");
+    PADDLE_ENFORCE(n_sample >= 0,
+                   "The n_sample should be greater than or equal to 0.");
+    float radius = ctx->Attrs().Get<float>("Radius");
+    PADDLE_ENFORCE(radius >= 0,
+                   "The radius should be greater than or equal to 0.");
+    // output: [b,m,nsample]
+    std::vector<int64_t> dim_out({p_dims[0], np_dims[1], n_sample});
+    ctx->SetOutputDim("Output", framework::make_ddim(dim_out));
+  }
+
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext &ctx) const override {
+    auto input_data_type = ctx.Input<Tensor>("Points")->type();
+    return framework::OpKernelType(input_data_type, ctx.GetPlace());
+  }
+};
+
+class QueryBallOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("Points",
+             "Input points with shape (batch, n, 3), n is input "
+             "points's num");
+    AddInput("New_Points",
+             "Query points with shape (batch, m, 3), m is query points's num");
+    AddOutput("Output", "output points with shape(batch, m, nsample)");
+    AddAttr<int>("N_sample",
+                 R"Doc(Number of points selected in each ball region")Doc")
+        .SetDefault(0)
+        .EqualGreaterThan(0);
+    AddAttr<float>("Radius",
+                   R"Doc(Ball search radius with shape(1))Doc")
+        .SetDefault(0)
+        .EqualGreaterThan(0);
+
+    AddComment(
+        R"Doc(Query Ball Points)Doc");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(query_ball, ops::QueryBallOp, ops::QueryBallOpMaker);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/query_ball_op.cu b/PaddleCV/Paddle3D/PointNet++/ext_op/src/query_ball_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..8e8917f1b0e39bbf5793999fe5ebb5d5df699863
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/query_ball_op.cu
@@ -0,0 +1,113 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/framework/op_registry.h"
+
+#include "util.cu.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename T>
+// input: radius (1), nsample (1), points (b,n,3), new_points (b,m,3)
+// output: idx (b,m,nsample)
+__global__ void QueryBall(int b,
+                          int n,
+                          int m,
+                          T radius,
+                          int nsample,
+                          const T *points,
+                          const T *new_points,
+                          int *idx) {
+  int batch_index = blockIdx.x;
+  points += n * 3 * batch_index;
+  new_points += m * 3 * batch_index;
+  idx += m * nsample * batch_index;
+
+  int index = threadIdx.x;
+  int stride = blockDim.x;
+
+  for (int j = index; j < m; j += stride) {
+    int cnt = 0;
+    for (int k = 0; k < n; ++k) {
+      if (cnt == nsample)
+        break;  // only pick the FIRST nsample points in the ball
+      float x2 = new_points[j * 3 + 0];
+      float y2 = new_points[j * 3 + 1];
+      float z2 = new_points[j * 3 + 2];
+      float x1 = points[k * 3 + 0];
+      float y1 = points[k * 3 + 1];
+      float z1 = points[k * 3 + 2];
+      float d =
+          (x2 - x1) * (x2 - x1) + (y2 - y1) * (y2 - y1) + (z2 - z1) * (z2 - z1);
+      if (d < radius * radius) {
+        if (cnt == 0) {  // set ALL indices to k, s.t. if there are less points
+                         // in ball than nsample, we still have valid
+                         // (repeating) indices
+          for (int l = 0; l < nsample; ++l) idx[j * nsample + l] = k;
+        }
+        idx[j * nsample + cnt] = k;
+        cnt += 1;
+      }
+    }
+  }
+}
+
+template <typename T>
+class QueryBallOpCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
+                   "This kernel only runs on GPU device.");
+    // input: radius (1), nsample (1), points (b,n,3), new_points (b,m,3)
+    // output: idx (b,m,nsample)
+    auto *points = ctx.Input<Tensor>("Points");
+    auto *new_points = ctx.Input<Tensor>("New_Points");
+    auto *output = ctx.Output<Tensor>("Output");
+
+    float radius = ctx.Attr<T>("Radius");
+    int nsample = ctx.Attr<int>("N_sample");
+
+    if (points->numel() == 0 || new_points->numel() == 0) return;
+
+    int batch_size = points->dims()[0];
+    int n = points->dims()[1];
+    int m = new_points->dims()[1];
+    // allocate memory
+    int* p_out_points = output->mutable_data<int>({batch_size, m, nsample}, ctx.GetPlace());
+
+    auto& dev_ctx = ctx.template device_context<platform::CUDADeviceContext>();
+    int pnum = output->numel();
+    Zero<int><<<(pnum + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(p_out_points,
+                                                               pnum);
+
+    const T *p_points = points->data<T>();
+    const T *p_new_points = new_points->data<T>();
+
+    QueryBall<<<batch_size, 256>>>(batch_size,
+                                   n,
+                                   m,
+                                   radius,
+                                   nsample,
+                                   p_points,
+                                   p_new_points,
+                                   p_out_points);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(query_ball, ops::QueryBallOpCUDAKernel<float>);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_interp_op.cc b/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_interp_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..b7bfbe7f935b74c46a795dd5370c814e4f5350c4
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_interp_op.cc
@@ -0,0 +1,142 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include <memory>
+#include <string>
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+
+class ThreeInterpOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+protected:
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("X"),
+                   "Input(X) of ThreeInterpOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Weight"),
+                   "Input(Weight) of ThreeInterpOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Idx"),
+                   "Input(Idx) of ThreeInterpOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("Out"),
+                   "Output(Out) of ThreeInterpOp should not be null.");
+
+    auto dim_x = ctx->GetInputDim("X");  // [B, M, C]
+    PADDLE_ENFORCE_EQ(dim_x.size(), 3, "X's dimension must be 3");
+
+    auto dim_weight = ctx->GetInputDim("Weight");  // [B, N, 3]
+    PADDLE_ENFORCE_EQ(dim_weight.size(), 3, "Weight's dimension must be 3");
+
+    PADDLE_ENFORCE_EQ(
+        dim_x[0], dim_weight[0], "X and Weight dim[0] should be equal.");
+
+    auto dim_idx = ctx->GetInputDim("Idx");  // [B, N, 3]
+    PADDLE_ENFORCE_EQ(dim_idx.size(), 3, "Idx's dimension must be 3");
+
+    for (int i = 0; i < 3; i++) {
+      PADDLE_ENFORCE_EQ(
+          dim_weight[i], dim_idx[i], "Weight and Idx shape should be same.");
+    }
+
+    // output: [B, N, C]
+    std::vector<int64_t> dim_out({dim_x[0], dim_idx[1], dim_x[2]});
+    ctx->SetOutputDim("Out", framework::make_ddim(dim_out));
+  }
+
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(ctx.Input<Tensor>("X")->type(),
+                                   ctx.GetPlace());
+  }
+};
+
+class ThreeInterpOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("X",
+             "The input tensor of three_interp operator. "
+             "This is a 3-D tensor with shape of [B, M, C].");
+    AddInput("Weight",
+             "The input tensor of point weight of three_interp operator. "
+             "This is a 3-D tensor with shape of [B, N, 3].");
+    AddInput("Idx",
+             "The input tensor of nearest neighbor index of three_interp "
+             "operator. This is a 3-D tensor with shape of [B, N, 3].");
+    AddOutput("Out",
+              "The output tensor of three_interp operator. "
+              "This is a 3-D tensor with shape of [B, N, 3].");
+
+    AddComment(R"DOC(
+          This operator calculate interpolate results from input, weight and
+          index.
+         )DOC");
+  }
+};
+
+class ThreeInterpOpGrad : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+protected:
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Weight"), "Input(Weight) should not be null");
+    PADDLE_ENFORCE(ctx->HasInput("Idx"), "Input(Idx) should not be null");
+    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
+                   "Input(Out@GRAD) should not be null");
+    auto dim_x = ctx->GetInputDim("X");
+    if (ctx->HasOutput(framework::GradVarName("X"))) {
+      ctx->SetOutputDim(framework::GradVarName("X"), dim_x);
+    }
+  }
+
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        ctx.Input<Tensor>(framework::GradVarName("Out"))->type(),
+        ctx.GetPlace());
+  }
+};
+
+template <typename T>
+class ThreeInterpGradDescMaker : public framework::SingleGradOpMaker<T> {
+public:
+  using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
+
+protected:
+  std::unique_ptr<T> Apply() const override {
+    auto* op = new T();
+    op->SetType("three_interp_grad");
+    op->SetInput("X", this->Input("X"));
+    op->SetInput("Weight", this->Input("Weight"));
+    op->SetInput("Idx", this->Input("Idx"));
+    op->SetInput(framework::GradVarName("Out"), this->OutputGrad("Out"));
+    op->SetOutput(framework::GradVarName("X"), this->InputGrad("X"));
+    op->SetAttrMap(this->Attrs());
+    return std::unique_ptr<T>(op);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(three_interp,
+                  ops::ThreeInterpOp,
+                  ops::ThreeInterpOpMaker,
+                  ops::ThreeInterpGradDescMaker<paddle::framework::OpDesc>,
+                  ops::ThreeInterpGradDescMaker<paddle::imperative::OpBase>);
+REGISTER_OPERATOR(three_interp_grad, ops::ThreeInterpOpGrad);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_interp_op.cu b/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_interp_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..0e23440b70da75fbcfa32d32be741904bdfedaa0
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_interp_op.cu
@@ -0,0 +1,152 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/platform/cuda_primitives.h"
+
+#include "util.cu.h"
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+
+template <typename T>
+__global__ void KeThreeInterpFw(T* output,
+                                const T* input,
+                                const T* weight,
+                                const int* idx,
+                                const int b,
+                                const int m,
+                                const int c,
+                                const int n) {
+  int nthreads = b * n * c;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  for (; tid < nthreads; tid += stride) {
+    int bi = tid / n / c;
+    int ni = (tid % (n * c)) / c;
+    int ci = tid % c;
+
+    int input_base_idx = bi * m * c;
+    int w_idx = bi * n * 3 + ni * 3;
+    output[tid] =
+        input[input_base_idx + idx[w_idx] * c + ci] * weight[w_idx] +
+        input[input_base_idx + idx[w_idx + 1] * c + ci] * weight[w_idx + 1] +
+        input[input_base_idx + idx[w_idx + 2] * c + ci] * weight[w_idx + 2];
+  }
+}
+
+template <typename T>
+__global__ void KeThreeInterpBw(T* input_grad,
+                                const T* output_grad,
+                                const T* weight,
+                                const int* idx,
+                                const int b,
+                                const int m,
+                                const int c,
+                                const int n) {
+  int nthreads = b * n * c;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  for (; tid < nthreads; tid += stride) {
+    int bi = tid / n / c;
+    int ni = (tid % (c * n)) / c;
+    int ci = tid % c;
+
+    int input_base_idx = bi * m * c;
+    int w_idx = bi * n * 3 + ni * 3;
+    platform::CudaAtomicAdd(&input_grad[input_base_idx + idx[w_idx] * c + ci],
+                            output_grad[tid] * weight[w_idx]);
+    platform::CudaAtomicAdd(
+        &input_grad[input_base_idx + idx[w_idx + 1] * c + ci],
+        output_grad[tid] * weight[w_idx + 1]);
+    platform::CudaAtomicAdd(
+        &input_grad[input_base_idx + idx[w_idx + 2] * c + ci],
+        output_grad[tid] * weight[w_idx + 2]);
+  }
+}
+
+template <typename T>
+class ThreeInterpOpCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
+                   "This kernel only runs on GPU device.");
+    auto* input = ctx.Input<Tensor>("X");
+    auto* weight = ctx.Input<Tensor>("Weight");
+    auto* idx = ctx.Input<Tensor>("Idx");
+    auto* output = ctx.Output<Tensor>("Out");
+    auto* input_data = input->data<T>();
+    auto* weight_data = weight->data<T>();
+    auto* idx_data = idx->data<int>();
+
+    const int b = input->dims()[0];
+    const int m = input->dims()[1];
+    const int c = input->dims()[2];
+    const int n = weight->dims()[1];
+
+    auto* output_data = output->mutable_data<T>({b, n, c}, ctx.GetPlace());
+
+    int pixelNum = b * n * c;
+    int grid_dim = (pixelNum + 512 - 1) / 512;
+    grid_dim = grid_dim > 8 ? 8 : grid_dim;
+
+    KeThreeInterpFw<
+        T><<<grid_dim, 512, 0, ctx.cuda_device_context().stream()>>>(
+        output_data, input_data, weight_data, idx_data, b, m, c, n);
+  }
+};
+
+template <typename T>
+class ThreeInterpGradOpCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* input = ctx.Input<Tensor>("X");
+    auto* weight = ctx.Input<Tensor>("Weight");
+    auto* idx = ctx.Input<Tensor>("Idx");
+    auto* output_grad = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto* input_grad = ctx.Output<Tensor>(framework::GradVarName("X"));
+    auto* weight_data = weight->data<T>();
+    auto* idx_data = idx->data<int>();
+    auto output_grad_data = output_grad->data<T>();
+
+    const int b = input->dims()[0];
+    const int m = input->dims()[1];
+    const int c = input->dims()[2];
+    const int n = weight->dims()[1];
+
+    auto* input_grad_data =
+        input_grad->mutable_data<T>({b, m, c}, ctx.GetPlace());
+    auto& dev_ctx = ctx.template device_context<platform::CUDADeviceContext>();
+    int pnum = input_grad->numel();
+    Zero<<<(pnum + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(input_grad_data,
+                                                               pnum);
+
+    int pixelNum = b * n * c;
+    int grid_dim = (pixelNum + 512 - 1) / 512;
+    grid_dim = grid_dim > 8 ? 8 : grid_dim;
+
+    KeThreeInterpBw<
+        T><<<grid_dim, 512, 0, ctx.cuda_device_context().stream()>>>(
+        input_grad_data, output_grad_data, weight_data, idx_data, b, m, c, n);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(three_interp,
+                        ops::ThreeInterpOpCUDAKernel<float>,
+                        ops::ThreeInterpOpCUDAKernel<double>);
+REGISTER_OP_CUDA_KERNEL(three_interp_grad,
+                        ops::ThreeInterpGradOpCUDAKernel<float>,
+                        ops::ThreeInterpGradOpCUDAKernel<double>);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_nn_op.cc b/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_nn_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5ca8b261c79cb2c7a28d1e2e8064dabc3eba921b
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_nn_op.cc
@@ -0,0 +1,93 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include <memory>
+#include <string>
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+
+class ThreeNNOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+protected:
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("X"),
+                   "Input(X) of ThreeNNOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Known"),
+                   "Input(Known) of ThreeNNOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("Distance"),
+                   "Output(Distance) of ThreeNNOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("Idx"),
+                   "Output(Idx) of ThreeNNOp should not be null.");
+
+    auto dim_x = ctx->GetInputDim("X");  // [B, N, 3]
+    PADDLE_ENFORCE_EQ(dim_x.size(), 3, "X's dimension must be 3");
+    PADDLE_ENFORCE_EQ(dim_x[2], 3, "X dim[2] must be 3");
+
+    auto dim_known = ctx->GetInputDim("Known");  // [B, M, 3]
+    PADDLE_ENFORCE_EQ(dim_known.size(), 3, "Known's dimension must be 3");
+    PADDLE_ENFORCE_EQ(dim_known[2], 3, "Known dim[2] must be 3");
+
+    PADDLE_ENFORCE_EQ(
+        dim_x[0], dim_known[0], "X and Known dim[0] should be equal.");
+    PADDLE_ENFORCE_GE(
+        dim_known[1], 3, "Known dim[1] shoule be greater or euqal than 3.");
+
+    ctx->SetOutputDim("Distance", dim_x);
+    ctx->SetOutputDim("Idx", dim_x);
+  }
+
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(ctx.Input<Tensor>("X")->type(),
+                                   ctx.GetPlace());
+  }
+};
+
+class ThreeNNOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("X",
+             "The input tensor of three_nn operator. "
+             "This is a 3-D tensor with shape of [B, N, 3].");
+    AddInput("Known",
+             "The input tensor of known points of three_nn operator. "
+             "This is a 3-D tensor with shape of [B, M, 3].");
+    AddOutput("Distance",
+              "The output distance tensor of three_nn operator. "
+              "This is a 3-D tensor with shape of [B, N, 3].");
+    AddOutput("Idx",
+              "The output index tensor of three_nn operator. "
+              "This is a 3-D tensor with shape of [B, N, 3].");
+
+    AddAttr<float>("eps", "minimum value of distance.").SetDefault(1e-10);
+
+    AddComment(R"DOC(
+          This operator samples the top-3 nearest neighbor of each point
+          coordinates specified by Input(X) between known point coordinates
+          specified by Input(Known) and calcualte the distance between these
+          nearest neighbors.
+         )DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(three_nn, ops::ThreeNNOp, ops::ThreeNNOpMaker);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_nn_op.cu b/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_nn_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..4120599a9dfde4d75fcceb6f3f17a002d3d896d9
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/three_nn_op.cu
@@ -0,0 +1,110 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/platform/cuda_primitives.h"
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+
+template <typename T>
+__global__ void KeThreeNNFw(T* distance,
+                            int* idx,
+                            const T* input,
+                            const T* known,
+                            const float eps,
+                            const int b,
+                            const int n,
+                            const int m) {
+  int nthreads = b * n;
+  int tid = blockIdx.x * blockDim.x + threadIdx.x;
+  int stride = blockDim.x * gridDim.x;
+  for (; tid < nthreads; tid += stride) {
+    int bi = tid / n;
+    int ni = tid % n;
+
+    int input_idx = tid * 3;
+    T x1 = input[input_idx];
+    T y1 = input[input_idx + 1];
+    T z1 = input[input_idx + 2];
+
+    distance[input_idx] = 1e40;
+    distance[input_idx + 1] = 1e40;
+    distance[input_idx + 2] = 1e40;
+    idx[input_idx] = 0;
+    idx[input_idx + 1] = 0;
+    idx[input_idx + 2] = 0;
+    for (int i = 0; i < m; i++) {
+      int known_idx = bi * m * 3 + i * 3;
+      double dist = (x1 - known[known_idx]) * (x1 - known[known_idx]) +
+                    (y1 - known[known_idx + 1]) * (y1 - known[known_idx + 1]) +
+                    (z1 - known[known_idx + 2]) * (z1 - known[known_idx + 2]);
+      T valid_dist = dist > eps ? static_cast<T>(dist) : eps;
+      if (dist < distance[input_idx]) {
+        distance[input_idx + 2] = distance[input_idx + 1];
+        idx[input_idx + 2] = idx[input_idx + 1];
+        distance[input_idx + 1] = distance[input_idx];
+        idx[input_idx + 1] = idx[input_idx];
+        distance[input_idx] = dist;
+        idx[input_idx] = i;
+      } else if (dist < distance[input_idx + 1]) {
+        distance[input_idx + 2] = distance[input_idx + 1];
+        idx[input_idx + 2] = idx[input_idx + 1];
+        distance[input_idx + 1] = dist;
+        idx[input_idx + 1] = i;
+      } else if (dist < distance[input_idx + 2]) {
+        distance[input_idx + 2] = dist;
+        idx[input_idx + 2] = i;
+      }
+    }
+  }
+}
+
+template <typename T>
+class ThreeNNOpCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    PADDLE_ENFORCE(platform::is_gpu_place(ctx.GetPlace()),
+                   "This kernel only runs on GPU device.");
+    auto* input = ctx.Input<Tensor>("X");
+    auto* known = ctx.Input<Tensor>("Known");
+    auto* distance = ctx.Output<Tensor>("Distance");
+    auto* idx = ctx.Output<Tensor>("Idx");
+    auto* input_data = input->data<T>();
+    auto* known_data = known->data<T>();
+
+    const float eps = ctx.Attr<float>("eps");
+
+    const int b = input->dims()[0];
+    const int n = input->dims()[1];
+    const int m = known->dims()[1];
+
+    auto* idx_data = idx->mutable_data<int>({b, n, 3}, ctx.GetPlace());
+    auto* distance_data = distance->mutable_data<T>({b, n, 3}, ctx.GetPlace());
+
+    int pixelNum = b * n;
+    int grid_dim = (pixelNum + 512 - 1) / 512;
+    grid_dim = grid_dim > 8 ? 8 : grid_dim;
+
+    KeThreeNNFw<T><<<grid_dim, 512, 0, ctx.cuda_device_context().stream()>>>(
+        distance_data, idx_data, input_data, known_data, eps, b, n, m);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(three_nn,
+                        ops::ThreeNNOpCUDAKernel<float>,
+                        ops::ThreeNNOpCUDAKernel<double>);
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/src/util.cu.h b/PaddleCV/Paddle3D/PointNet++/ext_op/src/util.cu.h
new file mode 100644
index 0000000000000000000000000000000000000000..05f1e9f9046644df1d92c4aca592d4a9017e4d5b
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/src/util.cu.h
@@ -0,0 +1,18 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License. */
+
+template <typename T>
+__global__ void Zero(T* x, int num) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < num;
+       i += blockDim.x * gridDim.x) {
+    x[i] = static_cast<T>(0);
+  }
+}
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_farthest_point_sampling_op.py b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_farthest_point_sampling_op.py
new file mode 100644
index 0000000000000000000000000000000000000000..76df7c77f100070a75f253cfe9ac005663ecb3d7
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_farthest_point_sampling_op.py
@@ -0,0 +1,63 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import unittest
+import numpy as np
+import paddle.fluid as fluid
+import pointnet_lib
+
+
+def farthest_point_sampling_np(xyz, npoint):
+    B, N, C = xyz.shape
+    S = npoint
+
+    centroids = np.zeros((B, S))
+    distance = np.ones((B, N)) * 1e10
+    farthest = 0
+    batch_indices = np.arange(B).astype('int32')
+    for i in range(S):
+        centroids[:, i] = farthest
+        centroid = xyz[batch_indices, farthest, :].reshape((B, 1, 3))
+        dist = np.sum((xyz - centroid)**2, -1)
+        mask = dist < distance
+        distance[mask] = dist[mask]
+        farthest = np.argmax(distance, -1)
+    return centroids.astype('int32')
+
+
+class TestFarthestPointSamplingOp(unittest.TestCase):
+    def test_check_output(self):
+        x_shape = (1, 512, 3)
+        x_type = 'float32'
+        sampled_point_num = 256
+
+        x = fluid.layers.data(
+            name='x', shape=x_shape, dtype=x_type, append_batch_size=False)
+        y = pointnet_lib.farthest_point_sampling(x, sampled_point_num)
+
+        x_np = np.random.randint(1, 100, (x_shape[0] * x_shape[1] *
+                                          3, )).reshape(x_shape).astype(x_type)
+        out_np = farthest_point_sampling_np(x_np, sampled_point_num)
+
+        place = fluid.CUDAPlace(0)
+        exe = fluid.Executor(place)
+        outs = exe.run(feed={'x': x_np}, fetch_list=[y])
+
+        self.assertTrue(np.allclose(outs[0], out_np))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_gather_point_op.py b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_gather_point_op.py
new file mode 100644
index 0000000000000000000000000000000000000000..ff01bc8ad70e1a20ff854b270fa4d4fc2c2f08e1
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_gather_point_op.py
@@ -0,0 +1,56 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import unittest
+import numpy as np
+import paddle.fluid as fluid
+import pointnet_lib
+
+
+def gather_point_np(points, index):
+    result = []
+    for i in range(len(index)):
+        a = points[i][index[i]]
+        result.append(a.tolist())
+    return result
+
+
+class TestGatherPointOp(unittest.TestCase):
+    def test_check_output(self):
+        x_shape = (1, 512, 3)
+        x_type = 'float32'
+        idx_shape = (1, 32)
+        idx_type = 'int32'
+
+        x = fluid.layers.data(
+            name='x', shape=x_shape, dtype=x_type, append_batch_size=False)
+        idx = fluid.layers.data(
+            name='idx', shape=idx_shape, dtype=idx_type, append_batch_size=False)
+        y = pointnet_lib.gather_point(x, idx)
+
+        x_np = np.random.uniform(-10, 10, x_shape).astype(x_type)
+        idx_np = np.random.randint(0, x_shape[1], idx_shape).astype(idx_type)
+        out_np = gather_point_np(x_np, idx_np)
+
+        place = fluid.CUDAPlace(0)
+        exe = fluid.Executor(place)
+        outs = exe.run(feed={'x': x_np, 'idx': idx_np}, fetch_list=[y])
+
+        self.assertTrue(np.allclose(outs[0], out_np))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_group_points_op.py b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_group_points_op.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ab4fb7a9c5040bf2c8130d1bd211243038c046f
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_group_points_op.py
@@ -0,0 +1,60 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import unittest
+import numpy as np
+import paddle.fluid as fluid
+import pointnet_lib
+
+
+def group_points_np(x, idx):
+    b, m, s = idx.shape
+    _, c, n = x.shape
+
+    output = np.zeros((b, c, m, s)).astype(x.dtype)
+    for i in range(b):
+        for j in range(m):
+            for k in range(s):
+                output[i, :, j, k] = x[i, :, idx[i, j, k]]
+    return output
+
+
+class TestGroupPointsOp(unittest.TestCase):
+    def test_check_output(self):
+        x_shape = [8, 43, 29]
+        x_type = 'float32'
+        idx_shape = [8, 37, 41]
+        idx_type = 'int32'
+
+        x = fluid.layers.data(
+            name='x', shape=x_shape, dtype=x_type, append_batch_size=False)
+        idx = fluid.layers.data(
+            name='idx', shape=idx_shape, dtype=idx_type, append_batch_size=False)
+        y = pointnet_lib.group_points(x, idx)
+
+        x_np = np.random.uniform(-10, 10, x_shape).astype(x_type)
+        idx_np = np.random.randint(0, x_shape[2], idx_shape).astype(idx_type)
+        out_np = group_points_np(x_np, idx_np)
+
+        place = fluid.CUDAPlace(0)
+        exe = fluid.Executor(place)
+        outs = exe.run(feed={'x': x_np, 'idx': idx_np}, fetch_list=[y])
+
+        self.assertTrue(np.allclose(outs[0], out_np))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_query_ball_op.py b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_query_ball_op.py
new file mode 100644
index 0000000000000000000000000000000000000000..ab3ea1821f388108edf753420e90d37c8abfcbc0
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_query_ball_op.py
@@ -0,0 +1,69 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import unittest
+import numpy as np
+import paddle.fluid as fluid
+import pointnet_lib
+
+
+def query_ball_point_np(points, new_points, radius, nsample):
+    b, n, c = points.shape
+    _, m, _ = new_points.shape
+    out = np.zeros(shape=(b, m, nsample)).astype('int32')
+    radius_2 = radius * radius
+    for i in range(b):
+        for j in range(m):
+            cnt = 0
+            for k in range(n):
+                if (cnt == nsample):
+                    break
+                dist = np.sum(np.square(points[i][k] - new_points[i][j]))
+                if (dist < radius_2):
+                    if cnt == 0:
+                        out[i][j] = np.ones(shape=(nsample)) * k
+                    out[i][j][cnt] = k
+                    cnt += 1
+    return out
+
+
+class TestQueryBallOp(unittest.TestCase):
+    def test_check_output(self):
+        points_shape = [2, 5, 3]
+        new_points_shape = [2, 4, 3]
+        points_type = 'float32'
+        radius = 6
+        nsample = 5
+
+        points = fluid.layers.data(
+            name='points', shape=points_shape, dtype=points_type, append_batch_size=False)
+        new_points = fluid.layers.data(
+            name='new_points', shape=new_points_shape, dtype=points_type, append_batch_size=False)
+        y = pointnet_lib.query_ball(points, new_points, radius, nsample)
+
+        points_np = np.random.randint(1, 5, points_shape).astype(points_type)
+        new_points_np = np.random.randint(1, 5, new_points_shape).astype(points_type)
+        out_np = query_ball_point_np(points_np, new_points_np, radius, nsample)
+
+        place = fluid.CUDAPlace(0)
+        exe = fluid.Executor(place)
+        outs = exe.run(feed={'points': points_np, 'new_points': new_points_np}, fetch_list=[y])
+
+        self.assertTrue(np.allclose(outs[0], out_np))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_three_interp_op.py b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_three_interp_op.py
new file mode 100644
index 0000000000000000000000000000000000000000..e73fbad756ac5e5d3703f8354e2b0641b7cc9383
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_three_interp_op.py
@@ -0,0 +1,66 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import unittest
+import numpy as np
+import paddle.fluid as fluid
+import pointnet_lib
+
+
+def three_interp_np(x, weight, idx):
+    b, m, c = x.shape
+    n = weight.shape[1]
+
+    output = np.zeros((b, n, c)).astype('float32')
+    for i in range(b):
+        for j in range(n):
+            w1, w2, w3 = weight[i, j, :]
+            i1, i2, i3 = idx[i, j, :]
+            output[i, j, :] = w1 * x[i, i1, :] \
+                            + w2 * x[i, i2, :] \
+                            + w3 * x[i, i3, :]
+    return output
+
+
+class TestThreeInterpOp(unittest.TestCase):
+    def test_check_output(self):
+        input_shape = [8, 21, 29]
+        input_type = 'float32'
+        weight_shape = [8, 37, 3]
+        weight_type = 'float32'
+
+        x = fluid.layers.data(
+            name='x', shape=input_shape, dtype=input_type, append_batch_size=False)
+        weight = fluid.layers.data(
+            name='weight', shape=weight_shape, dtype=weight_type, append_batch_size=False)
+        idx = fluid.layers.data(
+            name='idx', shape=weight_shape, dtype="int32", append_batch_size=False)
+        y = pointnet_lib.three_interp(x, weight, idx)
+
+        x_np = np.random.random(input_shape).astype(input_type)
+        weight_np = np.random.random(weight_shape).astype(weight_type)
+        idx_np = np.random.uniform(0, input_shape[1], weight_shape).astype("int32")
+        out_np = three_interp_np(x_np, weight_np, idx_np)
+
+        place = fluid.CUDAPlace(0)
+        exe = fluid.Executor(place)
+        outs = exe.run(feed={'x': x_np, 'weight': weight_np, 'idx': idx_np}, fetch_list=[y])
+
+        self.assertTrue(np.allclose(outs[0], out_np))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_three_nn_op.py b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_three_nn_op.py
new file mode 100644
index 0000000000000000000000000000000000000000..c6468e8b8cf881e3bbd1a16b1f8a6896fce07333
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/ext_op/tests/test_three_nn_op.py
@@ -0,0 +1,79 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import unittest
+import numpy as np
+import paddle.fluid as fluid
+import pointnet_lib
+
+
+def three_nn_np(x, known, eps=1e-10):
+    distance = np.ones_like(x).astype('float32') * 1e40
+    idx = np.zeros_like(x).astype('int32')
+
+    b, n, _ = x.shape
+    m = known.shape[1]
+    for i in range(b):
+        for j in range(n):
+            for k in range(m):
+                sub = x[i, j, :] - known[i, k, :]
+                d = float(np.sum(sub * sub))
+                valid_d = max(d, eps)
+                if d < distance[i, j, 0]:
+                    distance[i, j, 2] = distance[i, j, 1]
+                    idx[i, j, 2] = idx[i, j, 1]
+                    distance[i, j, 1] = distance[i, j, 0]
+                    idx[i, j, 1] = idx[i, j, 0]
+                    distance[i, j, 0] = valid_d
+                    idx[i, j, 0] = k
+                elif d < distance[i, j, 1]:
+                    distance[i, j, 2] = distance[i, j, 1]
+                    idx[i, j, 2] = idx[i, j, 1]
+                    distance[i, j, 1] = valid_d
+                    idx[i, j, 1] = k
+                elif d < distance[i, j, 2]:
+                    distance[i, j, 2] = valid_d
+                    idx[i, j, 2] = k
+    return distance, idx
+
+
+class TestThreeNNOp(unittest.TestCase):
+    def test_check_output(self):
+        input_shape = [16, 32, 3]
+        known_shape = [16, 8, 3]
+        input_type = 'float32'
+        eps = 1e-10
+
+        x = fluid.layers.data(
+            name='x', shape=input_shape, dtype=input_type, append_batch_size=False)
+        known = fluid.layers.data(
+            name='known', shape=known_shape, dtype=input_type, append_batch_size=False)
+        dist, idx = pointnet_lib.three_nn(x, known, eps)
+
+        x_np = np.random.random(input_shape).astype(input_type)
+        known_np = np.random.random(known_shape).astype(input_type)
+        dist_np, idx_np = three_nn_np(x_np, known_np, eps)
+
+        place = fluid.CUDAPlace(0)
+        exe = fluid.Executor(place)
+        outs = exe.run(feed={'x': x_np, 'known': known_np}, fetch_list=[dist, idx])
+
+        self.assertTrue(np.allclose(outs[0], dist_np))
+        self.assertTrue(np.allclose(outs[1], idx_np))
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/PaddleCV/Paddle3D/PointNet++/image/pointnet2.jpg b/PaddleCV/Paddle3D/PointNet++/image/pointnet2.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..5d3b0f4ec8f614332ea258e7bd8d64d261264a92
Binary files /dev/null and b/PaddleCV/Paddle3D/PointNet++/image/pointnet2.jpg differ
diff --git a/PaddleCV/Paddle3D/PointNet++/models/__init__.py b/PaddleCV/Paddle3D/PointNet++/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..2435d63de22342ce5f3219f6f0347e0afb27c014
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/models/__init__.py
@@ -0,0 +1,27 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+
+from . import pointnet2_modules
+from . import pointnet2_seg
+from . import pointnet2_cls
+
+from .pointnet2_modules import *
+from .pointnet2_seg import *
+from .pointnet2_cls import *
+
+__all__ = pointnet2_modules.__all__
+__all__ += pointnet2_seg.__all__
+__all__ += pointnet2_cls.__all__
diff --git a/PaddleCV/Paddle3D/PointNet++/models/pointnet2_cls.py b/PaddleCV/Paddle3D/PointNet++/models/pointnet2_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..778433c17794ebfeb520f59655c2c4772ce23b0a
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/models/pointnet2_cls.py
@@ -0,0 +1,151 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains PointNet++ classification models
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Constant
+from .pointnet2_modules import *
+
+__all__ = ["PointNet2ClsSSG", "PointNet2ClsMSG"]
+
+
+class PointNet2Cls(object):
+    def __init__(self, num_classes, num_points, use_xyz=True):
+        self.num_classes = num_classes
+        self.num_points = num_points
+        self.use_xyz = use_xyz
+        self.out_feature = None
+        self.pyreader = None
+        self.model_config()
+
+    def model_config(self):
+        self.SA_confs = []
+
+    def build_input(self):
+        self.xyz = fluid.layers.data(name='xyz', shape=[self.num_points, 3], dtype='float32', lod_level=0)
+        self.label = fluid.layers.data(name='label', shape=[1], dtype='int64', lod_level=0)
+        self.pyreader = fluid.io.PyReader(
+                feed_list=[self.xyz, self.label],
+                capacity=64,
+                use_double_buffer=True,
+                iterable=False)
+        self.feed_vars = [self.xyz, self.label]
+
+    def build_model(self, bn_momentum=0.99):
+        self.build_input()
+
+        xyz, feature = self.xyz, None
+        for i, SA_conf in enumerate(self.SA_confs):
+            xyz, feature = pointnet_sa_module(
+                    xyz=xyz,
+                    feature=feature,
+                    bn_momentum=bn_momentum,
+                    use_xyz=self.use_xyz,
+                    name="sa_{}".format(i),
+                    **SA_conf)
+
+        out = fluid.layers.squeeze(feature, axes=[-1])
+        out = fc_bn(out,out_channels=512, bn=True, bn_momentum=bn_momentum, name="fc_1")
+        out = fluid.layers.dropout(out, 0.5, dropout_implementation="upscale_in_train")
+        out = fc_bn(out,out_channels=256, bn=True, bn_momentum=bn_momentum, name="fc_2")
+        out = fluid.layers.dropout(out, 0.5, dropout_implementation="upscale_in_train")
+        out = fc_bn(out,out_channels=self.num_classes, act=None, name="fc_3")
+        pred = fluid.layers.softmax(out)
+
+        # calc loss
+        self.loss = fluid.layers.cross_entropy(pred, self.label)
+        self.loss = fluid.layers.reduce_mean(self.loss)
+
+        # calc acc
+        pred = fluid.layers.reshape(pred, shape=[-1, self.num_classes])
+        label = fluid.layers.reshape(self.label, shape=[-1, 1])
+        self.acc1 = fluid.layers.accuracy(pred, label, k=1)
+
+    def get_feeds(self):
+        return self.feed_vars
+
+    def get_outputs(self):
+        return {"loss": self.loss, "accuracy": self.acc1}
+
+    def get_pyreader(self):
+        return self.pyreader
+
+
+class PointNet2ClsSSG(PointNet2Cls):
+    def __init__(self, num_classes, num_points, use_xyz=True):
+        super(PointNet2ClsSSG, self).__init__(num_classes, num_points, use_xyz)
+
+    def model_config(self):
+        self.SA_confs = [
+            {
+                "npoint": 512,
+                "radiuss": [0.2],
+                "nsamples": [64],
+                "mlps": [[64, 64, 128]],
+            },
+            {
+                "npoint": 128,
+                "radiuss": [0.4],
+                "nsamples": [64],
+                "mlps": [[128, 128, 256]],
+            },
+            {
+                "npoint":None,
+		"radiuss": [None],
+		"nsamples":[None],
+		"mlps": [[256, 512, 1024]],
+            },
+        ]
+
+
+class PointNet2ClsMSG(PointNet2Cls):
+    def __init__(self, num_classes, num_points, use_xyz=True):
+        super(PointNet2ClsMSG, self).__init__(num_classes, num_points, use_xyz)
+
+    def model_config(self):
+        self.SA_confs = [
+            {
+                "npoint": 512,
+                "radiuss": [0.1, 0.2, 0.4],
+                "nsamples": [16, 32, 128],
+                "mlps": [[32, 32, 64],
+                         [64, 64, 128],
+                         [64,96,128]],
+            },
+            {
+                "npoint": 128,
+                "radiuss": [0.2, 0.4, 0.8],
+                "nsamples": [32, 64, 128],
+                "mlps": [[64, 64, 128],
+                         [128, 128, 256],
+                         [128,128,256]],
+            },
+            {
+                "npoint":None,
+		"radiuss": [None],
+		"nsamples":[None],
+		"mlps": [[256, 512, 1024]],
+            },
+        ]
+
+
diff --git a/PaddleCV/Paddle3D/PointNet++/models/pointnet2_modules.py b/PaddleCV/Paddle3D/PointNet++/models/pointnet2_modules.py
new file mode 100644
index 0000000000000000000000000000000000000000..08cc15ae1ea9730da670738c79c14b80c5557361
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/models/pointnet2_modules.py
@@ -0,0 +1,219 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains PointNet++  utility functions.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Constant
+from ext_op import *
+
+__all__ = ["conv_bn", "pointnet_sa_module", "pointnet_fp_module","fc_bn"]
+
+
+def query_and_group(xyz, new_xyz, radius, nsample, features=None, use_xyz=True):
+    """
+    Perform query_ball and group_points
+
+    Args:
+        xyz (Variable): xyz coordiantes features with shape [B, N, 3]
+        new_xyz (Variable): centriods features with shape [B, npoint, 3]
+        radius (float32): radius of ball
+        nsample (int32): maximum number of gather features
+        features (Variable): features with shape [B, N, C]
+        use_xyz (bool): whether use xyz coordiantes features
+
+    Returns:
+        out (Variable): features with shape [B, npoint, nsample, C + 3]
+    """
+    idx = query_ball(xyz, new_xyz, radius, nsample)
+    idx.stop_gradient = True
+    xyz = fluid.layers.transpose(xyz,perm=[0, 2, 1])
+    grouped_xyz = group_points(xyz, idx)
+    expand_new_xyz = fluid.layers.unsqueeze(fluid.layers.transpose(new_xyz, perm=[0, 2, 1]), axes=[-1])
+    expand_new_xyz = fluid.layers.expand(expand_new_xyz, [1, 1, 1, grouped_xyz.shape[3]])
+    grouped_xyz -= expand_new_xyz
+
+    if features is not None:
+        grouped_features = group_points(features, idx)
+        return fluid.layers.concat([grouped_xyz, grouped_features], axis=1) \
+                if use_xyz else grouped_features
+    else:
+        assert use_xyz, "use_xyz should be True when features is None"
+        return grouped_xyz
+
+
+def group_all(xyz, features=None, use_xyz=True):
+    """
+    Group all xyz and features when npoint is None
+    See query_and_group
+    """
+    xyz = fluid.layers.transpose(xyz,perm=[0,2,1])
+    grouped_xyz = fluid.layers.unsqueeze(xyz, axes=[2])
+    if features is not None:
+        grouped_features = fluid.layers.unsqueeze(features, axes=[2])
+        return fluid.layers.concat([grouped_xyz, grouped_features], axis=1) if use_xyz else grouped_features
+    else:
+        return grouped_xyz
+
+
+def conv_bn(input, out_channels, bn=True, bn_momentum=0.99, act='relu', name=None):
+    param_attr = ParamAttr(name='{}_conv_weight'.format(name),)
+    bias_attr = ParamAttr(name='{}_conv_bias'.format(name)) \
+                                  if not bn else False
+    out = fluid.layers.conv2d(input,
+                              num_filters=out_channels,
+                              filter_size=1,
+                              stride=1,
+                              padding=0,
+                              dilation=1,
+                              param_attr=param_attr,
+                              bias_attr=bias_attr,
+			      act=act if not bn else None)
+    if bn:
+        bn_name = name + "_bn"
+        out = fluid.layers.batch_norm(out,
+                                      act=act,
+				      momentum=bn_momentum,
+                                      param_attr=ParamAttr(name=bn_name + "_scale"),
+                                      bias_attr=ParamAttr(name=bn_name + "_offset"),
+                                      moving_mean_name=bn_name + '_mean',
+                                      moving_variance_name=bn_name + '_var')
+
+    return out
+
+def fc_bn(input, out_channels, bn=False, bn_momentum=0.99, act='relu', name=None):
+    param_attr = ParamAttr(name='{}_fc_weight'.format(name))
+    if not bn:
+        bias_attr = ParamAttr(name='{}_fc_bias'.format(name))
+    else:
+        bias_attr = False
+    out = fluid.layers.fc(input,
+                          size=out_channels,
+			  param_attr=param_attr,
+			  bias_attr=bias_attr)
+    if bn:
+        bn_name = name + "_bn"
+        out = fluid.layers.batch_norm(out,
+                                      momentum=bn_momentum,
+                                      param_attr=ParamAttr(name=bn_name + "_scale"),
+                                      bias_attr=ParamAttr(name=bn_name + "_offset"),
+                                      moving_mean_name=bn_name + '_mean',
+                                      moving_variance_name=bn_name + '_var')
+    if act == "relu":
+        out = fluid.layers.relu(out)
+    return out
+
+def MLP(features, out_channels_list, bn=True, bn_momentum=0.99, act='relu', name=None):
+    out = features
+    for i, out_channels in enumerate(out_channels_list):
+        out = conv_bn(out, out_channels, bn=bn, act=act, bn_momentum=bn_momentum, name=name + "_{}".format(i))
+    return out
+        
+
+def pointnet_sa_module(xyz,
+                       npoint=None,
+                       radiuss=[],
+                       nsamples=[],
+                       mlps=[],
+                       feature=None,
+                       bn=True,
+		       bn_momentum=0.99,
+                       use_xyz=True,
+                       name=None):
+    """
+    PointNet MSG(Multi-Scale Group) Set Abstraction Module.
+    Call with radiuss, nsamples, mlps as single element list for 
+    SSG(Single-Scale Group).
+
+    Args:
+        xyz (Variable): xyz coordiantes features with shape [B, N, 3]
+        radiuss ([float32]): list of radius of ball
+        nsamples ([int32]): list of maximum number of gather features
+        mlps ([[int32]]): list of out_channels_list
+        feature (Variable): features with shape [B, C, N]
+        bn (bool): whether perform batch norm after conv2d
+	bn_momentum (float): momentum of batch norm
+        use_xyz (bool): whether use xyz coordiantes features
+
+    Returns:
+        new_xyz (Variable): centriods features with shape [B, npoint, 3]
+        out (Variable): features with shape [B, npoint, \sum_i{mlps[i][-1]}]
+    """
+    assert len(radiuss) == len(nsamples) == len(mlps), \
+            "radiuss, nsamples, mlps length should be same"
+
+    farthest_idx = farthest_point_sampling(xyz, npoint)
+    farthest_idx.stop_gradient = True
+    new_xyz = gather_point(xyz, farthest_idx) if npoint is not None else None
+
+    outs = []
+    for i, (radius, nsample, mlp) in enumerate(zip(radiuss, nsamples, mlps)):
+        out = query_and_group(xyz, new_xyz, radius, nsample, feature, use_xyz) if npoint is not None else group_all(xyz, feature, use_xyz)
+        out = MLP(out, mlp, bn=bn, bn_momentum=bn_momentum, name=name + '_mlp{}'.format(i))
+        out = fluid.layers.pool2d(out, pool_size=[1, out.shape[3]], pool_type='max')
+        out = fluid.layers.squeeze(out, axes=[-1])
+        outs.append(out)
+    out = fluid.layers.concat(outs, axis=1)
+
+    return (new_xyz, out)
+
+
+def pointnet_fp_module(unknown, known, unknown_feats, known_feats, mlp, bn=True, bn_momentum=0.99, name=None):
+    """
+    PointNet Feature Propagation Module
+
+    Args:
+        unknown (Variable): unknown xyz coordiantes features with shape [B, N, 3]
+        known (Variable): known xyz coordiantes features with shape [B, M, 3]
+        unknown_feats (Variable): unknown features with shape [B, N, C1] to be propagated to
+        known_feats (Variable): known features with shape [B, M, C2] to be propagated from
+        mlp ([int32]): out_channels_list
+        bn (bool): whether perform batch norm after conv2d
+        bn_momentum (float): momentum of batch norm
+
+    Returns:
+        new_features (Variable): new features with shape [B, N, mlp[-1]]
+    """
+    if known is None:
+        raise NotImplementedError("Not implement known as None currently.")
+    else:
+        dist, idx = three_nn(unknown, known, eps=0)
+        dist.stop_gradient = True
+        idx.stop_gradient = True
+        dist = fluid.layers.sqrt(dist)
+        ones = fluid.layers.fill_constant_batch_size_like(dist, dist.shape, dist.dtype, 1)
+        dist_recip = ones / (dist + 1e-8); # 1.0 / dist
+        norm = fluid.layers.reduce_sum(dist_recip, dim=-1, keep_dim=True)
+        weight = dist_recip / norm
+        weight.stop_gradient = True
+        interp_feats = three_interp(known_feats, weight, idx)
+
+    new_features = interp_feats if unknown_feats is None else \
+                    fluid.layers.concat([interp_feats, unknown_feats], axis=-1)
+    new_features = fluid.layers.transpose(new_features, perm=[0, 2, 1])
+    new_features = fluid.layers.unsqueeze(new_features, axes=[-1])
+    new_features = MLP(new_features, mlp, bn=bn, bn_momentum=bn_momentum, name=name + '_mlp')
+    new_features = fluid.layers.squeeze(new_features, axes=[-1])
+    new_features = fluid.layers.transpose(new_features, perm=[0, 2, 1])
+    
+    return new_features
+
diff --git a/PaddleCV/Paddle3D/PointNet++/models/pointnet2_seg.py b/PaddleCV/Paddle3D/PointNet++/models/pointnet2_seg.py
new file mode 100644
index 0000000000000000000000000000000000000000..04d6d73e2b6d066aa940ab176698af3738b4de94
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/models/pointnet2_seg.py
@@ -0,0 +1,188 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains PointNet++ SSG/MSG semantic segmentation models
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Constant
+from .pointnet2_modules import *
+
+__all__ = ["PointNet2SemSegSSG", "PointNet2SemSegMSG"]
+
+
+class PointNet2SemSeg(object):
+    def __init__(self, num_classes, num_points, use_xyz=True):
+        self.num_classes = num_classes
+        self.num_points = num_points
+        self.use_xyz = use_xyz
+        self.feed_vars = []
+        self.out_feature = None
+        self.pyreader = None
+        self.model_config()
+
+    def model_config(self):
+        self.SA_confs = []
+        self.FP_confs = []
+
+    def build_input(self):
+        self.xyz = fluid.layers.data(name='xyz', shape=[self.num_points, 3], dtype='float32', lod_level=0)
+        self.feature = fluid.layers.data(name='feature', shape=[self.num_points, 6], dtype='float32', lod_level=0)
+        self.label = fluid.layers.data(name='label', shape=[self.num_points, 1], dtype='int64', lod_level=0)
+        self.pyreader = fluid.io.PyReader(
+                feed_list=[self.xyz, self.feature, self.label],
+                capacity=64,
+                use_double_buffer=True,
+                iterable=False)
+        self.feed_vars = [self.xyz, self.feature, self.label]
+
+    def build_model(self, bn_momentum=0.99):
+        self.build_input()
+
+        xyzs, features = [self.xyz], [self.feature]
+        xyzi, featurei = xyzs[-1], fluid.layers.transpose(self.feature, perm=[0, 2, 1])
+        for i, SA_conf in enumerate(self.SA_confs):
+            xyzi, featurei = pointnet_sa_module(
+                    xyz=xyzi,
+                    feature=featurei,
+                    bn_momentum=bn_momentum,
+                    use_xyz=self.use_xyz,
+                    name="sa_{}".format(i),
+                    **SA_conf)
+            xyzs.append(xyzi)
+            features.append(fluid.layers.transpose(featurei, perm=[0, 2, 1]))
+        for i in range(-1, -(len(self.FP_confs) + 1), -1):
+            features[i - 1] = pointnet_fp_module(
+                    unknown=xyzs[i - 1],
+                    known=xyzs[i],
+                    unknown_feats=features[i - 1],
+                    known_feats=features[i],
+                    bn_momentum=bn_momentum,
+                    name="fp_{}".format(i+len(self.FP_confs)),
+                    **self.FP_confs[i])
+
+        out = fluid.layers.transpose(features[0], perm=[0, 2, 1])
+        out = fluid.layers.unsqueeze(out, axes=[-1])
+        out = conv_bn(out, out_channels=128, bn=True, bn_momentum=bn_momentum, name="output_1")
+        out = fluid.layers.dropout(out, 0.5, dropout_implementation="upscale_in_train")
+        out = conv_bn(out, out_channels=self.num_classes, bn=False, act=None, name="output_2")
+        out = fluid.layers.squeeze(out, axes=[-1])
+        out = fluid.layers.transpose(out, perm=[0, 2, 1])
+        pred = fluid.layers.softmax(out)
+
+        # calc loss
+        self.loss = fluid.layers.cross_entropy(pred, self.label)
+        self.loss = fluid.layers.reduce_mean(self.loss)
+
+        # calc acc
+        pred = fluid.layers.reshape(pred, shape=[-1, self.num_classes])
+        label = fluid.layers.reshape(self.label, shape=[-1, 1])
+        self.acc1 = fluid.layers.accuracy(pred, label, k=1)
+
+    def get_feeds(self):
+        return self.feed_vars
+
+    def get_outputs(self):
+        return {"loss": self.loss, "accuracy": self.acc1}
+
+    def get_pyreader(self):
+        return self.pyreader
+
+
+class PointNet2SemSegSSG(PointNet2SemSeg):
+    def __init__(self, num_classes, use_xyz=True):
+        super(PointNet2SemSegSSG, self).__init__(num_classes, use_xyz)
+
+    def model_config(self):
+        self.SA_confs = [
+            {
+                "npoint": 1024,
+                "radiuss": [0.1],
+                "nsamples": [32],
+                "mlps": [[32, 32, 64]],
+            },
+            {
+                "npoint": 256,
+                "radiuss": [0.2],
+                "nsamples": [32],
+                "mlps": [[64, 64, 128]],
+            },
+            {
+                "npoint": 64,
+                "radiuss": [0.4],
+                "nsamples": [32],
+                "mlps": [[128, 128, 256]],
+            },
+            {
+                "npoint": 16,
+                "radiuss": [0.8],
+                "nsamples": [32],
+                "mlps": [[256, 256, 512]],
+            },
+        ]
+
+        self.FP_confs = [
+            {"mlp": [128, 128, 128]},
+            {"mlp": [256, 128]},
+            {"mlp": [256, 256]},
+            {"mlp": [256, 256]},
+        ]
+
+
+class PointNet2SemSegMSG(PointNet2SemSeg):
+    def __init__(self, num_classes, use_xyz=True):
+        super(PointNet2SemSegMSG, self).__init__(num_classes, use_xyz)
+
+    def model_config(self):
+        self.SA_confs = [
+            {
+                "npoint": 1024,
+                "radiuss": [0.05, 0.1],
+                "nsamples": [16, 32],
+                "mlps": [[16, 16, 32], [32, 32, 64]],
+            },
+            {
+                "npoint": 256,
+                "radiuss": [0.1, 0.2],
+                "nsamples": [16, 32],
+                "mlps": [[64, 64, 128], [64, 96, 128]],
+            },
+            {
+                "npoint": 64,
+                "radiuss": [0.2, 0.4],
+                "nsamples": [16, 32],
+                "mlps": [[128, 196, 256], [128, 196, 256]],
+            },
+            {
+                "npoint": 16,
+                "radiuss": [0.4, 0.8],
+                "nsamples": [16, 32],
+                "mlps": [[256, 256, 512], [256, 384, 512]],
+            },
+        ]
+
+        self.FP_confs = [
+            {"mlp": [128, 128]},
+            {"mlp": [256, 256]},
+            {"mlp": [512, 512]},
+            {"mlp": [512, 512]},
+        ]
+
diff --git a/PaddleCV/Paddle3D/PointNet++/scripts/eval_cls.sh b/PaddleCV/Paddle3D/PointNet++/scripts/eval_cls.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b8ddac6e94dad30794ddb2e0d21f1afb7cf385bd
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/scripts/eval_cls.sh
@@ -0,0 +1,4 @@
+export CUDA_VISIBLE_DEVICES=0
+
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+python eval_cls.py --model=MSG --weights=checkpoints/200
diff --git a/PaddleCV/Paddle3D/PointNet++/scripts/eval_seg.sh b/PaddleCV/Paddle3D/PointNet++/scripts/eval_seg.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3fb7583cc7f54fcedacebcf96df1a61aa121fc54
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/scripts/eval_seg.sh
@@ -0,0 +1,4 @@
+export CUDA_VISIBLE_DEVICES=0
+
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+python eval_seg.py --model=MSG --weights=checkpoints/200
diff --git a/PaddleCV/Paddle3D/PointNet++/scripts/train_cls.sh b/PaddleCV/Paddle3D/PointNet++/scripts/train_cls.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fdcd8d42b6570c463808d5e96b6344bb892211e2
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/scripts/train_cls.sh
@@ -0,0 +1,4 @@
+export CUDA_VISIBLE_DEVICES=0
+
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+python train_cls.py --model=MSG --batch_size=16 --num_points=4096 --epoch=200
diff --git a/PaddleCV/Paddle3D/PointNet++/scripts/train_seg.sh b/PaddleCV/Paddle3D/PointNet++/scripts/train_seg.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2b2055056c2eb821f685edb330bf56cdbe19713f
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/scripts/train_seg.sh
@@ -0,0 +1,4 @@
+export CUDA_VISIBLE_DEVICES=0
+
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+python train_seg.py --model=MSG --batch_size=32 --num_points=4096 --epoch=201
diff --git a/PaddleCV/Paddle3D/PointNet++/train_cls.py b/PaddleCV/Paddle3D/PointNet++/train_cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..f9b49b9dceacc48848a9fa9c3570e2fbf8d79a76
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/train_cls.py
@@ -0,0 +1,302 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import sys
+import time
+import shutil
+import argparse
+import ast
+import logging
+import numpy as np
+import paddle.fluid as fluid
+import paddle.fluid.framework as framework
+
+from models import *
+from data.modelnet40_reader import ModelNet40ClsReader
+from data.data_utils import *
+from utils import *
+
+logging.root.handlers = []
+FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("PointNet++ classification train script")
+    parser.add_argument(
+        '--model',
+        type=str,
+        default='MSG',
+        help='SSG or MSG model to train, default MSG')
+    parser.add_argument(
+        '--use_gpu',
+        type=ast.literal_eval,
+        default=True,
+        help='default use gpu.')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=16,
+        help='training batch size, default 16')
+    parser.add_argument(
+        '--num_points',
+        type=int,
+        default=4096,
+        help='number of points in a sample, default: 4096')
+    parser.add_argument(
+        '--num_classes',
+        type=int,
+        default=40,
+        help='number of classes in dataset, default: 40')
+    parser.add_argument(
+        '--lr',
+        type=float,
+        default=0.01,
+        help='initial learning rate, default 0.01')
+    parser.add_argument(
+        '--lr_decay',
+        type=float,
+        default=0.7,
+        help='learning rate decay gamma, default 0.5')
+    parser.add_argument(
+        '--bn_momentum',
+        type=float,
+        default=0.99,
+        help='initial batch norm momentum, default 0.99')
+    parser.add_argument(
+        '--decay_steps',
+        type=int,
+        default=12500,
+        help='learning rate and batch norm momentum decay steps, default 12500')
+    parser.add_argument(
+        '--weight_decay',
+        type=float,
+        default=1e-5,
+        help='L2 regularization weight decay coeff, default 1e-5.')
+    parser.add_argument(
+        '--epoch',
+        type=int,
+        default=201,
+        help='epoch number. default 201.')
+    parser.add_argument(
+        '--data_dir',
+        type=str,
+        default='dataset/ModelNet40/modelnet40_ply_hdf5_2048',
+        help='dataset directory')
+    parser.add_argument(
+        '--save_dir',
+        type=str,
+        default='checkpoints_cls',
+        help='directory name to save train snapshoot')
+    parser.add_argument(
+        '--resume',
+        type=str,
+        default=None,
+        help='path to resume training based on previous checkpoints. '
+        'None for not resuming any checkpoints.')
+    parser.add_argument(
+        '--log_interval',
+        type=int,
+        default=1,
+        help='mini-batch interval for logging.')
+    parser.add_argument(
+        '--enable_ce',
+        action='store_true',
+        help='The flag indicating whether to run the task '
+        'for continuous evaluation.')
+    args = parser.parse_args()
+    return args
+
+
+def train():
+    args = parse_args()
+    print_arguments(args)
+    # check whether the installed paddle is compiled with GPU
+    check_gpu(args.use_gpu)
+
+    if not os.path.isdir(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    assert args.model in ['MSG', 'SSG'], \
+            "--model can only be 'MSG' or 'SSG'"
+
+    # build model
+    if args.enable_ce:
+        SEED = 102
+        fluid.default_main_program().random_seed = SEED
+        framework.default_startup_program().random_seed = SEED
+
+    startup = fluid.Program()
+    train_prog = fluid.Program()
+    with fluid.program_guard(train_prog, startup):
+        with fluid.unique_name.guard():
+            train_model = PointNet2ClsMSG(args.num_classes, args.num_points) \
+                            if args.model == "MSG" else \
+                          PointNet2ClsSSG(args.num_classes, args.num_points)
+            train_model.build_model(bn_momentum=args.bn_momentum)
+            train_feeds = train_model.get_feeds()
+            train_pyreader = train_model.get_pyreader()
+            train_outputs = train_model.get_outputs()
+            train_loss = train_outputs['loss']
+            lr = fluid.layers.exponential_decay(
+                    learning_rate=args.lr,
+                    decay_steps=args.decay_steps,
+                    decay_rate=args.lr_decay,
+                    staircase=True)
+            lr = fluid.layers.clip(lr, 1e-5, args.lr)
+            optimizer = fluid.optimizer.Adam(learning_rate=lr,
+                    regularization=fluid.regularizer.L2Decay(args.weight_decay))
+            optimizer.minimize(train_loss)
+    train_keys, train_values = parse_outputs(train_outputs)
+
+    test_prog = fluid.Program()
+    with fluid.program_guard(test_prog, startup):
+        with fluid.unique_name.guard():
+            test_model = PointNet2ClsMSG(args.num_classes, args.num_points) \
+                           if args.model == "MSG" else \
+                         PointNet2ClsSSG(args.num_classes, args.num_points)
+            test_model.build_model()
+            test_feeds = test_model.get_feeds()
+            test_outputs = test_model.get_outputs()
+            test_pyreader = test_model.get_pyreader()
+    test_prog = test_prog.clone(True)
+    test_keys, test_values = parse_outputs(test_outputs)
+
+    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(startup)
+
+    if args.resume:
+        assert os.path.exists(args.resume), \
+                "Given resume weight dir {} not exist.".format(args.resume)
+        def if_exist(var):
+            return os.path.exists(os.path.join(args.resume, var.name))
+        fluid.io.load_vars(
+            exe, args.resume, predicate=if_exist, main_program=train_prog)
+
+    build_strategy = fluid.BuildStrategy()
+    build_strategy.memory_optimize = False
+    build_strategy.enable_inplace = False
+    build_strategy.fuse_all_optimizer_ops = False
+    train_compile_prog = fluid.compiler.CompiledProgram(
+            train_prog).with_data_parallel(loss_name=train_loss.name,
+                    build_strategy=build_strategy)
+    test_compile_prog = fluid.compiler.CompiledProgram(test_prog)
+
+    def save_model(exe, prog, path):
+        if os.path.isdir(path):
+            shutil.rmtree(path)
+        logger.info("Save model to {}".format(path))
+        fluid.io.save_persistables(exe, path, prog)
+
+    # get reader
+    trans_list = [
+        PointcloudScale(),
+        PointcloudRotate(),
+        PointcloudRotatePerturbation(),
+        PointcloudTranslate(),
+        PointcloudJitter(),
+        PointcloudRandomInputDropout(),
+    ]
+    modelnet_reader = ModelNet40ClsReader(args.data_dir, mode='train', transforms=trans_list)
+    train_reader = modelnet_reader.get_reader(args.batch_size, args.num_points)
+    train_pyreader.decorate_sample_list_generator(train_reader, place)
+    modelnet_reader = ModelNet40ClsReader(args.data_dir, mode='test', transforms=None)
+    test_reader = modelnet_reader.get_reader(args.batch_size, args.num_points)
+    test_pyreader.decorate_sample_list_generator(test_reader, place)
+
+    train_stat = Stat()
+    test_stat = Stat()
+
+    ce_time = 0
+    ce_loss = []
+
+    for epoch_id in range(args.epoch):
+        try:
+            train_pyreader.start()
+            train_iter = 0
+            train_periods = []
+            while True:
+                cur_time = time.time()
+                train_outs = exe.run(train_compile_prog, fetch_list=train_values + [lr.name])
+                period = time.time() - cur_time
+                train_periods.append(period)
+                train_stat.update(train_keys, train_outs[:-1])
+                if train_iter % args.log_interval == 0:
+                    log_str = ""
+                    for name, values in zip(train_keys + ['learning_rate'], train_outs):
+                        log_str += "{}: {:.5f}, ".format(name, np.mean(values))
+                        if name == 'loss':
+                            ce_loss.append(np.mean(values))
+                    logger.info("[TRAIN] Epoch {}, batch {}: {}time: {:.2f}".format(epoch_id, train_iter, log_str, period))
+                train_iter += 1
+        except fluid.core.EOFException:
+            logger.info("[TRAIN] Epoch {} finished, {}average time: {:.2f}".format(epoch_id, train_stat.get_mean_log(), np.mean(train_periods[1:])))
+            ce_time = np.mean(train_periods[1:])
+            save_model(exe, train_prog, os.path.join(args.save_dir, str(epoch_id)))
+
+            # evaluation
+            if not args.enable_ce:
+                try:
+                    test_pyreader.start()
+                    test_iter = 0
+                    test_periods = []
+                    while True:
+                        cur_time = time.time()
+                        test_outs = exe.run(test_compile_prog, fetch_list=test_values)
+                        period = time.time() - cur_time
+                        test_periods.append(period)
+                        test_stat.update(test_keys, test_outs)
+                        if test_iter % args.log_interval == 0:
+                            log_str = ""
+                            for name, value in zip(test_keys, test_outs):
+                                log_str += "{}: {:.4f}, ".format(name, np.mean(value))
+                            logger.info("[TEST] Epoch {}, batch {}: {}time: {:.2f}".format(epoch_id, test_iter, log_str, period))
+                        test_iter += 1
+                except fluid.core.EOFException:
+                    logger.info("[TEST] Epoch {} finished, {}average time: {:.2f}".format(epoch_id, test_stat.get_mean_log(), np.mean(test_periods[1:])))
+                finally:
+                    test_pyreader.reset()
+                    test_stat.reset()
+                    test_periods = []
+
+        finally:
+            train_pyreader.reset()
+            train_stat.reset()
+            train_periods = []
+
+    # only for ce
+    if args.enable_ce:
+        card_num = get_cards()
+        _loss = 0
+        _time = 0
+        try:
+            _time = ce_time
+            _loss = np.mean(ce_loss[1:])
+        except:
+            print("ce info error")
+        print("kpis\ttrain_cls_%s_duration_card%s\t%s" % (args.model, card_num, _time))
+        print("kpis\ttrain_cls_%s_loss_card%s\t%f" % (args.model, card_num, _loss))
+
+def get_cards():
+    num = 0
+    cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
+    if cards != '':
+        num = len(cards.split(","))
+    return num
+
+if __name__ == "__main__":
+    train()
diff --git a/PaddleCV/Paddle3D/PointNet++/train_seg.py b/PaddleCV/Paddle3D/PointNet++/train_seg.py
new file mode 100644
index 0000000000000000000000000000000000000000..11eaabc7b298f70c9b64faa630541ce8d1d89ec6
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/train_seg.py
@@ -0,0 +1,292 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import sys
+import time
+import shutil
+import argparse
+import ast
+import logging
+import numpy as np
+import paddle.fluid as fluid
+import paddle.fluid.framework as framework
+
+from models import *
+from data.indoor3d_reader import Indoor3DReader
+from utils import *
+
+logging.root.handlers = []
+FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("PointNet++ semantic segmentation train script")
+    parser.add_argument(
+        '--model',
+        type=str,
+        default='MSG',
+        help='SSG or MSG model to train, default MSG')
+    parser.add_argument(
+        '--use_gpu',
+        type=ast.literal_eval,
+        default=True,
+        help='default use gpu.')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=32,
+        help='training batch size, default 32')
+    parser.add_argument(
+        '--num_points',
+        type=int,
+        default=4096,
+        help='number of points in a sample, default: 4096')
+    parser.add_argument(
+        '--num_classes',
+        type=int,
+        default=13,
+        help='number of classes in dataset, default: 13')
+    parser.add_argument(
+        '--lr',
+        type=float,
+        default=0.01,
+        help='initial learning rate, default 0.01')
+    parser.add_argument(
+        '--lr_decay',
+        type=float,
+        default=0.5,
+        help='learning rate decay gamma, default 0.5')
+    parser.add_argument(
+        '--bn_momentum',
+        type=float,
+        default=0.99,
+        help='initial batch norm momentum, default 0.99')
+    parser.add_argument(
+        '--decay_steps',
+        type=int,
+        default=6250,
+        help='learning rate and batch norm momentum decay steps, default 6250')
+    parser.add_argument(
+        '--weight_decay',
+        type=float,
+        default=0.,
+        help='L2 regularization weight decay coeff, default 0.')
+    parser.add_argument(
+        '--epoch',
+        type=int,
+        default=201,
+        help='epoch number. default 201.')
+    parser.add_argument(
+        '--data_dir',
+        type=str,
+        default='dataset/Indoor3DSemSeg/indoor3d_sem_seg_hdf5_data',
+        help='dataset directory')
+    parser.add_argument(
+        '--save_dir',
+        type=str,
+        default='checkpoints_seg',
+        help='directory name to save train snapshoot')
+    parser.add_argument(
+        '--resume',
+        type=str,
+        default=None,
+        help='path to resume training based on previous checkpoints. '
+        'None for not resuming any checkpoints.')
+    parser.add_argument(
+        '--log_interval',
+        type=int,
+        default=1,
+        help='mini-batch interval for logging.')
+    parser.add_argument(
+        '--enable_ce',
+        action='store_true',
+        help='The flag indicating whether to run the task '
+        'for continuous evaluation.')
+    args = parser.parse_args()
+    return args
+
+
+def train():
+    args = parse_args()
+    print_arguments(args)
+    # check whether the installed paddle is compiled with GPU
+    check_gpu(args.use_gpu)
+
+    if not os.path.isdir(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    assert args.model in ['MSG', 'SSG'], \
+            "--model can only be 'MSG' or 'SSG'"
+
+    # build model
+    if args.enable_ce:
+        SEED = 102
+        fluid.default_main_program().random_seed = SEED
+        framework.default_startup_program().random_seed = SEED
+
+    startup = fluid.Program()
+    train_prog = fluid.Program()
+    with fluid.program_guard(train_prog, startup):
+        with fluid.unique_name.guard():
+            train_model = PointNet2SemSegMSG(args.num_classes, args.num_points) \
+                            if args.model == "MSG" else \
+                          PointNet2SemSegSSG(args.num_classes, args.num_points)
+            train_model.build_model(bn_momentum=args.bn_momentum)
+            train_feeds = train_model.get_feeds()
+            train_pyreader = train_model.get_pyreader()
+            train_outputs = train_model.get_outputs()
+            train_loss = train_outputs['loss']
+            lr = fluid.layers.exponential_decay(
+                    learning_rate=args.lr,
+                    decay_steps=args.decay_steps,
+                    decay_rate=args.lr_decay,
+                    staircase=True)
+            lr = fluid.layers.clip(lr, 1e-5, args.lr)
+            optimizer = fluid.optimizer.Adam(learning_rate=lr,
+                    regularization=fluid.regularizer.L2Decay(args.weight_decay))
+            optimizer.minimize(train_loss)
+    train_keys, train_values = parse_outputs(train_outputs)
+
+    test_prog = fluid.Program()
+    with fluid.program_guard(test_prog, startup):
+        with fluid.unique_name.guard():
+            test_model = PointNet2SemSegMSG(args.num_classes, args.num_points) \
+                           if args.model == "MSG" else \
+                         PointNet2SemSegSSG(args.num_classes, args.num_points)
+            test_model.build_model()
+            test_feeds = test_model.get_feeds()
+            test_outputs = test_model.get_outputs()
+            test_pyreader = test_model.get_pyreader()
+    test_prog = test_prog.clone(True)
+    test_keys, test_values = parse_outputs(test_outputs)
+
+    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(startup)
+
+    if args.resume:
+        assert os.path.exists(args.resume), \
+                "Given resume weight dir {} not exist.".format(args.resume)
+        def if_exist(var):
+            return os.path.exists(os.path.join(args.resume, var.name))
+        fluid.io.load_vars(
+            exe, args.resume, predicate=if_exist, main_program=train_prog)
+
+    build_strategy = fluid.BuildStrategy()
+    build_strategy.memory_optimize = False
+    build_strategy.enable_inplace = False
+    build_strategy.fuse_all_optimizer_ops = False
+    train_compile_prog = fluid.compiler.CompiledProgram(
+            train_prog).with_data_parallel(loss_name=train_loss.name,
+                    build_strategy=build_strategy)
+    test_compile_prog = fluid.compiler.CompiledProgram(test_prog)
+
+    def save_model(exe, prog, path):
+        if os.path.isdir(path):
+            shutil.rmtree(path)
+        logger.info("Save model to {}".format(path))
+        fluid.io.save_persistables(exe, path, prog)
+
+    # get reader
+    indoor_reader = Indoor3DReader(args.data_dir)
+    train_reader = indoor_reader.get_reader(args.batch_size, args.num_points, mode='train')
+    test_reader = indoor_reader.get_reader(args.batch_size, args.num_points, mode='test')
+    train_pyreader.decorate_sample_list_generator(train_reader, place)
+    test_pyreader.decorate_sample_list_generator(test_reader, place)
+
+    train_stat = Stat()
+    test_stat = Stat()
+
+    ce_time = 0
+    ce_loss = []
+
+    for epoch_id in range(args.epoch):
+        try:
+            train_pyreader.start()
+            train_iter = 0
+            train_periods = []
+            while True:
+                cur_time = time.time()
+                train_outs = exe.run(train_compile_prog, fetch_list=train_values + [lr.name])
+                period = time.time() - cur_time
+                train_periods.append(period)
+                train_stat.update(train_keys, train_outs[:-1])
+                if train_iter % args.log_interval == 0:
+                    log_str = ""
+                    for name, values in zip(train_keys + ['learning_rate'], train_outs):
+                        log_str += "{}: {:.5f}, ".format(name, np.mean(values))
+                        if name == 'loss':
+                            ce_loss.append(np.mean(values))
+                    logger.info("[TRAIN] Epoch {}, batch {}: {}time: {:.2f}".format(epoch_id, train_iter, log_str, period))
+                train_iter += 1
+        except fluid.core.EOFException:
+            logger.info("[TRAIN] Epoch {} finished, {}average time: {:.2f}".format(epoch_id, train_stat.get_mean_log(), np.mean(train_periods[1:])))
+            ce_time = np.mean(train_periods[1:])
+            save_model(exe, train_prog, os.path.join(args.save_dir, str(epoch_id)))
+            
+            # evaluation
+            if not args.enable_ce:
+                try:
+                    test_pyreader.start()
+                    test_iter = 0
+                    test_periods = []
+                    while True:
+                        cur_time = time.time()
+                        test_outs = exe.run(test_compile_prog, fetch_list=test_values)
+                        period = time.time() - cur_time
+                        test_periods.append(period)
+                        test_stat.update(test_keys, test_outs)
+                        if test_iter % args.log_interval == 0:
+                            log_str = ""
+                            for name, value in zip(test_keys, test_outs):
+                                log_str += "{}: {:.4f}, ".format(name, np.mean(value))
+                            logger.info("[TEST] Epoch {}, batch {}: {}time: {:.2f}".format(epoch_id, test_iter, log_str, period))
+                        test_iter += 1
+                except fluid.core.EOFException:
+                    logger.info("[TEST] Epoch {} finished, {}average time: {:.2f}".format(epoch_id, test_stat.get_mean_log(), np.mean(test_periods[1:])))
+                finally:
+                    test_pyreader.reset()
+                    test_stat.reset()
+                    test_periods = []
+
+        finally:
+            train_pyreader.reset()
+            train_stat.reset()
+            train_periods = []
+
+    # only for ce
+    if args.enable_ce:
+        card_num = get_cards()
+        _loss = 0
+        _time = 0
+        try:
+            _time = ce_time
+            _loss = np.mean(ce_loss[1:])
+        except:
+            print("ce info error")
+        print("kpis\ttrain_seg_%s_duration_card%s\t%s" % (args.model, card_num, _time))
+        print("kpis\ttrain_seg_%s_loss_card%s\t%f" % (args.model, card_num, _loss))
+
+def get_cards():
+    num = 0
+    cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
+    if cards != '':
+        num = len(cards.split(","))
+    return num
+
+if __name__ == "__main__":
+    train()
diff --git a/PaddleCV/Paddle3D/PointNet++/utils.py b/PaddleCV/Paddle3D/PointNet++/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2629bebf2869ab8316fe8cada38c26f198ce9dcf
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointNet++/utils.py
@@ -0,0 +1,98 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains common utility functions.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import sys
+import six
+import logging
+import numpy as np
+import paddle.fluid as fluid
+
+__all__ = ["check_gpu", "print_arguments", "parse_outputs", "Stat"]
+
+logger = logging.getLogger(__name__)
+
+
+def check_gpu(use_gpu):
+    """
+    Log error and exit when set use_gpu=True in paddlepaddle
+    cpu version.
+    """
+    err = "Config use_gpu cannot be set as True while you are " \
+          "using paddlepaddle cpu version ! \nPlease try: \n" \
+          "\t1. Install paddlepaddle-gpu to run model on GPU \n" \
+          "\t2. Set --use_gpu=False to run model on CPU"
+
+    try:
+        if use_gpu and not fluid.is_compiled_with_cuda():
+            logger.error(err)
+            sys.exit(1)
+    except Exception as e:
+        pass
+
+
+def print_arguments(args):
+    """Print argparse's arguments.
+
+    Usage:
+
+    .. code-block:: python
+
+        parser = argparse.ArgumentParser()
+        parser.add_argument("name", default="Jonh", type=str, help="User name.")
+        args = parser.parse_args()
+        print_arguments(args)
+
+    :param args: Input argparse.Namespace for printing.
+    :type args: argparse.Namespace
+    """
+    logger.info("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(six.iteritems(vars(args))):
+        logger.info("%s: %s" % (arg, value))
+    logger.info("------------------------------------------------")
+
+
+def parse_outputs(outputs):
+    keys, values = [], []
+    for k, v in outputs.items():
+        keys.append(k)
+        v.persistable = True
+        values.append(v.name)
+    return keys, values
+
+
+class Stat(object):
+    def __init__(self):
+        self.stats = {}
+
+    def update(self, keys, values):
+        for k, v in zip(keys, values):
+            if k not in self.stats:
+                self.stats[k] = []
+            self.stats[k].append(v)
+
+    def reset(self):
+        self.stats = {}
+
+    def get_mean_log(self):
+        log = ""
+        for k, v in self.stats.items():
+            log += "avg_{}: {:.4f}, ".format(k, np.mean(v))
+        return log
diff --git a/PaddleCV/Paddle3D/PointRCNN/.gitignore b/PaddleCV/Paddle3D/PointRCNN/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..9ea6e75c687e4ac93fa06d18bd0d1444e5d3b054
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/.gitignore
@@ -0,0 +1,14 @@
+*log*
+checkpoints*
+build
+output
+result_dir
+pp_pointrcnn*
+data/gt_database
+utils/pts_utils/dist
+utils/pts_utils/build
+utils/pts_utils/pts_utils.egg-info
+utils/cyops/*.c
+utils/cyops/*.so
+ext_op/src/*.o
+ext_op/src/*.so
diff --git a/PaddleCV/Paddle3D/PointRCNN/README.md b/PaddleCV/Paddle3D/PointRCNN/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5b2d82920bf702146589879291a2de9ececf1371
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/README.md
@@ -0,0 +1,339 @@
+# PointRCNN 3D目标检测模型
+
+---
+## 内容
+
+- [简介](#简介)
+- [快速开始](#快速开始)
+- [参考文献](#参考文献)
+- [版本更新](#版本更新)
+
+## 简介
+
+[PointRCNN](https://arxiv.org/abs/1812.04244) 是 Shaoshuai Shi, Xiaogang Wang, Hongsheng Li. 等人提出的，是第一个仅使用原始点云的2-stage(两阶段)3D目标检测器，第一阶段将 Pointnet++ with MSG（Multi-scale Grouping）作为backbone，直接将原始点云数据分割为前景点和背景点，并利用前景点生成bounding box。第二阶段在标准坐标系中对生成对bounding box进一步筛选和优化。该模型还提出了基于bin的方式，把回归问题转化为分类问题，验证了在三维边界框回归中的有效性。PointRCNN在KITTI数据集上进行评估，论文发布时在KITTI 3D目标检测排行榜上获得了最佳性能。
+
+网络结构如下所示：
+
+<p align="center">
+<img src="images/teaser.png" height=300 width=800 hspace='10'/> <br />
+用于点云的目标检测器 PointNet++
+</p>
+
+**注意:** PointRCNN 模型构建依赖于自定义的 C++ 算子，目前仅支持GPU设备在Linux/Unix系统上进行编译，本模型**不能运行在Windows系统或CPU设备上**
+
+
+## 快速开始
+
+### 安装
+
+**安装 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle):**
+
+在当前目录下运行样例代码需要 PaddelPaddle Fluid [develop每日版本](https://www.paddlepaddle.org.cn/install/doc/tables#多版本whl包列表-dev-11)或使用PaddlePaddle [develop分支](https://github.com/PaddlePaddle/Paddle/tree/develop)源码编译安装. 
+
+为了使自定义算子与paddle版本兼容，建议您**优先使用源码编译paddle**，源码编译方式请参考[编译安装](https://www.paddlepaddle.org.cn/install/doc/source/ubuntu)
+
+**安装PointRCNN:**
+
+1. 下载[PaddlePaddle/models](https://github.com/PaddlePaddle/models)模型库
+
+通过如下命令下载Paddle models模型库：
+
+```
+git clone https://github.com/PaddlePaddle/models
+```
+
+2. 在`PaddleCV/Paddle3D/PointRCNN`目录下下载[pybind11](https://github.com/pybind/pybind11)
+
+`pts_utils`依赖`pybind11`编译，须在`PaddleCV/Paddle3D/PointRCNN`目录下下载`pybind11`子库，可使用如下命令下载：
+
+```
+cd PaddleCV/Paddle3D/PointRCNN
+git clone https://github.com/pybind/pybind11
+```
+
+3. 安装python依赖库
+
+使用如下命令安装python依赖库：
+
+```
+pip install -r requirement.txt
+```
+
+**注意：** KITTI mAP评估工具只能在python 3.6及以上版本中使用，且python3环境中需要安装`scikit-image`,`Numba`,`fire`等子库。
+`requirement.txt`中的`scikit-image`,`Numba`,`fire`即为KITTI mAP评估工具所需依赖库。
+
+4. 编译安装`pts_utils`, `kitti_utils`, `roipool3d_utils`, `iou_utils` 等模块
+
+使用如下命令编译安装`pts_utils`, `kitti_utils`, `roipool3d_utils`, `iou_utils` 等模块：
+```
+sh build_and_install.sh
+```
+
+### 编译自定义OP
+
+请确认Paddle版本为PaddelPaddle Fluid develop每日版本或基于Paddle develop分支源码编译安装，**推荐使用源码编译安装的方式**。
+
+自定义OP编译方式如下：
+
+    进入 `ext_op/src` 目录，执行编译脚本
+    ```
+    cd ext_op/src
+    sh make.sh
+    ```
+
+    成功编译后，`ext_op/src` 目录下将会生成 `pointnet_lib.so` 
+
+    执行下列操作，确保自定义算子编译正确：
+
+    ```
+    # 设置动态库的路径到 LD_LIBRARY_PATH 中
+    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+
+    # 回到 ext_op 目录，添加 PYTHONPATH
+    cd ..
+    export PYTHONPATH=$PYTHONPATH:`pwd`
+
+    # 运行单测 
+    python tests/test_farthest_point_sampling_op.py
+    python tests/test_gather_point_op.py
+    python tests/test_group_points_op.py
+    python tests/test_query_ball_op.py
+    python tests/test_three_interp_op.py
+    python tests/test_three_nn_op.py
+    ```
+    单测运行成功会输出提示信息，如下所示：
+
+    ```
+    .
+    ----------------------------------------------------------------------
+    Ran 1 test in 13.205s
+
+    OK
+    ```
+
+**说明：** 自定义OP编译与[PointNet++](../PointNet++)下一致，更多关于自定义OP的编译说明，请参考[自定义OP编译](../PointNet++/ext_op/README.md)
+
+### 数据准备
+
+**KITTI 3D object detection 数据集:**
+
+PointRCNN使用数据集[KITTI 3D object detection](http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d) 
+上进行训练。
+
+可通过如下方式下载数据集：
+
+```
+cd data/KITTI/object
+sh download.sh
+```
+
+此处的images只用做可视化，训练过程中使用[road planes](https://drive.google.com/file/d/1d5mq0RXRnvHPVeKx6Q612z0YRO1t2wAp/view?usp=sharing)数据来做训练时的数据增强，
+请下载并解压至`./data/KITTI/object/training`目录下。
+
+数据目录结构如下所示：
+
+```
+PointRCNN
+├── data
+│   ├── KITTI
+│   │   ├── ImageSets
+│   │   ├── object
+│   │   │   ├──training
+│   │   │   │  ├──calib & velodyne & label_2 & image_2 & planes
+│   │   │   ├──testing
+│   │   │   │  ├──calib & velodyne & image_2
+
+```
+
+
+### 训练
+
+**PointRCNN模型:**
+
+可通过如下方式启动 PointRCNN模型的训练：
+
+1. 指定单卡训练并设置动态库路径
+
+```
+# 指定单卡GPU训练
+export CUDA_VISIBLE_DEVICES=0
+
+# 设置动态库的路径到 LD_LIBRARY_PATH 中
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+```
+
+2. 生成Groud Truth采样数据，命令如下：
+
+```
+python tools/generate_gt_database.py --class_name 'Car' --split train
+```
+
+3. 训练 RPN 模型
+
+```
+python train.py --cfg=./cfgs/default.yml \
+                --train_mode=rpn \
+                --batch_size=16 \
+                --epoch=200 \
+                --save_dir=checkpoints
+```
+
+RPN训练checkpoints默认保存在`checkpoints/rpn`目录，也可以通过`--save_dir`来指定。
+
+4. 生成增强离线场景数据并保存RPN模型的输出特征和ROI，用于离线训练 RCNN 模型
+
+生成增强的离线场景数据命令如下：
+
+```
+python tools/generate_aug_scene.py --class_name 'Car' --split train --aug_times 4
+```
+
+保存RPN模型对离线增强数据的输出特征和ROI，可以通过参数`--ckpt_dir`来指定RPN训练最终权重保存路径，RPN权重默认保存在`checkpoints/rpn`目录。
+保存输出特征和ROI时须指定`TEST.SPLIT`为`train_aug`，指定`TEST.RPN_POST_NMS_TOP_N`为`300`, `TEST.RPN_NMS_THRESH`为`0.85`。
+通过`--output_dir`指定保存输出特征和ROI的路径，默认保存到`./output`目录。
+
+```
+python eval.py --cfg=cfgs/default.yml  \
+               --eval_mode=rpn \
+               --ckpt_dir=./checkpoints/rpn/199 \
+               --save_rpn_feature \
+               --output_dir=output \
+               --set TEST.SPLIT train_aug TEST.RPN_POST_NMS_TOP_N 300 TEST.RPN_NMS_THRESH 0.85
+```
+
+`--output_dir`下保存的数据目录结构如下：
+
+```
+output
+├── detections 
+│   ├── data          # 保存ROI数据
+│   │   ├── 000000.txt
+│   │   ├── 000003.txt
+│   │   ├── ...
+├── features          # 保存输出特征
+│   ├── 000000_intensity.npy
+│   ├── 000000.npy
+│   ├── 000000_rawscore.npy
+│   ├── 000000_seg.npy
+│   ├── 000000_xyz.npy
+│   ├── ...
+├── seg_result        # 保存语义分割结果
+│   ├── 000000.npy
+│   ├── 000003.npy
+│   ├── ...
+```
+
+5. 离线训练RCNN，并且通过参数`--rcnn_training_roi_dir` and `--rcnn_training_feature_dir` 来指定 RPN 模型保存的输出特征和ROI路径。
+
+```
+python train.py --cfg=./cfgs/default.yml \
+                --train_mode=rcnn_offline \
+                --batch_size=4 \
+                --epoch=30 \
+                --save_dir=checkpoints \
+                --rcnn_training_roi_dir=output/detections/data \
+                --rcnn_training_feature_dir=output/features \
+                --set TRAIN.SPLIT train_aug
+```
+
+RCNN模型训练权重默认保存在`checkpoints/rcnn`目录下，可通过`--save_dir`参数指定。
+
+**注意**: 最好的模型是通过保存RPN模型输出特征和ROI并离线数据增强的方式训练RCNN模型得出的，目前默认仅支持这种方式。
+
+
+### 模型评估
+
+**PointRCNN模型:**
+
+可通过如下方式启动 PointRCNN 模型的评估：
+
+1. 指定单卡训练并设置动态库路径
+
+```
+# 指定单卡GPU训练
+export CUDA_VISIBLE_DEVICES=0
+
+# 设置动态库的路径到 LD_LIBRARY_PATH 中
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+
+```
+
+2. 保存RPN模型对评估数据的输出特征和ROI
+
+保存RPN模型对评估数据的输出特征和ROI命令如下，可以通过参数`--ckpt_dir`来指定RPN训练最终权重保存路径，RPN权重默认保存在`checkpoints/rpn`目录。
+通过`--output_dir`指定保存输出特征和ROI的路径，默认保存到`./output`目录。
+
+```
+python eval.py --cfg=cfgs/default.yml \
+               --eval_mode=rpn \
+               --ckpt_dir=./checkpoints/rpn/199 \
+               --save_rpn_feature \
+               --output_dir=output/val
+```
+
+保存RPN模型对评估数据的输出特征和ROI保存的目录结构与上述保存离线增强数据保存目录结构一致。
+
+3. 评估离线RCNN模型
+
+评估离线RCNN模型命令如下:
+
+```
+python eval.py --cfg=cfgs/default.yml \
+               --eval_mode=rcnn_offline \
+               --ckpt_dir=./checkpoints/rcnn_offline/29 \
+               --rcnn_eval_roi_dir=output/val/detections/data \
+               --rcnn_eval_feature_dir=output/val/features \
+               --save_result
+```
+
+最终目标检测结果文件保存在`./result_dir`目录下`final_result`文件夹下，同时可通过`--save_result`开启保存`roi_output`和`refine_output`结果文件。
+`result_dir`目录结构如下：
+
+```
+result_dir
+├── final_result
+│   ├── data          # 最终检测结果
+│   │   ├── 000001.txt
+│   │   ├── 000002.txt
+│   │   ├── ...
+├── roi_output
+│   ├── data          # RCNN模型输出检测ROI结果
+│   │   ├── 000001.txt
+│   │   ├── 000002.txt
+│   │   ├── ...
+├── refine_output
+│   ├── data          # 解码后的检测结果
+│   │   ├── 000001.txt
+│   │   ├── 000002.txt
+│   │   ├── ...
+```
+
+4. 使用KITTI mAP工具获得评估结果
+
+若在评估过程中使用的python版本为3.6及以上版本，则程序会自动运行KITTI mAP评估，若使用python版本低于3.6，
+由于KITTI mAP仅支持python 3.6及以上版本，须使用对应python版本通过如下命令进行评估：
+
+```
+python3 tools/kitti_eval.py
+```
+
+使用训练最终权重[RPN模型](https://paddlemodels.bj.bcebos.com/Paddle3D/pointrcnn_rpn.tar)和[RCNN模型](https://paddlemodels.bj.bcebos.com/Paddle3D/pointrcnn_rcnn_offline.tar)评估结果如下所示：
+
+|  Car AP@ | 0.70(easy) | 0.70(moderate) | 0.70(hard) |
+| :------- | :--------: | :------------: | :--------: |
+| bbox AP: |   90.20    |     88.85      |   88.59    |
+| bev  AP: |   89.50    |     86.97      |   85.58    |
+| 3d   AP: |   86.66    |     76.65      |   75.90    |
+| aos  AP: |   90.10    |     88.64      |   88.26    |
+
+
+## 参考文献
+
+- [PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud](https://arxiv.org/abs/1812.04244), Shaoshuai Shi, Xiaogang Wang, Hongsheng Li.
+- [PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space](https://arxiv.org/abs/1706.02413), Charles R. Qi, Li Yi, Hao Su, Leonidas J. Guibas.
+- [PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation](https://www.semanticscholar.org/paper/PointNet%3A-Deep-Learning-on-Point-Sets-for-3D-and-Qi-Su/d997beefc0922d97202789d2ac307c55c2c52fba), Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas.
+
+## 版本更新
+
+- 11/2019, 新增 PointRCNN模型。
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/build_and_install.sh b/PaddleCV/Paddle3D/PointRCNN/build_and_install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..83aaef84704445cf9c7bf3e87cc453e0daa708cd
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/build_and_install.sh
@@ -0,0 +1,7 @@
+# compile cyops
+python utils/cyops/setup.py develop
+
+# compile and install pts_utils
+cd utils/pts_utils
+python setup.py install
+cd ../..
diff --git a/PaddleCV/Paddle3D/PointRCNN/cfgs/default.yml b/PaddleCV/Paddle3D/PointRCNN/cfgs/default.yml
new file mode 100644
index 0000000000000000000000000000000000000000..33dc45086ca48128174fc341e7f9fdee9374d53e
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/cfgs/default.yml
@@ -0,0 +1,167 @@
+# This config is based on https://github.com/sshaoshuai/PointRCNN/blob/master/tools/cfgs/default.yaml
+CLASSES: Car
+
+INCLUDE_SIMILAR_TYPE: True
+
+# config of augmentation
+AUG_DATA: True
+AUG_METHOD_LIST: ['rotation', 'scaling', 'flip']
+AUG_METHOD_PROB: [1.0, 1.0, 0.5]
+AUG_ROT_RANGE: 18
+
+GT_AUG_ENABLED: True
+GT_EXTRA_NUM: 15
+GT_AUG_RAND_NUM: True
+GT_AUG_APPLY_PROB: 1.0
+GT_AUG_HARD_RATIO: 0.6
+
+PC_REDUCE_BY_RANGE: True
+PC_AREA_SCOPE: [[-40, 40], [-1,   3], [0, 70.4]]  # x, y, z scope in rect camera coords
+CLS_MEAN_SIZE: [[1.52563191462, 1.62856739989, 3.88311640418]]
+
+
+# 1. config of rpn network
+RPN:
+    ENABLED: True
+    FIXED: False
+
+    # config of input
+    USE_INTENSITY: False
+
+    # config of bin-based loss
+    LOC_XZ_FINE: True
+    LOC_SCOPE: 3.0
+    LOC_BIN_SIZE: 0.5
+    NUM_HEAD_BIN: 12
+
+    # config of network structure
+    BACKBONE: pointnet2_msg
+    USE_BN: True
+    NUM_POINTS: 16384
+
+    SA_CONFIG:
+        NPOINTS: [4096, 1024, 256, 64]
+        RADIUS: [[0.1, 0.5], [0.5, 1.0], [1.0, 2.0], [2.0, 4.0]]
+        NSAMPLE: [[16, 32], [16, 32], [16, 32], [16, 32]]
+        MLPS: [[[16, 16, 32], [32, 32, 64]],
+              [[64, 64, 128], [64, 96, 128]],
+              [[128, 196, 256], [128, 196, 256]],
+              [[256, 256, 512], [256, 384, 512]]]
+    FP_MLPS: [[128, 128], [256, 256], [512, 512], [512, 512]]
+    CLS_FC: [128]
+    REG_FC: [128]
+    DP_RATIO: 0.5
+
+    # config of training
+    LOSS_CLS: SigmoidFocalLoss
+    FG_WEIGHT: 15
+    FOCAL_ALPHA: [0.25, 0.75]
+    FOCAL_GAMMA: 2.0
+    REG_LOSS_WEIGHT: [1.0, 1.0, 1.0, 1.0]
+    LOSS_WEIGHT: [1.0, 1.0]
+    NMS_TYPE: normal
+
+    # config of testing
+    SCORE_THRESH: 0.3
+
+# 2. config of rcnn network
+RCNN:
+    ENABLED: True
+
+    # config of input
+    ROI_SAMPLE_JIT: False 
+    REG_AUG_METHOD: multiple  # multiple, single, normal
+    ROI_FG_AUG_TIMES: 10
+
+    USE_RPN_FEATURES: True
+    USE_MASK: True
+    MASK_TYPE: seg
+    USE_INTENSITY: False
+    USE_DEPTH: True
+    USE_SEG_SCORE: False
+
+    POOL_EXTRA_WIDTH: 1.0
+
+    # config of bin-based loss
+    LOC_SCOPE: 1.5
+    LOC_BIN_SIZE: 0.5
+    NUM_HEAD_BIN: 9
+    LOC_Y_BY_BIN: False
+    LOC_Y_SCOPE: 0.5
+    LOC_Y_BIN_SIZE: 0.25
+    SIZE_RES_ON_ROI: False
+
+    # config of network structure
+    USE_BN: False
+    DP_RATIO: 0.0
+
+    BACKBONE: pointnet  # pointnet
+    XYZ_UP_LAYER: [128, 128]
+
+    NUM_POINTS: 512
+    SA_CONFIG:
+        NPOINTS: [128, 32, -1]
+        RADIUS: [0.2, 0.4, 100]
+        NSAMPLE: [64, 64, 64]
+        MLPS: [[128, 128, 128],
+               [128, 128, 256],
+               [256, 256, 512]]
+    CLS_FC: [256, 256]
+    REG_FC: [256, 256]
+
+    # config of training
+    LOSS_CLS: BinaryCrossEntropy
+    FOCAL_ALPHA: [0.25, 0.75]
+    FOCAL_GAMMA: 2.0
+    CLS_WEIGHT: [1.0, 1.0, 1.0]
+    CLS_FG_THRESH: 0.6
+    CLS_BG_THRESH: 0.45
+    CLS_BG_THRESH_LO: 0.05
+    REG_FG_THRESH: 0.55
+    FG_RATIO: 0.5
+    ROI_PER_IMAGE: 64
+    HARD_BG_RATIO: 0.8
+
+    # config of testing
+    SCORE_THRESH: 0.3
+    NMS_THRESH: 0.1
+
+# general training config
+TRAIN:
+    SPLIT: train
+    VAL_SPLIT: smallval
+
+    LR: 0.002
+    LR_CLIP: 0.00001
+    LR_DECAY: 0.5
+    DECAY_STEP_LIST: [100, 150, 180, 200]
+    LR_WARMUP: True
+    WARMUP_MIN: 0.0002
+    WARMUP_EPOCH: 1
+
+    BN_MOMENTUM: 0.1
+    BN_DECAY: 0.5
+    BNM_CLIP: 0.01
+    BN_DECAY_STEP_LIST: [1000]
+
+    OPTIMIZER: adam # adam, adam_onecycle
+    WEIGHT_DECAY: 0.001  # L2 regularization
+    MOMENTUM: 0.9
+
+    MOMS: [0.95, 0.85]
+    DIV_FACTOR: 10.0
+    PCT_START: 0.4
+
+    GRAD_NORM_CLIP: 1.0
+
+    RPN_PRE_NMS_TOP_N: 9000
+    RPN_POST_NMS_TOP_N: 512
+    RPN_NMS_THRESH: 0.85
+    RPN_DISTANCE_BASED_PROPOSE: True
+
+TEST:
+    SPLIT: val
+    RPN_PRE_NMS_TOP_N: 9000
+    RPN_POST_NMS_TOP_N: 100
+    RPN_NMS_THRESH: 0.8
+    RPN_DISTANCE_BASED_PROPOSE: True
diff --git a/PaddleCV/Paddle3D/PointRCNN/data/KITTI/object/download.sh b/PaddleCV/Paddle3D/PointRCNN/data/KITTI/object/download.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1f5818d38323c5cc7349022ba82d2a55315a59a7
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/data/KITTI/object/download.sh
@@ -0,0 +1,25 @@
+DIR="$( cd "$(dirname "$0")" ; pwd -P  )"
+cd "$DIR"
+
+echo "Downloading https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_velodyne.zip"
+wget https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_velodyne.zip
+echo "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_image_2.zip"
+wget https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_image_2.zip
+echo "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_calib.zip"
+wget https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_calib.zip
+echo "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_label_2.zip"
+wget https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_label_2.zip
+
+echo "Decompressing data_object_velodyne.zip"
+unzip data_object_velodyne.zip
+echo "Decompressing data_object_image_2.zip"
+unzip "data_object_image_2.zip"
+echo "Decompressing data_object_calib.zip"
+unzip data_object_calib.zip
+echo "Decompressing data_object_label_2.zip"
+unzip data_object_label_2.zip
+
+echo "Download KITTI ImageSets"
+wget https://paddlemodels.bj.bcebos.com/Paddle3D/pointrcnn_kitti_imagesets.tar
+tar xf pointrcnn_kitti_imagesets.tar
+mv ImageSets ..
diff --git a/PaddleCV/Paddle3D/PointRCNN/data/__init__.py b/PaddleCV/Paddle3D/PointRCNN/data/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..46a4f6ee220f10f50a182f4a2ed510b0551f64a8
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/data/__init__.py
@@ -0,0 +1,13 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
diff --git a/PaddleCV/Paddle3D/PointRCNN/data/kitti_dataset.py b/PaddleCV/Paddle3D/PointRCNN/data/kitti_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..0765a5045f6e330646fde26fe391eb313d022124
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/data/kitti_dataset.py
@@ -0,0 +1,77 @@
+"""
+This code is based on https://github.com/sshaoshuai/PointRCNN/blob/master/lib/datasets/kitti_dataset.py
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import cv2
+import numpy as np
+import utils.calibration as calibration
+from utils.object3d import get_objects_from_label
+from PIL import Image
+
+__all__ = ["KittiDataset"]
+
+
+class KittiDataset(object):
+    def __init__(self, data_dir, split='train'):
+        assert split in ['train', 'train_aug', 'val', 'test'], "unknown split {}".format(split)
+        self.split = split
+        self.is_test = self.split == 'test'
+        self.imageset_dir = os.path.join(data_dir, 'KITTI', 'object', 'testing' if self.is_test else 'training')
+
+        split_dir = os.path.join(data_dir, 'KITTI', 'ImageSets', split + '.txt')
+        self.image_idx_list = [x.strip() for x in open(split_dir).readlines()]
+        self.num_sample = self.image_idx_list.__len__()
+
+        self.image_dir = os.path.join(self.imageset_dir, 'image_2')
+        self.lidar_dir = os.path.join(self.imageset_dir, 'velodyne')
+        self.calib_dir = os.path.join(self.imageset_dir, 'calib')
+        self.label_dir = os.path.join(self.imageset_dir, 'label_2')
+        self.plane_dir = os.path.join(self.imageset_dir, 'planes')
+
+    def get_image(self, idx):
+        img_file = os.path.join(self.image_dir, '%06d.png' % idx)
+        assert os.path.exists(img_file)
+        return cv2.imread(img_file)  # (H, W, 3) BGR mode
+
+    def get_image_shape(self, idx):
+        img_file = os.path.join(self.image_dir, '%06d.png' % idx)
+        assert os.path.exists(img_file)
+        im = Image.open(img_file)
+        width, height = im.size
+        return height, width, 3
+
+    def get_lidar(self, idx):
+        lidar_file = os.path.join(self.lidar_dir, '%06d.bin' % idx)
+        assert os.path.exists(lidar_file)
+        return np.fromfile(lidar_file, dtype=np.float32).reshape(-1, 4)
+
+    def get_calib(self, idx):
+        calib_file = os.path.join(self.calib_dir, '%06d.txt' % idx)
+        assert os.path.exists(calib_file)
+        return calibration.Calibration(calib_file)
+
+    def get_label(self, idx):
+        label_file = os.path.join(self.label_dir, '%06d.txt' % idx)
+        assert os.path.exists(label_file)
+        # return kitti_utils.get_objects_from_label(label_file)
+        return get_objects_from_label(label_file)
+
+    def get_road_plane(self, idx):
+        plane_file = os.path.join(self.plane_dir, '%06d.txt' % idx)
+        with open(plane_file, 'r') as f:
+            lines = f.readlines()
+        lines = [float(i) for i in lines[3].split()]
+        plane = np.asarray(lines)
+
+        # Ensure normal is always facing up, this is in the rectified camera coordinate
+        if plane[1] > 0:
+            plane = -plane
+
+        norm = np.linalg.norm(plane[0:3])
+        plane = plane / norm
+        return plane
diff --git a/PaddleCV/Paddle3D/PointRCNN/data/kitti_rcnn_reader.py b/PaddleCV/Paddle3D/PointRCNN/data/kitti_rcnn_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..57367d2c6ff5abd3c21a15cbe7c0a90ba9e64e62
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/data/kitti_rcnn_reader.py
@@ -0,0 +1,1193 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+This code is based on https://github.com/sshaoshuai/PointRCNN/blob/master/lib/datasets/kitti_rcnn_dataset.py
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import signal
+import logging
+import multiprocessing
+import numpy as np
+import scipy
+from scipy.spatial import Delaunay
+try:
+    import cPickle as pickle
+except:
+    import pickle
+
+import pts_utils
+import utils.cyops.kitti_utils as kitti_utils
+import utils.cyops.roipool3d_utils as roipool3d_utils
+from data.kitti_dataset import KittiDataset
+from utils.config import cfg
+from collections import OrderedDict
+
+__all__ = ["KittiRCNNReader"]
+
+logger = logging.getLogger(__name__)
+
+
+def has_empty(data):
+    for d in data:
+        if isinstance(d, np.ndarray) and len(d) == 0:
+            return True
+    return False
+
+
+def in_hull(p, hull):
+    """
+    :param p: (N, K) test points
+    :param hull: (M, K) M corners of a box
+    :return (N) bool
+    """
+    try:
+        if not isinstance(hull, Delaunay):
+            hull = Delaunay(hull)
+        flag = hull.find_simplex(p) >= 0
+    except scipy.spatial.qhull.QhullError:
+        logger.debug('Warning: not a hull.')
+        flag = np.zeros(p.shape[0], dtype=np.bool)
+
+    return flag
+
+
+class KittiRCNNReader(KittiDataset):
+    def __init__(self, data_dir, npoints=16384, split='train', classes='Car', mode='TRAIN',
+                 random_select=True, rcnn_training_roi_dir=None, rcnn_training_feature_dir=None,
+                 rcnn_eval_roi_dir=None, rcnn_eval_feature_dir=None, gt_database_dir=None):
+        super(KittiRCNNReader, self).__init__(data_dir=data_dir, split=split)
+        if classes == 'Car':
+            self.classes = ('Background', 'Car')
+            aug_scene_data_dir = os.path.join(data_dir, 'KITTI', 'aug_scene')
+        elif classes == 'People':
+            self.classes = ('Background', 'Pedestrian', 'Cyclist')
+        elif classes == 'Pedestrian':
+            self.classes = ('Background', 'Pedestrian')
+            aug_scene_data_dir = os.path.join(data_dir, 'KITTI', 'aug_scene_ped')
+        elif classes == 'Cyclist':
+            self.classes = ('Background', 'Cyclist')
+            aug_scene_data_dir = os.path.join(data_dir, 'KITTI', 'aug_scene_cyclist')
+        else:
+            assert False, "Invalid classes: %s" % classes
+
+        self.num_classes = len(self.classes)
+
+        self.npoints = npoints
+        self.sample_id_list = []
+        self.random_select = random_select
+
+        if split == 'train_aug':
+            self.aug_label_dir = os.path.join(aug_scene_data_dir, 'training', 'aug_label')
+            self.aug_pts_dir = os.path.join(aug_scene_data_dir, 'training', 'rectified_data')
+        else:
+            self.aug_label_dir = os.path.join(aug_scene_data_dir, 'training', 'aug_label')
+            self.aug_pts_dir = os.path.join(aug_scene_data_dir, 'training', 'rectified_data')
+
+        # for rcnn training
+        self.rcnn_training_bbox_list = []
+        self.rpn_feature_list = {}
+        self.pos_bbox_list = []
+        self.neg_bbox_list = []
+        self.far_neg_bbox_list = []
+        self.rcnn_eval_roi_dir = rcnn_eval_roi_dir
+        self.rcnn_eval_feature_dir = rcnn_eval_feature_dir
+        self.rcnn_training_roi_dir = rcnn_training_roi_dir
+        self.rcnn_training_feature_dir = rcnn_training_feature_dir
+
+        self.gt_database = None
+
+        if not self.random_select:
+            logger.warning('random select is False')
+
+        assert mode in ['TRAIN', 'EVAL', 'TEST'], 'Invalid mode: %s' % mode
+        self.mode = mode
+
+        if cfg.RPN.ENABLED:
+            if gt_database_dir is not None:
+                self.gt_database = pickle.load(open(gt_database_dir, 'rb'))
+
+                if cfg.GT_AUG_HARD_RATIO > 0:
+                    easy_list, hard_list = [], []
+                    for k in range(self.gt_database.__len__()):
+                        obj = self.gt_database[k]
+                        if obj['points'].shape[0] > 100:
+                            easy_list.append(obj)
+                        else:
+                            hard_list.append(obj)
+                    self.gt_database = [easy_list, hard_list]
+                    logger.info('Loading gt_database(easy(pt_num>100): %d, hard(pt_num<=100): %d) from %s'
+                                % (len(easy_list), len(hard_list), gt_database_dir))
+                else:
+                    logger.info('Loading gt_database(%d) from %s' % (len(self.gt_database), gt_database_dir))
+
+            if mode == 'TRAIN':
+                self.preprocess_rpn_training_data()
+            else:
+                self.sample_id_list = [int(sample_id) for sample_id in self.image_idx_list]
+                logger.info('Load testing samples from %s' % self.imageset_dir)
+                logger.info('Done: total test samples %d' % len(self.sample_id_list))
+        elif cfg.RCNN.ENABLED:
+            for idx in range(0, self.num_sample):
+                sample_id = int(self.image_idx_list[idx])
+                obj_list = self.filtrate_objects(self.get_label(sample_id))
+                if len(obj_list) == 0:
+                    # logger.info('No gt classes: %06d' % sample_id)
+                    continue
+                self.sample_id_list.append(sample_id)
+
+            logger.info('Done: filter %s results for rcnn training: %d / %d\n' %
+                  (self.mode, len(self.sample_id_list), len(self.image_idx_list)))
+
+    def preprocess_rpn_training_data(self):
+        """
+        Discard samples which don't have current classes, which will not be used for training.
+        Valid sample_id is stored in self.sample_id_list
+        """
+        logger.info('Loading %s samples from %s ...' % (self.mode, self.label_dir))
+        for idx in range(0, self.num_sample):
+            sample_id = int(self.image_idx_list[idx])
+            obj_list = self.filtrate_objects(self.get_label(sample_id))
+            if len(obj_list) == 0:
+                logger.debug('No gt classes: %06d' % sample_id)
+                continue
+            self.sample_id_list.append(sample_id)
+
+        logger.info('Done: filter %s results: %d / %d\n' % (self.mode, len(self.sample_id_list),
+                                                                 len(self.image_idx_list)))
+
+    def get_label(self, idx):
+        if idx < 10000:
+            label_file = os.path.join(self.label_dir, '%06d.txt' % idx)
+        else:
+            label_file = os.path.join(self.aug_label_dir, '%06d.txt' % idx)
+
+        assert os.path.exists(label_file)
+        return kitti_utils.get_objects_from_label(label_file)
+
+    def get_image(self, idx):
+        return super(KittiRCNNReader, self).get_image(idx % 10000)
+
+    def get_image_shape(self, idx):
+        return super(KittiRCNNReader, self).get_image_shape(idx % 10000)
+
+    def get_calib(self, idx):
+        return super(KittiRCNNReader, self).get_calib(idx % 10000)
+
+    def get_road_plane(self, idx):
+        return super(KittiRCNNReader, self).get_road_plane(idx % 10000)
+
+    @staticmethod
+    def get_rpn_features(rpn_feature_dir, idx):
+        rpn_feature_file = os.path.join(rpn_feature_dir, '%06d.npy' % idx)
+        rpn_xyz_file = os.path.join(rpn_feature_dir, '%06d_xyz.npy' % idx)
+        rpn_intensity_file = os.path.join(rpn_feature_dir, '%06d_intensity.npy' % idx)
+        if cfg.RCNN.USE_SEG_SCORE:
+            rpn_seg_file = os.path.join(rpn_feature_dir, '%06d_rawscore.npy' % idx)
+            rpn_seg_score = np.load(rpn_seg_file).reshape(-1)
+            rpn_seg_score = torch.sigmoid(torch.from_numpy(rpn_seg_score)).numpy()
+        else:
+            rpn_seg_file = os.path.join(rpn_feature_dir, '%06d_seg.npy' % idx)
+            rpn_seg_score = np.load(rpn_seg_file).reshape(-1)
+        return np.load(rpn_xyz_file), np.load(rpn_feature_file), np.load(rpn_intensity_file).reshape(-1), rpn_seg_score
+
+    def filtrate_objects(self, obj_list):
+        """
+        Discard objects which are not in self.classes (or its similar classes)
+        :param obj_list: list
+        :return: list
+        """
+        type_whitelist = self.classes
+        if self.mode == 'TRAIN' and cfg.INCLUDE_SIMILAR_TYPE:
+            type_whitelist = list(self.classes)
+            if 'Car' in self.classes:
+                type_whitelist.append('Van')
+            if 'Pedestrian' in self.classes:  # or 'Cyclist' in self.classes:
+                type_whitelist.append('Person_sitting')
+
+        valid_obj_list = []
+        for obj in obj_list:
+            if obj.cls_type not in type_whitelist:  # rm Van, 20180928
+                continue
+            if self.mode == 'TRAIN' and cfg.PC_REDUCE_BY_RANGE and (self.check_pc_range(obj.pos) is False):
+                continue
+            valid_obj_list.append(obj)
+        return valid_obj_list
+
+    @staticmethod
+    def filtrate_dc_objects(obj_list):
+        valid_obj_list = []
+        for obj in obj_list:
+            if obj.cls_type in ['DontCare']:
+                continue
+            valid_obj_list.append(obj)
+
+        return valid_obj_list
+
+    @staticmethod
+    def check_pc_range(xyz):
+        """
+        :param xyz: [x, y, z]
+        :return:
+        """
+        x_range, y_range, z_range = cfg.PC_AREA_SCOPE
+        if (x_range[0] <= xyz[0] <= x_range[1]) and (y_range[0] <= xyz[1] <= y_range[1]) and \
+                (z_range[0] <= xyz[2] <= z_range[1]):
+            return True
+        return False
+
+    @staticmethod
+    def get_valid_flag(pts_rect, pts_img, pts_rect_depth, img_shape):
+        """
+        Valid point should be in the image (and in the PC_AREA_SCOPE)
+        :param pts_rect:
+        :param pts_img:
+        :param pts_rect_depth:
+        :param img_shape:
+        :return:
+        """
+        val_flag_1 = np.logical_and(pts_img[:, 0] >= 0, pts_img[:, 0] < img_shape[1])
+        val_flag_2 = np.logical_and(pts_img[:, 1] >= 0, pts_img[:, 1] < img_shape[0])
+        val_flag_merge = np.logical_and(val_flag_1, val_flag_2)
+        pts_valid_flag = np.logical_and(val_flag_merge, pts_rect_depth >= 0)
+
+        if cfg.PC_REDUCE_BY_RANGE:
+            x_range, y_range, z_range = cfg.PC_AREA_SCOPE
+            pts_x, pts_y, pts_z = pts_rect[:, 0], pts_rect[:, 1], pts_rect[:, 2]
+            range_flag = (pts_x >= x_range[0]) & (pts_x <= x_range[1]) \
+                         & (pts_y >= y_range[0]) & (pts_y <= y_range[1]) \
+                         & (pts_z >= z_range[0]) & (pts_z <= z_range[1])
+            pts_valid_flag = pts_valid_flag & range_flag
+        return pts_valid_flag
+
+    def get_rpn_sample(self, index):
+        sample_id = int(self.sample_id_list[index])
+        if sample_id < 10000:
+            calib = self.get_calib(sample_id)
+            # img = self.get_image(sample_id)
+            img_shape = self.get_image_shape(sample_id)
+            pts_lidar = self.get_lidar(sample_id)
+
+            # get valid point (projected points should be in image)
+            pts_rect = calib.lidar_to_rect(pts_lidar[:, 0:3])
+            pts_intensity = pts_lidar[:, 3]
+        else:
+            calib = self.get_calib(sample_id % 10000)
+            # img = self.get_image(sample_id % 10000)
+            img_shape = self.get_image_shape(sample_id % 10000)
+
+            pts_file = os.path.join(self.aug_pts_dir, '%06d.bin' % sample_id)
+            assert os.path.exists(pts_file), '%s' % pts_file
+            aug_pts = np.fromfile(pts_file, dtype=np.float32).reshape(-1, 4)
+            pts_rect, pts_intensity = aug_pts[:, 0:3], aug_pts[:, 3]
+
+        pts_img, pts_rect_depth = calib.rect_to_img(pts_rect)
+        pts_valid_flag = self.get_valid_flag(pts_rect, pts_img, pts_rect_depth, img_shape)
+
+        pts_rect = pts_rect[pts_valid_flag][:, 0:3]
+        pts_intensity = pts_intensity[pts_valid_flag]
+
+        if cfg.GT_AUG_ENABLED and self.mode == 'TRAIN':
+            # all labels for checking overlapping
+            all_gt_obj_list = self.filtrate_dc_objects(self.get_label(sample_id))
+            all_gt_boxes3d = kitti_utils.objs_to_boxes3d(all_gt_obj_list)
+
+            gt_aug_flag = False
+            if np.random.rand() < cfg.GT_AUG_APPLY_PROB:
+                # augment one scene
+                gt_aug_flag, pts_rect, pts_intensity, extra_gt_boxes3d, extra_gt_obj_list = \
+                    self.apply_gt_aug_to_one_scene(sample_id, pts_rect, pts_intensity, all_gt_boxes3d)
+
+        # generate inputs
+        if self.mode == 'TRAIN' or self.random_select:
+            if self.npoints < len(pts_rect):
+                pts_depth = pts_rect[:, 2]
+                pts_near_flag = pts_depth < 40.0
+                far_idxs_choice = np.where(pts_near_flag == 0)[0]
+                near_idxs = np.where(pts_near_flag == 1)[0]
+                near_idxs_choice = np.random.choice(near_idxs, self.npoints - len(far_idxs_choice), replace=False)
+
+                choice = np.concatenate((near_idxs_choice, far_idxs_choice), axis=0) \
+                    if len(far_idxs_choice) > 0 else near_idxs_choice
+                np.random.shuffle(choice)
+            else:
+                choice = np.arange(0, len(pts_rect), dtype=np.int32)
+                if self.npoints > len(pts_rect):
+                    extra_choice = np.random.choice(choice, self.npoints - len(pts_rect), replace=False)
+                    choice = np.concatenate((choice, extra_choice), axis=0)
+                np.random.shuffle(choice)
+
+            ret_pts_rect = pts_rect[choice, :]
+            ret_pts_intensity = pts_intensity[choice] - 0.5  # translate intensity to [-0.5, 0.5]
+        else:
+            ret_pts_rect = np.zeros((self.npoints, pts_rect.shape[1])).astype(pts_rect.dtype)
+            num_ = min(self.npoints, pts_rect.shape[0])
+            ret_pts_rect[:num_] = pts_rect[:num_]
+
+            ret_pts_intensity = pts_intensity - 0.5
+
+        pts_features = [ret_pts_intensity.reshape(-1, 1)]
+        ret_pts_features = np.concatenate(pts_features, axis=1) if pts_features.__len__() > 1 else pts_features[0]
+
+        sample_info = {'sample_id': sample_id, 'random_select': self.random_select}
+
+        if self.mode == 'TEST':
+            if cfg.RPN.USE_INTENSITY:
+                pts_input = np.concatenate((ret_pts_rect, ret_pts_features), axis=1)  # (N, C)
+            else:
+                pts_input = ret_pts_rect
+            sample_info['pts_input'] = pts_input
+            sample_info['pts_rect'] = ret_pts_rect
+            sample_info['pts_features'] = ret_pts_features
+            return sample_info
+
+        gt_obj_list = self.filtrate_objects(self.get_label(sample_id))
+        if cfg.GT_AUG_ENABLED and self.mode == 'TRAIN' and gt_aug_flag:
+            gt_obj_list.extend(extra_gt_obj_list)
+        gt_boxes3d = kitti_utils.objs_to_boxes3d(gt_obj_list)
+
+        gt_alpha = np.zeros((gt_obj_list.__len__()), dtype=np.float32)
+        for k, obj in enumerate(gt_obj_list):
+            gt_alpha[k] = obj.alpha
+
+        # data augmentation
+        aug_pts_rect = ret_pts_rect.copy()
+        aug_gt_boxes3d = gt_boxes3d.copy()
+        if cfg.AUG_DATA and self.mode == 'TRAIN':
+            aug_pts_rect, aug_gt_boxes3d, aug_method = self.data_augmentation(aug_pts_rect, aug_gt_boxes3d, gt_alpha,
+                                                                              sample_id)
+            sample_info['aug_method'] = aug_method
+
+        # prepare input
+        if cfg.RPN.USE_INTENSITY:
+            pts_input = np.concatenate((aug_pts_rect, ret_pts_features), axis=1)  # (N, C)
+        else:
+            pts_input = aug_pts_rect
+
+        if cfg.RPN.FIXED:
+            sample_info['pts_input'] = pts_input
+            sample_info['pts_rect'] = aug_pts_rect
+            sample_info['pts_features'] = ret_pts_features
+            sample_info['gt_boxes3d'] = aug_gt_boxes3d
+            return sample_info
+
+        if self.mode == 'EVAL' and aug_gt_boxes3d.shape[0] == 0:
+            aug_gt_boxes3d = np.zeros((1, aug_gt_boxes3d.shape[1]))
+
+        # generate training labels
+        rpn_cls_label, rpn_reg_label = self.generate_rpn_training_labels(aug_pts_rect, aug_gt_boxes3d)
+        sample_info['pts_input'] = pts_input
+        sample_info['pts_rect'] = aug_pts_rect
+        sample_info['pts_features'] = ret_pts_features
+        sample_info['rpn_cls_label'] = rpn_cls_label
+        sample_info['rpn_reg_label'] = rpn_reg_label
+        sample_info['gt_boxes3d'] = aug_gt_boxes3d
+        return sample_info
+
+    def apply_gt_aug_to_one_scene(self, sample_id, pts_rect, pts_intensity, all_gt_boxes3d):
+        """
+        :param pts_rect: (N, 3)
+        :param all_gt_boxex3d: (M2, 7)
+        :return:
+        """
+        assert self.gt_database is not None
+        # extra_gt_num = np.random.randint(10, 15)
+        # try_times = 50
+        if cfg.GT_AUG_RAND_NUM:
+            extra_gt_num = np.random.randint(10, cfg.GT_EXTRA_NUM)
+        else:
+            extra_gt_num = cfg.GT_EXTRA_NUM
+        try_times = 100
+        cnt = 0
+        cur_gt_boxes3d = all_gt_boxes3d.copy()
+        cur_gt_boxes3d[:, 4] += 0.5  # TODO: consider different objects
+        cur_gt_boxes3d[:, 5] += 0.5  # enlarge new added box to avoid too nearby boxes
+        cur_gt_corners = kitti_utils.boxes3d_to_corners3d(cur_gt_boxes3d)
+
+        extra_gt_obj_list = []
+        extra_gt_boxes3d_list = []
+        new_pts_list, new_pts_intensity_list = [], []
+        src_pts_flag = np.ones(pts_rect.shape[0], dtype=np.int32)
+
+        road_plane = self.get_road_plane(sample_id)
+        a, b, c, d = road_plane
+
+        while try_times > 0:
+            if cnt > extra_gt_num:
+                break
+
+            try_times -= 1
+            if cfg.GT_AUG_HARD_RATIO > 0:
+                p = np.random.rand()
+                if p > cfg.GT_AUG_HARD_RATIO:
+                    # use easy sample
+                    rand_idx = np.random.randint(0, len(self.gt_database[0]))
+                    new_gt_dict = self.gt_database[0][rand_idx]
+                else:
+                    # use hard sample
+                    rand_idx = np.random.randint(0, len(self.gt_database[1]))
+                    new_gt_dict = self.gt_database[1][rand_idx]
+            else:
+                rand_idx = np.random.randint(0, self.gt_database.__len__())
+                new_gt_dict = self.gt_database[rand_idx]
+
+            new_gt_box3d = new_gt_dict['gt_box3d'].copy()
+            new_gt_points = new_gt_dict['points'].copy()
+            new_gt_intensity = new_gt_dict['intensity'].copy()
+            new_gt_obj = new_gt_dict['obj']
+            center = new_gt_box3d[0:3]
+            if cfg.PC_REDUCE_BY_RANGE and (self.check_pc_range(center) is False):
+                continue
+
+            if new_gt_points.__len__() < 5:  # too few points
+                continue
+
+            # put it on the road plane
+            cur_height = (-d - a * center[0] - c * center[2]) / b
+            move_height = new_gt_box3d[1] - cur_height
+            new_gt_box3d[1] -= move_height
+            new_gt_points[:, 1] -= move_height
+            new_gt_obj.pos[1] -= move_height
+
+            new_enlarged_box3d = new_gt_box3d.copy()
+            new_enlarged_box3d[4] += 0.5
+            new_enlarged_box3d[5] += 0.5  # enlarge new added box to avoid too nearby boxes
+
+            cnt += 1
+            new_corners = kitti_utils.boxes3d_to_corners3d(new_enlarged_box3d.reshape(1, 7))
+            iou3d = kitti_utils.get_iou3d(new_corners, cur_gt_corners)
+            valid_flag = iou3d.max() < 1e-8
+            if not valid_flag:
+                continue
+
+            enlarged_box3d = new_gt_box3d.copy()
+            enlarged_box3d[3] += 2  # remove the points above and below the object
+
+            boxes_pts_mask_list = pts_utils.pts_in_boxes3d(pts_rect,
+                    enlarged_box3d.reshape(1, 7))
+            pt_mask_flag = (boxes_pts_mask_list[0] == 1)
+            src_pts_flag[pt_mask_flag] = 0  # remove the original points which are inside the new box
+
+            new_pts_list.append(new_gt_points)
+            new_pts_intensity_list.append(new_gt_intensity)
+            cur_gt_boxes3d = np.concatenate((cur_gt_boxes3d, new_enlarged_box3d.reshape(1, 7)), axis=0)
+            cur_gt_corners = np.concatenate((cur_gt_corners, new_corners), axis=0)
+            extra_gt_boxes3d_list.append(new_gt_box3d.reshape(1, 7))
+            extra_gt_obj_list.append(new_gt_obj)
+
+        if new_pts_list.__len__() == 0:
+            return False, pts_rect, pts_intensity, None, None
+
+        extra_gt_boxes3d = np.concatenate(extra_gt_boxes3d_list, axis=0)
+        # remove original points and add new points
+        pts_rect = pts_rect[src_pts_flag == 1]
+        pts_intensity = pts_intensity[src_pts_flag == 1]
+        new_pts_rect = np.concatenate(new_pts_list, axis=0)
+        new_pts_intensity = np.concatenate(new_pts_intensity_list, axis=0)
+        pts_rect = np.concatenate((pts_rect, new_pts_rect), axis=0)
+        pts_intensity = np.concatenate((pts_intensity, new_pts_intensity), axis=0)
+
+        return True, pts_rect, pts_intensity, extra_gt_boxes3d, extra_gt_obj_list
+
+    def rotate_box3d_along_y(self, box3d, rot_angle):
+        old_x, old_z, ry = box3d[0], box3d[2], box3d[6]
+        old_beta = np.arctan2(old_z, old_x)
+        alpha = -np.sign(old_beta) * np.pi / 2 + old_beta + ry
+        box3d = kitti_utils.rotate_pc_along_y(box3d.reshape(1, 7), rot_angle=rot_angle)[0]
+        new_x, new_z = box3d[0], box3d[2]
+        new_beta = np.arctan2(new_z, new_x)
+        box3d[6] = np.sign(new_beta) * np.pi / 2 + alpha - new_beta
+        return box3d
+
+    def data_augmentation(self, aug_pts_rect, aug_gt_boxes3d, gt_alpha, sample_id=None, mustaug=False, stage=1):
+        """
+        :param aug_pts_rect: (N, 3)
+        :param aug_gt_boxes3d: (N, 7)
+        :param gt_alpha: (N)
+        :return:
+        """
+        aug_list = cfg.AUG_METHOD_LIST
+        aug_enable = 1 - np.random.rand(3)
+        if mustaug is True:
+            aug_enable[0] = -1
+            aug_enable[1] = -1
+        aug_method = []
+        if 'rotation' in aug_list and aug_enable[0] < cfg.AUG_METHOD_PROB[0]:
+            angle = np.random.uniform(-np.pi / cfg.AUG_ROT_RANGE, np.pi / cfg.AUG_ROT_RANGE)
+            aug_pts_rect = kitti_utils.rotate_pc_along_y(aug_pts_rect, rot_angle=angle)
+            if stage == 1:
+                # xyz change, hwl unchange
+                aug_gt_boxes3d = kitti_utils.rotate_pc_along_y(aug_gt_boxes3d, rot_angle=angle)
+
+                # calculate the ry after rotation
+                x, z = aug_gt_boxes3d[:, 0], aug_gt_boxes3d[:, 2]
+                beta = np.arctan2(z, x)
+                new_ry = np.sign(beta) * np.pi / 2 + gt_alpha - beta
+                aug_gt_boxes3d[:, 6] = new_ry  # TODO: not in [-np.pi / 2, np.pi / 2]
+            elif stage == 2:
+                # for debug stage-2, this implementation has little float precision difference with the above one
+                assert aug_gt_boxes3d.shape[0] == 2
+                aug_gt_boxes3d[0] = self.rotate_box3d_along_y(aug_gt_boxes3d[0], angle)
+                aug_gt_boxes3d[1] = self.rotate_box3d_along_y(aug_gt_boxes3d[1], angle)
+            else:
+                raise NotImplementedError
+
+            aug_method.append(['rotation', angle])
+
+        if 'scaling' in aug_list and aug_enable[1] < cfg.AUG_METHOD_PROB[1]:
+            scale = np.random.uniform(0.95, 1.05)
+            aug_pts_rect = aug_pts_rect * scale
+            aug_gt_boxes3d[:, 0:6] = aug_gt_boxes3d[:, 0:6] * scale
+            aug_method.append(['scaling', scale])
+
+        if 'flip' in aug_list and aug_enable[2] < cfg.AUG_METHOD_PROB[2]:
+            # flip horizontal
+            aug_pts_rect[:, 0] = -aug_pts_rect[:, 0]
+            aug_gt_boxes3d[:, 0] = -aug_gt_boxes3d[:, 0]
+            # flip orientation: ry > 0: pi - ry, ry < 0: -pi - ry
+            if stage == 1:
+                aug_gt_boxes3d[:, 6] = np.sign(aug_gt_boxes3d[:, 6]) * np.pi - aug_gt_boxes3d[:, 6]
+            elif stage == 2:
+                assert aug_gt_boxes3d.shape[0] == 2
+                aug_gt_boxes3d[0, 6] = np.sign(aug_gt_boxes3d[0, 6]) * np.pi - aug_gt_boxes3d[0, 6]
+                aug_gt_boxes3d[1, 6] = np.sign(aug_gt_boxes3d[1, 6]) * np.pi - aug_gt_boxes3d[1, 6]
+            else:
+                raise NotImplementedError
+
+            aug_method.append('flip')
+
+        return aug_pts_rect, aug_gt_boxes3d, aug_method
+
+    @staticmethod
+    def generate_rpn_training_labels(pts_rect, gt_boxes3d):
+        cls_label = np.zeros((pts_rect.shape[0]), dtype=np.int32)
+        reg_label = np.zeros((pts_rect.shape[0], 7), dtype=np.float32)  # dx, dy, dz, ry, h, w, l
+        gt_corners = kitti_utils.boxes3d_to_corners3d(gt_boxes3d, rotate=True)
+        extend_gt_boxes3d = kitti_utils.enlarge_box3d(gt_boxes3d, extra_width=0.2)
+        extend_gt_corners = kitti_utils.boxes3d_to_corners3d(extend_gt_boxes3d, rotate=True)
+        for k in range(gt_boxes3d.shape[0]):
+            box_corners = gt_corners[k]
+            fg_pt_flag = in_hull(pts_rect, box_corners)
+            fg_pts_rect = pts_rect[fg_pt_flag]
+            cls_label[fg_pt_flag] = 1
+
+            # enlarge the bbox3d, ignore nearby points
+            extend_box_corners = extend_gt_corners[k]
+            fg_enlarge_flag = in_hull(pts_rect, extend_box_corners)
+            ignore_flag = np.logical_xor(fg_pt_flag, fg_enlarge_flag)
+            cls_label[ignore_flag] = -1
+
+            # pixel offset of object center
+            center3d = gt_boxes3d[k][0:3].copy()  # (x, y, z)
+            center3d[1] -= gt_boxes3d[k][3] / 2
+            reg_label[fg_pt_flag, 0:3] = center3d - fg_pts_rect  # Now y is the true center of 3d box 20180928
+
+            # size and angle encoding
+            reg_label[fg_pt_flag, 3] = gt_boxes3d[k][3]  # h
+            reg_label[fg_pt_flag, 4] = gt_boxes3d[k][4]  # w
+            reg_label[fg_pt_flag, 5] = gt_boxes3d[k][5]  # l
+            reg_label[fg_pt_flag, 6] = gt_boxes3d[k][6]  # ry
+
+        return cls_label, reg_label
+
+    def get_rcnn_sample_jit(self, index):
+        sample_id = int(self.sample_id_list[index])
+        rpn_xyz, rpn_features, rpn_intensity, seg_mask = \
+            self.get_rpn_features(self.rcnn_training_feature_dir, sample_id)
+
+        # load rois and gt_boxes3d for this sample
+        roi_file = os.path.join(self.rcnn_training_roi_dir, '%06d.txt' % sample_id)
+        roi_obj_list = kitti_utils.get_objects_from_label(roi_file)
+        roi_boxes3d = kitti_utils.objs_to_boxes3d(roi_obj_list)
+        # roi_scores is not used currently
+        # roi_scores = kitti_utils.objs_to_scores(roi_obj_list)
+
+        gt_obj_list = self.filtrate_objects(self.get_label(sample_id))
+        gt_boxes3d = kitti_utils.objs_to_boxes3d(gt_obj_list)
+        sample_info = OrderedDict()
+        sample_info["sample_id"] = sample_id
+        sample_info['rpn_xyz'] = rpn_xyz
+        sample_info['rpn_features'] = rpn_features
+        sample_info['rpn_intensity'] = rpn_intensity
+        sample_info['seg_mask'] = seg_mask
+        sample_info['roi_boxes3d'] = roi_boxes3d
+        sample_info['pts_depth'] = np.linalg.norm(rpn_xyz, ord=2, axis=1)
+        sample_info['gt_boxes3d'] = gt_boxes3d
+
+        return sample_info
+
+    def sample_bg_inds(self, hard_bg_inds, easy_bg_inds, bg_rois_per_this_image):
+        if hard_bg_inds.size > 0 and easy_bg_inds.size > 0:
+            hard_bg_rois_num = int(bg_rois_per_this_image * cfg.RCNN.HARD_BG_RATIO)
+            easy_bg_rois_num = bg_rois_per_this_image - hard_bg_rois_num
+
+            # sampling hard bg
+            rand_num = np.floor(np.random.rand(hard_bg_rois_num) * hard_bg_inds.size).astype(np.int32)
+            hard_bg_inds = hard_bg_inds[rand_num]
+            # sampling easy bg
+            rand_num = np.floor(np.random.rand(easy_bg_rois_num) * easy_bg_inds.size).astype(np.int32)
+            easy_bg_inds = easy_bg_inds[rand_num]
+
+            bg_inds = np.concatenate([hard_bg_inds, easy_bg_inds], axis=0)
+        elif hard_bg_inds.size > 0 and easy_bg_inds.size == 0:
+            hard_bg_rois_num = bg_rois_per_this_image
+            # sampling hard bg
+            rand_num = np.floor(np.random.rand(hard_bg_rois_num) * hard_bg_inds.size).astype(np.int32)
+            bg_inds = hard_bg_inds[rand_num]
+        elif hard_bg_inds.size == 0 and easy_bg_inds.size > 0:
+            easy_bg_rois_num = bg_rois_per_this_image
+            # sampling easy bg
+            rand_num = np.floor(np.random.rand(easy_bg_rois_num) * easy_bg_inds.size).astype(np.int32)
+            bg_inds = easy_bg_inds[rand_num]
+        else:
+            raise NotImplementedError
+
+        return bg_inds
+
+    def aug_roi_by_noise_batch(self, roi_boxes3d, gt_boxes3d, aug_times=10):
+        """
+        :param roi_boxes3d: (N, 7)
+        :param gt_boxes3d: (N, 7)
+        :return:
+        """
+        iou_of_rois = np.zeros(roi_boxes3d.shape[0], dtype=np.float32)
+        for k in range(roi_boxes3d.__len__()):
+            temp_iou = cnt = 0
+            roi_box3d = roi_boxes3d[k]
+            gt_box3d = gt_boxes3d[k]
+            pos_thresh = min(cfg.RCNN.REG_FG_THRESH, cfg.RCNN.CLS_FG_THRESH)
+            gt_corners = kitti_utils.boxes3d_to_corners3d(gt_box3d.reshape(1, 7), True)
+            aug_box3d = roi_box3d
+            while temp_iou < pos_thresh and cnt < aug_times:
+                if np.random.rand() < 0.2:
+                    aug_box3d = roi_box3d  # p=0.2 to keep the original roi box
+                else:
+                    aug_box3d = self.random_aug_box3d(roi_box3d)
+                aug_corners = kitti_utils.boxes3d_to_corners3d(aug_box3d.reshape(1, 7), True)
+                iou3d = kitti_utils.get_iou3d(aug_corners, gt_corners)
+                temp_iou = iou3d[0][0]
+                cnt += 1
+            roi_boxes3d[k] = aug_box3d
+            iou_of_rois[k] = temp_iou
+        return roi_boxes3d, iou_of_rois
+
+    @staticmethod
+    def canonical_transform_batch(pts_input, roi_boxes3d, gt_boxes3d):
+        """
+        :param pts_input: (N, npoints, 3 + C)
+        :param roi_boxes3d: (N, 7)
+        :param gt_boxes3d: (N, 7)
+        :return:
+        """
+        roi_ry = roi_boxes3d[:, 6] % (2 * np.pi)  # 0 ~ 2pi
+        roi_center = roi_boxes3d[:, 0:3]
+        # shift to center
+        pts_input[:, :, [0, 1, 2]] = pts_input[:, :, [0, 1, 2]] - roi_center.reshape(-1, 1, 3)
+        gt_boxes3d_ct = np.copy(gt_boxes3d)
+        gt_boxes3d_ct[:, 0:3] = gt_boxes3d_ct[:, 0:3] - roi_center
+        # rotate to the direction of head
+        gt_boxes3d_ct = kitti_utils.rotate_pc_along_y_np(
+            gt_boxes3d_ct.reshape(-1, 1, 7),
+            roi_ry,
+        )
+        # TODO: check here
+        gt_boxes3d_ct = gt_boxes3d_ct.reshape(-1,7)
+        gt_boxes3d_ct[:, 6] = gt_boxes3d_ct[:, 6] - roi_ry
+        pts_input = kitti_utils.rotate_pc_along_y_np(
+            pts_input, 
+            roi_ry
+        )
+        return pts_input, gt_boxes3d_ct
+
+    def get_rcnn_training_sample_batch(self, index):
+        sample_id = int(self.sample_id_list[index])
+        rpn_xyz, rpn_features, rpn_intensity, seg_mask = \
+            self.get_rpn_features(self.rcnn_training_feature_dir, sample_id)
+
+        # load rois and gt_boxes3d for this sample
+        roi_file = os.path.join(self.rcnn_training_roi_dir, '%06d.txt' % sample_id)
+        roi_obj_list = kitti_utils.get_objects_from_label(roi_file)
+        roi_boxes3d = kitti_utils.objs_to_boxes3d(roi_obj_list)
+        # roi_scores = kitti_utils.objs_to_scores(roi_obj_list)
+
+        gt_obj_list = self.filtrate_objects(self.get_label(sample_id))
+        gt_boxes3d = kitti_utils.objs_to_boxes3d(gt_obj_list)
+
+        # calculate original iou
+        iou3d = kitti_utils.get_iou3d(kitti_utils.boxes3d_to_corners3d(roi_boxes3d, True),
+                                      kitti_utils.boxes3d_to_corners3d(gt_boxes3d, True))
+        max_overlaps, gt_assignment = iou3d.max(axis=1), iou3d.argmax(axis=1)
+        max_iou_of_gt, roi_assignment = iou3d.max(axis=0), iou3d.argmax(axis=0)
+        roi_assignment = roi_assignment[max_iou_of_gt > 0].reshape(-1)
+
+        # sample fg, easy_bg, hard_bg
+        fg_rois_per_image = int(np.round(cfg.RCNN.FG_RATIO * cfg.RCNN.ROI_PER_IMAGE))
+        fg_thresh = min(cfg.RCNN.REG_FG_THRESH, cfg.RCNN.CLS_FG_THRESH)
+        fg_inds = np.nonzero(max_overlaps >= fg_thresh)[0]
+        fg_inds = np.concatenate((fg_inds, roi_assignment), axis=0)  # consider the roi which has max_overlaps with gt as fg
+
+        easy_bg_inds = np.nonzero((max_overlaps < cfg.RCNN.CLS_BG_THRESH_LO))[0]
+        hard_bg_inds = np.nonzero((max_overlaps < cfg.RCNN.CLS_BG_THRESH) &
+                                  (max_overlaps >= cfg.RCNN.CLS_BG_THRESH_LO))[0]
+
+        fg_num_rois = fg_inds.size
+        bg_num_rois = hard_bg_inds.size + easy_bg_inds.size
+
+        if fg_num_rois > 0 and bg_num_rois > 0:
+            # sampling fg
+            fg_rois_per_this_image = min(fg_rois_per_image, fg_num_rois)
+            rand_num = np.random.permutation(fg_num_rois)
+            fg_inds = fg_inds[rand_num[:fg_rois_per_this_image]]
+
+            # sampling bg
+            bg_rois_per_this_image = cfg.RCNN.ROI_PER_IMAGE  - fg_rois_per_this_image
+            bg_inds = self.sample_bg_inds(hard_bg_inds, easy_bg_inds, bg_rois_per_this_image)
+
+        elif fg_num_rois > 0 and bg_num_rois == 0:
+            # sampling fg
+            rand_num = np.floor(np.random.rand(cfg.RCNN.ROI_PER_IMAGE ) * fg_num_rois)
+            # rand_num = torch.from_numpy(rand_num).type_as(gt_boxes3d).long()
+            fg_inds = fg_inds[rand_num]
+            fg_rois_per_this_image = cfg.RCNN.ROI_PER_IMAGE
+            bg_rois_per_this_image = 0
+        elif bg_num_rois > 0 and fg_num_rois == 0:
+            # sampling bg
+            bg_rois_per_this_image = cfg.RCNN.ROI_PER_IMAGE
+            bg_inds = self.sample_bg_inds(hard_bg_inds, easy_bg_inds, bg_rois_per_this_image)
+            fg_rois_per_this_image = 0
+        else:
+            import pdb
+            pdb.set_trace()
+            raise NotImplementedError
+
+        # augment the rois by noise
+        roi_list, roi_iou_list, roi_gt_list = [], [], []
+        if fg_rois_per_this_image > 0:
+            fg_rois_src = roi_boxes3d[fg_inds].copy()
+            gt_of_fg_rois = gt_boxes3d[gt_assignment[fg_inds]]
+            fg_rois, fg_iou3d = self.aug_roi_by_noise_batch(fg_rois_src, gt_of_fg_rois, aug_times=10)
+            roi_list.append(fg_rois)
+            roi_iou_list.append(fg_iou3d)
+            roi_gt_list.append(gt_of_fg_rois)
+
+        if bg_rois_per_this_image > 0:
+            bg_rois_src = roi_boxes3d[bg_inds].copy()
+            gt_of_bg_rois = gt_boxes3d[gt_assignment[bg_inds]]
+            bg_rois, bg_iou3d = self.aug_roi_by_noise_batch(bg_rois_src, gt_of_bg_rois, aug_times=1)
+            roi_list.append(bg_rois)
+            roi_iou_list.append(bg_iou3d)
+            roi_gt_list.append(gt_of_bg_rois)
+
+        rois = np.concatenate(roi_list, axis=0)
+        iou_of_rois = np.concatenate(roi_iou_list, axis=0)
+        gt_of_rois = np.concatenate(roi_gt_list, axis=0)
+
+        # collect extra features for point cloud pooling
+        if cfg.RCNN.USE_INTENSITY:
+            pts_extra_input_list = [rpn_intensity.reshape(-1, 1), seg_mask.reshape(-1, 1)]
+        else:
+            pts_extra_input_list = [seg_mask.reshape(-1, 1)]
+
+        if cfg.RCNN.USE_DEPTH:
+            pts_depth = (np.linalg.norm(rpn_xyz, ord=2, axis=1) / 70.0) - 0.5
+            pts_extra_input_list.append(pts_depth.reshape(-1, 1))
+        pts_extra_input = np.concatenate(pts_extra_input_list, axis=1)
+
+        # pts, pts_feature, boxes3d, pool_extra_width, sampled_pt_num
+        pts_input, pts_features, pts_empty_flag = roipool3d_utils.roipool3d_cpu(
+            rpn_xyz, rpn_features, rois, pts_extra_input,
+            cfg.RCNN.POOL_EXTRA_WIDTH,
+            sampled_pt_num=cfg.RCNN.NUM_POINTS,
+            #canonical_transform=False
+        )
+
+        # data augmentation
+        if cfg.AUG_DATA and self.mode == 'TRAIN':
+            for k in range(rois.__len__()):
+                aug_pts = pts_input[k, :, 0:3].copy()
+                aug_gt_box3d = gt_of_rois[k].copy()
+                aug_roi_box3d = rois[k].copy()
+
+                # calculate alpha by ry
+                temp_boxes3d = np.concatenate([aug_roi_box3d.reshape(1, 7), aug_gt_box3d.reshape(1, 7)], axis=0)
+                temp_x, temp_z, temp_ry = temp_boxes3d[:, 0], temp_boxes3d[:, 2], temp_boxes3d[:, 6]
+                temp_beta = np.arctan2(temp_z, temp_x).astype(np.float64)
+                temp_alpha = -np.sign(temp_beta) * np.pi / 2 + temp_beta + temp_ry
+
+                # data augmentation
+                aug_pts, aug_boxes3d, aug_method = self.data_augmentation(aug_pts, temp_boxes3d, temp_alpha,
+                                                                          mustaug=True, stage=2)
+
+                # assign to original data
+                pts_input[k, :, 0:3] = aug_pts
+                rois[k] = aug_boxes3d[0]
+                gt_of_rois[k] = aug_boxes3d[1]
+
+        valid_mask = (pts_empty_flag == 0).astype(np.int32)
+        # regression valid mask
+        reg_valid_mask = (iou_of_rois > cfg.RCNN.REG_FG_THRESH).astype(np.int32) & valid_mask
+
+        # classification label
+        cls_label = (iou_of_rois > cfg.RCNN.CLS_FG_THRESH).astype(np.int32)
+        invalid_mask = (iou_of_rois > cfg.RCNN.CLS_BG_THRESH) & (iou_of_rois < cfg.RCNN.CLS_FG_THRESH)
+        cls_label[invalid_mask] = -1
+        cls_label[valid_mask == 0] = -1
+
+        # canonical transform and sampling
+        pts_input_ct, gt_boxes3d_ct = self.canonical_transform_batch(pts_input, rois, gt_of_rois)
+
+        pts_input_ = np.concatenate((pts_input_ct, pts_features), axis=-1)
+        sample_info = OrderedDict()
+
+        sample_info['sample_id'] = sample_id
+        sample_info['pts_input'] = pts_input_
+        sample_info['pts_feature'] = pts_features
+        sample_info['roi_boxes3d'] = rois
+        sample_info['cls_label'] = cls_label
+        sample_info['reg_valid_mask'] = reg_valid_mask
+        sample_info['gt_boxes3d_ct'] = gt_boxes3d_ct
+        sample_info['gt_of_rois'] = gt_of_rois
+        return sample_info
+
+    @staticmethod
+    def random_aug_box3d(box3d):
+        """
+        :param box3d: (7) [x, y, z, h, w, l, ry]
+        random shift, scale, orientation
+        """
+        if cfg.RCNN.REG_AUG_METHOD == 'single':
+            pos_shift = (np.random.rand(3) - 0.5)  # [-0.5 ~ 0.5]
+            hwl_scale = (np.random.rand(3) - 0.5) / (0.5 / 0.15) + 1.0  #
+            angle_rot = (np.random.rand(1) - 0.5) / (0.5 / (np.pi / 12))  # [-pi/12 ~ pi/12]
+
+            aug_box3d = np.concatenate([box3d[0:3] + pos_shift, box3d[3:6] * hwl_scale,
+                                        box3d[6:7] + angle_rot])
+            return aug_box3d
+        elif cfg.RCNN.REG_AUG_METHOD == 'multiple':
+            # pos_range, hwl_range, angle_range, mean_iou
+            range_config = [[0.2, 0.1, np.pi / 12, 0.7],
+                            [0.3, 0.15, np.pi / 12, 0.6],
+                            [0.5, 0.15, np.pi / 9, 0.5],
+                            [0.8, 0.15, np.pi / 6, 0.3],
+                            [1.0, 0.15, np.pi / 3, 0.2]]
+            idx = np.random.randint(len(range_config))
+
+            pos_shift = ((np.random.rand(3) - 0.5) / 0.5) * range_config[idx][0]
+            hwl_scale = ((np.random.rand(3) - 0.5) / 0.5) * range_config[idx][1] + 1.0
+            angle_rot = ((np.random.rand(1) - 0.5) / 0.5) * range_config[idx][2]
+
+            aug_box3d = np.concatenate([box3d[0:3] + pos_shift, box3d[3:6] * hwl_scale, box3d[6:7] + angle_rot])
+            return aug_box3d
+        elif cfg.RCNN.REG_AUG_METHOD == 'normal':
+            x_shift = np.random.normal(loc=0, scale=0.3)
+            y_shift = np.random.normal(loc=0, scale=0.2)
+            z_shift = np.random.normal(loc=0, scale=0.3)
+            h_shift = np.random.normal(loc=0, scale=0.25)
+            w_shift = np.random.normal(loc=0, scale=0.15)
+            l_shift = np.random.normal(loc=0, scale=0.5)
+            ry_shift = ((np.random.rand() - 0.5) / 0.5) * np.pi / 12
+
+            aug_box3d = np.array([box3d[0] + x_shift, box3d[1] + y_shift, box3d[2] + z_shift, box3d[3] + h_shift,
+                                  box3d[4] + w_shift, box3d[5] + l_shift, box3d[6] + ry_shift])
+            return aug_box3d
+        else:
+            raise NotImplementedError
+
+    def get_proposal_from_file(self, index):
+        sample_id = int(self.image_idx_list[index])
+        proposal_file = os.path.join(self.rcnn_eval_roi_dir, '%06d.txt' % sample_id)
+        roi_obj_list = kitti_utils.get_objects_from_label(proposal_file)
+
+        rpn_xyz, rpn_features, rpn_intensity, seg_mask = self.get_rpn_features(self.rcnn_eval_feature_dir, sample_id)
+        pts_rect, pts_rpn_features, pts_intensity = rpn_xyz, rpn_features, rpn_intensity
+
+        roi_box3d_list, roi_scores = [], []
+        for obj in roi_obj_list:
+            box3d = np.array([obj.pos[0], obj.pos[1], obj.pos[2], obj.h, obj.w, obj.l, obj.ry], dtype=np.float32)
+            roi_box3d_list.append(box3d.reshape(1, 7))
+            roi_scores.append(obj.score)
+
+        roi_boxes3d = np.concatenate(roi_box3d_list, axis=0)  # (N, 7)
+        roi_scores = np.array(roi_scores, dtype=np.float32)  # (N)
+
+        if cfg.RCNN.ROI_SAMPLE_JIT:
+            sample_dict = {'sample_id': sample_id,
+                           'rpn_xyz': rpn_xyz,
+                           'rpn_features': rpn_features,
+                           'seg_mask': seg_mask,
+                           'roi_boxes3d': roi_boxes3d,
+                           'roi_scores': roi_scores,
+                           'pts_depth': np.linalg.norm(rpn_xyz, ord=2, axis=1)}
+
+            if self.mode != 'TEST':
+                gt_obj_list = self.filtrate_objects(self.get_label(sample_id))
+                gt_boxes3d = kitti_utils.objs_to_boxes3d(gt_obj_list)
+
+                roi_corners = kitti_utils.boxes3d_to_corners3d(roi_boxes3d,True)
+                gt_corners = kitti_utils.boxes3d_to_corners3d(gt_boxes3d,True)
+                iou3d = kitti_utils.get_iou3d(roi_corners, gt_corners)
+                if gt_boxes3d.shape[0] > 0:
+                    gt_iou = iou3d.max(axis=1)
+                else:
+                    gt_iou = np.zeros(roi_boxes3d.shape[0]).astype(np.float32)
+
+                sample_dict['gt_boxes3d'] = gt_boxes3d
+                sample_dict['gt_iou'] = gt_iou
+            return sample_dict
+
+        if cfg.RCNN.USE_INTENSITY:
+            pts_extra_input_list = [pts_intensity.reshape(-1, 1), seg_mask.reshape(-1, 1)]
+        else:
+            pts_extra_input_list = [seg_mask.reshape(-1, 1)]
+
+        if cfg.RCNN.USE_DEPTH:
+            cur_depth = np.linalg.norm(pts_rect, axis=1, ord=2)
+            cur_depth_norm = (cur_depth / 70.0) - 0.5
+            pts_extra_input_list.append(cur_depth_norm.reshape(-1, 1))
+
+        pts_extra_input = np.concatenate(pts_extra_input_list, axis=1)
+        pts_input, pts_features, _ = roipool3d_utils.roipool3d_cpu(
+            pts_rect, pts_rpn_features, roi_boxes3d, pts_extra_input, 
+            cfg.RCNN.POOL_EXTRA_WIDTH, sampled_pt_num=cfg.RCNN.NUM_POINTS,
+            canonical_transform=True
+        )
+        pts_input = np.concatenate((pts_input, pts_features), axis=-1)
+        
+        sample_dict = OrderedDict()
+        sample_dict['sample_id'] = sample_id 
+        sample_dict['pts_input'] = pts_input 
+        sample_dict['pts_feature'] = pts_features 
+        sample_dict['roi_boxes3d'] = roi_boxes3d 
+        sample_dict['roi_scores'] = roi_scores 
+        #sample_dict['roi_size'] = roi_boxes3d[:, 3:6]
+
+        if self.mode == 'TEST':
+            return sample_dict
+
+        gt_obj_list = self.filtrate_objects(self.get_label(sample_id))
+        gt_boxes3d = np.zeros((gt_obj_list.__len__(), 7), dtype=np.float32)
+
+        for k, obj in enumerate(gt_obj_list):
+            gt_boxes3d[k, 0:3], gt_boxes3d[k, 3], gt_boxes3d[k, 4], gt_boxes3d[k, 5], gt_boxes3d[k, 6] \
+                = obj.pos, obj.h, obj.w, obj.l, obj.ry
+
+        if gt_boxes3d.__len__() == 0:
+            gt_iou = np.zeros((roi_boxes3d.shape[0]), dtype=np.float32)
+        else:
+            roi_corners = kitti_utils.boxes3d_to_corners3d(roi_boxes3d,True)
+            gt_corners = kitti_utils.boxes3d_to_corners3d(gt_boxes3d,True)
+            iou3d = kitti_utils.get_iou3d(roi_corners, gt_corners)
+            gt_iou = iou3d.max(axis=1)
+
+        sample_dict['gt_iou'] = gt_iou
+        sample_dict['gt_boxes3d'] = gt_boxes3d
+        
+        return sample_dict
+
+    def __len__(self):
+        if cfg.RPN.ENABLED:
+            return len(self.sample_id_list)
+        elif cfg.RCNN.ENABLED:
+            if self.mode == 'TRAIN':
+                return len(self.sample_id_list)
+            else:
+                return len(self.image_idx_list)
+        else:
+            raise NotImplementedError
+
+    def __getitem__(self, index):
+        if cfg.RPN.ENABLED:
+            return self.get_rpn_sample(index)
+        elif cfg.RCNN.ENABLED:
+            if self.mode == 'TRAIN':
+                if cfg.RCNN.ROI_SAMPLE_JIT:
+                    return self.get_rcnn_sample_jit(index)
+                else:
+                    return self.get_rcnn_training_sample_batch(index)
+            else:
+                return self.get_proposal_from_file(index)
+        else:
+            raise NotImplementedError
+
+    def padding_batch(self, batch_data, batch_size):
+        max_roi = 0
+        max_gt = 0 
+        
+        for k in range(batch_size):
+            # roi_boxes3d
+            max_roi = max(max_roi, batch_data[k][3].shape[0])
+            # gt_boxes3d
+            max_gt = max(max_gt, batch_data[k][-1].shape[0])
+        batch_roi_boxes3d = np.zeros((batch_size, max_roi, 7))
+        batch_gt_boxes3d = np.zeros((batch_size, max_gt, 7), dtype=np.float32)
+        
+        for i, data in enumerate(batch_data):
+            roi_num = data[3].shape[0]
+            gt_num = data[-1].shape[0]
+            batch_roi_boxes3d[i,:roi_num,:] = data[3]
+            batch_gt_boxes3d[i,:gt_num,:] = data[-1]
+
+        new_batch = []
+        for i, data in enumerate(batch_data):
+            new_batch.append(data[:3])
+            # roi_boxes3d
+            new_batch[i].append(batch_roi_boxes3d[i])
+            # ... 
+            new_batch[i].extend(data[4:7])
+            # gt_boxes3d
+            new_batch[i].append(batch_gt_boxes3d[i])
+        return new_batch
+
+    def padding_batch_eval(self, batch_data, batch_size):
+        max_pts = 0 
+        max_feats = 0
+        max_roi = 0
+        max_score = 0
+        max_iou = 0
+        max_gt = 0
+        
+        for k in range(batch_size):
+            # pts_input
+            max_pts = max(max_pts, batch_data[k][1].shape[0])
+            # pts_feature
+            max_feats = max(max_feats, batch_data[k][2].shape[0])
+            # roi_boxes3d
+            max_roi = max(max_roi, batch_data[k][3].shape[0])
+            # gt_iou 
+            max_iou = max(max_iou, batch_data[k][-2].shape[0])
+            # gt_boxes3d
+            max_gt = max(max_gt, batch_data[k][-1].shape[0])
+        batch_pts_input = np.zeros((batch_size, max_pts, 512, 133), dtype=np.float32)
+        batch_pts_feat = np.zeros((batch_size, max_feats, 512, 128), dtype=np.float32)
+        batch_roi_boxes3d = np.zeros((batch_size, max_roi, 7), dtype=np.float32)
+        batch_gt_iou = np.zeros((batch_size, max_iou), dtype=np.float32)
+        batch_gt_boxes3d = np.zeros((batch_size, max_gt, 7), dtype=np.float32)
+        
+        for i, data in enumerate(batch_data):
+            # num
+            pts_num = data[1].shape[0]
+            pts_feat_num = data[2].shape[0]
+            roi_num = data[3].shape[0]
+            iou_num = data[-2].shape[0]
+            gt_num = data[-1].shape[0]
+            # data
+            batch_pts_input[i, :pts_num, :, :] = data[1]
+            batch_pts_feat[i, :pts_feat_num, :, :] = data[2]
+            batch_roi_boxes3d[i,:roi_num,:] = data[3]
+            batch_gt_iou[i,:iou_num] = data[-2] 
+            batch_gt_boxes3d[i,:gt_num,:] = data[-1]
+            
+        new_batch = []
+        for i, data in enumerate(batch_data):
+            new_batch.append(data[:1])
+            new_batch[i].append(batch_pts_input[i])
+            new_batch[i].append(batch_pts_feat[i])
+            new_batch[i].append(batch_roi_boxes3d[i])
+            new_batch[i].append(data[4])
+            new_batch[i].append(batch_gt_iou[i])
+            new_batch[i].append(batch_gt_boxes3d[i])
+        return new_batch
+
+    def get_reader(self, batch_size, fields, drop_last=False):
+        def reader():
+            batch_out = []
+            idxs = np.arange(self.__len__())
+            if self.mode == 'TRAIN':
+                np.random.shuffle(idxs)
+            for idx in idxs:
+                sample_all = self.__getitem__(idx)
+                sample = [sample_all[f] for f in fields]
+                if has_empty(sample):
+                    logger.info("sample field: %d has empty field"%len(sample))
+                    continue
+                batch_out.append(sample)
+                if len(batch_out) >= batch_size:
+                    if cfg.RPN.ENABLED:
+                        yield batch_out
+                    else:
+                        if self.mode == 'TRAIN':
+                            yield self.padding_batch(batch_out, batch_size)
+                        elif self.mode == 'EVAL':
+                            # batch_size can should be 1 in rcnn_offline eval currently
+                            # if batch_size > 1, batch should be padded as follow
+                            # yield self.padding_batch_eval(batch_out, batch_size)
+                            yield batch_out
+                        else:
+                            logger.error("not only support train/eval padding")
+                    batch_out = []
+            if not drop_last:
+                if len(batch_out) > 0:
+                    yield batch_out
+        return reader
+
+    def get_multiprocess_reader(self, batch_size, fields, proc_num=8, max_queue_len=128, drop_last=False):
+        def read_to_queue(idxs, queue):
+            for idx in idxs:
+                sample_all = self.__getitem__(idx)
+                sample = [sample_all[f] for f in fields]
+                queue.put(sample)
+            queue.put(None)
+
+        def reader():
+            sample_num = self.__len__()
+            idxs = np.arange(self.__len__())
+            if self.mode == 'TRAIN':
+                np.random.shuffle(idxs)
+
+            proc_idxs = []
+            proc_sample_num = int(sample_num / proc_num)
+            start_idx = 0
+            for i in range(proc_num - 1):
+                proc_idxs.append(idxs[start_idx:start_idx + proc_sample_num])
+                start_idx += proc_sample_num
+            proc_idxs.append(idxs[start_idx:])
+
+            queue = multiprocessing.Queue(max_queue_len)
+            p_list = []
+            for i in range(proc_num):
+                p_list.append(multiprocessing.Process(
+                    target=read_to_queue, args=(proc_idxs[i], queue,)))
+                p_list[-1].start()
+
+            finish_num = 0
+            batch_out = []
+            while finish_num < len(p_list):
+                sample = queue.get()
+                if sample is None:
+                    finish_num += 1
+                else:
+                    batch_out.append(sample)
+                    if len(batch_out) == batch_size:
+                        yield batch_out
+                        batch_out = []
+
+            # join process
+            for p in p_list:
+                if p.is_alive():
+                    p.join()
+
+        return reader
+
+
+def _term_reader(signum, frame):
+    logger.info('pid {} terminated, terminate reader process '
+                'group {}...'.format(os.getpid(), os.getpgrp()))
+    os.killpg(os.getpgid(os.getpid()), signal.SIGKILL)
+
+signal.signal(signal.SIGINT, _term_reader)
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/eval.py b/PaddleCV/Paddle3D/PointRCNN/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ee5d37f40bbee8a5486090b1ebda05f0d5928a8
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/eval.py
@@ -0,0 +1,343 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import time
+import shutil
+import argparse
+import logging
+import multiprocessing
+import numpy as np
+from collections import OrderedDict 
+import paddle
+import paddle.fluid as fluid
+
+from models.point_rcnn import PointRCNN
+from data.kitti_rcnn_reader import KittiRCNNReader
+from utils.run_utils import *
+from utils.config import cfg, load_config, set_config_from_list
+from utils.metric_utils import calc_iou_recall, rpn_metric, rcnn_metric
+
+logging.root.handlers = []
+FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+np.random.seed(1024)  # use same seed
+METRIC_PROC_NUM = 4
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        "PointRCNN semantic segmentation train script")
+    parser.add_argument(
+        '--cfg',
+        type=str,
+        default='cfgs/default.yml',
+        help='specify the config for training')
+    parser.add_argument(
+        '--eval_mode',
+        type=str,
+        default='rpn',
+        required=True,
+        help='specify the training mode')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=1,
+        help='evaluation batch size, default 1')
+    parser.add_argument(
+        '--ckpt_dir',
+        type=str,
+        default='checkpoints/199',
+        help='specify a ckpt directory to be evaluated if needed')
+    parser.add_argument(
+        '--data_dir',
+        type=str,
+        default='./data',
+        help='KITTI dataset root directory')
+    parser.add_argument(
+        '--output_dir',
+        type=str,
+        default='output',
+        help='output directory')
+    parser.add_argument(
+        '--save_rpn_feature',
+        action='store_true',
+        default=False,
+        help='save features for separately rcnn training and evaluation')
+    parser.add_argument(
+        '--save_result',
+        action='store_true',
+        default=False,
+        help='save roi and refine result of evaluation')
+    parser.add_argument(
+        '--rcnn_eval_roi_dir',
+        type=str,
+        default=None,
+        help='specify the saved rois for rcnn evaluation when using rcnn_offline mode')
+    parser.add_argument(
+        '--rcnn_eval_feature_dir',
+        type=str,
+        default=None,
+        help='specify the saved features for rcnn evaluation when using rcnn_offline mode')
+    parser.add_argument(
+        '--log_interval',
+        type=int,
+        default=1,
+        help='mini-batch interval to log.')
+    parser.add_argument(
+        '--set',
+        dest='set_cfgs',
+        default=None,
+        nargs=argparse.REMAINDER,
+        help='set extra config keys if needed.')
+    args = parser.parse_args()
+    return args
+
+
+def eval():
+    args = parse_args()
+    print_arguments(args)
+    # check whether the installed paddle is compiled with GPU
+    # PointRCNN model can only run on GPU
+    check_gpu(True)
+
+    load_config(args.cfg)
+    if args.set_cfgs is not None:
+        set_config_from_list(args.set_cfgs)
+
+    if not os.path.isdir(args.output_dir):
+        os.makedirs(args.output_dir)
+
+    if args.eval_mode == 'rpn':
+        cfg.RPN.ENABLED = True
+        cfg.RCNN.ENABLED = False
+    elif args.eval_mode == 'rcnn':
+        cfg.RCNN.ENABLED = True
+        cfg.RPN.ENABLED = cfg.RPN.FIXED = True
+        assert args.batch_size, "batch size must be 1 in rcnn evaluation"
+    elif args.eval_mode == 'rcnn_offline':
+        cfg.RCNN.ENABLED = True
+        cfg.RPN.ENABLED = False
+        assert args.batch_size, "batch size must be 1 in rcnn_offline evaluation"
+    else:
+        raise NotImplementedError("unkown eval mode: {}".format(args.eval_mode))
+
+    place = fluid.CUDAPlace(0)
+    exe = fluid.Executor(place)
+
+    # build model
+    startup = fluid.Program()
+    eval_prog = fluid.Program()
+    with fluid.program_guard(eval_prog, startup):
+        with fluid.unique_name.guard():
+            eval_model = PointRCNN(cfg, args.batch_size, True, 'TEST')
+            eval_model.build()
+            eval_pyreader = eval_model.get_pyreader()
+            eval_feeds = eval_model.get_feeds()
+            eval_outputs = eval_model.get_outputs()
+    eval_prog = eval_prog.clone(True)
+
+    extra_keys = []
+    if args.eval_mode == 'rpn':
+        extra_keys.extend(['sample_id', 'rpn_cls_label', 'gt_boxes3d'])
+        if args.save_rpn_feature:
+            extra_keys.extend(['pts_rect', 'pts_features', 'pts_input',])
+    eval_keys, eval_values = parse_outputs(
+        eval_outputs, prog=eval_prog, extra_keys=extra_keys)
+
+    eval_compile_prog = fluid.compiler.CompiledProgram(
+        eval_prog).with_data_parallel()
+
+    exe.run(startup)
+
+    # load checkpoint
+    assert os.path.isdir(
+        args.ckpt_dir), "ckpt_dir {} not a directory".format(args.ckpt_dir)
+
+    def if_exist(var):
+        return os.path.exists(os.path.join(args.ckpt_dir, var.name))
+    fluid.io.load_vars(exe, args.ckpt_dir, eval_prog, predicate=if_exist)
+
+    kitti_feature_dir = os.path.join(args.output_dir, 'features')
+    kitti_output_dir = os.path.join(args.output_dir, 'detections', 'data')
+    seg_output_dir = os.path.join(args.output_dir, 'seg_result')
+    if args.save_rpn_feature:
+        if os.path.exists(kitti_feature_dir):
+            shutil.rmtree(kitti_feature_dir)
+        os.makedirs(kitti_feature_dir)
+        if os.path.exists(kitti_output_dir):
+            shutil.rmtree(kitti_output_dir)
+        os.makedirs(kitti_output_dir)
+        if os.path.exists(seg_output_dir):
+            shutil.rmtree(seg_output_dir)
+        os.makedirs(seg_output_dir)
+
+    # must make sure these dirs existing 
+    roi_output_dir = os.path.join('./result_dir', 'roi_result', 'data')
+    refine_output_dir = os.path.join('./result_dir', 'refine_result', 'data')
+    final_output_dir = os.path.join("./result_dir", 'final_result', 'data')
+    if not os.path.exists(final_output_dir):
+        os.makedirs(final_output_dir)
+    if args.save_result:
+        if not os.path.exists(roi_output_dir):
+            os.makedirs(roi_output_dir)
+        if not os.path.exists(refine_output_dir):
+            os.makedirs(refine_output_dir)
+
+    # get reader
+    kitti_rcnn_reader = KittiRCNNReader(data_dir=args.data_dir,
+                                        npoints=cfg.RPN.NUM_POINTS,
+                                        split=cfg.TEST.SPLIT,
+                                        mode='EVAL',
+                                        classes=cfg.CLASSES,
+                                        rcnn_eval_roi_dir=args.rcnn_eval_roi_dir,
+                                        rcnn_eval_feature_dir=args.rcnn_eval_feature_dir)
+    eval_reader = kitti_rcnn_reader.get_multiprocess_reader(args.batch_size, eval_feeds)
+    eval_pyreader.decorate_sample_list_generator(eval_reader, place)
+
+    thresh_list = [0.1, 0.3, 0.5, 0.7, 0.9]
+    queue = multiprocessing.Queue(128)
+    mgr = multiprocessing.Manager()
+    lock = multiprocessing.Lock()
+    mdict = mgr.dict()
+    if cfg.RPN.ENABLED:
+        mdict['exit_proc'] = 0
+        mdict['total_gt_bbox'] = 0
+        mdict['total_cnt'] = 0
+        mdict['total_rpn_iou'] = 0
+        for i in range(len(thresh_list)):
+            mdict['total_recalled_bbox_list_{}'.format(i)] = 0
+
+        p_list = []
+        for i in range(METRIC_PROC_NUM):
+            p_list.append(multiprocessing.Process(
+                target=rpn_metric,
+                args=(queue, mdict, lock, thresh_list, args.save_rpn_feature, kitti_feature_dir,
+                      seg_output_dir, kitti_output_dir, kitti_rcnn_reader, cfg.CLASSES)))
+            p_list[-1].start()
+    
+    if cfg.RCNN.ENABLED:
+        for i in range(len(thresh_list)):
+            mdict['total_recalled_bbox_list_{}'.format(i)] = 0
+            mdict['total_roi_recalled_bbox_list_{}'.format(i)] = 0
+        mdict['exit_proc'] = 0
+        mdict['total_cls_acc'] = 0 
+        mdict['total_cls_acc_refined'] = 0
+        mdict['total_det_num'] = 0
+        mdict['total_gt_bbox'] = 0
+        p_list = []
+        for i in range(METRIC_PROC_NUM):
+            p_list.append(multiprocessing.Process(
+                target=rcnn_metric,
+                args=(queue, mdict, lock, thresh_list, kitti_rcnn_reader, roi_output_dir,
+                      refine_output_dir, final_output_dir, args.save_result)
+            ))
+            p_list[-1].start()
+
+    try:
+        eval_pyreader.start()
+        eval_iter = 0
+        start_time = time.time()
+        
+        cur_time = time.time()
+        while True:
+            eval_outs = exe.run(eval_compile_prog, fetch_list=eval_values, return_numpy=False)
+            rets_dict = {k: (np.array(v), v.recursive_sequence_lengths()) 
+                    for k, v in zip(eval_keys, eval_outs)}
+            run_time = time.time() - cur_time
+            cur_time = time.time()
+            queue.put(rets_dict)
+            eval_iter += 1
+
+            logger.info("[EVAL] iter {}, time: {:.2f}".format(
+                eval_iter, run_time))
+
+    except fluid.core.EOFException:
+        # terminate metric process
+        for i in range(METRIC_PROC_NUM):
+            queue.put(None)
+        while mdict['exit_proc'] < METRIC_PROC_NUM:
+            time.sleep(1)
+        for p in p_list:
+            if p.is_alive():
+                p.join()
+
+        end_time = time.time()
+        logger.info("[EVAL] total {} iter finished, average time: {:.2f}".format(
+            eval_iter, (end_time - start_time) / float(eval_iter)))
+
+        if cfg.RPN.ENABLED:
+            avg_rpn_iou = mdict['total_rpn_iou'] / max(len(kitti_rcnn_reader), 1.)
+            logger.info("average rpn iou: {:.3f}".format(avg_rpn_iou))
+            total_gt_bbox = float(max(mdict['total_gt_bbox'], 1.0))
+            for idx, thresh in enumerate(thresh_list):
+                recall = mdict['total_recalled_bbox_list_{}'.format(idx)] / total_gt_bbox
+                logger.info("total bbox recall(thresh={:.3f}): {} / {} = {:.3f}".format(
+                    thresh, mdict['total_recalled_bbox_list_{}'.format(idx)], mdict['total_gt_bbox'], recall))
+
+        if cfg.RCNN.ENABLED:
+            cnt = float(max(eval_iter, 1.0))
+            avg_cls_acc = mdict['total_cls_acc'] / cnt
+            avg_cls_acc_refined = mdict['total_cls_acc_refined'] / cnt
+            avg_det_num = mdict['total_det_num'] / cnt
+            
+            logger.info("avg_cls_acc: {}".format(avg_cls_acc))
+            logger.info("avg_cls_acc_refined: {}".format(avg_cls_acc_refined))
+            logger.info("avg_det_num: {}".format(avg_det_num))             
+            
+            total_gt_bbox = float(max(mdict['total_gt_bbox'], 1.0))
+            for idx, thresh in enumerate(thresh_list):
+                cur_roi_recall = mdict['total_roi_recalled_bbox_list_{}'.format(idx)] / total_gt_bbox
+                logger.info('total roi bbox recall(thresh=%.3f): %d / %d = %f' % (
+                    thresh, mdict['total_roi_recalled_bbox_list_{}'.format(idx)], total_gt_bbox, cur_roi_recall))
+            
+            for idx, thresh in enumerate(thresh_list):
+                cur_recall = mdict['total_recalled_bbox_list_{}'.format(idx)] / total_gt_bbox
+                logger.info('total bbox recall(thresh=%.2f) %d / %.2f = %.4f' % (
+                    thresh, mdict['total_recalled_bbox_list_{}'.format(idx)], total_gt_bbox, cur_recall))
+            
+            split_file = os.path.join('./data/KITTI', 'ImageSets', 'val.txt')
+            image_idx_list = [x.strip() for x in open(split_file).readlines()]
+            for k in range(image_idx_list.__len__()):
+                cur_file = os.path.join(final_output_dir, '%s.txt' % image_idx_list[k])
+                if not os.path.exists(cur_file):
+                    with open(cur_file, 'w') as temp_f:
+                        pass
+
+            if float(sys.version[:3]) >= 3.6:
+                label_dir = os.path.join('./data/KITTI/object/training', 'label_2')
+                split_file = os.path.join('./data/KITTI', 'ImageSets', 'val.txt')
+                final_output_dir = os.path.join("./result_dir", 'final_result', 'data')
+                name_to_class = {'Car': 0, 'Pedestrian': 1, 'Cyclist': 2}
+
+                from tools.kitti_object_eval_python.evaluate import evaluate as kitti_evaluate 
+                ap_result_str, ap_dict = kitti_evaluate(
+                    label_dir, final_output_dir, label_split_file=split_file,
+                     current_class=name_to_class["Car"])
+
+                logger.info("KITTI evaluate: {}, {}".format(ap_result_str, ap_dict))
+
+            else:
+                logger.info("KITTI mAP only support python version >= 3.6, users can "
+                            "run 'python3 tools/kitti_eval.py' to evaluate KITTI mAP.")
+
+    finally:
+        eval_pyreader.reset()
+
+
+if __name__ == "__main__":
+    eval()
diff --git a/PaddleCV/Paddle3D/PointRCNN/ext_op b/PaddleCV/Paddle3D/PointRCNN/ext_op
new file mode 120000
index 0000000000000000000000000000000000000000..dca99c677c8fa26e7cbf3ce1d50a8e6af0621655
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/ext_op
@@ -0,0 +1 @@
+../PointNet++/ext_op
\ No newline at end of file
diff --git a/PaddleCV/Paddle3D/PointRCNN/images/teaser.png b/PaddleCV/Paddle3D/PointRCNN/images/teaser.png
new file mode 100644
index 0000000000000000000000000000000000000000..21ae7e98165074ef93dc34fc643b3fddc5fe6c36
Binary files /dev/null and b/PaddleCV/Paddle3D/PointRCNN/images/teaser.png differ
diff --git a/PaddleCV/Paddle3D/PointRCNN/models/__init__.py b/PaddleCV/Paddle3D/PointRCNN/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..46a4f6ee220f10f50a182f4a2ed510b0551f64a8
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/models/__init__.py
@@ -0,0 +1,13 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
diff --git a/PaddleCV/Paddle3D/PointRCNN/models/loss_utils.py b/PaddleCV/Paddle3D/PointRCNN/models/loss_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..d70d1bd69c63b4e31616c1325e69187710f23961
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/models/loss_utils.py
@@ -0,0 +1,202 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Constant
+
+__all__ = ["get_reg_loss"]
+
+
+def sigmoid_focal_loss(logits, labels, weights, gamma=2.0, alpha=0.25):
+    sce_loss = fluid.layers.sigmoid_cross_entropy_with_logits(logits, labels)
+    prob = fluid.layers.sigmoid(logits)
+    p_t = labels * prob + (1.0 - labels) * (1.0 - prob)
+    modulating_factor = fluid.layers.pow(1.0 - p_t, gamma)
+    alpha_weight_factor = labels * alpha + (1.0 - labels) * (1.0 - alpha)
+    return modulating_factor * alpha_weight_factor * sce_loss * weights
+
+
+def get_reg_loss(pred_reg, reg_label, fg_mask, point_num, loc_scope,
+                 loc_bin_size, num_head_bin, anchor_size,
+                 get_xz_fine=True, get_y_by_bin=False, loc_y_scope=0.5,
+                 loc_y_bin_size=0.25, get_ry_fine=False):
+
+    """
+    Bin-based 3D bounding boxes regression loss. See https://arxiv.org/abs/1812.04244 for more details.
+
+    :param pred_reg: (N, C)
+    :param reg_label: (N, 7) [dx, dy, dz, h, w, l, ry]
+    :param loc_scope: constant
+    :param loc_bin_size: constant
+    :param num_head_bin: constant
+    :param anchor_size: (N, 3) or (3)
+    :param get_xz_fine:
+    :param get_y_by_bin:
+    :param loc_y_scope:
+    :param loc_y_bin_size:
+    :param get_ry_fine:
+    :return:
+    """
+    fg_num = fluid.layers.cast(fluid.layers.reduce_sum(fg_mask), dtype=pred_reg.dtype)
+    fg_num = fluid.layers.clip(fg_num, min=1.0, max=point_num)
+    fg_scale = float(point_num) / fg_num
+
+    per_loc_bin_num = int(loc_scope / loc_bin_size) * 2
+    loc_y_bin_num = int(loc_y_scope / loc_y_bin_size) * 2
+
+    reg_loss_dict = {}
+
+    # xz localization loss
+    x_offset_label, y_offset_label, z_offset_label = reg_label[:, 0:1], reg_label[:, 1:2], reg_label[:, 2:3]
+    x_shift = fluid.layers.clip(x_offset_label + loc_scope, 0., loc_scope * 2 - 1e-3)
+    z_shift = fluid.layers.clip(z_offset_label + loc_scope, 0., loc_scope * 2 - 1e-3)
+    x_bin_label = fluid.layers.cast(x_shift / loc_bin_size, dtype='int64')
+    z_bin_label = fluid.layers.cast(z_shift / loc_bin_size, dtype='int64')
+
+    x_bin_l, x_bin_r = 0, per_loc_bin_num
+    z_bin_l, z_bin_r = per_loc_bin_num, per_loc_bin_num * 2
+    start_offset = z_bin_r
+
+    loss_x_bin = fluid.layers.softmax_with_cross_entropy(pred_reg[:, x_bin_l: x_bin_r], x_bin_label)
+    loss_x_bin = fluid.layers.reduce_mean(loss_x_bin * fg_mask) * fg_scale
+    loss_z_bin = fluid.layers.softmax_with_cross_entropy(pred_reg[:, z_bin_l: z_bin_r], z_bin_label)
+    loss_z_bin = fluid.layers.reduce_mean(loss_z_bin * fg_mask) * fg_scale
+    reg_loss_dict['loss_x_bin'] = loss_x_bin
+    reg_loss_dict['loss_z_bin'] = loss_z_bin
+    loc_loss = loss_x_bin + loss_z_bin
+
+    if get_xz_fine:
+        x_res_l, x_res_r = per_loc_bin_num * 2, per_loc_bin_num * 3
+        z_res_l, z_res_r = per_loc_bin_num * 3, per_loc_bin_num * 4
+        start_offset = z_res_r
+
+        x_res_label = x_shift - (fluid.layers.cast(x_bin_label, dtype=x_shift.dtype) * loc_bin_size + loc_bin_size / 2.)
+        z_res_label = z_shift - (fluid.layers.cast(z_bin_label, dtype=z_shift.dtype) * loc_bin_size + loc_bin_size / 2.)
+        x_res_norm_label = x_res_label / loc_bin_size
+        z_res_norm_label = z_res_label / loc_bin_size
+
+        x_bin_onehot = fluid.layers.one_hot(x_bin_label, depth=per_loc_bin_num)
+        z_bin_onehot = fluid.layers.one_hot(z_bin_label, depth=per_loc_bin_num)
+
+        loss_x_res = fluid.layers.smooth_l1(fluid.layers.reduce_sum(pred_reg[:, x_res_l: x_res_r] * x_bin_onehot, dim=1, keep_dim=True), x_res_norm_label)
+        loss_x_res = fluid.layers.reduce_mean(loss_x_res * fg_mask) * fg_scale
+        loss_z_res = fluid.layers.smooth_l1(fluid.layers.reduce_sum(pred_reg[:, z_res_l: z_res_r] * z_bin_onehot, dim=1, keep_dim=True), z_res_norm_label)
+        loss_z_res = fluid.layers.reduce_mean(loss_z_res * fg_mask) * fg_scale
+        reg_loss_dict['loss_x_res'] = loss_x_res
+        reg_loss_dict['loss_z_res'] = loss_z_res
+        loc_loss += loss_x_res + loss_z_res
+
+    # y localization loss
+    if get_y_by_bin:
+        y_bin_l, y_bin_r = start_offset, start_offset + loc_y_bin_num
+        y_res_l, y_res_r = y_bin_r, y_bin_r + loc_y_bin_num
+        start_offset = y_res_r
+
+        y_shift = fluid.layers.clip(y_offset_label + loc_y_scope, 0., loc_y_scope * 2 - 1e-3)
+        y_bin_label = fluid.layers.cast(y_shift / loc_y_bin_size, dtype='int64')
+        y_res_label = y_shift - (fluid.layers.cast(y_bin_label, dtype=y_shift.dtype) * loc_y_bin_size + loc_y_bin_size / 2.)
+        y_res_norm_label = y_res_label / loc_y_bin_size
+
+        y_bin_onehot = fluid.layers.one_hot(y_bin_label, depth=per_loc_bin_num)
+
+        loss_y_bin = fluid.layers.cross_entropy(pred_reg[:, y_bin_l: y_bin_r], y_bin_label)
+        loss_y_bin = fluid.layers.reduce_mean(loss_y_bin * fg_mask) * fg_scale
+        loss_y_res = fluid.layers.smooth_l1(fluid.layers.reduce_sum(pred_reg[:, y_res_l: y_res_r] * y_bin_onehot, dim=1, keep_dim=True), y_res_norm_label)
+        loss_y_res = fluid.layers.reduce_mean(loss_y_res * fg_mask) * fg_scale
+
+        reg_loss_dict['loss_y_bin'] = loss_y_bin
+        reg_loss_dict['loss_y_res'] = loss_y_res
+
+        loc_loss += loss_y_bin + loss_y_res
+    else:
+        y_offset_l, y_offset_r = start_offset, start_offset + 1
+        start_offset = y_offset_r
+
+        loss_y_offset = fluid.layers.smooth_l1(fluid.layers.reduce_sum(pred_reg[:, y_offset_l: y_offset_r], dim=1, keep_dim=True), y_offset_label)
+        loss_y_offset = fluid.layers.reduce_mean(loss_y_offset * fg_mask) * fg_scale
+        reg_loss_dict['loss_y_offset'] = loss_y_offset
+        loc_loss += loss_y_offset
+
+    # angle loss
+    ry_bin_l, ry_bin_r = start_offset, start_offset + num_head_bin
+    ry_res_l, ry_res_r = ry_bin_r, ry_bin_r + num_head_bin
+
+    ry_label = reg_label[:, 6:7]
+
+    if get_ry_fine:
+        # divide pi/2 into several bins
+        angle_per_class = (np.pi / 2) / num_head_bin
+
+        ry_label = ry_label % (2 * np.pi)  # 0 ~ 2pi
+        opposite_flag = fluid.layers.logical_and(ry_label > np.pi * 0.5, ry_label < np.pi * 1.5)
+        opposite_flag = fluid.layers.cast(opposite_flag, dtype=ry_label.dtype)
+        shift_angle = (ry_label + opposite_flag * np.pi + np.pi * 0.5) % (2 * np.pi)  # (0 ~ pi)
+        shift_angle.stop_gradient = True
+
+        shift_angle = fluid.layers.clip(shift_angle - np.pi * 0.25, min=1e-3, max=np.pi * 0.5 - 1e-3)  # (0, pi/2)
+
+        # bin center is (5, 10, 15, ..., 85)
+        ry_bin_label = fluid.layers.cast(shift_angle / angle_per_class, dtype='int64')
+        ry_res_label = shift_angle - (fluid.layers.cast(ry_bin_label, dtype=shift_angle.dtype) * angle_per_class + angle_per_class / 2)
+        ry_res_norm_label = ry_res_label / (angle_per_class / 2)
+
+    else:
+        # divide 2pi into several bins
+        angle_per_class = (2 * np.pi) / num_head_bin
+        heading_angle = ry_label % (2 * np.pi)  # 0 ~ 2pi
+
+        shift_angle = (heading_angle + angle_per_class / 2) % (2 * np.pi)
+        shift_angle.stop_gradient = True
+        ry_bin_label = fluid.layers.cast(shift_angle / angle_per_class, dtype='int64')
+        ry_res_label = shift_angle - (fluid.layers.cast(ry_bin_label, dtype=shift_angle.dtype) * angle_per_class + angle_per_class / 2)
+        ry_res_norm_label = ry_res_label / (angle_per_class / 2)
+
+    ry_bin_onehot = fluid.layers.one_hot(ry_bin_label, depth=num_head_bin)
+    loss_ry_bin = fluid.layers.softmax_with_cross_entropy(pred_reg[:, ry_bin_l:ry_bin_r], ry_bin_label)
+    loss_ry_bin = fluid.layers.reduce_mean(loss_ry_bin * fg_mask) * fg_scale
+    loss_ry_res = fluid.layers.smooth_l1(fluid.layers.reduce_sum(pred_reg[:, ry_res_l: ry_res_r] * ry_bin_onehot, dim=1, keep_dim=True), ry_res_norm_label)
+    loss_ry_res = fluid.layers.reduce_mean(loss_ry_res * fg_mask) * fg_scale
+
+    reg_loss_dict['loss_ry_bin'] = loss_ry_bin
+    reg_loss_dict['loss_ry_res'] = loss_ry_res
+    angle_loss = loss_ry_bin + loss_ry_res
+
+    # size loss
+    size_res_l, size_res_r = ry_res_r, ry_res_r + 3
+    assert pred_reg.shape[1] == size_res_r, '%d vs %d' % (pred_reg.shape[1], size_res_r)
+
+    anchor_size_var = fluid.layers.zeros(shape=[3], dtype=reg_label.dtype)
+    fluid.layers.assign(np.array(anchor_size).astype('float32'), anchor_size_var)
+    size_res_norm_label = (reg_label[:, 3:6] - anchor_size_var) / anchor_size_var
+    size_res_norm_label = fluid.layers.reshape(size_res_norm_label, shape=[-1, 1], inplace=True)
+    size_res_norm = pred_reg[:, size_res_l:size_res_r]
+    size_res_norm = fluid.layers.reshape(size_res_norm, shape=[-1, 1], inplace=True)
+    size_loss = fluid.layers.smooth_l1(size_res_norm, size_res_norm_label)
+    size_loss = fluid.layers.reshape(size_loss, shape=[-1, 3])
+    size_loss = fluid.layers.reduce_mean(size_loss * fg_mask) * fg_scale
+
+    # Total regression loss
+    reg_loss_dict['loss_loc'] = loc_loss
+    reg_loss_dict['loss_angle'] = angle_loss
+    reg_loss_dict['loss_size'] = size_loss
+
+    return loc_loss, angle_loss, size_loss, reg_loss_dict
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/models/point_rcnn.py b/PaddleCV/Paddle3D/PointRCNN/models/point_rcnn.py
new file mode 100644
index 0000000000000000000000000000000000000000..890ef897405722f9cc1ba1d129bea2c80fce17a1
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/models/point_rcnn.py
@@ -0,0 +1,125 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+from collections import OrderedDict
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Constant
+
+from models.rpn import RPN
+from models.rcnn import RCNN
+
+
+__all__ = ["PointRCNN"]
+
+
+class PointRCNN(object):
+    def __init__(self, cfg, batch_size, use_xyz=True, mode='TRAIN', prog=None):
+        self.cfg = cfg
+        self.batch_size = batch_size
+        self.use_xyz = use_xyz
+        self.mode = mode
+        self.is_train = mode == 'TRAIN'
+        self.num_points = self.cfg.RPN.NUM_POINTS
+        self.prog = prog
+        self.inputs = None
+        self.pyreader = None
+
+    def build_inputs(self):
+        self.inputs = OrderedDict()
+
+        if self.cfg.RPN.ENABLED:
+            self.inputs['sample_id'] = fluid.layers.data(name='sample_id', shape=[1], dtype='int32')
+            self.inputs['pts_input'] = fluid.layers.data(name='pts_input', shape=[self.num_points, 3], dtype='float32')
+            self.inputs['pts_rect'] = fluid.layers.data(name='pts_rect', shape=[self.num_points, 3], dtype='float32')
+            self.inputs['pts_features'] = fluid.layers.data(name='pts_features', shape=[self.num_points, 1], dtype='float32')
+            self.inputs['rpn_cls_label'] = fluid.layers.data(name='rpn_cls_label', shape=[self.num_points], dtype='int32')
+            self.inputs['rpn_reg_label'] = fluid.layers.data(name='rpn_reg_label', shape=[self.num_points, 7], dtype='float32')
+            self.inputs['gt_boxes3d'] = fluid.layers.data(name='gt_boxes3d', shape=[7], lod_level=1, dtype='float32')
+
+        if self.cfg.RCNN.ENABLED:
+            if self.cfg.RCNN.ROI_SAMPLE_JIT:
+                self.inputs['sample_id'] = fluid.layers.data(name='sample_id', shape=[1], dtype='int32', append_batch_size=False)
+                self.inputs['rpn_xyz'] = fluid.layers.data(name='rpn_xyz', shape=[self.num_points, 3], dtype='float32', append_batch_size=False)
+                self.inputs['rpn_features'] = fluid.layers.data(name='rpn_features', shape=[self.num_points,128], dtype='float32', append_batch_size=False)
+                self.inputs['rpn_intensity'] = fluid.layers.data(name='rpn_intensity', shape=[self.num_points], dtype='float32', append_batch_size=False)
+                self.inputs['seg_mask'] = fluid.layers.data(name='seg_mask', shape=[self.num_points], dtype='float32', append_batch_size=False)
+                self.inputs['roi_boxes3d'] = fluid.layers.data(name='roi_boxes3d', shape=[-1, -1, 7], dtype='float32', append_batch_size=False, lod_level=0)
+                self.inputs['pts_depth'] = fluid.layers.data(name='pts_depth', shape=[self.num_points], dtype='float32', append_batch_size=False)
+                self.inputs['gt_boxes3d'] = fluid.layers.data(name='gt_boxes3d', shape=[-1, -1, 7], dtype='float32', append_batch_size=False, lod_level=0)
+            else:
+                self.inputs['sample_id'] = fluid.layers.data(name='sample_id', shape=[-1], dtype='int32', append_batch_size=False)
+                self.inputs['pts_input'] = fluid.layers.data(name='pts_input', shape=[-1,512,133], dtype='float32', append_batch_size=False)
+                self.inputs['pts_feature'] = fluid.layers.data(name='pts_feature', shape=[-1,512,128], dtype='float32', append_batch_size=False)
+                self.inputs['roi_boxes3d'] = fluid.layers.data(name='roi_boxes3d', shape=[-1,7], dtype='float32', append_batch_size=False)
+                if self.is_train:
+                    self.inputs['cls_label'] = fluid.layers.data(name='cls_label', shape=[-1], dtype='float32', append_batch_size=False)
+                    self.inputs['reg_valid_mask'] = fluid.layers.data(name='reg_valid_mask', shape=[-1], dtype='float32', append_batch_size=False)
+                    self.inputs['gt_boxes3d_ct'] = fluid.layers.data(name='gt_boxes3d_ct', shape=[-1,7], dtype='float32', append_batch_size=False)
+                    self.inputs['gt_of_rois'] = fluid.layers.data(name='gt_of_rois', shape=[-1,7], dtype='float32', append_batch_size=False)
+                else:
+                    self.inputs['roi_scores'] = fluid.layers.data(name='roi_scores', shape=[-1,], dtype='float32', append_batch_size=False)
+                    self.inputs['gt_iou'] = fluid.layers.data(name='gt_iou', shape=[-1], dtype='float32', append_batch_size=False)
+                    self.inputs['gt_boxes3d'] = fluid.layers.data(name='gt_boxes3d', shape=[-1,-1,7], dtype='float32', append_batch_size=False, lod_level=0)
+                
+
+        self.pyreader = fluid.io.PyReader(
+                feed_list=list(self.inputs.values()),
+                capacity=64,
+                use_double_buffer=True,
+                iterable=False)
+
+    def build(self):
+        self.build_inputs()
+        if self.cfg.RPN.ENABLED:
+            self.rpn = RPN(self.cfg, self.batch_size, self.use_xyz,
+                           self.mode, self.prog)
+            self.rpn.build(self.inputs)
+            self.rpn_outputs = self.rpn.get_outputs()
+            self.outputs = self.rpn_outputs
+        
+        if self.cfg.RCNN.ENABLED:
+            self.rcnn = RCNN(self.cfg, 1, self.batch_size, self.mode)
+            self.rcnn.build_model(self.inputs)
+            self.outputs = self.rcnn.get_outputs()
+        
+        if self.mode == 'TRAIN':
+            if self.cfg.RPN.ENABLED:
+                self.outputs['rpn_loss'], self.outputs['rpn_loss_cls'], \
+                        self.outputs['rpn_loss_reg'] = self.rpn.get_loss()
+            if self.cfg.RCNN.ENABLED:
+                self.outputs['rcnn_loss'], self.outputs['rcnn_loss_cls'], \
+                        self.outputs['rcnn_loss_reg'] = self.rcnn.get_loss()
+            self.outputs['loss'] = self.outputs.get('rpn_loss', 0.) \
+                                 + self.outputs.get('rcnn_loss', 0.)
+
+    def get_feeds(self):
+        return list(self.inputs.keys())
+
+    def get_outputs(self):
+        return self.outputs
+
+    def get_loss(self):
+        rpn_loss, _, _ = self.rpn.get_loss()
+        rcnn_loss, _, _ = self.rcnn.get_loss()
+        return rpn_loss + rcnn_loss
+
+    def get_pyreader(self):
+        return self.pyreader
+        
diff --git a/PaddleCV/Paddle3D/PointRCNN/models/pointnet2_modules.py b/PaddleCV/Paddle3D/PointRCNN/models/pointnet2_modules.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f92bb5f77afc50cdb1c92ab694b82c6ac64479f
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/models/pointnet2_modules.py
@@ -0,0 +1,203 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains PointNet++  utility functions.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Constant, Normal
+from ext_op import *
+
+__all__ = ["conv_bn", "pointnet_sa_module", "pointnet_fp_module", "MLP"]
+
+
+def query_and_group(xyz, new_xyz, radius, nsample, features=None, use_xyz=True):
+    """
+    Perform query_ball and group_points
+
+    Args:
+        xyz (Variable): xyz coordiantes features with shape [B, N, 3]
+        new_xyz (Variable): centriods features with shape [B, npoint, 3]
+        radius (float32): radius of ball
+        nsample (int32): maximum number of gather features
+        features (Variable): features with shape [B, N, C]
+        use_xyz (bool): whether use xyz coordiantes features
+
+    Returns:
+        out (Variable): features with shape [B, npoint, nsample, C + 3]
+    """
+    idx = query_ball(xyz, new_xyz, radius, nsample)
+    idx.stop_gradient = True
+    xyz = fluid.layers.transpose(xyz,perm=[0, 2, 1])
+    grouped_xyz = group_points(xyz, idx)
+    expand_new_xyz = fluid.layers.unsqueeze(fluid.layers.transpose(new_xyz, perm=[0, 2, 1]), axes=[-1])
+    expand_new_xyz = fluid.layers.expand(expand_new_xyz, [1, 1, 1, grouped_xyz.shape[3]])
+    grouped_xyz -= expand_new_xyz
+
+    if features is not None:
+        grouped_features = group_points(features, idx)
+        return fluid.layers.concat([grouped_xyz, grouped_features], axis=1) \
+                if use_xyz else grouped_features
+    else:
+        assert use_xyz, "use_xyz should be True when features is None"
+        return grouped_xyz
+
+
+def group_all(xyz, features=None, use_xyz=True):
+    """
+    Group all xyz and features when npoint is None
+    See query_and_group
+    """
+    xyz = fluid.layers.transpose(xyz,perm=[0, 2, 1])
+    grouped_xyz = fluid.layers.unsqueeze(xyz, axes=[2])
+    if features is not None:
+        grouped_features = fluid.layers.unsqueeze(features, axes=[2])
+        return fluid.layers.concat([grouped_xyz, grouped_features], axis=1) if use_xyz else grouped_features
+    else:
+        return grouped_xyz
+
+
+def conv_bn(input, out_channels, bn=True, bn_momentum=0.95, act='relu', name=None):
+    def _get_kaiming_init():
+        fan_in = input.shape[1]
+        std = (1.0 / fan_in / 3.0) ** 0.5
+        return Normal(0., std, 0.)
+
+    param_attr = ParamAttr(name='{}_conv_weight'.format(name),
+                           initializer=_get_kaiming_init())
+    bias_attr = ParamAttr(name='{}_conv_bias'.format(name)) \
+                                  if not bn else False
+    out = fluid.layers.conv2d(input,
+                              num_filters=out_channels,
+                              filter_size=1,
+                              stride=1,
+                              padding=0,
+                              dilation=1,
+                              param_attr=param_attr,
+                              bias_attr=bias_attr,
+			      act=act if not bn else None)
+    if bn:
+        bn_name = name + "_bn"
+        out = fluid.layers.batch_norm(out,
+                                      act=act,
+				      momentum=bn_momentum,
+                                      param_attr=ParamAttr(name=bn_name + "_scale"),
+                                      bias_attr=ParamAttr(name=bn_name + "_offset"),
+                                      moving_mean_name=bn_name + '_mean',
+                                      moving_variance_name=bn_name + '_var')
+
+    return out
+
+
+def MLP(features, out_channels_list, bn=True, bn_momentum=0.95, act='relu', name=None):
+    out = features
+    for i, out_channels in enumerate(out_channels_list):
+        out = conv_bn(out, out_channels, bn=bn, act=act, bn_momentum=bn_momentum, name=name + "_{}".format(i))
+    return out
+        
+
+def pointnet_sa_module(xyz,
+                       npoint=None,
+                       radiuss=[],
+                       nsamples=[],
+                       mlps=[],
+                       feature=None,
+                       bn=True,
+		       bn_momentum=0.95,
+                       use_xyz=True,
+                       name=None):
+    """
+    PointNet MSG(Multi-Scale Group) Set Abstraction Module.
+    Call with radiuss, nsamples, mlps as single element list for 
+    SSG(Single-Scale Group).
+
+    Args:
+        xyz (Variable): xyz coordiantes features with shape [B, N, 3]
+        radiuss ([float32]): list of radius of ball
+        nsamples ([int32]): list of maximum number of gather features
+        mlps ([[int32]]): list of out_channels_list
+        feature (Variable): features with shape [B, C, N]
+        bn (bool): whether perform batch norm after conv2d
+	bn_momentum (float): momentum of batch norm
+        use_xyz (bool): whether use xyz coordiantes features
+
+    Returns:
+        new_xyz (Variable): centriods features with shape [B, npoint, 3]
+        out (Variable): features with shape [B, npoint, \sum_i{mlps[i][-1]}]
+    """
+    assert len(radiuss) == len(nsamples) == len(mlps), \
+            "radiuss, nsamples, mlps length should be same"
+
+    farthest_idx = farthest_point_sampling(xyz, npoint)
+    farthest_idx.stop_gradient = True
+    new_xyz = gather_point(xyz, farthest_idx) if npoint is not None else None
+
+    outs = []
+    for i, (radius, nsample, mlp) in enumerate(zip(radiuss, nsamples, mlps)):
+        out = query_and_group(xyz, new_xyz, radius, nsample, feature, use_xyz) if npoint is not None else group_all(xyz, feature, use_xyz)
+        out = MLP(out, mlp, bn=bn, bn_momentum=bn_momentum, name=name + '_mlp{}'.format(i))
+        out = fluid.layers.pool2d(out, pool_size=[1, out.shape[3]], pool_type='max')
+        out = fluid.layers.squeeze(out, axes=[-1])
+        outs.append(out)
+    out = fluid.layers.concat(outs, axis=1)
+
+    return (new_xyz, out)
+
+
+def pointnet_fp_module(unknown, known, unknown_feats, known_feats, mlp, bn=True, bn_momentum=0.95, name=None):
+    """
+    PointNet Feature Propagation Module
+
+    Args:
+        unknown (Variable): unknown xyz coordiantes features with shape [B, N, 3]
+        known (Variable): known xyz coordiantes features with shape [B, M, 3]
+        unknown_feats (Variable): unknown features with shape [B, N, C1] to be propagated to
+        known_feats (Variable): known features with shape [B, M, C2] to be propagated from
+        mlp ([int32]): out_channels_list
+        bn (bool): whether perform batch norm after conv2d
+
+    Returns:
+        new_features (Variable): new features with shape [B, N, mlp[-1]]
+    """
+    if known is None:
+        raise NotImplementedError("Not implement known as None currently.")
+    else:
+        dist, idx = three_nn(unknown, known, eps=0.)
+        dist.stop_gradient = True
+        idx.stop_gradient = True
+        dist = fluid.layers.sqrt(dist)
+        ones = fluid.layers.fill_constant_batch_size_like(dist, dist.shape, dist.dtype, 1)
+        dist_recip = ones / (dist + 1e-8); # 1.0 / dist
+        norm = fluid.layers.reduce_sum(dist_recip, dim=-1, keep_dim=True)
+        weight = dist_recip / norm
+        weight.stop_gradient = True
+        interp_feats = three_interp(known_feats, weight, idx)
+
+    new_features = interp_feats if unknown_feats is None else \
+                    fluid.layers.concat([interp_feats, unknown_feats], axis=-1)
+    new_features = fluid.layers.transpose(new_features, perm=[0, 2, 1])
+    new_features = fluid.layers.unsqueeze(new_features, axes=[-1])
+    new_features = MLP(new_features, mlp, bn=bn, bn_momentum=bn_momentum, name=name + '_mlp')
+    new_features = fluid.layers.squeeze(new_features, axes=[-1])
+    new_features = fluid.layers.transpose(new_features, perm=[0, 2, 1])
+    
+    return new_features
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/models/pointnet2_msg.py b/PaddleCV/Paddle3D/PointRCNN/models/pointnet2_msg.py
new file mode 100644
index 0000000000000000000000000000000000000000..b4d5f98c3b320663111cf9eceef4f2649f44007d
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/models/pointnet2_msg.py
@@ -0,0 +1,78 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains PointNet++ SSG/MSG semantic segmentation models
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Constant
+from models.pointnet2_modules import *
+
+__all__ = ["PointNet2MSG"]
+
+
+class PointNet2MSG(object):
+    def __init__(self, cfg, xyz, feature=None, use_xyz=True):
+        self.cfg = cfg
+        self.xyz = xyz
+        self.feature = feature
+        self.use_xyz = use_xyz
+        self.model_config()
+
+    def model_config(self):
+        self.SA_confs = []
+        for i in range(self.cfg.RPN.SA_CONFIG.NPOINTS.__len__()):
+            self.SA_confs.append({
+                "npoint": self.cfg.RPN.SA_CONFIG.NPOINTS[i],
+                "radiuss": self.cfg.RPN.SA_CONFIG.RADIUS[i],
+                "nsamples": self.cfg.RPN.SA_CONFIG.NSAMPLE[i],
+                "mlps": self.cfg.RPN.SA_CONFIG.MLPS[i],
+                })
+
+        self.FP_confs = []
+        for i in range(self.cfg.RPN.FP_MLPS.__len__()):
+            self.FP_confs.append({"mlp": self.cfg.RPN.FP_MLPS[i]})
+
+    def build(self, bn_momentum=0.95):
+        xyzs, features = [self.xyz], [self.feature]
+        xyzi, featurei = self.xyz, self.feature
+        for i, SA_conf in enumerate(self.SA_confs):
+            xyzi, featurei = pointnet_sa_module(
+                    xyz=xyzi,
+                    feature=featurei,
+                    bn_momentum=bn_momentum,
+                    use_xyz=self.use_xyz,
+                    name="sa_{}".format(i),
+                    **SA_conf)
+            xyzs.append(xyzi)
+            features.append(fluid.layers.transpose(featurei, perm=[0, 2, 1]))
+        for i in range(-1, -(len(self.FP_confs) + 1), -1):
+            features[i - 1] = pointnet_fp_module(
+                    unknown=xyzs[i - 1],
+                    known=xyzs[i],
+                    unknown_feats=features[i - 1],
+                    known_feats=features[i],
+                    bn_momentum=bn_momentum,
+                    name="fp_{}".format(i + len(self.FP_confs)),
+                    **self.FP_confs[i])
+
+        return xyzs[0], features[0]
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/models/rcnn.py b/PaddleCV/Paddle3D/PointRCNN/models/rcnn.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb2f65332fbf14517abec0c257330fab1c834155
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/models/rcnn.py
@@ -0,0 +1,303 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import sys
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Constant
+
+from models.pointnet2_modules import MLP, pointnet_sa_module, conv_bn
+from models.loss_utils import sigmoid_focal_loss , get_reg_loss
+from utils.proposal_target import get_proposal_target_func
+from utils.cyops.kitti_utils import rotate_pc_along_y
+
+__all__ = ['RCNN']
+
+
+class RCNN(object):
+    def __init__(self, cfg, num_classes, batch_size, mode='TRAIN', use_xyz=True, input_channels=0):
+        self.cfg = cfg
+        self.use_xyz = use_xyz
+        self.num_classes = num_classes
+        self.input_channels = input_channels
+        self.inputs = None
+        self.training = mode == 'TRAIN'
+        self.batch_size = batch_size
+
+    def create_tmp_var(self, name, dtype, shape):
+        return fluid.default_main_program().current_block().create_var(
+            name=name, dtype=dtype, shape=shape
+        )
+
+    def build_model(self, inputs):
+        self.inputs = inputs
+        if self.cfg.RCNN.ROI_SAMPLE_JIT:
+            if self.training:
+                proposal_target = get_proposal_target_func(self.cfg)
+                        
+                tmp_list = [
+                    self.inputs['seg_mask'],
+                    self.inputs['rpn_features'],
+                    self.inputs['gt_boxes3d'],
+                    self.inputs['rpn_xyz'],
+                    self.inputs['pts_depth'],
+                    self.inputs['roi_boxes3d'],
+                    self.inputs['rpn_intensity'],
+                ]
+                out_name = ['reg_valid_mask' ,'sampled_pts' ,'roi_boxes3d', 'gt_of_rois', 'pts_feature' ,'cls_label','gt_iou']
+                reg_valid_mask = self.create_tmp_var(name="reg_valid_mask",dtype='float32',shape=[-1,])
+                sampled_pts = self.create_tmp_var(name="sampled_pts",dtype='float32',shape=[-1, self.cfg.RCNN.NUM_POINTS, 3])
+                new_roi_boxes3d = self.create_tmp_var(name="new_roi_boxes3d",dtype='float32',shape=[-1, 7])
+                gt_of_rois = self.create_tmp_var(name="gt_of_rois", dtype='float32', shape=[-1,7])
+                pts_feature = self.create_tmp_var(name="pts_feature", dtype='float32',shape=[-1,512,130])
+                cls_label = self.create_tmp_var(name="cls_label",dtype='int64',shape=[-1])
+                gt_iou = self.create_tmp_var(name="gt_iou",dtype='float32',shape=[-1])
+                
+                out_list = [reg_valid_mask, sampled_pts, new_roi_boxes3d, gt_of_rois, pts_feature, cls_label, gt_iou]
+                out = fluid.layers.py_func(func=proposal_target,x=tmp_list,out=out_list)
+                
+                self.target_dict = {}
+                for i,item in enumerate(out):
+                    self.target_dict[out_name[i]] = item
+                
+                pts = fluid.layers.concat(input=[self.target_dict['sampled_pts'],self.target_dict['pts_feature']], axis=2)
+                self.debug = pts
+                self.target_dict['pts_input'] = pts
+            else:
+                rpn_xyz, rpn_features = inputs['rpn_xyz'], inputs['rpn_features']
+                batch_rois = inputs['roi_boxes3d']
+                rpn_intensity = inputs['rpn_intensity']
+                rpn_intensity = fluid.layers.unsqueeze(rpn_intensity,axes=[2])
+                seg_mask = fluid.layers.unsqueeze(inputs['seg_mask'],axes=[2])
+                if self.cfg.RCNN.USE_INTENSITY:
+                    pts_extra_input_list = [rpn_intensity, seg_mask]
+                else:
+                    pts_extra_input_list = [seg_mask]
+
+                if self.cfg.RCNN.USE_DEPTH:
+                    pts_depth = inputs['pts_depth'] / 70.0 -0.5
+                    pts_depth = fluid.layers.unsqueeze(pts_depth,axes=[2])
+                    pts_extra_input_list.append(pts_depth)
+                pts_extra_input = fluid.layers.concat(pts_extra_input_list, axis=2)
+                pts_feature = fluid.layers.concat([pts_extra_input, rpn_features],axis=2)
+                
+                pooled_features, pooled_empty_flag = fluid.layers.roi_pool_3d(rpn_xyz,pts_feature,batch_rois,
+                                                                              self.cfg.RCNN.POOL_EXTRA_WIDTH,
+                                                                              sampled_pt_num=self.cfg.RCNN.NUM_POINTS)
+                # canonical transformation
+                batch_size = batch_rois.shape[0]
+                roi_center = batch_rois[:, :, 0:3]
+                tmp = pooled_features[:, :, :, 0:3] - fluid.layers.unsqueeze(roi_center,axes=[2])
+                pooled_features = fluid.layers.concat(input=[tmp,pooled_features[:,:,:,3:]],axis=3)
+                concat_list = []
+                for i in range(batch_size):
+                    tmp = rotate_pc_along_y(pooled_features[i, :, :, 0:3],
+                                                        batch_rois[i, :, 6])
+                    concat = fluid.layers.concat([tmp,pooled_features[i,:,:,3:]],axis=-1)
+                    concat = fluid.layers.unsqueeze(concat,axes=[0])
+                    concat_list.append(concat)
+                pooled_features = fluid.layers.concat(concat_list,axis=0)
+                pts = fluid.layers.reshape(pooled_features,shape=[-1,pooled_features.shape[2],pooled_features.shape[3]])
+        
+        else:
+            pts = inputs['pts_input']
+            self.target_dict = {}
+            self.target_dict['pts_input'] = inputs['pts_input']
+            self.target_dict['roi_boxes3d'] = inputs['roi_boxes3d']
+        
+            if self.training:
+                self.target_dict['cls_label'] = inputs['cls_label']
+                self.target_dict['reg_valid_mask'] = inputs['reg_valid_mask']
+                self.target_dict['gt_of_rois'] = inputs['gt_boxes3d_ct']
+        
+        xyz = pts[:,:,0:3]
+        feature = fluid.layers.transpose(pts[:,:,3:], [0,2,1]) if pts.shape[-1]>3 else None
+        if self.cfg.RCNN.USE_RPN_FEATURES:
+            self.rcnn_input_channel = 3 + int(self.cfg.RCNN.USE_INTENSITY) + \
+                                      int(self.cfg.RCNN.USE_MASK) + int(self.cfg.RCNN.USE_DEPTH)
+            c_out = self.cfg.RCNN.XYZ_UP_LAYER[-1]
+
+            xyz_input = pts[:,:,:self.rcnn_input_channel]
+            xyz_input = fluid.layers.transpose(xyz_input, [0,2,1])
+            xyz_input = fluid.layers.unsqueeze(xyz_input, axes=[3])
+            
+            rpn_feature = pts[:,:,self.rcnn_input_channel:]
+            rpn_feature = fluid.layers.transpose(rpn_feature, [0,2,1])
+            rpn_feature = fluid.layers.unsqueeze(rpn_feature,axes=[3])
+
+            xyz_feature = MLP(
+                xyz_input,
+                out_channels_list=self.cfg.RCNN.XYZ_UP_LAYER,
+                bn=self.cfg.RCNN.USE_BN,
+                name="xyz_up_layer")
+            
+            merged_feature = fluid.layers.concat([xyz_feature, rpn_feature],axis=1)
+            merged_feature = MLP(
+                merged_feature,
+                out_channels_list=[c_out], 
+                bn=self.cfg.RCNN.USE_BN, 
+                name="xyz_down_layer")
+
+            xyzs = [xyz]
+            features = [fluid.layers.squeeze(merged_feature,axes=[3])]
+        else:
+            xyzs = [xyz]
+            features = [feature]
+        
+        # forward
+        xyzi, featurei = xyzs[-1], features[-1]
+        for k in range(len(self.cfg.RCNN.SA_CONFIG.NPOINTS)):
+            mlps = self.cfg.RCNN.SA_CONFIG.MLPS[k]
+            npoint = self.cfg.RCNN.SA_CONFIG.NPOINTS[k] if self.cfg.RCNN.SA_CONFIG.NPOINTS[k] != -1 else None
+            
+            xyzi, featurei = pointnet_sa_module(
+                xyz=xyzi,
+                feature = featurei,
+                bn = self.cfg.RCNN.USE_BN,
+                use_xyz = self.use_xyz,
+                name = "sa_{}".format(k),
+                npoint = npoint,
+                mlps = [mlps],
+                radiuss = [self.cfg.RCNN.SA_CONFIG.RADIUS[k]],
+                nsamples = [self.cfg.RCNN.SA_CONFIG.NSAMPLE[k]]
+            )
+            xyzs.append(xyzi)
+            features.append(featurei)
+        
+        head_in = features[-1]
+        head_in = fluid.layers.unsqueeze(head_in, axes=[2])
+        
+        cls_out = head_in
+        reg_out = cls_out
+        
+        for i in range(0, self.cfg.RCNN.CLS_FC.__len__()):
+            cls_out = conv_bn(cls_out, self.cfg.RCNN.CLS_FC[i], bn=self.cfg.RCNN.USE_BN, name='rcnn_cls_{}'.format(i))
+            if i == 0 and self.cfg.RCNN.DP_RATIO >= 0:
+                cls_out = fluid.layers.dropout(cls_out, self.cfg.RCNN.DP_RATIO, dropout_implementation="upscale_in_train")
+        cls_channel = 1 if self.num_classes == 2 else self.num_classes
+        cls_out = conv_bn(cls_out, cls_channel, act=None, name="cls_out", bn=self.cfg.RCNN.USE_BN)
+        self.cls_out = fluid.layers.squeeze(cls_out,axes=[1,3])
+       
+        per_loc_bin_num = int(self.cfg.RCNN.LOC_SCOPE / self.cfg.RCNN.LOC_BIN_SIZE) * 2
+        loc_y_bin_num = int(self.cfg.RCNN.LOC_Y_SCOPE / self.cfg.RCNN.LOC_Y_BIN_SIZE) * 2
+        reg_channel = per_loc_bin_num * 4 + self.cfg.RCNN.NUM_HEAD_BIN * 2 + 3
+        reg_channel += (1 if not self.cfg.RCNN.LOC_Y_BY_BIN else loc_y_bin_num * 2)
+        for i in range(0, self.cfg.RCNN.REG_FC.__len__()):
+            reg_out = conv_bn(reg_out, self.cfg.RCNN.REG_FC[i], bn=self.cfg.RCNN.USE_BN, name='rcnn_reg_{}'.format(i))
+            if i == 0 and self.cfg.RCNN.DP_RATIO >= 0:
+                reg_out = fluid.layers.dropout(reg_out, self.cfg.RCNN.DP_RATIO, dropout_implementation="upscale_in_train")
+
+        reg_out = conv_bn(reg_out, reg_channel, act=None, name="reg_out", bn=self.cfg.RCNN.USE_BN)
+        self.reg_out = fluid.layers.squeeze(reg_out, axes=[2,3])
+
+        
+        self.outputs = {
+            'rcnn_cls':self.cls_out,
+            'rcnn_reg':self.reg_out,
+        }
+        if self.training:
+            self.outputs.update(self.target_dict)
+        elif not self.training:
+            self.outputs['sample_id'] = inputs['sample_id']
+            self.outputs['pts_input'] = inputs['pts_input']
+            self.outputs['roi_boxes3d'] = inputs['roi_boxes3d']
+            self.outputs['roi_scores'] = inputs['roi_scores']
+            self.outputs['gt_iou'] = inputs['gt_iou'] 
+            self.outputs['gt_boxes3d'] = inputs['gt_boxes3d']
+
+            if self.cls_out.shape[1] == 1:
+                raw_scores = fluid.layers.reshape(self.cls_out, shape=[-1])
+                norm_scores = fluid.layers.sigmoid(raw_scores)
+            else:
+                norm_scores = fluid.layers.softmax(self.cls_out, axis=1)
+            self.outputs['norm_scores'] = norm_scores
+            
+    def get_outputs(self):
+        return self.outputs
+
+    def get_loss(self):
+        assert self.inputs is not None, \
+            "please call build() first"
+        rcnn_cls_label = self.outputs['cls_label']
+        reg_valid_mask = self.outputs['reg_valid_mask']
+        roi_boxes3d = self.outputs['roi_boxes3d']
+        roi_size = roi_boxes3d[:, 3:6]
+        gt_boxes3d_ct = self.outputs['gt_of_rois']
+        pts_input = self.outputs['pts_input']
+
+        rcnn_cls = self.cls_out
+        rcnn_reg = self.reg_out
+
+        # RCNN classification loss
+        assert self.cfg.RCNN.LOSS_CLS in ["SigmoidFocalLoss", "BinaryCrossEntropy"], \
+                "unsupported RCNN cls loss type {}".format(self.cfg.RCNN.LOSS_CLS)
+
+        if self.cfg.RCNN.LOSS_CLS == "SigmoidFocalLoss":
+            cls_flat = fluid.layers.reshape(self.cls_out, shape=[-1])
+            cls_label_flat = fluid.layers.reshape(rcnn_cls_label, shape=[-1])
+            cls_label_flat = fluid.layers.cast(cls_label_flat, dtype=cls_flat.dtype)
+            cls_target = fluid.layers.cast(cls_label_flat>0, dtype=cls_flat.dtype)
+            cls_label_flat.stop_gradient = True
+            pos = fluid.layers.cast(cls_label_flat > 0, dtype=cls_flat.dtype)
+            pos.stop_gradient = True
+            pos_normalizer = fluid.layers.reduce_sum(pos)
+            cls_weights = fluid.layers.cast(cls_label_flat >= 0, dtype=cls_flat.dtype)
+            cls_weights = cls_weights / fluid.layers.clip(pos_normalizer, min=1.0, max=1e10)
+            cls_weights.stop_gradient = True
+            rcnn_loss_cls = sigmoid_focal_loss(cls_flat, cls_target, cls_weights)
+            rcnn_loss_cls = fluid.layers.reduce_sum(rcnn_loss_cls)
+        else: # BinaryCrossEntropy
+            cls_label = fluid.layers.reshape(rcnn_cls_label, shape=self.cls_out.shape)
+            cls_valid_mask = fluid.layers.cast(cls_label >= 0, dtype=self.cls_out.dtype)
+            cls_label = fluid.layers.cast(cls_label, dtype=self.cls_out.dtype)
+            cls_label.stop_gradient = True
+            rcnn_loss_cls = fluid.layers.sigmoid_cross_entropy_with_logits(self.cls_out, cls_label)
+            cls_mask_normalzer = fluid.layers.reduce_sum(cls_valid_mask)
+            rcnn_loss_cls = fluid.layers.reduce_sum(rcnn_loss_cls * cls_valid_mask) \
+                                / fluid.layers.clip(cls_mask_normalzer, min=1.0, max=1e10)
+
+        # RCNN regression loss
+        reg_out = self.reg_out
+        fg_mask = fluid.layers.cast(reg_valid_mask > 0, dtype=reg_out.dtype)
+        fg_mask = fluid.layers.unsqueeze(fg_mask, axes=[1])
+        fg_mask.stop_gradient = True
+        gt_boxes3d_ct = fluid.layers.reshape(gt_boxes3d_ct, [-1,7])
+        all_anchor_size = roi_size
+        anchor_size = all_anchor_size[fg_mask] if self.cfg.RCNN.SIZE_RES_ON_ROI else self.cfg.CLS_MEAN_SIZE[0]
+
+        loc_loss, angle_loss, size_loss, loss_dict = get_reg_loss(
+            reg_out * fg_mask, 
+            gt_boxes3d_ct,
+            fg_mask,
+            point_num=float(self.batch_size*64),
+            loc_scope=self.cfg.RCNN.LOC_SCOPE,
+            loc_bin_size=self.cfg.RCNN.LOC_BIN_SIZE,
+            num_head_bin=self.cfg.RCNN.NUM_HEAD_BIN,
+            anchor_size=anchor_size,
+            get_xz_fine=True,
+            get_y_by_bin=self.cfg.RCNN.LOC_Y_BY_BIN,
+            loc_y_scope=self.cfg.RCNN.LOC_Y_SCOPE,
+            loc_y_bin_size=self.cfg.RCNN.LOC_Y_BIN_SIZE,
+            get_ry_fine=True
+        )
+        rcnn_loss_reg = loc_loss + angle_loss + size_loss * 3
+        rcnn_loss = rcnn_loss_cls + rcnn_loss_reg
+        return rcnn_loss, rcnn_loss_cls, rcnn_loss_reg
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/models/rpn.py b/PaddleCV/Paddle3D/PointRCNN/models/rpn.py
new file mode 100644
index 0000000000000000000000000000000000000000..5432e0c2b2b5d6f3e6e492bd34d9f2a8ab14c49f
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/models/rpn.py
@@ -0,0 +1,171 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Normal, Constant
+
+from utils.proposal_utils import get_proposal_func
+from models.pointnet2_msg import PointNet2MSG
+from models.pointnet2_modules import conv_bn
+from models.loss_utils import sigmoid_focal_loss, get_reg_loss
+
+__all__ = ["RPN"]
+
+
+class RPN(object):
+    def __init__(self, cfg, batch_size, use_xyz=True, mode='TRAIN', prog=None):
+        self.cfg = cfg
+        self.batch_size = batch_size
+        self.use_xyz = use_xyz
+        self.mode = mode
+        self.is_train = mode == 'TRAIN'
+        self.inputs = None
+        self.prog = fluid.default_main_program() if prog is None else prog
+
+    def build(self, inputs):
+        assert self.cfg.RPN.BACKBONE == 'pointnet2_msg', \
+                "RPN backbone only support pointnet2_msg"
+        self.inputs = inputs
+        self.outputs = {}
+
+        xyz = inputs["pts_input"]
+        assert not self.cfg.RPN.USE_INTENSITY, \
+                "RPN.USE_INTENSITY not support now"
+        feature = None
+        msg = PointNet2MSG(self.cfg, xyz, feature, self.use_xyz)
+        backbone_xyz, backbone_feature = msg.build()
+        self.outputs['backbone_xyz'] = backbone_xyz
+        self.outputs['backbone_feature'] = backbone_feature
+
+        backbone_feature = fluid.layers.transpose(backbone_feature, perm=[0, 2, 1])
+        cls_out = fluid.layers.unsqueeze(backbone_feature, axes=[-1])
+        reg_out = cls_out
+
+        # classification branch
+        for i in range(self.cfg.RPN.CLS_FC.__len__()):
+            cls_out = conv_bn(cls_out, self.cfg.RPN.CLS_FC[i], bn=self.cfg.RPN.USE_BN, name='rpn_cls_{}'.format(i))
+            if i == 0 and self.cfg.RPN.DP_RATIO > 0:
+                cls_out = fluid.layers.dropout(cls_out, self.cfg.RPN.DP_RATIO, dropout_implementation="upscale_in_train")
+        cls_out = fluid.layers.conv2d(cls_out,
+                                      num_filters=1,
+				      filter_size=1,
+				      stride=1,
+				      padding=0,
+				      dilation=1,
+                                      param_attr=ParamAttr(name='rpn_cls_out_conv_weight'),
+                                      bias_attr=ParamAttr(name='rpn_cls_out_conv_bias',
+                                                          initializer=Constant(-np.log(99))))
+        cls_out = fluid.layers.squeeze(cls_out, axes=[1, 3])
+        self.outputs['rpn_cls'] = cls_out
+
+        # regression branch
+        per_loc_bin_num = int(self.cfg.RPN.LOC_SCOPE / self.cfg.RPN.LOC_BIN_SIZE) * 2
+        if self.cfg.RPN.LOC_XZ_FINE:
+            reg_channel = per_loc_bin_num * 4 + self.cfg.RPN.NUM_HEAD_BIN * 2 + 3
+        else:
+            reg_channel = per_loc_bin_num * 2 + self.cfg.RPN.NUM_HEAD_BIN * 2 + 3
+        reg_channel += 1  # reg y
+
+        for i in range(self.cfg.RPN.REG_FC.__len__()):
+            reg_out = conv_bn(reg_out, self.cfg.RPN.REG_FC[i], bn=self.cfg.RPN.USE_BN, name='rpn_reg_{}'.format(i))
+            if i == 0 and self.cfg.RPN.DP_RATIO > 0:
+                reg_out = fluid.layers.dropout(reg_out, self.cfg.RPN.DP_RATIO, dropout_implementation="upscale_in_train")
+        reg_out = fluid.layers.conv2d(reg_out,
+                                      num_filters=reg_channel,
+				      filter_size=1,
+				      stride=1,
+				      padding=0,
+				      dilation=1,
+                                      param_attr=ParamAttr(name='rpn_reg_out_conv_weight',
+                                                           initializer=Normal(0., 0.001),),
+                                      bias_attr=ParamAttr(name='rpn_reg_out_conv_bias'))
+        reg_out = fluid.layers.squeeze(reg_out, axes=[3])
+        reg_out = fluid.layers.transpose(reg_out, [0, 2, 1])
+        self.outputs['rpn_reg'] = reg_out
+
+        if self.mode != 'TRAIN' or self.cfg.RCNN.ENABLED:
+            rpn_scores_row = cls_out
+            rpn_scores_norm = fluid.layers.sigmoid(rpn_scores_row)
+            seg_mask = fluid.layers.cast(rpn_scores_norm > self.cfg.RPN.SCORE_THRESH, dtype='float32')
+            pts_depth = fluid.layers.sqrt(fluid.layers.reduce_sum(backbone_xyz * backbone_xyz, dim=2))
+            proposal_func = get_proposal_func(self.cfg, self.mode)
+            proposal_input = fluid.layers.concat([fluid.layers.unsqueeze(rpn_scores_row, axes=[-1]),
+                                                  backbone_xyz, reg_out], axis=-1)
+            proposal = self.prog.current_block().create_var(name='proposal',
+                                                            shape=[-1, proposal_input.shape[1], 8],
+                                                            dtype='float32')
+            fluid.layers.py_func(proposal_func, proposal_input, proposal)
+            rois, roi_scores_row = proposal[:, :, :7], proposal[:, :, -1]
+            self.outputs['rois'] = rois
+            self.outputs['roi_scores_row'] = roi_scores_row
+            self.outputs['seg_mask'] = seg_mask
+            self.outputs['pts_depth'] = pts_depth
+
+    def get_outputs(self):
+        return self.outputs
+
+    def get_loss(self):
+        assert self.inputs is not None, \
+                "please call build() first"
+        rpn_cls_label = self.inputs['rpn_cls_label']
+        rpn_reg_label = self.inputs['rpn_reg_label']
+        rpn_cls = self.outputs['rpn_cls']
+        rpn_reg = self.outputs['rpn_reg']
+
+        # RPN classification loss
+        assert self.cfg.RPN.LOSS_CLS == "SigmoidFocalLoss", \
+                "unsupported RPN cls loss type {}".format(self.cfg.RPN.LOSS_CLS)
+        cls_flat = fluid.layers.reshape(rpn_cls, shape=[-1])
+        cls_label_flat = fluid.layers.reshape(rpn_cls_label, shape=[-1])
+        cls_label_pos = fluid.layers.cast(cls_label_flat > 0, dtype=cls_flat.dtype)
+        pos_normalizer = fluid.layers.reduce_sum(cls_label_pos)
+        cls_weights = fluid.layers.cast(cls_label_flat >= 0, dtype=cls_flat.dtype)
+        cls_weights = cls_weights / fluid.layers.clip(pos_normalizer, min=1.0, max=1e10)
+        cls_weights.stop_gradient = True
+        cls_label_flat = fluid.layers.cast(cls_label_flat, dtype=cls_flat.dtype)
+        cls_label_flat.stop_gradient = True
+        rpn_loss_cls = sigmoid_focal_loss(cls_flat, cls_label_pos, cls_weights)
+        rpn_loss_cls = fluid.layers.reduce_sum(rpn_loss_cls)
+
+        # RPN regression loss
+        rpn_reg = fluid.layers.reshape(rpn_reg, [-1, rpn_reg.shape[-1]])
+        reg_label = fluid.layers.reshape(rpn_reg_label, [-1, rpn_reg_label.shape[-1]])
+        fg_mask = fluid.layers.cast(cls_label_flat > 0, dtype=rpn_reg.dtype)
+        fg_mask = fluid.layers.unsqueeze(fg_mask, axes=[1])
+        fg_mask.stop_gradient = True
+        loc_loss, angle_loss, size_loss, loss_dict = get_reg_loss(
+            rpn_reg * fg_mask,
+            reg_label,
+            fg_mask,
+            float(self.batch_size * self.cfg.RPN.NUM_POINTS),
+            loc_scope=self.cfg.RPN.LOC_SCOPE,
+            loc_bin_size=self.cfg.RPN.LOC_BIN_SIZE,
+            num_head_bin=self.cfg.RPN.NUM_HEAD_BIN,
+            anchor_size=self.cfg.CLS_MEAN_SIZE[0],
+            get_xz_fine=self.cfg.RPN.LOC_XZ_FINE,
+            get_y_by_bin=False,
+            get_ry_fine=False)
+        rpn_loss_reg = loc_loss + angle_loss + size_loss * 3
+
+        self.rpn_loss = rpn_loss_cls * self.cfg.RPN.LOSS_WEIGHT[0] \
+                            + rpn_loss_reg * self.cfg.RPN.LOSS_WEIGHT[1]
+        return self.rpn_loss, rpn_loss_cls, rpn_loss_reg
+        
diff --git a/PaddleCV/Paddle3D/PointRCNN/requirement.txt b/PaddleCV/Paddle3D/PointRCNN/requirement.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6ff347ab06c588b507fd6b5f1442e2375afb032a
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/requirement.txt
@@ -0,0 +1,6 @@
+Cython
+opencv-python
+shapely
+scikit-image
+Numba
+fire
diff --git a/PaddleCV/Paddle3D/PointRCNN/tools/generate_aug_scene.py b/PaddleCV/Paddle3D/PointRCNN/tools/generate_aug_scene.py
new file mode 100644
index 0000000000000000000000000000000000000000..59cfa4abc0629c71d150f750e8f32400c6c361b9
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/tools/generate_aug_scene.py
@@ -0,0 +1,330 @@
+"""
+Generate GT database
+This code is based on https://github.com/sshaoshuai/PointRCNN/blob/master/tools/generate_aug_scene.py
+"""
+
+import os
+import numpy as np
+import pickle
+
+import pts_utils
+import utils.cyops.kitti_utils as kitti_utils 
+from utils.box_utils import boxes_iou3d
+from utils import calibration as calib
+from data.kitti_dataset import KittiDataset
+import argparse
+
+np.random.seed(1024)
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--mode', type=str, default='generator')
+parser.add_argument('--class_name', type=str, default='Car')
+parser.add_argument('--data_dir', type=str, default='./data')
+parser.add_argument('--save_dir', type=str, default='./data/KITTI/aug_scene/training')
+parser.add_argument('--split', type=str, default='train')
+parser.add_argument('--gt_database_dir', type=str, default='./data/gt_database/train_gt_database_3level_Car.pkl')
+parser.add_argument('--include_similar', action='store_true', default=False)
+parser.add_argument('--aug_times', type=int, default=4)
+args = parser.parse_args()
+
+PC_REDUCE_BY_RANGE = True
+if args.class_name == 'Car':
+    PC_AREA_SCOPE = np.array([[-40, 40], [-1, 3], [0, 70.4]])  # x, y, z scope in rect camera coords
+else:
+    PC_AREA_SCOPE = np.array([[-30, 30], [-1, 3], [0, 50]])
+
+
+def log_print(info, fp=None):
+    print(info)
+    if fp is not None:
+        # print(info, file=fp)
+        fp.write(info+"\n")
+
+
+def save_kitti_format(calib, bbox3d, obj_list, img_shape, save_fp):
+    corners3d = kitti_utils.boxes3d_to_corners3d(bbox3d)
+    img_boxes, _ = calib.corners3d_to_img_boxes(corners3d)
+
+    img_boxes[:, 0] = np.clip(img_boxes[:, 0], 0, img_shape[1] - 1)
+    img_boxes[:, 1] = np.clip(img_boxes[:, 1], 0, img_shape[0] - 1)
+    img_boxes[:, 2] = np.clip(img_boxes[:, 2], 0, img_shape[1] - 1)
+    img_boxes[:, 3] = np.clip(img_boxes[:, 3], 0, img_shape[0] - 1)
+
+    # Discard boxes that are larger than 80% of the image width OR height
+    img_boxes_w = img_boxes[:, 2] - img_boxes[:, 0]
+    img_boxes_h = img_boxes[:, 3] - img_boxes[:, 1]
+    box_valid_mask = np.logical_and(img_boxes_w < img_shape[1] * 0.8, img_boxes_h < img_shape[0] * 0.8)
+
+    for k in range(bbox3d.shape[0]):
+        if box_valid_mask[k] == 0:
+            continue
+        x, z, ry = bbox3d[k, 0], bbox3d[k, 2], bbox3d[k, 6]
+        beta = np.arctan2(z, x)
+        alpha = -np.sign(beta) * np.pi / 2 + beta + ry
+
+        save_fp.write('%s %.2f %d %.4f %.4f %.4f %.4f %.4f %.4f %.4f %.4f %.4f %.4f %.4f %.4f\n' %
+              (args.class_name, obj_list[k].trucation, int(obj_list[k].occlusion), alpha, img_boxes[k, 0], img_boxes[k, 1],
+               img_boxes[k, 2], img_boxes[k, 3],
+               bbox3d[k, 3], bbox3d[k, 4], bbox3d[k, 5], bbox3d[k, 0], bbox3d[k, 1], bbox3d[k, 2],
+               bbox3d[k, 6]))
+
+
+class AugSceneGenerator(KittiDataset):
+    def __init__(self, root_dir, gt_database=None, split='train', classes=args.class_name):
+        super(AugSceneGenerator, self).__init__(root_dir, split=split)
+        self.gt_database = None
+        if classes == 'Car':
+            self.classes = ('Background', 'Car')
+        elif classes == 'People':
+            self.classes = ('Background', 'Pedestrian', 'Cyclist')
+        elif classes == 'Pedestrian':
+            self.classes = ('Background', 'Pedestrian')
+        elif classes == 'Cyclist':
+            self.classes = ('Background', 'Cyclist')
+        else:
+            assert False, "Invalid classes: %s" % classes
+
+        self.gt_database = gt_database
+
+    def __len__(self):
+        raise NotImplementedError
+
+    def __getitem__(self, item):
+        raise NotImplementedError
+
+    def filtrate_dc_objects(self, obj_list):
+        valid_obj_list = []
+        for obj in obj_list:
+            if obj.cls_type in ['DontCare']:
+                continue
+            valid_obj_list.append(obj)
+
+        return valid_obj_list
+
+    def filtrate_objects(self, obj_list):
+        valid_obj_list = []
+        type_whitelist = self.classes
+        if args.include_similar:
+            type_whitelist = list(self.classes)
+            if 'Car' in self.classes:
+                type_whitelist.append('Van')
+            if 'Pedestrian' in self.classes or 'Cyclist' in self.classes:
+                type_whitelist.append('Person_sitting')
+
+        for obj in obj_list:
+            if obj.cls_type in type_whitelist:
+                valid_obj_list.append(obj)
+        return valid_obj_list
+
+    @staticmethod
+    def get_valid_flag(pts_rect, pts_img, pts_rect_depth, img_shape):
+        """
+        Valid point should be in the image (and in the PC_AREA_SCOPE)
+        :param pts_rect:
+        :param pts_img:
+        :param pts_rect_depth:
+        :param img_shape:
+        :return:
+        """
+        val_flag_1 = np.logical_and(pts_img[:, 0] >= 0, pts_img[:, 0] < img_shape[1])
+        val_flag_2 = np.logical_and(pts_img[:, 1] >= 0, pts_img[:, 1] < img_shape[0])
+        val_flag_merge = np.logical_and(val_flag_1, val_flag_2)
+        pts_valid_flag = np.logical_and(val_flag_merge, pts_rect_depth >= 0)
+
+        if PC_REDUCE_BY_RANGE:
+            x_range, y_range, z_range = PC_AREA_SCOPE
+            pts_x, pts_y, pts_z = pts_rect[:, 0], pts_rect[:, 1], pts_rect[:, 2]
+            range_flag = (pts_x >= x_range[0]) & (pts_x <= x_range[1]) \
+                         & (pts_y >= y_range[0]) & (pts_y <= y_range[1]) \
+                         & (pts_z >= z_range[0]) & (pts_z <= z_range[1])
+            pts_valid_flag = pts_valid_flag & range_flag
+        return pts_valid_flag
+
+    @staticmethod
+    def check_pc_range(xyz):
+        """
+        :param xyz: [x, y, z]
+        :return:
+        """
+        x_range, y_range, z_range = PC_AREA_SCOPE
+        if (x_range[0] <= xyz[0] <= x_range[1]) and (y_range[0] <= xyz[1] <= y_range[1]) and \
+                (z_range[0] <= xyz[2] <= z_range[1]):
+            return True
+        return False
+
+    def aug_one_scene(self, sample_id, pts_rect, pts_intensity, all_gt_boxes3d):
+        """
+        :param pts_rect: (N, 3)
+        :param gt_boxes3d: (M1, 7)
+        :param all_gt_boxex3d: (M2, 7)
+        :return:
+        """
+        assert self.gt_database is not None
+        extra_gt_num = np.random.randint(10, 15)
+        try_times = 50
+        cnt = 0
+        cur_gt_boxes3d = all_gt_boxes3d.copy()
+        cur_gt_boxes3d[:, 4] += 0.5
+        cur_gt_boxes3d[:, 5] += 0.5  # enlarge new added box to avoid too nearby boxes
+
+        extra_gt_obj_list = []
+        extra_gt_boxes3d_list = []
+        new_pts_list, new_pts_intensity_list = [], []
+        src_pts_flag = np.ones(pts_rect.shape[0], dtype=np.int32)
+
+        road_plane = self.get_road_plane(sample_id)
+        a, b, c, d = road_plane
+
+        while try_times > 0:
+            try_times -= 1
+
+            rand_idx = np.random.randint(0, self.gt_database.__len__() - 1)
+
+            new_gt_dict = self.gt_database[rand_idx]
+            new_gt_box3d = new_gt_dict['gt_box3d'].copy()
+            new_gt_points = new_gt_dict['points'].copy()
+            new_gt_intensity = new_gt_dict['intensity'].copy()
+            new_gt_obj = new_gt_dict['obj']
+            center = new_gt_box3d[0:3]
+            if PC_REDUCE_BY_RANGE and (self.check_pc_range(center) is False):
+                continue
+            if cnt > extra_gt_num:
+                break
+            if new_gt_points.__len__() < 5:  # too few points
+                continue
+
+            # put it on the road plane
+            cur_height = (-d - a * center[0] - c * center[2]) / b
+            move_height = new_gt_box3d[1] - cur_height
+            new_gt_box3d[1] -= move_height
+            new_gt_points[:, 1] -= move_height
+
+            cnt += 1
+
+            iou3d = boxes_iou3d(new_gt_box3d.reshape(1, 7), cur_gt_boxes3d)
+
+            valid_flag = iou3d.max() < 1e-8
+            if not valid_flag:
+                continue
+
+            enlarged_box3d = new_gt_box3d.copy()
+            enlarged_box3d[3] += 2  # remove the points above and below the object
+            boxes_pts_mask_list = pts_utils.pts_in_boxes3d(pts_rect, enlarged_box3d.reshape(1, 7))
+            pt_mask_flag = (boxes_pts_mask_list[0] == 1)
+            src_pts_flag[pt_mask_flag] = 0  # remove the original points which are inside the new box
+
+            new_pts_list.append(new_gt_points)
+            new_pts_intensity_list.append(new_gt_intensity)
+            enlarged_box3d = new_gt_box3d.copy()
+            enlarged_box3d[4] += 0.5
+            enlarged_box3d[5] += 0.5  # enlarge new added box to avoid too nearby boxes
+            cur_gt_boxes3d = np.concatenate((cur_gt_boxes3d, enlarged_box3d.reshape(1, 7)), axis=0)
+            extra_gt_boxes3d_list.append(new_gt_box3d.reshape(1, 7))
+            extra_gt_obj_list.append(new_gt_obj)
+
+        if new_pts_list.__len__() == 0:
+            return False, pts_rect, pts_intensity, None, None
+
+        extra_gt_boxes3d = np.concatenate(extra_gt_boxes3d_list, axis=0)
+        # remove original points and add new points
+        pts_rect = pts_rect[src_pts_flag == 1]
+        pts_intensity = pts_intensity[src_pts_flag == 1]
+        new_pts_rect = np.concatenate(new_pts_list, axis=0)
+        new_pts_intensity = np.concatenate(new_pts_intensity_list, axis=0)
+        pts_rect = np.concatenate((pts_rect, new_pts_rect), axis=0)
+        pts_intensity = np.concatenate((pts_intensity, new_pts_intensity), axis=0)
+
+        return True, pts_rect, pts_intensity, extra_gt_boxes3d, extra_gt_obj_list
+
+    def aug_one_epoch_scene(self, base_id, data_save_dir, label_save_dir, split_list, log_fp=None):
+        for idx, sample_id in enumerate(self.image_idx_list):
+            sample_id = int(sample_id)
+            print('process gt sample (%s, id=%06d)' % (args.split, sample_id))
+
+            pts_lidar = self.get_lidar(sample_id)
+            calib = self.get_calib(sample_id)
+            pts_rect = calib.lidar_to_rect(pts_lidar[:, 0:3])
+            pts_img, pts_rect_depth = calib.rect_to_img(pts_rect)
+            img_shape = self.get_image_shape(sample_id)
+
+            pts_valid_flag = self.get_valid_flag(pts_rect, pts_img, pts_rect_depth, img_shape)
+            pts_rect = pts_rect[pts_valid_flag][:, 0:3]
+            pts_intensity = pts_lidar[pts_valid_flag][:, 3]
+
+            # all labels for checking overlapping
+            all_obj_list = self.filtrate_dc_objects(self.get_label(sample_id))
+            all_gt_boxes3d = np.zeros((all_obj_list.__len__(), 7), dtype=np.float32)
+            for k, obj in enumerate(all_obj_list):
+                all_gt_boxes3d[k, 0:3], all_gt_boxes3d[k, 3], all_gt_boxes3d[k, 4], all_gt_boxes3d[k, 5], \
+                all_gt_boxes3d[k, 6] = obj.pos, obj.h, obj.w, obj.l, obj.ry
+
+            # gt_boxes3d of current label
+            obj_list = self.filtrate_objects(self.get_label(sample_id))
+            if args.class_name != 'Car' and obj_list.__len__() == 0:
+                continue
+
+            # augment one scene
+            aug_flag, pts_rect, pts_intensity, extra_gt_boxes3d, extra_gt_obj_list = \
+                self.aug_one_scene(sample_id, pts_rect, pts_intensity, all_gt_boxes3d)
+
+            # save augment result to file
+            pts_info = np.concatenate((pts_rect, pts_intensity.reshape(-1, 1)), axis=1)
+            bin_file = os.path.join(data_save_dir, '%06d.bin' % (base_id + sample_id))
+            pts_info.astype(np.float32).tofile(bin_file)
+
+            # save filtered original gt_boxes3d
+            label_save_file = os.path.join(label_save_dir, '%06d.txt' % (base_id + sample_id))
+            with open(label_save_file, 'w') as f:
+                for obj in obj_list:
+                    f.write(obj.to_kitti_format() + '\n')
+
+                if aug_flag:
+                    # augment successfully
+                    save_kitti_format(calib, extra_gt_boxes3d, extra_gt_obj_list, img_shape=img_shape, save_fp=f)
+                else:
+                    extra_gt_boxes3d = np.zeros((0, 7), dtype=np.float32)
+            log_print('Save to file (new_obj: %s): %s' % (extra_gt_boxes3d.__len__(), label_save_file), fp=log_fp)
+            split_list.append('%06d' % (base_id + sample_id))
+
+    def generate_aug_scene(self, aug_times, log_fp=None):
+        data_save_dir = os.path.join(args.save_dir, 'rectified_data')
+        label_save_dir = os.path.join(args.save_dir, 'aug_label')
+        if not os.path.isdir(data_save_dir):
+            os.makedirs(data_save_dir)
+        if not os.path.isdir(label_save_dir):
+            os.makedirs(label_save_dir)
+
+        split_file = os.path.join(args.save_dir, '%s_aug.txt' % args.split)
+        split_list = self.image_idx_list[:]
+        for epoch in range(aug_times):
+            base_id = (epoch + 1) * 10000
+            self.aug_one_epoch_scene(base_id, data_save_dir, label_save_dir, split_list, log_fp=log_fp)
+
+        with open(split_file, 'w') as f:
+            for idx, sample_id in enumerate(split_list):
+                f.write(str(sample_id) + '\n')
+        log_print('Save split file to %s' % split_file, fp=log_fp)
+        target_dir = os.path.join(args.data_dir, 'KITTI/ImageSets/')
+        os.system('cp %s %s' % (split_file, target_dir))
+        log_print('Copy split file from %s to %s' % (split_file, target_dir), fp=log_fp)
+
+
+if __name__ == '__main__':
+    if not os.path.isdir(args.save_dir):
+        os.makedirs(args.save_dir)
+    info_file = os.path.join(args.save_dir, 'log_info.txt')
+
+    if args.mode == 'generator':
+        log_fp = open(info_file, 'w')
+
+        gt_database = pickle.load(open(args.gt_database_dir, 'rb'))
+        log_print('Loading gt_database(%d) from %s' % (gt_database.__len__(), args.gt_database_dir), fp=log_fp)
+
+        dataset = AugSceneGenerator(root_dir=args.data_dir, gt_database=gt_database, split=args.split)
+        dataset.generate_aug_scene(aug_times=args.aug_times, log_fp=log_fp)
+
+        log_fp.close()
+
+    else:
+        pass
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/tools/generate_gt_database.py b/PaddleCV/Paddle3D/PointRCNN/tools/generate_gt_database.py
new file mode 100644
index 0000000000000000000000000000000000000000..43290db734c9734fef8120031cab44a394f4323b
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/tools/generate_gt_database.py
@@ -0,0 +1,104 @@
+"""
+Generate GT database
+This code is based on https://github.com/sshaoshuai/PointRCNN/blob/master/tools/generate_gt_database.py
+"""
+
+import os
+import numpy as np
+import pickle
+
+from data.kitti_dataset import KittiDataset
+import pts_utils 
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--data_dir', type=str, default='./data')
+parser.add_argument('--save_dir', type=str, default='./data/gt_database')
+parser.add_argument('--class_name', type=str, default='Car')
+parser.add_argument('--split', type=str, default='train')
+args = parser.parse_args()
+
+
+class GTDatabaseGenerator(KittiDataset):
+    def __init__(self, root_dir, split='train', classes=args.class_name):
+        super(GTDatabaseGenerator, self).__init__(root_dir, split=split)
+        self.gt_database = None
+        if classes == 'Car':
+            self.classes = ('Background', 'Car')
+        elif classes == 'People':
+            self.classes = ('Background', 'Pedestrian', 'Cyclist')
+        elif classes == 'Pedestrian':
+            self.classes = ('Background', 'Pedestrian')
+        elif classes == 'Cyclist':
+            self.classes = ('Background', 'Cyclist')
+        else:
+            assert False, "Invalid classes: %s" % classes
+
+    def __len__(self):
+        raise NotImplementedError
+
+    def __getitem__(self, item):
+        raise NotImplementedError
+
+    def filtrate_objects(self, obj_list):
+        valid_obj_list = []
+        for obj in obj_list:
+            if obj.cls_type not in self.classes:
+                continue
+            if obj.level_str not in ['Easy', 'Moderate', 'Hard']:
+                continue
+            valid_obj_list.append(obj)
+
+        return valid_obj_list
+
+    def generate_gt_database(self):
+        gt_database = []
+        for idx, sample_id in enumerate(self.image_idx_list):
+            sample_id = int(sample_id)
+            print('process gt sample (id=%06d)' % sample_id)
+
+            pts_lidar = self.get_lidar(sample_id)
+            calib = self.get_calib(sample_id)
+            pts_rect = calib.lidar_to_rect(pts_lidar[:, 0:3])
+            pts_intensity = pts_lidar[:, 3]
+
+            obj_list = self.filtrate_objects(self.get_label(sample_id))
+
+            gt_boxes3d = np.zeros((obj_list.__len__(), 7), dtype=np.float32)
+            for k, obj in enumerate(obj_list):
+                gt_boxes3d[k, 0:3], gt_boxes3d[k, 3], gt_boxes3d[k, 4], gt_boxes3d[k, 5], gt_boxes3d[k, 6] \
+                    = obj.pos, obj.h, obj.w, obj.l, obj.ry
+
+            if gt_boxes3d.__len__() == 0:
+                print('No gt object')
+                continue
+
+            boxes_pts_mask_list = pts_utils.pts_in_boxes3d(pts_rect, gt_boxes3d)
+
+            for k in range(boxes_pts_mask_list.shape[0]):
+                pt_mask_flag = (boxes_pts_mask_list[k] == 1)
+                cur_pts = pts_rect[pt_mask_flag].astype(np.float32)
+                cur_pts_intensity = pts_intensity[pt_mask_flag].astype(np.float32)
+                sample_dict = {'sample_id': sample_id,
+                               'cls_type': obj_list[k].cls_type,
+                               'gt_box3d': gt_boxes3d[k],
+                               'points': cur_pts,
+                               'intensity': cur_pts_intensity,
+                               'obj': obj_list[k]}
+                gt_database.append(sample_dict)
+
+        save_file_name = os.path.join(args.save_dir, '%s_gt_database_3level_%s.pkl' % (args.split, self.classes[-1]))
+        with open(save_file_name, 'wb') as f:
+            pickle.dump(gt_database, f)
+
+        self.gt_database = gt_database
+        print('Save refine training sample info file to %s' % save_file_name)
+
+
+if __name__ == '__main__':
+    dataset = GTDatabaseGenerator(root_dir=args.data_dir, split=args.split)
+    if not os.path.isdir(args.save_dir):
+        os.makedirs(args.save_dir)
+
+    dataset.generate_gt_database()
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/tools/kitti_eval.py b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d16ef487301fb7ba45b71c64cd3af337cef13c5
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_eval.py
@@ -0,0 +1,71 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os 
+import sys
+import argparse
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        "KITTI mAP evaluation script")
+    parser.add_argument(
+        '--result_dir',
+        type=str,
+        default='./result_dir',
+        help='detection result directory to evaluate')
+    parser.add_argument(
+        '--data_dir',
+        type=str,
+        default='./data',
+        help='KITTI dataset root directory')
+    parser.add_argument(
+        '--split',
+        type=str,
+        default='val',
+        help='evaluation split, default val')
+    parser.add_argument(
+        '--class_name',
+        type=str,
+        default='Car',
+        help='evaluation class name, default Car')
+    args = parser.parse_args()
+    return args
+
+
+def kitti_eval():
+    if float(sys.version[:3]) < 3.6:
+        print("KITTI mAP evaluation can only run with python3.6+")
+        sys.exit(1)
+
+    args = parse_args()
+
+    label_dir = os.path.join(args.data_dir, 'KITTI/object/training', 'label_2')
+    split_file = os.path.join(args.data_dir, 'KITTI/ImageSets',
+                              '{}.txt'.format(args.split))
+    final_output_dir = os.path.join(args.result_dir, 'final_result', 'data')
+    name_to_class = {'Car': 0, 'Pedestrian': 1, 'Cyclist': 2}
+
+    from tools.kitti_object_eval_python.evaluate import evaluate as kitti_evaluate 
+    ap_result_str, ap_dict = kitti_evaluate(
+        label_dir, final_output_dir, label_split_file=split_file,
+         current_class=name_to_class[args.class_name])
+
+    print("KITTI evaluate: ", ap_result_str, ap_dict)
+
+
+if __name__ == "__main__":
+    kitti_eval()
+
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/LICENSE b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..ab602974d200aa6849e6ad8220951ef9a78d9f08
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2018 
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/README.md b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0e0e0c307c2db3f0486e594deae1c04ac49f55f3
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/README.md
@@ -0,0 +1,32 @@
+# kitti-object-eval-python
+**NOTE**: This is borrowed from [traveller59/kitti-object-eval-python](https://github.com/traveller59/kitti-object-eval-python)
+
+Fast kitti object detection eval in python(finish eval in less than 10 second), support 2d/bev/3d/aos. , support coco-style AP. If you use command line interface, numba need some time to compile jit functions.
+## Dependencies
+Only support python 3.6+, need `numpy`, `skimage`, `numba`, `fire`. If you have Anaconda, just install `cudatoolkit` in anaconda. Otherwise, please reference to this [page](https://github.com/numba/numba#custom-python-environments) to set up llvm and cuda for numba.
+* Install by conda:
+```
+conda install -c numba cudatoolkit=x.x  (8.0, 9.0, 9.1, depend on your environment) 
+```
+## Usage
+* commandline interface:
+```
+python evaluate.py evaluate --label_path=/path/to/your_gt_label_folder --result_path=/path/to/your_result_folder --label_split_file=/path/to/val.txt --current_class=0 --coco=False
+```
+* python interface:
+```Python
+import kitti_common as kitti
+from eval import get_official_eval_result, get_coco_eval_result
+def _read_imageset_file(path):
+    with open(path, 'r') as f:
+        lines = f.readlines()
+    return [int(line) for line in lines]
+det_path = "/path/to/your_result_folder"
+dt_annos = kitti.get_label_annos(det_path)
+gt_path = "/path/to/your_gt_label_folder"
+gt_split_file = "/path/to/val.txt" # from https://xiaozhichen.github.io/files/mv3d/imagesets.tar.gz
+val_image_ids = _read_imageset_file(gt_split_file)
+gt_annos = kitti.get_label_annos(gt_path, val_image_ids)
+print(get_official_eval_result(gt_annos, dt_annos, 0)) # 6s in my computer
+print(get_coco_eval_result(gt_annos, dt_annos, 0)) # 18s in my computer
+```
diff --git a/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/eval.py b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..38101ca69a59cdc0603ebc82cac0338432457550
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/eval.py
@@ -0,0 +1,740 @@
+import numpy as np
+import numba
+import io as sysio
+from tools.kitti_object_eval_python.rotate_iou import rotate_iou_gpu_eval
+
+
+@numba.jit
+def get_thresholds(scores: np.ndarray, num_gt, num_sample_pts=41):
+    scores.sort()
+    scores = scores[::-1]
+    current_recall = 0
+    thresholds = []
+    for i, score in enumerate(scores):
+        l_recall = (i + 1) / num_gt
+        if i < (len(scores) - 1):
+            r_recall = (i + 2) / num_gt
+        else:
+            r_recall = l_recall
+        if (((r_recall - current_recall) < (current_recall - l_recall))
+                and (i < (len(scores) - 1))):
+            continue
+        # recall = l_recall
+        thresholds.append(score)
+        current_recall += 1 / (num_sample_pts - 1.0)
+    return thresholds
+
+
+def clean_data(gt_anno, dt_anno, current_class, difficulty):
+    CLASS_NAMES = ['car', 'pedestrian', 'cyclist']
+    MIN_HEIGHT = [40, 25, 25]
+    MAX_OCCLUSION = [0, 1, 2]
+    MAX_TRUNCATION = [0.15, 0.3, 0.5]
+    dc_bboxes, ignored_gt, ignored_dt = [], [], []
+    current_cls_name = CLASS_NAMES[current_class].lower()
+    num_gt = len(gt_anno["name"])
+    num_dt = len(dt_anno["name"])
+    num_valid_gt = 0
+    for i in range(num_gt):
+        bbox = gt_anno["bbox"][i]
+        gt_name = gt_anno["name"][i].lower()
+        height = bbox[3] - bbox[1]
+        valid_class = -1
+        if (gt_name == current_cls_name):
+            valid_class = 1
+        elif (current_cls_name == "Pedestrian".lower()
+              and "Person_sitting".lower() == gt_name):
+            valid_class = 0
+        elif (current_cls_name == "Car".lower() and "Van".lower() == gt_name):
+            valid_class = 0
+        else:
+            valid_class = -1
+        ignore = False
+        if ((gt_anno["occluded"][i] > MAX_OCCLUSION[difficulty])
+                or (gt_anno["truncated"][i] > MAX_TRUNCATION[difficulty])
+                or (height <= MIN_HEIGHT[difficulty])):
+            # if gt_anno["difficulty"][i] > difficulty or gt_anno["difficulty"][i] == -1:
+            ignore = True
+        if valid_class == 1 and not ignore:
+            ignored_gt.append(0)
+            num_valid_gt += 1
+        elif (valid_class == 0 or (ignore and (valid_class == 1))):
+            ignored_gt.append(1)
+        else:
+            ignored_gt.append(-1)
+    # for i in range(num_gt):
+        if gt_anno["name"][i] == "DontCare":
+            dc_bboxes.append(gt_anno["bbox"][i])
+    for i in range(num_dt):
+        if (dt_anno["name"][i].lower() == current_cls_name):
+            valid_class = 1
+        else:
+            valid_class = -1
+        height = abs(dt_anno["bbox"][i, 3] - dt_anno["bbox"][i, 1])
+        if height < MIN_HEIGHT[difficulty]:
+            ignored_dt.append(1)
+        elif valid_class == 1:
+            ignored_dt.append(0)
+        else:
+            ignored_dt.append(-1)
+
+    return num_valid_gt, ignored_gt, ignored_dt, dc_bboxes
+
+
+@numba.jit(nopython=True)
+def image_box_overlap(boxes, query_boxes, criterion=-1):
+    N = boxes.shape[0]
+    K = query_boxes.shape[0]
+    overlaps = np.zeros((N, K), dtype=boxes.dtype)
+    for k in range(K):
+        qbox_area = ((query_boxes[k, 2] - query_boxes[k, 0]) *
+                     (query_boxes[k, 3] - query_boxes[k, 1]))
+        for n in range(N):
+            iw = (min(boxes[n, 2], query_boxes[k, 2]) -
+                  max(boxes[n, 0], query_boxes[k, 0]))
+            if iw > 0:
+                ih = (min(boxes[n, 3], query_boxes[k, 3]) -
+                      max(boxes[n, 1], query_boxes[k, 1]))
+                if ih > 0:
+                    if criterion == -1:
+                        ua = (
+                            (boxes[n, 2] - boxes[n, 0]) *
+                            (boxes[n, 3] - boxes[n, 1]) + qbox_area - iw * ih)
+                    elif criterion == 0:
+                        ua = ((boxes[n, 2] - boxes[n, 0]) *
+                              (boxes[n, 3] - boxes[n, 1]))
+                    elif criterion == 1:
+                        ua = qbox_area
+                    else:
+                        ua = 1.0
+                    overlaps[n, k] = iw * ih / ua
+    return overlaps
+
+
+def bev_box_overlap(boxes, qboxes, criterion=-1):
+    riou = rotate_iou_gpu_eval(boxes, qboxes, criterion)
+    return riou
+
+
+@numba.jit(nopython=True, parallel=True)
+def d3_box_overlap_kernel(boxes, qboxes, rinc, criterion=-1):
+    # ONLY support overlap in CAMERA, not lider.
+    N, K = boxes.shape[0], qboxes.shape[0]
+    for i in range(N):
+        for j in range(K):
+            if rinc[i, j] > 0:
+                # iw = (min(boxes[i, 1] + boxes[i, 4], qboxes[j, 1] +
+                #         qboxes[j, 4]) - max(boxes[i, 1], qboxes[j, 1]))
+                iw = (min(boxes[i, 1], qboxes[j, 1]) - max(
+                    boxes[i, 1] - boxes[i, 4], qboxes[j, 1] - qboxes[j, 4]))
+
+                if iw > 0:
+                    area1 = boxes[i, 3] * boxes[i, 4] * boxes[i, 5]
+                    area2 = qboxes[j, 3] * qboxes[j, 4] * qboxes[j, 5]
+                    inc = iw * rinc[i, j]
+                    if criterion == -1:
+                        ua = (area1 + area2 - inc)
+                    elif criterion == 0:
+                        ua = area1
+                    elif criterion == 1:
+                        ua = area2
+                    else:
+                        ua = inc
+                    rinc[i, j] = inc / ua
+                else:
+                    rinc[i, j] = 0.0
+
+
+def d3_box_overlap(boxes, qboxes, criterion=-1):
+    rinc = rotate_iou_gpu_eval(boxes[:, [0, 2, 3, 5, 6]],
+                               qboxes[:, [0, 2, 3, 5, 6]], 2)
+    d3_box_overlap_kernel(boxes, qboxes, rinc, criterion)
+    return rinc
+
+
+@numba.jit(nopython=True)
+def compute_statistics_jit(overlaps,
+                           gt_datas,
+                           dt_datas,
+                           ignored_gt,
+                           ignored_det,
+                           dc_bboxes,
+                           metric,
+                           min_overlap,
+                           thresh=0,
+                           compute_fp=False,
+                           compute_aos=False):
+
+    det_size = dt_datas.shape[0]
+    gt_size = gt_datas.shape[0]
+    dt_scores = dt_datas[:, -1]
+    dt_alphas = dt_datas[:, 4]
+    gt_alphas = gt_datas[:, 4]
+    dt_bboxes = dt_datas[:, :4]
+    gt_bboxes = gt_datas[:, :4]
+
+    assigned_detection = [False] * det_size
+    ignored_threshold = [False] * det_size
+    if compute_fp:
+        for i in range(det_size):
+            if (dt_scores[i] < thresh):
+                ignored_threshold[i] = True
+    NO_DETECTION = -10000000
+    tp, fp, fn, similarity = 0, 0, 0, 0
+    # thresholds = [0.0]
+    # delta = [0.0]
+    thresholds = np.zeros((gt_size, ))
+    thresh_idx = 0
+    delta = np.zeros((gt_size, ))
+    delta_idx = 0
+    for i in range(gt_size):
+        if ignored_gt[i] == -1:
+            continue
+        det_idx = -1
+        valid_detection = NO_DETECTION
+        max_overlap = 0
+        assigned_ignored_det = False
+
+        for j in range(det_size):
+            if (ignored_det[j] == -1):
+                continue
+            if (assigned_detection[j]):
+                continue
+            if (ignored_threshold[j]):
+                continue
+            overlap = overlaps[j, i]
+            dt_score = dt_scores[j]
+            if (not compute_fp and (overlap > min_overlap)
+                    and dt_score > valid_detection):
+                det_idx = j
+                valid_detection = dt_score
+            elif (compute_fp and (overlap > min_overlap)
+                  and (overlap > max_overlap or assigned_ignored_det)
+                  and ignored_det[j] == 0):
+                max_overlap = overlap
+                det_idx = j
+                valid_detection = 1
+                assigned_ignored_det = False
+            elif (compute_fp and (overlap > min_overlap)
+                  and (valid_detection == NO_DETECTION)
+                  and ignored_det[j] == 1):
+                det_idx = j
+                valid_detection = 1
+                assigned_ignored_det = True
+
+        if (valid_detection == NO_DETECTION) and ignored_gt[i] == 0:
+            fn += 1
+        elif ((valid_detection != NO_DETECTION)
+              and (ignored_gt[i] == 1 or ignored_det[det_idx] == 1)):
+            assigned_detection[det_idx] = True
+        elif valid_detection != NO_DETECTION:
+            tp += 1
+            # thresholds.append(dt_scores[det_idx])
+            thresholds[thresh_idx] = dt_scores[det_idx]
+            thresh_idx += 1
+            if compute_aos:
+                # delta.append(gt_alphas[i] - dt_alphas[det_idx])
+                delta[delta_idx] = gt_alphas[i] - dt_alphas[det_idx]
+                delta_idx += 1
+
+            assigned_detection[det_idx] = True
+    if compute_fp:
+        for i in range(det_size):
+            if (not (assigned_detection[i] or ignored_det[i] == -1
+                     or ignored_det[i] == 1 or ignored_threshold[i])):
+                fp += 1
+        nstuff = 0
+        if metric == 0:
+            overlaps_dt_dc = image_box_overlap(dt_bboxes, dc_bboxes, 0)
+            for i in range(dc_bboxes.shape[0]):
+                for j in range(det_size):
+                    if (assigned_detection[j]):
+                        continue
+                    if (ignored_det[j] == -1 or ignored_det[j] == 1):
+                        continue
+                    if (ignored_threshold[j]):
+                        continue
+                    if overlaps_dt_dc[j, i] > min_overlap:
+                        assigned_detection[j] = True
+                        nstuff += 1
+        fp -= nstuff
+        if compute_aos:
+            tmp = np.zeros((fp + delta_idx, ))
+            # tmp = [0] * fp
+            for i in range(delta_idx):
+                tmp[i + fp] = (1.0 + np.cos(delta[i])) / 2.0
+                # tmp.append((1.0 + np.cos(delta[i])) / 2.0)
+            # assert len(tmp) == fp + tp
+            # assert len(delta) == tp
+            if tp > 0 or fp > 0:
+                similarity = np.sum(tmp)
+            else:
+                similarity = -1
+    return tp, fp, fn, similarity, thresholds[:thresh_idx]
+
+
+def get_split_parts(num, num_part):
+    same_part = num // num_part
+    remain_num = num % num_part
+    if remain_num == 0:
+        return [same_part] * num_part
+    else:
+        return [same_part] * num_part + [remain_num]
+
+
+@numba.jit(nopython=True)
+def fused_compute_statistics(overlaps,
+                             pr,
+                             gt_nums,
+                             dt_nums,
+                             dc_nums,
+                             gt_datas,
+                             dt_datas,
+                             dontcares,
+                             ignored_gts,
+                             ignored_dets,
+                             metric,
+                             min_overlap,
+                             thresholds,
+                             compute_aos=False):
+    gt_num = 0
+    dt_num = 0
+    dc_num = 0
+    for i in range(gt_nums.shape[0]):
+        for t, thresh in enumerate(thresholds):
+            overlap = overlaps[dt_num:dt_num + dt_nums[i], gt_num:
+                               gt_num + gt_nums[i]]
+
+            gt_data = gt_datas[gt_num:gt_num + gt_nums[i]]
+            dt_data = dt_datas[dt_num:dt_num + dt_nums[i]]
+            ignored_gt = ignored_gts[gt_num:gt_num + gt_nums[i]]
+            ignored_det = ignored_dets[dt_num:dt_num + dt_nums[i]]
+            dontcare = dontcares[dc_num:dc_num + dc_nums[i]]
+            tp, fp, fn, similarity, _ = compute_statistics_jit(
+                overlap,
+                gt_data,
+                dt_data,
+                ignored_gt,
+                ignored_det,
+                dontcare,
+                metric,
+                min_overlap=min_overlap,
+                thresh=thresh,
+                compute_fp=True,
+                compute_aos=compute_aos)
+            pr[t, 0] += tp
+            pr[t, 1] += fp
+            pr[t, 2] += fn
+            if similarity != -1:
+                pr[t, 3] += similarity
+        gt_num += gt_nums[i]
+        dt_num += dt_nums[i]
+        dc_num += dc_nums[i]
+
+
+def calculate_iou_partly(gt_annos, dt_annos, metric, num_parts=50):
+    """fast iou algorithm. this function can be used independently to
+    do result analysis. Must be used in CAMERA coordinate system.
+    Args:
+        gt_annos: dict, must from get_label_annos() in kitti_common.py
+        dt_annos: dict, must from get_label_annos() in kitti_common.py
+        metric: eval type. 0: bbox, 1: bev, 2: 3d
+        num_parts: int. a parameter for fast calculate algorithm
+    """
+    assert len(gt_annos) == len(dt_annos)
+    total_dt_num = np.stack([len(a["name"]) for a in dt_annos], 0)
+    total_gt_num = np.stack([len(a["name"]) for a in gt_annos], 0)
+    num_examples = len(gt_annos)
+    split_parts = get_split_parts(num_examples, num_parts)
+    parted_overlaps = []
+    example_idx = 0
+
+    for num_part in split_parts:
+        gt_annos_part = gt_annos[example_idx:example_idx + num_part]
+        dt_annos_part = dt_annos[example_idx:example_idx + num_part]
+        if metric == 0:
+            gt_boxes = np.concatenate([a["bbox"] for a in gt_annos_part], 0)
+            dt_boxes = np.concatenate([a["bbox"] for a in dt_annos_part], 0)
+            overlap_part = image_box_overlap(gt_boxes, dt_boxes)
+        elif metric == 1:
+            loc = np.concatenate(
+                [a["location"][:, [0, 2]] for a in gt_annos_part], 0)
+            dims = np.concatenate(
+                [a["dimensions"][:, [0, 2]] for a in gt_annos_part], 0)
+            rots = np.concatenate([a["rotation_y"] for a in gt_annos_part], 0)
+            gt_boxes = np.concatenate(
+                [loc, dims, rots[..., np.newaxis]], axis=1)
+            loc = np.concatenate(
+                [a["location"][:, [0, 2]] for a in dt_annos_part], 0)
+            dims = np.concatenate(
+                [a["dimensions"][:, [0, 2]] for a in dt_annos_part], 0)
+            rots = np.concatenate([a["rotation_y"] for a in dt_annos_part], 0)
+            dt_boxes = np.concatenate(
+                [loc, dims, rots[..., np.newaxis]], axis=1)
+            overlap_part = bev_box_overlap(gt_boxes, dt_boxes).astype(
+                np.float64)
+        elif metric == 2:
+            loc = np.concatenate([a["location"] for a in gt_annos_part], 0)
+            dims = np.concatenate([a["dimensions"] for a in gt_annos_part], 0)
+            rots = np.concatenate([a["rotation_y"] for a in gt_annos_part], 0)
+            gt_boxes = np.concatenate(
+                [loc, dims, rots[..., np.newaxis]], axis=1)
+            loc = np.concatenate([a["location"] for a in dt_annos_part], 0)
+            dims = np.concatenate([a["dimensions"] for a in dt_annos_part], 0)
+            rots = np.concatenate([a["rotation_y"] for a in dt_annos_part], 0)
+            dt_boxes = np.concatenate(
+                [loc, dims, rots[..., np.newaxis]], axis=1)
+            overlap_part = d3_box_overlap(gt_boxes, dt_boxes).astype(
+                np.float64)
+        else:
+            raise ValueError("unknown metric")
+        parted_overlaps.append(overlap_part)
+        example_idx += num_part
+    overlaps = []
+    example_idx = 0
+    for j, num_part in enumerate(split_parts):
+        gt_annos_part = gt_annos[example_idx:example_idx + num_part]
+        dt_annos_part = dt_annos[example_idx:example_idx + num_part]
+        gt_num_idx, dt_num_idx = 0, 0
+        for i in range(num_part):
+            gt_box_num = total_gt_num[example_idx + i]
+            dt_box_num = total_dt_num[example_idx + i]
+            overlaps.append(
+                parted_overlaps[j][gt_num_idx:gt_num_idx + gt_box_num,
+                                   dt_num_idx:dt_num_idx + dt_box_num])
+            gt_num_idx += gt_box_num
+            dt_num_idx += dt_box_num
+        example_idx += num_part
+
+    return overlaps, parted_overlaps, total_gt_num, total_dt_num
+
+
+def _prepare_data(gt_annos, dt_annos, current_class, difficulty):
+    gt_datas_list = []
+    dt_datas_list = []
+    total_dc_num = []
+    ignored_gts, ignored_dets, dontcares = [], [], []
+    total_num_valid_gt = 0
+    for i in range(len(gt_annos)):
+        rets = clean_data(gt_annos[i], dt_annos[i], current_class, difficulty)
+        num_valid_gt, ignored_gt, ignored_det, dc_bboxes = rets
+        ignored_gts.append(np.array(ignored_gt, dtype=np.int64))
+        ignored_dets.append(np.array(ignored_det, dtype=np.int64))
+        if len(dc_bboxes) == 0:
+            dc_bboxes = np.zeros((0, 4)).astype(np.float64)
+        else:
+            dc_bboxes = np.stack(dc_bboxes, 0).astype(np.float64)
+        total_dc_num.append(dc_bboxes.shape[0])
+        dontcares.append(dc_bboxes)
+        total_num_valid_gt += num_valid_gt
+        gt_datas = np.concatenate(
+            [gt_annos[i]["bbox"], gt_annos[i]["alpha"][..., np.newaxis]], 1)
+        dt_datas = np.concatenate([
+            dt_annos[i]["bbox"], dt_annos[i]["alpha"][..., np.newaxis],
+            dt_annos[i]["score"][..., np.newaxis]
+        ], 1)
+        gt_datas_list.append(gt_datas)
+        dt_datas_list.append(dt_datas)
+    total_dc_num = np.stack(total_dc_num, axis=0)
+    return (gt_datas_list, dt_datas_list, ignored_gts, ignored_dets, dontcares,
+            total_dc_num, total_num_valid_gt)
+
+
+def eval_class(gt_annos,
+               dt_annos,
+               current_classes,
+               difficultys,
+               metric,
+               min_overlaps,
+               compute_aos=False,
+               num_parts=50):
+    """Kitti eval. support 2d/bev/3d/aos eval. support 0.5:0.05:0.95 coco AP.
+    Args:
+        gt_annos: dict, must from get_label_annos() in kitti_common.py
+        dt_annos: dict, must from get_label_annos() in kitti_common.py
+        current_classes: list of int, 0: car, 1: pedestrian, 2: cyclist
+        difficultys: list of int. eval difficulty, 0: easy, 1: normal, 2: hard
+        metric: eval type. 0: bbox, 1: bev, 2: 3d
+        min_overlaps: float, min overlap. format: [num_overlap, metric, class].
+        num_parts: int. a parameter for fast calculate algorithm
+
+    Returns:
+        dict of recall, precision and aos
+    """
+    assert len(gt_annos) == len(dt_annos)
+    num_examples = len(gt_annos)
+    split_parts = get_split_parts(num_examples, num_parts)
+
+    rets = calculate_iou_partly(dt_annos, gt_annos, metric, num_parts)
+    overlaps, parted_overlaps, total_dt_num, total_gt_num = rets
+    N_SAMPLE_PTS = 41
+    num_minoverlap = len(min_overlaps)
+    num_class = len(current_classes)
+    num_difficulty = len(difficultys)
+    precision = np.zeros(
+        [num_class, num_difficulty, num_minoverlap, N_SAMPLE_PTS])
+    recall = np.zeros(
+        [num_class, num_difficulty, num_minoverlap, N_SAMPLE_PTS])
+    aos = np.zeros([num_class, num_difficulty, num_minoverlap, N_SAMPLE_PTS])
+    for m, current_class in enumerate(current_classes):
+        for l, difficulty in enumerate(difficultys):
+            rets = _prepare_data(gt_annos, dt_annos, current_class, difficulty)
+            (gt_datas_list, dt_datas_list, ignored_gts, ignored_dets,
+             dontcares, total_dc_num, total_num_valid_gt) = rets
+            for k, min_overlap in enumerate(min_overlaps[:, metric, m]):
+                thresholdss = []
+                for i in range(len(gt_annos)):
+                    rets = compute_statistics_jit(
+                        overlaps[i],
+                        gt_datas_list[i],
+                        dt_datas_list[i],
+                        ignored_gts[i],
+                        ignored_dets[i],
+                        dontcares[i],
+                        metric,
+                        min_overlap=min_overlap,
+                        thresh=0.0,
+                        compute_fp=False)
+                    tp, fp, fn, similarity, thresholds = rets
+                    thresholdss += thresholds.tolist()
+                thresholdss = np.array(thresholdss)
+                thresholds = get_thresholds(thresholdss, total_num_valid_gt)
+                thresholds = np.array(thresholds)
+                pr = np.zeros([len(thresholds), 4])
+                idx = 0
+                for j, num_part in enumerate(split_parts):
+                    gt_datas_part = np.concatenate(
+                        gt_datas_list[idx:idx + num_part], 0)
+                    dt_datas_part = np.concatenate(
+                        dt_datas_list[idx:idx + num_part], 0)
+                    dc_datas_part = np.concatenate(
+                        dontcares[idx:idx + num_part], 0)
+                    ignored_dets_part = np.concatenate(
+                        ignored_dets[idx:idx + num_part], 0)
+                    ignored_gts_part = np.concatenate(
+                        ignored_gts[idx:idx + num_part], 0)
+                    fused_compute_statistics(
+                        parted_overlaps[j],
+                        pr,
+                        total_gt_num[idx:idx + num_part],
+                        total_dt_num[idx:idx + num_part],
+                        total_dc_num[idx:idx + num_part],
+                        gt_datas_part,
+                        dt_datas_part,
+                        dc_datas_part,
+                        ignored_gts_part,
+                        ignored_dets_part,
+                        metric,
+                        min_overlap=min_overlap,
+                        thresholds=thresholds,
+                        compute_aos=compute_aos)
+                    idx += num_part
+                for i in range(len(thresholds)):
+                    recall[m, l, k, i] = pr[i, 0] / (pr[i, 0] + pr[i, 2])
+                    precision[m, l, k, i] = pr[i, 0] / (pr[i, 0] + pr[i, 1])
+                    if compute_aos:
+                        aos[m, l, k, i] = pr[i, 3] / (pr[i, 0] + pr[i, 1])
+                for i in range(len(thresholds)):
+                    precision[m, l, k, i] = np.max(
+                        precision[m, l, k, i:], axis=-1)
+                    recall[m, l, k, i] = np.max(recall[m, l, k, i:], axis=-1)
+                    if compute_aos:
+                        aos[m, l, k, i] = np.max(aos[m, l, k, i:], axis=-1)
+    ret_dict = {
+        "recall": recall,
+        "precision": precision,
+        "orientation": aos,
+    }
+    return ret_dict
+
+
+def get_mAP(prec):
+    sums = 0
+    for i in range(0, prec.shape[-1], 4):
+        sums = sums + prec[..., i]
+    return sums / 11 * 100
+
+
+def print_str(value, *arg, sstream=None):
+    if sstream is None:
+        sstream = sysio.StringIO()
+    sstream.truncate(0)
+    sstream.seek(0)
+    print(value, *arg, file=sstream)
+    return sstream.getvalue()
+
+
+def do_eval(gt_annos,
+            dt_annos,
+            current_classes,
+            min_overlaps,
+            compute_aos=False):
+    # min_overlaps: [num_minoverlap, metric, num_class]
+    difficultys = [0, 1, 2]
+    ret = eval_class(gt_annos, dt_annos, current_classes, difficultys, 0,
+                     min_overlaps, compute_aos)
+    # ret: [num_class, num_diff, num_minoverlap, num_sample_points]
+    mAP_bbox = get_mAP(ret["precision"])
+    mAP_aos = None
+    if compute_aos:
+        mAP_aos = get_mAP(ret["orientation"])
+    ret = eval_class(gt_annos, dt_annos, current_classes, difficultys, 1,
+                     min_overlaps)
+    mAP_bev = get_mAP(ret["precision"])
+    ret = eval_class(gt_annos, dt_annos, current_classes, difficultys, 2,
+                     min_overlaps)
+    mAP_3d = get_mAP(ret["precision"])
+    return mAP_bbox, mAP_bev, mAP_3d, mAP_aos
+
+
+def do_coco_style_eval(gt_annos, dt_annos, current_classes, overlap_ranges,
+                       compute_aos):
+    # overlap_ranges: [range, metric, num_class]
+    min_overlaps = np.zeros([10, *overlap_ranges.shape[1:]])
+    for i in range(overlap_ranges.shape[1]):
+        for j in range(overlap_ranges.shape[2]):
+            min_overlaps[:, i, j] = np.linspace(*overlap_ranges[:, i, j])
+    mAP_bbox, mAP_bev, mAP_3d, mAP_aos = do_eval(
+        gt_annos, dt_annos, current_classes, min_overlaps, compute_aos)
+    # ret: [num_class, num_diff, num_minoverlap]
+    mAP_bbox = mAP_bbox.mean(-1)
+    mAP_bev = mAP_bev.mean(-1)
+    mAP_3d = mAP_3d.mean(-1)
+    if mAP_aos is not None:
+        mAP_aos = mAP_aos.mean(-1)
+    return mAP_bbox, mAP_bev, mAP_3d, mAP_aos
+
+
+def get_official_eval_result(gt_annos, dt_annos, current_classes):
+    overlap_0_7 = np.array([[0.7, 0.5, 0.5, 0.7,
+                             0.5], [0.7, 0.5, 0.5, 0.7, 0.5],
+                            [0.7, 0.5, 0.5, 0.7, 0.5]])
+    overlap_0_5 = np.array([[0.7, 0.5, 0.5, 0.7,
+                             0.5], [0.5, 0.25, 0.25, 0.5, 0.25],
+                            [0.5, 0.25, 0.25, 0.5, 0.25]])
+    min_overlaps = np.stack([overlap_0_7, overlap_0_5], axis=0)  # [2, 3, 5]
+    class_to_name = {
+        0: 'Car',
+        1: 'Pedestrian',
+        2: 'Cyclist',
+        3: 'Van',
+        4: 'Person_sitting',
+    }
+    name_to_class = {v: n for n, v in class_to_name.items()}
+    if not isinstance(current_classes, (list, tuple)):
+        current_classes = [current_classes]
+    current_classes_int = []
+    for curcls in current_classes:
+        if isinstance(curcls, str):
+            current_classes_int.append(name_to_class[curcls])
+        else:
+            current_classes_int.append(curcls)
+    current_classes = current_classes_int
+    min_overlaps = min_overlaps[:, :, current_classes]
+    result = ''
+    # check whether alpha is valid
+    compute_aos = False
+    for anno in dt_annos:
+        if anno['alpha'].shape[0] != 0:
+            if anno['alpha'][0] != -10:
+                compute_aos = True
+            break
+    mAPbbox, mAPbev, mAP3d, mAPaos = do_eval(
+        gt_annos, dt_annos, current_classes, min_overlaps, compute_aos)
+
+    ret_dict = {}
+    for j, curcls in enumerate(current_classes):
+        # mAP threshold array: [num_minoverlap, metric, class]
+        # mAP result: [num_class, num_diff, num_minoverlap]
+        for i in range(min_overlaps.shape[0]):
+            result += print_str(
+                (f"{class_to_name[curcls]} "
+                 "AP@{:.2f}, {:.2f}, {:.2f}:".format(*min_overlaps[i, :, j])))
+            result += print_str((f"bbox AP:{mAPbbox[j, 0, i]:.4f}, "
+                                 f"{mAPbbox[j, 1, i]:.4f}, "
+                                 f"{mAPbbox[j, 2, i]:.4f}"))
+            result += print_str((f"bev  AP:{mAPbev[j, 0, i]:.4f}, "
+                                 f"{mAPbev[j, 1, i]:.4f}, "
+                                 f"{mAPbev[j, 2, i]:.4f}"))
+            result += print_str((f"3d   AP:{mAP3d[j, 0, i]:.4f}, "
+                                 f"{mAP3d[j, 1, i]:.4f}, "
+                                 f"{mAP3d[j, 2, i]:.4f}"))
+
+
+            if compute_aos:
+                result += print_str((f"aos  AP:{mAPaos[j, 0, i]:.2f}, "
+                                     f"{mAPaos[j, 1, i]:.2f}, "
+                                     f"{mAPaos[j, 2, i]:.2f}"))
+    ret_dict['Car_3d_easy'] = mAP3d[0, 0, 0]
+    ret_dict['Car_3d_moderate'] = mAP3d[0, 1, 0]
+    ret_dict['Car_3d_hard'] = mAP3d[0, 2, 0]
+    ret_dict['Car_bev_easy'] = mAPbev[0, 0, 0]
+    ret_dict['Car_bev_moderate'] = mAPbev[0, 1, 0]
+    ret_dict['Car_bev_hard'] = mAPbev[0, 2, 0]
+    ret_dict['Car_image_easy'] = mAPbbox[0, 0, 0]
+    ret_dict['Car_image_moderate'] = mAPbbox[0, 1, 0]
+    ret_dict['Car_image_hard'] = mAPbbox[0, 2, 0]
+
+    return result, ret_dict
+
+
+def get_coco_eval_result(gt_annos, dt_annos, current_classes):
+    class_to_name = {
+        0: 'Car',
+        1: 'Pedestrian',
+        2: 'Cyclist',
+        3: 'Van',
+        4: 'Person_sitting',
+    }
+    class_to_range = {
+        0: [0.5, 0.95, 10],
+        1: [0.25, 0.7, 10],
+        2: [0.25, 0.7, 10],
+        3: [0.5, 0.95, 10],
+        4: [0.25, 0.7, 10],
+    }
+    name_to_class = {v: n for n, v in class_to_name.items()}
+    if not isinstance(current_classes, (list, tuple)):
+        current_classes = [current_classes]
+    current_classes_int = []
+    for curcls in current_classes:
+        if isinstance(curcls, str):
+            current_classes_int.append(name_to_class[curcls])
+        else:
+            current_classes_int.append(curcls)
+    current_classes = current_classes_int
+    overlap_ranges = np.zeros([3, 3, len(current_classes)])
+    for i, curcls in enumerate(current_classes):
+        overlap_ranges[:, :, i] = np.array(
+            class_to_range[curcls])[:, np.newaxis]
+    result = ''
+    # check whether alpha is valid
+    compute_aos = False
+    for anno in dt_annos:
+        if anno['alpha'].shape[0] != 0:
+            if anno['alpha'][0] != -10:
+                compute_aos = True
+            break
+    mAPbbox, mAPbev, mAP3d, mAPaos = do_coco_style_eval(
+        gt_annos, dt_annos, current_classes, overlap_ranges, compute_aos)
+    for j, curcls in enumerate(current_classes):
+        # mAP threshold array: [num_minoverlap, metric, class]
+        # mAP result: [num_class, num_diff, num_minoverlap]
+        o_range = np.array(class_to_range[curcls])[[0, 2, 1]]
+        o_range[1] = (o_range[2] - o_range[0]) / (o_range[1] - 1)
+        result += print_str((f"{class_to_name[curcls]} "
+                             "coco AP@{:.2f}:{:.2f}:{:.2f}:".format(*o_range)))
+        result += print_str((f"bbox AP:{mAPbbox[j, 0]:.2f}, "
+                             f"{mAPbbox[j, 1]:.2f}, "
+                             f"{mAPbbox[j, 2]:.2f}"))
+        result += print_str((f"bev  AP:{mAPbev[j, 0]:.2f}, "
+                             f"{mAPbev[j, 1]:.2f}, "
+                             f"{mAPbev[j, 2]:.2f}"))
+        result += print_str((f"3d   AP:{mAP3d[j, 0]:.2f}, "
+                             f"{mAP3d[j, 1]:.2f}, "
+                             f"{mAP3d[j, 2]:.2f}"))
+        if compute_aos:
+            result += print_str((f"aos  AP:{mAPaos[j, 0]:.2f}, "
+                                 f"{mAPaos[j, 1]:.2f}, "
+                                 f"{mAPaos[j, 2]:.2f}"))
+    return result
diff --git a/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/evaluate.py b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/evaluate.py
new file mode 100644
index 0000000000000000000000000000000000000000..e822ae464618eb05c4123b7bd05cec875a567b70
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/evaluate.py
@@ -0,0 +1,32 @@
+import time
+import fire
+
+import tools.kitti_object_eval_python.kitti_common as kitti
+from tools.kitti_object_eval_python.eval import get_official_eval_result, get_coco_eval_result
+
+
+def _read_imageset_file(path):
+    with open(path, 'r') as f:
+        lines = f.readlines()
+    return [int(line) for line in lines]
+
+
+def evaluate(label_path,
+             result_path,
+             label_split_file,
+             current_class=0,
+             coco=False,
+             score_thresh=-1):
+    dt_annos = kitti.get_label_annos(result_path)
+    if score_thresh > 0:
+        dt_annos = kitti.filter_annos_low_score(dt_annos, score_thresh)
+    val_image_ids = _read_imageset_file(label_split_file)
+    gt_annos = kitti.get_label_annos(label_path, val_image_ids)
+    if coco:
+        return get_coco_eval_result(gt_annos, dt_annos, current_class)
+    else:
+        return get_official_eval_result(gt_annos, dt_annos, current_class)
+
+
+if __name__ == '__main__':
+    fire.Fire()
diff --git a/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/kitti_common.py b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/kitti_common.py
new file mode 100644
index 0000000000000000000000000000000000000000..e7e254ea4a27af9656757bbfb1f932c1348f59fe
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/kitti_common.py
@@ -0,0 +1,411 @@
+import concurrent.futures as futures
+import os
+import pathlib
+import re
+from collections import OrderedDict
+
+import numpy as np
+from skimage import io
+
+def get_image_index_str(img_idx):
+    return "{:06d}".format(img_idx)
+
+
+def get_kitti_info_path(idx,
+                        prefix,
+                        info_type='image_2',
+                        file_tail='.png',
+                        training=True,
+                        relative_path=True):
+    img_idx_str = get_image_index_str(idx)
+    img_idx_str += file_tail
+    prefix = pathlib.Path(prefix)
+    if training:
+        file_path = pathlib.Path('training') / info_type / img_idx_str
+    else:
+        file_path = pathlib.Path('testing') / info_type / img_idx_str
+    if not (prefix / file_path).exists():
+        raise ValueError("file not exist: {}".format(file_path))
+    if relative_path:
+        return str(file_path)
+    else:
+        return str(prefix / file_path)
+
+
+def get_image_path(idx, prefix, training=True, relative_path=True):
+    return get_kitti_info_path(idx, prefix, 'image_2', '.png', training,
+                               relative_path)
+
+
+def get_label_path(idx, prefix, training=True, relative_path=True):
+    return get_kitti_info_path(idx, prefix, 'label_2', '.txt', training,
+                               relative_path)
+
+
+def get_velodyne_path(idx, prefix, training=True, relative_path=True):
+    return get_kitti_info_path(idx, prefix, 'velodyne', '.bin', training,
+                               relative_path)
+
+
+def get_calib_path(idx, prefix, training=True, relative_path=True):
+    return get_kitti_info_path(idx, prefix, 'calib', '.txt', training,
+                               relative_path)
+
+
+def _extend_matrix(mat):
+    mat = np.concatenate([mat, np.array([[0., 0., 0., 1.]])], axis=0)
+    return mat
+
+
+def get_kitti_image_info(path,
+                         training=True,
+                         label_info=True,
+                         velodyne=False,
+                         calib=False,
+                         image_ids=7481,
+                         extend_matrix=True,
+                         num_worker=8,
+                         relative_path=True,
+                         with_imageshape=True):
+    # image_infos = []
+    root_path = pathlib.Path(path)
+    if not isinstance(image_ids, list):
+        image_ids = list(range(image_ids))
+
+    def map_func(idx):
+        image_info = {'image_idx': idx}
+        annotations = None
+        if velodyne:
+            image_info['velodyne_path'] = get_velodyne_path(
+                idx, path, training, relative_path)
+        image_info['img_path'] = get_image_path(idx, path, training,
+                                                relative_path)
+        if with_imageshape:
+            img_path = image_info['img_path']
+            if relative_path:
+                img_path = str(root_path / img_path)
+            image_info['img_shape'] = np.array(
+                io.imread(img_path).shape[:2], dtype=np.int32)
+        if label_info:
+            label_path = get_label_path(idx, path, training, relative_path)
+            if relative_path:
+                label_path = str(root_path / label_path)
+            annotations = get_label_anno(label_path)
+        if calib:
+            calib_path = get_calib_path(
+                idx, path, training, relative_path=False)
+            with open(calib_path, 'r') as f:
+                lines = f.readlines()
+            P0 = np.array(
+                [float(info) for info in lines[0].split(' ')[1:13]]).reshape(
+                    [3, 4])
+            P1 = np.array(
+                [float(info) for info in lines[1].split(' ')[1:13]]).reshape(
+                    [3, 4])
+            P2 = np.array(
+                [float(info) for info in lines[2].split(' ')[1:13]]).reshape(
+                    [3, 4])
+            P3 = np.array(
+                [float(info) for info in lines[3].split(' ')[1:13]]).reshape(
+                    [3, 4])
+            if extend_matrix:
+                P0 = _extend_matrix(P0)
+                P1 = _extend_matrix(P1)
+                P2 = _extend_matrix(P2)
+                P3 = _extend_matrix(P3)
+            image_info['calib/P0'] = P0
+            image_info['calib/P1'] = P1
+            image_info['calib/P2'] = P2
+            image_info['calib/P3'] = P3
+            R0_rect = np.array([
+                float(info) for info in lines[4].split(' ')[1:10]
+            ]).reshape([3, 3])
+            if extend_matrix:
+                rect_4x4 = np.zeros([4, 4], dtype=R0_rect.dtype)
+                rect_4x4[3, 3] = 1.
+                rect_4x4[:3, :3] = R0_rect
+            else:
+                rect_4x4 = R0_rect
+            image_info['calib/R0_rect'] = rect_4x4
+            Tr_velo_to_cam = np.array([
+                float(info) for info in lines[5].split(' ')[1:13]
+            ]).reshape([3, 4])
+            Tr_imu_to_velo = np.array([
+                float(info) for info in lines[6].split(' ')[1:13]
+            ]).reshape([3, 4])
+            if extend_matrix:
+                Tr_velo_to_cam = _extend_matrix(Tr_velo_to_cam)
+                Tr_imu_to_velo = _extend_matrix(Tr_imu_to_velo)
+            image_info['calib/Tr_velo_to_cam'] = Tr_velo_to_cam
+            image_info['calib/Tr_imu_to_velo'] = Tr_imu_to_velo
+        if annotations is not None:
+            image_info['annos'] = annotations
+            add_difficulty_to_annos(image_info)
+        return image_info
+
+    with futures.ThreadPoolExecutor(num_worker) as executor:
+        image_infos = executor.map(map_func, image_ids)
+    return list(image_infos)
+
+
+def filter_kitti_anno(image_anno,
+                      used_classes,
+                      used_difficulty=None,
+                      dontcare_iou=None):
+    if not isinstance(used_classes, (list, tuple)):
+        used_classes = [used_classes]
+    img_filtered_annotations = {}
+    relevant_annotation_indices = [
+        i for i, x in enumerate(image_anno['name']) if x in used_classes
+    ]
+    for key in image_anno.keys():
+        img_filtered_annotations[key] = (
+            image_anno[key][relevant_annotation_indices])
+    if used_difficulty is not None:
+        relevant_annotation_indices = [
+            i for i, x in enumerate(img_filtered_annotations['difficulty'])
+            if x in used_difficulty
+        ]
+        for key in image_anno.keys():
+            img_filtered_annotations[key] = (
+                img_filtered_annotations[key][relevant_annotation_indices])
+
+    if 'DontCare' in used_classes and dontcare_iou is not None:
+        dont_care_indices = [
+            i for i, x in enumerate(img_filtered_annotations['name'])
+            if x == 'DontCare'
+        ]
+        # bounding box format [y_min, x_min, y_max, x_max]
+        all_boxes = img_filtered_annotations['bbox']
+        ious = iou(all_boxes, all_boxes[dont_care_indices])
+
+        # Remove all bounding boxes that overlap with a dontcare region.
+        if ious.size > 0:
+            boxes_to_remove = np.amax(ious, axis=1) > dontcare_iou
+            for key in image_anno.keys():
+                img_filtered_annotations[key] = (img_filtered_annotations[key][
+                    np.logical_not(boxes_to_remove)])
+    return img_filtered_annotations
+
+def filter_annos_low_score(image_annos, thresh):
+    new_image_annos = []
+    for anno in image_annos:
+        img_filtered_annotations = {}
+        relevant_annotation_indices = [
+            i for i, s in enumerate(anno['score']) if s >= thresh
+        ]
+        for key in anno.keys():
+            img_filtered_annotations[key] = (
+                anno[key][relevant_annotation_indices])
+        new_image_annos.append(img_filtered_annotations)
+    return new_image_annos
+
+def kitti_result_line(result_dict, precision=4):
+    prec_float = "{" + ":.{}f".format(precision) + "}"
+    res_line = []
+    all_field_default = OrderedDict([
+        ('name', None),
+        ('truncated', -1),
+        ('occluded', -1),
+        ('alpha', -10),
+        ('bbox', None),
+        ('dimensions', [-1, -1, -1]),
+        ('location', [-1000, -1000, -1000]),
+        ('rotation_y', -10),
+        ('score', None),
+    ])
+    res_dict = [(key, None) for key, val in all_field_default.items()]
+    res_dict = OrderedDict(res_dict)
+    for key, val in result_dict.items():
+        if all_field_default[key] is None and val is None:
+            raise ValueError("you must specify a value for {}".format(key))
+        res_dict[key] = val
+
+    for key, val in res_dict.items():
+        if key == 'name':
+            res_line.append(val)
+        elif key in ['truncated', 'alpha', 'rotation_y', 'score']:
+            if val is None:
+                res_line.append(str(all_field_default[key]))
+            else:
+                res_line.append(prec_float.format(val))
+        elif key == 'occluded':
+            if val is None:
+                res_line.append(str(all_field_default[key]))
+            else:
+                res_line.append('{}'.format(val))
+        elif key in ['bbox', 'dimensions', 'location']:
+            if val is None:
+                res_line += [str(v) for v in all_field_default[key]]
+            else:
+                res_line += [prec_float.format(v) for v in val]
+        else:
+            raise ValueError("unknown key. supported key:{}".format(
+                res_dict.keys()))
+    return ' '.join(res_line)
+
+
+def add_difficulty_to_annos(info):
+    min_height = [40, 25,
+                  25]  # minimum height for evaluated groundtruth/detections
+    max_occlusion = [
+        0, 1, 2
+    ]  # maximum occlusion level of the groundtruth used for evaluation
+    max_trunc = [
+        0.15, 0.3, 0.5
+    ]  # maximum truncation level of the groundtruth used for evaluation
+    annos = info['annos']
+    dims = annos['dimensions']  # lhw format
+    bbox = annos['bbox']
+    height = bbox[:, 3] - bbox[:, 1]
+    occlusion = annos['occluded']
+    truncation = annos['truncated']
+    diff = []
+    easy_mask = np.ones((len(dims), ), dtype=np.bool)
+    moderate_mask = np.ones((len(dims), ), dtype=np.bool)
+    hard_mask = np.ones((len(dims), ), dtype=np.bool)
+    i = 0
+    for h, o, t in zip(height, occlusion, truncation):
+        if o > max_occlusion[0] or h <= min_height[0] or t > max_trunc[0]:
+            easy_mask[i] = False
+        if o > max_occlusion[1] or h <= min_height[1] or t > max_trunc[1]:
+            moderate_mask[i] = False
+        if o > max_occlusion[2] or h <= min_height[2] or t > max_trunc[2]:
+            hard_mask[i] = False
+        i += 1
+    is_easy = easy_mask
+    is_moderate = np.logical_xor(easy_mask, moderate_mask)
+    is_hard = np.logical_xor(hard_mask, moderate_mask)
+
+    for i in range(len(dims)):
+        if is_easy[i]:
+            diff.append(0)
+        elif is_moderate[i]:
+            diff.append(1)
+        elif is_hard[i]:
+            diff.append(2)
+        else:
+            diff.append(-1)
+    annos["difficulty"] = np.array(diff, np.int32)
+    return diff
+
+
+def get_label_anno(label_path):
+    annotations = {}
+    annotations.update({
+        'name': [],
+        'truncated': [],
+        'occluded': [],
+        'alpha': [],
+        'bbox': [],
+        'dimensions': [],
+        'location': [],
+        'rotation_y': []
+    })
+    with open(label_path, 'r') as f:
+        lines = f.readlines()
+    # if len(lines) == 0 or len(lines[0]) < 15:
+    #     content = []
+    # else:
+    content = [line.strip().split(' ') for line in lines]
+    annotations['name'] = np.array([x[0] for x in content])
+    annotations['truncated'] = np.array([float(x[1]) for x in content])
+    annotations['occluded'] = np.array([int(x[2]) for x in content])
+    annotations['alpha'] = np.array([float(x[3]) for x in content])
+    annotations['bbox'] = np.array(
+        [[float(info) for info in x[4:8]] for x in content]).reshape(-1, 4)
+    # dimensions will convert hwl format to standard lhw(camera) format.
+    annotations['dimensions'] = np.array(
+        [[float(info) for info in x[8:11]] for x in content]).reshape(
+            -1, 3)[:, [2, 0, 1]]
+    annotations['location'] = np.array(
+        [[float(info) for info in x[11:14]] for x in content]).reshape(-1, 3)
+    annotations['rotation_y'] = np.array(
+        [float(x[14]) for x in content]).reshape(-1)
+    if len(content) != 0 and len(content[0]) == 16:  # have score
+        annotations['score'] = np.array([float(x[15]) for x in content])
+    else:
+        annotations['score'] = np.zeros([len(annotations['bbox'])])
+    return annotations
+
+def get_label_annos(label_folder, image_ids=None):
+    if image_ids is None:
+        filepaths = pathlib.Path(label_folder).glob('*.txt')
+        prog = re.compile(r'^\d{6}.txt$')
+        filepaths = filter(lambda f: prog.match(f.name), filepaths)
+        image_ids = [int(p.stem) for p in filepaths]
+        image_ids = sorted(image_ids)
+    if not isinstance(image_ids, list):
+        image_ids = list(range(image_ids))
+    annos = []
+    label_folder = pathlib.Path(label_folder)
+    for idx in image_ids:
+        image_idx = get_image_index_str(idx)
+        label_filename = label_folder / (image_idx + '.txt')
+        annos.append(get_label_anno(label_filename))
+    return annos
+
+def area(boxes, add1=False):
+    """Computes area of boxes.
+
+    Args:
+        boxes: Numpy array with shape [N, 4] holding N boxes
+
+    Returns:
+        a numpy array with shape [N*1] representing box areas
+    """
+    if add1:
+        return (boxes[:, 2] - boxes[:, 0] + 1.0) * (
+            boxes[:, 3] - boxes[:, 1] + 1.0)
+    else:
+        return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
+
+
+def intersection(boxes1, boxes2, add1=False):
+    """Compute pairwise intersection areas between boxes.
+
+    Args:
+        boxes1: a numpy array with shape [N, 4] holding N boxes
+        boxes2: a numpy array with shape [M, 4] holding M boxes
+
+    Returns:
+        a numpy array with shape [N*M] representing pairwise intersection area
+    """
+    [y_min1, x_min1, y_max1, x_max1] = np.split(boxes1, 4, axis=1)
+    [y_min2, x_min2, y_max2, x_max2] = np.split(boxes2, 4, axis=1)
+
+    all_pairs_min_ymax = np.minimum(y_max1, np.transpose(y_max2))
+    all_pairs_max_ymin = np.maximum(y_min1, np.transpose(y_min2))
+    if add1:
+        all_pairs_min_ymax += 1.0
+    intersect_heights = np.maximum(
+        np.zeros(all_pairs_max_ymin.shape),
+        all_pairs_min_ymax - all_pairs_max_ymin)
+
+    all_pairs_min_xmax = np.minimum(x_max1, np.transpose(x_max2))
+    all_pairs_max_xmin = np.maximum(x_min1, np.transpose(x_min2))
+    if add1:
+        all_pairs_min_xmax += 1.0
+    intersect_widths = np.maximum(
+        np.zeros(all_pairs_max_xmin.shape),
+        all_pairs_min_xmax - all_pairs_max_xmin)
+    return intersect_heights * intersect_widths
+
+
+def iou(boxes1, boxes2, add1=False):
+    """Computes pairwise intersection-over-union between box collections.
+
+    Args:
+        boxes1: a numpy array with shape [N, 4] holding N boxes.
+        boxes2: a numpy array with shape [M, 4] holding N boxes.
+
+    Returns:
+        a numpy array with shape [N, M] representing pairwise iou scores.
+    """
+    intersect = intersection(boxes1, boxes2, add1)
+    area1 = area(boxes1, add1)
+    area2 = area(boxes2, add1)
+    union = np.expand_dims(
+        area1, axis=1) + np.expand_dims(
+            area2, axis=0) - intersect
+    return intersect / union
diff --git a/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/rotate_iou.py b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/rotate_iou.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd694ef5c5a0c9fac9595a17743a35db37d48820
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/tools/kitti_object_eval_python/rotate_iou.py
@@ -0,0 +1,329 @@
+#####################
+# Based on https://github.com/hongzhenwang/RRPN-revise
+# Licensed under The MIT License
+# Author: yanyan, scrin@foxmail.com
+#####################
+import math
+
+import numba
+import numpy as np
+from numba import cuda
+
+@numba.jit(nopython=True)
+def div_up(m, n):
+    return m // n + (m % n > 0)
+
+@cuda.jit('(float32[:], float32[:], float32[:])', device=True, inline=True)
+def trangle_area(a, b, c):
+    return ((a[0] - c[0]) * (b[1] - c[1]) - (a[1] - c[1]) *
+            (b[0] - c[0])) / 2.0
+
+
+@cuda.jit('(float32[:], int32)', device=True, inline=True)
+def area(int_pts, num_of_inter):
+    area_val = 0.0
+    for i in range(num_of_inter - 2):
+        area_val += abs(
+            trangle_area(int_pts[:2], int_pts[2 * i + 2:2 * i + 4],
+                         int_pts[2 * i + 4:2 * i + 6]))
+    return area_val
+
+
+@cuda.jit('(float32[:], int32)', device=True, inline=True)
+def sort_vertex_in_convex_polygon(int_pts, num_of_inter):
+    if num_of_inter > 0:
+        center = cuda.local.array((2, ), dtype=numba.float32)
+        center[:] = 0.0
+        for i in range(num_of_inter):
+            center[0] += int_pts[2 * i]
+            center[1] += int_pts[2 * i + 1]
+        center[0] /= num_of_inter
+        center[1] /= num_of_inter
+        v = cuda.local.array((2, ), dtype=numba.float32)
+        vs = cuda.local.array((16, ), dtype=numba.float32)
+        for i in range(num_of_inter):
+            v[0] = int_pts[2 * i] - center[0]
+            v[1] = int_pts[2 * i + 1] - center[1]
+            d = math.sqrt(v[0] * v[0] + v[1] * v[1])
+            v[0] = v[0] / d
+            v[1] = v[1] / d
+            if v[1] < 0:
+                v[0] = -2 - v[0]
+            vs[i] = v[0]
+        j = 0
+        temp = 0
+        for i in range(1, num_of_inter):
+            if vs[i - 1] > vs[i]:
+                temp = vs[i]
+                tx = int_pts[2 * i]
+                ty = int_pts[2 * i + 1]
+                j = i
+                while j > 0 and vs[j - 1] > temp:
+                    vs[j] = vs[j - 1]
+                    int_pts[j * 2] = int_pts[j * 2 - 2]
+                    int_pts[j * 2 + 1] = int_pts[j * 2 - 1]
+                    j -= 1
+
+                vs[j] = temp
+                int_pts[j * 2] = tx
+                int_pts[j * 2 + 1] = ty
+
+
+@cuda.jit(
+    '(float32[:], float32[:], int32, int32, float32[:])',
+    device=True,
+    inline=True)
+def line_segment_intersection(pts1, pts2, i, j, temp_pts):
+    A = cuda.local.array((2, ), dtype=numba.float32)
+    B = cuda.local.array((2, ), dtype=numba.float32)
+    C = cuda.local.array((2, ), dtype=numba.float32)
+    D = cuda.local.array((2, ), dtype=numba.float32)
+
+    A[0] = pts1[2 * i]
+    A[1] = pts1[2 * i + 1]
+
+    B[0] = pts1[2 * ((i + 1) % 4)]
+    B[1] = pts1[2 * ((i + 1) % 4) + 1]
+
+    C[0] = pts2[2 * j]
+    C[1] = pts2[2 * j + 1]
+
+    D[0] = pts2[2 * ((j + 1) % 4)]
+    D[1] = pts2[2 * ((j + 1) % 4) + 1]
+    BA0 = B[0] - A[0]
+    BA1 = B[1] - A[1]
+    DA0 = D[0] - A[0]
+    CA0 = C[0] - A[0]
+    DA1 = D[1] - A[1]
+    CA1 = C[1] - A[1]
+    acd = DA1 * CA0 > CA1 * DA0
+    bcd = (D[1] - B[1]) * (C[0] - B[0]) > (C[1] - B[1]) * (D[0] - B[0])
+    if acd != bcd:
+        abc = CA1 * BA0 > BA1 * CA0
+        abd = DA1 * BA0 > BA1 * DA0
+        if abc != abd:
+            DC0 = D[0] - C[0]
+            DC1 = D[1] - C[1]
+            ABBA = A[0] * B[1] - B[0] * A[1]
+            CDDC = C[0] * D[1] - D[0] * C[1]
+            DH = BA1 * DC0 - BA0 * DC1
+            Dx = ABBA * DC0 - BA0 * CDDC
+            Dy = ABBA * DC1 - BA1 * CDDC
+            temp_pts[0] = Dx / DH
+            temp_pts[1] = Dy / DH
+            return True
+    return False
+
+
+@cuda.jit(
+    '(float32[:], float32[:], int32, int32, float32[:])',
+    device=True,
+    inline=True)
+def line_segment_intersection_v1(pts1, pts2, i, j, temp_pts):
+    a = cuda.local.array((2, ), dtype=numba.float32)
+    b = cuda.local.array((2, ), dtype=numba.float32)
+    c = cuda.local.array((2, ), dtype=numba.float32)
+    d = cuda.local.array((2, ), dtype=numba.float32)
+
+    a[0] = pts1[2 * i]
+    a[1] = pts1[2 * i + 1]
+
+    b[0] = pts1[2 * ((i + 1) % 4)]
+    b[1] = pts1[2 * ((i + 1) % 4) + 1]
+
+    c[0] = pts2[2 * j]
+    c[1] = pts2[2 * j + 1]
+
+    d[0] = pts2[2 * ((j + 1) % 4)]
+    d[1] = pts2[2 * ((j + 1) % 4) + 1]
+
+    area_abc = trangle_area(a, b, c)
+    area_abd = trangle_area(a, b, d)
+
+    if area_abc * area_abd >= 0:
+        return False
+
+    area_cda = trangle_area(c, d, a)
+    area_cdb = area_cda + area_abc - area_abd
+
+    if area_cda * area_cdb >= 0:
+        return False
+    t = area_cda / (area_abd - area_abc)
+
+    dx = t * (b[0] - a[0])
+    dy = t * (b[1] - a[1])
+    temp_pts[0] = a[0] + dx
+    temp_pts[1] = a[1] + dy
+    return True
+
+
+@cuda.jit('(float32, float32, float32[:])', device=True, inline=True)
+def point_in_quadrilateral(pt_x, pt_y, corners):
+    ab0 = corners[2] - corners[0]
+    ab1 = corners[3] - corners[1]
+
+    ad0 = corners[6] - corners[0]
+    ad1 = corners[7] - corners[1]
+
+    ap0 = pt_x - corners[0]
+    ap1 = pt_y - corners[1]
+
+    abab = ab0 * ab0 + ab1 * ab1
+    abap = ab0 * ap0 + ab1 * ap1
+    adad = ad0 * ad0 + ad1 * ad1
+    adap = ad0 * ap0 + ad1 * ap1
+
+    return abab >= abap and abap >= 0 and adad >= adap and adap >= 0
+
+
+@cuda.jit('(float32[:], float32[:], float32[:])', device=True, inline=True)
+def quadrilateral_intersection(pts1, pts2, int_pts):
+    num_of_inter = 0
+    for i in range(4):
+        if point_in_quadrilateral(pts1[2 * i], pts1[2 * i + 1], pts2):
+            int_pts[num_of_inter * 2] = pts1[2 * i]
+            int_pts[num_of_inter * 2 + 1] = pts1[2 * i + 1]
+            num_of_inter += 1
+        if point_in_quadrilateral(pts2[2 * i], pts2[2 * i + 1], pts1):
+            int_pts[num_of_inter * 2] = pts2[2 * i]
+            int_pts[num_of_inter * 2 + 1] = pts2[2 * i + 1]
+            num_of_inter += 1
+    temp_pts = cuda.local.array((2, ), dtype=numba.float32)
+    for i in range(4):
+        for j in range(4):
+            has_pts = line_segment_intersection(pts1, pts2, i, j, temp_pts)
+            if has_pts:
+                int_pts[num_of_inter * 2] = temp_pts[0]
+                int_pts[num_of_inter * 2 + 1] = temp_pts[1]
+                num_of_inter += 1
+
+    return num_of_inter
+
+
+@cuda.jit('(float32[:], float32[:])', device=True, inline=True)
+def rbbox_to_corners(corners, rbbox):
+    # generate clockwise corners and rotate it clockwise
+    angle = rbbox[4]
+    a_cos = math.cos(angle)
+    a_sin = math.sin(angle)
+    center_x = rbbox[0]
+    center_y = rbbox[1]
+    x_d = rbbox[2]
+    y_d = rbbox[3]
+    corners_x = cuda.local.array((4, ), dtype=numba.float32)
+    corners_y = cuda.local.array((4, ), dtype=numba.float32)
+    corners_x[0] = -x_d / 2
+    corners_x[1] = -x_d / 2
+    corners_x[2] = x_d / 2
+    corners_x[3] = x_d / 2
+    corners_y[0] = -y_d / 2
+    corners_y[1] = y_d / 2
+    corners_y[2] = y_d / 2
+    corners_y[3] = -y_d / 2
+    for i in range(4):
+        corners[2 *
+                i] = a_cos * corners_x[i] + a_sin * corners_y[i] + center_x
+        corners[2 * i
+                + 1] = -a_sin * corners_x[i] + a_cos * corners_y[i] + center_y
+
+
+@cuda.jit('(float32[:], float32[:])', device=True, inline=True)
+def inter(rbbox1, rbbox2):
+    corners1 = cuda.local.array((8, ), dtype=numba.float32)
+    corners2 = cuda.local.array((8, ), dtype=numba.float32)
+    intersection_corners = cuda.local.array((16, ), dtype=numba.float32)
+
+    rbbox_to_corners(corners1, rbbox1)
+    rbbox_to_corners(corners2, rbbox2)
+
+    num_intersection = quadrilateral_intersection(corners1, corners2,
+                                                  intersection_corners)
+    sort_vertex_in_convex_polygon(intersection_corners, num_intersection)
+    # print(intersection_corners.reshape([-1, 2])[:num_intersection])
+
+    return area(intersection_corners, num_intersection)
+
+
+@cuda.jit('(float32[:], float32[:], int32)', device=True, inline=True)
+def devRotateIoUEval(rbox1, rbox2, criterion=-1):
+    area1 = rbox1[2] * rbox1[3]
+    area2 = rbox2[2] * rbox2[3]
+    area_inter = inter(rbox1, rbox2)
+    if criterion == -1:
+        return area_inter / (area1 + area2 - area_inter)
+    elif criterion == 0:
+        return area_inter / area1
+    elif criterion == 1:
+        return area_inter / area2
+    else:
+        return area_inter
+
+@cuda.jit('(int64, int64, float32[:], float32[:], float32[:], int32)', fastmath=False)
+def rotate_iou_kernel_eval(N, K, dev_boxes, dev_query_boxes, dev_iou, criterion=-1):
+    threadsPerBlock = 8 * 8
+    row_start = cuda.blockIdx.x
+    col_start = cuda.blockIdx.y
+    tx = cuda.threadIdx.x
+    row_size = min(N - row_start * threadsPerBlock, threadsPerBlock)
+    col_size = min(K - col_start * threadsPerBlock, threadsPerBlock)
+    block_boxes = cuda.shared.array(shape=(64 * 5, ), dtype=numba.float32)
+    block_qboxes = cuda.shared.array(shape=(64 * 5, ), dtype=numba.float32)
+
+    dev_query_box_idx = threadsPerBlock * col_start + tx
+    dev_box_idx = threadsPerBlock * row_start + tx
+    if (tx < col_size):
+        block_qboxes[tx * 5 + 0] = dev_query_boxes[dev_query_box_idx * 5 + 0]
+        block_qboxes[tx * 5 + 1] = dev_query_boxes[dev_query_box_idx * 5 + 1]
+        block_qboxes[tx * 5 + 2] = dev_query_boxes[dev_query_box_idx * 5 + 2]
+        block_qboxes[tx * 5 + 3] = dev_query_boxes[dev_query_box_idx * 5 + 3]
+        block_qboxes[tx * 5 + 4] = dev_query_boxes[dev_query_box_idx * 5 + 4]
+    if (tx < row_size):
+        block_boxes[tx * 5 + 0] = dev_boxes[dev_box_idx * 5 + 0]
+        block_boxes[tx * 5 + 1] = dev_boxes[dev_box_idx * 5 + 1]
+        block_boxes[tx * 5 + 2] = dev_boxes[dev_box_idx * 5 + 2]
+        block_boxes[tx * 5 + 3] = dev_boxes[dev_box_idx * 5 + 3]
+        block_boxes[tx * 5 + 4] = dev_boxes[dev_box_idx * 5 + 4]
+    cuda.syncthreads()
+    if tx < row_size:
+        for i in range(col_size):
+            offset = row_start * threadsPerBlock * K + col_start * threadsPerBlock + tx * K + i
+            dev_iou[offset] = devRotateIoUEval(block_qboxes[i * 5:i * 5 + 5],
+                                           block_boxes[tx * 5:tx * 5 + 5], criterion)
+
+
+def rotate_iou_gpu_eval(boxes, query_boxes, criterion=-1, device_id=0):
+    """rotated box iou running in gpu. 500x faster than cpu version
+    (take 5ms in one example with numba.cuda code).
+    convert from [this project](
+        https://github.com/hongzhenwang/RRPN-revise/tree/master/lib/rotation).
+    
+    Args:
+        boxes (float tensor: [N, 5]): rbboxes. format: centers, dims, 
+            angles(clockwise when positive)
+        query_boxes (float tensor: [K, 5]): [description]
+        device_id (int, optional): Defaults to 0. [description]
+    
+    Returns:
+        [type]: [description]
+    """
+    box_dtype = boxes.dtype
+    boxes = boxes.astype(np.float32)
+    query_boxes = query_boxes.astype(np.float32)
+    N = boxes.shape[0]
+    K = query_boxes.shape[0]
+    iou = np.zeros((N, K), dtype=np.float32)
+    if N == 0 or K == 0:
+        return iou
+    threadsPerBlock = 8 * 8
+    cuda.select_device(device_id)
+    blockspergrid = (div_up(N, threadsPerBlock), div_up(K, threadsPerBlock))
+    
+    stream = cuda.stream()
+    with stream.auto_synchronize():
+        boxes_dev = cuda.to_device(boxes.reshape([-1]), stream)
+        query_boxes_dev = cuda.to_device(query_boxes.reshape([-1]), stream)
+        iou_dev = cuda.to_device(iou.reshape([-1]), stream)
+        rotate_iou_kernel_eval[blockspergrid, threadsPerBlock, stream](
+            N, K, boxes_dev, query_boxes_dev, iou_dev, criterion)
+        iou_dev.copy_to_host(iou.reshape([-1]), stream=stream)
+    return iou.astype(boxes.dtype)
diff --git a/PaddleCV/Paddle3D/PointRCNN/train.py b/PaddleCV/Paddle3D/PointRCNN/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..41a6f0981b5222b940eb23aca548fbf0672723ba
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/train.py
@@ -0,0 +1,248 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import sys
+import time
+import shutil
+import argparse
+import logging
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.layers import control_flow
+from paddle.fluid.contrib.extend_optimizer import extend_with_decoupled_weight_decay
+import paddle.fluid.layers.learning_rate_scheduler as lr_scheduler
+
+from models.point_rcnn import PointRCNN
+from data.kitti_rcnn_reader import KittiRCNNReader
+from utils.run_utils import *
+from utils.config import cfg, load_config, set_config_from_list
+from utils.optimizer import optimize
+
+logging.root.handlers = []
+FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("PointRCNN semantic segmentation train script")
+    parser.add_argument(
+        '--cfg',
+        type=str,
+        default='cfgs/default.yml',
+        help='specify the config for training')
+    parser.add_argument(
+        '--train_mode',
+        type=str,
+        default='rpn',
+        required=True,
+        help='specify the training mode')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=16,
+        required=True,
+        help='training batch size, default 16')
+    parser.add_argument(
+        '--epoch',
+        type=int,
+        default=200,
+        required=True,
+        help='epoch number. default 200.')
+    parser.add_argument(
+        '--save_dir',
+        type=str,
+        default='checkpoints',
+        help='directory name to save train snapshoot')
+    parser.add_argument(
+        '--resume',
+        type=str,
+        default=None,
+        help='path to resume training based on previous checkpoints. '
+        'None for not resuming any checkpoints.')
+    parser.add_argument(
+        '--resume_epoch',
+        type=int,
+        default=0,
+        help='resume epoch id')
+    parser.add_argument(
+        '--data_dir',
+        type=str,
+        default='./data',
+        help='KITTI dataset root directory')
+    parser.add_argument(
+        '--gt_database',
+        type=str,
+        default='data/gt_database/train_gt_database_3level_Car.pkl',
+        help='generated gt database for augmentation')
+    parser.add_argument(
+        '--rcnn_training_roi_dir',
+        type=str,
+        default=None,
+	help='specify the saved rois for rcnn training when using rcnn_offline mode')
+    parser.add_argument(
+        '--rcnn_training_feature_dir',
+        type=str,
+        default=None,
+	help='specify the saved features for rcnn training when using rcnn_offline mode')
+    parser.add_argument(
+        '--worker_num',
+        type=int,
+        default=16,
+	help='multiprocess reader process num, default 16')
+    parser.add_argument(
+        '--log_interval',
+        type=int,
+        default=1,
+        help='mini-batch interval to log.')
+    parser.add_argument(
+        '--set',
+        dest='set_cfgs',
+        default=None,
+        nargs=argparse.REMAINDER,
+        help='set extra config keys if needed.')
+    args = parser.parse_args()
+    return args
+
+
+def train():
+    args = parse_args()
+    print_arguments(args)
+    # check whether the installed paddle is compiled with GPU
+    # PointRCNN model can only run on GPU
+    check_gpu(True)
+
+    load_config(args.cfg)
+    if args.set_cfgs is not None:
+        set_config_from_list(args.set_cfgs)
+
+    if args.train_mode == 'rpn':
+        cfg.RPN.ENABLED = True
+        cfg.RCNN.ENABLED = False
+    elif args.train_mode == 'rcnn':
+        cfg.RCNN.ENABLED = True
+        cfg.RPN.ENABLED = cfg.RPN.FIXED = True
+    elif args.train_mode == 'rcnn_offline':
+        cfg.RCNN.ENABLED = True
+        cfg.RPN.ENABLED = False
+    else:
+        raise NotImplementedError("unknown train mode: {}".format(args.train_mode))
+
+    checkpoints_dir = os.path.join(args.save_dir, args.train_mode)
+    if not os.path.isdir(checkpoints_dir):
+        os.makedirs(checkpoints_dir)
+
+    kitti_rcnn_reader = KittiRCNNReader(data_dir=args.data_dir,
+                                    npoints=cfg.RPN.NUM_POINTS,
+                                    split=cfg.TRAIN.SPLIT,
+                                    mode='TRAIN',
+                                    classes=cfg.CLASSES,
+                                    rcnn_training_roi_dir=args.rcnn_training_roi_dir,
+                                    rcnn_training_feature_dir=args.rcnn_training_feature_dir,
+                                    gt_database_dir=args.gt_database)
+    num_samples = len(kitti_rcnn_reader)
+    steps_per_epoch = int(num_samples / args.batch_size)
+    logger.info("Total {} samples, {} batch per epoch.".format(num_samples, steps_per_epoch))
+    boundaries = [i * steps_per_epoch for i in cfg.TRAIN.DECAY_STEP_LIST]
+    values = [cfg.TRAIN.LR * (cfg.TRAIN.LR_DECAY ** i) for i in range(len(boundaries) + 1)]
+
+    place = fluid.CUDAPlace(0)
+    exe = fluid.Executor(place)
+
+    # build model
+    startup = fluid.Program()
+    train_prog = fluid.Program()
+    with fluid.program_guard(train_prog, startup):
+        with fluid.unique_name.guard():
+            train_model = PointRCNN(cfg, args.batch_size, True, 'TRAIN')
+            train_model.build()
+            train_pyreader = train_model.get_pyreader()
+            train_feeds = train_model.get_feeds()
+            train_outputs = train_model.get_outputs()
+            train_loss = train_outputs['loss']
+            lr = optimize(train_loss,
+                          learning_rate=cfg.TRAIN.LR,
+                          warmup_factor=1. / cfg.TRAIN.DIV_FACTOR,
+                          decay_factor=1e-5,
+                          total_step=steps_per_epoch * args.epoch,
+                          warmup_pct=cfg.TRAIN.PCT_START,
+                          train_program=train_prog,
+                          startup_prog=startup,
+                          weight_decay=cfg.TRAIN.WEIGHT_DECAY,
+                          clip_norm=cfg.TRAIN.GRAD_NORM_CLIP)
+    train_keys, train_values = parse_outputs(train_outputs, 'loss')
+
+    exe.run(startup)
+
+    if args.resume:
+        assert os.path.exists(args.resume), \
+                "Given resume weight dir {} not exist.".format(args.resume)
+        def if_exist(var):
+            logger.debug("{}: {}".format(var.name, os.path.exists(os.path.join(args.resume, var.name))))
+            return os.path.exists(os.path.join(args.resume, var.name))
+        fluid.io.load_vars(
+            exe, args.resume, predicate=if_exist, main_program=train_prog)
+
+    build_strategy = fluid.BuildStrategy()
+    build_strategy.memory_optimize = False
+    build_strategy.enable_inplace = False
+    build_strategy.fuse_all_optimizer_ops = False
+    train_compile_prog = fluid.compiler.CompiledProgram(
+            train_prog).with_data_parallel(loss_name=train_loss.name,
+                    build_strategy=build_strategy)
+
+    def save_model(exe, prog, path):
+        if os.path.isdir(path):
+            shutil.rmtree(path)
+        logger.info("Save model to {}".format(path))
+        fluid.io.save_persistables(exe, path, prog)
+
+    # get reader
+    train_reader = kitti_rcnn_reader.get_multiprocess_reader(args.batch_size,
+                                                             train_feeds,
+                                                             proc_num=args.worker_num,
+                                                             drop_last=True)
+    train_pyreader.decorate_sample_list_generator(train_reader, place)
+
+    train_stat = Stat()
+    for epoch_id in range(args.resume_epoch, args.epoch):
+        try:
+            train_pyreader.start()
+            train_iter = 0
+            train_periods = []
+            while True:
+                cur_time = time.time()
+                train_outs = exe.run(train_compile_prog, fetch_list=train_values + [lr.name])
+                period = time.time() - cur_time
+                train_periods.append(period)
+                train_stat.update(train_keys, train_outs[:-1])
+                if train_iter % args.log_interval == 0:
+                    log_str = ""
+                    for name, values in zip(train_keys + ['learning_rate'], train_outs):
+                        log_str += "{}: {:.6f}, ".format(name, np.mean(values))
+                    logger.info("[TRAIN] Epoch {}, batch {}: {}time: {:.2f}".format(epoch_id, train_iter, log_str, period))
+                train_iter += 1
+        except fluid.core.EOFException:
+            logger.info("[TRAIN] Epoch {} finished, {}average time: {:.2f}".format(epoch_id, train_stat.get_mean_log(), np.mean(train_periods[2:])))
+            save_model(exe, train_prog, os.path.join(checkpoints_dir, str(epoch_id)))
+            train_stat.reset()
+            train_periods = []
+        finally:
+            train_pyreader.reset()
+
+
+if __name__ == "__main__":
+    train()
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/__init__.py b/PaddleCV/Paddle3D/PointRCNN/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..cad1d5d9ab5b0e5ed0724ddfc65ef53d14044b76
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/__init__.py
@@ -0,0 +1,14 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/box_utils.py b/PaddleCV/Paddle3D/PointRCNN/utils/box_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..49c9ee74a64634e1836d081220996919ffae16a4
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/box_utils.py
@@ -0,0 +1,275 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains proposal functions
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import paddle.fluid as fluid
+
+from utils.config import cfg
+
+__all__ = ["boxes3d_to_bev", "box_overlap_rotate", "boxes3d_to_bev", "box_iou", "box_nms"]
+
+
+def boxes3d_to_bev(boxes3d):
+    """
+    Args:
+        boxes3d: [N, 7], (x, y, z, h, w, l, ry)
+    Return:
+        boxes_bev: [N, 5], (x1, y1, x2, y2, ry)
+    """
+    boxes_bev = np.zeros((boxes3d.shape[0], 5), dtype='float32')
+
+    cu, cv = boxes3d[:, 0], boxes3d[:, 2]
+    half_l, half_w = boxes3d[:, 5] / 2, boxes3d[:, 4] / 2
+    boxes_bev[:, 0], boxes_bev[:, 1] = cu - half_l, cv - half_w
+    boxes_bev[:, 2], boxes_bev[:, 3] = cu + half_l, cv + half_w
+    boxes_bev[:, 4] = boxes3d[:, 6]
+    return boxes_bev
+
+
+def rotate_around_center(center, angle_cos, angle_sin, corners):
+    new_x = (corners[:, 0] - center[0]) * angle_cos + \
+            (corners[:, 1] - center[1]) * angle_sin + center[0]
+    new_y = -(corners[:, 0] - center[0]) * angle_sin + \
+            (corners[:, 1] - center[1]) * angle_cos + center[1]
+    return np.concatenate([new_x[:, np.newaxis], new_y[:, np.newaxis]], axis=-1)
+
+
+def check_rect_cross(p1, p2, q1, q2):
+    return min(p1[0], p2[0]) <= max(q1[0], q2[0]) and \
+           min(q1[0], q2[0]) <= max(p1[0], p2[0]) and \
+           min(p1[1], p2[1]) <= max(q1[1], q2[1]) and \
+           min(q1[1], q2[1]) <= max(p1[1], p2[1])
+
+
+def cross(p1, p2, p0):
+    return (p1[0] - p0[0]) * (p2[1] - p0[1]) - (p2[0] - p0[0]) * (p1[1] - p0[1]);
+
+
+def cross_area(a, b):
+    return a[0] * b[1] - a[1] * b[0]
+
+
+def intersection(p1, p0, q1, q0):
+    if not check_rect_cross(p1, p0, q1, q0):
+        return None
+
+    s1 = cross(q0, p1, p0)
+    s2 = cross(p1, q1, p0)
+    s3 = cross(p0, q1, q0)
+    s4 = cross(q1, p1, q0)
+    if not (s1 * s2 > 0 and s3 * s4 > 0):
+        return None
+
+    s5 = cross(q1, p1, p0)
+    if np.abs(s5 - s1) > 1e-8:
+        return np.array([(s5 * q0[0] - s1 * q1[0]) / (s5 - s1),
+                (s5 * q0[1] - s1 * q1[1]) / (s5 - s1)], dtype='float32')
+    else:
+        a0 = p0[1] - p1[1]
+        b0 = p1[0] - p0[0]
+        c0 = p0[0] * p1[1] - p1[0] * p0[1]
+        a0 = q0[1] - q1[1]
+        b0 = q1[0] - q0[0]
+        c0 = q0[0] * q1[1] - q1[0] * q0[1]
+        D = a0 * b1 - a1 * b0
+        return np.array([(b0 * c1 - b1 * c0) / D, (a1 * c0 - a0 * c1) / D], dtype='float32')
+
+
+def check_in_box2d(box, p):
+    center_x = (box[0] + box[2]) / 2.
+    center_y = (box[1] + box[3]) / 2.
+    angle_cos = np.cos(-box[4])
+    angle_sin = np.sin(-box[4])
+    rot_x = (p[0] - center_x) * angle_cos + (p[1] - center_y) * angle_sin + center_x
+    rot_y = -(p[0] - center_x) * angle_sin + (p[1] - center_y) * angle_cos + center_y
+    return rot_x > box[0] - 1e-5 and rot_x < box[2] + 1e-5 and \
+            rot_y > box[1] - 1e-5 and rot_y < box[3] + 1e-5
+
+
+def point_cmp(a, b, center):
+    return np.arctan2(a[1] - center[1], a[0] - center[0]) > \
+            np.arctan2(b[1] - center[1], b[0] - center[0])
+
+
+def box_overlap_rotate(cur_box, boxes):
+    """
+    Calculate box overlap with rotate, box: [x1, y1, x2, y2, angle]
+    """
+    areas = np.zeros((len(boxes), ), dtype='float32')
+    cur_center = [(cur_box[0] + cur_box[2]) / 2., (cur_box[1] + cur_box[3]) / 2.]
+    cur_corners = np.array([
+            [cur_box[0], cur_box[1]], # (x1, y1)
+            [cur_box[2], cur_box[1]], # (x2, y1)
+            [cur_box[2], cur_box[3]], # (x2, y2)
+            [cur_box[0], cur_box[3]], # (x1, y2)
+            [cur_box[0], cur_box[1]], # (x1, y1)
+            ], dtype='float32')
+    cur_angle_cos = np.cos(cur_box[4])
+    cur_angle_sin = np.sin(cur_box[4])
+    cur_corners = rotate_around_center(cur_center, cur_angle_cos, cur_angle_sin, cur_corners)
+
+    for i, box in enumerate(boxes):
+        box_center = [(box[0] + box[2]) / 2., (box[1] + box[3]) / 2.]
+        box_corners = np.array([
+                [box[0], box[1]],
+                [box[2], box[1]],
+                [box[2], box[3]],
+                [box[0], box[3]],
+                [box[0], box[1]],
+                ], dtype='float32')
+        box_angle_cos = np.cos(box[4])
+        box_angle_sin = np.sin(box[4])
+        box_corners = rotate_around_center(box_center, box_angle_cos, box_angle_sin, box_corners)
+
+        cross_points = np.zeros((16, 2), dtype='float32')
+        cnt = 0
+        # get intersection of lines
+        for j in range(4):
+            for k in range(4):
+                inters = intersection(cur_corners[j + 1], cur_corners[j],
+                                      box_corners[k + 1], box_corners[k])
+                if inters is not None:
+                    cross_points[cnt, :] = inters
+                    cnt += 1
+        # check corners
+        for l in range(4):
+            if check_in_box2d(cur_box, box_corners[l]):
+                cross_points[cnt, :] = box_corners[l]
+                cnt += 1
+            if check_in_box2d(box, cur_corners[l]):
+                cross_points[cnt, :] = cur_corners[l]
+                cnt += 1
+
+        if cnt > 0:
+            poly_center = np.sum(cross_points[:cnt, :], axis=0) / cnt
+        else:
+            poly_center = np.zeros((2,))
+
+        # sort the points of polygon
+        for j in range(cnt - 1):
+            for k in range(cnt - j - 1):
+                if point_cmp(cross_points[k], cross_points[k + 1], poly_center):
+                    cross_points[k], cross_points[k + 1] = \
+                            cross_points[k + 1].copy(), cross_points[k].copy()
+
+        # get the overlap areas
+        area = 0.
+        for j in range(cnt - 1):
+            area += cross_area(cross_points[j] - cross_points[0],
+                               cross_points[j + 1] - cross_points[0])
+        areas[i] = np.abs(area) / 2.
+    
+    return areas
+
+
+def box_iou(cur_box, boxes, box_type='normal'):
+    cur_S = (cur_box[2] - cur_box[0]) * (cur_box[3] - cur_box[1])
+    boxes_S = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
+    
+    if box_type == 'normal':
+        inter_x1 = np.maximum(cur_box[0], boxes[:, 0])
+        inter_y1 = np.maximum(cur_box[1], boxes[:, 1])
+        inter_x2 = np.minimum(cur_box[2], boxes[:, 2])
+        inter_y2 = np.minimum(cur_box[3], boxes[:, 3])
+        inter_w = np.maximum(inter_x2 - inter_x1, 0.)
+        inter_h = np.maximum(inter_y2 - inter_y1, 0.)
+        inter_area = inter_w * inter_h
+    elif box_type == 'rotate':
+        inter_area = box_overlap_rotate(cur_box, boxes)
+    else:
+        raise NotImplementedError
+
+    return inter_area / np.maximum(cur_S + boxes_S - inter_area, 1e-8)
+
+
+def box_nms(boxes, scores, proposals, thresh, topk, nms_type='normal'):
+    assert nms_type in ['normal', 'rotate'], \
+            "unknown nms type {}".format(nms_type)
+    order = np.argsort(-scores)
+    boxes = boxes[order]
+    scores = scores[order]
+    proposals = proposals[order]
+
+    nmsed_scores = []
+    nmsed_proposals = []
+    cnt = 0
+    while boxes.shape[0]:
+        nmsed_scores.append(scores[0])
+        nmsed_proposals.append(proposals[0])
+        cnt +=1
+        if cnt >= topk or boxes.shape[0] == 1:
+            break
+        iou = box_iou(boxes[0], boxes[1:], nms_type)
+        boxes = boxes[1:][iou < thresh]
+        scores = scores[1:][iou < thresh]
+        proposals = proposals[1:][iou < thresh]
+    return nmsed_scores, nmsed_proposals
+
+
+def box_nms_eval(boxes, scores, proposals, thresh, nms_type='rotate'):
+    assert nms_type in ['normal', 'rotate'], \
+            "unknown nms type {}".format(nms_type)
+    order = np.argsort(-scores)
+    boxes = boxes[order]
+    scores = scores[order]
+    proposals = proposals[order]
+
+    nmsed_scores = []
+    nmsed_proposals = []
+    while boxes.shape[0]:
+        nmsed_scores.append(scores[0])
+        nmsed_proposals.append(proposals[0])
+        iou = box_iou(boxes[0], boxes[1:], nms_type)
+        inds = iou < thresh
+        boxes = boxes[1:][inds]
+        scores = scores[1:][inds]
+        proposals = proposals[1:][inds]
+    nmsed_scores = np.asarray(nmsed_scores)
+    nmsed_proposals = np.asarray(nmsed_proposals)
+    return nmsed_scores, nmsed_proposals 
+
+def boxes_iou3d(boxes1, boxes2):
+    boxes1_bev = boxes3d_to_bev(boxes1)
+    boxes2_bev = boxes3d_to_bev(boxes2)
+
+    # bev overlap
+    overlaps_bev = np.zeros((boxes1_bev.shape[0], boxes2_bev.shape[0]))
+    for i in range(boxes1_bev.shape[0]):
+        overlaps_bev[i, :] = box_overlap_rotate(boxes1_bev[i], boxes2_bev)
+
+    # height overlap
+    boxes1_height_min = (boxes1[:, 1] - boxes1[:, 3]).reshape(-1, 1)
+    boxes1_height_max = boxes1[:, 1].reshape(-1, 1)
+    boxes2_height_min = (boxes2[:, 1] - boxes2[:, 3]).reshape(1, -1)
+    boxes2_height_max = boxes2[:, 1].reshape(1, -1)
+
+    max_of_min = np.maximum(boxes1_height_min, boxes2_height_min)
+    min_of_max = np.minimum(boxes1_height_max, boxes2_height_max)
+    overlaps_h = np.maximum(min_of_max - max_of_min, 0.)
+
+    # 3d iou
+    overlaps_3d = overlaps_bev * overlaps_h
+
+    vol_a = (boxes1[:, 3] * boxes1[:, 4] * boxes1[:, 5]).reshape(-1, 1)
+    vol_b = (boxes2[:, 3] * boxes2[:, 4] * boxes2[:, 5]).reshape(1, -1)
+    iou3d = overlaps_3d / np.maximum(vol_a + vol_b - overlaps_3d, 1e-7)
+
+    return iou3d
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/calibration.py b/PaddleCV/Paddle3D/PointRCNN/utils/calibration.py
new file mode 100644
index 0000000000000000000000000000000000000000..41fcf279db5a194c5dcc81ae8dafa48b088a42bc
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/calibration.py
@@ -0,0 +1,143 @@
+"""
+This code is borrow from https://github.com/sshaoshuai/PointRCNN/blob/master/lib/utils/kitti_utils.py
+"""
+import numpy as np
+import os
+
+
+def get_calib_from_file(calib_file):
+    with open(calib_file) as f:
+        lines = f.readlines()
+
+    obj = lines[2].strip().split(' ')[1:]
+    P2 = np.array(obj, dtype=np.float32)
+    obj = lines[3].strip().split(' ')[1:]
+    P3 = np.array(obj, dtype=np.float32)
+    obj = lines[4].strip().split(' ')[1:]
+    R0 = np.array(obj, dtype=np.float32)
+    obj = lines[5].strip().split(' ')[1:]
+    Tr_velo_to_cam = np.array(obj, dtype=np.float32)
+
+    return {'P2': P2.reshape(3, 4),
+            'P3': P3.reshape(3, 4),
+            'R0': R0.reshape(3, 3),
+            'Tr_velo2cam': Tr_velo_to_cam.reshape(3, 4)}
+
+
+class Calibration(object):
+    def __init__(self, calib_file):
+        if isinstance(calib_file, str):
+            calib = get_calib_from_file(calib_file)
+        else:
+            calib = calib_file
+
+        self.P2 = calib['P2']  # 3 x 4
+        self.R0 = calib['R0']  # 3 x 3
+        self.V2C = calib['Tr_velo2cam']  # 3 x 4
+
+        # Camera intrinsics and extrinsics
+        self.cu = self.P2[0, 2]
+        self.cv = self.P2[1, 2]
+        self.fu = self.P2[0, 0]
+        self.fv = self.P2[1, 1]
+        self.tx = self.P2[0, 3] / (-self.fu)
+        self.ty = self.P2[1, 3] / (-self.fv)
+
+    def cart_to_hom(self, pts):
+        """
+        :param pts: (N, 3 or 2)
+        :return pts_hom: (N, 4 or 3)
+        """
+        pts_hom = np.hstack((pts, np.ones((pts.shape[0], 1), dtype=np.float32)))
+        return pts_hom
+
+    def lidar_to_rect(self, pts_lidar):
+        """
+        :param pts_lidar: (N, 3)
+        :return pts_rect: (N, 3)
+        """
+        pts_lidar_hom = self.cart_to_hom(pts_lidar)
+        pts_rect = np.dot(pts_lidar_hom, np.dot(self.V2C.T, self.R0.T))
+        # pts_rect = reduce(np.dot, (pts_lidar_hom, self.V2C.T, self.R0.T))
+        return pts_rect
+
+    def rect_to_img(self, pts_rect):
+        """
+        :param pts_rect: (N, 3)
+        :return pts_img: (N, 2)
+        """
+        pts_rect_hom = self.cart_to_hom(pts_rect)
+        pts_2d_hom = np.dot(pts_rect_hom, self.P2.T)
+        pts_img = (pts_2d_hom[:, 0:2].T / pts_rect_hom[:, 2]).T  # (N, 2)
+        pts_rect_depth = pts_2d_hom[:, 2] - self.P2.T[3, 2]  # depth in rect camera coord
+        return pts_img, pts_rect_depth
+
+    def lidar_to_img(self, pts_lidar):
+        """
+        :param pts_lidar: (N, 3)
+        :return pts_img: (N, 2)
+        """
+        pts_rect = self.lidar_to_rect(pts_lidar)
+        pts_img, pts_depth = self.rect_to_img(pts_rect)
+        return pts_img, pts_depth
+
+    def img_to_rect(self, u, v, depth_rect):
+        """
+        :param u: (N)
+        :param v: (N)
+        :param depth_rect: (N)
+        :return:
+        """
+        x = ((u - self.cu) * depth_rect) / self.fu + self.tx
+        y = ((v - self.cv) * depth_rect) / self.fv + self.ty
+        pts_rect = np.concatenate((x.reshape(-1, 1), y.reshape(-1, 1), depth_rect.reshape(-1, 1)), axis=1)
+        return pts_rect
+
+    def depthmap_to_rect(self, depth_map):
+        """
+        :param depth_map: (H, W), depth_map
+        :return:
+        """
+        x_range = np.arange(0, depth_map.shape[1])
+        y_range = np.arange(0, depth_map.shape[0])
+        x_idxs, y_idxs = np.meshgrid(x_range, y_range)
+        x_idxs, y_idxs = x_idxs.reshape(-1), y_idxs.reshape(-1)
+        depth = depth_map[y_idxs, x_idxs]
+        pts_rect = self.img_to_rect(x_idxs, y_idxs, depth)
+        return pts_rect, x_idxs, y_idxs
+
+    def corners3d_to_img_boxes(self, corners3d):
+        """
+        :param corners3d: (N, 8, 3) corners in rect coordinate
+        :return: boxes: (None, 4) [x1, y1, x2, y2] in rgb coordinate
+        :return: boxes_corner: (None, 8) [xi, yi] in rgb coordinate
+        """
+        sample_num = corners3d.shape[0]
+        corners3d_hom = np.concatenate((corners3d, np.ones((sample_num, 8, 1))), axis=2)  # (N, 8, 4)
+
+        img_pts = np.matmul(corners3d_hom, self.P2.T)  # (N, 8, 3)
+
+        x, y = img_pts[:, :, 0] / img_pts[:, :, 2], img_pts[:, :, 1] / img_pts[:, :, 2]
+        x1, y1 = np.min(x, axis=1), np.min(y, axis=1)
+        x2, y2 = np.max(x, axis=1), np.max(y, axis=1)
+
+        boxes = np.concatenate((x1.reshape(-1, 1), y1.reshape(-1, 1), x2.reshape(-1, 1), y2.reshape(-1, 1)), axis=1)
+        boxes_corner = np.concatenate((x.reshape(-1, 8, 1), y.reshape(-1, 8, 1)), axis=2)
+
+        return boxes, boxes_corner
+
+    def camera_dis_to_rect(self, u, v, d):
+        """
+        Can only process valid u, v, d, which means u, v can not beyond the image shape, reprojection error 0.02
+        :param u: (N)
+        :param v: (N)
+        :param d: (N), the distance between camera and 3d points, d^2 = x^2 + y^2 + z^2
+        :return:
+        """
+        assert self.fu == self.fv, '%.8f != %.8f' % (self.fu, self.fv)
+        fd = np.sqrt((u - self.cu)**2 + (v - self.cv)**2 + self.fu**2)
+        x = ((u - self.cu) * d) / fd + self.tx
+        y = ((v - self.cv) * d) / fd + self.ty
+        z = np.sqrt(d**2 - x**2 - y**2)
+        pts_rect = np.concatenate((x.reshape(-1, 1), y.reshape(-1, 1), z.reshape(-1, 1)), axis=1)
+        return pts_rect
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/config.py b/PaddleCV/Paddle3D/PointRCNN/utils/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc24aee5253576e3e5f78b8ed246af51c06279ba
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/config.py
@@ -0,0 +1,279 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+This code is bases on https://github.com/sshaoshuai/PointRCNN/blob/master/lib/config.py
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import yaml
+import numpy as np
+from ast import literal_eval
+
+__all__ = ["load_config", "cfg"]
+
+
+class AttrDict(dict):
+    def __init__(self, *args, **kwargs):
+        for arg in args:
+            for k, v in arg.items():
+                if isinstance(v, dict):
+                    arg[k] = AttrDict(v)
+                else:
+                    arg[k] = v
+        super(AttrDict, self).__init__(*args, **kwargs)
+
+    def __getattr__(self, name):
+        if name in self.__dict__:
+            return self.__dict__[name]
+        elif name in self:
+            return self[name]
+        else:
+            raise AttributeError(name)
+
+    def __setattr__(self, name, value):
+        if name in self.__dict__:
+            self.__dict__[name] = value
+        else:
+            self[name] = value
+
+
+__C = AttrDict()
+cfg = __C
+
+# 0. basic config
+__C.TAG = 'default'
+__C.CLASSES = 'Car'
+
+__C.INCLUDE_SIMILAR_TYPE = False
+
+# config of augmentation
+__C.AUG_DATA = True
+__C.AUG_METHOD_LIST = ['rotation', 'scaling', 'flip']
+__C.AUG_METHOD_PROB = [0.5, 0.5, 0.5]
+__C.AUG_ROT_RANGE = 18
+
+__C.GT_AUG_ENABLED = False
+__C.GT_EXTRA_NUM = 15
+__C.GT_AUG_RAND_NUM = False
+__C.GT_AUG_APPLY_PROB = 0.75
+__C.GT_AUG_HARD_RATIO = 0.6
+
+__C.PC_REDUCE_BY_RANGE = True
+__C.PC_AREA_SCOPE = np.array([[-40, 40],
+                              [-1,   3],
+                              [0, 70.4]])  # x, y, z scope in rect camera coords
+
+__C.CLS_MEAN_SIZE = np.array([[1.52, 1.63, 3.88]], dtype=np.float32)
+
+
+# 1. config of rpn network
+__C.RPN = AttrDict()
+__C.RPN.ENABLED = True
+__C.RPN.FIXED = False
+
+__C.RPN.USE_INTENSITY = True
+
+# config of bin-based loss
+__C.RPN.LOC_XZ_FINE = False
+__C.RPN.LOC_SCOPE = 3.0
+__C.RPN.LOC_BIN_SIZE = 0.5
+__C.RPN.NUM_HEAD_BIN = 12
+
+# config of network structure
+__C.RPN.BACKBONE = 'pointnet2_msg'
+
+__C.RPN.USE_BN = True
+__C.RPN.NUM_POINTS = 16384
+
+__C.RPN.SA_CONFIG = AttrDict()
+__C.RPN.SA_CONFIG.NPOINTS = [4096, 1024, 256, 64]
+__C.RPN.SA_CONFIG.RADIUS = [[0.1, 0.5], [0.5, 1.0], [1.0, 2.0], [2.0, 4.0]]
+__C.RPN.SA_CONFIG.NSAMPLE = [[16, 32], [16, 32], [16, 32], [16, 32]]
+__C.RPN.SA_CONFIG.MLPS = [[[16, 16, 32], [32, 32, 64]],
+                          [[64, 64, 128], [64, 96, 128]],
+                          [[128, 196, 256], [128, 196, 256]],
+                          [[256, 256, 512], [256, 384, 512]]]
+__C.RPN.FP_MLPS = [[128, 128], [256, 256], [512, 512], [512, 512]]
+__C.RPN.CLS_FC = [128]
+__C.RPN.REG_FC = [128]
+__C.RPN.DP_RATIO = 0.5
+
+# config of training
+__C.RPN.LOSS_CLS = 'DiceLoss'
+__C.RPN.FG_WEIGHT = 15
+__C.RPN.FOCAL_ALPHA = [0.25, 0.75]
+__C.RPN.FOCAL_GAMMA = 2.0
+__C.RPN.REG_LOSS_WEIGHT = [1.0, 1.0, 1.0, 1.0]
+__C.RPN.LOSS_WEIGHT = [1.0, 1.0]
+__C.RPN.NMS_TYPE = 'normal'  # normal, rotate
+
+# config of testing
+__C.RPN.SCORE_THRESH = 0.3
+
+
+# 2. config of rcnn network
+__C.RCNN = AttrDict()
+__C.RCNN.ENABLED = False
+
+# config of input
+__C.RCNN.USE_RPN_FEATURES = True
+__C.RCNN.USE_MASK = True
+__C.RCNN.MASK_TYPE = 'seg'
+__C.RCNN.USE_INTENSITY = False
+__C.RCNN.USE_DEPTH = True
+__C.RCNN.USE_SEG_SCORE = False
+__C.RCNN.ROI_SAMPLE_JIT = False
+__C.RCNN.ROI_FG_AUG_TIMES = 10
+
+__C.RCNN.REG_AUG_METHOD = 'multiple'  # multiple, single, normal
+__C.RCNN.POOL_EXTRA_WIDTH = 1.0
+
+# config of bin-based loss
+__C.RCNN.LOC_SCOPE = 1.5
+__C.RCNN.LOC_BIN_SIZE = 0.5
+__C.RCNN.NUM_HEAD_BIN = 9
+__C.RCNN.LOC_Y_BY_BIN = False
+__C.RCNN.LOC_Y_SCOPE = 0.5
+__C.RCNN.LOC_Y_BIN_SIZE = 0.25
+__C.RCNN.SIZE_RES_ON_ROI = False
+
+# config of network structure
+__C.RCNN.USE_BN = False
+__C.RCNN.DP_RATIO = 0.0
+
+__C.RCNN.BACKBONE = 'pointnet'  # pointnet, pointsift
+__C.RCNN.XYZ_UP_LAYER = [128, 128]
+
+__C.RCNN.NUM_POINTS = 512
+__C.RCNN.SA_CONFIG = AttrDict()
+__C.RCNN.SA_CONFIG.NPOINTS = [128, 32, -1]
+__C.RCNN.SA_CONFIG.RADIUS = [0.2, 0.4, 100]
+__C.RCNN.SA_CONFIG.NSAMPLE = [64, 64, 64]
+__C.RCNN.SA_CONFIG.MLPS = [[128, 128, 128],
+                           [128, 128, 256],
+                           [256, 256, 512]]
+__C.RCNN.CLS_FC = [256, 256]
+__C.RCNN.REG_FC = [256, 256]
+
+# config of training
+__C.RCNN.LOSS_CLS = 'BinaryCrossEntropy'
+__C.RCNN.FOCAL_ALPHA = [0.25, 0.75]
+__C.RCNN.FOCAL_GAMMA = 2.0
+__C.RCNN.CLS_WEIGHT = np.array([1.0, 1.0, 1.0], dtype=np.float32)
+__C.RCNN.CLS_FG_THRESH = 0.6
+__C.RCNN.CLS_BG_THRESH = 0.45
+__C.RCNN.CLS_BG_THRESH_LO = 0.05
+__C.RCNN.REG_FG_THRESH = 0.55
+__C.RCNN.FG_RATIO = 0.5
+__C.RCNN.ROI_PER_IMAGE = 64
+__C.RCNN.HARD_BG_RATIO = 0.6
+
+# config of testing
+__C.RCNN.SCORE_THRESH = 0.3
+__C.RCNN.NMS_THRESH = 0.1
+
+
+# general training config
+__C.TRAIN = AttrDict()
+__C.TRAIN.SPLIT = 'train'
+__C.TRAIN.VAL_SPLIT = 'smallval'
+
+__C.TRAIN.LR = 0.002
+__C.TRAIN.LR_CLIP = 0.00001
+__C.TRAIN.LR_DECAY = 0.5
+__C.TRAIN.DECAY_STEP_LIST = [50, 100, 150, 200, 250, 300]
+__C.TRAIN.LR_WARMUP = False
+__C.TRAIN.WARMUP_MIN = 0.0002
+__C.TRAIN.WARMUP_EPOCH = 5
+
+__C.TRAIN.BN_MOMENTUM = 0.9
+__C.TRAIN.BN_DECAY = 0.5
+__C.TRAIN.BNM_CLIP = 0.01
+__C.TRAIN.BN_DECAY_STEP_LIST = [50, 100, 150, 200, 250, 300]
+
+__C.TRAIN.OPTIMIZER = 'adam'
+__C.TRAIN.WEIGHT_DECAY = 0.0  # "L2 regularization coeff [default: 0.0]"
+__C.TRAIN.MOMENTUM = 0.9
+
+__C.TRAIN.MOMS = [0.95, 0.85]
+__C.TRAIN.DIV_FACTOR = 10.0
+__C.TRAIN.PCT_START = 0.4
+
+__C.TRAIN.GRAD_NORM_CLIP = 1.0
+
+__C.TRAIN.RPN_PRE_NMS_TOP_N = 12000
+__C.TRAIN.RPN_POST_NMS_TOP_N = 2048
+__C.TRAIN.RPN_NMS_THRESH = 0.85
+__C.TRAIN.RPN_DISTANCE_BASED_PROPOSE = True
+
+
+__C.TEST = AttrDict()
+__C.TEST.SPLIT = 'val'
+__C.TEST.RPN_PRE_NMS_TOP_N = 9000
+__C.TEST.RPN_POST_NMS_TOP_N = 300
+__C.TEST.RPN_NMS_THRESH = 0.7
+__C.TEST.RPN_DISTANCE_BASED_PROPOSE = True
+
+
+def load_config(fname):
+    """
+    Load config from yaml file and merge into global cfg
+    """
+    with open(fname) as f:
+        yml_cfg = AttrDict(yaml.load(f.read(), Loader=yaml.Loader))
+    _merge_cfg_a_to_b(yml_cfg, __C)
+
+
+def set_config_from_list(cfg_list):
+    assert len(cfg_list) % 2 == 0, "cfgs list length invalid"
+    for k, v in zip(cfg_list[0::2], cfg_list[1::2]):
+        key_list = k.split('.')
+        d = __C
+        for subkey in key_list[:-1]:
+            assert subkey in d
+            d = d[subkey]
+        subkey = key_list[-1]
+        assert subkey in d
+        try:
+            value = literal_eval(v)
+        except:
+            # handle the case when v is a string literal
+            value = v
+        assert type(value) == type(d[subkey]), \
+            'type {} does not match original type {}'.format(type(value), type(d[subkey]))
+        d[subkey] = value
+
+
+def _merge_cfg_a_to_b(a, b):
+    assert isinstance(a, AttrDict), \
+            "unknown type {}".format(type(a))
+
+    for k, v in a.items():
+        assert k in b, "unknown key {}".format(k)
+        if type(v) is not type(b[k]):
+            if isinstance(b[k], np.ndarray):
+                b[k] = np.array(v, dtype=b[k].dtype)
+            else:
+                raise TypeError("Config type mismatch")
+        if isinstance(v, AttrDict):
+            _merge_cfg_a_to_b(v, b[k])
+        else:
+            b[k] = v
+
+
+if __name__ == "__main__":
+    load_config("./cfgs/default.yml")
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/cyops/__init__.py b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e02c54922625934fe1ab74a8c29e435f44f4d302
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/__init__.py
@@ -0,0 +1,15 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/cyops/iou3d_utils.pyx b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/iou3d_utils.pyx
new file mode 100644
index 0000000000000000000000000000000000000000..b2c7f3c7169c0a0f5da1adeeb029eec423daf39e
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/iou3d_utils.pyx
@@ -0,0 +1,195 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import cython 
+from math import pi, cos, sin
+import numpy as np 
+cimport numpy as np 
+
+
+cdef class Point:
+    cdef float x, y 
+    def __cinit__(self, x, y):
+        self.x = x
+        self.y = y
+
+    def __add__(self, v):
+        if not isinstance(v, Point):
+            return NotImplemented
+        return Point(self.x + v.x, self.y + v.y)
+
+    def __sub__(self, v):
+        if not isinstance(v, Point):
+            return NotImplemented
+        return Point(self.x - v.x, self.y - v.y)
+
+    def cross(self, v):
+        if not isinstance(v, Point):
+            return NotImplemented
+        return self.x*v.y - self.y*v.x
+
+
+cdef class Line:
+    cdef float a, b, c 
+    # ax + by + c = 0
+    def __cinit__(self, v1, v2):
+        self.a = v2.y - v1.y
+        self.b = v1.x - v2.x
+        self.c = v2.cross(v1)
+
+    def __call__(self, p):
+        return self.a*p.x + self.b*p.y + self.c
+
+    def intersection(self, other):
+        if not isinstance(other, Line):
+            return NotImplemented
+        w = self.a*other.b - self.b*other.a
+        return Point(
+            (self.b*other.c - self.c*other.b)/w,
+            (self.c*other.a - self.a*other.c)/w
+        )
+
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def rectangle_vertices_(x1, y1, x2, y2, r):
+    
+    cx = (x1 + x2) / 2
+    cy = (y1 + y2) / 2
+    angle = r 
+    cr = cos(angle)
+    sr = sin(angle)
+    # rotate around center
+    return (
+        Point(
+            x=(x1-cx)*cr+(y1-cy)*sr+cx,
+            y=-(x1-cx)*sr+(y1-cy)*cr+cy
+        ),
+        Point(
+            x=(x2-cx)*cr+(y1-cy)*sr+cx,
+            y=-(x2-cx)*sr+(y1-cy)*cr+cy
+        ),
+        Point(
+            x=(x2-cx)*cr+(y2-cy)*sr+cx,
+            y=-(x2-cx)*sr+(y2-cy)*cr+cy
+        ),
+        Point(
+            x=(x1-cx)*cr+(y2-cy)*sr+cx,
+            y=-(x1-cx)*sr+(y2-cy)*cr+cy
+        )
+    )
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def intersection_area(r1, r2):
+    # r1 and r2 are in (center, width, height, rotation) representation
+    # First convert these into a sequence of vertices
+
+    rect1 = rectangle_vertices_(*r1)
+    rect2 = rectangle_vertices_(*r2)
+
+    # Use the vertices of the first rectangle as
+    # starting vertices of the intersection polygon.
+    intersection = rect1
+
+    # Loop over the edges of the second rectangle
+    for p, q in zip(rect2, rect2[1:] + rect2[:1]):
+        if len(intersection) <= 2:
+            break # No intersection
+
+        line = Line(p, q)
+
+        # Any point p with line(p) <= 0 is on the "inside" (or on the boundary),
+        # any point p with line(p) > 0 is on the "outside".
+
+        # Loop over the edges of the intersection polygon,
+        # and determine which part is inside and which is outside.
+        new_intersection = []
+        line_values = [line(t) for t in intersection]
+        for s, t, s_value, t_value in zip(
+            intersection, intersection[1:] + intersection[:1],
+            line_values, line_values[1:] + line_values[:1]):
+            if s_value <= 0:
+                new_intersection.append(s)
+            if s_value * t_value < 0:
+                # Points are on opposite sides.
+                # Add the intersection of the lines to new_intersection.
+                intersection_point = line.intersection(Line(s, t))
+                new_intersection.append(intersection_point)
+
+        intersection = new_intersection
+
+    # Calculate area
+    if len(intersection) <= 2:
+        return 0
+
+    return 0.5 * sum(p.x*q.y - p.y*q.x for p, q in zip(intersection, intersection[1:] + intersection[:1]))
+
+
+def boxes3d_to_bev_(boxes3d):
+    """
+    Args:
+        boxes3d: [N, 7], (x, y, z, h, w, l, ry)
+    Return:
+        boxes_bev: [N, 5], (x1, y1, x2, y2, ry)
+    """
+    boxes_bev = np.zeros((boxes3d.shape[0], 5), dtype='float32')
+    cu, cv = boxes3d[:, 0], boxes3d[:, 2]
+    half_l, half_w = boxes3d[:, 5] / 2, boxes3d[:, 4] / 2
+    boxes_bev[:, 0], boxes_bev[:, 1] = cu - half_l, cv - half_w
+    boxes_bev[:, 2], boxes_bev[:, 3] = cu + half_l, cv + half_w
+    boxes_bev[:, 4] = boxes3d[:, 6]
+    return boxes_bev
+
+
+def boxes_iou3d(boxes_a, boxes_b):
+    """
+    :param boxes_a: (N, 7) [x, y, z, h, w, l, ry]
+    :param boxes_b: (M, 7) [x, y, z, h, w, l, ry]
+    :return:
+        ans_iou: (M, N)
+    """
+    boxes_a_bev = boxes3d_to_bev_(boxes_a)
+    boxes_b_bev = boxes3d_to_bev_(boxes_b)
+    # bev overlap
+    num_a = boxes_a_bev.shape[0]
+    num_b = boxes_b_bev.shape[0]
+    overlaps_bev = np.zeros((num_a, num_b), dtype=np.float32)
+    for i in range(num_a):
+        for j in range(num_b):
+            overlaps_bev[i][j] = intersection_area(boxes_a_bev[i], boxes_b_bev[j])
+
+    # height overlap
+    boxes_a_height_min = (boxes_a[:, 1] - boxes_a[:, 3]).reshape(-1, 1)
+    boxes_a_height_max = boxes_a[:, 1].reshape(-1, 1)
+    boxes_b_height_min = (boxes_b[:, 1] - boxes_b[:, 3]).reshape(1, -1)
+    boxes_b_height_max = boxes_b[:, 1].reshape(1, -1)
+
+    max_of_min = np.maximum(boxes_a_height_min, boxes_b_height_min)
+    min_of_max = np.minimum(boxes_a_height_max, boxes_b_height_max)
+    overlaps_h = np.clip(min_of_max - max_of_min, a_min=0, a_max=np.inf)
+    # 3d iou
+    overlaps_3d = overlaps_bev * overlaps_h
+
+    vol_a = (boxes_a[:, 3] * boxes_a[:, 4] * boxes_a[:, 5]).reshape(-1, 1)
+    vol_b = (boxes_b[:, 3] * boxes_b[:, 4] * boxes_b[:, 5]).reshape(1, -1)
+
+    iou3d = overlaps_3d / np.clip(vol_a + vol_b - overlaps_3d, a_min=1e-7, a_max=np.inf)
+    return iou3d
+
+#if __name__ == '__main__':
+#    # (center, width, height, rotation)
+#    r1 = (10, 15, 15, 10, 30)
+#    r2 = (15, 15, 20, 10, 0)
+#    print(intersection_area(r1, r2))
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/cyops/kitti_utils.pyx b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/kitti_utils.pyx
new file mode 100644
index 0000000000000000000000000000000000000000..593dd0c9354516a2861701c5103f8e9b10ae46b1
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/kitti_utils.pyx
@@ -0,0 +1,346 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import cython
+import numpy as np
+cimport numpy as np
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def pts_in_boxes3d(np.ndarray pts_rect, np.ndarray boxes3d):
+    """
+    :param pts: (N, 3) in rect-camera coords
+    :param boxes3d: (M, 7)
+    :return: boxes_pts_mask_list: (M), list with [(N), (N), ..]
+    """
+    cdef float MAX_DIS = 10.0
+    cdef np.ndarray boxes_pts_mask_list = np.zeros((boxes3d.shape[0], pts_rect.shape[0]), dtype='int32')
+    cdef int boxes3d_num = boxes3d.shape[0]
+    cdef int pts_rect_num = pts_rect.shape[0]
+    cdef float cx, by, cz, h, w, l, angle, cy, cosa, sina, x_rot, z_rot
+    cdef int x, y, z
+
+    for i in range(boxes3d_num):
+        cx, by, cz, h, w, l, angle = boxes3d[i, :]
+        cy = by - h / 2.
+        cosa = np.cos(angle)
+        sina = np.sin(angle)
+        for j in range(pts_rect_num):
+            x, y, z = pts_rect[j, :]
+
+            if np.abs(x - cx) > MAX_DIS or np.abs(y - cy) > h / 2. or np.abs(z - cz) > MAX_DIS:
+                continue
+
+            x_rot = (x - cx) * cosa + (z - cz) * (-sina)
+            z_rot = (x - cx) * sina + (z - cz) * cosa
+            boxes_pts_mask_list[i, j] = int(x_rot >= -l / 2. and x_rot <= l / 2. and
+                                            z_rot >= -w / 2. and z_rot <= w / 2.)
+    return boxes_pts_mask_list
+
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def rotate_pc_along_y(np.ndarray pc, float rot_angle):
+    """
+    params pc: (N, 3+C), (N, 3) is in the rectified camera coordinate
+    params rot_angle: rad scalar
+    Output pc: updated pc with XYZ rotated
+    """
+    cosval = np.cos(rot_angle)
+    sinval = np.sin(rot_angle)
+    rotmat = np.array([[cosval, -sinval], [sinval, cosval]])
+    pc[:, [0, 2]] = np.dot(pc[:, [0, 2]], np.transpose(rotmat))
+    return pc
+
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def rotate_pc_along_y_np(np.ndarray pc, np.ndarray rot_angle):
+    """
+    :param pc: (N, 512, 3 + C)
+    :param rot_angle: (N)
+    :return:
+    TODO: merge with rotate_pc_along_y_torch in bbox_transform.py
+    """
+    cdef np.ndarray cosa, sina, raw_1, raw_2, R, pc_temp
+    cosa = np.cos(rot_angle).reshape(-1, 1)
+    sina = np.sin(rot_angle).reshape(-1, 1)
+    raw_1 = np.concatenate([cosa, -sina], axis=1)
+    raw_2 = np.concatenate([sina, cosa], axis=1)
+    # # (N, 2, 2)
+    R = np.concatenate((np.expand_dims(raw_1, axis=1), np.expand_dims(raw_2, axis=1)), axis=1)
+    pc_temp = pc[:, :, [0, 2]]
+    pc[:, :, [0, 2]] = np.matmul(pc_temp, R.transpose(0, 2, 1))
+    
+    return pc
+
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def enlarge_box3d(np.ndarray boxes3d, float extra_width):
+    """
+    :param boxes3d: (N, 7) [x, y, z, h, w, l, ry]
+    """
+    cdef np.ndarray large_boxes3d
+    if isinstance(boxes3d, np.ndarray):
+        large_boxes3d = boxes3d.copy()
+    else:
+        large_boxes3d = boxes3d.clone()
+    large_boxes3d[:, 3:6] += extra_width * 2
+    large_boxes3d[:, 1] += extra_width
+
+    return large_boxes3d
+
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def boxes3d_to_corners3d(np.ndarray boxes3d, bint rotate=True):
+    """
+    :param boxes3d: (N, 7) [x, y, z, h, w, l, ry]
+    :param rotate:
+    :return: corners3d: (N, 8, 3)
+    """
+    cdef int boxes_num = boxes3d.shape[0]
+    cdef np.ndarray h, w, l
+    h, w, l = boxes3d[:, 3], boxes3d[:, 4], boxes3d[:, 5]
+    cdef np.ndarray x_corners, y_corners
+    x_corners = np.array([l / 2., l / 2., -l / 2., -l / 2., l / 2., l / 2., -l / 2., -l / 2.], dtype=np.float32).T  # (N, 8)
+    z_corners = np.array([w / 2., -w / 2., -w / 2., w / 2., w / 2., -w / 2., -w / 2., w / 2.], dtype=np.float32).T  # (N, 8)
+
+    y_corners = np.zeros((boxes_num, 8), dtype=np.float32)
+    y_corners[:, 4:8] = -h.reshape(boxes_num, 1).repeat(4, axis=1)  # (N, 8)
+    
+    cdef np.ndarray ry, zeros, ones, rot_list, R_list, temp_corners, rotated_corners
+    if rotate:
+        ry = boxes3d[:, 6]
+        zeros, ones = np.zeros(ry.size, dtype=np.float32), np.ones(ry.size, dtype=np.float32)
+        rot_list = np.array([[np.cos(ry), zeros, -np.sin(ry)],
+                             [zeros,       ones,       zeros],
+                             [np.sin(ry), zeros,  np.cos(ry)]])  # (3, 3, N)
+        R_list = np.transpose(rot_list, (2, 0, 1))  # (N, 3, 3)
+
+        temp_corners = np.concatenate((x_corners.reshape(-1, 8, 1), y_corners.reshape(-1, 8, 1),
+                                       z_corners.reshape(-1, 8, 1)), axis=2)  # (N, 8, 3)
+        rotated_corners = np.matmul(temp_corners, R_list)  # (N, 8, 3)
+        x_corners, y_corners, z_corners = rotated_corners[:, :, 0], rotated_corners[:, :, 1], rotated_corners[:, :, 2]
+
+    cdef np.ndarray x_loc, y_loc, z_loc
+    x_loc, y_loc, z_loc = boxes3d[:, 0], boxes3d[:, 1], boxes3d[:, 2]
+
+    cdef np.ndarray x, y, z, corners 
+    x = x_loc.reshape(-1, 1) + x_corners.reshape(-1, 8)
+    y = y_loc.reshape(-1, 1) + y_corners.reshape(-1, 8)
+    z = z_loc.reshape(-1, 1) + z_corners.reshape(-1, 8)
+
+    corners = np.concatenate((x.reshape(-1, 8, 1), y.reshape(-1, 8, 1), z.reshape(-1, 8, 1)), axis=2).astype(np.float32)
+
+    return corners
+
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def objs_to_boxes3d(obj_list):
+    cdef np.ndarray boxes3d = np.zeros((obj_list.__len__(), 7), dtype=np.float32)
+    cdef int k 
+    for k, obj in enumerate(obj_list):
+        boxes3d[k, 0:3], boxes3d[k, 3], boxes3d[k, 4], boxes3d[k, 5], boxes3d[k, 6] \
+            = obj.pos, obj.h, obj.w, obj.l, obj.ry
+    return boxes3d
+
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def objs_to_scores(obj_list):
+    cdef np.ndarray scores = np.zeros((obj_list.__len__()), dtype=np.float32)
+    cdef int k 
+    for k, obj in enumerate(obj_list):
+        scores[k] = obj.score
+    return scores
+
+
+def get_iou3d(np.ndarray corners3d, np.ndarray query_corners3d, bint need_bev=False):
+    """
+    :param corners3d: (N, 8, 3) in rect coords
+    :param query_corners3d: (M, 8, 3)
+    :return:
+    """
+    from shapely.geometry import Polygon
+    A, B = corners3d, query_corners3d
+    N, M = A.shape[0], B.shape[0]
+    iou3d = np.zeros((N, M), dtype=np.float32)
+    iou_bev = np.zeros((N, M), dtype=np.float32)
+
+    # for height overlap, since y face down, use the negative y
+    min_h_a = -A[:, 0:4, 1].sum(axis=1) / 4.0
+    max_h_a = -A[:, 4:8, 1].sum(axis=1) / 4.0
+    min_h_b = -B[:, 0:4, 1].sum(axis=1) / 4.0
+    max_h_b = -B[:, 4:8, 1].sum(axis=1) / 4.0
+
+    for i in range(N):
+        for j in range(M):
+            max_of_min = np.max([min_h_a[i], min_h_b[j]])
+            min_of_max = np.min([max_h_a[i], max_h_b[j]])
+            h_overlap = np.max([0, min_of_max - max_of_min])
+            if h_overlap == 0:
+                continue
+
+            bottom_a, bottom_b = Polygon(A[i, 0:4, [0, 2]].T), Polygon(B[j, 0:4, [0, 2]].T)
+            if bottom_a.is_valid and bottom_b.is_valid:
+                # check is valid,  A valid Polygon may not possess any overlapping exterior or interior rings.
+                bottom_overlap = bottom_a.intersection(bottom_b).area
+            else:
+                bottom_overlap = 0.
+            overlap3d = bottom_overlap * h_overlap
+            union3d = bottom_a.area * (max_h_a[i] - min_h_a[i]) + bottom_b.area * (max_h_b[j] - min_h_b[j]) - overlap3d
+            iou3d[i][j] = overlap3d / union3d
+            iou_bev[i][j] = bottom_overlap / (bottom_a.area + bottom_b.area - bottom_overlap)
+
+    if need_bev:
+        return iou3d, iou_bev
+
+    return iou3d
+
+
+def get_objects_from_label(label_file):
+    import utils.object3d as object3d
+
+    with open(label_file, 'r') as f:
+        lines = f.readlines()
+    objects = [object3d.Object3d(line) for line in lines]
+    return objects
+
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def _rotate_pc_along_y(np.ndarray pc, np.ndarray angle):
+    cdef np.ndarray cosa = np.cos(angle)
+    cosa=cosa.reshape(-1, 1)
+    cdef np.ndarray sina = np.sin(angle)
+    sina = sina.reshape(-1, 1)
+
+    cdef np.ndarray R = np.concatenate([cosa, -sina, sina, cosa], axis=-1)
+    R = R.reshape(-1, 2, 2)
+    cdef np.ndarray pc_temp = pc[:, [0, 2]]
+    pc_temp = pc_temp.reshape(-1, 1, 2)
+    cdef np.ndarray pc_temp_1 = np.matmul(pc_temp, R.transpose(0, 2, 1))
+    pc_temp_1 = pc_temp_1.reshape(-1, 2)
+    pc[:,[0,2]] = pc_temp_1 
+
+    return pc
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def decode_bbox_target(
+    np.ndarray roi_box3d, 
+    np.ndarray pred_reg, 
+    np.ndarray anchor_size, 
+    float loc_scope,
+    float loc_bin_size, 
+    int num_head_bin, 
+    bint get_xz_fine=True,
+    float loc_y_scope=0.5, 
+    float loc_y_bin_size=0.25,
+    bint get_y_by_bin=False, 
+    bint get_ry_fine=False):
+    
+    cdef int per_loc_bin_num = int(loc_scope / loc_bin_size) * 2
+    cdef int loc_y_bin_num = int(loc_y_scope / loc_y_bin_size) * 2
+
+    # recover xz localization
+    cdef int x_bin_l = 0
+    cdef int x_bin_r = per_loc_bin_num
+    cdef int z_bin_l = per_loc_bin_num, 
+    cdef int z_bin_r = per_loc_bin_num * 2
+    cdef int start_offset = z_bin_r
+    cdef np.ndarray x_bin = np.argmax(pred_reg[:, x_bin_l: x_bin_r], axis=1)
+    cdef np.ndarray z_bin = np.argmax(pred_reg[:, z_bin_l: z_bin_r], axis=1)
+
+    cdef np.ndarray pos_x = x_bin.astype('float32') * loc_bin_size + loc_bin_size / 2 - loc_scope
+    cdef np.ndarray pos_z = z_bin.astype('float32') * loc_bin_size + loc_bin_size / 2 - loc_scope
+
+    if get_xz_fine:
+        x_res_l, x_res_r = per_loc_bin_num * 2, per_loc_bin_num * 3
+        z_res_l, z_res_r = per_loc_bin_num * 3, per_loc_bin_num * 4
+        start_offset = z_res_r
+
+        x_res_norm = pred_reg[:, x_res_l:x_res_r][np.arange(len(x_bin)), x_bin]
+        z_res_norm = pred_reg[:, z_res_l:z_res_r][np.arange(len(z_bin)), z_bin]
+
+        x_res = x_res_norm * loc_bin_size
+        z_res = z_res_norm * loc_bin_size
+        pos_x += x_res
+        pos_z += z_res
+
+    # recover y localization
+    if get_y_by_bin:
+        y_bin_l, y_bin_r = start_offset, start_offset + loc_y_bin_num
+        y_res_l, y_res_r = y_bin_r, y_bin_r + loc_y_bin_num
+        start_offset = y_res_r
+
+        y_bin = np.argmax(pred_reg[:, y_bin_l: y_bin_r], axis=1)
+        y_res_norm = pred_reg[:, y_res_l:y_res_r][np.arange(len(y_bin)), y_bin]
+        y_res = y_res_norm * loc_y_bin_size
+        pos_y = y_bin.astype('float32') * loc_y_bin_size + loc_y_bin_size / 2 - loc_y_scope + y_res
+        pos_y = pos_y + np.array(roi_box3d[:, 1]).reshape(-1)
+    else:
+        y_offset_l, y_offset_r = start_offset, start_offset + 1
+        start_offset = y_offset_r
+
+        pos_y = np.array(roi_box3d[:, 1]) + np.array(pred_reg[:, y_offset_l])
+        pos_y = pos_y.reshape(-1)
+
+    # recover ry rotation
+    cdef int  ry_bin_l = start_offset, 
+    cdef int ry_bin_r = start_offset + num_head_bin
+    cdef int ry_res_l = ry_bin_r, 
+    cdef int ry_res_r = ry_bin_r + num_head_bin
+
+    cdef np.ndarray ry_bin = np.argmax(pred_reg[:, ry_bin_l: ry_bin_r], axis=1)
+    cdef np.ndarray ry_res_norm = pred_reg[:, ry_res_l:ry_res_r][np.arange(len(ry_bin)), ry_bin]
+    if get_ry_fine:
+        # divide pi/2 into several bins
+        angle_per_class = (np.pi / 2) / num_head_bin
+        ry_res = ry_res_norm * (angle_per_class / 2)
+        ry = (ry_bin.astype('float32') * angle_per_class + angle_per_class / 2) + ry_res - np.pi / 4
+    else:
+        angle_per_class = (2 * np.pi) / num_head_bin
+        ry_res = ry_res_norm * (angle_per_class / 2)
+
+        # bin_center is (0, 30, 60, 90, 120, ..., 270, 300, 330)
+        ry = np.fmod(ry_bin.astype('float32') * angle_per_class + ry_res, 2 * np.pi)
+        ry[ry > np.pi] -= 2 * np.pi
+
+    # recover size
+    cdef int size_res_l = ry_res_r 
+    cdef int size_res_r = ry_res_r + 3
+    assert size_res_r == pred_reg.shape[1]
+
+    cdef np.ndarray size_res_norm = pred_reg[:, size_res_l: size_res_r]
+    cdef np.ndarray hwl = size_res_norm * anchor_size + anchor_size
+
+    # shift to original coords
+    cdef np.ndarray roi_center = np.array(roi_box3d[:, 0:3])
+    cdef np.ndarray shift_ret_box3d = np.concatenate((
+        pos_x.reshape(-1, 1),
+        pos_y.reshape(-1, 1),
+        pos_z.reshape(-1, 1),
+        hwl, ry.reshape(-1, 1)), axis=1)
+    ret_box3d = shift_ret_box3d
+    if roi_box3d.shape[1] == 7:
+        roi_ry = np.array(roi_box3d[:, 6]).reshape(-1)
+        ret_box3d = _rotate_pc_along_y(np.array(shift_ret_box3d), -roi_ry)
+        ret_box3d[:, 6] += roi_ry
+    ret_box3d[:, [0, 2]] += roi_center[:, [0, 2]]
+    
+    return ret_box3d
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/cyops/object3d.py b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/object3d.py
new file mode 100644
index 0000000000000000000000000000000000000000..97d81421afa89a0e26daa4f956c4d835763cb966
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/object3d.py
@@ -0,0 +1,107 @@
+"""
+This code is borrow from https://github.com/sshaoshuai/PointRCNN/blob/master/lib/utils/object3d.py
+"""
+import numpy as np
+
+
+def cls_type_to_id(cls_type):
+    type_to_id = {'Car': 1, 'Pedestrian': 2, 'Cyclist': 3, 'Van': 4}
+    if cls_type not in type_to_id.keys():
+        return -1
+    return type_to_id[cls_type]
+
+
+class Object3d(object):
+
+    def __init__(self, line):
+        label = line.strip().split(' ')
+        self.src = line
+        self.cls_type = label[0]
+        self.cls_id = cls_type_to_id(self.cls_type)
+        self.trucation = float(label[1])
+        self.occlusion = float(label[2])  # 0:fully visible 1:partly occluded 2:largely occluded 3:unknown
+        self.alpha = float(label[3])
+        self.box2d = np.array((float(label[4]), float(label[5]), float(label[6]), float(label[7])), dtype=np.float32)
+        self.h = float(label[8])
+        self.w = float(label[9])
+        self.l = float(label[10])
+        self.pos = np.array((float(label[11]), float(label[12]), float(label[13])), dtype=np.float32)
+        self.dis_to_cam = np.linalg.norm(self.pos)
+        self.ry = float(label[14])
+        self.score = float(label[15]) if label.__len__() == 16 else -1.0
+        self.level_str = None
+        self.level = self.get_obj_level()
+
+    def get_obj_level(self):
+        height = float(self.box2d[3]) - float(self.box2d[1]) + 1
+
+        if height >= 40 and self.trucation <= 0.15 and self.occlusion <= 0:
+            self.level_str = 'Easy'
+            return 1  # Easy
+        elif height >= 25 and self.trucation <= 0.3 and self.occlusion <= 1:
+            self.level_str = 'Moderate'
+            return 2  # Moderate
+        elif height >= 25 and self.trucation <= 0.5 and self.occlusion <= 2:
+            self.level_str = 'Hard'
+            return 3  # Hard
+        else:
+            self.level_str = 'UnKnown'
+            return 4
+
+    def generate_corners3d(self):
+        """
+        generate corners3d representation for this object
+        :return corners_3d: (8, 3) corners of box3d in camera coord
+        """
+        l, h, w = self.l, self.h, self.w
+        x_corners = [l / 2, l / 2, -l / 2, -l / 2, l / 2, l / 2, -l / 2, -l / 2]
+        y_corners = [0, 0, 0, 0, -h, -h, -h, -h]
+        z_corners = [w / 2, -w / 2, -w / 2, w / 2, w / 2, -w / 2, -w / 2, w / 2]
+
+        R = np.array([[np.cos(self.ry), 0, np.sin(self.ry)],
+                      [0, 1, 0],
+                      [-np.sin(self.ry), 0, np.cos(self.ry)]])
+        corners3d = np.vstack([x_corners, y_corners, z_corners])  # (3, 8)
+        corners3d = np.dot(R, corners3d).T
+        corners3d = corners3d + self.pos
+        return corners3d
+
+    def to_bev_box2d(self, oblique=True, voxel_size=0.1):
+        """
+        :param bev_shape: (2) for bev shape (h, w), => (y_max, x_max) in image
+        :param voxel_size: float, 0.1m
+        :param oblique:
+        :return: box2d (4, 2)/ (4) in image coordinate
+        """
+        if oblique:
+            corners3d = self.generate_corners3d()
+            xz_corners = corners3d[0:4, [0, 2]]
+            box2d = np.zeros((4, 2), dtype=np.int32)
+            box2d[:, 0] = ((xz_corners[:, 0] - Object3d.MIN_XZ[0]) / voxel_size).astype(np.int32)
+            box2d[:, 1] = Object3d.BEV_SHAPE[0] - 1 - ((xz_corners[:, 1] - Object3d.MIN_XZ[1]) / voxel_size).astype(np.int32)
+            box2d[:, 0] = np.clip(box2d[:, 0], 0, Object3d.BEV_SHAPE[1])
+            box2d[:, 1] = np.clip(box2d[:, 1], 0, Object3d.BEV_SHAPE[0])
+        else:
+            box2d = np.zeros(4, dtype=np.int32)
+            # discrete_center = np.floor((self.pos / voxel_size)).astype(np.int32)
+            cu = np.floor((self.pos[0] - Object3d.MIN_XZ[0]) / voxel_size).astype(np.int32)
+            cv = Object3d.BEV_SHAPE[0] - 1 - ((self.pos[2] - Object3d.MIN_XZ[1]) / voxel_size).astype(np.int32)
+            half_l, half_w = int(self.l / voxel_size / 2), int(self.w / voxel_size / 2)
+            box2d[0], box2d[1] = cu - half_l, cv - half_w
+            box2d[2], box2d[3] = cu + half_l, cv + half_w
+
+        return box2d
+
+    def to_str(self):
+        print_str = '%s %.3f %.3f %.3f box2d: %s hwl: [%.3f %.3f %.3f] pos: %s ry: %.3f' \
+                     % (self.cls_type, self.trucation, self.occlusion, self.alpha, self.box2d, self.h, self.w, self.l,
+                        self.pos, self.ry)
+        return print_str
+
+    def to_kitti_format(self):
+        kitti_str = '%s %.2f %d %.2f %.2f %.2f %.2f %.2f %.2f %.2f %.2f %.2f %.2f %.2f %.2f' \
+                    % (self.cls_type, self.trucation, int(self.occlusion), self.alpha, self.box2d[0], self.box2d[1],
+                       self.box2d[2], self.box2d[3], self.h, self.w, self.l, self.pos[0], self.pos[1], self.pos[2],
+                       self.ry)
+        return kitti_str
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/cyops/roipool3d_utils.pyx b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/roipool3d_utils.pyx
new file mode 100644
index 0000000000000000000000000000000000000000..3efa83135fed11d3e3a3daceb821c63424beb524
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/roipool3d_utils.pyx
@@ -0,0 +1,160 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+ 
+import numpy as np 
+cimport numpy as np 
+cimport cython 
+from libc.math cimport sin, cos 
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+cdef enlarge_box3d(np.ndarray boxes3d, int extra_width):
+    """
+    :param boxes3d: (N, 7) [x, y, z, h, w, l, ry]
+    """
+    if isinstance(boxes3d, np.ndarray):
+        large_boxes3d = boxes3d.copy()
+    else:
+        large_boxes3d = boxes3d.clone()
+    large_boxes3d[:, 3:6] += extra_width * 2
+    large_boxes3d[:, 1] += extra_width
+    return large_boxes3d
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+cdef pt_in_box(float x, float y, float z, float cx, float bottom_y, float cz, float h, float w, float l, float angle):
+    cdef float max_ids = 10.0
+    cdef float cy = bottom_y - h / 2.0
+    if ((abs(x - cx) > max_ids) or (abs(y - cy) > h / 2.0) or (abs(z - cz) > max_ids)):
+        return 0
+    cdef float cosa = cos(angle)
+    cdef float sina = sin(angle)
+    cdef float x_rot = (x - cx) * cosa + (z - cz) * (-sina)
+
+    cdef float z_rot = (x - cx) * sina + (z - cz) * cosa
+
+    cdef float flag = (x_rot >= -l / 2.0) and (x_rot <= l / 2.0) and (z_rot >= -w / 2.0) and (z_rot <= w / 2.0)
+    return flag
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+cdef _rotate_pc_along_y(np.ndarray pc, float rot_angle):
+    """
+    params pc: (N, 3+C), (N, 3) is in the rectified camera coordinate
+    params rot_angle: rad scalar
+    Output pc: updated pc with XYZ rotated
+    """
+    cosval = np.cos(rot_angle)
+    sinval = np.sin(rot_angle)
+    rotmat = np.array([[cosval, -sinval], [sinval, cosval]])
+    pc[:, [0, 2]] = np.dot(pc[:, [0, 2]], np.transpose(rotmat))
+    return pc
+
+@cython.boundscheck(False)
+@cython.wraparound(False)
+def roipool3d_cpu(
+        np.ndarray[float, ndim=2] pts, 
+        np.ndarray[float, ndim=2] pts_feature, 
+        np.ndarray[float, ndim=2] boxes3d, 
+        np.ndarray[float, ndim=2] pts_extra_input, 
+        int pool_extra_width, int sampled_pt_num, int batch_size=1, bint canonical_transform=False):
+    cdef np.ndarray pts_feature_all = np.concatenate((pts_extra_input, pts_feature), axis=1)
+
+    cdef np.ndarray larged_boxes3d = enlarge_box3d(boxes3d.reshape(-1, 7), pool_extra_width).reshape(batch_size, -1, 7)
+
+    cdef int pts_num  = pts.shape[0], 
+    cdef int boxes_num = boxes3d.shape[0] 
+    cdef int feature_len = pts_feature_all.shape[1]
+    cdef np.ndarray pts_data = np.zeros((batch_size, boxes_num, sampled_pt_num, 3))
+    cdef np.ndarray features_data = np.zeros((batch_size, boxes_num, sampled_pt_num, feature_len))
+    cdef np.ndarray empty_flag_data = np.zeros((batch_size, boxes_num))
+    
+    cdef int cnt = 0
+    cdef float cx = 0.
+    cdef float bottom_y = 0.
+    cdef float cz = 0.
+    cdef float h = 0.
+    cdef float w = 0.
+    cdef float l = 0.
+    cdef float ry = 0.
+    cdef float x = 0. 
+    cdef float y = 0.
+    cdef float z = 0.
+    cdef np.ndarray x_i
+    cdef np.ndarray feat_i 
+    cdef int bs
+    cdef int i 
+    cdef int j 
+    for bs in range(batch_size):
+        # boxes: 64,7
+        for i in range(boxes_num):
+            cnt = 0
+            # box
+            box = larged_boxes3d[bs][i]
+            cx = box[0]
+            bottom_y = box[1]
+            cz = box[2]
+            h = box[3]
+            w = box[4]
+            l = box[5]
+            ry = box[6]
+            # points: 16384,3  
+            x_i = pts
+            # features: 16384, 128 
+            feat_i = pts_feature_all
+            
+            for j in range(pts_num):
+                x = x_i[j][0]
+                y = x_i[j][1]
+                z = x_i[j][2]
+                cur_in_flag = pt_in_box(x,y,z,cx,bottom_y,cz,h,w,l,ry)
+                if cur_in_flag:
+                    if cnt < sampled_pt_num:
+                        pts_data[bs][i][cnt][:] = x_i[j]
+                        features_data[bs][i][cnt][:] = feat_i[j]
+                        cnt += 1
+                    else:
+                        break 
+
+            if cnt == 0:
+                empty_flag_data[bs][i] = 1
+            elif (cnt < sampled_pt_num):
+                for k in range(cnt, sampled_pt_num):
+                    pts_data[bs][i][k] = pts_data[bs][i][k % cnt]
+                    features_data[bs][i][k] = features_data[bs][i][k % cnt]
+
+
+    pooled_pts = pts_data.astype("float32")[0]
+    pooled_features = features_data.astype('float32')[0]
+    pooled_empty_flag = empty_flag_data.astype('int64')[0]
+
+    cdef int extra_input_len = pts_extra_input.shape[1]
+    pooled_pts = np.concatenate((pooled_pts, pooled_features[:,:,0:extra_input_len]),axis=2)
+    pooled_features = pooled_features[:,:,extra_input_len:]
+    
+    if canonical_transform:
+        # Translate to the roi coordinates
+        roi_ry = boxes3d[:, 6] % (2 * np.pi)  # 0~2pi
+        roi_center = boxes3d[:, 0:3]
+        # shift to center
+        pooled_pts[:, :, 0:3] = pooled_pts[:, :, 0:3] - roi_center[:, np.newaxis, :]
+        for k in range(pooled_pts.shape[0]):
+            pooled_pts[k] = _rotate_pc_along_y(pooled_pts[k], roi_ry[k])
+        return pooled_pts, pooled_features, pooled_empty_flag
+
+    return pooled_pts, pooled_features, pooled_empty_flag
+
+
+#def roipool3d_cpu(pts, pts_feature, boxes3d, pts_extra_input, pool_extra_width, sampled_pt_num=512, batch_size=1):
+#    return _roipool3d_cpu(pts, pts_feature, boxes3d, pts_extra_input, pool_extra_width, sampled_pt_num, batch_size)
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/cyops/setup.py b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d775017468bbb683d0ea0f0058062e5de12da73
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/cyops/setup.py
@@ -0,0 +1,74 @@
+# Copyright (c) 2017-present, Facebook, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+##############################################################################
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from Cython.Build import cythonize
+from setuptools import Extension
+from setuptools import setup
+
+import numpy as np
+
+_NP_INCLUDE_DIRS = np.get_include()
+
+
+# Extension modules
+ext_modules = [
+    Extension(
+        name='utils.cyops.roipool3d_utils',
+        sources=[
+            'utils/cyops/roipool3d_utils.pyx'
+        ],
+        extra_compile_args=[
+            '-Wno-cpp'
+        ],
+        include_dirs=[
+            _NP_INCLUDE_DIRS
+        ]
+    ),
+
+    Extension(
+        name='utils.cyops.iou3d_utils',
+        sources=[
+            'utils/cyops/iou3d_utils.pyx'
+        ],
+        extra_compile_args=[
+            '-Wno-cpp'
+        ],
+        include_dirs=[
+            _NP_INCLUDE_DIRS
+        ]
+    ),
+    
+    Extension(
+        name='utils.cyops.kitti_utils',
+        sources=[
+            'utils/cyops/kitti_utils.pyx'
+        ],
+        extra_compile_args=[
+            '-Wno-cpp'
+        ],
+        include_dirs=[
+            _NP_INCLUDE_DIRS
+        ]
+    ),
+]
+
+setup(
+    name='pp_pointrcnn',
+    ext_modules=cythonize(ext_modules)
+)
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/metric_utils.py b/PaddleCV/Paddle3D/PointRCNN/utils/metric_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..aa7ee70652ac4e76aef9f4d755ec057ef2bc9123
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/metric_utils.py
@@ -0,0 +1,216 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import logging
+import numpy as np
+import utils.cyops.kitti_utils as kitti_utils 
+from utils.config import cfg
+from utils.box_utils import boxes_iou3d, box_nms_eval, boxes3d_to_bev
+from utils.save_utils import save_rpn_feature, save_kitti_result, save_kitti_format
+
+__all__ = ['calc_iou_recall', 'rpn_metric', 'rcnn_metric']
+
+logging.root.handlers = []
+FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+
+def calc_iou_recall(rets, thresh_list):
+    rpn_cls_label = rets['rpn_cls_label'][0]
+    boxes3d = rets['rois'][0]
+    seg_mask = rets['seg_mask'][0]
+    sample_id = rets['sample_id'][0]
+    gt_boxes3d = rets['gt_boxes3d'][0]
+    gt_boxes3d_num = rets['gt_boxes3d'][1]
+
+    gt_box_idx = 0
+    recalled_bbox_list = [0] * len(thresh_list)
+    gt_box_num = 0
+    rpn_iou_sum = 0.
+    for i in range(len(gt_boxes3d_num)):
+        cur_rpn_cls_label = rpn_cls_label[i]
+        cur_boxes3d = boxes3d[i]
+        cur_seg_mask = seg_mask[i]
+        cur_sample_id = sample_id[i]
+        cur_gt_boxes3d = gt_boxes3d[gt_box_idx: gt_box_idx +
+                                    gt_boxes3d_num[0][i]]
+        gt_box_idx += gt_boxes3d_num[0][i]
+
+        k = cur_gt_boxes3d.__len__() - 1
+        while k >= 0 and np.sum(cur_gt_boxes3d[k]) == 0:
+            k -= 1
+        cur_gt_boxes3d = cur_gt_boxes3d[:k + 1]
+
+        if cur_gt_boxes3d.shape[0] > 0:
+            iou3d = boxes_iou3d(cur_boxes3d, cur_gt_boxes3d[:, 0:7])
+            gt_max_iou = iou3d.max(axis=0)
+
+            for idx, thresh in enumerate(thresh_list):
+                recalled_bbox_list[idx] += np.sum(gt_max_iou > thresh)
+            gt_box_num += cur_gt_boxes3d.__len__()
+
+        fg_mask = cur_rpn_cls_label > 0
+        correct = np.sum(np.logical_and(
+            cur_seg_mask == cur_rpn_cls_label, fg_mask))
+        union = np.sum(fg_mask) + np.sum(cur_seg_mask > 0) - correct
+        rpn_iou = float(correct) / max(float(union), 1.0)
+        rpn_iou_sum += rpn_iou
+        logger.debug('sample_id:{}, rpn_iou:{}, gt_box_num:{}, recalled_bbox_list:{}'.format(
+            sample_id, rpn_iou, gt_box_num, str(recalled_bbox_list)))
+
+    return len(gt_boxes3d_num), gt_box_num, rpn_iou_sum, recalled_bbox_list
+
+
+def rpn_metric(queue, mdict, lock, thresh_list, is_save_rpn_feature, kitti_feature_dir,
+               seg_output_dir, kitti_output_dir, kitti_rcnn_reader, classes):
+    while True:
+        rets_dict = queue.get()
+        if rets_dict is None:
+            lock.acquire()
+            mdict['exit_proc'] += 1
+            lock.release()
+            return 
+
+        cnt, gt_box_num, rpn_iou_sum, recalled_bbox_list = calc_iou_recall(
+            rets_dict, thresh_list)
+        lock.acquire()
+        mdict['total_cnt'] += cnt
+        mdict['total_gt_bbox'] += gt_box_num
+        mdict['total_rpn_iou'] += rpn_iou_sum
+        for i, bbox_num in enumerate(recalled_bbox_list):
+            mdict['total_recalled_bbox_list_{}'.format(i)] += bbox_num
+        logger.debug("rpn_metric: {}".format(str(mdict)))
+        lock.release()
+
+        if is_save_rpn_feature:
+            save_rpn_feature(rets_dict, kitti_feature_dir)
+            save_kitti_result(
+                rets_dict, seg_output_dir, kitti_output_dir, kitti_rcnn_reader, classes)
+
+
+def rcnn_metric(queue, mdict, lock, thresh_list, kitti_rcnn_reader, roi_output_dir,
+                refine_output_dir, final_output_dir, is_save_result=False):
+    while True:
+        rets_dict = queue.get()
+        if rets_dict is None:
+            lock.acquire()
+            mdict['exit_proc'] += 1
+            lock.release()
+            return 
+        
+        for k,v in rets_dict.items():
+            rets_dict[k] = v[0]
+
+        rcnn_cls = rets_dict['rcnn_cls']
+        rcnn_reg = rets_dict['rcnn_reg']
+        roi_boxes3d = rets_dict['roi_boxes3d']
+        roi_scores = rets_dict['roi_scores']
+
+        # bounding box regression
+        anchor_size = cfg.CLS_MEAN_SIZE[0]
+        pred_boxes3d = kitti_utils.decode_bbox_target(
+            roi_boxes3d, 
+            rcnn_reg,
+            anchor_size=np.array(anchor_size),
+            loc_scope=cfg.RCNN.LOC_SCOPE,
+            loc_bin_size=cfg.RCNN.LOC_BIN_SIZE,
+            num_head_bin=cfg.RCNN.NUM_HEAD_BIN,
+            get_xz_fine=True, 
+            get_y_by_bin=cfg.RCNN.LOC_Y_BY_BIN,
+            loc_y_scope=cfg.RCNN.LOC_Y_SCOPE,
+            loc_y_bin_size=cfg.RCNN.LOC_Y_BIN_SIZE,
+            get_ry_fine=True
+        )
+
+        # scoring
+        if rcnn_cls.shape[1] == 1:
+            raw_scores = rcnn_cls.reshape(-1)
+            norm_scores = rets_dict['norm_scores']
+            pred_classes = norm_scores > cfg.RCNN.SCORE_THRESH
+            pred_classes = pred_classes.astype(np.float32)
+        else:
+            pred_classes = np.argmax(rcnn_cls, axis=1).reshape(-1)
+            raw_scores = rcnn_cls[:, pred_classes]
+
+        # evaluation
+        gt_iou = rets_dict['gt_iou']
+        gt_boxes3d = rets_dict['gt_boxes3d']
+        
+        # recall
+        if gt_boxes3d.size > 0:
+            gt_num = gt_boxes3d.shape[1]
+            gt_boxes3d = gt_boxes3d.reshape((-1,7))
+            iou3d = boxes_iou3d(pred_boxes3d, gt_boxes3d)
+            gt_max_iou = iou3d.max(axis=0)
+            refined_iou = iou3d.max(axis=1)
+
+            recalled_num = (gt_max_iou > 0.7).sum()
+            roi_boxes3d = roi_boxes3d.reshape((-1,7))
+            iou3d_in = boxes_iou3d(roi_boxes3d, gt_boxes3d)
+            gt_max_iou_in = iou3d_in.max(axis=0)
+
+            lock.acquire()
+            mdict['total_gt_bbox'] += gt_num
+            for idx, thresh in enumerate(thresh_list):
+                recalled_bbox_num = (gt_max_iou > thresh).sum() 
+                mdict['total_recalled_bbox_list_{}'.format(idx)] += recalled_bbox_num
+            for idx, thresh in enumerate(thresh_list):
+                roi_recalled_bbox_num = (gt_max_iou_in > thresh).sum()
+                mdict['total_roi_recalled_bbox_list_{}'.format(idx)] += roi_recalled_bbox_num 
+            lock.release()
+        
+        # classification accuracy
+        cls_label = gt_iou > cfg.RCNN.CLS_FG_THRESH
+        cls_label = cls_label.astype(np.float32)
+        cls_valid_mask = (gt_iou >= cfg.RCNN.CLS_FG_THRESH) | (gt_iou <= cfg.RCNN.CLS_BG_THRESH)
+        cls_valid_mask = cls_valid_mask.astype(np.float32)
+        cls_acc = (pred_classes == cls_label).astype(np.float32)
+        cls_acc = (cls_acc * cls_valid_mask).sum() / max(cls_valid_mask.sum(), 1.0) * 1.0 
+        
+        iou_thresh = 0.7 if cfg.CLASSES == 'Car' else 0.5
+        cls_label_refined = (gt_iou >= iou_thresh)
+        cls_label_refined = cls_label_refined.astype(np.float32)
+        cls_acc_refined = (pred_classes == cls_label_refined).astype(np.float32).sum() / max(cls_label_refined.shape[0], 1.0) 
+        
+        sample_id = rets_dict['sample_id']
+        image_shape = kitti_rcnn_reader.get_image_shape(sample_id)
+        
+        if is_save_result:
+            roi_boxes3d_np = roi_boxes3d
+            pred_boxes3d_np = pred_boxes3d
+            calib = kitti_rcnn_reader.get_calib(sample_id)
+            save_kitti_format(sample_id, calib, roi_boxes3d_np, roi_output_dir, roi_scores, image_shape)
+            save_kitti_format(sample_id, calib, pred_boxes3d_np, refine_output_dir, raw_scores, image_shape)
+        
+        inds = norm_scores > cfg.RCNN.SCORE_THRESH
+        if inds.astype(np.float32).sum() == 0:
+            logger.debug("The num of 'norm_scores > thresh' of sample {} is 0".format(sample_id))
+            continue
+        pred_boxes3d_selected = pred_boxes3d[inds]
+        raw_scores_selected = raw_scores[inds]
+        # NMS thresh
+        boxes_bev_selected = boxes3d_to_bev(pred_boxes3d_selected)
+        scores_selected, pred_boxes3d_selected = box_nms_eval(boxes_bev_selected, raw_scores_selected, pred_boxes3d_selected, cfg.RCNN.NMS_THRESH)
+        calib = kitti_rcnn_reader.get_calib(sample_id)
+        save_kitti_format(sample_id, calib, pred_boxes3d_selected, final_output_dir, scores_selected, image_shape)
+        lock.acquire()
+        mdict['total_det_num'] += pred_boxes3d_selected.shape[0]
+        mdict['total_cls_acc'] += cls_acc
+        mdict['total_cls_acc_refined'] += cls_acc_refined
+        lock.release()
+        logger.debug("rcnn_metric: {}".format(str(mdict)))
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/object3d.py b/PaddleCV/Paddle3D/PointRCNN/utils/object3d.py
new file mode 100644
index 0000000000000000000000000000000000000000..7b5703bdbfba1c1bf239c2a2c9f2179ea908a7e5
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/object3d.py
@@ -0,0 +1,113 @@
+"""
+This code is borrow from https://github.com/sshaoshuai/PointRCNN/blob/master/lib/utils/object3d.py
+"""
+import numpy as np
+
+
+def cls_type_to_id(cls_type):
+    type_to_id = {'Car': 1, 'Pedestrian': 2, 'Cyclist': 3, 'Van': 4}
+    if cls_type not in type_to_id.keys():
+        return -1
+    return type_to_id[cls_type]
+
+
+def get_objects_from_label(label_file):
+    with open(label_file, 'r') as f:
+        lines = f.readlines()
+    objects = [Object3d(line) for line in lines]
+    return objects
+
+
+class Object3d(object):
+    def __init__(self, line):
+        label = line.strip().split(' ')
+        self.src = line
+        self.cls_type = label[0]
+        self.cls_id = cls_type_to_id(self.cls_type)
+        self.trucation = float(label[1])
+        self.occlusion = float(label[2])  # 0:fully visible 1:partly occluded 2:largely occluded 3:unknown
+        self.alpha = float(label[3])
+        self.box2d = np.array((float(label[4]), float(label[5]), float(label[6]), float(label[7])), dtype=np.float32)
+        self.h = float(label[8])
+        self.w = float(label[9])
+        self.l = float(label[10])
+        self.pos = np.array((float(label[11]), float(label[12]), float(label[13])), dtype=np.float32)
+        self.dis_to_cam = np.linalg.norm(self.pos)
+        self.ry = float(label[14])
+        self.score = float(label[15]) if label.__len__() == 16 else -1.0
+        self.level_str = None
+        self.level = self.get_obj_level()
+
+    def get_obj_level(self):
+        height = float(self.box2d[3]) - float(self.box2d[1]) + 1
+
+        if height >= 40 and self.trucation <= 0.15 and self.occlusion <= 0:
+            self.level_str = 'Easy'
+            return 1  # Easy
+        elif height >= 25 and self.trucation <= 0.3 and self.occlusion <= 1:
+            self.level_str = 'Moderate'
+            return 2  # Moderate
+        elif height >= 25 and self.trucation <= 0.5 and self.occlusion <= 2:
+            self.level_str = 'Hard'
+            return 3  # Hard
+        else:
+            self.level_str = 'UnKnown'
+            return 4
+
+    def generate_corners3d(self):
+        """
+        generate corners3d representation for this object
+        :return corners_3d: (8, 3) corners of box3d in camera coord
+        """
+        l, h, w = self.l, self.h, self.w
+        x_corners = [l / 2, l / 2, -l / 2, -l / 2, l / 2, l / 2, -l / 2, -l / 2]
+        y_corners = [0, 0, 0, 0, -h, -h, -h, -h]
+        z_corners = [w / 2, -w / 2, -w / 2, w / 2, w / 2, -w / 2, -w / 2, w / 2]
+
+        R = np.array([[np.cos(self.ry), 0, np.sin(self.ry)],
+                      [0, 1, 0],
+                      [-np.sin(self.ry), 0, np.cos(self.ry)]])
+        corners3d = np.vstack([x_corners, y_corners, z_corners])  # (3, 8)
+        corners3d = np.dot(R, corners3d).T
+        corners3d = corners3d + self.pos
+        return corners3d
+
+    def to_bev_box2d(self, oblique=True, voxel_size=0.1):
+        """
+        :param bev_shape: (2) for bev shape (h, w), => (y_max, x_max) in image
+        :param voxel_size: float, 0.1m
+        :param oblique:
+        :return: box2d (4, 2)/ (4) in image coordinate
+        """
+        if oblique:
+            corners3d = self.generate_corners3d()
+            xz_corners = corners3d[0:4, [0, 2]]
+            box2d = np.zeros((4, 2), dtype=np.int32)
+            box2d[:, 0] = ((xz_corners[:, 0] - Object3d.MIN_XZ[0]) / voxel_size).astype(np.int32)
+            box2d[:, 1] = Object3d.BEV_SHAPE[0] - 1 - ((xz_corners[:, 1] - Object3d.MIN_XZ[1]) / voxel_size).astype(np.int32)
+            box2d[:, 0] = np.clip(box2d[:, 0], 0, Object3d.BEV_SHAPE[1])
+            box2d[:, 1] = np.clip(box2d[:, 1], 0, Object3d.BEV_SHAPE[0])
+        else:
+            box2d = np.zeros(4, dtype=np.int32)
+            # discrete_center = np.floor((self.pos / voxel_size)).astype(np.int32)
+            cu = np.floor((self.pos[0] - Object3d.MIN_XZ[0]) / voxel_size).astype(np.int32)
+            cv = Object3d.BEV_SHAPE[0] - 1 - ((self.pos[2] - Object3d.MIN_XZ[1]) / voxel_size).astype(np.int32)
+            half_l, half_w = int(self.l / voxel_size / 2), int(self.w / voxel_size / 2)
+            box2d[0], box2d[1] = cu - half_l, cv - half_w
+            box2d[2], box2d[3] = cu + half_l, cv + half_w
+
+        return box2d
+
+    def to_str(self):
+        print_str = '%s %.3f %.3f %.3f box2d: %s hwl: [%.3f %.3f %.3f] pos: %s ry: %.3f' \
+                     % (self.cls_type, self.trucation, self.occlusion, self.alpha, self.box2d, self.h, self.w, self.l,
+                        self.pos, self.ry)
+        return print_str
+
+    def to_kitti_format(self):
+        kitti_str = '%s %.2f %d %.2f %.2f %.2f %.2f %.2f %.2f %.2f %.2f %.2f %.2f %.2f %.2f' \
+                    % (self.cls_type, self.trucation, int(self.occlusion), self.alpha, self.box2d[0], self.box2d[1],
+                       self.box2d[2], self.box2d[3], self.h, self.w, self.l, self.pos[0], self.pos[1], self.pos[2],
+                       self.ry)
+        return kitti_str
+
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/optimizer.py b/PaddleCV/Paddle3D/PointRCNN/utils/optimizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..e32d1df862de7692e520168a2b35f482535f3ac6
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/optimizer.py
@@ -0,0 +1,122 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimization and learning rate scheduling."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import paddle.fluid as fluid
+import paddle.fluid.layers.learning_rate_scheduler as lr_scheduler
+from paddle.fluid.layers import control_flow
+
+import logging
+logger = logging.getLogger(__name__)
+
+def cosine_warmup_decay(learning_rate, betas, warmup_factor, decay_factor,
+                        total_step, warmup_pct):
+    def annealing_cos(start, end, pct):
+        "Cosine anneal from `start` to `end` as pct goes from 0.0 to 1.0."
+        cos_out = fluid.layers.cos(pct * np.pi) + 1.
+        return cos_out * (start - end) / 2. + end
+
+    warmup_start_lr = learning_rate * warmup_factor
+    decay_end_lr = learning_rate * decay_factor
+    warmup_step = total_step * warmup_pct
+
+    global_step = lr_scheduler._decay_step_counter()
+
+    lr = fluid.layers.create_global_var(
+        shape=[1],
+        value=float(learning_rate),
+        dtype='float32',
+        persistable=True,
+        name="learning_rate")
+    beta1 = fluid.layers.create_global_var(
+        shape=[1],
+        value=float(betas[0]),
+        dtype='float32',
+        persistable=True,
+        name="beta1")
+
+    warmup_step_var = fluid.layers.fill_constant(
+        shape=[1], dtype='float32', value=float(warmup_step), force_cpu=True)
+
+    with control_flow.Switch() as switch:
+        with switch.case(global_step < warmup_step_var):
+            cur_lr = annealing_cos(warmup_start_lr, learning_rate,
+                                   global_step / warmup_step_var)
+            fluid.layers.assign(cur_lr, lr)
+            cur_beta1 = annealing_cos(betas[0], betas[1],
+                                   global_step / warmup_step_var)
+            fluid.layers.assign(cur_beta1, beta1)
+        with switch.case(global_step >= warmup_step_var):
+            cur_lr = annealing_cos(learning_rate, decay_end_lr,
+                                   (global_step - warmup_step_var) / (total_step - warmup_step))
+            fluid.layers.assign(cur_lr, lr)
+            cur_beta1 = annealing_cos(betas[1], betas[0],
+                                   (global_step - warmup_step_var) / (total_step - warmup_step))
+            fluid.layers.assign(cur_beta1, beta1)
+
+    return lr, beta1
+
+
+def optimize(loss,
+             learning_rate,
+             warmup_factor,
+             decay_factor,
+             total_step,
+             warmup_pct,
+             train_program,
+             startup_prog,
+             weight_decay,
+             clip_norm,
+             beta1=[0.95, 0.85],
+             beta2=0.99,
+             scheduler='cosine_warmup_decay'):
+
+    scheduled_lr= None
+    if scheduler == 'cosine_warmup_decay':
+        scheduled_lr, scheduled_beta1 = cosine_warmup_decay(learning_rate, beta1, warmup_factor,
+                                           decay_factor, total_step,
+                                           warmup_pct)
+    else:
+        raise ValueError("Unkown learning rate scheduler, should be "
+                         "'cosine_warmup_decay'")
+
+    optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr,
+                                     beta1=scheduled_beta1,
+                                     beta2=beta2)
+    fluid.clip.set_gradient_clip(
+        clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=clip_norm))
+
+    param_list = dict()
+
+    if weight_decay > 0:
+        for param in train_program.global_block().all_parameters():
+            param_list[param.name] = param * 1.0
+            param_list[param.name].stop_gradient = True
+
+    _, param_grads = optimizer.minimize(loss)
+
+    if weight_decay > 0:
+        for param, grad in param_grads:
+            with param.block.program._optimized_guard(
+                [param, grad]), fluid.framework.name_scope("weight_decay"):
+                updated_param = param - param_list[
+                    param.name] * weight_decay * scheduled_lr
+                fluid.layers.assign(output=param, input=updated_param)
+
+    return scheduled_lr
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/proposal_target.py b/PaddleCV/Paddle3D/PointRCNN/utils/proposal_target.py
new file mode 100644
index 0000000000000000000000000000000000000000..deda51180bfb9007f1dadd265c3f33f397b1cccf
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/proposal_target.py
@@ -0,0 +1,369 @@
+import numpy as np
+from utils.cyops import kitti_utils, roipool3d_utils, iou3d_utils
+
+CLOSE_RANDOM = False 
+
+def get_proposal_target_func(cfg, mode='TRAIN'):
+
+    def sample_rois_for_rcnn(roi_boxes3d, gt_boxes3d):
+        """
+        :param roi_boxes3d: (B, M, 7)
+        :param gt_boxes3d: (B, N, 8) [x, y, z, h, w, l, ry, cls]
+        :return
+            batch_rois: (B, N, 7)
+            batch_gt_of_rois: (B, N, 8)
+            batch_roi_iou: (B, N)
+        """
+
+        batch_size = roi_boxes3d.shape[0]
+        
+        #batch_size = 1
+        fg_rois_per_image = int(np.round(cfg.RCNN.FG_RATIO * cfg.RCNN.ROI_PER_IMAGE))
+
+        batch_rois = np.zeros((batch_size, cfg.RCNN.ROI_PER_IMAGE, 7))
+        batch_gt_of_rois = np.zeros((batch_size, cfg.RCNN.ROI_PER_IMAGE, 7))
+        batch_roi_iou = np.zeros((batch_size, cfg.RCNN.ROI_PER_IMAGE))
+        for idx in range(batch_size):
+            cur_roi, cur_gt = roi_boxes3d[idx], gt_boxes3d[idx]
+            k = cur_gt.shape[0] - 1
+            while cur_gt[k].sum() == 0:
+                k -= 1
+            cur_gt = cur_gt[:k + 1]
+            # include gt boxes in the candidate rois
+            iou3d = iou3d_utils.boxes_iou3d(cur_roi, cur_gt[:, 0:7])  # (M, N)
+            max_overlaps = np.max(iou3d, axis=1)
+            gt_assignment = np.argmax(iou3d, axis=1)
+            # sample fg, easy_bg, hard_bg
+            fg_thresh = min(cfg.RCNN.REG_FG_THRESH, cfg.RCNN.CLS_FG_THRESH)
+            fg_inds = np.where(max_overlaps >= fg_thresh)[0].reshape(-1)
+
+            # TODO: this will mix the fg and bg when CLS_BG_THRESH_LO < iou < CLS_BG_THRESH
+            # fg_inds = torch.cat((fg_inds, roi_assignment), dim=0)  # consider the roi which has max_iou with gt as fg
+            easy_bg_inds = np.where(max_overlaps < cfg.RCNN.CLS_BG_THRESH_LO)[0].reshape(-1)
+            hard_bg_inds = np.where((max_overlaps < cfg.RCNN.CLS_BG_THRESH) & (max_overlaps >= cfg.RCNN.CLS_BG_THRESH_LO))[0].reshape(-1)
+
+            fg_num_rois = fg_inds.shape[0]
+            bg_num_rois = hard_bg_inds.shape[0] + easy_bg_inds.shape[0]
+
+            if fg_num_rois > 0 and bg_num_rois > 0:
+                # sampling fg
+                fg_rois_per_this_image = min(fg_rois_per_image, fg_num_rois)
+                if CLOSE_RANDOM:
+                    fg_inds = fg_inds[:fg_rois_per_this_image]
+                else:
+                    rand_num = np.random.permutation(fg_num_rois)
+                    fg_inds = fg_inds[rand_num[:fg_rois_per_this_image]]
+                
+                # sampling bg
+                bg_rois_per_this_image = cfg.RCNN.ROI_PER_IMAGE - fg_rois_per_this_image
+                bg_inds = sample_bg_inds(hard_bg_inds, easy_bg_inds, bg_rois_per_this_image)
+
+            elif fg_num_rois > 0 and bg_num_rois == 0:
+                # sampling fg
+                rand_num = np.floor(np.random.rand(cfg.RCNN.ROI_PER_IMAGE) * fg_num_rois)
+                # rand_num = torch.from_numpy(rand_num).type_as(gt_boxes3d).long()
+                fg_inds = fg_inds[rand_num]
+                fg_rois_per_this_image = cfg.RCNN.ROI_PER_IMAGE
+                bg_rois_per_this_image = 0
+            elif bg_num_rois > 0 and fg_num_rois == 0:
+                # sampling bg
+                bg_rois_per_this_image = cfg.RCNN.ROI_PER_IMAGE
+                bg_inds = sample_bg_inds(hard_bg_inds, easy_bg_inds, bg_rois_per_this_image)
+
+                fg_rois_per_this_image = 0
+            else:
+                import pdb
+                pdb.set_trace()
+                raise NotImplementedError
+            # augment the rois by noise
+            roi_list, roi_iou_list, roi_gt_list = [], [], []
+            if fg_rois_per_this_image > 0:
+                fg_rois_src = cur_roi[fg_inds]
+                gt_of_fg_rois = cur_gt[gt_assignment[fg_inds]]
+                iou3d_src = max_overlaps[fg_inds]
+                fg_rois, fg_iou3d = aug_roi_by_noise(
+                    fg_rois_src, gt_of_fg_rois, iou3d_src, aug_times=cfg.RCNN.ROI_FG_AUG_TIMES)
+                roi_list.append(fg_rois)
+                roi_iou_list.append(fg_iou3d)
+                roi_gt_list.append(gt_of_fg_rois)
+
+            if bg_rois_per_this_image > 0:
+                bg_rois_src = cur_roi[bg_inds]
+                gt_of_bg_rois = cur_gt[gt_assignment[bg_inds]]
+                iou3d_src = max_overlaps[bg_inds]
+                aug_times = 1 if cfg.RCNN.ROI_FG_AUG_TIMES > 0 else 0
+                bg_rois, bg_iou3d = aug_roi_by_noise(
+                    bg_rois_src, gt_of_bg_rois, iou3d_src, aug_times=aug_times)
+                roi_list.append(bg_rois)
+                roi_iou_list.append(bg_iou3d)
+                roi_gt_list.append(gt_of_bg_rois)
+
+            
+            rois = np.concatenate(roi_list, axis=0)
+            iou_of_rois = np.concatenate(roi_iou_list, axis=0)
+            gt_of_rois = np.concatenate(roi_gt_list, axis=0)
+            batch_rois[idx] = rois
+            batch_gt_of_rois[idx] = gt_of_rois
+            batch_roi_iou[idx] = iou_of_rois
+
+        return batch_rois, batch_gt_of_rois, batch_roi_iou
+
+    def sample_bg_inds(hard_bg_inds, easy_bg_inds, bg_rois_per_this_image):
+
+        if hard_bg_inds.shape[0] > 0 and easy_bg_inds.shape[0] > 0:
+            hard_bg_rois_num = int(bg_rois_per_this_image * cfg.RCNN.HARD_BG_RATIO)
+            easy_bg_rois_num = bg_rois_per_this_image - hard_bg_rois_num
+            # sampling hard bg
+            if CLOSE_RANDOM:
+                rand_idx = list(np.arange(0,hard_bg_inds.shape[0]))*hard_bg_rois_num
+                rand_idx = rand_idx[:hard_bg_rois_num]
+            else:
+                rand_idx = np.random.randint(low=0, high=hard_bg_inds.shape[0], size=(hard_bg_rois_num,))
+            hard_bg_inds = hard_bg_inds[rand_idx]
+            # sampling easy bg
+            if CLOSE_RANDOM:
+                rand_idx = list(np.arange(0,easy_bg_inds.shape[0]))*easy_bg_rois_num
+                rand_idx = rand_idx[:easy_bg_rois_num]
+            else:
+                rand_idx = np.random.randint(low=0, high=easy_bg_inds.shape[0], size=(easy_bg_rois_num,))
+            easy_bg_inds = easy_bg_inds[rand_idx]
+            bg_inds = np.concatenate([hard_bg_inds, easy_bg_inds], axis=0)
+        elif hard_bg_inds.shape[0] > 0 and easy_bg_inds.shape[0] == 0:
+            hard_bg_rois_num = bg_rois_per_this_image
+            # sampling hard bg
+            rand_idx = np.random.randint(low=0, high=hard_bg_inds.shape[0], size=(hard_bg_rois_num,))
+            bg_inds = hard_bg_inds[rand_idx]
+        elif hard_bg_inds.shape[0] == 0 and easy_bg_inds.shape[0] > 0:
+            easy_bg_rois_num = bg_rois_per_this_image
+            # sampling easy bg
+            rand_idx = np.random.randint(low=0, high=easy_bg_inds.shape[0], size=(easy_bg_rois_num,))
+            bg_inds = easy_bg_inds[rand_idx]
+        else:
+            raise NotImplementedError
+        
+        return bg_inds
+
+    def aug_roi_by_noise(roi_boxes3d, gt_boxes3d, iou3d_src, aug_times=10):
+        iou_of_rois = np.zeros(roi_boxes3d.shape[0]).astype(gt_boxes3d.dtype)
+        pos_thresh = min(cfg.RCNN.REG_FG_THRESH, cfg.RCNN.CLS_FG_THRESH)
+
+        for k in range(roi_boxes3d.shape[0]):
+            temp_iou = cnt = 0
+            roi_box3d = roi_boxes3d[k]
+
+            gt_box3d = gt_boxes3d[k].reshape(1, 7)
+            aug_box3d = roi_box3d
+            keep = True
+            while temp_iou < pos_thresh and cnt < aug_times:
+                if True: #np.random.rand() < 0.2:
+                    aug_box3d = roi_box3d  # p=0.2 to keep the original roi box
+                    keep = True
+                else:
+                    aug_box3d = random_aug_box3d(roi_box3d)
+                    keep = False
+                aug_box3d = aug_box3d.reshape((1, 7))
+                iou3d = iou3d_utils.boxes_iou3d(aug_box3d, gt_box3d)
+                temp_iou = iou3d[0][0]
+                cnt += 1
+            roi_boxes3d[k] = aug_box3d.reshape(-1)
+            if cnt == 0 or keep:
+                iou_of_rois[k] = iou3d_src[k]
+            else:
+                iou_of_rois[k] = temp_iou
+        return roi_boxes3d, iou_of_rois
+
+    def random_aug_box3d(box3d):
+        """
+        :param box3d: (7) [x, y, z, h, w, l, ry]
+        random shift, scale, orientation
+        """
+        if cfg.RCNN.REG_AUG_METHOD == 'single':
+            
+            pos_shift = (np.random.rand(3) - 0.5)  # [-0.5 ~ 0.5]
+            hwl_scale = (np.random.rand(3) - 0.5) / (0.5 / 0.15) + 1.0  #
+            angle_rot = (np.random.rand(1) - 0.5) / (0.5 / (np.pi / 12))  # [-pi/12 ~ pi/12]
+            aug_box3d = np.concatenate([box3d[0:3] + pos_shift, box3d[3:6] * hwl_scale, box3d[6:7] + angle_rot], axis=0)
+            return aug_box3d
+        elif cfg.RCNN.REG_AUG_METHOD == 'multiple':
+            # pos_range, hwl_range, angle_range, mean_iou
+            range_config = [[0.2, 0.1, np.pi / 12, 0.7],
+                            [0.3, 0.15, np.pi / 12, 0.6],
+                            [0.5, 0.15, np.pi / 9, 0.5],
+                            [0.8, 0.15, np.pi / 6, 0.3],
+                            [1.0, 0.15, np.pi / 3, 0.2]]
+            idx = np.random.randint(low=0, high=len(range_config), size=(1,))[0]
+            pos_shift = ((np.random.rand(3) - 0.5) / 0.5) * range_config[idx][0]
+            hwl_scale = ((np.random.rand(3) - 0.5) / 0.5) * range_config[idx][1] + 1.0
+            angle_rot = ((np.random.rand(1) - 0.5) / 0.5) * range_config[idx][2]
+            aug_box3d = np.concatenate([box3d[0:3] + pos_shift, box3d[3:6] * hwl_scale, box3d[6:7] + angle_rot], axis=0)
+            return aug_box3d
+        elif cfg.RCNN.REG_AUG_METHOD == 'normal':
+            x_shift = np.random.normal(loc=0, scale=0.3)
+            y_shift = np.random.normal(loc=0, scale=0.2)
+            z_shift = np.random.normal(loc=0, scale=0.3)
+            h_shift = np.random.normal(loc=0, scale=0.25)
+            w_shift = np.random.normal(loc=0, scale=0.15)
+            l_shift = np.random.normal(loc=0, scale=0.5)
+            ry_shift = ((np.random.rand() - 0.5) / 0.5) * np.pi / 12
+            aug_box3d = np.array([box3d[0] + x_shift, box3d[1] + y_shift, box3d[2] + z_shift, box3d[3] + h_shift,
+                                  box3d[4] + w_shift, box3d[5] + l_shift, box3d[6] + ry_shift], dtype=np.float32)
+            aug_box3d = aug_box3d.astype(box3d.dtype)
+            return aug_box3d
+        else:
+            raise NotImplementedError
+
+    def data_augmentation(pts, rois, gt_of_rois):
+        """
+        :param pts: (B, M, 512, 3)
+        :param rois: (B, M. 7)
+        :param gt_of_rois: (B, M, 7)
+        :return:
+        """
+        batch_size, boxes_num = pts.shape[0], pts.shape[1]
+
+        # rotation augmentation
+        angles = (np.random.rand(batch_size, boxes_num) - 0.5 / 0.5) * (np.pi / cfg.AUG_ROT_RANGE)
+        # calculate gt alpha from gt_of_rois
+        temp_x, temp_z, temp_ry = gt_of_rois[:, :, 0], gt_of_rois[:, :, 2], gt_of_rois[:, :, 6]
+        temp_beta = np.arctan2(temp_z, temp_x)
+        gt_alpha = -np.sign(temp_beta) * np.pi / 2 + temp_beta + temp_ry  # (B, M)
+
+        temp_x, temp_z, temp_ry = rois[:, :, 0], rois[:, :, 2], rois[:, :, 6]
+        temp_beta = np.arctan2(temp_z, temp_x)
+        roi_alpha = -np.sign(temp_beta) * np.pi / 2 + temp_beta + temp_ry  # (B, M)
+
+        for k in range(batch_size):
+            pts[k] = kitti_utils.rotate_pc_along_y_np(pts[k], angles[k])
+            gt_of_rois[k] = np.squeeze(kitti_utils.rotate_pc_along_y_np(
+                np.expand_dims(gt_of_rois[k], axis=1), angles[k]), axis=1)
+            rois[k] = np.squeeze(kitti_utils.rotate_pc_along_y_np(
+                np.expand_dims(rois[k], axis=1), angles[k]),axis=1)
+
+            # calculate the ry after rotation
+            temp_x, temp_z = gt_of_rois[:, :, 0], gt_of_rois[:, :, 2]
+            temp_beta = np.arctan2(temp_z, temp_x)
+            gt_of_rois[:, :, 6] = np.sign(temp_beta) * np.pi / 2 + gt_alpha - temp_beta
+            temp_x, temp_z = rois[:, :, 0], rois[:, :, 2]
+            temp_beta = np.arctan2(temp_z, temp_x)
+            rois[:, :, 6] = np.sign(temp_beta) * np.pi / 2 + roi_alpha - temp_beta
+        # scaling augmentation
+        scales = 1 + ((np.random.rand(batch_size, boxes_num) - 0.5) / 0.5) * 0.05
+        pts = pts * np.expand_dims(np.expand_dims(scales, axis=2), axis=3)
+        gt_of_rois[:, :, 0:6] = gt_of_rois[:, :, 0:6] * np.expand_dims(scales, axis=2)
+        rois[:, :, 0:6] = rois[:, :, 0:6] * np.expand_dims(scales, axis=2)
+
+        # flip augmentation
+        flip_flag = np.sign(np.random.rand(batch_size, boxes_num) - 0.5)
+        pts[:, :, :, 0] = pts[:, :, :, 0] * np.expand_dims(flip_flag, axis=2)
+        gt_of_rois[:, :, 0] = gt_of_rois[:, :, 0] * flip_flag
+        # flip orientation: ry > 0: pi - ry, ry < 0: -pi - ry
+        src_ry = gt_of_rois[:, :, 6]
+        ry = (flip_flag == 1).astype(np.float32) * src_ry + (flip_flag == -1).astype(np.float32) * (np.sign(src_ry) * np.pi - src_ry)
+        gt_of_rois[:, :, 6] = ry
+
+        rois[:, :, 0] = rois[:, :, 0] * flip_flag
+        # flip orientation: ry > 0: pi - ry, ry < 0: -pi - ry
+        src_ry = rois[:, :, 6]
+        ry = (flip_flag == 1).astype(np.float32) * src_ry + (flip_flag == -1).astype(np.float32) * (np.sign(src_ry) * np.pi - src_ry)
+        rois[:, :, 6] = ry
+
+        return pts, rois, gt_of_rois
+
+    def generate_proposal_target(seg_mask,rpn_features,gt_boxes3d,rpn_xyz,pts_depth,roi_boxes3d,rpn_intensity):
+        seg_mask = np.array(seg_mask)
+        features = np.array(rpn_features)
+        gt_boxes3d = np.array(gt_boxes3d)
+        rpn_xyz = np.array(rpn_xyz)
+        pts_depth = np.array(pts_depth)
+        roi_boxes3d = np.array(roi_boxes3d)
+        rpn_intensity = np.array(rpn_intensity)
+        batch_rois, batch_gt_of_rois, batch_roi_iou = sample_rois_for_rcnn(roi_boxes3d, gt_boxes3d)
+        
+        if cfg.RCNN.USE_INTENSITY:
+            pts_extra_input_list = [np.expand_dims(rpn_intensity, axis=2),
+                                    np.expand_dims(seg_mask, axis=2)]
+        else:
+            pts_extra_input_list = [np.expand_dims(seg_mask, axis=2)]
+
+        if cfg.RCNN.USE_DEPTH:
+            pts_depth = pts_depth / 70.0 - 0.5
+            pts_extra_input_list.append(np.expand_dims(pts_depth, axis=2))
+        pts_extra_input = np.concatenate(pts_extra_input_list, axis=2)
+        
+        # point cloud pooling
+        pts_feature = np.concatenate((pts_extra_input, rpn_features), axis=2)
+        
+        batch_rois = batch_rois.astype(np.float32)
+
+        pooled_features, pooled_empty_flag = roipool3d_utils.roipool3d_gpu(
+            rpn_xyz, pts_feature, batch_rois, cfg.RCNN.POOL_EXTRA_WIDTH,
+            sampled_pt_num=cfg.RCNN.NUM_POINTS
+        )
+
+        sampled_pts, sampled_features = pooled_features[:, :, :, 0:3], pooled_features[:, :, :, 3:]
+        # data augmentation
+        if cfg.AUG_DATA:
+            # data augmentation
+            sampled_pts, batch_rois, batch_gt_of_rois = \
+                data_augmentation(sampled_pts, batch_rois, batch_gt_of_rois)
+
+        # canonical transformation
+        batch_size = batch_rois.shape[0]
+        roi_ry = batch_rois[:, :, 6] % (2 * np.pi)
+        roi_center = batch_rois[:, :, 0:3]
+        sampled_pts = sampled_pts - np.expand_dims(roi_center, axis=2)  # (B, M, 512, 3)
+        batch_gt_of_rois[:, :, 0:3] = batch_gt_of_rois[:, :, 0:3] - roi_center
+        batch_gt_of_rois[:, :, 6] = batch_gt_of_rois[:, :, 6] - roi_ry
+
+        for k in range(batch_size):
+            sampled_pts[k] = kitti_utils.rotate_pc_along_y_np(sampled_pts[k], batch_rois[k, :, 6])
+            batch_gt_of_rois[k] = np.squeeze(kitti_utils.rotate_pc_along_y_np(
+                np.expand_dims(batch_gt_of_rois[k], axis=1), roi_ry[k]), axis=1)
+
+        # regression valid mask
+        valid_mask = (pooled_empty_flag == 0)
+        reg_valid_mask = ((batch_roi_iou > cfg.RCNN.REG_FG_THRESH) & valid_mask).astype(np.float32)
+    
+        # classification label
+        batch_cls_label = (batch_roi_iou > cfg.RCNN.CLS_FG_THRESH).astype(np.int64)
+        invalid_mask = (batch_roi_iou > cfg.RCNN.CLS_BG_THRESH) & (batch_roi_iou < cfg.RCNN.CLS_FG_THRESH)
+        batch_cls_label[valid_mask == 0] = -1
+        batch_cls_label[invalid_mask > 0] = -1
+
+        output_dict = {'sampled_pts': sampled_pts.reshape(-1, cfg.RCNN.NUM_POINTS, 3).astype(np.float32),
+                       'pts_feature': sampled_features.reshape(-1, cfg.RCNN.NUM_POINTS, sampled_features.shape[3]).astype(np.float32),
+                       'cls_label': batch_cls_label.reshape(-1),
+                       'reg_valid_mask': reg_valid_mask.reshape(-1).astype(np.float32),
+                       'gt_of_rois': batch_gt_of_rois.reshape(-1, 7).astype(np.float32),
+                       'gt_iou': batch_roi_iou.reshape(-1).astype(np.float32),
+                       'roi_boxes3d': batch_rois.reshape(-1, 7).astype(np.float32)}
+        
+        return output_dict.values()
+
+    return generate_proposal_target
+
+
+if __name__ == "__main__":
+    
+    input_dict = {}
+    input_dict['roi_boxes3d'] = np.load("models/rpn_data/roi_boxes3d.npy")
+    input_dict['gt_boxes3d'] = np.load("models/rpn_data/gt_boxes3d.npy")
+    input_dict['rpn_xyz'] = np.load("models/rpn_data/rpn_xyz.npy")
+    input_dict['rpn_features'] = np.load("models/rpn_data/rpn_features.npy")
+    input_dict['rpn_intensity'] = np.load("models/rpn_data/rpn_intensity.npy")
+    input_dict['seg_mask'] = np.load("models/rpn_data/seg_mask.npy")
+    input_dict['pts_depth'] = np.load("models/rpn_data/pts_depth.npy")
+    for k, v in input_dict.items():
+        print(k, v.shape, np.sum(np.abs(v)))
+        input_dict[k] = np.expand_dims(v, axis=0)
+
+    from utils.config import cfg
+    cfg.RPN.LOC_XZ_FINE = True
+    cfg.TEST.RPN_DISTANCE_BASED_PROPOSE = False
+    cfg.RPN.NMS_TYPE = 'rotate'
+
+    proposal_target_func = get_proposal_target_func(cfg)
+    out_dict = proposal_target_func(input_dict['seg_mask'],input_dict['rpn_features'],input_dict['gt_boxes3d'],
+                                    input_dict['rpn_xyz'],input_dict['pts_depth'],input_dict['roi_boxes3d'],input_dict['rpn_intensity'])
+    for key in out_dict.keys():
+        print("name:{}, shape{}".format(key,out_dict[key].shape))
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/proposal_utils.py b/PaddleCV/Paddle3D/PointRCNN/utils/proposal_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9160ffe8e4e4a1aff7f8e8984e5ddd3711d1ffb0
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/proposal_utils.py
@@ -0,0 +1,270 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains proposal functions
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import paddle.fluid as fluid
+
+import utils.box_utils as box_utils
+from utils.config import cfg
+
+__all__ = ["get_proposal_func"]
+
+
+def get_proposal_func(cfg, mode='TRAIN'):
+    def decode_bbox_target(roi_box3d, pred_reg, anchor_size, loc_scope,
+                           loc_bin_size, num_head_bin, get_xz_fine=True,
+                           loc_y_scope=0.5, loc_y_bin_size=0.25,
+                           get_y_by_bin=False, get_ry_fine=False):
+        per_loc_bin_num = int(loc_scope / loc_bin_size) * 2
+        loc_y_bin_num = int(loc_y_scope / loc_y_bin_size) * 2
+        
+        # recover xz localization
+        x_bin_l, x_bin_r = 0, per_loc_bin_num
+        z_bin_l, z_bin_r = per_loc_bin_num, per_loc_bin_num * 2
+        start_offset = z_bin_r
+        
+        x_bin = np.argmax(pred_reg[:, x_bin_l: x_bin_r], axis=1)
+        z_bin = np.argmax(pred_reg[:, z_bin_l: z_bin_r], axis=1)
+        
+        pos_x = x_bin.astype('float32') * loc_bin_size + loc_bin_size / 2 - loc_scope
+        pos_z = z_bin.astype('float32') * loc_bin_size + loc_bin_size / 2 - loc_scope
+        if get_xz_fine:
+            x_res_l, x_res_r = per_loc_bin_num * 2, per_loc_bin_num * 3
+            z_res_l, z_res_r = per_loc_bin_num * 3, per_loc_bin_num * 4
+            start_offset = z_res_r
+            
+            x_res_norm = pred_reg[:, x_res_l:x_res_r][np.arange(len(x_bin)), x_bin]
+            z_res_norm = pred_reg[:, z_res_l:z_res_r][np.arange(len(z_bin)), z_bin]
+
+            x_res = x_res_norm * loc_bin_size
+            z_res = z_res_norm * loc_bin_size
+            pos_x += x_res
+            pos_z += z_res
+
+        # recover y localization
+        if get_y_by_bin:
+            y_bin_l, y_bin_r = start_offset, start_offset + loc_y_bin_num
+            y_res_l, y_res_r = y_bin_r, y_bin_r + loc_y_bin_num
+            start_offset = y_res_r
+
+            y_bin = np.argmax(pred_reg[:, y_bin_l: y_bin_r], axis=1)
+            y_res_norm = pred_reg[:, y_res_l:y_res_r][np.arange(len(y_bin)), y_bin]
+            y_res = y_res_norm * loc_y_bin_size
+            pos_y = y_bin.astype('float32') * loc_y_bin_size + loc_y_bin_size / 2 - loc_y_scope + y_res
+            pos_y = pos_y + np.array(roi_box3d[:, 1]).reshape(-1)
+        else:
+            y_offset_l, y_offset_r = start_offset, start_offset + 1
+            start_offset = y_offset_r
+            
+            pos_y = np.array(roi_box3d[:, 1]) + np.array(pred_reg[:, y_offset_l])
+            pos_y = pos_y.reshape(-1)
+
+        # recover ry rotation
+        ry_bin_l, ry_bin_r = start_offset, start_offset + num_head_bin
+        ry_res_l, ry_res_r = ry_bin_r, ry_bin_r + num_head_bin
+        
+        ry_bin = np.argmax(pred_reg[:, ry_bin_l: ry_bin_r], axis=1)
+        ry_res_norm = pred_reg[:, ry_res_l:ry_res_r][np.arange(len(ry_bin)), ry_bin]
+        if get_ry_fine:
+            # divide pi/2 into several bins
+            angle_per_class = (np.pi / 2) / num_head_bin
+            ry_res = ry_res_norm * (angle_per_class / 2)
+            ry = (ry_bin.astype('float32') * angle_per_class + angle_per_class / 2) + ry_res - np.pi / 4
+        else:
+            angle_per_class = (2 * np.pi) / num_head_bin
+            ry_res = ry_res_norm * (angle_per_class / 2)
+            
+            # bin_center is (0, 30, 60, 90, 120, ..., 270, 300, 330)
+            ry = np.fmod(ry_bin.astype('float32') * angle_per_class + ry_res, 2 * np.pi)
+            ry[ry > np.pi] -= 2 * np.pi
+        
+        # recover size
+        size_res_l, size_res_r = ry_res_r, ry_res_r + 3
+        assert size_res_r == pred_reg.shape[1]
+        
+        size_res_norm = pred_reg[:, size_res_l: size_res_r]
+        hwl = size_res_norm * anchor_size + anchor_size
+        
+        def rotate_pc_along_y(pc, angle):
+            cosa = np.cos(angle).reshape(-1, 1)
+            sina = np.sin(angle).reshape(-1, 1)
+            
+            R = np.concatenate([cosa, -sina, sina, cosa], axis=-1).reshape(-1, 2, 2)
+            pc_temp = pc[:, [0, 2]].reshape(-1, 1, 2)
+            pc[:, [0, 2]] = np.matmul(pc_temp, R.transpose(0, 2, 1)).reshape(-1, 2)
+            
+            return pc
+        
+        # shift to original coords
+        roi_center = np.array(roi_box3d[:, 0:3])
+        shift_ret_box3d = np.concatenate((
+            pos_x.reshape(-1, 1),
+            pos_y.reshape(-1, 1),
+            pos_z.reshape(-1, 1),
+            hwl, ry.reshape(-1, 1)), axis=1)
+        ret_box3d = shift_ret_box3d
+        if roi_box3d.shape[1] == 7:
+            roi_ry = np.array(roi_box3d[:, 6]).reshape(-1)
+            ret_box3d = rotate_pc_along_y(np.array(shift_ret_box3d), -roi_ry)
+            ret_box3d[:, 6] += roi_ry
+        ret_box3d[:, [0, 2]] += roi_center[:, [0, 2]]
+        return ret_box3d
+
+    def distance_based_proposal(scores, proposals, sorted_idxs):
+        nms_range_list = [0, 40.0, 80.0]
+        pre_tot_top_n = cfg[mode].RPN_PRE_NMS_TOP_N
+        pre_top_n_list = [0, int(pre_tot_top_n * 0.7), pre_tot_top_n - int(pre_tot_top_n * 0.7)]
+        post_tot_top_n = cfg[mode].RPN_POST_NMS_TOP_N
+        post_top_n_list = [0, int(post_tot_top_n * 0.7), post_tot_top_n - int(post_tot_top_n * 0.7)]
+
+        batch_size = scores.shape[0]
+        ret_proposals = np.zeros((batch_size, cfg[mode].RPN_POST_NMS_TOP_N, 7), dtype='float32')
+        ret_scores= np.zeros((batch_size, cfg[mode].RPN_POST_NMS_TOP_N, 1), dtype='float32')
+
+        for b, (score, proposal, sorted_idx) in enumerate(zip(scores, proposals, sorted_idxs)):
+            # sort by score
+            score_ord = score[sorted_idx]
+            proposal_ord = proposal[sorted_idx]
+
+            dist = proposal_ord[:, 2]
+            first_mask = (dist > nms_range_list[0]) & (dist <= nms_range_list[1])
+
+            scores_single_list, proposals_single_list = [], []
+            for i in range(1, len(nms_range_list)):
+                # get proposal distance mask
+                dist_mask = ((dist > nms_range_list[i - 1]) & (dist <= nms_range_list[i]))
+
+                if dist_mask.sum() != 0:
+                    # this area has points, reduce by mask
+                    cur_scores = score_ord[dist_mask]
+                    cur_proposals = proposal_ord[dist_mask]
+
+                    # fetch pre nms top K
+                    cur_scores = cur_scores[:pre_top_n_list[i]]
+                    cur_proposals = cur_proposals[:pre_top_n_list[i]]
+                else:
+                    assert i == 2, '%d' % i
+                    # this area doesn't have any points, so use rois of first area
+                    cur_scores = score_ord[first_mask]
+                    cur_proposals = proposal_ord[first_mask]
+
+                    # fetch top K of first area
+                    cur_scores = cur_scores[pre_top_n_list[i - 1]:][:pre_top_n_list[i]]
+                    cur_proposals = cur_proposals[pre_top_n_list[i - 1]:][:pre_top_n_list[i]]
+
+                # oriented nms
+                boxes_bev = box_utils.boxes3d_to_bev(cur_proposals)
+                s_scores, s_proposals = box_utils.box_nms(
+                        boxes_bev, cur_scores, cur_proposals,
+                        cfg[mode].RPN_NMS_THRESH, post_top_n_list[i],
+                        cfg.RPN.NMS_TYPE)
+                if len(s_scores) > 0:
+                    scores_single_list.append(s_scores)
+                    proposals_single_list.append(s_proposals)
+
+            scores_single = np.concatenate(scores_single_list, axis=0)
+            proposals_single = np.concatenate(proposals_single_list, axis=0)
+
+            prop_num = proposals_single.shape[0]
+            ret_scores[b, :prop_num, 0] = scores_single
+            ret_proposals[b, :prop_num] = proposals_single 
+        # ret_proposals.tofile("proposal.data")
+        # ret_scores.tofile("score.data")
+        return np.concatenate([ret_proposals, ret_scores], axis=-1)
+
+    def score_based_proposal(scores, proposals, sorted_idxs):
+        batch_size = scores.shape[0]
+        ret_proposals = np.zeros((batch_size, cfg[mode].RPN_POST_NMS_TOP_N, 7), dtype='float32')
+        ret_scores= np.zeros((batch_size, cfg[mode].RPN_POST_NMS_TOP_N, 1), dtype='float32')
+        for b, (score, proposal, sorted_idx) in enumerate(zip(scores, proposals, sorted_idxs)):
+            # sort by score
+            score_ord = score[sorted_idx]
+            proposal_ord = proposal[sorted_idx]
+
+            # pre nms top K
+            cur_scores = score_ord[:cfg[mode].RPN_PRE_NMS_TOP_N]
+            cur_proposals = proposal_ord[:cfg[mode].RPN_PRE_NMS_TOP_N]
+
+            boxes_bev = box_utils.boxes3d_to_bev(cur_proposals)
+            s_scores, s_proposals = box_utils.box_nms(
+                    boxes_bev, cur_scores, cur_proposals,
+                    cfg[mode].RPN_NMS_THRESH,
+                    cfg[mode].RPN_POST_NMS_TOP_N,
+                    'rotate')
+            prop_num = len(s_proposals)
+            ret_scores[b, :prop_num, 0] = s_scores 
+            ret_proposals[b, :prop_num] = s_proposals 
+        # ret_proposals.tofile("proposal.data")
+        # ret_scores.tofile("score.data")
+        return np.concatenate([ret_proposals, ret_scores], axis=-1)
+
+    def generate_proposal(x):
+        rpn_scores = np.array(x[:, :, 0])[:, :, 0]
+        roi_box3d = x[:, :, 1:4]
+        pred_reg = x[:, :, 4:]
+
+        proposals = decode_bbox_target(
+                np.array(roi_box3d).reshape(-1, roi_box3d.shape()[-1]), 
+                np.array(pred_reg).reshape(-1, pred_reg.shape()[-1]), 
+                anchor_size=np.array(cfg.CLS_MEAN_SIZE[0], dtype='float32'),
+	       	loc_scope=cfg.RPN.LOC_SCOPE,
+	       	loc_bin_size=cfg.RPN.LOC_BIN_SIZE,
+	       	num_head_bin=cfg.RPN.NUM_HEAD_BIN,
+	       	get_xz_fine=cfg.RPN.LOC_XZ_FINE,
+	       	get_y_by_bin=False,
+	       	get_ry_fine=False)
+        proposals[:, 1] += proposals[:, 3] / 2
+        proposals = proposals.reshape(rpn_scores.shape[0], -1, proposals.shape[-1])
+
+        sorted_idxs = np.argsort(-rpn_scores, axis=-1)
+
+        if cfg.TEST.RPN_DISTANCE_BASED_PROPOSE:
+            ret = distance_based_proposal(rpn_scores, proposals, sorted_idxs)
+        else:
+            ret = score_based_proposal(rpn_scores, proposals, sorted_idxs)
+
+        return ret
+
+
+    return generate_proposal 
+
+
+if __name__ == "__main__":
+    np.random.seed(3333)
+    x_np = np.random.random((4, 256, 84)).astype('float32')
+
+    from config import cfg
+    cfg.RPN.LOC_XZ_FINE = True
+    # cfg.TEST.RPN_DISTANCE_BASED_PROPOSE = False
+    # cfg.RPN.NMS_TYPE = 'rotate'
+    proposal_func = get_proposal_func(cfg)
+
+    x = fluid.layers.data(name="x", shape=[256, 84], dtype='float32')
+    proposal = fluid.default_main_program().current_block().create_var(
+                    name="proposal", dtype='float32', shape=[256, 7])
+    fluid.layers.py_func(proposal_func, x, proposal)
+    loss = fluid.layers.reduce_mean(proposal)
+
+    place = fluid.CUDAPlace(0)
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+    ret = exe.run(fetch_list=[proposal.name, loss.name], feed={'x': x_np})
+    print(ret)
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/CMakeLists.txt b/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/CMakeLists.txt
new file mode 100644
index 0000000000000000000000000000000000000000..044bbed5d020464250810601ec2dcdacdec0cd18
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/CMakeLists.txt
@@ -0,0 +1,6 @@
+
+cmake_minimum_required(VERSION 2.8.12)
+project(pts_utils)
+
+add_subdirectory(pybind11)
+pybind11_add_module(pts_utils pts_utils.cpp)
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/pts_utils.cpp b/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/pts_utils.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..356b02baa5288903e218c8fca1b17118ef8ea72b
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/pts_utils.cpp
@@ -0,0 +1,62 @@
+#include <pybind11/pybind11.h>
+#include <pybind11/numpy.h>
+#include <math.h>
+
+namespace py = pybind11;
+
+int pt_in_box3d(float x, float y, float z, float cx, float cy, float cz, float h, float w, float l, float cosa, float sina) {
+	if ((fabsf(x - cx) > 10.) || (fabsf(y - cy) > h / 2.0) || (fabsf(z - cz) > 10.)){
+			return 0;
+	}
+
+	float x_rot = (x - cx) * cosa + (z - cz) * (-sina);
+	float z_rot = (x - cx) * sina + (z - cz) * cosa;
+
+	int in_flag = static_cast<int>((x_rot >= -l / 2.0) & (x_rot <= l / 2.0) & (z_rot >= -w / 2.0) & (z_rot <= w / 2.0));
+	return in_flag;
+}
+
+py::array_t<int> pts_in_boxes3d(py::array_t<float> pts, py::array_t<float> boxes) {
+  py::buffer_info pts_buf= pts.request(), boxes_buf = boxes.request();
+
+  if (pts_buf.ndim != 2 || boxes_buf.ndim != 2) {
+    throw std::runtime_error("Number of dimensions must be 2");
+  }
+  if (pts_buf.shape[1] != 3) {
+    throw std::runtime_error("pts 2nd dimension must be 3");
+  }
+  if (boxes_buf.shape[1] != 7) {
+    throw std::runtime_error("boxes 2nd dimension must be 7");
+  }
+
+  auto pts_num = pts_buf.shape[0];
+  auto boxes_num = boxes_buf.shape[0];
+  auto mask = py::array_t<int>(pts_num * boxes_num);
+  py::buffer_info mask_buf = mask.request();
+
+  float *pts_ptr = (float *) pts_buf.ptr,
+        *boxes_ptr = (float *) boxes_buf.ptr;
+  int *mask_ptr = (int *) mask_buf.ptr;
+
+  for (ssize_t i = 0; i < boxes_num; i++) {
+    float cx = boxes_ptr[i * 7];
+    float cy = boxes_ptr[i * 7 + 1] - boxes_ptr[i * 7 + 3] / 2.;
+    float cz = boxes_ptr[i * 7 + 2];
+    float h = boxes_ptr[i * 7 + 3];
+    float w = boxes_ptr[i * 7 + 4];
+    float l = boxes_ptr[i * 7 + 5];
+    float angle = boxes_ptr[i * 7 + 6];
+    float cosa = cosf(angle);
+    float sina = sinf(angle);
+    for (ssize_t j = 0; j < pts_num; j++) {
+      mask_ptr[i * pts_num + j] = pt_in_box3d(pts_ptr[j * 3], pts_ptr[j * 3 + 1], pts_ptr[j * 3 + 2], cx, cy, cz, h, w, l, cosa, sina);
+    }
+  }
+
+  mask.resize({boxes_num, pts_num});
+  return mask;
+}
+
+PYBIND11_MODULE(pts_utils, m) {
+    m.def("pts_in_boxes3d", &pts_in_boxes3d, "Calculate mask for whether points in boxes3d");
+}
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/setup.py b/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/setup.py
new file mode 100644
index 0000000000000000000000000000000000000000..e44e80ea703c0b2b3d1938fadc3c1befadb1dad0
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/setup.py
@@ -0,0 +1,12 @@
+from setuptools import setup
+from setuptools import Extension
+
+setup(
+    name='pts_utils',
+    ext_modules = [Extension(
+        name='pts_utils',
+        sources=['pts_utils.cpp'],
+        include_dirs=[r'../../pybind11/include'],
+        extra_compile_args=['-std=c++11']
+    )],
+)
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/test.py b/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/test.py
new file mode 100644
index 0000000000000000000000000000000000000000..e4e3be285e3363a2193102732f1c0d9894eb497d
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/pts_utils/test.py
@@ -0,0 +1,7 @@
+import numpy as np
+import pts_utils
+
+a = np.random.random((16384, 3)).astype('float32')
+b = np.random.random((64, 7)).astype('float32')
+c = pts_utils.pts_in_boxes3d(a, b)
+print(a, b, c, c.shape, np.sum(c))
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/run_utils.py b/PaddleCV/Paddle3D/PointRCNN/utils/run_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..0df37e5658f86c0cfc416e8a0185c5556bffe9f9
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/run_utils.py
@@ -0,0 +1,110 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains common utility functions.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import sys
+import six
+import logging
+import numpy as np
+import paddle.fluid as fluid
+
+__all__ = ["check_gpu",  "print_arguments", "parse_outputs", "Stat"]
+
+logger = logging.getLogger(__name__)
+
+
+def check_gpu(use_gpu):
+    """
+    Log error and exit when set use_gpu=True in paddlepaddle
+    cpu version.
+    """
+    err = "Config use_gpu cannot be set as True while you are " \
+          "using paddlepaddle cpu version ! \nPlease try: \n" \
+          "\t1. Install paddlepaddle-gpu to run model on GPU \n" \
+          "\t2. Set --use_gpu=False to run model on CPU"
+
+    try:
+        if use_gpu and not fluid.is_compiled_with_cuda():
+            logger.error(err)
+            sys.exit(1)
+    except Exception as e:
+        pass
+
+
+def print_arguments(args):
+    """Print argparse's arguments.
+
+    Usage:
+
+    .. code-block:: python
+
+        parser = argparse.ArgumentParser()
+        parser.add_argument("name", default="Jonh", type=str, help="User name.")
+        args = parser.parse_args()
+        print_arguments(args)
+
+    :param args: Input argparse.Namespace for printing.
+    :type args: argparse.Namespace
+    """
+    logger.info("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(six.iteritems(vars(args))):
+        logger.info("%s: %s" % (arg, value))
+    logger.info("------------------------------------------------")
+
+
+def parse_outputs(outputs, filter_key=None, extra_keys=None, prog=None):
+    keys, values = [], []
+    for k, v in outputs.items():
+        if filter_key is not None and k.find(filter_key) < 0:
+            continue
+        keys.append(k)
+        v.persistable = True
+        values.append(v.name)
+
+    if prog is not None and extra_keys is not None:
+        for k in extra_keys:
+            try:
+                v = fluid.framework._get_var(k, prog)
+                keys.append(k)
+                v.persistable = True
+                values.append(v.name)
+            except:
+                pass
+    return keys, values
+
+
+class Stat(object):
+    def __init__(self):
+        self.stats = {}
+
+    def update(self, keys, values):
+        for k, v in zip(keys, values):
+            if k not in self.stats:
+                self.stats[k] = []
+            self.stats[k].append(v)
+
+    def reset(self):
+        self.stats = {}
+
+    def get_mean_log(self):
+        log = ""
+        for k, v in self.stats.items():
+            log += "avg_{}: {:.4f}, ".format(k, np.mean(v))
+        return log
diff --git a/PaddleCV/Paddle3D/PointRCNN/utils/save_utils.py b/PaddleCV/Paddle3D/PointRCNN/utils/save_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..c24a89a2429bd5f45386efa1176f8c8770500120
--- /dev/null
+++ b/PaddleCV/Paddle3D/PointRCNN/utils/save_utils.py
@@ -0,0 +1,132 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import numpy as np
+from utils.config import cfg
+from utils import calibration as calib
+import utils.cyops.kitti_utils as kitti_utils 
+
+__all__ = ['save_rpn_feature', 'save_kitti_result', 'save_kitti_format']
+
+
+def save_rpn_feature(rets, kitti_features_dir):
+    """
+    save rpn features for RCNN offline training
+    """
+
+    sample_id = rets['sample_id'][0]
+    backbone_xyz = rets['backbone_xyz'][0]
+    backbone_feature = rets['backbone_feature'][0]
+    pts_features = rets['pts_features'][0]
+    seg_mask = rets['seg_mask'][0]
+    rpn_cls = rets['rpn_cls'][0]
+
+    for i in range(len(sample_id)):
+        pts_intensity = pts_features[i, :, 0]
+        s_id = sample_id[i, 0]
+
+        output_file = os.path.join(kitti_features_dir, '%06d.npy' % s_id)
+        xyz_file = os.path.join(kitti_features_dir, '%06d_xyz.npy' % s_id)
+        seg_file = os.path.join(kitti_features_dir, '%06d_seg.npy' % s_id)
+        intensity_file = os.path.join(
+            kitti_features_dir, '%06d_intensity.npy' % s_id)
+        np.save(output_file, backbone_feature[i])
+        np.save(xyz_file, backbone_xyz[i])
+        np.save(seg_file, seg_mask[i])
+        np.save(intensity_file, pts_intensity)
+        rpn_scores_raw_file = os.path.join(
+            kitti_features_dir, '%06d_rawscore.npy' % s_id)
+        np.save(rpn_scores_raw_file, rpn_cls[i])
+
+
+def save_kitti_result(rets, seg_output_dir, kitti_output_dir, reader, classes):
+    sample_id = rets['sample_id'][0]
+    roi_scores_row = rets['roi_scores_row'][0]
+    bboxes3d = rets['rois'][0]
+    pts_rect = rets['pts_rect'][0]
+    seg_mask = rets['seg_mask'][0]
+    rpn_cls_label = rets['rpn_cls_label'][0]
+    gt_boxes3d = rets['gt_boxes3d'][0]
+    gt_boxes3d_num = rets['gt_boxes3d'][1]
+
+    for i in range(len(sample_id)):
+        s_id = sample_id[i, 0]
+
+        seg_result_data = np.concatenate((pts_rect[i].reshape(-1, 3),
+                                          rpn_cls_label[i].reshape(-1, 1),
+                                          seg_mask[i].reshape(-1, 1)),
+                                         axis=1).astype('float16')
+        seg_output_file = os.path.join(seg_output_dir, '%06d.npy' % s_id)
+        np.save(seg_output_file, seg_result_data)
+
+        scores = roi_scores_row[i, :]
+        bbox3d = bboxes3d[i, :]
+        img_shape = reader.get_image_shape(s_id)
+        calib = reader.get_calib(s_id)
+
+        corners3d = kitti_utils.boxes3d_to_corners3d(bbox3d)
+        img_boxes, _ = calib.corners3d_to_img_boxes(corners3d)
+
+        img_boxes[:, 0] = np.clip(img_boxes[:, 0], 0, img_shape[1] - 1)
+        img_boxes[:, 1] = np.clip(img_boxes[:, 1], 0, img_shape[0] - 1)
+        img_boxes[:, 2] = np.clip(img_boxes[:, 2], 0, img_shape[1] - 1)
+        img_boxes[:, 3] = np.clip(img_boxes[:, 3], 0, img_shape[0] - 1)
+
+        img_boxes_w = img_boxes[:, 2] - img_boxes[:, 0]
+        img_boxes_h = img_boxes[:, 3] - img_boxes[:, 1]
+        box_valid_mask = np.logical_and(
+            img_boxes_w < img_shape[1] * 0.8, img_boxes_h < img_shape[0] * 0.8)
+
+        kitti_output_file = os.path.join(kitti_output_dir, '%06d.txt' % s_id)
+        with open(kitti_output_file, 'w') as f:
+            for k in range(bbox3d.shape[0]):
+                if box_valid_mask[k] == 0:
+                    continue
+                x, z, ry = bbox3d[k, 0], bbox3d[k, 2], bbox3d[k, 6]
+                beta = np.arctan2(z, x)
+                alpha = -np.sign(beta) * np.pi / 2 + beta + ry
+
+                f.write('{} -1 -1 {:.4f} {:.4f} {:.4f} {:.4f} {:.4f} {:.4f} {:.4f} {:.4f} {:.4f} {:.4f} {:.4f} {:.4f} {:.4f}\n'.format(
+                    classes, alpha, img_boxes[k, 0], img_boxes[k, 1], img_boxes[k, 2], img_boxes[k, 3],
+                    bbox3d[k, 3], bbox3d[k, 4], bbox3d[k, 5], bbox3d[k, 0], bbox3d[k, 1], bbox3d[k, 2],
+                    bbox3d[k, 6], scores[k]))
+
+
+def save_kitti_format(sample_id, calib, bbox3d, kitti_output_dir, scores, img_shape):
+    corners3d = kitti_utils.boxes3d_to_corners3d(bbox3d)
+    img_boxes, _ = calib.corners3d_to_img_boxes(corners3d)
+    img_boxes[:, 0] = np.clip(img_boxes[:, 0], 0, img_shape[1] - 1)
+    img_boxes[:, 1] = np.clip(img_boxes[:, 1], 0, img_shape[0] - 1)
+    img_boxes[:, 2] = np.clip(img_boxes[:, 2], 0, img_shape[1] - 1)
+    img_boxes[:, 3] = np.clip(img_boxes[:, 3], 0, img_shape[0] - 1)
+
+    img_boxes_w = img_boxes[:, 2] - img_boxes[:, 0]
+    img_boxes_h = img_boxes[:, 3] - img_boxes[:, 1]
+    box_valid_mask = np.logical_and(img_boxes_w < img_shape[1] * 0.8, img_boxes_h < img_shape[0] * 0.8)
+
+    kitti_output_file = os.path.join(kitti_output_dir, '%06d.txt' % sample_id)
+    with open(kitti_output_file, 'w') as f:
+        for k in range(bbox3d.shape[0]):
+            if box_valid_mask[k] == 0:
+                continue
+            x, z, ry = bbox3d[k, 0], bbox3d[k, 2], bbox3d[k, 6]
+            beta = np.arctan2(z, x)
+            alpha = -np.sign(beta) * np.pi / 2 + beta + ry
+
+            f.write('%s -1 -1 %.4f %.4f %.4f %.4f %.4f %.4f %.4f %.4f %.4f %.4f %.4f %.4f %.4f\n' %
+                  (cfg.CLASSES, alpha, img_boxes[k, 0], img_boxes[k, 1], img_boxes[k, 2], img_boxes[k, 3],
+                   bbox3d[k, 3], bbox3d[k, 4], bbox3d[k, 5], bbox3d[k, 0], bbox3d[k, 1], bbox3d[k, 2],
+                   bbox3d[k, 6], scores[k]))
+
diff --git a/PaddleCV/PaddleDetection/docs/GETTING_STARTED_cn.md b/PaddleCV/PaddleDetection/docs/GETTING_STARTED_cn.md
index b5dd6041033e539e18d59dc8669a3658ba395da2..15cb8cdb239e72f204d598adbf78627000cb5bec 100644
--- a/PaddleCV/PaddleDetection/docs/GETTING_STARTED_cn.md
+++ b/PaddleCV/PaddleDetection/docs/GETTING_STARTED_cn.md
@@ -5,7 +5,7 @@
 
 ## 训练/评估/推断
 
-PaddleDetection提供了训练/训练/评估三个功能的使用脚本，支持通过不同可选参数实现特定功能
+PaddleDetection提供了训练/评估/推断三个功能的使用脚本，支持通过不同可选参数实现特定功能
 
 ```bash
 # 设置PYTHONPATH路径
diff --git a/PaddleCV/PaddleDetection/ppdet/data/data_feed.py b/PaddleCV/PaddleDetection/ppdet/data/data_feed.py
index c384b2cb3241cc5c012cedc02dfc9cbeab524bf6..cbaebc2e4860e40481a8e1defdeea3edde22eb7e 100644
--- a/PaddleCV/PaddleDetection/ppdet/data/data_feed.py
+++ b/PaddleCV/PaddleDetection/ppdet/data/data_feed.py
@@ -452,7 +452,7 @@ class FasterRCNNTrainFeed(DataFeed):
                      'image', 'im_info', 'im_id', 'gt_box', 'gt_label',
                      'is_crowd'
                  ],
-                 image_shape=[3, 800, 1333],
+                 image_shape=[None, 3, None, None],
                  sample_transforms=[
                      DecodeImage(to_rgb=True),
                      RandomFlipImage(prob=0.5),
@@ -504,7 +504,7 @@ class FasterRCNNEvalFeed(DataFeed):
                                      COCO_VAL_IMAGE_DIR).__dict__,
                  fields=['image', 'im_info', 'im_id', 'im_shape', 'gt_box',
                          'gt_label', 'is_difficult'],
-                 image_shape=[3, 800, 1333],
+                 image_shape=[None, 3, None, None],
                  sample_transforms=[
                      DecodeImage(to_rgb=True),
                      NormalizeImage(mean=[0.485, 0.456, 0.406],
@@ -551,7 +551,7 @@ class FasterRCNNTestFeed(DataFeed):
                  dataset=SimpleDataSet(COCO_VAL_ANNOTATION,
                                        COCO_VAL_IMAGE_DIR).__dict__,
                  fields=['image', 'im_info', 'im_id', 'im_shape'],
-                 image_shape=[3, 800, 1333],
+                 image_shape=[None, 3, None, None],
                  sample_transforms=[
                      DecodeImage(to_rgb=True),
                      NormalizeImage(mean=[0.485, 0.456, 0.406],
@@ -598,7 +598,7 @@ class MaskRCNNTrainFeed(DataFeed):
                      'image', 'im_info', 'im_id', 'gt_box', 'gt_label',
                      'is_crowd', 'gt_mask'
                  ],
-                 image_shape=[3, 800, 1333],
+                 image_shape=[None, 3, None, None],
                  sample_transforms=[
                      DecodeImage(to_rgb=True),
                      RandomFlipImage(prob=0.5, is_mask_flip=True),
@@ -644,7 +644,7 @@ class MaskRCNNEvalFeed(DataFeed):
                  dataset=CocoDataSet(COCO_VAL_ANNOTATION,
                                      COCO_VAL_IMAGE_DIR).__dict__,
                  fields=['image', 'im_info', 'im_id', 'im_shape'],
-                 image_shape=[3, 800, 1333],
+                 image_shape=[None, 3, None, None],
                  sample_transforms=[
                      DecodeImage(to_rgb=True),
                      NormalizeImage(mean=[0.485, 0.456, 0.406],
@@ -696,7 +696,7 @@ class MaskRCNNTestFeed(DataFeed):
                  dataset=SimpleDataSet(COCO_VAL_ANNOTATION,
                                        COCO_VAL_IMAGE_DIR).__dict__,
                  fields=['image', 'im_info', 'im_id', 'im_shape'],
-                 image_shape=[3, 800, 1333],
+                 image_shape=[None, 3, None, None],
                  sample_transforms=[
                      DecodeImage(to_rgb=True),
                      NormalizeImage(
diff --git a/PaddleCV/PaddleDetection/ppdet/data/tools/x2coco.py b/PaddleCV/PaddleDetection/ppdet/data/tools/x2coco.py
index da8e4aef4011ef1a23e7459bc473301e171b9fea..0379fab6335cb7886da8fe9f5170717a4453c6d6 100644
--- a/PaddleCV/PaddleDetection/ppdet/data/tools/x2coco.py
+++ b/PaddleCV/PaddleDetection/ppdet/data/tools/x2coco.py
@@ -277,13 +277,16 @@ def main():
             indent=4,
             cls=MyEncoder)
     if args.val_proportion != 0:
-        val_data_coco = deal_json(args.output_dir + '/val', args.json_input_dir)
+        val_data_coco = deal_json(args.dataset_type,
+                                  args.output_dir + '/val', 
+                                  args.json_input_dir)
         val_json_path = osp.join(args.output_dir + '/annotations',
                                  'instance_val.json')
         json.dump(
             val_data_coco, open(val_json_path, 'w'), indent=4, cls=MyEncoder)
     if args.test_proportion != 0:
-        test_data_coco = deal_json(args.output_dir + '/test',
+        test_data_coco = deal_json(args.dataset_type,
+                                   args.output_dir + '/test',
                                    args.json_input_dir)
         test_json_path = osp.join(args.output_dir + '/annotations',
                                   'instance_test.json')
diff --git a/PaddleCV/PaddleDetection/slim/distillation/README.md b/PaddleCV/PaddleDetection/slim/distillation/README.md
index e970cc42b54c17a6131c4873662fb2be46767b60..e46e6a2c92ac502f48d7d929a81b61228ed10d7a 100755
--- a/PaddleCV/PaddleDetection/slim/distillation/README.md
+++ b/PaddleCV/PaddleDetection/slim/distillation/README.md
@@ -135,7 +135,7 @@ python ../infer.py \
 | FLOPS |Box AP|
 |---|---|
 |baseline|76.2     |
-|蒸馏后|- |
+|蒸馏后|76.27 |
 
 
 ## FAQ
diff --git a/PaddleCV/PaddleDetection/slim/quantization/README.md b/PaddleCV/PaddleDetection/slim/quantization/README.md
index acb4c9efcbd49bccc4682c7eb7af294885e5d42a..d451e959a8828c24fcafb9ac52b8c5a2a3ce8de5 100644
--- a/PaddleCV/PaddleDetection/slim/quantization/README.md
+++ b/PaddleCV/PaddleDetection/slim/quantization/README.md
@@ -4,7 +4,7 @@
 
 ## 概述
 
-该示例使用PaddleSlim提供的[量化压缩策略](https://github.com/PaddlePaddle/models/blob/develop/PaddleSlim/docs/tutorial.md#1-quantization-aware-training%E9%87%8F%E5%8C%96%E4%BB%8B%E7%BB%8D)对分类模型进行压缩。
+该示例使用PaddleSlim提供的[量化压缩策略](https://github.com/PaddlePaddle/models/blob/develop/PaddleSlim/docs/tutorial.md#1-quantization-aware-training%E9%87%8F%E5%8C%96%E4%BB%8B%E7%BB%8D)对检测模型进行压缩。
 在阅读该示例前，建议您先了解以下内容：
 
 - [检测模型的常规训练方法](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/PaddleDetection)
@@ -41,10 +41,11 @@
 
 step1: 设置gpu卡
 ```
-export CUDA_VISIBLE_DEVICES=0
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
 ```
 step2: 开始训练
-使用PaddleDetection提供的配置文件在用8卡进行训练：
+
+使用PaddleDetection提供的配置文件用8卡进行训练：
 
 ```
 python compress.py \
@@ -234,8 +235,11 @@ FP32模型可使用PaddleLite进行加载预测，可参见教程[Paddle-Lite如
 |---|---|---|---|---|
 |baseline|- |76.2%|- |-|
 |abs_max|abs_max|- |- |-|
-|abs_max|moving_average_abs_max|- |- |-|
+|abs_max|moving_average_abs_max|74.48%|10.99|3348.68|
 |channel_wise_abs_max|abs_max|- |- |-|
 
+> 注： lite端运行手机信息：Android手机，
+型号：BKL-AL20，运行内存RAM：4GB 6GB，CPU核心数：八核 4*A73 2.36GHz+4*A53 1.8GHz，操作系统：EMUI 8.0，CPU品牌：麒麟970
+
 
 ## FAQ
diff --git a/PaddleCV/PaddleDetection/slim/quantization/compress.py b/PaddleCV/PaddleDetection/slim/quantization/compress.py
index b4a5553cf46eabcd25f7cc1ce6c50fccefd2e5df..0e145abcf70c54a3b7960243e20c0cb8cb6d39d9 100644
--- a/PaddleCV/PaddleDetection/slim/quantization/compress.py
+++ b/PaddleCV/PaddleDetection/slim/quantization/compress.py
@@ -49,7 +49,7 @@ from ppdet.data.data_feed import create_reader
 from ppdet.utils.eval_utils import parse_fetches, eval_results
 from ppdet.utils.stats import TrainingStats
 from ppdet.utils.cli import ArgsParser, print_total_cfg
-from ppdet.utils.check import check_gpu, check_version
+from ppdet.utils.check import check_gpu
 import ppdet.utils.checkpoint as checkpoint
 from ppdet.modeling.model_input import create_feed
 
@@ -121,8 +121,7 @@ def main():
 
     # check if set use_gpu=True in paddlepaddle cpu version
     check_gpu(cfg.use_gpu)
-    # print_total_cfg(cfg)
-    #check_version()
+
     if cfg.use_gpu:
         devices_num = fluid.core.get_cuda_device_count()
     else:
diff --git a/PaddleCV/PaddleDetection/slim/quantization/freeze.py b/PaddleCV/PaddleDetection/slim/quantization/freeze.py
index 38c06578e3d22e1cc4f2bdcc933298553c1c1f37..42c7bc62fd771366430f3658d9446a0f12fe2125 100644
--- a/PaddleCV/PaddleDetection/slim/quantization/freeze.py
+++ b/PaddleCV/PaddleDetection/slim/quantization/freeze.py
@@ -195,19 +195,6 @@ def main():
         model_filename='model',
         params_filename='weights')
 
-    logger.info("convert the freezed pass to paddle-lite execution")
-    mobile_pass = TransformForMobilePass()
-    mobile_pass.apply(test_graph)
-    mobile_program = test_graph.to_program()
-    fluid.io.save_inference_model(
-        dirname=os.path.join(FLAGS.save_path, 'mobile'),
-        feeded_var_names=feed_names,
-        target_vars=fetch_targets,
-        executor=exe,
-        main_program=mobile_program,
-        model_filename='model',
-        params_filename='weights')
-
 
 if __name__ == '__main__':
     parser = ArgsParser()
diff --git a/PaddleCV/PaddleDetection/slim/quantization/yolov3_mobilenet_v1_slim.yaml b/PaddleCV/PaddleDetection/slim/quantization/yolov3_mobilenet_v1_slim.yaml
index 60a66f656f9e419cd862231654ab4eaca6057ea2..9d453450d91edf4d10c6aa5fd9fd29f21953e5d3 100644
--- a/PaddleCV/PaddleDetection/slim/quantization/yolov3_mobilenet_v1_slim.yaml
+++ b/PaddleCV/PaddleDetection/slim/quantization/yolov3_mobilenet_v1_slim.yaml
@@ -5,7 +5,6 @@ strategies:
         start_epoch: 0
         end_epoch: 4 
         float_model_save_path: './output/yolov3/float'
-        mobile_model_save_path: './output/yolov3/mobile'
         int8_model_save_path: './output/yolov3/int8'
         weight_bits: 8
         activation_bits: 8
diff --git a/PaddleCV/PaddleDetection/tools/train.py b/PaddleCV/PaddleDetection/tools/train.py
index 08e1fc63437c78722e11429d94468dcf2e5eee2c..6d04c665ecbae873a043624a80661c385194fe9e 100644
--- a/PaddleCV/PaddleDetection/tools/train.py
+++ b/PaddleCV/PaddleDetection/tools/train.py
@@ -19,6 +19,7 @@ from __future__ import print_function
 import os
 import time
 import numpy as np
+import random
 import datetime
 from collections import deque
 
@@ -36,6 +37,7 @@ set_paddle_flags(
 )
 
 from paddle import fluid
+from paddle.fluid import profiler
 
 from ppdet.experimental import mixed_precision_context
 from ppdet.core.workspace import load_config, merge_config, create
@@ -61,10 +63,13 @@ def main():
     FLAGS.dist = 'PADDLE_TRAINER_ID' in env and 'PADDLE_TRAINERS_NUM' in env
     if FLAGS.dist:
         trainer_id = int(env['PADDLE_TRAINER_ID'])
-        import random
         local_seed = (99 + trainer_id)
         random.seed(local_seed)
         np.random.seed(local_seed)
+    
+    if FLAGS.enable_ce:
+        random.seed(0)
+        np.random.seed(0)
 
     cfg = load_config(FLAGS.config)
     if 'architecture' in cfg:
@@ -111,6 +116,9 @@ def main():
     # build program
     startup_prog = fluid.Program()
     train_prog = fluid.Program()
+    if FLAGS.enable_ce:
+        startup_prog.random_seed = 1000
+        train_prog.random_seed = 1000
     with fluid.program_guard(train_prog, startup_prog):
         with fluid.unique_name.guard():
             model = create(main_arch)
@@ -257,6 +265,18 @@ def main():
             strs = 'iter: {}, lr: {:.6f}, {}, time: {:.3f}, eta: {}'.format(
                 it, np.mean(outs[-1]), logs, time_cost, eta)
             logger.info(strs)
+        
+        #only for continuous evaluation
+        if FLAGS.enable_ce and it == cfg.max_iters - 1:
+            print("kpis\t{}_train_loss\t{}".format(cfg.architecture, stats['loss']))
+            print("kpis\t{}_train_time\t{}".format(cfg.architecture, time_cost))
+
+        # profiler tools, used for benchmark
+        if FLAGS.is_profiler and it == 5:
+            profiler.start_profiler("All")
+        elif FLAGS.is_profiler and it == 10:
+            profiler.stop_profiler("total", FLAGS.profiler_path)
+            return
 
         if (it > 0 and it % cfg.snapshot_iter == 0 or it == cfg.max_iters - 1) \
            and (not FLAGS.dist or trainer_id == 0):
@@ -334,5 +354,23 @@ if __name__ == '__main__':
         type=str,
         default="tb_log_dir/scalar",
         help='Tensorboard logging directory for scalar.')
+    parser.add_argument(
+        '--enable_ce',
+        type=bool,
+        default=False,
+        help="If set True, enable continuous evaluation job."
+        "This flag is only used for internal test.")
+
+    #NOTE:args for profiler tools, used for benchmark
+    parser.add_argument(
+        '--is_profiler',
+        type=int,
+        default=0,
+        help='The switch of profiler tools. (used for benchmark)')
+    parser.add_argument(
+        '--profiler_path',
+        type=str,
+        default="./",
+        help='The profiler output file path. (used for benchmark)')
     FLAGS = parser.parse_args()
     main()
diff --git a/PaddleCV/PaddleGAN/README.md b/PaddleCV/PaddleGAN/README.md
index 97b0f3985b149b30bab7d1123032f428dd8bc5a0..cb5453984bef0b7bb46b320ff24172413e52c124 100644
--- a/PaddleCV/PaddleGAN/README.md
+++ b/PaddleCV/PaddleGAN/README.md
@@ -12,7 +12,6 @@
 - [FAQ](#faq)
 - [参考论文](#参考论文)
 - [版本更新](#版本更新)
-- [作者](#作者)
 
 ## 模型简介
 
@@ -312,9 +311,5 @@ SPADE整体的网络结构[10]
 
 - 6/2019 新增CGAN, DCGAN, Pix2Pix, CycleGAN,StarGAN, AttGAN, STGAN
 
-## 作者
-- [ceci3](https://github.com/ceci3)
-- [zhumanyu](https://github.com/zhumanyu)
-
 ## 如何贡献代码
 如果你可以修复某个issue或者增加一个新功能，欢迎给我们提交PR。如果对应的PR被接受了，我们将根据贡献的质量和难度进行打分（0-5分，越高越好）。如果你累计获得了10分，可以联系我们获得面试机会或者为你写推荐信。
diff --git a/PaddleCV/PaddleGAN/cycle_gan/train.py b/PaddleCV/PaddleGAN/cycle_gan/train.py
index a85da0ae2c97e95aa7d8de5a6ef5661988c84971..5fadd201ef250f31023b3858c1dffeb992f3b19a 100644
--- a/PaddleCV/PaddleGAN/cycle_gan/train.py
+++ b/PaddleCV/PaddleGAN/cycle_gan/train.py
@@ -69,6 +69,11 @@ add_arg('save_checkpoints',  bool,  True,       "Whether to save checkpoints.")
 add_arg('run_test',          bool,  True,       "Whether to run test.")
 add_arg('use_gpu',           bool,  True,       "Whether to use GPU to train.")
 add_arg('profile',           bool,  False,       "Whether to profile.")
+
+# NOTE: args for profiler, used for benchmark
+add_arg('profiler_path',     str,  './profiler_cyclegan',       "the path of profiler output files. used for benchmark")
+add_arg('max_iter',          int,  0,       "the max batch nums to train. used for benchmark")
+
 add_arg('run_ce',            bool,  False,       "Whether to run for model ce.")
 # yapf: enable
 
@@ -214,9 +219,14 @@ def train(args):
             loss_name=d_A_trainer.d_loss_A.name,
             build_strategy=build_strategy,
             exec_strategy=exec_strategy)
+
+    total_batch_num = 0  # this is for benchmark
+
     for epoch in range(args.epoch):
         batch_id = 0
         for i in range(max_images_num):
+            if args.max_iter and total_batch_num == args.max_iter:  # this for benchmark
+                return
             data_A = next(A_reader)
             data_B = next(B_reader)
             tensor_A = fluid.LoDTensor()
@@ -265,6 +275,12 @@ def train(args):
             losses[1].append(d_A_loss[0])
             sys.stdout.flush()
             batch_id += 1
+            total_batch_num = total_batch_num + 1  # this is for benchmark
+            # profiler tools for benchmark
+            if args.profile and epoch == 0 and batch_id == 10:
+                profiler.reset_profiler()
+            elif args.profile and epoch == 0 and batch_id == 15:
+                return
 
         if args.run_test and not args.run_ce:
             test(epoch)
@@ -281,7 +297,7 @@ if __name__ == "__main__":
     print_arguments(args)
     if args.profile:
         if args.use_gpu:
-            with profiler.cuda_profiler("cuda_profiler.txt", 'csv') as nvprof:
+            with profiler.profiler('All', 'total', args.profiler_path) as prof:
                 train(args)
         else:
             with profiler.profiler("CPU", sorted_key='total') as cpuprof:
diff --git a/PaddleCV/PaddleGAN/data_reader.py b/PaddleCV/PaddleGAN/data_reader.py
index ef18d7e05a70dfcef605597da55d80c1db794c94..407855abca1c3328e931841f404a3f80a9b6cc36 100644
--- a/PaddleCV/PaddleGAN/data_reader.py
+++ b/PaddleCV/PaddleGAN/data_reader.py
@@ -308,7 +308,7 @@ class triplex_reader_creator(reader_creator):
                 input_label = np.zeros(
                     (args.label_nc, index.shape[1], index.shape[2]))
                 np.put_along_axis(input_label, index, 1.0, 0)
-                img1 = input_label
+                img1 = input_label.astype('float32')
                 img2 = (np.array(img2).astype('float32') / 255.0 - 0.5) / 0.5
                 img2 = img2.transpose([2, 0, 1])
                 if not args.no_instance:
@@ -630,6 +630,7 @@ class data_reader(object):
                     batch_size=self.cfg.batch_size,
                     mode="TRAIN")
                 reader_test = None
+                id2name = None
                 if self.cfg.run_test:
                     test_list = os.path.join(dataset_dir, "test.txt")
                     if self.cfg.test_list is not None:
diff --git a/PaddleCV/PaddleGAN/infer.py b/PaddleCV/PaddleGAN/infer.py
index 04c2fc9226ba27b59bbfc5da9cb625d29b08cdf9..4eed0a5deef604a68352985fe00fb0ba76b35c2f 100644
--- a/PaddleCV/PaddleGAN/infer.py
+++ b/PaddleCV/PaddleGAN/infer.py
@@ -305,16 +305,22 @@ def infer(args):
         id2name = test_reader.id2name
         for data in loader():
             real_img, image_name = data[0]['input'], data[0]['image_name']
-            image_name = id2name[np.array(image_name).astype('int32')[0]]
-            print("read: ", image_name)
+            image_names = []
+            for name in image_name:
+                image_names.append(id2name[np.array(name).astype('int32')[0]])
+            print("read: ", image_names)
             fake_temp = exe.run(fetch_list=[fake.name],
                                 feed={"input": real_img})
-            fake_temp = np.squeeze(fake_temp[0]).transpose([1, 2, 0])
-            input_temp = np.squeeze(np.array(real_img)[0]).transpose([1, 2, 0])
+            fake_temp = save_batch_image(fake_temp[0])
+            input_temp = save_batch_image(np.array(real_img))
 
-            imageio.imwrite(
-                os.path.join(args.output, "fake_" + image_name), (
-                    (fake_temp + 1) * 127.5).astype(np.uint8))
+            for i, name in enumerate(image_names):
+                imageio.imwrite(
+                    os.path.join(args.output, "fake_" + name), (
+                        (fake_temp[i] + 1) * 127.5).astype(np.uint8))
+                imageio.imwrite(
+                    os.path.join(args.output, "input_" + name), (
+                        (input_temp[i] + 1) * 127.5).astype(np.uint8))
     elif args.model_net == 'SPADE':
         test_reader = triplex_reader_creator(
             image_dir=args.dataset_dir,
diff --git a/PaddleCV/PaddleGAN/network/AttGAN_network.py b/PaddleCV/PaddleGAN/network/AttGAN_network.py
index 7d5640ec680730e612f5db39b076dcc0018b33f2..a447e66d16001c834f082e28fcee9da809ae6616 100755
--- a/PaddleCV/PaddleGAN/network/AttGAN_network.py
+++ b/PaddleCV/PaddleGAN/network/AttGAN_network.py
@@ -62,7 +62,7 @@ class AttGAN_model(object):
         """Concatenate attribute vector on feature map axis."""
         ones = fluid.layers.fill_constant_batch_size_like(
             z, [-1, a.shape[1], z.shape[2], z.shape[3]], "float32", 1.0)
-        return fluid.layers.concat([z, ones * a], axis=1)
+        return fluid.layers.concat([z, fluid.layers.elementwise_mul(ones, a, axis=0)], axis=1)
 
     def Genc(self, input, dim=64, n_layers=5, name='G_enc_', is_test=False):
         z = input
diff --git a/PaddleCV/PaddleGAN/network/DCGAN_network.py b/PaddleCV/PaddleGAN/network/DCGAN_network.py
index c0e67bdc523fd2563aa1f9470db19ecae10607f5..13ba14d452f81ce5f931f1a343d5d60213e42eb6 100644
--- a/PaddleCV/PaddleGAN/network/DCGAN_network.py
+++ b/PaddleCV/PaddleGAN/network/DCGAN_network.py
@@ -89,5 +89,5 @@ class DCGAN_model(object):
             norm=self.norm,
             activation_fn='leaky_relu',
             name=name + '_l1')
-        out = linear(o_l1, 1, activation_fn='sigmoid', name=name + '_l2')
+        out = linear(o_l1, 1, activation_fn=None, name=name + '_l2')
         return out
diff --git a/PaddleCV/PaddleGAN/network/STGAN_network.py b/PaddleCV/PaddleGAN/network/STGAN_network.py
index 6ea82687d1f0bd51b0188a62f78fcf390b45b0dd..75da511ec817375269002c42ee4788d57e6bffad 100755
--- a/PaddleCV/PaddleGAN/network/STGAN_network.py
+++ b/PaddleCV/PaddleGAN/network/STGAN_network.py
@@ -84,7 +84,7 @@ class STGAN_model(object):
         """Concatenate attribute vector on feature map axis."""
         ones = fluid.layers.fill_constant_batch_size_like(
             z, [-1, a.shape[1], z.shape[2], z.shape[3]], "float32", 1.0)
-        return fluid.layers.concat([z, ones * a], axis=1)
+        return fluid.layers.concat([z, fluid.layers.elementwise_mul(ones, a, axis=0)], axis=1)
 
     def Genc(self, input, dim=64, n_layers=5, name='G_enc_', is_test=False):
         z = input
diff --git a/PaddleCV/PaddleGAN/network/base_network.py b/PaddleCV/PaddleGAN/network/base_network.py
index e3125a32b08be0d2441d95a6e02fd63374d8619a..50b2d86449f878e1f480fb66e63c1dbe9ad2836b 100644
--- a/PaddleCV/PaddleGAN/network/base_network.py
+++ b/PaddleCV/PaddleGAN/network/base_network.py
@@ -64,12 +64,6 @@ def norm_layer(input,
             moving_variance_name=name + '_var')
 
     elif norm_type == 'instance_norm':
-        helper = fluid.layer_helper.LayerHelper("instance_norm", **locals())
-        dtype = helper.input_dtype()
-        epsilon = 1e-5
-        mean = fluid.layers.reduce_mean(input, dim=[2, 3], keep_dim=True)
-        var = fluid.layers.reduce_mean(
-            fluid.layers.square(input - mean), dim=[2, 3], keep_dim=True)
         if name is not None:
             scale_name = name + "_scale"
             offset_name = name + "_offset"
@@ -91,15 +85,8 @@ def norm_layer(input,
                 name=offset_name,
                 initializer=fluid.initializer.Constant(0.0),
                 trainable=False)
-        scale = helper.create_parameter(
-            attr=scale_param, shape=input.shape[1:2], dtype=dtype)
-        offset = helper.create_parameter(
-            attr=offset_param, shape=input.shape[1:2], dtype=dtype)
-
-        tmp = fluid.layers.elementwise_mul(x=(input - mean), y=scale, axis=1)
-        tmp = tmp / fluid.layers.sqrt(var + epsilon)
-        tmp = fluid.layers.elementwise_add(tmp, offset, axis=1)
-        return tmp
+        return fluid.layers.instance_norm(
+            input, param_attr=scale_param, bias_attr=offset_param)
     else:
         raise NotImplementedError("norm type: [%s] is not support" % norm_type)
 
diff --git a/PaddleCV/PaddleGAN/train.py b/PaddleCV/PaddleGAN/train.py
index 3008dd24873588e990d1a0e235719d7a13b988ca..a5339021baa039d2c22ea23f735e80b546848d2f 100644
--- a/PaddleCV/PaddleGAN/train.py
+++ b/PaddleCV/PaddleGAN/train.py
@@ -70,7 +70,7 @@ if __name__ == "__main__":
     if cfg.profile:
         if cfg.use_gpu:
             with fluid.profiler.profiler('All', 'total',
-                                         '/tmp/profile') as prof:
+                                         cfg.profiler_path) as prof:
                 train(cfg)
         else:
             with fluid.profiler.profiler("CPU", sorted_key='total') as cpuprof:
diff --git a/PaddleCV/PaddleGAN/trainer/AttGAN.py b/PaddleCV/PaddleGAN/trainer/AttGAN.py
index 81fe56977e6d4e64776188801b1cb4d717b9d7b6..02d840d6f163bfe0d62e9631274b20d10d7cf192 100644
--- a/PaddleCV/PaddleGAN/trainer/AttGAN.py
+++ b/PaddleCV/PaddleGAN/trainer/AttGAN.py
@@ -156,8 +156,13 @@ class DTrainer():
     def gradient_penalty(self, f, real, fake=None, cfg=None, name=None):
         def _interpolate(a, b=None):
             if b is None:
-                beta = fluid.layers.uniform_random_batch_size_like(
-                    input=a, shape=a.shape, min=0.0, max=1.0)
+                if cfg.enable_ce:
+                    beta = fluid.layers.uniform_random_batch_size_like(
+                       input=a, shape=a.shape, min=0.0, max=1.0, seed=1)
+                else:
+                    beta = fluid.layers.uniform_random_batch_size_like(
+                       input=a, shape=a.shape, min=0.0, max=1.0)
+
                 mean = fluid.layers.reduce_mean(
                     a, dim=list(range(len(a.shape))), keep_dim=True)
                 input_sub_mean = fluid.layers.elementwise_sub(a, mean, axis=0)
@@ -167,9 +172,14 @@ class DTrainer():
                     keep_dim=True)
                 b = beta * fluid.layers.sqrt(var) * 0.5 + a
             shape = [a.shape[0]]
-            alpha = fluid.layers.uniform_random_batch_size_like(
-                input=a, shape=shape, min=0.0, max=1.0)
-            inner = (b - a) * alpha + a
+            if cfg.enable_ce:
+               alpha = fluid.layers.uniform_random_batch_size_like(
+                 input=a, shape=shape, min=0.0, max=1.0, seed=1)
+            else:
+               alpha = fluid.layers.uniform_random_batch_size_like(
+                 input=a, shape=shape, min=0.0, max=1.0)
+
+            inner = fluid.layers.elementwise_mul((b-a), alpha, axis=0) + a
             return inner
 
         x = _interpolate(real, fake)
@@ -254,6 +264,10 @@ class AttGAN(object):
             default=None,
             help="the normalization in discriminator, choose in [None, instance_norm]"
         )
+        parser.add_argument(
+            '--enable_ce',
+            action='store_true',
+            help="if set, run the tasks with continuous evaluation logs")
 
         return parser
 
@@ -282,6 +296,9 @@ class AttGAN(object):
             name='label_org_', shape=[None, self.cfg.c_dim], dtype='float32')
         label_trg_ = fluid.data(
             name='label_trg_', shape=[None, self.cfg.c_dim], dtype='float32')
+        # used for continuous evaluation        
+        if self.cfg.enable_ce:
+            fluid.default_startup_program().random_seed = 90
 
         py_reader = fluid.io.PyReader(
             feed_list=[image_real, label_org, label_trg],
@@ -325,7 +342,11 @@ class AttGAN(object):
             dis_trainer.program).with_data_parallel(
                 loss_name=dis_trainer.d_loss.name,
                 build_strategy=build_strategy)
-
+        # used for continuous evaluation        
+        if self.cfg.enable_ce:
+            gen_trainer_program.random_seed = 90
+            dis_trainer_program.random_seed = 90
+        
         t_time = 0
 
         for epoch_id in range(self.cfg.epoch):
@@ -367,6 +388,8 @@ class AttGAN(object):
                                                      d_loss_gp[0], batch_time))
                 sys.stdout.flush()
                 batch_id += 1
+                if self.cfg.enable_ce and batch_id == 100:
+                   break
 
             if self.cfg.run_test:
                 image_name = fluid.data(
@@ -393,3 +416,13 @@ class AttGAN(object):
                                     "net_G")
                 utility.checkpoints(epoch_id, self.cfg, exe, dis_trainer,
                                     "net_D")
+            # used for continuous evaluation
+            if self.cfg.enable_ce:
+                device_num = fluid.core.get_cuda_device_count() if self.cfg.use_gpu else 1
+                print("kpis\tattgan_g_loss_fake_card{}\t{}".format(device_num, g_loss_fake[0]))
+                print("kpis\tattgan_g_loss_rec_card{}\t{}".format(device_num, g_loss_rec[0]))
+                print("kpis\tattgan_g_loss_cls_card{}\t{}".format(device_num, g_loss_cls[0]))
+                print("kpis\tattgan_d_loss_real_card{}\t{}".format(device_num, d_loss_real[0]))
+                print("kpis\tattgan_d_loss_fake_card{}\t{}".format(device_num,d_loss_fake[0]))
+                print("kpis\tattgan_d_loss_gp_card{}\t{}".format(device_num,d_loss_gp[0]))
+                print("kpis\tattgan_Batch_time_cost_card{}\t{}".format(device_num,batch_time)) 
diff --git a/PaddleCV/PaddleGAN/trainer/CycleGAN.py b/PaddleCV/PaddleGAN/trainer/CycleGAN.py
index 5c6d7909114270b10d017425fa814d11bc477418..62b118eba21c8c2c7d5d221c9effd0c28da4fa54 100644
--- a/PaddleCV/PaddleGAN/trainer/CycleGAN.py
+++ b/PaddleCV/PaddleGAN/trainer/CycleGAN.py
@@ -207,7 +207,10 @@ class CycleGAN(object):
             type=int,
             default=3,
             help="only used when CycleGAN discriminator is nlayers")
-
+        parser.add_argument(
+            '--enable_ce',
+            action='store_true',
+            help="if set, run the tasks with continuous evaluation logs")
         return parser
 
     def __init__(self,
@@ -237,6 +240,9 @@ class CycleGAN(object):
             name='fake_pool_A', shape=data_shape, dtype='float32')
         fake_pool_B = fluid.data(
             name='fake_pool_B', shape=data_shape, dtype='float32')
+        # used for continuous evaluation
+        if self.cfg.enable_ce:
+            fluid.default_startup_program().random_seed = 90
 
         A_py_reader = fluid.io.PyReader(
             feed_list=[input_A],
@@ -317,6 +323,10 @@ class CycleGAN(object):
                 fake_pool_B = B_pool.pool_image(fake_B_tmp)
                 fake_pool_A = A_pool.pool_image(fake_A_tmp)
 
+                if self.cfg.enable_ce:
+                    fake_pool_B = fake_B_tmp
+                    fake_pool_A = fake_A_tmp
+
                 # optimize the d_A network
                 d_A_loss = exe.run(
                     d_A_trainer_program,
@@ -344,6 +354,9 @@ class CycleGAN(object):
 
                 sys.stdout.flush()
                 batch_id += 1
+                # used for continuous evaluation
+                if self.cfg.enable_ce and batch_id == 10:
+                    break
 
             if self.cfg.run_test:
                 A_image_name = fluid.data(
@@ -390,3 +403,26 @@ class CycleGAN(object):
                                     "net_DA")
                 utility.checkpoints(epoch_id, self.cfg, exe, d_B_trainer,
                                     "net_DB")
+
+        # used for continuous evaluation
+        if self.cfg.enable_ce:
+            device_num = fluid.core.get_cuda_device_count(
+            ) if self.cfg.use_gpu else 1
+            print("kpis\tcyclegan_g_A_loss_card{}\t{}".format(device_num,
+                                                              g_A_loss[0]))
+            print("kpis\tcyclegan_g_A_cyc_loss_card{}\t{}".format(
+                device_num, g_A_cyc_loss[0]))
+            print("kpis\tcyclegan_g_A_idt_loss_card{}\t{}".format(
+                device_num, g_A_idt_loss[0]))
+            print("kpis\tcyclegan_d_A_loss_card{}\t{}".format(device_num,
+                                                              d_A_loss[0]))
+            print("kpis\tcyclegan_g_B_loss_card{}\t{}".format(device_num,
+                                                              g_B_loss[0]))
+            print("kpis\tcyclegan_g_B_cyc_loss_card{}\t{}".format(
+                device_num, g_B_cyc_loss[0]))
+            print("kpis\tcyclegan_g_B_idt_loss_card{}\t{}".format(
+                device_num, g_B_idt_loss[0]))
+            print("kpis\tcyclegan_d_B_loss_card{}\t{}".format(device_num,
+                                                              d_B_loss[0]))
+            print("kpis\tcyclegan_Batch_time_cost_card{}\t{}".format(
+                device_num, batch_time))
diff --git a/PaddleCV/PaddleGAN/trainer/DCGAN.py b/PaddleCV/PaddleGAN/trainer/DCGAN.py
index 4301f4d906ac46ce9f1540da11174321e2003258..a14ecb431fb7f912e27e0767b70d1c43b7780c79 100644
--- a/PaddleCV/PaddleGAN/trainer/DCGAN.py
+++ b/PaddleCV/PaddleGAN/trainer/DCGAN.py
@@ -27,6 +27,7 @@ import matplotlib
 matplotlib.use('agg')
 import matplotlib.pyplot as plt
 import paddle.fluid as fluid
+import random
 
 
 class GTrainer():
@@ -78,7 +79,10 @@ class DCGAN(object):
     def add_special_args(self, parser):
         parser.add_argument(
             '--noise_size', type=int, default=100, help="the noise dimension")
-
+        parser.add_argument(
+            '--enable_ce',
+            action='store_true',
+            help="if set, run the tasks with continuous evaluation logs")
         return parser
 
     def __init__(self, cfg=None, train_reader=None):
@@ -90,6 +94,11 @@ class DCGAN(object):
         noise = fluid.data(
             name='noise', shape=[None, self.cfg.noise_size], dtype='float32')
         label = fluid.data(name='label', shape=[None, 1], dtype='float32')
+        # used for continuous evaluation
+        if self.cfg.enable_ce:
+            fluid.default_startup_program().random_seed = 90
+            random.seed(0)
+            np.random.seed(0)
 
         g_trainer = GTrainer(noise, label, self.cfg)
         d_trainer = DTrainer(img, label, self.cfg)
@@ -200,3 +209,11 @@ class DCGAN(object):
             if self.cfg.save_checkpoints:
                 utility.checkpoints(epoch_id, self.cfg, exe, g_trainer, "net_G")
                 utility.checkpoints(epoch_id, self.cfg, exe, d_trainer, "net_D")
+        # used for continuous evaluation
+        if self.cfg.enable_ce:
+            device_num = fluid.core.get_cuda_device_count(
+            ) if self.cfg.use_gpu else 1
+            print("kpis\tdcgan_d_loss_card{}\t{}".format(device_num, d_loss[0]))
+            print("kpis\tdcgan_g_loss_card{}\t{}".format(device_num, g_loss[0]))
+            print("kpis\tdcgan_Batch_time_cost_card{}\t{}".format(device_num,
+                                                                  batch_time))
diff --git a/PaddleCV/PaddleGAN/trainer/Pix2pix.py b/PaddleCV/PaddleGAN/trainer/Pix2pix.py
index 3595bea04fdab1f8cb76897151a57513c04ef1c5..1c340a57c3769de9448e4dae1c6352d7f424212f 100644
--- a/PaddleCV/PaddleGAN/trainer/Pix2pix.py
+++ b/PaddleCV/PaddleGAN/trainer/Pix2pix.py
@@ -18,8 +18,10 @@ from __future__ import print_function
 from network.Pix2pix_network import Pix2pix_model
 from util import utility
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 import sys
 import time
+import numpy as np
 
 
 class GTrainer():
@@ -195,7 +197,10 @@ class Pix2pix(object):
             type=int,
             default=3,
             help="only used when Pix2pix discriminator is nlayers")
-
+        parser.add_argument(
+            '--enable_ce',
+            action='store_true',
+            help="if set, run the tasks with continuous evaluation logs")
         return parser
 
     def __init__(self,
@@ -217,6 +222,9 @@ class Pix2pix(object):
         input_B = fluid.data(name='input_B', shape=data_shape, dtype='float32')
         input_fake = fluid.data(
             name='input_fake', shape=data_shape, dtype='float32')
+        # used for continuous evaluation        
+        if self.cfg.enable_ce:
+            fluid.default_startup_program().random_seed = 90
 
         loader = fluid.io.DataLoader.from_generator(
             feed_list=[input_A, input_B],
@@ -255,12 +263,15 @@ class Pix2pix(object):
 
         t_time = 0
 
+        total_train_batch = 0  # used for benchmark
+
         for epoch_id in range(self.cfg.epoch):
             batch_id = 0
             for tensor in loader():
+                if self.cfg.max_iter and total_train_batch == self.cfg.max_iter:  # used for benchmark
+                    return
                 s_time = time.time()
 
-                tensor_A, tensor_B = tensor[0]['input_A'], tensor[0]['input_B']
                 # optimize the generator network
                 g_loss_gan, g_loss_l1, fake_B_tmp = exe.run(
                     gen_trainer_program,
@@ -270,17 +281,18 @@ class Pix2pix(object):
                     ],
                     feed=tensor)
 
+                devices_num = utility.get_device_num(self.cfg)
+                fake_per_device = int(len(fake_B_tmp) / devices_num)
+                for dev in range(devices_num):
+                    tensor[dev]['input_fake'] = fake_B_tmp[dev * fake_per_device : (dev+1) * fake_per_device]
+
                 # optimize the discriminator network
                 d_loss_real, d_loss_fake = exe.run(dis_trainer_program,
                                                    fetch_list=[
                                                        dis_trainer.d_loss_real,
                                                        dis_trainer.d_loss_fake
                                                    ],
-                                                   feed={
-                                                       "input_A": tensor_A,
-                                                       "input_B": tensor_B,
-                                                       "input_fake": fake_B_tmp
-                                                   })
+                                                   feed=tensor)
 
                 batch_time = time.time() - s_time
                 t_time += batch_time
@@ -294,6 +306,12 @@ class Pix2pix(object):
 
                 sys.stdout.flush()
                 batch_id += 1
+                total_train_batch += 1  # used for benchmark
+                # profiler tools
+                if self.cfg.profile and epoch_id == 0 and batch_id == self.cfg.print_freq:
+                    profiler.reset_profiler()
+                elif self.cfg.profile and epoch_id == 0 and batch_id == self.cfg.print_freq + 5:
+                    return
 
             if self.cfg.run_test:
                 image_name = fluid.data(
@@ -325,3 +343,16 @@ class Pix2pix(object):
                                     "net_G")
                 utility.checkpoints(epoch_id, self.cfg, exe, dis_trainer,
                                     "net_D")
+        if self.cfg.enable_ce:
+            device_num = fluid.core.get_cuda_device_count(
+            ) if self.cfg.use_gpu else 1
+            print("kpis\tpix2pix_g_loss_gan_card{}\t{}".format(device_num,
+                                                               g_loss_gan[0]))
+            print("kpis\tpix2pix_g_loss_l1_card{}\t{}".format(device_num,
+                                                              g_loss_l1[0]))
+            print("kpis\tpix2pix_d_loss_real_card{}\t{}".format(device_num,
+                                                                d_loss_real[0]))
+            print("kpis\tpix2pix_d_loss_fake_card{}\t{}".format(device_num,
+                                                                d_loss_fake[0]))
+            print("kpis\tpix2pix_Batch_time_cost_card{}\t{}".format(device_num,
+                                                                    batch_time))
diff --git a/PaddleCV/PaddleGAN/trainer/SPADE.py b/PaddleCV/PaddleGAN/trainer/SPADE.py
index b11c9b6c556e90ce3db50bd5de8e22c506ec6b1a..59d9df64334e6ec2f254dfdb018266047f4ccd1f 100644
--- a/PaddleCV/PaddleGAN/trainer/SPADE.py
+++ b/PaddleCV/PaddleGAN/trainer/SPADE.py
@@ -268,7 +268,11 @@ class SPADE(object):
             type=bool,
             default=False,
             help="Whether to use instance label.")
-
+        parser.add_argument(
+            '--enable_ce',
+            type=bool,
+            default=False,
+            help="If set True, enable continuous evaluation job.")
         return parser
 
     def __init__(self,
@@ -298,6 +302,9 @@ class SPADE(object):
             name='input_ins', shape=edge_shape, dtype='float32')
         input_fake = fluid.data(
             name='input_fake', shape=data_shape, dtype='float32')
+        # used for continuous evaluation
+        if self.cfg.enable_ce:
+          fluid.default_startup_program().random_seed = 90
 
         gen_trainer = GTrainer(input_A, input_B, input_C, self.cfg,
                                self.batch_num)
@@ -343,7 +350,11 @@ class SPADE(object):
             dis_trainer.program).with_data_parallel(
                 loss_name=dis_trainer.d_loss.name,
                 build_strategy=build_strategy)
-
+        # used for continuous evaluation  
+        if self.cfg.enable_ce:
+            gen_trainer_program.random_seed = 90
+            dis_trainer_program.random_seed = 90
+        
         t_time = 0
 
         for epoch_id in range(self.cfg.epoch):
@@ -391,7 +402,6 @@ class SPADE(object):
 
                 sys.stdout.flush()
                 batch_id += 1
-
             if self.cfg.run_test:
                 test_program = gen_trainer.infer_program
                 image_name = fluid.data(
@@ -422,3 +432,12 @@ class SPADE(object):
                                     "net_G")
                 utility.checkpoints(epoch_id, self.cfg, exe, dis_trainer,
                                     "net_D")
+            # used for continuous evaluation
+            if self.cfg.enable_ce:
+                device_num = fluid.core.get_cuda_device_count() if self.cfg.use_gpu else 1
+                print("kpis\tspade_g_loss_gan_card{}\t{}".format(device_num, g_loss_gan[0]))
+                print("kpis\tspade_g_loss_vgg_card{}\t{}".format(device_num,g_loss_vgg[0]))
+                print("kpis\tspade_g_loss_feat_card{}\t{}".format(device_num,g_loss_feat[0]))
+                print("kpis\tspade_d_loss_real_card{}\t{}".format(device_num,d_loss_real[0]))
+                print("kpis\tspade_d_loss_fake_card{}\t{}".format(device_num,d_loss_fake[0]))
+                print("kpis\tspade_Batch_time_cost_card{}\t{}".format(device_num,batch_time))
diff --git a/PaddleCV/PaddleGAN/trainer/STGAN.py b/PaddleCV/PaddleGAN/trainer/STGAN.py
index 6e3c6156ae2f19d277aae0a89d68ed6db40e05da..7d4275c0dd3f64ac1924b64bcc15d677f4e8a1e3 100644
--- a/PaddleCV/PaddleGAN/trainer/STGAN.py
+++ b/PaddleCV/PaddleGAN/trainer/STGAN.py
@@ -17,10 +17,12 @@ from __future__ import print_function
 from network.STGAN_network import STGAN_model
 from util import utility
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 import sys
 import time
 import copy
 import numpy as np
+import ast
 
 
 class GTrainer():
@@ -162,8 +164,13 @@ class DTrainer():
     def gradient_penalty(self, f, real, fake=None, cfg=None, name=None):
         def _interpolate(a, b=None):
             if b is None:
-                beta = fluid.layers.uniform_random_batch_size_like(
-                    input=a, shape=a.shape, min=0.0, max=1.0)
+                if cfg.enable_ce:
+                   beta = fluid.layers.uniform_random_batch_size_like(
+                       input=a, shape=a.shape, min=0.0, max=1.0, seed=1)
+                else:
+                   beta = fluid.layers.uniform_random_batch_size_like(
+                       input=a, shape=a.shape, min=0.0, max=1.0)
+                   
                 mean = fluid.layers.reduce_mean(
                     a, dim=list(range(len(a.shape))), keep_dim=True)
                 input_sub_mean = fluid.layers.elementwise_sub(a, mean, axis=0)
@@ -173,9 +180,14 @@ class DTrainer():
                     keep_dim=True)
                 b = beta * fluid.layers.sqrt(var) * 0.5 + a
             shape = [a.shape[0]]
-            alpha = fluid.layers.uniform_random_batch_size_like(
-                input=a, shape=shape, min=0.0, max=1.0)
-            inner = (b - a) * alpha + a
+            if cfg.enable_ce:
+                alpha = fluid.layers.uniform_random_batch_size_like(
+                    input=a, shape=shape, min=0.0, max=1.0, seed=1)
+            else:    
+                alpha = fluid.layers.uniform_random_batch_size_like(
+                    input=a, shape=shape, min=0.0, max=1.0)
+
+            inner = fluid.layers.elementwise_mul((b-a), alpha, axis=0) + a
             return inner
 
         x = _interpolate(real, fake)
@@ -223,7 +235,7 @@ class STGAN(object):
             default=1024,
             help="the base fc dim in discriminator")
         parser.add_argument(
-            '--use_gru', type=bool, default=True, help="whether to use GRU")
+            '--use_gru', type=ast.literal_eval, default=True, help="whether to use GRU")
         parser.add_argument(
             '--lambda_cls',
             type=float,
@@ -267,7 +279,10 @@ class STGAN(object):
             default=None,
             help="the normalization in discriminator, choose in [None, instance_norm]"
         )
-
+        parser.add_argument(
+            '--enable_ce',
+            action='store_true',
+            help="if set, run the tasks with continuous evaluation logs")
         return parser
 
     def __init__(self,
@@ -294,6 +309,9 @@ class STGAN(object):
             name='label_org_', shape=[None, self.cfg.c_dim], dtype='float32')
         label_trg_ = fluid.data(
             name='label_trg_', shape=[None, self.cfg.c_dim], dtype='float32')
+        # used for continuous evaluation        
+        if self.cfg.enable_ce:
+            fluid.default_startup_program().random_seed = 90
 
         test_gen_trainer = GTrainer(image_real, label_org, label_org_,
                                     label_trg, label_trg_, self.cfg,
@@ -337,12 +355,20 @@ class STGAN(object):
             dis_trainer.program).with_data_parallel(
                 loss_name=dis_trainer.d_loss.name,
                 build_strategy=build_strategy)
-
+        # used for continuous evaluation        
+        if self.cfg.enable_ce:
+            gen_trainer_program.random_seed = 90
+            dis_trainer_program.random_seed = 90
+ 
         t_time = 0
 
+        total_train_batch = 0  # used for benchmark
+
         for epoch_id in range(self.cfg.epoch):
             batch_id = 0
             for data in py_reader():
+                if self.cfg.max_iter and total_train_batch == self.cfg.max_iter: # used for benchmark
+                    return
                 s_time = time.time()
                 # optimize the discriminator network
                 fetches = [
@@ -376,6 +402,15 @@ class STGAN(object):
                                                      d_loss_gp[0], batch_time))
                 sys.stdout.flush()
                 batch_id += 1
+                if self.cfg.enable_ce and batch_id == 100:
+                   break
+
+                total_train_batch += 1  # used for benchmark
+                # profiler tools
+                if self.cfg.profile and epoch_id == 0 and batch_id == self.cfg.print_freq:
+                    profiler.reset_profiler()
+                elif self.cfg.profile and epoch_id == 0 and batch_id == self.cfg.print_freq + 5:
+                    return
 
             if self.cfg.run_test:
                 image_name = fluid.data(
@@ -401,3 +436,15 @@ class STGAN(object):
                                     "net_G")
                 utility.checkpoints(epoch_id, self.cfg, exe, dis_trainer,
                                     "net_D")
+            # used for continuous evaluation
+            if self.cfg.enable_ce:
+                device_num = fluid.core.get_cuda_device_count() if self.cfg.use_gpu else 1
+                print("kpis\tstgan_g_loss_fake_card{}\t{}".format(device_num, g_loss_fake[0]))
+                print("kpis\tstgan_g_loss_rec_card{}\t{}".format(device_num, g_loss_rec[0]))
+                print("kpis\tstgan_g_loss_cls_card{}\t{}".format(device_num, g_loss_cls[0]))
+                print("kpis\tstgan_d_loss_card{}\t{}".format(device_num, d_loss[0]))
+                print("kpis\tstgan_d_loss_real_card{}\t{}".format(device_num, d_loss_real[0]))
+                print("kpis\tstgan_d_loss_fake_card{}\t{}".format(device_num,d_loss_fake[0]))
+                print("kpis\tstgan_d_loss_cls_card{}\t{}".format(device_num, d_loss_cls[0]))
+                print("kpis\tstgan_d_loss_gp_card{}\t{}".format(device_num,d_loss_gp[0]))
+                print("kpis\tstgan_Batch_time_cost_card{}\t{}".format(device_num,batch_time))
diff --git a/PaddleCV/PaddleGAN/trainer/StarGAN.py b/PaddleCV/PaddleGAN/trainer/StarGAN.py
index b4fce5952fa0d41300e474e0dd9919b097ba5fd2..6fa72be7578b082b84fb2f7486ae7991981e9545 100644
--- a/PaddleCV/PaddleGAN/trainer/StarGAN.py
+++ b/PaddleCV/PaddleGAN/trainer/StarGAN.py
@@ -17,6 +17,7 @@ from __future__ import print_function
 from network.StarGAN_network import StarGAN_model
 from util import utility
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 import sys
 import time
 import copy
@@ -158,10 +159,14 @@ class DTrainer():
     def gradient_penalty(self, f, real, fake, cfg=None, name=None):
         def _interpolate(a, b):
             shape = [a.shape[0]]
-            alpha = fluid.layers.uniform_random_batch_size_like(
-                input=a, shape=shape, min=0.0, max=1.0)
+            if cfg.enable_ce:
+                  alpha = fluid.layers.uniform_random_batch_size_like(
+                    input=a, shape=shape, min=0.0, max=1.0, seed=1)
+            else:      
+                  alpha = fluid.layers.uniform_random_batch_size_like(
+                    input=a, shape=shape, min=0.0, max=1.0)
 
-            inner = b * (1.0 - alpha) + a * alpha
+            inner = fluid.layers.elementwise_mul(b, (1.0-alpha), axis=0) + fluid.layers.elementwise_mul(a, alpha, axis=0)
             return inner
 
         x = _interpolate(real, fake)
@@ -244,6 +249,10 @@ class StarGAN(object):
             help="the attributes we selected to change")
         parser.add_argument(
             '--n_samples', type=int, default=1, help="batch size when testing")
+        parser.add_argument(
+            '--enable_ce',
+            action='store_true',
+            help="if set, run the tasks with continuous evaluation logs")
 
         return parser
 
@@ -267,6 +276,9 @@ class StarGAN(object):
             name='label_org', shape=[None, self.cfg.c_dim], dtype='float32')
         label_trg = fluid.data(
             name='label_trg', shape=[None, self.cfg.c_dim], dtype='float32')
+        # used for continuous evaluation        
+        if self.cfg.enable_ce:
+            fluid.default_startup_program().random_seed = 90
 
         py_reader = fluid.io.PyReader(
             feed_list=[image_real, label_org, label_trg],
@@ -303,12 +315,18 @@ class StarGAN(object):
             dis_trainer.program).with_data_parallel(
                 loss_name=dis_trainer.d_loss.name,
                 build_strategy=build_strategy)
+        # used for continuous evaluation        
+        if self.cfg.enable_ce:
+            gen_trainer_program.random_seed = 90
+            dis_trainer_program.random_seed = 90
 
         t_time = 0
-
+        total_train_batch = 0  # used for benchmark
         for epoch_id in range(self.cfg.epoch):
             batch_id = 0
             for data in py_reader():
+                if self.cfg.max_iter and total_train_batch == self.cfg.max_iter: # used for benchmark
+                    return
                 s_time = time.time()
                 d_loss_real, d_loss_fake, d_loss, d_loss_cls, d_loss_gp = exe.run(
                     dis_trainer_program,
@@ -344,6 +362,16 @@ class StarGAN(object):
 
                 sys.stdout.flush()
                 batch_id += 1
+                # used for ce
+                if self.cfg.enable_ce and batch_id == 100:
+                   break
+
+                total_train_batch += 1  # used for benchmark
+                # profiler tools
+                if self.cfg.profile and epoch_id == 0 and batch_id == self.cfg.print_freq:
+                    profiler.reset_profiler()
+                elif self.cfg.profile and epoch_id == 0 and batch_id == self.cfg.print_freq + 5:
+                    return
 
             if self.cfg.run_test:
                 image_name = fluid.data(
@@ -369,3 +397,14 @@ class StarGAN(object):
                                     "net_G")
                 utility.checkpoints(epoch_id, self.cfg, exe, dis_trainer,
                                     "net_D")
+            # used for continuous evaluation
+            if self.cfg.enable_ce:
+                device_num = fluid.core.get_cuda_device_count() if self.cfg.use_gpu else 1
+                print("kpis\tstargan_g_loss_fake_card{}\t{}".format(device_num, g_loss_fake[0]))
+                print("kpis\tstargan_g_loss_rec_card{}\t{}".format(device_num, g_loss_rec[0]))
+                print("kpis\tstargan_g_loss_cls_card{}\t{}".format(device_num, g_loss_cls[0]))
+                print("kpis\tstargan_d_loss_real_card{}\t{}".format(device_num, d_loss_real[0]))
+                print("kpis\tstargan_d_loss_fake_card{}\t{}".format(device_num,d_loss_fake[0]))
+                print("kpis\tstargan_d_loss_cls_card{}\t{}".format(device_num, d_loss_cls[0]))
+                print("kpis\tstargan_d_loss_gp_card{}\t{}".format(device_num,d_loss_gp[0]))
+                print("kpis\tstargan_Batch_time_cost_card{}\t{}".format(device_num,batch_time))
diff --git a/PaddleCV/PaddleGAN/util/config.py b/PaddleCV/PaddleGAN/util/config.py
index 55666012515dfe2c631b1ac6ec4ed0909e12cc1d..3708cda9b8f7e64d26dcfb8d19abbbe9bbe0d701 100644
--- a/PaddleCV/PaddleGAN/util/config.py
+++ b/PaddleCV/PaddleGAN/util/config.py
@@ -85,6 +85,11 @@ def base_parse_args(parser):
     add_arg('run_test', bool, True, "Whether to run test.")
     add_arg('use_gpu', bool, True, "Whether to use GPU to train.")
     add_arg('profile', bool, False, "Whether to profile.")
+
+    # NOTE: add args for profiler, used for benchmark
+    add_arg('profiler_path', str, '/tmp/profile', "the  profiler output files. (used for benchmark)")
+    add_arg('max_iter', int, 0, "the max iter to train. (used for benchmark)")
+
     add_arg('dropout', bool, False, "Whether to use drouput.")
     add_arg('drop_last', bool, False,
             "Whether to drop the last images that cannot form a batch")
diff --git a/PaddleCV/PaddleGAN/util/utility.py b/PaddleCV/PaddleGAN/util/utility.py
index d9465107f4451e1a2687910a8e646b5df32cf8d4..d28961a7c6c3026604fc893f3532be0a87a3308f 100644
--- a/PaddleCV/PaddleGAN/util/utility.py
+++ b/PaddleCV/PaddleGAN/util/utility.py
@@ -425,3 +425,12 @@ def check_version():
     except Exception as e:
         print(err)
         sys.exit(1)
+
+def get_device_num(args):
+    if args.use_gpu:
+        gpus = os.environ.get("CUDA_VISIBLE_DEVICES", 1)
+        gpu_num = len(gpus.split(','))
+        return gpu_num
+    else:
+        cpu_num = os.environ.get("CPU_NUM", 1)
+        return int(cpu_num)
diff --git a/PaddleCV/PaddleVideo/models/bmn/bmn_utils.py b/PaddleCV/PaddleVideo/models/bmn/bmn_utils.py
index da2ceb20e428c940a83270521852cd846eadf07f..d35dd43ba6b46ad07bcc58e4169244ddfd9d488c 100644
--- a/PaddleCV/PaddleVideo/models/bmn/bmn_utils.py
+++ b/PaddleCV/PaddleVideo/models/bmn/bmn_utils.py
@@ -100,6 +100,7 @@ def soft_nms(df, alpha, t1, t2):
 def video_process(video_list,
                   video_dict,
                   output_path,
+                  result_dict,
                   snms_alpha=0.4,
                   snms_t1=0.55,
                   snms_t2=0.9):
@@ -134,15 +135,13 @@ def bmn_post_processing(video_dict, subset, output_path, result_path):
                                     num_videos_per_thread]
         p = mp.Process(
             target=video_process,
-            args=(
-                tmp_video_list,
-                video_dict,
-                output_path, ))
+            args=(tmp_video_list, video_dict, output_path, result_dict))
         p.start()
         processes.append(p)
     tmp_video_list = video_list[(pp_num - 1) * num_videos_per_thread:]
     p = mp.Process(
-        target=video_process, args=(tmp_video_list, video_dict, output_path))
+        target=video_process,
+        args=(tmp_video_list, video_dict, output_path, result_dict))
     p.start()
     processes.append(p)
     for p in processes:
diff --git a/PaddleCV/PaddleVideo/models/bsn/bsn_utils.py b/PaddleCV/PaddleVideo/models/bsn/bsn_utils.py
index cee44ebfa20c3290921dc615ebef36c0ab353d0f..d8dc46af2ddea9d3305e04204bf27731452e34d2 100644
--- a/PaddleCV/PaddleVideo/models/bsn/bsn_utils.py
+++ b/PaddleCV/PaddleVideo/models/bsn/bsn_utils.py
@@ -104,6 +104,7 @@ def soft_nms(df, alpha, t1, t2):
 def video_process(video_list,
                   video_dict,
                   output_path_pem,
+                  result_dict,
                   snms_alpha=0.75,
                   snms_t1=0.65,
                   snms_t2=0.9):
@@ -139,19 +140,13 @@ def bsn_post_processing(video_dict, subset, output_path_pem, result_path_pem):
                                     num_videos_per_thread]
         p = mp.Process(
             target=video_process,
-            args=(
-                tmp_video_list,
-                video_dict,
-                output_path_pem, ))
+            args=(tmp_video_list, video_dict, output_path_pem, result_dict))
         p.start()
         processes.append(p)
     tmp_video_list = video_list[(pp_num - 1) * num_videos_per_thread:]
     p = mp.Process(
         target=video_process,
-        args=(
-            tmp_video_list,
-            video_dict,
-            output_path_pem, ))
+        args=(tmp_video_list, video_dict, output_path_pem, result_dict))
     p.start()
     processes.append(p)
     for p in processes:
diff --git a/PaddleCV/PaddleVideo/train.py b/PaddleCV/PaddleVideo/train.py
index 467523d88d8878684ff217f741a0a85778d1327d..4adac34374bd8025f988569899e4cb45ed769ed1 100644
--- a/PaddleCV/PaddleVideo/train.py
+++ b/PaddleCV/PaddleVideo/train.py
@@ -104,6 +104,17 @@ def parse_args():
         type=ast.literal_eval,
         default=False,
         help='If set True, enable continuous evaluation job.')
+    # NOTE: args for profiler, used for benchmark
+    parser.add_argument(
+        '--profiler_path',
+        type=str,
+        default='./',
+        help='the path to store profiler output file. used for benchmark.')
+    parser.add_argument(
+        '--is_profiler',
+        type=int,
+        default=0,
+        help='the switch profiler. used for benchmark.')
     args = parser.parse_args()
     return args
 
@@ -236,7 +247,9 @@ def train(args):
         compiled_test_prog=compiled_valid_prog,  #test_exe=valid_exe,
         test_dataloader=valid_dataloader,
         test_fetch_list=valid_fetch_list,
-        test_metrics=valid_metrics)
+        test_metrics=valid_metrics,
+        is_profiler=args.is_profiler,
+        profiler_path=args.profiler_path)
 
 
 if __name__ == "__main__":
diff --git a/PaddleCV/PaddleVideo/utils/train_utils.py b/PaddleCV/PaddleVideo/utils/train_utils.py
index 4168abbb86eb0675779d570457708b72b41089a7..f7e489183ef91fed9898fdd456932f8cc8264967 100644
--- a/PaddleCV/PaddleVideo/utils/train_utils.py
+++ b/PaddleCV/PaddleVideo/utils/train_utils.py
@@ -18,6 +18,7 @@ import time
 import numpy as np
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 import logging
 import shutil
 
@@ -76,7 +77,8 @@ def train_with_dataloader(exe, train_prog, compiled_train_prog, train_dataloader
                         log_interval = 0, valid_interval = 0, save_dir = './', \
                         save_model_name = 'model', fix_random_seed = False, \
                         compiled_test_prog = None, test_dataloader = None, \
-                        test_fetch_list = None, test_metrics = None):
+                        test_fetch_list = None, test_metrics = None, \
+                        is_profiler = None, profiler_path = None):
     if not train_dataloader:
         logger.error("[TRAIN] get dataloader failed.")
     epoch_periods = []
@@ -98,6 +100,13 @@ def train_with_dataloader(exe, train_prog, compiled_train_prog, train_dataloader
                 train_metrics.calculate_and_log_out(train_outs, \
                         info = '[TRAIN] Epoch {}, iter {} '.format(epoch, train_iter))
             train_iter += 1
+ 
+            # NOTE: profiler tools, used for benchmark
+            if is_profiler and epoch == 0 and train_iter == log_interval:
+                profiler.start_profiler("All")
+            elif is_profiler and epoch == 0 and train_iter == log_interval + 5:
+                profiler.stop_profiler("total", profiler_path)
+                return
 
         if len(epoch_periods) < 1:
             logger.info(
diff --git a/PaddleCV/Research/PWCNet/AverageMeter.py b/PaddleCV/Research/PWCNet/AverageMeter.py
new file mode 100644
index 0000000000000000000000000000000000000000..633e6c067d465559d2da61913342da2e521ac731
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/AverageMeter.py
@@ -0,0 +1,18 @@
+
+
+class AverageMeter(object):
+    """Computes and stores the average and current value"""
+    def __init__(self):
+        self.reset()
+
+    def reset(self):
+        self.val = 0
+        self.avg = 0
+        self.sum = 0
+        self.count = 0
+
+    def update(self, val, n=1):
+        self.val = val
+        self.sum += val * n
+        self.count += n
+        self.avg = self.sum / self.count
diff --git a/PaddleCV/Research/PWCNet/README.md b/PaddleCV/Research/PWCNet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b3335013b641836c47b61dd31f8a6f5459188254
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/README.md
@@ -0,0 +1,86 @@
+# PWCNet reimplement using paddlepaddle DyGraph
+PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume.
+# Environment 
+```
+cenntos7
+paddle develop version (after 20191201) install from source
+python3.7
+SciPy 1.1.0
+```
+code will update for paddle v1.7 later.
+# Compile correlation op
+```
+cd correlation_op
+sh make.sh
+```
+# Datasets
+1.Please download the `FlyingChairs dataset` and `FlyingChairs_train_val.txt` from https://lmb.informatik.uni-freiburg.de/resources/datasets
+
+Or you can use `./data/download.sh` to download datasets.
+
+We split the data to train and val by using `FlyingChairs_train_val.txt` with `1 for train and 2 for val`.
+# Inference
+Note that the paddle models `pwc_net_paddle.pdparams` and `pwc_net_chairs_paddle.pdparams` are transferred from the pytorch pth files `pwc_net.pth.tar` and `pwc_net_chairs.pth.tar`.
+
+Run
+```
+python infer.py
+```
+
+| Input img1 | Input img2 |
+|-------|------------|
+| <img src='data/frame_0010.png' width=500> | <img src='data/frame_0011.png' width=500> |
+
+|prediction with pwc_net_paddle.pdparams| prediction with pwc_net_chairs_paddle.pdparams|
+|-------------|-------------|
+|<img src='tmp/hsv_pd.png' width=500> | <img src='tmp/hsv_pd_chairs.png' width=500> |
+
+# First Train with L2 loss
+A single gpu is supported. Multi gpus will be supported later.
+
+You should check parameters in `my_args.py` as you like.
+
+And change them in `train.sh`.
+```
+--data_root
+--train_val_txt
+--batch_size
+```
+Then run
+```
+./train.sh
+```
+Some results during training can be seen
+```
+./img1.png
+./img2.png
+./hsv_pd.png # ground truth
+./hsv_predict.png # output of model
+```
+
+# Finetune with L1 loss
+finetune from your best pretrain model by adding --pretrained your_best_model_name eg. `--pretrained epoch_7_pwc_net_paddle`
+
+Run
+```
+./finetune.sh
+```
+# Note
+This code reimplement PWCNet like the code of `https://github.com/NVlabs/PWC-Net`
+If you want to want to train like the paper
+```
+@InProceedings{Sun2018PWC-Net,
+  author    = {Deqing Sun and Xiaodong Yang and Ming-Yu Liu and Jan Kautz},
+  title     = {{PWC-Net}: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume},
+  booktitle = CVPR,
+  year      = {2018},
+}
+```
+Please use all the datasets in `./data/download.sh` if you like. And use the code in `./data/datasets.py`.
+
+Reference works
+```
+https://github.com/NVlabs/PWC-Net
+https://github.com/ClementPinard/FlowNetPytorch
+https://github.com/NVIDIA/flownet2-pytorch/blob/master/datasets.py
+```
\ No newline at end of file
diff --git a/PaddleCV/Research/PWCNet/__init__.py b/PaddleCV/Research/PWCNet/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleCV/Research/PWCNet/correlation_op/README.md b/PaddleCV/Research/PWCNet/correlation_op/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d83c6fe61d6fef1d01139289b69605628e689d72
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/correlation_op/README.md
@@ -0,0 +1,14 @@
+自定义OP编译:
+1. 使用paddle develop 12月1日之后的版本
+2. sh make.sh编译成correlation_lib.so动态库
+3. 添加动态库路径到LD_LIBRARY_PATH：
+```
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python3.7 -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+```
+4. 添加correlation op的python路径: 
+```
+export PYTHONPATH=$PYTHONPATH:`pwd`
+```
+5. python test_correlation.py运行单测，验证是否加载成功。
+
+PS: 如果paddle whl包是从官网上下载的，需要使用gcc 4.8，即把make.sh中的g++ 改为 g++-4.8
diff --git a/PaddleCV/Research/PWCNet/correlation_op/correlation.py b/PaddleCV/Research/PWCNet/correlation_op/correlation.py
new file mode 100644
index 0000000000000000000000000000000000000000..05e9267d1fcb51344e096592ad86d22223b99f75
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/correlation_op/correlation.py
@@ -0,0 +1,25 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle.fluid as fluid
+import os
+file_dir = os.path.dirname(os.path.abspath(__file__))
+fluid.load_op_library(os.path.join(file_dir, 'correlation_lib.so'))
+
+from paddle.fluid.layer_helper import LayerHelper
+
+def correlation(input1, input2, pad_size, kernel_size, max_displacement, stride1, stride2, corr_type_multiply=1):
+  helper = LayerHelper("correlation", **locals())
+  output = helper.create_variable_for_type_inference(dtype=input1.dtype)
+  helper.append_op(type="correlation", inputs={"Input1": input1, "Input2": input2}, attrs={"pad_size": pad_size, "kernel_size": kernel_size, "max_displacement": max_displacement, "stride1": stride1, "stride2": stride2, "corr_type_multiply": corr_type_multiply}, outputs = {"Output": output})
+  return output
diff --git a/PaddleCV/Research/PWCNet/correlation_op/correlation_op.cc b/PaddleCV/Research/PWCNet/correlation_op/correlation_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..4902db3ed7115d0d315ae2f2cbab5ea1a5ee6528
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/correlation_op/correlation_op.cc
@@ -0,0 +1,140 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <memory>
+#include <string>
+#include <unordered_map>
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+inline std::vector<int64_t> CorrelationOutputSize(int batch, int input_height, int input_width, int stride1, int stride2, int kernel_size, int pad_size, int max_displacement) {
+
+  std::vector<int64_t> output_shape({batch});
+  int kernel_radius = (kernel_size - 1) / 2;
+  int border_radius = kernel_radius + max_displacement;
+  int padded_input_height = input_height + 2 * pad_size;
+  int padded_input_width = input_width + 2 * pad_size;
+  int output_channel = ((max_displacement/stride2) * 2 + 1) * ((max_displacement/stride2) * 2 + 1);
+  output_shape.push_back(output_channel);
+  int output_height = std::ceil(static_cast<float>(padded_input_height - 2 * border_radius) / static_cast<float>(stride1)); 
+  int output_width = std::ceil(static_cast<float>(padded_input_width - 2 * border_radius) / static_cast<float>(stride1));
+  output_shape.push_back(output_height);
+  output_shape.push_back(output_width);
+  return output_shape;
+}
+
+class CorrelationOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  void Make() override{
+    AddInput("Input1", "input1");
+    AddInput("Input2", "input2");
+    AddOutput("Output", "output");
+    AddAttr<int>("pad_size", "pad size for input1 and input2");
+    AddAttr<int>("kernel_size", "kernel size of input1 and input2");
+    AddAttr<int>("max_displacement", "max displacement of input1 and input2");
+    AddAttr<int>("stride1", "Input1 stride");
+    AddAttr<int>("stride2", "Input2 stride");
+    AddAttr<int>("corr_type_multiply", "correlation coefficient").SetDefault(1);
+    AddComment(R"DOC(Correlation of two feature map. Only support NCHW data format.)DOC");
+  }
+};
+
+class CorrelationOp : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override{
+    PADDLE_ENFORCE_EQ(ctx->HasInput("Input1"), true, "Input(input1) cannot be null");
+    PADDLE_ENFORCE_EQ(ctx->HasInput("Input2"), true, "Input(input2) cannot be null");
+    int stride1 = ctx->Attrs().Get<int>("stride1");
+    int stride2 = ctx->Attrs().Get<int>("stride2");
+    int max_displacement = ctx->Attrs().Get<int>("max_displacement");
+    int pad_size = ctx->Attrs().Get<int>("pad_size");
+    int kernel_size = ctx->Attrs().Get<int>("kernel_size");
+
+    auto in_dims = ctx->GetInputDim("Input1");
+    auto in2_dims = ctx->GetInputDim("Input2");
+    PADDLE_ENFORCE_EQ(in_dims.size() == 4, true, "input1 must be 4-dims");
+    PADDLE_ENFORCE_EQ(in2_dims.size() == 4, true, "input2 must be 4-dims");
+    std::vector<int64_t> output_shape = CorrelationOutputSize(in_dims[0], in_dims[2], in_dims[3], stride1, stride2, kernel_size, pad_size, max_displacement);
+    ctx->SetOutputDim("Output", framework::make_ddim(output_shape));
+  }
+
+ protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override{
+    auto input_data_type = OperatorWithKernel::IndicateVarDataType(ctx, "Input1");
+    PADDLE_ENFORCE_EQ(input_data_type, ctx.Input<Tensor>("Input2")->type(), "Input1 and Input2 shoule have same type");
+    return framework::OpKernelType(input_data_type, ctx.GetPlace());
+  }
+};
+
+template <typename T>
+class CorrelationOpGradMaker : public framework::SingleGradOpMaker<T> {
+ public:
+  using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
+
+ protected:
+  std::unique_ptr<T> Apply() const override {
+    auto* op = new T();
+    op->SetType("correlation_grad");
+    op->SetInput("Input1", this->Input("Input1"));
+    op->SetInput("Input2", this->Input("Input2"));
+    op->SetInput(framework::GradVarName("Output"), this->OutputGrad("Output"));
+    op->SetOutput(framework::GradVarName("Input1"), this->InputGrad("Input1"));
+    op->SetOutput(framework::GradVarName("Input2"), this->InputGrad("Input2"));
+    op->SetAttrMap(this->Attrs());
+
+    return std::unique_ptr<T>(op);
+  }
+};
+
+class CorrelationOpGrad : public framework::OperatorWithKernel {
+ public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override{
+    PADDLE_ENFORCE_EQ(ctx->HasInput("Input1"), true, "Input(Input1) should not be null");
+    PADDLE_ENFORCE_EQ(ctx->HasInput("Input2"), true, "Input(Input2) should not be null");
+    PADDLE_ENFORCE_EQ(ctx->HasInput(framework::GradVarName("Output")), true, "Input(Output@GRAD) should not be null");
+
+    auto in1_dims = ctx->GetInputDim("Input1");
+    auto in2_dims = ctx->GetInputDim("Input2");
+    ctx->SetOutputDim(framework::GradVarName("Input1"), in1_dims);
+    ctx->SetOutputDim(framework::GradVarName("Input2"), in1_dims);
+  }
+
+ protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override{
+    const auto* var = ctx.InputVar(framework::GradVarName("Output"));
+    if (var == nullptr) {
+      PADDLE_THROW("cannot find Output@GRAD");
+    }
+    return framework::OpKernelType(OperatorWithKernel::IndicateVarDataType(ctx, "Input1"), ctx.GetPlace());
+  }
+};
+
+} // namespace operators
+} // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(correlation, ops::CorrelationOp, ops::CorrelationOpMaker,
+          ops::CorrelationOpGradMaker<paddle::framework::OpDesc>,
+          ops::CorrelationOpGradMaker<paddle::imperative::OpBase>);
+REGISTER_OPERATOR(correlation_grad, ops::CorrelationOpGrad);
diff --git a/PaddleCV/Research/PWCNet/correlation_op/correlation_op.cu b/PaddleCV/Research/PWCNet/correlation_op/correlation_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..161844430fe4b9dfeaf80dbe127d802d67a6de76
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/correlation_op/correlation_op.cu
@@ -0,0 +1,434 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+
+#define THREADS_PER_BLOCK 32
+#define FULL_MASK 0xffffffff
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+
+template <typename T>
+__forceinline__ __device__ T warpReduceSum(T val) {
+  for (int offset = 16; offset > 0; offset /= 2) {
+    val += __shfl_down_sync(FULL_MASK, val, offset);
+  }
+  return val;
+}
+
+template <typename T>
+__forceinline__ __device__ T blockReduceSum(T val) {
+  static __shared__ T shared[32];
+  int lane = threadIdx.x % warpSize;
+  int wid = threadIdx.x / warpSize;
+
+  val = warpReduceSum(val);
+  if (lane == 0)
+    shared[wid] = val;
+
+  __syncthreads();
+  val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;
+
+  if (wid == 0)
+    val = warpReduceSum(val);
+
+  return val;
+}
+
+template <typename T>
+__global__ void set_zero(T *x, int num) {
+  for(int i = blockIdx.x * blockDim.x + threadIdx.x; i < num; i += blockDim.x * gridDim.x)
+    x[i] = static_cast<T>(0);
+}
+
+template <typename T>
+__global__ void channel_first(const T *input, T *rinput, const int channel, const int height, const int width, const int pad_size) {
+  int n = blockIdx.x;
+  int h = blockIdx.y;
+  int w = blockIdx.z;
+
+  int ch_off = threadIdx.x;
+  T value;
+  int dimchw = channel * height * width;
+  int dimhw = height * width;
+
+  int p_dimw = (width + 2 * pad_size);
+  int p_dimh = (height + 2 * pad_size);
+  int p_dimchw = channel * p_dimw * p_dimh;
+  int p_dimcw = channel * p_dimw;
+
+  for (int c = ch_off; c < channel; c += THREADS_PER_BLOCK) {
+    value = input[n * dimchw + c * dimhw + h * width + w];
+    rinput[n * p_dimchw + (h + pad_size) * p_dimcw + (w + pad_size) * channel + c] = value;
+  }
+}
+
+template <typename T>
+__global__ void correlation_forward(T *output, const int output_channel, const int output_height, const int output_width, const T *rinput1, const int input_channel, const int input_height, const int input_width, const T *rinput2, const int pad_size, const int kernel_size, const int max_displacement, const int stride1, const int stride2) {
+
+  int p_input_width = input_width + 2 * pad_size;
+  int p_input_height = input_height + 2 * pad_size;
+
+  int kernel_rad = (kernel_size - 1) / 2;
+  int displacement_rad = max_displacement / stride2;
+
+  int displacement_size = 2 * displacement_rad + 1;
+
+  int n = blockIdx.x;
+  int h1 = blockIdx.y * stride1 + max_displacement;
+  int w1 = blockIdx.z * stride1 + max_displacement;
+  int c = threadIdx.x;
+
+  int p_dimchw = p_input_height * p_input_width * input_channel;
+  int p_dimcw = p_input_width * input_channel;
+  int p_dimc = input_channel;
+
+  int t_dimchw = output_channel * output_height * output_width;
+  int t_dimhw = output_height * output_width;
+  int t_dimw = output_width;
+
+  int nelems = kernel_size * kernel_size * p_dimc;
+
+  for (int tj = -displacement_rad; tj <= displacement_rad; ++tj) {
+    for(int ti = -displacement_rad; ti <= displacement_rad; ++ti) {
+      int w2 = w1 + ti * stride2;
+      int h2 = h1 + tj * stride2;
+
+      T acc0 = 0;
+      for(int j = -kernel_rad; j <= kernel_rad; ++j) {
+        for(int i = -kernel_rad; i <= kernel_rad; ++i) {
+          for(int ch = c; ch < p_dimc; ch += blockDim.x) {
+            int index1 = n * p_dimchw + (h1 + j) * p_dimcw + (w1 + i) * p_dimc + ch;
+            int index2 = n * p_dimchw + (h2 + j) * p_dimcw + (w2 + i) * p_dimc + ch;
+            acc0 += static_cast<T>(rinput1[index1] * rinput2[index2]);
+          } 
+        }
+      }
+      if (blockDim.x == warpSize) {
+        __syncwarp();
+        acc0 = warpReduceSum(acc0);
+      } else {
+        __syncthreads();
+        acc0 = blockReduceSum(acc0);
+      }
+
+      if (threadIdx.x == 0) {
+        int tc = (tj + displacement_rad) * displacement_size + (ti + displacement_rad);
+        const int t_index = n * t_dimchw + tc * t_dimhw + blockIdx.y * t_dimw + blockIdx.z;
+        output[t_index] = static_cast<T>(acc0 / nelems);
+      }
+    }
+  }
+
+}
+
+//class CorrelationKernel<platform::CUDADeviceContext, T>
+template <typename T>
+class CorrelationKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    PADDLE_ENFORCE_EQ(platform::is_gpu_place(ctx.GetPlace()), true, "It must be CUDAPlace");
+
+    auto *input1 = ctx.Input<Tensor>("Input1");
+    auto *input2 = ctx.Input<Tensor>("Input2");
+    int pad_size = ctx.Attr<int>("pad_size");
+    int kernel_size = ctx.Attr<int>("kernel_size");
+    int stride1 = ctx.Attr<int>("stride1");
+    int stride2 = ctx.Attr<int>("stride2");
+    int max_displacement = ctx.Attr<int>("max_displacement");
+    int corr_type_multiply = ctx.Attr<int>("corr_type_multiply");
+
+    auto *output = ctx.Output<Tensor>("Output");
+    output->mutable_data<T>(ctx.GetPlace());
+    auto &dev_ctx = ctx.template device_context<platform::CUDADeviceContext>();
+
+    // base on input1, NCHW
+    auto in_dims = input1->dims();
+    int N = in_dims[0];
+    int C = in_dims[1];
+    int H = in_dims[2];
+    int W = in_dims[3];
+
+    int padded_input_height = H + 2 * pad_size;
+    int padded_input_width = W + 2 * pad_size;
+
+    Tensor rinput1 = ctx.AllocateTmpTensor<T, platform::CUDADeviceContext>({N, padded_input_height, padded_input_width, C}, dev_ctx);
+    rinput1.mutable_data<T>(ctx.GetPlace());
+
+    Tensor rinput2 = ctx.AllocateTmpTensor<T, platform::CUDADeviceContext>({N, padded_input_height, padded_input_width, C}, dev_ctx);
+    rinput2.mutable_data<T>(ctx.GetPlace());
+
+    set_zero<<<(rinput1.numel() + 512 - 1)/512, 512, 0, dev_ctx.stream()>>>(rinput1.data<T>(), rinput1.numel());
+    set_zero<<<(rinput2.numel() + 512 - 1)/512, 512, 0, dev_ctx.stream()>>>(rinput2.data<T>(), rinput2.numel());
+    set_zero<<<(output->numel() + 512 - 1)/512, 512, 0, dev_ctx.stream()>>>(output->data<T>(), output->numel());
+
+    auto out_dims = output->dims();
+    int OC = out_dims[1];
+    int OH = out_dims[2];
+    int OW = out_dims[3];
+
+    dim3 blocks_grid(N, H, W);
+    dim3 threads_block(THREADS_PER_BLOCK);
+
+    channel_first<T><<<blocks_grid, threads_block, 0, dev_ctx.stream()>>>(input1->data<T>(), rinput1.data<T>(), C, H, W, pad_size);
+    channel_first<T><<<blocks_grid, threads_block, 0, dev_ctx.stream()>>>(input2->data<T>(), rinput2.data<T>(), C, H, W, pad_size);
+
+    dim3 threadsPerBlock(THREADS_PER_BLOCK);
+    dim3 totalBlocksCorr(N, OH, OW);
+
+    correlation_forward<T><<<totalBlocksCorr, threadsPerBlock, 0, dev_ctx.stream()>>>(output->data<T>(), OC, OH, OW, rinput1.data<T>(),
+C, H, W, rinput2.data<T>(), pad_size, kernel_size, max_displacement, stride1, stride2);
+  }
+};
+
+template <typename T>
+__global__ void correlation_backward_input1(int item, T *grad_input1, const int input_channel, const int input_height, const int input_width, const T *grad_output, const int output_channel, const int output_height, const int output_width, const T *rinput2, const int pad_size, const int kernel_size, const int max_displacement, const int stride1, const int stride2) {
+
+  int n = item;
+  int h = blockIdx.x * stride1 + pad_size;
+  int w = blockIdx.y * stride1 + pad_size;
+  int c = blockIdx.z;
+  int tch_off = threadIdx.x;
+
+  int kernel_rad = (kernel_size - 1) / 2;
+  int displacement_rad = max_displacement / stride2;
+  int displacement_size = 2 * displacement_rad + 1;
+
+  int xmin = (w - kernel_rad - max_displacement) / stride1;
+  int ymin = (h - kernel_rad - max_displacement) / stride1;
+
+  int xmax = (w + kernel_rad - max_displacement) / stride1;
+  int ymax = (h + kernel_rad - max_displacement) / stride1;
+
+  if (xmax < 0 || ymax < 0 || xmin >= output_width || ymin >= output_height) {
+    return;
+  }
+
+  if (xmin > xmax || ymin > ymax) {
+    return;
+  }
+
+  xmin = max(0, xmin);
+  xmax = min(output_width - 1, xmax);
+
+  ymin = max(0, ymin);
+  ymax = min(output_height - 1, ymax);
+
+  int p_input_width = input_width + 2 * pad_size;
+  int p_input_height = input_height + 2 * pad_size;
+  int p_dimchw = input_channel * p_input_height * p_input_width;
+  int p_dimcw = input_channel * p_input_width;
+  int p_dimc = input_channel;
+
+  int t_dimchw = output_channel * output_height * output_width;
+  int t_dimhw = output_height * output_width;
+  int t_dimw = output_width;
+
+  int o_dimchw = input_channel * input_height * input_width;
+  int o_dimhw = input_height * input_width;
+  int o_dimw = input_width;
+
+  int nelems = kernel_size * kernel_size * input_channel;
+
+  __shared__ T prod_sum[THREADS_PER_BLOCK];
+  prod_sum[tch_off] = 0;
+
+  for (int tc = tch_off; tc < output_channel; tc += THREADS_PER_BLOCK) {
+    int i2 = (tc % displacement_size - displacement_rad) * stride2;
+    int j2 = (tc / displacement_size - displacement_rad) * stride2;
+
+    int index2 = n * p_dimchw + (h + j2) * p_dimcw + (w + i2) * p_dimc + c;
+
+    T val2 = rinput2[index2];
+    for (int j = ymin; j <= ymax; ++j) {
+      for (int i = xmin; i <= xmax; ++i) {
+        int t_index = n * t_dimchw + tc * t_dimhw + j * t_dimw + i;
+        prod_sum[tch_off] += grad_output[t_index] * val2;
+      }
+    }
+  }
+
+  __syncthreads();
+
+  if (tch_off == 0) {
+    T reduce_sum = 0;
+    for (int index = 0; index < THREADS_PER_BLOCK; index++) {
+      reduce_sum += prod_sum[index];
+    }
+    const int index1 = n * o_dimchw + c * o_dimhw + (h - pad_size) * o_dimw + (w - pad_size);
+    grad_input1[index1] = static_cast<T>(reduce_sum / nelems);
+  }
+
+}
+
+template <typename T>
+__global__ void correlation_backward_input2(int item, T *grad_input2, const int input_channel, const int input_height, const int input_width, const T *grad_output, const int output_channel, const int output_height, const int output_width, const T *rinput1, const int pad_size, const int kernel_size, const int max_displacement, const int stride1, const int stride2){
+
+  int n = item;
+  int h = blockIdx.x * stride1 + pad_size;
+  int w = blockIdx.y * stride1 + pad_size;
+  int c = blockIdx.z;
+
+  int tch_off = threadIdx.x;
+
+  int kernel_rad = (kernel_size - 1) / 2;
+  int displacement_rad = max_displacement / stride2;
+  int displacement_size = 2 * displacement_rad + 1;
+
+  int p_input_width = input_width + 2 * pad_size;
+  int p_input_height = input_height + 2 * pad_size;
+  int p_dimchw = input_channel * p_input_height * p_input_width;
+  int p_dimcw = input_channel * p_input_width;
+  int p_dimc = input_channel;
+
+  int t_dimchw = output_channel * output_height * output_width;
+  int t_dimhw = output_height * output_width;
+  int t_dimw = output_width;
+
+  int o_dimchw = input_channel * input_height * input_width;
+  int o_dimhw = input_height * input_width;
+  int o_dimw = input_width;
+
+  int nelems = kernel_size * kernel_size * input_channel;
+
+  __shared__ T prod_sum[THREADS_PER_BLOCK];
+  prod_sum[tch_off] = 0;
+
+  for (int tc = tch_off; tc < output_channel; tc += THREADS_PER_BLOCK) {
+    int i2 = (tc % displacement_size - displacement_rad) * stride2;
+    int j2 = (tc / displacement_size - displacement_rad) * stride2;
+
+    int xmin = (w - kernel_rad - max_displacement - i2) / stride1;
+    int ymin = (h - kernel_rad - max_displacement - j2) / stride1;
+
+    int xmax = (w + kernel_rad - max_displacement - i2) / stride1;
+    int ymax = (h + kernel_rad - max_displacement - j2) / stride1;
+
+    if (xmax < 0 || ymax < 0 || xmin >= output_width || ymin >= output_height) {
+      continue;
+    }
+
+    if (xmin > xmax || ymin > ymax) {
+      continue;
+    }
+
+    xmin = max(0, xmin);
+    xmax = min(output_width - 1, xmax);
+
+    ymin = max(0, ymin);
+    ymax = min(output_height - 1, ymax);
+
+    int index1 = n * p_dimchw + (h - j2) * p_dimcw + (w - i2) * p_dimc + c;
+    T val1 = rinput1[index1];
+    for (int j = ymin; j <= ymax; ++j) {
+      for (int i = xmin; i <= xmax; ++i) {
+        int t_index = n * t_dimchw + tc * t_dimhw + j * t_dimw + i;
+        prod_sum[tch_off] += grad_output[t_index] * val1;
+      }
+    }
+  }
+
+  __syncthreads();
+
+  if (tch_off == 0) {
+    T reduce_sum = 0;
+    for (int index = 0; index < THREADS_PER_BLOCK; index++) {
+      reduce_sum += prod_sum[index];
+    }
+    const int index2 = n * o_dimchw + c * o_dimhw + (h - pad_size) * o_dimw + (w - pad_size);
+    grad_input2[index2] = static_cast<T>(reduce_sum / nelems);
+  }
+}
+
+template <typename T>
+class CorrelationGradKernel : public framework::OpKernel<T> {
+ public:
+  void Compute(const framework::ExecutionContext &ctx) const override {
+    PADDLE_ENFORCE_EQ(platform::is_gpu_place(ctx.GetPlace()), true, "It must use CUDAPlace.");
+    const auto *input1 = ctx.Input<Tensor>("Input1");
+    const auto *input2 = ctx.Input<Tensor>("Input2");
+    const auto *grad_output = ctx.Input<Tensor>(framework::GradVarName("Output"));
+    const int pad_size = ctx.Attr<int>("pad_size");
+    const int kernel_size = ctx.Attr<int>("kernel_size");
+    const int stride1 = ctx.Attr<int>("stride1");
+    const int stride2 = ctx.Attr<int>("stride2");
+    const int max_displacement = ctx.Attr<int>("max_displacement");
+    const int corr_type_multiply = ctx.Attr<int>("corr_type_multiply");
+
+    auto *grad_input1 = ctx.Output<Tensor>(framework::GradVarName("Input1"));
+    grad_input1->mutable_data<T>(ctx.GetPlace());
+    auto *grad_input2 = ctx.Output<Tensor>(framework::GradVarName("Input2"));
+    grad_input2->mutable_data<T>(ctx.GetPlace());
+    auto &dev_ctx = ctx.template device_context<platform::CUDADeviceContext>();
+
+    auto in_dims = input1->dims();
+    int N = in_dims[0];
+    int C = in_dims[1];
+    int H = in_dims[2];
+    int W = in_dims[3];
+
+    int padded_input_height = H + 2 * pad_size;
+    int padded_input_width = W + 2 * pad_size;
+    
+    Tensor rinput1 = ctx.AllocateTmpTensor<T, platform::CUDADeviceContext>({N, padded_input_height, padded_input_width, C}, dev_ctx);
+    rinput1.mutable_data<T>(ctx.GetPlace());
+
+    Tensor rinput2 = ctx.AllocateTmpTensor<T, platform::CUDADeviceContext>({N, padded_input_height, padded_input_width, C}, dev_ctx);
+    rinput2.mutable_data<T>(ctx.GetPlace());
+
+    set_zero<<<(rinput1.numel() + 512 - 1)/512, 512, 0, dev_ctx.stream()>>>(rinput1.data<T>(), rinput1.numel());
+    set_zero<<<(rinput2.numel() + 512 - 1)/512, 512, 0, dev_ctx.stream()>>>(rinput2.data<T>(), rinput2.numel());
+    set_zero<<<(grad_input1->numel() + 512 - 1)/512, 512, 0, dev_ctx.stream()>>>(grad_input1->data<T>(), grad_input1->numel());
+    set_zero<<<(grad_input2->numel() + 512 - 1)/512, 512, 0, dev_ctx.stream()>>>(grad_input2->data<T>(), grad_input2->numel());
+
+    auto grad_out_dims = grad_output->dims();
+    int GOC = grad_out_dims[1];
+    int GOH = grad_out_dims[2];
+    int GOW = grad_out_dims[3];
+
+    dim3 blocks_grid(N, H, W);
+    dim3 threads_block(THREADS_PER_BLOCK);
+
+    channel_first<T><<<blocks_grid, threads_block, 0, dev_ctx.stream()>>>(input1->data<T>(), rinput1.data<T>(), C, H, W, pad_size);
+    channel_first<T><<<blocks_grid, threads_block, 0, dev_ctx.stream()>>>(input2->data<T>(), rinput2.data<T>(), C, H, W, pad_size);
+    
+    dim3 threadsPerBlock(THREADS_PER_BLOCK);
+    dim3 totalBlocksCorr(H, W, C);
+
+    for (int n = 0; n < N; n++) {
+      correlation_backward_input1<T><<<totalBlocksCorr, threadsPerBlock, 0, dev_ctx.stream()>>>(n, grad_input1->data<T>(), C, H, W, grad_output->data<T>(), GOC, GOH, GOW, rinput2.data<T>(), pad_size, kernel_size, max_displacement, stride1, stride2);
+    }
+
+    for (int n = 0; n < N; n++) {
+      correlation_backward_input2<T><<<totalBlocksCorr, threadsPerBlock, 0, dev_ctx.stream()>>>(n, grad_input2->data<T>(), C, H, W, grad_output->data<T>(), GOC, GOH, GOW, rinput1.data<T>(), pad_size, kernel_size, max_displacement, stride1, stride2);
+    }
+  }
+};
+
+} // namespace operators
+} // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(
+    correlation, ops::CorrelationKernel<float>,
+    ops::CorrelationKernel<double>);
+REGISTER_OP_CUDA_KERNEL(
+    correlation_grad, ops::CorrelationGradKernel<float>,
+    ops::CorrelationGradKernel<double>);
+
diff --git a/PaddleCV/Research/PWCNet/correlation_op/make.sh b/PaddleCV/Research/PWCNet/correlation_op/make.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0aa8deb6b3db2908838dbba10b976e37979bf231
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/correlation_op/make.sh
@@ -0,0 +1,22 @@
+include_dir=$( python3.7 -c 'import paddle; print(paddle.sysconfig.get_include())' )
+lib_dir=$( python3.7 -c 'import paddle; print(paddle.sysconfig.get_lib())' )
+
+echo $include_dir
+echo $lib_dir
+
+OPS='correlation_op'
+for op in ${OPS}
+do
+nvcc ${op}.cu -c -o ${op}.cu.o -ccbin cc -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -DPADDLE_WITH_MKLDNN -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O0 -g -DNVCC \
+    -I ${include_dir}/third_party/ \
+    -I ${include_dir}
+done
+
+##g++-4.8 correlation_op.cu.o correlation_op.cc -o correlation_lib.so -DPADDLE_WITH_MKLDNN -shared -fPIC -std=c++11 -O0 -g \
+g++ correlation_op.cu.o correlation_op.cc -o correlation_lib.so -DPADDLE_WITH_MKLDNN -shared -fPIC -std=c++11 -O0 -g \
+  -I ${include_dir}/third_party/ \
+  -I ${include_dir} \
+  -L ${lib_dir} \
+  -L /usr/local/cuda/lib64 -lpaddle_framework -lcudart
+
+rm *.cu.o
diff --git a/PaddleCV/Research/PWCNet/correlation_op/test_correlation.py b/PaddleCV/Research/PWCNet/correlation_op/test_correlation.py
new file mode 100644
index 0000000000000000000000000000000000000000..89e254adafe41465be93f98cef837cc6514bf9db
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/correlation_op/test_correlation.py
@@ -0,0 +1,88 @@
+import unittest
+from correlation import correlation
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.base import to_variable
+
+def corr(x_1, x_2, pad_size=4, kernel_size=1, max_displacement=4, stride1=1, stride2=1, corr_multiply=1):
+    K = kernel_size
+    # rinput1 = np.pad(x_1, tuple([pad_size for _ in range(4)]), mode='constant').transpose(1, 2).transpose(2, 3)
+    # rinput2 = np.pad(x_2, tuple([pad_size for _ in range(4)]), mode='constant').transpose(1, 2).transpose(2, 3)
+
+    rinput1 = np.pad(x_1, ((0, 0), (0, 0), (pad_size, pad_size), (pad_size, pad_size)), mode='constant')
+    rinput2 = np.pad(x_2, ((0, 0), (0, 0), (pad_size, pad_size), (pad_size, pad_size)), mode='constant')
+    rinput1 = np.transpose(rinput1, (0, 2, 3, 1))
+    rinput2 = np.transpose(rinput2, (0, 2, 3, 1))
+    B = int(rinput1.shape[0])
+    H = int(x_1.shape[2])
+    W = int(x_2.shape[3])
+    d = max_displacement
+    D = 2 * d + 1
+    output = np.zeros((B, D * D, H, W), dtype=np.float32)
+
+    for b in range(B):
+        for i in range(H):
+            for j in range(W):
+                for k in range(-d, d + 1):
+                    for l in range(-d, d + 1):
+                        x1_index = i + pad_size
+                        y1_index = j + pad_size
+                        x2_index = x1_index + k
+                        y2_index = y1_index + l
+                        output[b, l + d + D * (k + d), i, j] = np.mean(
+                            rinput1[b, x1_index:x1_index + K, y1_index:y1_index + K] * rinput2[b,
+                                                                                       x2_index:x2_index + K,
+                                                                                       y2_index:y2_index + K])
+
+    return output
+
+class TestCorrelationOp(unittest.TestCase):
+    def test_check_output(self):
+        #x_shape = (1, 196, 3, 3)
+        np.random.seed(13)
+        np.set_printoptions(threshold=np.inf)
+        x_shape = (2, 10, 3, 3)
+        x_type = 'float32'
+        x1 = fluid.layers.data(name='x1', shape=x_shape, dtype=x_type, append_batch_size=False)
+        x2 = fluid.layers.data(name='x2', shape=x_shape, dtype=x_type, append_batch_size=False)
+
+        x1_np = np.random.randn(2,3,4,5).astype(x_type)
+        x2_np = np.random.randn(2,3,4,5).astype(x_type)
+        out_np = corr(x1_np, x2_np, pad_size=4, kernel_size=1, max_displacement=4, stride1=1, stride2=1)
+
+        out = correlation(x1, x2, pad_size=4, kernel_size=1, max_displacement=4, stride1=1, stride2=1)
+
+        place = fluid.CUDAPlace(0)
+        exe = fluid.Executor(place)
+        res = exe.run(feed={'x1':x1_np, 'x2':x2_np}, fetch_list=[out.name])
+        
+        self.assertTrue(np.allclose(res[0], out_np))
+
+class Net(fluid.dygraph.Layer):
+    def __init__(self, name_scope):
+        super(Net, self).__init__(name_scope)
+    def forward(self, x1, x2):
+        y = correlation(x1, x2, pad_size=4, kernel_size=1, max_displacement=4, stride1=1, stride2=1)
+        return y
+
+class TestCorrelationOpDyGraph(unittest.TestCase):
+    def test_check_output(self):
+        np.random.seed(13)
+        np.set_printoptions(threshold=np.inf)
+        x_shape = (2, 10, 3, 3)
+        x_type = 'float32'
+        place = fluid.CUDAPlace(0)
+        with fluid.dygraph.guard(place):
+            x1_np = np.random.randn(2,3,4,5).astype(x_type)
+            x2_np = np.random.randn(2,3,4,5).astype(x_type)
+            out_np = corr(x1_np, x2_np, pad_size=4, kernel_size=1, max_displacement=4, stride1=1, stride2=1)
+
+            x1 = to_variable(x1_np)
+            x2 = to_variable(x2_np)
+            corr_pd = Net('corr_pd')
+            y = corr_pd(x1, x2)
+            out = y.numpy()
+            self.assertTrue(np.allclose(out, out_np))
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/PaddleCV/Research/PWCNet/data/__init__.py b/PaddleCV/Research/PWCNet/data/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleCV/Research/PWCNet/data/datasets.py b/PaddleCV/Research/PWCNet/data/datasets.py
new file mode 100644
index 0000000000000000000000000000000000000000..080e875df614c6ad8499822b492c85555321b338
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/data/datasets.py
@@ -0,0 +1,475 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# @FileName: datasets.py reference https://github.com/NVIDIA/flownet2-pytorch/blob/master/datasets.py
+import paddle
+import paddle.fluid as fluid
+import numpy as np
+import argparse
+import os, math, random
+import sys
+from os.path import *
+import numpy as np
+from glob import glob
+sys.path.append('../')
+import data.utils.frame_utils as frame_utils
+from scipy.misc import imsave
+from src import flow_vis
+from src.read_files import read_txt_to_index
+
+
+class StaticRandomCrop(object):
+    def __init__(self, image_size, crop_size):
+        self.th, self.tw = crop_size
+        h, w = image_size
+        self.h1 = random.randint(0, h - self.th)
+        self.w1 = random.randint(0, w - self.tw)
+
+    def __call__(self, img):
+        return img[self.h1:(self.h1 + self.th), self.w1:(self.w1 + self.tw), :]
+
+
+class StaticCenterCrop(object):
+    def __init__(self, image_size, crop_size):
+        self.th, self.tw = crop_size
+        self.h, self.w = image_size
+
+    def __call__(self, img):
+        return img[(self.h - self.th) // 2:(self.h + self.th) // 2, (self.w - self.tw) // 2:(self.w + self.tw) // 2, :]
+
+
+class MpiSintel(object):
+    def __init__(self, args, is_cropped=False, root='', dstype='clean', replicates=1):
+        self.args = args
+        self.is_cropped = is_cropped
+        self.crop_size = args.crop_size
+        self.render_size = args.inference_size
+        self.replicates = replicates
+
+        flow_root = join(root, 'flow')
+        image_root = join(root, dstype)
+
+        file_list = sorted(glob(join(flow_root, '*/*.flo')))
+
+        self.flow_list = []
+        self.image_list = []
+
+        for file in file_list:
+            if 'test' in file:
+                # print file
+                continue
+
+            fbase = file[len(flow_root) + 1:]
+            fprefix = fbase[:-8]
+            fnum = int(fbase[-8:-4])
+
+            img1 = join(image_root, fprefix + "%04d" % (fnum + 0) + '.png')
+            img2 = join(image_root, fprefix + "%04d" % (fnum + 1) + '.png')
+
+            if not isfile(img1) or not isfile(img2) or not isfile(file):
+                continue
+
+            self.image_list += [[img1, img2]]
+            self.flow_list += [file]
+
+        self.size = len(self.image_list)
+
+        self.frame_size = frame_utils.read_gen(self.image_list[0][0]).shape
+
+        if (self.render_size[0] < 0) or (self.render_size[1] < 0) or (self.frame_size[0] % 64) or (
+                self.frame_size[1] % 64):
+            self.render_size[0] = ((self.frame_size[0]) // 64) * 64
+            self.render_size[1] = ((self.frame_size[1]) // 64) * 64
+
+        args.inference_size = self.render_size
+
+        assert (len(self.image_list) == len(self.flow_list))
+
+    def __getitem__(self, index):
+
+        index = index % self.size
+
+        img1 = frame_utils.read_gen(self.image_list[index][0])
+        img2 = frame_utils.read_gen(self.image_list[index][1])
+
+        flow = frame_utils.read_gen(self.flow_list[index])
+
+        images = [img1, img2]
+        image_size = img1.shape[:2]
+
+        if self.is_cropped:
+            cropper = StaticRandomCrop(image_size, self.crop_size)
+        else:
+            cropper = StaticCenterCrop(image_size, self.render_size)
+        images = list(map(cropper, images))
+        flow = cropper(flow)
+
+        images = np.array(images).transpose(3, 0, 1, 2)
+        flow = flow.transpose(2, 0, 1)
+        return [images], [flow]
+
+    def __len__(self):
+        return self.size * self.replicates
+
+
+class MpiSintelClean(MpiSintel):
+    def __init__(self, args, is_cropped=False, root='', replicates=1):
+        super(MpiSintelClean, self).__init__(args, is_cropped=is_cropped, root=root, dstype='clean',
+                                             replicates=replicates)
+
+
+class MpiSintelFinal(MpiSintel):
+    def __init__(self, args, is_cropped=False, root='', replicates=1):
+        super(MpiSintelFinal, self).__init__(args, is_cropped=is_cropped, root=root, dstype='final',
+                                             replicates=replicates)
+
+
+class FlyingChairs(object):
+    def __init__(self, train_val, args, is_cropped, txt_file, root='/path/to/FlyingChairs_release/data', replicates=1):
+        self.args = args
+        self.is_cropped = is_cropped
+        self.crop_size = args.crop_size
+        self.render_size = args.inference_size
+        self.replicates = replicates
+
+        images = sorted(glob(join(root, '*.ppm')))
+
+        flow_list = sorted(glob(join(root, '*.flo')))
+
+        assert (len(images) // 2 == len(flow_list))
+
+        image_list = []
+        for i in range(len(flow_list)):
+            im1 = images[2 * i]
+            im2 = images[2 * i + 1]
+            image_list += [[im1, im2]]
+
+        assert len(image_list) == len(flow_list)
+        if train_val == 'train':
+            intindex = np.array(read_txt_to_index(txt_file))
+            image_list = np.array(image_list)
+            image_list = image_list[intindex == 1]
+            image_list = image_list.tolist()
+            flow_list = np.array(flow_list)
+            flow_list = flow_list[intindex == 1]
+            flow_list = flow_list.tolist()
+            assert len(image_list) == len(flow_list)
+        elif train_val == 'val':
+            intindex = np.array(read_txt_to_index(txt_file))
+            image_list = np.array(image_list)
+            image_list = image_list[intindex == 2]
+            image_list = image_list.tolist()
+            flow_list = np.array(flow_list)
+            flow_list = flow_list[intindex == 2]
+            flow_list = flow_list.tolist()
+            assert len(image_list) == len(flow_list)
+        else:
+            raise ValueError('FlyingChairs_train_val.txt not found for txt_file ......')
+        self.flow_list = flow_list
+        self.image_list = image_list
+
+        self.size = len(self.image_list)
+
+        self.frame_size = frame_utils.read_gen(self.image_list[0][0]).shape
+
+        if (self.render_size[0] < 0) or (self.render_size[1] < 0) or (self.frame_size[0] % 64) or (
+                self.frame_size[1] % 64):
+            self.render_size[0] = ((self.frame_size[0]) // 64) * 64
+            self.render_size[1] = ((self.frame_size[1]) // 64) * 64
+
+        args.inference_size = self.render_size
+
+    def __getitem__(self, index):
+        index = index % self.size
+
+        img1 = frame_utils.read_gen(self.image_list[index][0])
+        img2 = frame_utils.read_gen(self.image_list[index][1])
+
+        flow = frame_utils.read_gen(self.flow_list[index])
+
+        images = [img1, img2]
+        image_size = img1.shape[:2]
+        if self.is_cropped:
+            cropper = StaticRandomCrop(image_size, self.crop_size)
+        else:
+            cropper = StaticCenterCrop(image_size, self.render_size)
+        images = list(map(cropper, images))
+        flow = cropper(flow)
+
+        images = np.array(images).transpose(3, 0, 1, 2)
+        flow = flow.transpose(2, 0, 1)
+        return [images], [flow]
+
+    def __len__(self):
+        return self.size * self.replicates
+
+
+def reader_flyingchairs(dataset):
+    n = len(dataset)
+
+    def reader():
+        for i in range(n):
+            a, b = dataset[i]
+            yield a[0][:,0,:,:].transpose(1,2,0), a[0][:,1,:,:].transpose(1,2,0), b[0].transpose(1, 2, 0)# a single entry of data is created each time
+    return reader
+
+
+class FlyingThings(object):
+    def __init__(self, args, is_cropped, root='/path/to/flyingthings3d', dstype='frames_cleanpass', replicates=1):
+        self.args = args
+        self.is_cropped = is_cropped
+        self.crop_size = args.crop_size
+        self.render_size = args.inference_size
+        self.replicates = replicates
+
+        image_dirs = sorted(glob(join(root, dstype, 'TRAIN/*/*')))
+        image_dirs = sorted([join(f, 'left') for f in image_dirs] + [join(f, 'right') for f in image_dirs])
+
+        flow_dirs = sorted(glob(join(root, 'optical_flow_flo_format/TRAIN/*/*')))
+        flow_dirs = sorted(
+            [join(f, 'into_future/left') for f in flow_dirs] + [join(f, 'into_future/right') for f in flow_dirs])
+
+        assert (len(image_dirs) == len(flow_dirs))
+
+        self.image_list = []
+        self.flow_list = []
+
+        for idir, fdir in zip(image_dirs, flow_dirs):
+            images = sorted(glob(join(idir, '*.png')))
+            flows = sorted(glob(join(fdir, '*.flo')))
+            for i in range(len(flows)):
+                self.image_list += [[images[i], images[i + 1]]]
+                self.flow_list += [flows[i]]
+
+        assert len(self.image_list) == len(self.flow_list)
+
+        self.size = len(self.image_list)
+
+        self.frame_size = frame_utils.read_gen(self.image_list[0][0]).shape
+
+        if (self.render_size[0] < 0) or (self.render_size[1] < 0) or (self.frame_size[0] % 64) or (
+                self.frame_size[1] % 64):
+            self.render_size[0] = ((self.frame_size[0]) // 64) * 64
+            self.render_size[1] = ((self.frame_size[1]) // 64) * 64
+
+        args.inference_size = self.render_size
+
+    def __getitem__(self, index):
+        index = index % self.size
+
+        img1 = frame_utils.read_gen(self.image_list[index][0])
+        img2 = frame_utils.read_gen(self.image_list[index][1])
+
+        flow = frame_utils.read_gen(self.flow_list[index])
+
+        images = [img1, img2]
+        image_size = img1.shape[:2]
+        if self.is_cropped:
+            cropper = StaticRandomCrop(image_size, self.crop_size)
+        else:
+            cropper = StaticCenterCrop(image_size, self.render_size)
+        images = list(map(cropper, images))
+        flow = cropper(flow)
+
+        images = np.array(images).transpose(3, 0, 1, 2)
+        flow = flow.transpose(2, 0, 1)
+        return [images], [flow]
+
+    def __len__(self):
+        return self.size * self.replicates
+
+
+class FlyingThingsClean(FlyingThings):
+    def __init__(self, args, is_cropped=False, root='', replicates=1):
+        super(FlyingThingsClean, self).__init__(args, is_cropped=is_cropped, root=root, dstype='frames_cleanpass',
+                                                replicates=replicates)
+
+
+class FlyingThingsFinal(FlyingThings):
+    def __init__(self, args, is_cropped=False, root='', replicates=1):
+        super(FlyingThingsFinal, self).__init__(args, is_cropped=is_cropped, root=root, dstype='frames_finalpass',
+                                                replicates=replicates)
+
+
+class ChairsSDHom(object):
+    def __init__(self, args, is_cropped, root='/path/to/chairssdhom/data', dstype='train', replicates=1):
+        self.args = args
+        self.is_cropped = is_cropped
+        self.crop_size = args.crop_size
+        self.render_size = args.inference_size
+        self.replicates = replicates
+
+        image1 = sorted(glob(join(root, dstype, 't0/*.png')))
+        image2 = sorted(glob(join(root, dstype, 't1/*.png')))
+        self.flow_list = sorted(glob(join(root, dstype, 'flow/*.flo')))
+
+        assert (len(image1) == len(self.flow_list))
+
+        self.image_list = []
+        for i in range(len(self.flow_list)):
+            im1 = image1[i]
+            im2 = image2[i]
+            self.image_list += [[im1, im2]]
+
+        assert len(self.image_list) == len(self.flow_list)
+
+        self.size = len(self.image_list)
+
+        self.frame_size = frame_utils.read_gen(self.image_list[0][0]).shape
+
+        if (self.render_size[0] < 0) or (self.render_size[1] < 0) or (self.frame_size[0] % 64) or (
+                self.frame_size[1] % 64):
+            self.render_size[0] = ((self.frame_size[0]) // 64) * 64
+            self.render_size[1] = ((self.frame_size[1]) // 64) * 64
+
+        args.inference_size = self.render_size
+
+    def __getitem__(self, index):
+        index = index % self.size
+
+        img1 = frame_utils.read_gen(self.image_list[index][0])
+        img2 = frame_utils.read_gen(self.image_list[index][1])
+
+        flow = frame_utils.read_gen(self.flow_list[index])
+        flow = flow[::-1, :, :]
+
+        images = [img1, img2]
+        image_size = img1.shape[:2]
+        if self.is_cropped:
+            cropper = StaticRandomCrop(image_size, self.crop_size)
+        else:
+            cropper = StaticCenterCrop(image_size, self.render_size)
+        images = list(map(cropper, images))
+        flow = cropper(flow)
+
+        images = np.array(images).transpose(3, 0, 1, 2)
+        flow = flow.transpose(2, 0, 1)
+        return [images], [flow]
+
+    def __len__(self):
+        return self.size * self.replicates
+
+
+class ChairsSDHomTrain(ChairsSDHom):
+    def __init__(self, args, is_cropped=False, root='', replicates=1):
+        super(ChairsSDHomTrain, self).__init__(args, is_cropped=is_cropped, root=root, dstype='train',
+                                               replicates=replicates)
+
+
+class ChairsSDHomTest(ChairsSDHom):
+    def __init__(self, args, is_cropped=False, root='', replicates=1):
+        super(ChairsSDHomTest, self).__init__(args, is_cropped=is_cropped, root=root, dstype='test',
+                                              replicates=replicates)
+
+
+class ImagesFromFolder(object):
+    def __init__(self, args, is_cropped, root='/path/to/frames/only/folder', iext='png', replicates=1):
+        self.args = args
+        self.is_cropped = is_cropped
+        self.crop_size = args.crop_size
+        self.render_size = args.inference_size
+        self.replicates = replicates
+
+        images = sorted(glob(join(root, '*.' + iext)))
+        self.image_list = []
+        for i in range(len(images) - 1):
+            im1 = images[i]
+            im2 = images[i + 1]
+            self.image_list += [[im1, im2]]
+
+        self.size = len(self.image_list)
+
+        self.frame_size = frame_utils.read_gen(self.image_list[0][0]).shape
+
+        if (self.render_size[0] < 0) or (self.render_size[1] < 0) or (self.frame_size[0] % 64) or (
+                self.frame_size[1] % 64):
+            self.render_size[0] = ((self.frame_size[0]) // 64) * 64
+            self.render_size[1] = ((self.frame_size[1]) // 64) * 64
+
+        args.inference_size = self.render_size
+
+    def __getitem__(self, index):
+        index = index % self.size
+
+        img1 = frame_utils.read_gen(self.image_list[index][0])
+        img2 = frame_utils.read_gen(self.image_list[index][1])
+
+        images = [img1, img2]
+        image_size = img1.shape[:2]
+        if self.is_cropped:
+            cropper = StaticRandomCrop(image_size, self.crop_size)
+        else:
+            cropper = StaticCenterCrop(image_size, self.render_size)
+        images = list(map(cropper, images))
+
+        images = np.array(images).transpose(3, 0, 1, 2)
+        return [images], [np.zeros(images.size()[0:1] + (2,) + images.size()[-2:])]
+
+    def __len__(self):
+        return self.size * self.replicates
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    args = parser.parse_args()
+    args.inference_size = [1080, 1920]
+    args.crop_size = [384, 512]
+
+    index = 50
+    flyingchairs_dataset = FlyingChairs(args, True, root='/ssd2/zhenghe/DATA/FlyingChairs_release/data')
+    # a, b = flyingchairs_dataset[index]
+    # im1 = a[0][:,0,:,:].transpose(1,2,0)
+    # im2 = a[0][:,1,:,:].transpose(1,2,0)
+    # flo = b[0].transpose(1, 2, 0) / 20.0
+    # flow_color = flow_vis.flow_to_color(flo, convert_to_bgr=False)
+    # imsave('./hsv_pd.png', flow_color)
+    sample_num = len(flyingchairs_dataset)
+    reader = reader_flyingchairs(flyingchairs_dataset)
+    BATCH_SIZE = 8
+    train_batch_reader = paddle.batch(reader, BATCH_SIZE, drop_last=True)
+    epoch_num = 1
+
+    with fluid.dygraph.guard():
+        for epoch in range(epoch_num):
+            for batch_id, data in enumerate(train_batch_reader()):
+                im1_data = np.array(
+                    [x[0] for x in data]).astype('float32')
+                im2_data = np.array(
+                    [x[1] for x in data]).astype('float32')
+                flo_data = np.array(
+                    [x[2] for x in data]).astype('float32')
+                if batch_id % 500 == 0:
+                # if batch_id < 10:
+                    print(batch_id)
+                    print(im1_data.shape)
+                    print(im2_data.shape)
+                    print(flo_data.shape)
+                    im1 = im1_data[0, :, :, :]
+                    im2 = im2_data[0, :, :, :]
+                    flo = flo_data[0, :, :, :]
+                    print(im1.shape)
+                    print(im2.shape)
+                    print(flo.shape)
+                    imsave('./img1.png', im1)
+                    imsave('./img2.png', im2)
+                    flow_color = flow_vis.flow_to_color(flo, convert_to_bgr=False)
+                    imsave('./hsv_pd.png', flow_color)
+            print("batch_id:", batch_id)
+            print(batch_id * BATCH_SIZE)
+            print(sample_num)
+        # img = fluid.dygraph.to_variable(dy_x_data)
+
+
+
+
+
diff --git a/PaddleCV/Research/PWCNet/data/download.sh b/PaddleCV/Research/PWCNet/data/download.sh
new file mode 100755
index 0000000000000000000000000000000000000000..8a0c5dad4d5fb233be56050983bf1f0b293944d0
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/data/download.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+
+#mkdir FlyingThings3D_release
+#cd FlyingThings3D_release
+#
+#wget http://lmb.informatik.uni-freiburg.de/data/SceneFlowDatasets_CVPR16/Release_april16/data/FlyingThings3D/raw_data/flyingthings3d__frames_cleanpass.tar
+#wget http://lmb.informatik.uni-freiburg.de/data/SceneFlowDatasets_CVPR16/Release_april16/data/FlyingThings3D/derived_data/flyingthings3d__optical_flow.tar.bz2
+#
+#tar xvf flyingthings3d__frames_cleanpass.tar
+#tar xvf flyingthings3d__optical_flow.tar.bz2
+#
+#cd ..
+wget http://lmb.informatik.uni-freiburg.de/resources/datasets/FlyingChairs/FlyingChairs.zip
+unzip FlyingChairs.zip
+
+#wget https://lmb.informatik.uni-freiburg.de/data/FlowNet2/ChairsSDHom/ChairsSDHom.tar.gz
+#tar xvzf ChairsSDHom.tar.gz
diff --git a/PaddleCV/Research/PWCNet/data/frame_0010.png b/PaddleCV/Research/PWCNet/data/frame_0010.png
new file mode 100644
index 0000000000000000000000000000000000000000..80df246723859bb1e0aaca2f41944537cdc18d70
Binary files /dev/null and b/PaddleCV/Research/PWCNet/data/frame_0010.png differ
diff --git a/PaddleCV/Research/PWCNet/data/frame_0011.png b/PaddleCV/Research/PWCNet/data/frame_0011.png
new file mode 100644
index 0000000000000000000000000000000000000000..0ee97e97a7eba203eb6f67f032f81a8fbdb2c3ed
Binary files /dev/null and b/PaddleCV/Research/PWCNet/data/frame_0011.png differ
diff --git a/PaddleCV/Research/PWCNet/data/utils/__init__.py b/PaddleCV/Research/PWCNet/data/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..139597f9cb07c5d48bed18984ec4747f4b4f3438
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/data/utils/__init__.py
@@ -0,0 +1,2 @@
+
+
diff --git a/PaddleCV/Research/PWCNet/data/utils/flow_utils.py b/PaddleCV/Research/PWCNet/data/utils/flow_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ee0ecbb16a92bb9f738d278b61a18862ad518a5
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/data/utils/flow_utils.py
@@ -0,0 +1,57 @@
+import numpy as np
+
+TAG_CHAR = np.array([202021.25], np.float32)
+
+
+def readFlow(fn):
+    """ Read .flo file in Middlebury format"""
+    # Code adapted from:
+    # http://stackoverflow.com/questions/28013200/reading-middlebury-flow-files-with-python-bytes-array-numpy
+
+    # WARNING: this will work on little-endian architectures (eg Intel x86) only!
+    # print 'fn = %s'%(fn)
+    with open(fn, 'rb') as f:
+        magic = np.fromfile(f, np.float32, count=1)
+        if 202021.25 != magic:
+            print('Magic number incorrect. Invalid .flo file')
+            return None
+        else:
+            w = np.fromfile(f, np.int32, count=1)
+            h = np.fromfile(f, np.int32, count=1)
+            # print 'Reading %d x %d flo file\n' % (w, h)
+            data = np.fromfile(f, np.float32, count=2 * int(w) * int(h))
+            # Reshape data into 3D array (columns, rows, bands)
+            # The reshape here is for visualization, the original code is (w,h,2)
+            return np.resize(data, (int(h), int(w), 2))
+
+
+def writeFlow(filename, uv, v=None):
+    """ Write optical flow to file.
+
+    If v is None, uv is assumed to contain both u and v channels,
+    stacked in depth.
+    Original code by Deqing Sun, adapted from Daniel Scharstein.
+    """
+    nBands = 2
+
+    if v is None:
+        assert (uv.ndim == 3)
+        assert (uv.shape[2] == 2)
+        u = uv[:, :, 0]
+        v = uv[:, :, 1]
+    else:
+        u = uv
+
+    assert (u.shape == v.shape)
+    height, width = u.shape
+    f = open(filename, 'wb')
+    # write the header
+    f.write(TAG_CHAR)
+    np.array(width).astype(np.int32).tofile(f)
+    np.array(height).astype(np.int32).tofile(f)
+    # arrange into matrix form
+    tmp = np.zeros((height, width * nBands))
+    tmp[:, np.arange(width) * 2] = u
+    tmp[:, np.arange(width) * 2 + 1] = v
+    tmp.astype(np.float32).tofile(f)
+    f.close()
\ No newline at end of file
diff --git a/PaddleCV/Research/PWCNet/data/utils/frame_utils.py b/PaddleCV/Research/PWCNet/data/utils/frame_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..40a8ea5a206aec428241ac7674de83a1a4099de0
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/data/utils/frame_utils.py
@@ -0,0 +1,18 @@
+import numpy as np
+from os.path import *
+from scipy.misc import imread
+from . import flow_utils 
+
+def read_gen(file_name):
+    ext = splitext(file_name)[-1]
+    if ext == '.png' or ext == '.jpeg' or ext == '.ppm' or ext == '.jpg':
+        im = imread(file_name)
+        if im.shape[2] > 3:
+            return im[:,:,:3]
+        else:
+            return im
+    elif ext == '.bin' or ext == '.raw':
+        return np.load(file_name)
+    elif ext == '.flo':
+        return flow_utils.readFlow(file_name).astype(np.float32)
+    return []
\ No newline at end of file
diff --git a/PaddleCV/Research/PWCNet/finetune.sh b/PaddleCV/Research/PWCNet/finetune.sh
new file mode 100755
index 0000000000000000000000000000000000000000..29d2e802da3cc3fa13413ab768071e19d59e3147
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/finetune.sh
@@ -0,0 +1,5 @@
+#!/usr/bin/env bash
+python3 train.py --loss l1 --pretrained ./out/pwc_net_paddle --dataset FlyingChairs --train_val_txt data_dir/FlyingChairs_release/FlyingChairs_train_val.txt --data_root data_dir/FlyingChairs_release/data
+
+# use multi gpus NEED TO DO LATER
+#python3 -m paddle.distributed.launch --selected_gpus=0,1 train.py --use_multi_gpu --batch_size 40 --loss l1 --pretrained ./out/pwc_net_paddle --dataset FlyingChairs --train_val_txt data_dir/FlyingChairs_release/FlyingChairs_train_val.txt --data_root data_dir/FlyingChairs_release/data
diff --git a/PaddleCV/Research/PWCNet/infer.py b/PaddleCV/Research/PWCNet/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..717c18f02c017e910b4a86e09616386668822e8a
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/infer.py
@@ -0,0 +1,148 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Infer for PWCNet."""
+import sys
+import pickle
+import time
+import cv2
+import numpy as np
+from math import ceil
+from scipy.ndimage import imread
+from scipy.misc import imsave
+import paddle.fluid as fluid
+from models.model import PWCDCNet
+from src import flow_vis
+
+
+
+def writeFlowFile(filename, uv):
+    """
+    According to the matlab code of Deqing Sun and c++ source code of Daniel Scharstein
+    Contact: dqsun@cs.brown.edu
+    Contact: schar@middlebury.edu
+    """
+    TAG_STRING = np.array(202021.25, dtype=np.float32)
+    if uv.shape[2] != 2:
+        sys.exit("writeFlowFile: flow must have two bands!");
+    H = np.array(uv.shape[0], dtype=np.int32)
+    W = np.array(uv.shape[1], dtype=np.int32)
+    with open(filename, 'wb') as f:
+        f.write(TAG_STRING.tobytes())
+        f.write(W.tobytes())
+        f.write(H.tobytes())
+        f.write(uv.tobytes())
+
+
+def load_dict(filename_):
+    with open(filename_, 'rb') as f:
+        ret_di = pickle.load(f)
+    return ret_di
+
+
+def pad_input(x0):
+    intWidth = x0.shape[2]
+    intHeight = x0.shape[3]
+    if intWidth != ((intWidth >> 6) << 6):
+        intWidth_pad = (((intWidth >> 6) + 1) << 6)  # more than necessary
+        intPaddingLeft = int((intWidth_pad - intWidth) / 2)
+        intPaddingRight = intWidth_pad - intWidth - intPaddingLeft
+    else:
+        intWidth_pad = intWidth
+        intPaddingLeft = 0
+        intPaddingRight = 0
+
+    if intHeight != ((intHeight >> 6) << 6):
+        intHeight_pad = (((intHeight >> 6) + 1) << 6)  # more than necessary
+        intPaddingTop = int((intHeight_pad - intHeight) / 2)
+        intPaddingBottom = intHeight_pad - intHeight - intPaddingTop
+    else:
+        intHeight_pad = intHeight
+        intPaddingTop = 0
+        intPaddingBottom = 0
+
+    out = fluid.layers.pad2d(input=x0,
+                             paddings=[intPaddingLeft, intPaddingRight, intPaddingTop, intPaddingBottom],
+                             mode='edge')
+
+    return out, [intPaddingLeft, intPaddingRight, intPaddingTop, intPaddingBottom, intWidth, intHeight]
+
+
+def main():
+    im1_fn = 'data/frame_0010.png'
+    im2_fn = 'data/frame_0011.png'
+    flow_fn = './tmp/frame_0010_pd.flo'
+    if len(sys.argv) > 1:
+        im1_fn = sys.argv[1]
+    if len(sys.argv) > 2:
+        im2_fn = sys.argv[2]
+    if len(sys.argv) > 3:
+        flow_fn = sys.argv[3]
+
+    im_all = [imread(img) for img in [im1_fn, im2_fn]]
+    im_all = [im[:, :, :3] for im in im_all]
+
+    # rescale the image size to be multiples of 64
+    divisor = 64.
+    H = im_all[0].shape[0]
+    W = im_all[0].shape[1]
+    print('origin shape : ', H, W)
+
+    H_ = int(ceil(H / divisor) * divisor)
+    W_ = int(ceil(W / divisor) * divisor)
+    print('resize shape: ', H_, W_)
+    for i in range(len(im_all)):
+        im_all[i] = cv2.resize(im_all[i], (W_, H_))
+
+    for _i, _inputs in enumerate(im_all):
+        im_all[_i] = im_all[_i][:, :, ::-1]
+        im_all[_i] = 1.0 * im_all[_i] / 255.0
+        im_all[_i] = np.transpose(im_all[_i], (2, 0, 1))
+    im_all = np.concatenate((im_all[0], im_all[1]), axis=0).astype(np.float32)
+    im_all = im_all[np.newaxis, :, :, :]
+
+    with fluid.dygraph.guard(place=fluid.CUDAPlace(0)):
+        im_all = fluid.dygraph.to_variable(im_all)
+        im_all, [intPaddingLeft, intPaddingRight, intPaddingTop, intPaddingBottom, intWidth, intHeight] = pad_input(
+            im_all)
+
+        model = PWCDCNet("pwcnet")
+        model.eval()
+        pd_pretrain, _ = fluid.dygraph.load_dygraph("paddle_model/pwc_net_paddle")
+        model.set_dict(pd_pretrain)
+        start = time.time()
+        flo = model(im_all)
+        end = time.time()
+        print('Time of PWCNet model for one infer step: ', end - start)
+        flo = flo[0].numpy() * 20.0
+        # scale the flow back to the input size
+        flo = np.swapaxes(np.swapaxes(flo, 0, 1), 1, 2)
+        flo = flo[intPaddingTop * 2:intPaddingTop * 2 + intHeight * 2,
+              intPaddingLeft * 2: intPaddingLeft * 2 + intWidth * 2, :]
+        u_ = cv2.resize(flo[:, :, 0], (W, H))
+        v_ = cv2.resize(flo[:, :, 1], (W, H))
+        u_ *= W / float(W_)
+        v_ *= H / float(H_)
+        flo = np.dstack((u_, v_))
+
+        # # Apply the coloring (for OpenCV, set convert_to_bgr=True)
+        flow_color = flow_vis.flow_to_color(flo, convert_to_bgr=False)
+        imsave('./tmp/hsv_pd.png', flow_color)
+
+        writeFlowFile(flow_fn, flo)
+
+
+if __name__ == '__main__':
+    main()
+
+
diff --git a/PaddleCV/Research/PWCNet/models/__init__.py b/PaddleCV/Research/PWCNet/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..44a41a91f24512697caec6068c7ce1f4101c93b5
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/models/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import models.model
diff --git a/PaddleCV/Research/PWCNet/models/model.py b/PaddleCV/Research/PWCNet/models/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..435e9f4dbc375251468906ca0f33ac3c79701804
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/models/model.py
@@ -0,0 +1,277 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle.fluid as fluid
+from paddle.fluid.dygraph import Conv2D, Conv2DTranspose
+from correlation_op.correlation import correlation
+
+
+class PWCDCNet(fluid.dygraph.Layer):
+    def __init__(self, name_scope, md=4):
+        super(PWCDCNet, self).__init__(name_scope)
+        self.param_attr = fluid.ParamAttr(
+        name='conv_weights',
+        regularizer=fluid.regularizer.L2DecayRegularizer(
+            regularization_coeff=0.0004),
+        initializer=fluid.initializer.MSRAInitializer(uniform=True, fan_in=None, seed=0))
+        self.md = md
+        self.conv1a = Conv2D("conv1a", 16, filter_size=3, stride=2, padding=1, param_attr=self.param_attr)
+        self.conv1aa = Conv2D("conv1aa", 16, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv1b = Conv2D("conv1b", 16, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv2a = Conv2D("conv2a", 32, filter_size=3, stride=2, padding=1, param_attr=self.param_attr)
+        self.conv2aa = Conv2D("conv2aa", 32, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv2b = Conv2D("conv2b", 32, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv3a = Conv2D("conv3a", 64, filter_size=3, stride=2, padding=1, param_attr=self.param_attr)
+        self.conv3aa = Conv2D("conv3aa", 64, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv3b = Conv2D("conv3b", 64, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv4a = Conv2D("conv4a", 96, filter_size=3, stride=2, padding=1, param_attr=self.param_attr)
+        self.conv4aa = Conv2D("conv4aa", 96, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv4b = Conv2D("conv4b", 96, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv5a = Conv2D("conv5a", 128, filter_size=3, stride=2, padding=1, param_attr=self.param_attr)
+        self.conv5aa = Conv2D("conv5aa", 128, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv5b = Conv2D("conv5b", 128, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv6aa = Conv2D("conv6aa", 196, filter_size=3, stride=2, padding=1, param_attr=self.param_attr)
+        self.conv6a = Conv2D("conv6a", 196, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv6b = Conv2D("conv6b", 196, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+
+        self.conv6_0 = Conv2D("conv6_0", 128, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv6_1 = Conv2D("conv6_1", 128, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv6_2 = Conv2D("conv6_2", 96, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv6_3 = Conv2D("conv6_3", 64, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv6_4 = Conv2D("conv6_4", 32, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.predict_flow6 = Conv2D("predict_flow6", 2, filter_size=3,stride=1,padding=1, param_attr=self.param_attr)
+        self.deconv6 = Conv2DTranspose("deconv6", 2, filter_size=4, stride=2, padding=1, param_attr=self.param_attr)
+        self.upfeat6 = Conv2DTranspose("upfeat6", 2, filter_size=4, stride=2, padding=1, param_attr=self.param_attr)
+
+        self.conv5_0 = Conv2D("conv5_0", 128, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv5_1 = Conv2D("conv5_1", 128, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv5_2 = Conv2D("conv5_2", 96, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv5_3 = Conv2D("conv5_3", 64, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv5_4 = Conv2D("conv5_4", 32, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.predict_flow5 = Conv2D("predict_flow5", 2, filter_size=3,stride=1,padding=1, param_attr=self.param_attr)
+        self.deconv5 = Conv2DTranspose("deconv5", 2, filter_size=4, stride=2, padding=1, param_attr=self.param_attr)
+        self.upfeat5 = Conv2DTranspose("upfeat5", 2, filter_size=4, stride=2, padding=1, param_attr=self.param_attr)
+
+        self.conv4_0 = Conv2D("conv4_0", 128, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv4_1 = Conv2D("conv4_1", 128, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv4_2 = Conv2D("conv4_2", 96, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv4_3 = Conv2D("conv4_3", 64, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv4_4 = Conv2D("conv4_4", 32, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.predict_flow4 = Conv2D("predict_flow4", 2, filter_size=3,stride=1,padding=1, param_attr=self.param_attr)
+        self.deconv4 = Conv2DTranspose("deconv4", 2, filter_size=4, stride=2, padding=1, param_attr=self.param_attr)
+        self.upfeat4 = Conv2DTranspose("upfeat4", 2, filter_size=4, stride=2, padding=1, param_attr=self.param_attr)
+
+        self.conv3_0 = Conv2D("conv3_0", 128, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv3_1 = Conv2D("conv3_1", 128, filter_size=3, stride=1, padding=1 ,param_attr=self.param_attr)
+        self.conv3_2 = Conv2D("conv3_2", 96, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv3_3 = Conv2D("conv3_3", 64, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv3_4 = Conv2D("conv3_4", 32, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.predict_flow3 = Conv2D("predict_flow3", 2, filter_size=3,stride=1,padding=1, param_attr=self.param_attr)
+        self.deconv3 = Conv2DTranspose("deconv3", 2, filter_size=4, stride=2, padding=1, param_attr=self.param_attr)
+        self.upfeat3 = Conv2DTranspose("upfeat3", 2, filter_size=4, stride=2, padding=1, param_attr=self.param_attr)
+
+        self.conv2_0 = Conv2D("conv2_0", 128, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv2_1 = Conv2D("conv2_1", 128, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv2_2 = Conv2D("conv2_2", 96, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv2_3 = Conv2D("conv2_3", 64, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.conv2_4 = Conv2D("conv2_4", 32, filter_size=3, stride=1, padding=1, param_attr=self.param_attr)
+        self.predict_flow2 = Conv2D("predict_flow2", 2, filter_size=3,stride=1,padding=1, param_attr=self.param_attr)
+        self.deconv2 = Conv2DTranspose("deconv2", 2, filter_size=4, stride=2, padding=1, param_attr=self.param_attr)
+
+        self.dc_conv1 = Conv2D("dc_conv1", 128, filter_size=3, stride=1, padding=1, dilation=1, param_attr=self.param_attr)
+        self.dc_conv2 = Conv2D("dc_conv2", 128, filter_size=3, stride=1, padding=2, dilation=2, param_attr=self.param_attr)
+        self.dc_conv3 = Conv2D("dc_conv3", 128, filter_size=3, stride=1, padding=4, dilation=4, param_attr=self.param_attr)
+        self.dc_conv4 = Conv2D("dc_conv4", 96, filter_size=3, stride=1, padding=8, dilation=8, param_attr=self.param_attr)
+        self.dc_conv5 = Conv2D("dc_conv5", 64, filter_size=3, stride=1, padding=16, dilation=16, param_attr=self.param_attr)
+        self.dc_conv6 = Conv2D("dc_conv6", 32, filter_size=3, stride=1, padding=1, dilation=1, param_attr=self.param_attr)
+        self.dc_conv7 = Conv2D("dc_conv7", 2, filter_size=3,stride=1,padding=1, param_attr=self.param_attr)
+
+    def warp(self, x, flo):
+        """
+        warp an image/tensor (im2) back to im1, according to the optical flow
+
+        x: [B, C, H, W] (im2)
+        flo: [B, 2, H, W] flow
+
+        """
+
+        B, C, H, W = x.shape
+        # mesh grid
+        xx_pd = fluid.layers.range(0, W, 1, 'float32')
+        xx_pd = fluid.layers.reshape(xx_pd, shape=[1, -1])
+        xx_pd = fluid.layers.expand(x=xx_pd, expand_times=[H, 1])
+        xx_pd = fluid.layers.reshape(xx_pd, shape=[1, 1, H, W])
+        xx_pd = fluid.layers.expand(x=xx_pd, expand_times=[B, 1, 1, 1])
+
+        yy_pd = fluid.layers.range(0, H, 1, 'float32')
+        yy_pd = fluid.layers.reshape(yy_pd, shape=[-1, 1])
+        yy_pd = fluid.layers.expand(x=yy_pd, expand_times=[1, W])
+        yy_pd = fluid.layers.reshape(x=yy_pd, shape=[1, 1, H, W])
+        yy_pd = fluid.layers.expand(x=yy_pd, expand_times=[B, 1, 1, 1])
+        grid_pd = fluid.layers.concat(input=[xx_pd, yy_pd], axis=1)
+        flo_pd = flo
+        vgrid_pd = fluid.layers.elementwise_add(grid_pd, flo_pd)
+        vgrid_pd_0 = 2.0 * fluid.layers.slice(vgrid_pd, axes=[1], starts=[0], ends=[1]) / max(W - 1, 1) - 1.0
+        vgrid_pd_1 = 2.0 * fluid.layers.slice(vgrid_pd, axes=[1], starts=[1], ends=[2]) / max(H - 1, 1) - 1.0
+        vgrid_pd = fluid.layers.concat(input=[vgrid_pd_0, vgrid_pd_1], axis=1)
+        vgrid_pd = fluid.layers.transpose(vgrid_pd, [0, 2, 3, 1])
+        output = fluid.layers.grid_sampler(name='grid_sample', x=x, grid=vgrid_pd)
+
+        mask = fluid.layers.zeros_like(x)
+        mask = mask + 1.0
+        mask = fluid.layers.grid_sampler(name='grid_sample', x=mask, grid=vgrid_pd)
+        mask_temp1 = fluid.layers.cast(mask < 0.9990, 'float32')
+        mask = mask * (1 - mask_temp1)
+        mask = fluid.layers.cast(mask > 0, 'float32')
+        outwarp = fluid.layers.elementwise_mul(output, mask)
+
+        return outwarp
+
+    def corr(self, x_1, x_2):
+        out = correlation(x_1, x_2, pad_size=self.md, kernel_size=1, max_displacement=self.md,
+                          stride1=1, stride2=1, corr_type_multiply=1)
+        return out
+
+    def forward(self, x, output_more=False):
+        im1 = fluid.layers.slice(x, axes=[1], starts=[0], ends=[3])
+        im2 = fluid.layers.slice(x, axes=[1], starts=[3], ends=[6])
+        # print("\n\n***************************PWC Net details *************** \n\n")
+        c11 = fluid.layers.leaky_relu(self.conv1a(im1), 0.1)
+        c11 = fluid.layers.leaky_relu(self.conv1aa(c11), 0.1)
+        c11 = fluid.layers.leaky_relu(self.conv1b(c11), 0.1)
+
+        c21 = fluid.layers.leaky_relu(self.conv1a(im2), 0.1)
+        c21 = fluid.layers.leaky_relu(self.conv1aa(c21), 0.1)
+        c21 = fluid.layers.leaky_relu(self.conv1b(c21), 0.1)
+
+        c12 = fluid.layers.leaky_relu(self.conv2a(c11), 0.1)
+        c12 = fluid.layers.leaky_relu(self.conv2aa(c12), 0.1)
+        c12 = fluid.layers.leaky_relu(self.conv2b(c12), 0.1)
+
+        c22 = fluid.layers.leaky_relu(self.conv2a(c21), 0.1)
+        c22 = fluid.layers.leaky_relu(self.conv2aa(c22), 0.1)
+        c22 = fluid.layers.leaky_relu(self.conv2b(c22), 0.1)
+
+        c13 = fluid.layers.leaky_relu(self.conv3a(c12), 0.1)
+        c13 = fluid.layers.leaky_relu(self.conv3aa(c13), 0.1)
+        c13 = fluid.layers.leaky_relu(self.conv3b(c13), 0.1)
+
+        c23 = fluid.layers.leaky_relu(self.conv3a(c22), 0.1)
+        c23 = fluid.layers.leaky_relu(self.conv3aa(c23), 0.1)
+        c23 = fluid.layers.leaky_relu(self.conv3b(c23), 0.1)
+
+        c14 = fluid.layers.leaky_relu(self.conv4a(c13), 0.1)
+        c14 = fluid.layers.leaky_relu(self.conv4aa(c14), 0.1)
+        c14 = fluid.layers.leaky_relu(self.conv4b(c14), 0.1)
+
+        c24 = fluid.layers.leaky_relu(self.conv4a(c23), 0.1)
+        c24 = fluid.layers.leaky_relu(self.conv4aa(c24), 0.1)
+        c24 = fluid.layers.leaky_relu(self.conv4b(c24), 0.1)
+
+        c15 = fluid.layers.leaky_relu(self.conv5a(c14), 0.1)
+        c15 = fluid.layers.leaky_relu(self.conv5aa(c15), 0.1)
+        c15 = fluid.layers.leaky_relu(self.conv5b(c15), 0.1)
+
+        c25 = fluid.layers.leaky_relu(self.conv5a(c24), 0.1)
+        c25 = fluid.layers.leaky_relu(self.conv5aa(c25), 0.1)
+        c25 = fluid.layers.leaky_relu(self.conv5b(c25), 0.1)
+
+        c16 = fluid.layers.leaky_relu(self.conv6aa(c15), 0.1)
+        c16 = fluid.layers.leaky_relu(self.conv6a(c16), 0.1)
+        c16 = fluid.layers.leaky_relu(self.conv6b(c16), 0.1)
+
+        c26 = fluid.layers.leaky_relu(self.conv6aa(c25), 0.1)
+        c26 = fluid.layers.leaky_relu(self.conv6a(c26), 0.1)
+        c26 = fluid.layers.leaky_relu(self.conv6b(c26), 0.1)
+
+        corr6 = self.corr(c16, c26)
+        corr6 = fluid.layers.leaky_relu(corr6, alpha=0.1)
+
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv6_0(corr6), 0.1), corr6], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv6_1(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv6_2(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv6_3(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv6_4(x), 0.1), x], axis=1)
+
+        flow6 = self.predict_flow6(x)
+        up_flow6 = self.deconv6(flow6)
+        up_feat6 = self.upfeat6(x)
+
+        warp5 = self.warp(c25, up_flow6 * 0.625)
+        corr5 = self.corr(c15, warp5)
+        corr5 = fluid.layers.leaky_relu(corr5, alpha=0.1)
+
+        x = fluid.layers.concat(input=[corr5, c15, up_flow6, up_feat6], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv5_0(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv5_1(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv5_2(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv5_3(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv5_4(x), 0.1), x], axis=1)
+
+        flow5 = self.predict_flow5(x)
+        up_flow5 = self.deconv5(flow5)
+        up_feat5 = self.upfeat5(x)
+
+        warp4 = self.warp(c24, up_flow5 * 1.25)
+        corr4 = self.corr(c14, warp4)
+        corr4 = fluid.layers.leaky_relu(corr4, alpha=0.1)
+
+        x = fluid.layers.concat(input=[corr4, c14, up_flow5, up_feat5], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv4_0(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv4_1(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv4_2(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv4_3(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv4_4(x), 0.1), x], axis=1)
+
+        flow4 = self.predict_flow4(x)
+        up_flow4 = self.deconv4(flow4)
+        up_feat4 = self.upfeat4(x)
+
+        warp3 = self.warp(c23, up_flow4 * 2.5)
+        corr3 = self.corr(c13, warp3)
+        corr3 = fluid.layers.leaky_relu(corr3, alpha=0.1)
+
+        x = fluid.layers.concat(input=[corr3, c13, up_flow4, up_feat4], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv3_0(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv3_1(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv3_2(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv3_3(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv3_4(x), 0.1), x], axis=1)
+
+        flow3 = self.predict_flow3(x)
+        up_flow3 = self.deconv3(flow3)
+        up_feat3 = self.upfeat3(x)
+
+        warp2 = self.warp(c22, up_flow3 * 5.0)
+        corr2 = self.corr(c12, warp2)
+        corr2 = fluid.layers.leaky_relu(corr2, alpha=0.1)
+
+        x = fluid.layers.concat(input=[corr2, c12, up_flow3, up_feat3], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv2_0(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv2_1(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv2_2(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv2_3(x), 0.1), x], axis=1)
+        x = fluid.layers.concat(input=[fluid.layers.leaky_relu(self.conv2_4(x), 0.1), x], axis=1)
+
+        flow2 = self.predict_flow2(x)
+
+        x = fluid.layers.leaky_relu(self.dc_conv4(fluid.layers.leaky_relu(
+            self.dc_conv3(fluid.layers.leaky_relu(self.dc_conv2(fluid.layers.leaky_relu(self.dc_conv1(x), 0.1)), 0.1)),
+            0.1)), 0.1)
+        flow2 += self.dc_conv7(
+            fluid.layers.leaky_relu(self.dc_conv6(fluid.layers.leaky_relu(self.dc_conv5(x), 0.1)), 0.1))
+        if not output_more:
+            return flow2
+        else:
+            return [flow2, flow3, flow4, flow5, flow6]
+
diff --git a/PaddleCV/Research/PWCNet/my_args.py b/PaddleCV/Research/PWCNet/my_args.py
new file mode 100644
index 0000000000000000000000000000000000000000..bb673efe10534ba319fa240c09f05d044be76d4b
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/my_args.py
@@ -0,0 +1,17 @@
+import argparse
+
+parser = argparse.ArgumentParser(description='PWCNet_paddle')
+parser.add_argument('--dataset', default='FlyingChairs', help='dataset type : FlyingChairs')
+parser.add_argument('--data_root', default='', help='the path of selected datasets')
+parser.add_argument('--model_out_dir', default='./out', help='the path of selected datasets')
+parser.add_argument('--loss', default='l2', help='loss type : first train with l2 and finetune with l1')
+parser.add_argument('--train_val_txt', default='', help='the path of selected train_val_txt of dataset')
+parser.add_argument('--numEpoch', '-e', type=int, default=100, help='Number of epochs to train')
+parser.add_argument('--batch_size', '-b', type=int, default=40, help='batch size')
+parser.add_argument('--pretrained', default=None, help='path to the pretrained model weights')
+parser.add_argument('--optimize', default=None, help='path to the pretrained optimize weights')
+parser.add_argument('--use_multi_gpu',action = 'store_true', help='Enable multi gpu mode')
+
+args = parser.parse_args()
+args.inference_size = [384, 512]
+args.crop_size = [384, 448]
\ No newline at end of file
diff --git a/PaddleCV/Research/PWCNet/paddle_model/pwc_net_chairs_paddle.pdparams b/PaddleCV/Research/PWCNet/paddle_model/pwc_net_chairs_paddle.pdparams
new file mode 100755
index 0000000000000000000000000000000000000000..1b8a626b6bd1c5d30e65154bc6bb54f336716b25
Binary files /dev/null and b/PaddleCV/Research/PWCNet/paddle_model/pwc_net_chairs_paddle.pdparams differ
diff --git a/PaddleCV/Research/PWCNet/paddle_model/pwc_net_paddle.pdparams b/PaddleCV/Research/PWCNet/paddle_model/pwc_net_paddle.pdparams
new file mode 100755
index 0000000000000000000000000000000000000000..6e947b41ca33f8871bb72d3ad1e8f0b709c8f354
Binary files /dev/null and b/PaddleCV/Research/PWCNet/paddle_model/pwc_net_paddle.pdparams differ
diff --git a/PaddleCV/Research/PWCNet/src/__init__.py b/PaddleCV/Research/PWCNet/src/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleCV/Research/PWCNet/src/flow_vis.py b/PaddleCV/Research/PWCNet/src/flow_vis.py
new file mode 100644
index 0000000000000000000000000000000000000000..d2fe36828f829151ec307f1b4e1dc687b4ecc8b3
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/src/flow_vis.py
@@ -0,0 +1,163 @@
+# MIT License
+#
+# Copyright (c) 2018 Tom Runia
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to conditions.
+#
+# Author: Tom Runia
+# Date Created: 2018-08-03
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+
+
+def make_colorwheel():
+    '''
+    Generates a color wheel for optical flow visualization as presented in:
+        Baker et al. "A Database and Evaluation Methodology for Optical Flow" (ICCV, 2007)
+        URL: http://vision.middlebury.edu/flow/flowEval-iccv07.pdf
+
+    According to the C++ source code of Daniel Scharstein
+    According to the Matlab source code of Deqing Sun
+    '''
+
+    RY = 15
+    YG = 6
+    GC = 4
+    CB = 11
+    BM = 13
+    MR = 6
+
+    ncols = RY + YG + GC + CB + BM + MR
+    colorwheel = np.zeros((ncols, 3))
+    col = 0
+
+    # RY
+    colorwheel[0:RY, 0] = 255
+    colorwheel[0:RY, 1] = np.floor(255*np.arange(0,RY)/RY)
+    col = col+RY
+    # YG
+    colorwheel[col:col+YG, 0] = 255 - np.floor(255*np.arange(0,YG)/YG)
+    colorwheel[col:col+YG, 1] = 255
+    col = col+YG
+    # GC
+    colorwheel[col:col+GC, 1] = 255
+    colorwheel[col:col+GC, 2] = np.floor(255*np.arange(0,GC)/GC)
+    col = col+GC
+    # CB
+    colorwheel[col:col+CB, 1] = 255 - np.floor(255*np.arange(CB)/CB)
+    colorwheel[col:col+CB, 2] = 255
+    col = col+CB
+    # BM
+    colorwheel[col:col+BM, 2] = 255
+    colorwheel[col:col+BM, 0] = np.floor(255*np.arange(0,BM)/BM)
+    col = col+BM
+    # MR
+    colorwheel[col:col+MR, 2] = 255 - np.floor(255*np.arange(MR)/MR)
+    colorwheel[col:col+MR, 0] = 255
+    return colorwheel
+
+
+def flow_compute_color(u, v, convert_to_bgr=False):
+    '''
+    Applies the flow color wheel to (possibly clipped) flow components u and v.
+
+    According to the C++ source code of Daniel Scharstein
+    According to the Matlab source code of Deqing Sun
+
+    :param u: np.ndarray, input horizontal flow
+    :param v: np.ndarray, input vertical flow
+    :param convert_to_bgr: bool, whether to change ordering and output BGR instead of RGB
+    :return:
+    '''
+
+    flow_image = np.zeros((u.shape[0], u.shape[1], 3), np.uint8)
+
+    colorwheel = make_colorwheel()  # shape [55x3]
+    ncols = colorwheel.shape[0]
+
+    rad = np.sqrt(np.square(u) + np.square(v))
+    a = np.arctan2(-v, -u)/np.pi
+
+    fk = (a+1) / 2*(ncols-1)
+    k0 = np.floor(fk).astype(np.int32)
+    k1 = k0 + 1
+    k1[k1 == ncols] = 0
+    f = fk - k0
+
+    for i in range(colorwheel.shape[1]):
+
+        tmp = colorwheel[:,i]
+        col0 = tmp[k0] / 255.0
+        col1 = tmp[k1] / 255.0
+        col = (1-f)*col0 + f*col1
+
+        idx = (rad <= 1)
+        col[idx]  = 1 - rad[idx] * (1-col[idx])
+        col[~idx] = col[~idx] * 0.75   # out of range?
+
+        # Note the 2-i => BGR instead of RGB
+        ch_idx = 2-i if convert_to_bgr else i
+        flow_image[:,:,ch_idx] = np.floor(255 * col)
+
+    return flow_image
+
+
+def flow_to_color(flow_uv, clip_flow=None, convert_to_bgr=False):
+    '''
+    Expects a two dimensional flow image of shape [H,W,2]
+
+    According to the C++ source code of Daniel Scharstein
+    According to the Matlab source code of Deqing Sun
+
+    :param flow_uv: np.ndarray of shape [H,W,2]
+    :param clip_flow: float, maximum clipping value for flow
+    :return:
+    '''
+    assert flow_uv.ndim == 3, 'input flow must have three dimensions'
+    assert flow_uv.shape[2] == 2, 'input flow must have shape [H,W,2]'
+
+    if clip_flow is not None:
+        flow_uv = np.clip(flow_uv, 0, clip_flow)
+
+    u = flow_uv[:,:,0]
+    v = flow_uv[:,:,1]
+
+    rad = np.sqrt(np.square(u) + np.square(v))
+    rad_max = np.max(rad)
+
+    epsilon = 1e-5
+    u = u / (rad_max + epsilon)
+    v = v / (rad_max + epsilon)
+    return flow_compute_color(u, v, convert_to_bgr)
+
+
+def read_flow(filename):
+    """
+    https://github.com/sampepose/flownet2-tf/blob/master/src/flowlib.py
+    read optical flow from Middlebury .flo file
+    :param filename: name of the flow file
+    :return: optical flow data in matrix
+    """
+    f = open(filename, 'rb')
+    magic = np.fromfile(f, np.float32, count=1)
+    data2d = None
+
+    if 202021.25 != magic:
+        print('Magic number incorrect. Invalid .flo file')
+    else:
+        w = np.fromfile(f, np.int32, count=1)
+        h = np.fromfile(f, np.int32, count=1)
+        print("Reading %d x %d flo file" % (h, w))
+        data2d = np.fromfile(f, np.float32, count=2 * w[0] * h[0])
+        # reshape data into 3D array (columns, rows, channels)
+        data2d = np.resize(data2d, (h[0], w[0], 2))
+    f.close()
+    return data2d
diff --git a/PaddleCV/Research/PWCNet/src/multiscaleloss.py b/PaddleCV/Research/PWCNet/src/multiscaleloss.py
new file mode 100644
index 0000000000000000000000000000000000000000..a52a74acf278fde4a99335af21459050fd28a7ef
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/src/multiscaleloss.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+import paddle.fluid as fluid
+
+
+def EPE(input_flow, target_flow, loss_type, sparse=False, mean=True):
+    if loss_type == 'l1':
+        EPE_map = fluid.layers.abs(input_flow - target_flow)
+    else:
+        EPE_map = fluid.layers.square(input_flow - target_flow)
+    if sparse: #TODO mask = (target_flow[:,0] == 0) & (target_flow[:,1] == 0) EPE_map = EPE_map[~mask]
+        mask_temp1 = fluid.layers.cast(target_flow[:, 0] == 0, 'float32')
+        mask_temp2 = fluid.layers.cast(target_flow[:, 1] == 0, 'float32')
+        mask = 1 - fluid.layers.elementwise_mul(mask_temp1, mask_temp2)
+        mask = fluid.layers.reshape(mask, [mask.shape[0], 1, mask.shape[1], mask.shape[2]])
+        mask = fluid.layers.concat([mask, mask], 1)
+        EPE_map = EPE_map * mask
+
+    if mean:
+        return fluid.layers.mean(EPE_map)
+    else:
+        batch_size = EPE_map.shape[0]
+        res_sum = fluid.layers.reduce_sum(EPE_map)
+        res = res_sum / batch_size
+        return res
+
+
+def sparse_max_pool(input, size):
+    '''Downsample the input by considering 0 values as invalid.
+
+    Unfortunately, no generic interpolation mode can resize a sparse map correctly,
+    the strategy here is to use max pooling for positive values and "min pooling"
+    for negative values, the two results are then summed.
+    This technique allows sparsity to be minized, contrary to nearest interpolation,
+    which could potentially lose information for isolated data points.'''
+
+    positive = fluid.layers.cast(input > 0, 'float32')
+    negative = fluid.layers.cast(input < 0, 'float32')
+    output = fluid.layers.adaptive_pool2d(input * positive, size) - fluid.layers.adaptive_pool2d(-input * negative,
+                                                                                                 size)
+    return output
+
+
+def multiscaleEPE(network_output, target_flow, loss_type, weights=None, sparse=False):
+    def one_scale(output, target, sparse, loss_type):
+        if sparse:
+            h = output.shape[2]
+            w = output.shape[3]
+            target_scaled = sparse_max_pool(target, [h, w])
+        else:
+            target_scaled = fluid.layers.resize_bilinear(target, out_shape=[output.shape[2],
+                                                                               output.shape[3]],
+                                                            align_corners=False, align_mode=False)
+        return EPE(output, target_scaled, loss_type=loss_type, sparse=sparse, mean=False)
+
+    if type(network_output) not in [tuple, list]:
+        network_output = [network_output]
+    if weights is None:
+        weights = [0.005, 0.01, 0.02, 0.08, 0.32]  # as in original article
+    assert(len(weights) == len(network_output))
+
+    loss = 0
+    for output, weight in zip(network_output, weights):
+        loss += weight * one_scale(output, target_flow, sparse, loss_type)
+    return loss
+
+
+def realEPE(output, target, sparse=False):
+    upsampled_output = fluid.layers.resize_bilinear(output, out_shape=[target.shape[2],
+                                                                       target.shape[3]],
+                                               align_corners=False, align_mode=False)
+    return EPE(upsampled_output, target, sparse, mean=True)
+
diff --git a/PaddleCV/Research/PWCNet/src/read_files.py b/PaddleCV/Research/PWCNet/src/read_files.py
new file mode 100644
index 0000000000000000000000000000000000000000..743a57ddc2552c668c5a76b3511659c861ab160f
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/src/read_files.py
@@ -0,0 +1,22 @@
+def read_txt(videoTxt):
+    with open(videoTxt, 'r') as f:
+        videolist = f.readlines()
+    return videolist
+
+
+def read_txt_to_index(file):
+    data = read_txt(file)
+    data = list(map(int, data))
+    return data
+
+
+def main():
+    file = 'data_dir/FlyingChairs_release/FlyingChairs_train_val.txt'
+    data = read_txt_to_index(file)
+    data = list(map(int, data))
+    print(data)
+    print(len(data))
+
+
+if __name__ == '__main__':
+    main()
diff --git a/PaddleCV/Research/PWCNet/tmp/hsv_pd.png b/PaddleCV/Research/PWCNet/tmp/hsv_pd.png
new file mode 100755
index 0000000000000000000000000000000000000000..0ebc10300e6d3e93260ddbec59bc9d002958c01a
Binary files /dev/null and b/PaddleCV/Research/PWCNet/tmp/hsv_pd.png differ
diff --git a/PaddleCV/Research/PWCNet/tmp/hsv_pd_chairs.png b/PaddleCV/Research/PWCNet/tmp/hsv_pd_chairs.png
new file mode 100755
index 0000000000000000000000000000000000000000..cc3249bf0ca991502f715f26e29a723b9319b8ce
Binary files /dev/null and b/PaddleCV/Research/PWCNet/tmp/hsv_pd_chairs.png differ
diff --git a/PaddleCV/Research/PWCNet/train.py b/PaddleCV/Research/PWCNet/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..7dc3b05edf1ccd2b59594e5c4a157e90b9390735
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/train.py
@@ -0,0 +1,275 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Trainer for PWCNet."""
+import sys
+import os
+os.environ['FLAGS_fraction_of_gpu_memory_to_use'] = "0.99999"
+os.environ["FLAGS_eager_delete_tensor_gb"] = "0"
+import pickle
+import time
+import cv2
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from scipy.misc import imsave
+from src import flow_vis
+from models.model import PWCDCNet
+from data.datasets import FlyingChairs, reader_flyingchairs
+from src.multiscaleloss import multiscaleEPE, realEPE
+from AverageMeter import *
+from my_args import args
+
+
+def writeFlowFile(filename, uv):
+    """
+    According to the matlab code of Deqing Sun and c++ source code of Daniel Scharstein
+    Contact: dqsun@cs.brown.edu
+    Contact: schar@middlebury.edu
+    """
+    TAG_STRING = np.array(202021.25, dtype=np.float32)
+    if uv.shape[2] != 2:
+        sys.exit("writeFlowFile: flow must have two bands!");
+    H = np.array(uv.shape[0], dtype=np.int32)
+    W = np.array(uv.shape[1], dtype=np.int32)
+    with open(filename, 'wb') as f:
+        f.write(TAG_STRING.tobytes())
+        f.write(W.tobytes())
+        f.write(H.tobytes())
+        f.write(uv.tobytes())
+
+
+def load_dict(filename_):
+    with open(filename_, 'rb') as f:
+        ret_di = pickle.load(f)
+    return ret_di
+
+
+def pad_input(x0):
+    intWidth = x0.shape[2]
+    intHeight = x0.shape[3]
+    if intWidth != ((intWidth >> 6) << 6):
+        intWidth_pad = (((intWidth >> 6) + 1) << 6)  # more than necessary
+        intPaddingLeft = int((intWidth_pad - intWidth) / 2)
+        intPaddingRight = intWidth_pad - intWidth - intPaddingLeft
+    else:
+        intWidth_pad = intWidth
+        intPaddingLeft = 0
+        intPaddingRight = 0
+
+    if intHeight != ((intHeight >> 6) << 6):
+        intHeight_pad = (((intHeight >> 6) + 1) << 6)  # more than necessary
+        intPaddingTop = int((intHeight_pad - intHeight) / 2)
+        intPaddingBottom = intHeight_pad - intHeight - intPaddingTop
+    else:
+        intHeight_pad = intHeight
+        intPaddingTop = 0
+        intPaddingBottom = 0
+
+    out = fluid.layers.pad2d(input=x0,
+                             paddings=[intPaddingLeft, intPaddingRight, intPaddingTop, intPaddingBottom],
+                             mode='edge')
+
+    return out, [intPaddingLeft, intPaddingRight, intPaddingTop, intPaddingBottom, intWidth, intHeight]
+
+
+def val(model, batch_reader, epoch, batch_num):
+    model.eval()
+    loss_cnt = AverageMeter()
+    for batch_id, data in enumerate(batch_reader()):
+        start = time.time()
+        im1_data = np.array(
+            [x[0] for x in data]).astype('float32')
+        im2_data = np.array(
+            [x[1] for x in data]).astype('float32')
+        flo_data = np.array(
+            [x[2] for x in data]).astype('float32')
+        step = im1_data.shape[0]
+
+        im_all = np.concatenate((im1_data, im2_data), axis=3).astype(np.float32)
+        im_all = im_all / 255.0
+        im_all = np.swapaxes(np.swapaxes(im_all, 1, 2), 1, 3)
+        label = flo_data / 20.0
+        label = np.swapaxes(np.swapaxes(label, 1, 2), 1, 3)
+
+        im_all = fluid.dygraph.to_variable(im_all)
+        label = fluid.dygraph.to_variable(label)
+        # im_all, [intPaddingLeft, intPaddingRight, intPaddingTop, intPaddingBottom, intWidth, intHeight] = pad_input(
+        #     im_all)
+
+        end = time.time()
+        read_data_time = end - start
+        start = time.time()
+        network_output = model(im_all, output_more=False)
+        loss = realEPE(network_output, label)
+        end = time.time()
+        loss_cnt.update(loss.numpy()[0], step)
+        print('val epoch {} batch {}/{} run time: {}s read data time {}s loss {}'.format(epoch, batch_id, batch_num,
+                                                                                     round(end - start, 2),
+                                                                                     round(read_data_time, 2),
+                                                                                     loss.numpy()))
+    return round(loss_cnt.avg, 4)
+
+
+def train(model, train_batch_reader, adam, epoch, batch_num, args):
+    loss_type = args.loss
+    model.train()
+    for batch_id, data in enumerate(train_batch_reader()):
+        start = time.time()
+        im1_data = np.array(
+            [x[0] for x in data]).astype('float32')
+        im2_data = np.array(
+            [x[1] for x in data]).astype('float32')
+        flo_data = np.array(
+            [x[2] for x in data]).astype('float32')
+        im_all = np.concatenate((im1_data, im2_data), axis=3).astype(np.float32)
+        im_all = im_all / 255.0
+        im_all = np.swapaxes(np.swapaxes(im_all, 1, 2), 1, 3)
+        label = flo_data / 20.0
+        label = np.swapaxes(np.swapaxes(label, 1, 2), 1, 3)
+        if batch_id % 10 == 0:
+            im1 = im_all[0, :3, :, :] * 255
+            im2 = im_all[0, 3:, :, :] * 255
+            im1 = np.swapaxes(np.swapaxes(im1, 0, 1), 1, 2).astype(np.uint8)
+            im2 = np.swapaxes(np.swapaxes(im2, 0, 1), 1, 2).astype(np.uint8)
+
+            flo = label[0, :, :, :] * 20
+            flo = np.swapaxes(np.swapaxes(flo, 0, 1), 1, 2)
+            imsave('./img1.png', im1)
+            imsave('./img2.png', im2)
+            flow_color = flow_vis.flow_to_color(flo, convert_to_bgr=False)
+            imsave('./hsv_pd.png', flow_color)
+            H = im_all[0].shape[1]
+            W = im_all[0].shape[2]
+
+        im_all = fluid.dygraph.to_variable(im_all)
+        label = fluid.dygraph.to_variable(label)
+        im_all, [intPaddingLeft, intPaddingRight, intPaddingTop, intPaddingBottom, intWidth, intHeight] = pad_input(
+            im_all)
+
+        label, _ = pad_input(label)
+        end = time.time()
+        read_data_time = end - start
+        start = time.time()
+        network_output = model(im_all, output_more=True)
+        if batch_id % 10 == 0:
+            flo = network_output[0][0].numpy() * 20.0
+            # scale the flow back to the input size
+            flo = np.swapaxes(np.swapaxes(flo, 0, 1), 1, 2)
+            flo = flo[intPaddingTop * 2:intPaddingTop * 2 + intHeight * 2,
+                                       intPaddingLeft * 2: intPaddingLeft * 2 + intWidth * 2, :]
+
+            u_ = cv2.resize(flo[:, :, 0], (W, H))
+            v_ = cv2.resize(flo[:, :, 1], (W, H))
+            flo = np.dstack((u_, v_))
+            flow_color = flow_vis.flow_to_color(flo, convert_to_bgr=False)
+            imsave('./hsv_predict.png', flow_color)
+        loss = multiscaleEPE(network_output, label, loss_type, weights=None, sparse=False)
+
+        end = time.time()
+        loss.backward()
+        if args.use_multi_gpu:
+            model.apply_collective_grads()
+        adam.minimize(loss)
+        model.clear_gradients()
+        print('epoch {} batch {}/{} run time: {}s read data time {}s loss {}'.format(epoch, batch_id, batch_num,
+                                                                                     round(end - start, 2),
+                                                                                     round(read_data_time, 2),
+                                                                                     loss.numpy()))
+
+
+def main():
+    print(args)
+    if args.use_multi_gpu:
+        place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
+    else:
+        place = fluid.CUDAPlace(0)
+
+    with fluid.dygraph.guard(place=place):
+        if args.use_multi_gpu:
+            strategy = fluid.dygraph.parallel.prepare_context()
+        model = PWCDCNet("pwcnet")
+        if args.pretrained:
+            print('-----------load pretrained model:', args.pretrained)
+            pd_pretrain, _ = fluid.dygraph.load_dygraph(args.pretrained)
+            model.set_dict(pd_pretrain)
+
+        adam = fluid.optimizer.AdamOptimizer(learning_rate=0.0001, regularization=fluid.regularizer.L2DecayRegularizer(
+            regularization_coeff=0.0004))
+        if args.optimize:
+            print('--------------load pretrained model:', args.optimize)
+            adam_pretrain, _ = fluid.dygraph.load_dygraph(args.optimize)
+            adam.set_dict(adam_pretrain)
+        if args.use_multi_gpu:
+            model = fluid.dygraph.parallel.DataParallel(model, strategy)
+
+        if args.dataset == 'FlyingChairs':
+            train_flyingchairs_dataset = FlyingChairs('train', args, is_cropped=True, txt_file=args.train_val_txt,
+                                                      root=args.data_root)
+            val_flyingchairs_dataset = FlyingChairs('val', args, is_cropped=False, txt_file=args.train_val_txt,
+                                                    root=args.data_root)
+        else:
+            raise ValueError('dataset name is wrong, please fix it by using args.dataset')
+
+        train_sample_num = len(train_flyingchairs_dataset)
+        val_sample_num = len(val_flyingchairs_dataset)
+        print('train sample num: ', train_sample_num)
+        print('val sample num: ', val_sample_num)
+        train_reader = reader_flyingchairs(train_flyingchairs_dataset)
+        val_reader = reader_flyingchairs(val_flyingchairs_dataset)
+        if args.use_multi_gpu:
+            train_reader = fluid.contrib.reader.distributed_batch_reader(
+                train_reader)
+            val_reader = fluid.contrib.reader.distributed_batch_reader(
+                val_reader)
+        BATCH_SIZE = args.batch_size
+        train_batch_num = round(train_sample_num / BATCH_SIZE)
+        val_batch_num = round(val_sample_num / BATCH_SIZE)
+        train_batch_reader = paddle.batch(paddle.reader.shuffle(train_reader, buf_size=BATCH_SIZE * 100), BATCH_SIZE,
+                                          drop_last=True)
+        val_batch_reader = paddle.batch(val_reader, BATCH_SIZE, drop_last=False)
+        epoch_num = args.numEpoch
+        val_value = 100000000
+        rm_best_model = ""
+
+        for epoch in range(epoch_num):
+            train(model, train_batch_reader, adam, epoch, train_batch_num, args)
+            pd_save_dir = args.model_out_dir
+            if not os.path.exists(pd_save_dir):
+                os.makedirs(pd_save_dir)
+            pd_model_save = os.path.join(pd_save_dir, 'epoch_' + str(epoch) + "_pwc_net_paddle")
+            rm_dir = os.path.join(pd_save_dir, 'epoch_' + str(epoch - 1) + "_pwc_net_paddle.pdparams")
+            if os.path.exists(rm_dir):
+                os.remove(rm_dir)
+            if args.use_multi_gpu:
+                if fluid.dygraph.parallel.Env().local_rank == 0:
+                    fluid.dygraph.save_dygraph(model.state_dict(), pd_model_save)
+                    fluid.dygraph.save_dygraph(adam.state_dict(), os.path.join(pd_save_dir, 'adam'))
+            else:
+                fluid.dygraph.save_dygraph(model.state_dict(), pd_model_save)
+                fluid.dygraph.save_dygraph(adam.state_dict(), os.path.join(pd_save_dir, 'adam'))
+            val_loss_value = val(model, val_batch_reader, epoch, val_batch_num)
+            if val_loss_value < val_value:
+                best_model = os.path.join(pd_save_dir, "pwc_net_paddle_" + str(val_loss_value) + '.pdparams')
+                os.link(pd_model_save + '.pdparams', best_model)
+                if os.path.exists(rm_best_model):
+                    os.remove(rm_best_model)
+                rm_best_model = best_model
+                val_value = val_loss_value
+
+
+if __name__ == '__main__':
+    main()
+
+
+
diff --git a/PaddleCV/Research/PWCNet/train.sh b/PaddleCV/Research/PWCNet/train.sh
new file mode 100755
index 0000000000000000000000000000000000000000..7c2b7226bef96ebdbbe6c768255a8419e0d32de0
--- /dev/null
+++ b/PaddleCV/Research/PWCNet/train.sh
@@ -0,0 +1,4 @@
+#!/usr/bin/env bash
+python3 train.py --dataset FlyingChairs --train_val_txt data_dir/FlyingChairs_release/FlyingChairs_train_val.txt --data_root data_dir/FlyingChairs_release/data
+# use multi gpus NEED TO DO LATER
+#python3 -m paddle.distributed.launch --selected_gpus=0,1 --log_dir ./mylog train.py --use_multi_gpu --batch_size 20 --dataset FlyingChairs --train_val_txt data_dir/FlyingChairs_release/FlyingChairs_train_val.txt --data_root data_dir/FlyingChairs_release/data
diff --git a/PaddleCV/Research/SemSegPaddle/README.md b/PaddleCV/Research/SemSegPaddle/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f4fd9a731b947c95accbaff6686e8d272debf076
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/README.md
@@ -0,0 +1,139 @@
+# SemSegPaddle: A Paddle-based Framework for Deep Learning in Semantic Segmentation
+
+This is a Paddle implementation of semantic segmentation models on multiple datasets, including Cityscapes, Pascal Context, and ADE20K.
+
+## Updates
+
+- [**2020/01/08**] We release ***PSPNet-ResNet101*** and ***GloRe-ResNet101*** models on Pascal Context and Cityscapes datasets.
+
+## Highlights
+
+Synchronized Batch Normlization is important for segmenation.
+  - The implementation is easy to use as it is pure-python, no any C++ extra extension libs.
+   
+  - Paddle provides sync_batch_norm.
+   
+   
+## Support models
+
+We split our models into backbone and decoder network, where backbone network are transfered from classification networks.
+
+Backbone:
+  - ResNet
+  - ResNeXt
+  - HRNet
+  - EfficientNet
+  
+Decoder:
+  - PSPNet: [Pyramid Scene Parsing Network](http://openaccess.thecvf.com/content_cvpr_2017/papers/Zhao_Pyramid_Scene_Parsing_CVPR_2017_paper.pdf)
+  - DeepLabv3: [Rethinking Atrous Convolution for Semantic Image Segmentation](https://arxiv.org/abs/1706.05587)
+  - GloRe: [Graph-Based Global Reasoning Networks](http://openaccess.thecvf.com/content_CVPR_2019/papers/Chen_Graph-Based_Global_Reasoning_Networks_CVPR_2019_paper.pdf)
+  - GINet: [GINet: Graph Interaction Netowrk for Scene Parsing]()
+  
+
+
+## Peformance
+
+ - Performance of Cityscapes validation set.
+
+**Method**  | **Backbone** | **lr**     | **BatchSize**  | **epoch**    | **mean IoU (Single-scale)** |  **Trained weights**   |
+------------|:------------:|:----------:|:--------------:|:------------:|:---------------------------:|------------------------|
+PSPNet      | resnet101    |     0.01   |        8       | 80           | 78.1                        |  [pspnet_resnet_cityscapes_epoch_80.pdparams](https://pan.baidu.com/s/1adfvtq2JnLKRv_j7lOmW1A)|
+GloRe      | resnet101    |     0.01   |        8       | 80           |  78.4                        |  [pspnet_resnet_pascalcontext_epoch_80.pdparams](https://pan.baidu.com/s/1r4SbrYKbVk38c0dXZLAi9w)              |
+
+
+ - Performance of Pascal-context validation set.
+
+**Method**  | **Backbone** | **lr**     | **BatchSize**  | **epoch**    | **mean IoU (Single-scale)** |  **Trained weights**   |
+------------|:------------:|:----------:|:--------------:|:------------:|:---------------------------:|:----------------------:|
+PSPNet       | resnet101    | 0.005       |   16            | 80           |   48.9                   |  [glore_resnet_cityscapes_epoch_80.pdparams](https://pan.baidu.com/s/1l7-sqt2DsUunD9l4YivgQw)                       |
+GloRe       | resnet101    | 0.005       |   16            | 80           |    48.4                   |  [glore_resnet_pascalcontext_epoch_80.pdparams](https://pan.baidu.com/s/1rVuk7OfSj-AXR3ZCFGNmKg)                |
+
+
+## Environment
+
+This repo is developed under the following configurations:
+
+ - Hardware: 4 GPUs for training, 1 GPU for testing
+ - Software: Centos 6.10, ***CUDA>=9.2 Python>=3.6, Paddle>=1.6***
+
+
+## Quick start: training and testing models
+
+### 1. Preparing data
+
+Download the [Cityscapes](https://www.cityscapes-dataset.com/) dataset. It should have this basic structure:
+
+      cityscapes/
+      ├── cityscapes_list
+      │   ├── test.lst
+      │   ├── train.lst
+      │   ├── train+.lst
+      │   ├── train++.lst
+      │   ├── trainval.lst
+      │   └── val.lst
+      ├── gtFine
+      │   ├── test
+      │   ├── train
+      │   └── val
+      ├── leftImg8bit
+      │   ├── test
+      │   ├── train
+      │   └── val
+      ├── license.txt
+      └── README
+   
+ Download Pascal-Context dataset. It should have this basic structure:  
+
+      pascalContext/
+      ├── GroundTruth_trainval_mat
+      ├── GroundTruth_trainval_png
+      ├── JPEGImages
+      ├── pascal_context_train.txt
+      ├── pascal_context_val.txt
+      ├── README.md
+      └── VOCdevkit
+
+ Then, create symlinks for the Cityscapes and Pascal-Context datasets
+ ```
+ cd SemSegPaddle/data
+ ln -s $cityscapes ./
+ ln -s $pascalContext ./
+ ```
+ 
+### 2. Download pretrained weights
+  Downlaod pretrained [resnet-101](https://pan.baidu.com/s/1niXBDZnLlUIulB7FY068DQ) weights file, and put it into the directory: ***./pretrained_model***
+  
+  Then, run the following command:
+```
+  tar -zxvf  ./repretrained/resnet101_v2.tgz -C pretrained_model 
+```
+
+### 3. Training
+
+select confiure file for training according to the DECODER\_NAME, BACKBONE\_NAME and DATASET\_NAME.
+```
+CUDA_VISIBLE_DEVICES=0,1,2,3 python -m paddle.distributed.launch train.py  --use_gpu --use_mpio \
+                                  --cfg ./configs/pspnet_res101_cityscapes.yaml 
+```
+
+### 4. Testing 
+select confiure file for testing according to the DECODER\_NAME, BACKBONE\_NAME and DATASET\_NAME.
+
+Single-scale testing:
+```
+CUDA_VISIBLE_DEVICES=0 python  eval.py --use_gpu \
+                                       --use_mpio \
+                                       --cfg ./configs/pspnet_res101_cityscapes.yaml 
+```
+
+Multi-scale testing:
+```
+CUDA_VISIBLE_DEVICES=0 python  eval.py --use_gpu \
+                                       --use_mpio \
+                                       --multi_scales \
+                                       --cfg ./configs/pspnet_res101_cityscapes.yaml 
+```
+
+## Contact
+If you have any questions regarding the repo, please create an issue.
diff --git a/PaddleCV/Research/SemSegPaddle/configs/deeplabv3_res101_cityscapes.yaml b/PaddleCV/Research/SemSegPaddle/configs/deeplabv3_res101_cityscapes.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..093758b507bf5a8bc963ca20d3e4fff56adf7fdd
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/configs/deeplabv3_res101_cityscapes.yaml
@@ -0,0 +1,46 @@
+DATAAUG:
+    RAND_SCALE_MIN: 0.75
+    RAND_SCALE_MAX: 2.0
+    BASE_SIZE: 1024
+    CROP_SIZE: 769
+    EXTRA: True
+TRAIN_BATCH_SIZE_PER_GPU: 2
+NUM_TRAINERS: 4
+EVAL_BATCH_SIZE: 1
+DATASET:
+    DATASET_NAME: "cityscapes"
+    DATA_DIR: "./data/cityscapes/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 19
+    TEST_FILE_LIST: "./data/cityscapes/cityscapes_list/test.lst"
+    TRAIN_FILE_LIST: "./data/cityscapes/cityscapes_list/train.lst"
+    VAL_FILE_LIST: "./data/cityscapes/cityscapes_list/val.lst"
+    IGNORE_INDEX: 255
+    DATA_DIM: 3
+MODEL:
+    MODEL_NAME: "deeplabv3"
+    DEFAULT_NORM_TYPE: "bn"
+    MULTI_LOSS_WEIGHT: [1.0,0.4]
+    BACKBONE: "resnet"
+    BACKBONE_LAYERS: 101
+    BACKBONE_MULTI_GRID: True
+    DEEPLABv3:
+        DEPTH_MULTIPLIER: 1
+        ASPP_WITH_SEP_CONV: True
+        AuxHead: True
+TRAIN:
+    PRETRAINED_MODEL_DIR: "./pretrained_model/resnet101_v2/"
+    MODEL_SAVE_DIR: "./snapshots/deeplabv3_resnet_cityscapes/"
+    SNAPSHOT_EPOCH: 1
+TEST:
+    TEST_MODEL: "./snapshots/deeplabv3_resnet_cityscapes"
+    BASE_SIZE: 2048
+    CROP_SIZE: 769
+    SLIDE_WINDOW: True
+SOLVER:
+    LR: 0.01
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+    NUM_EPOCHS: 80
+    LOSS: "['softmax_loss']"
+
diff --git a/PaddleCV/Research/SemSegPaddle/configs/deeplabv3_res101_pascalcontext.yaml b/PaddleCV/Research/SemSegPaddle/configs/deeplabv3_res101_pascalcontext.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..fa41bfb02844293390df3ce6c2e271cb5a2e80ee
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/configs/deeplabv3_res101_pascalcontext.yaml
@@ -0,0 +1,47 @@
+DATAAUG:
+    RAND_SCALE_MIN: 0.5
+    RAND_SCALE_MAX: 2.0
+    BASE_SIZE: 520
+    CROP_SIZE: 520
+    EXTRA: True
+TRAIN_BATCH_SIZE_PER_GPU: 4
+NUM_TRAINERS: 4
+EVAL_BATCH_SIZE: 1
+DATASET:
+    DATASET_NAME: "pascalContext"
+    DATA_DIR: "./data/pascalContext/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 59
+    TEST_FILE_LIST: "./data/pascalContext/pascal_context_val.txt"
+    TRAIN_FILE_LIST: "./data/pascalContext/pascal_context_train.txt"
+    VAL_FILE_LIST: "./data/pascalContext/pascal_context_val.txt"
+    IGNORE_INDEX: -1
+    DATA_DIM: 3
+    SEPARATOR: ' '
+MODEL:
+    MODEL_NAME: "deeplabv3"
+    DEFAULT_NORM_TYPE: "bn"
+    MULTI_LOSS_WEIGHT: [1.0,0.4]
+    BACKBONE: "resnet"
+    BACKBONE_LAYERS: 101
+    BACKBONE_MULTI_GRID: True
+    DEEPLABv3:
+        DEPTH_MULTIPLIER: 1
+        ASPP_WITH_SEP_CONV: True
+        AuxHead: True
+TRAIN:
+    PRETRAINED_MODEL_DIR: "./pretrained_model/resnet101_v2/"
+    MODEL_SAVE_DIR: "./snapshots/deeplabv3_resnet_pascalcontext/"
+    SNAPSHOT_EPOCH: 1
+TEST:
+    TEST_MODEL: "./snapshots/deeplabv3_resnet_pascalcontext"
+    BASE_SIZE: 520
+    CROP_SIZE: 520
+    SLIDE_WINDOW: True
+SOLVER:
+    LR: 0.005
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+    NUM_EPOCHS: 80
+    LOSS: "['softmax_loss']"
+
diff --git a/PaddleCV/Research/SemSegPaddle/configs/glore_res101_cityscapes.yaml b/PaddleCV/Research/SemSegPaddle/configs/glore_res101_cityscapes.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..fa26415584f1f6391b2562981bbfdbaa06d02354
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/configs/glore_res101_cityscapes.yaml
@@ -0,0 +1,45 @@
+DATAAUG:
+    RAND_SCALE_MIN: 0.5
+    RAND_SCALE_MAX: 2.0
+    BASE_SIZE: 1024
+    CROP_SIZE: 769
+    EXTRA: True
+TRAIN_BATCH_SIZE_PER_GPU: 2
+NUM_TRAINERS: 4
+EVAL_BATCH_SIZE: 1
+DATASET:
+    DATASET_NAME: "cityscapes"
+    DATA_DIR: "./data/cityscapes/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 19
+    TEST_FILE_LIST: "./data/cityscapes/cityscapes_list/test.lst"
+    TRAIN_FILE_LIST: "./data/cityscapes/cityscapes_list/train.lst"
+    VAL_FILE_LIST: "./data/cityscapes/cityscapes_list/val.lst"
+    IGNORE_INDEX: 255
+    DATA_DIM: 3
+MODEL:
+    MODEL_NAME: "glore"
+    DEFAULT_NORM_TYPE: "bn"
+    MULTI_LOSS_WEIGHT: [1.0, 0.4]
+    BACKBONE: "resnet"
+    BACKBONE_LAYERS: 101
+    BACKBONE_MULTI_GRID: True
+    GLORE:
+        DEPTH_MULTIPLIER: 1
+        AuxHead: True
+TRAIN:
+    MODEL_SAVE_DIR: "snapshots/glore_res101_cityscapes/"
+    PRETRAINED_MODEL_DIR: "./pretrained_model/resnet101_v2/"
+    SNAPSHOT_EPOCH: 1
+TEST:
+    TEST_MODEL: "snapshots/glore_res101_cityscapes"
+    BASE_SIZE: 2048
+    CROP_SIZE: 769
+    SLIDE_WINDOW: True
+SOLVER:
+    LR: 0.01
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+    NUM_EPOCHS: 80
+    LOSS: "['softmax_loss']"
+
diff --git a/PaddleCV/Research/SemSegPaddle/configs/glore_res101_pascalcontext.yaml b/PaddleCV/Research/SemSegPaddle/configs/glore_res101_pascalcontext.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..9dd26b95aba4d14a33af2751182c9c2ff485416a
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/configs/glore_res101_pascalcontext.yaml
@@ -0,0 +1,45 @@
+DATAAUG:
+    RAND_SCALE_MIN: 0.5
+    RAND_SCALE_MAX: 2.0
+    BASE_SIZE: 520
+    CROP_SIZE: 520
+    EXTRA: True
+TRAIN_BATCH_SIZE_PER_GPU: 4
+NUM_TRAINERS: 4
+EVAL_BATCH_SIZE: 1
+DATASET:
+    DATASET_NAME: "pascalContext"
+    DATA_DIR: "./data/pascalContext/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 59
+    TEST_FILE_LIST: "./data/pascalContext/pascal_context_val.txt"
+    TRAIN_FILE_LIST: "./data/pascalContext/pascal_context_train.txt"
+    VAL_FILE_LIST: "./data/pascalContext/pascal_context_val.txt"
+    IGNORE_INDEX: -1
+    DATA_DIM: 3
+    SEPARATOR: ' '
+MODEL:
+    MODEL_NAME: "glore"
+    DEFAULT_NORM_TYPE: "bn"
+    MULTI_LOSS_WEIGHT: [1.0,0.4]
+    BACKBONE: "resnet"
+    BACKBONE_LAYERS: 101
+    BACKBONE_MULTI_GRID: True
+    GLORE:
+        DEPTH_MULTIPLIER: 1
+        AuxHead: True
+TEST:
+    TEST_MODEL: "snapshots/glore_res101_pascalContext"
+    BASE_SIZE: 520
+    CROP_SIZE: 520
+    SLIDE_WINDOW: True
+TRAIN:
+    MODEL_SAVE_DIR: "snapshots/glore_res101_pascalContext/"
+    PRETRAINED_MODEL_DIR: "./pretrained_model/resnet101_v2/"
+    SNAPSHOT_EPOCH: 1
+SOLVER:
+    LR: 0.005
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+    NUM_EPOCHS: 80
+    LOSS: "['softmax_loss']"
diff --git a/PaddleCV/Research/SemSegPaddle/configs/pspnet_hrnet_cityscapes.yaml b/PaddleCV/Research/SemSegPaddle/configs/pspnet_hrnet_cityscapes.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..8aa5a38c869811fac76cc3e0e994d3140ea0b012
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/configs/pspnet_hrnet_cityscapes.yaml
@@ -0,0 +1,43 @@
+DATAAUG:
+    RAND_SCALE_MIN: 0.75
+    RAND_SCALE_MAX: 2.0
+    BASE_SIZE: 2048
+    CROP_SIZE: 769
+    SLIDE_WINDOW: True
+TRAIN_BATCH_SIZE_PER_GPU: 2
+EVAL_BATCH_SIZE: 1
+NUM_TRAINERS: 4
+DATASET:
+    DATASET_NAME: "cityscapes"
+    DATA_DIR: "./data/cityscapes/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 19
+    TEST_FILE_LIST: "./data/cityscapes/cityscapes_list/test.lst"
+    TRAIN_FILE_LIST: "./data/cityscapes/cityscapes_list/train.lst"
+    VAL_FILE_LIST: "./data/cityscapes/cityscapes_list/val.lst"
+    IGNORE_INDEX: 255
+    DATA_DIM: 3
+MODEL:
+    MODEL_NAME: "pspnet"
+    DEFAULT_NORM_TYPE: "bn"
+    MULTI_LOSS_WEIGHT: [1.0,]
+    BACKBONE: "hrnet"
+    PSPNET:
+        DEPTH_MULTIPLIER: 1
+        AuxHead: False
+TRAIN:
+    MODEL_SAVE_DIR: "snapshots/pspnet_hrnet_cityscapes/"
+    PRETRAINED_MODEL_DIR: "./pretrained_model/HRNet_W40_C_pretrained/"
+    SNAPSHOT_EPOCH: 1
+TEST:
+    TEST_MODEL: "snapshots/pspnet_hrnet_cityscapes"
+    BASE_SIZE: 2048
+    CROP_SIZE: 769
+    SLIDE_WINDOW: True
+SOLVER:
+    LR: 0.001
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+    NUM_EPOCHS: 240
+    LOSS: "['softmax_loss']"
+
diff --git a/PaddleCV/Research/SemSegPaddle/configs/pspnet_res101_ade.yaml b/PaddleCV/Research/SemSegPaddle/configs/pspnet_res101_ade.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..02423c5bbd99b5bc5188b9299088fb5b05a7294f
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/configs/pspnet_res101_ade.yaml
@@ -0,0 +1,44 @@
+DATAAUG:
+    RAND_SCALE_MIN: 0.5
+    RAND_SCALE_MAX: 2.0
+    BASE_SIZE: 520
+    CROP_SIZE: 520
+    EXTRA: True
+TRAIN_BATCH_SIZE_PER_GPU: 2
+NUM_TRAINERS: 4
+EVAL_BATCH_SIZE: 1
+DATASET:
+    DATASET_NAME: "ade"
+    DATA_DIR: "./data/ade/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 150
+    TEST_FILE_LIST: "./data/ade/ade_val.lst"
+    TRAIN_FILE_LIST: "./data/ade/ade_train.lst"
+    VAL_FILE_LIST: "./data/ade/ade_val.lst"
+    IGNORE_INDEX: -1
+    DATA_DIM: 3
+MODEL:
+    MODEL_NAME: "pspnet"
+    DEFAULT_NORM_TYPE: "bn"
+    MULTI_LOSS_WEIGHT: [1.0, 0.4]
+    BACKBONE: "resnet"
+    BACKBONE_LAYERS: 101
+    BACKBONE_MULTI_GRID: True
+    PSPNET:
+        DEPTH_MULTIPLIER: 1
+        AuxHead: True
+TEST:
+    TEST_MODEL: "snapshots/pspnet_res101_ade/"
+    BASE_SIZE: 520
+    CROP_SIZE: 520
+    SLIDE_WINDOW: True
+TRAIN:
+    MODEL_SAVE_DIR: "snapshots/pspnet_res101_ade/"
+    PRETRAINED_MODEL_DIR: "./pretrained_model/resnet101_v2/"
+    SNAPSHOT_EPOCH: 10
+SOLVER:
+    LR: 0.01
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+    NUM_EPOCHS: 120
+    LOSS: "['softmax_loss']"
diff --git a/PaddleCV/Research/SemSegPaddle/configs/pspnet_res101_cityscapes.yaml b/PaddleCV/Research/SemSegPaddle/configs/pspnet_res101_cityscapes.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..a759677e92398993a8f052de6f20bd5fc65e8984
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/configs/pspnet_res101_cityscapes.yaml
@@ -0,0 +1,45 @@
+DATAAUG:
+    RAND_SCALE_MIN: 0.5
+    RAND_SCALE_MAX: 2.0
+    BASE_SIZE: 1024
+    CROP_SIZE: 769
+    EXTRA: True
+TRAIN_BATCH_SIZE_PER_GPU: 2
+NUM_TRAINERS: 4
+EVAL_BATCH_SIZE: 1
+DATASET:
+    DATASET_NAME: "cityscapes"
+    DATA_DIR: "./data/cityscapes/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 19
+    TEST_FILE_LIST: "./data/cityscapes/cityscapes_list/test.lst"
+    TRAIN_FILE_LIST: "./data/cityscapes/cityscapes_list/train.lst"
+    VAL_FILE_LIST: "./data/cityscapes/cityscapes_list/val.lst"
+    IGNORE_INDEX: 255
+    DATA_DIM: 3
+MODEL:
+    MODEL_NAME: "pspnet"
+    DEFAULT_NORM_TYPE: "bn"
+    MULTI_LOSS_WEIGHT: [1.0, 0.4]
+    BACKBONE: "resnet"
+    BACKBONE_LAYERS: 101
+    BACKBONE_MULTI_GRID: True
+    PSPNET:
+        DEPTH_MULTIPLIER: 1
+        AuxHead: True
+TRAIN:
+    MODEL_SAVE_DIR: "snapshots/pspnet_res101_cityscapes/"
+    PRETRAINED_MODEL_DIR: "./pretrained_model/resnet101_v2/"
+    SNAPSHOT_EPOCH: 1
+TEST:
+    TEST_MODEL: "snapshots/pspnet_res101_cityscapes"
+    BASE_SIZE: 2048
+    CROP_SIZE: 769
+    SLIDE_WINDOW: True
+SOLVER:
+    LR: 0.01
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+    NUM_EPOCHS: 80
+    LOSS: "['softmax_loss']"
+
diff --git a/PaddleCV/Research/SemSegPaddle/configs/pspnet_res101_pascalcontext.yaml b/PaddleCV/Research/SemSegPaddle/configs/pspnet_res101_pascalcontext.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..111a6768b78ef0459a5e15b9c40526f9499915f9
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/configs/pspnet_res101_pascalcontext.yaml
@@ -0,0 +1,45 @@
+DATAAUG:
+    RAND_SCALE_MIN: 0.5
+    RAND_SCALE_MAX: 2.0
+    BASE_SIZE: 520
+    CROP_SIZE: 520
+    EXTRA: True
+TRAIN_BATCH_SIZE_PER_GPU: 4
+NUM_TRAINERS: 4
+EVAL_BATCH_SIZE: 1
+DATASET:
+    DATASET_NAME: "pascalContext"
+    DATA_DIR: "./data/pascalContext/"
+    IMAGE_TYPE: "rgb"  # choice rgb or rgba
+    NUM_CLASSES: 59
+    TEST_FILE_LIST: "./data/pascalContext/pascal_context_val.txt"
+    TRAIN_FILE_LIST: "./data/pascalContext/pascal_context_train.txt"
+    VAL_FILE_LIST: "./data/pascalContext/pascal_context_val.txt"
+    IGNORE_INDEX: -1
+    DATA_DIM: 3
+    SEPARATOR: ' '
+MODEL:
+    MODEL_NAME: "pspnet"
+    DEFAULT_NORM_TYPE: "bn"
+    MULTI_LOSS_WEIGHT: [1.0,0.4]
+    BACKBONE: "resnet"
+    BACKBONE_LAYERS: 101
+    BACKBONE_MULTI_GRID: True
+    PSPNET:
+        DEPTH_MULTIPLIER: 1
+        AuxHead: True
+TEST:
+    TEST_MODEL: "snapshots/pspnet_res101_pascalContext"
+    BASE_SIZE: 520
+    CROP_SIZE: 520
+    SLIDE_WINDOW: True
+TRAIN:
+    MODEL_SAVE_DIR: "snapshots/pspnet_res101_pascalContext/"
+    PRETRAINED_MODEL_DIR: "./pretrained_model/resnet101_v2/"
+    SNAPSHOT_EPOCH: 1
+SOLVER:
+    LR: 0.005
+    LR_POLICY: "poly"
+    OPTIMIZER: "sgd"
+    NUM_EPOCHS: 80
+    LOSS: "['softmax_loss']"
diff --git a/PaddleCV/Research/SemSegPaddle/data/note.txt b/PaddleCV/Research/SemSegPaddle/data/note.txt
new file mode 100644
index 0000000000000000000000000000000000000000..08a033d979d1aee4253a7b48b6aff911616f38f4
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/data/note.txt
@@ -0,0 +1 @@
+please create symlinks for datasets
diff --git a/PaddleCV/Research/SemSegPaddle/eval.py b/PaddleCV/Research/SemSegPaddle/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..4195be40e939ecf4bc67adf2c393fde86a01f15c
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/eval.py
@@ -0,0 +1,311 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+# GPU memory garbage collection optimization flags
+os.environ['FLAGS_eager_delete_tensor_gb'] = "0.0"
+import sys
+import time
+import argparse
+import functools
+import pprint
+import cv2
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import math
+
+from src.utils.config import cfg
+from src.utils.timer import Timer, calculate_eta
+from src.models.model_builder import build_model
+from src.models.model_builder import ModelPhase
+from src.datasets import build_dataset
+from src.utils.metrics import ConfusionMatrix
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='SemsegPaddle')
+    parser.add_argument(
+        '--cfg',
+        dest='cfg_file',
+        help='Config file for training (and optionally testing)',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--use_gpu',
+        dest='use_gpu',
+        help='Use gpu or cpu',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--use_mpio',
+        dest='use_mpio',
+        help='Use multiprocess IO or not',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        'opts',
+        help='See utils/config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    parser.add_argument(
+            '--multi_scales',
+            dest='multi_scales',
+            help='Use multi_scales for eval',
+            action='store_true',
+            default=False)
+    parser.add_argument(
+            '--flip',
+            dest='flip',
+            help='flip the image or not',
+            action='store_true',
+            default=False)
+    if len(sys.argv) == 1:
+        parser.print_help()
+        sys.exit(1)
+    return parser.parse_args()
+
+
+def evaluate(cfg, ckpt_dir=None, use_gpu=False, use_mpio=False, multi_scales=False, flip=False,  **kwargs):
+    np.set_printoptions(precision=5, suppress=True)
+    
+    num_classes = cfg.DATASET.NUM_CLASSES
+    base_size = cfg.TEST.BASE_SIZE
+    crop_size = cfg.TEST.CROP_SIZE
+    startup_prog = fluid.Program()
+    test_prog = fluid.Program()
+    dataset = build_dataset(cfg.DATASET.DATASET_NAME,
+        file_list=cfg.DATASET.VAL_FILE_LIST,
+        mode=ModelPhase.EVAL,
+        data_dir=cfg.DATASET.DATA_DIR)
+
+    def data_generator():
+        #TODO: check is batch reader compatitable with Windows
+        if use_mpio:
+            data_gen = dataset.multiprocess_generator(
+                num_processes=cfg.DATALOADER.NUM_WORKERS,
+                max_queue_size=cfg.DATALOADER.BUF_SIZE)
+        else:
+            data_gen = dataset.generator()
+
+        for b in data_gen:
+            yield b[0], b[1], b[2]
+
+    py_reader, avg_loss, out, grts, masks = build_model(
+        test_prog, startup_prog, phase=ModelPhase.EVAL)
+
+    py_reader.decorate_sample_generator(
+        data_generator, drop_last=False, batch_size=cfg.EVAL_BATCH_SIZE, places=fluid.cuda_places())
+
+    # Get device environment
+    places = fluid.cuda_places() if use_gpu else fluid.cpu_places()
+    place = places[0]
+    dev_count = len(places)
+    print("#Device count: {}".format(dev_count))
+
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+
+    test_prog = test_prog.clone(for_test=True)
+
+    ckpt_dir = cfg.TEST.TEST_MODEL if not ckpt_dir else ckpt_dir
+
+    if ckpt_dir is not None:
+        filename= '{}_{}_{}_epoch_{}.pdparams'.format(str(cfg.MODEL.MODEL_NAME),
+                                                  str(cfg.MODEL.BACKBONE), str(cfg.DATASET.DATASET_NAME), cfg.SOLVER.NUM_EPOCHS)
+        print("loading testing model file: {}/{}".format(ckpt_dir, filename))
+        fluid.io.load_params(exe, ckpt_dir, main_program=test_prog, filename=filename)
+
+    # Use streaming confusion matrix to calculate mean_iou
+    np.set_printoptions(
+        precision=4, suppress=True, linewidth=160, floatmode="fixed")
+    conf_mat = ConfusionMatrix(cfg.DATASET.NUM_CLASSES, streaming=True)
+    
+    #fetch_list: return of the model
+    fetch_list = [avg_loss.name, out.name]
+    num_images = 0
+    step = 0
+    all_step = cfg.DATASET.VAL_TOTAL_IMAGES // cfg.EVAL_BATCH_SIZE 
+    timer = Timer()
+    timer.start()
+    for data in py_reader():
+        mask = np.array(data[0]['mask'])
+        label = np.array(data[0]['label'])
+        image_org = np.array(data[0]['image'])
+        image = np.transpose(image_org, (0, 2, 3, 1)) # BCHW->BHWC
+        image = np.squeeze(image)
+
+        if cfg.TEST.SLIDE_WINDOW:
+            if not multi_scales:
+                scales = [1.0]
+            else:
+                scales = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.25] if cfg.DATASET.DATASET_NAME == 'cityscapes' else [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
+                #scales = [0.75, 1.0, 1.25] # fast multi-scale testing
+        
+            #strides
+            stride = int(crop_size *1.0 / 3)  # 1/3 > 2/3 > 1/2 for input_size: 769 x 769
+            h, w = image.shape[0:2]
+            scores = np.zeros(shape=[num_classes, h, w], dtype='float32')
+
+            for scale in scales:
+                long_size = int(math.ceil(base_size * scale))
+                if h > w:
+                    height = long_size
+                    width = int(1.0 * w * long_size / h + 0.5)
+                    short_size = width
+                else:
+                    width = long_size
+                    height = int(1.0 * h * long_size / w + 0.5)
+                    short_size = height
+                # print('org_img_size: {}x{}, rescale_img_size: {}x{}'.format(h, w, height, width))
+                cur_img = image_resize(image, height, width)
+                # pading
+                if long_size <= crop_size:
+                    pad_img = pad_single_image(cur_img, crop_size)
+                    label_feed, mask_feed = get_feed(pad_img)
+                    pad_img = mapper_image(pad_img)
+                    loss, pred1 = exe.run(
+                            test_prog, 
+                            feed={'image':pad_img, 'label':label_feed, 'mask':mask_feed}, 
+                            fetch_list = fetch_list,
+                            return_numpy=True)
+                    pred1 = np.array(pred1)
+                    outputs = pred1[:, :, :height, :width]
+                    if flip:
+                        pad_img_flip = flip_left_right_image(cur_img)
+                        pad_img_flip = pad_single_image(pad_img_flip, crop_size)
+                        label_feed, mask_feed = get_feed(pad_img_flip)
+
+                        pad_img_flip = mapper_image(pad_img_flip)
+                        loss, pred1 = exe.run(
+                                test_prog,
+                                feed={'image':pad_img_flip, 'label':label_feed, 'mask':mask_feed},
+                                fetch_list = fetch_list,
+                                return_numpy=True)
+                        pred1 = np.flip(pred1, 3)
+                        outputs += pred1[:, :, :height, :width]
+                else:
+                    if short_size < crop_size:
+                        pad_img = pad_single_image(cur_img, crop_size)
+                    else:
+                        pad_img = cur_img
+                    ph, pw = pad_img.shape[0:2]
+
+                    #slid window
+                    h_grids = int(math.ceil(1.0 * (ph - crop_size) / stride)) + 1
+                    w_grids = int(math.ceil(1.0 * (pw - crop_size) / stride)) + 1
+                    outputs = np.zeros(shape=[1, num_classes, ph, pw], dtype='float32')
+                    count_norm = np.zeros(shape=[1, 1, ph, pw], dtype='int32')
+                    for idh in range(h_grids):
+                        for idw in range(w_grids):
+                            h0 = idh * stride
+                            w0 = idw * stride
+                            h1 = min(h0 + crop_size, ph)
+                            w1 = min(w0 + crop_size, pw)
+                            #print('(h0,w0,h1,w1):({},{},{},{})'.format(h0, w0, h1, w1))
+                            crop_img = crop_image(pad_img, h0, w0, h1, w1)
+                            pad_crop_img = pad_single_image(crop_img, crop_size)
+                            label_feed, mask_feed = get_feed(pad_crop_img)
+                            pad_crop_img = mapper_image(pad_crop_img)
+                            loss, pred1 = exe.run(
+                                    test_prog, 
+                                    feed={'image':pad_crop_img, 'label':label_feed, 'mask':mask_feed},
+                                    fetch_list = fetch_list,
+                                    return_numpy=True)
+                            pred1 = np.array(pred1)
+                            outputs[:, :, h0:h1, w0:w1] += pred1[:, :, 0:h1-h0, 0:w1-w0]
+                            count_norm[:, :, h0:h1, w0:w1] += 1
+                            if flip:
+                                pad_img_flip = flip_left_right_image(crop_img)
+                                pad_img_flip = pad_single_image(pad_img_flip, crop_size)
+                                label_feed, mask_feed = get_feed(pad_img_flip)
+                                pad_img_flip = mapper_image(pad_img_flip)
+                                loss, pred1 = exe.run(
+                                        test_prog,
+                                        feed={'image':pad_img_flip, 'label':label_feed, 'mask':mask_feed},
+                                        fetch_list = fetch_list,
+                                        return_numpy = True)
+                                pred1 = np.flip(pred1, 3)
+                                outputs[:, :, h0:h1, w0:w1] += pred1[:, :, 0:h1-h0, 0:w1-w0]
+                                count_norm[:, :, h0:h1, w0:w1] += 1
+                    
+                    outputs = 1.0 * outputs / count_norm
+                    outputs = outputs[:, :, :height, :width]
+                with fluid.dygraph.guard():
+                    outputs = fluid.dygraph.to_variable(outputs)
+                    outputs = fluid.layers.resize_bilinear(outputs, out_shape=[h, w])
+                    score = outputs.numpy()[0]
+                    scores += score
+        else: 
+            # taking the original image as the model input     
+            loss, pred = exe.run(
+                    test_prog,
+                    feed={'image':image_org, 'label':label, 'mask':mask},
+                    fetch_list = fetch_list,
+                    return_numpy = True)
+            scores = pred[0]
+        # computing IoU with all scale result
+        pred = np.argmax(scores, axis=0).astype('int64')
+        pred = pred[np.newaxis, :, :, np.newaxis]
+        step += 1
+        num_images += pred.shape[0]
+        conf_mat.calculate(pred, label, mask)
+        _, iou = conf_mat.mean_iou()
+        _, acc = conf_mat.accuracy()
+
+        print("[EVAL] step={}/{} acc={:.4f} IoU={:.4f}".format(step, all_step, acc, iou))
+
+    category_iou, avg_iou = conf_mat.mean_iou()
+    category_acc, avg_acc = conf_mat.accuracy()
+    print("[EVAL] #image={} acc={:.4f} IoU={:.4f}".format(num_images, avg_acc, avg_iou))
+    print("[EVAL] Category IoU:", category_iou)
+    print("[EVAL] Category Acc:", category_acc)
+    print("[EVAL] Kappa:{:.4f}".format(conf_mat.kappa()))
+    print("flip = ", flip)
+    print("scales = ", scales)
+
+    return category_iou, avg_iou, category_acc, avg_acc
+
+def image_resize(image, height, width):
+    if image.shape[0] == 3:
+        image = np.transpose(image, (1, 2, 0))
+    image = cv2.resize(image, (width, height), interpolation=cv2.INTER_LINEAR)
+    return image
+
+def pad_single_image(image, crop_size):
+    h, w  = image.shape[0:2]
+    pad_h = crop_size - h if h < crop_size else 0
+    pad_w = crop_size - w if w < crop_size else 0
+    image = cv2.copyMakeBorder(image, 0, pad_h, 0, pad_w, cv2.BORDER_CONSTANT,value=0)
+    return image
+
+def mapper_image(image):
+    # HxWx3 -> 3xHxW -> 1x3xHxW
+    image_array = np.transpose(image, (2, 0, 1))
+    image_array = image_array.astype('float32')
+    image_array = image_array[np.newaxis, :]
+    return image_array
+
+def flip_left_right_image(image):
+    return cv2.flip(image, 1)
+
+def get_feed(image):
+    h, w = image.shape[0:2]
+    return np.zeros([1, 1, h, w], dtype='int32'), np.zeros([1, 1, h, w], dtype='int32')
+
+def crop_image(image, h0, w0, h1, w1):
+    return image[h0:h1, w0:w1, :]
+
+def main():
+    args = parse_args()
+    if args.cfg_file is not None:
+        cfg.update_from_file(args.cfg_file)
+    if args.opts:
+        cfg.update_from_list(args.opts)
+    cfg.check_and_infer()
+    print(pprint.pformat(cfg))
+    evaluate(cfg, **args.__dict__)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/PaddleCV/Research/SemSegPaddle/expes/deeplabv3_res101_cityscapes.sh b/PaddleCV/Research/SemSegPaddle/expes/deeplabv3_res101_cityscapes.sh
new file mode 100755
index 0000000000000000000000000000000000000000..98e44a0892d2e4be7c9505864d7b459d8a20bbe6
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/expes/deeplabv3_res101_cityscapes.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+
+
+# Deeplabv3_Res101_Cityscapes
+# 1.1 Training
+CUDA_VISIBLE_DEVICES=0,1,2,3  python train.py  --use_gpu \
+                                              --use_mpio \
+                                              --cfg ./configs/deeplabv3_res101_cityscapes.yaml | tee -a train.log 2>&1
+# 1.2 single-scale testing
+CUDA_VISIBLE_DEVICES=0 python eval.py --use_gpu \
+                                      --cfg ./configs/deeplabv3_res101_cityscapes.yaml
+# 1.3 multi-scale testing
+CUDA_VISIBLE_DEVICES=0 python eval.py --use_gpu \
+                                      --multi_scales \
+                                      --cfg ./configs/deeplabv3_res101_cityscapes.yaml
+
+
diff --git a/PaddleCV/Research/SemSegPaddle/expes/deeplabv3_res101_pascalcontext.sh b/PaddleCV/Research/SemSegPaddle/expes/deeplabv3_res101_pascalcontext.sh
new file mode 100755
index 0000000000000000000000000000000000000000..a356a18d6483bf613975e3808c3e63a3db25d543
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/expes/deeplabv3_res101_pascalcontext.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+
+
+# Deeplabv3_Res101_PascalContext
+# 1.1 Training
+CUDA_VISIBLE_DEVICES=0,1,2,3  python train.py  --use_gpu \
+                                              --use_mpio \
+                                              --cfg ./configs/deeplabv3_res101_pascalcontext.yaml | tee -a train.log 2>&1
+# 1.2 single-scale testing
+CUDA_VISIBLE_DEVICES=0 python eval.py --use_gpu \
+                                      --cfg ./configs/deeplabv3_res101_pascalcontext.yaml
+# 1.3 multi-scale testing
+CUDA_VISIBLE_DEVICES=0 python eval.py --use_gpu \
+                                      --multi_scales \
+                                      --cfg ./configs/deeplabv3_res101_pascalcontext.yaml
+
+
+
+
diff --git a/PaddleCV/Research/SemSegPaddle/expes/glore_res101_cityscapes.sh b/PaddleCV/Research/SemSegPaddle/expes/glore_res101_cityscapes.sh
new file mode 100755
index 0000000000000000000000000000000000000000..075b0b974da19641b32c8410bcda9c897dabcab0
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/expes/glore_res101_cityscapes.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+
+
+# GloRe_Res101_Cityscapes
+# 1.1 Training
+CUDA_VISIBLE_DEVICES=0,1,2,3  python train.py  --use_gpu \
+                                              --use_mpio \
+                                              --cfg ./configs/glore_res101_cityscapes.yaml | tee -a train.log 2>&1
+# 1.2 single-scale testing
+CUDA_VISIBLE_DEVICES=0 python eval.py --use_gpu \
+                                      --cfg ./configs/glore_res101_cityscapes.yaml
+# 1.3 multi-scale testing
+CUDA_VISIBLE_DEVICES=0 python eval.py --use_gpu \
+                                      --multi_scales \
+                                      --cfg ./configs/glore_res101_cityscapes.yaml
+
diff --git a/PaddleCV/Research/SemSegPaddle/expes/glore_res101_pascalcontext.sh b/PaddleCV/Research/SemSegPaddle/expes/glore_res101_pascalcontext.sh
new file mode 100755
index 0000000000000000000000000000000000000000..452df25133713ac1c864f06dbbe9f21c3e162781
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/expes/glore_res101_pascalcontext.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+
+
+# GloRe_Res101_PascalContext
+:<<!
+# 1.1 Training
+CUDA_VISIBLE_DEVICES=0,1,2,3  python train.py  --use_gpu \
+                                              --use_mpio \
+                                              --cfg ./configs/glore_res101_pascalcontext.yaml | tee -a train.log 2>&1
+!
+# 1.2 single-scale testing
+CUDA_VISIBLE_DEVICES=0 python eval.py --use_gpu \
+                                      --cfg ./configs/glore_res101_pascalcontext.yaml
+:<<!
+# 1.3 multi-scale testing
+CUDA_VISIBLE_DEVICES=0 python eval.py --use_gpu \
+                                      --multi_scales \
+                                      --cfg ./configs/glore_res101_pascalcontext.yaml
+
+!
diff --git a/PaddleCV/Research/SemSegPaddle/expes/pspnet_res101_cityscapes.sh b/PaddleCV/Research/SemSegPaddle/expes/pspnet_res101_cityscapes.sh
new file mode 100755
index 0000000000000000000000000000000000000000..5263342e5f2ce6894135082d439c3d539409547a
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/expes/pspnet_res101_cityscapes.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+
+#PSPNet_Res101_Cityscapes
+# 1.1 training 
+CUDA_VISIBLE_DEVICES=0,1,2,3  python train.py  --use_gpu \
+                                               --cfg ./configs/pspnet_res101_cityscapes.yaml | tee -a train.log 2>&1
+# 1.2 single-scale testing
+CUDA_VISIBLE_DEVICES=0 python  eval.py --use_gpu \
+                                       --cfg ./configs/pspnet_res101_cityscapes.yaml
+# 1.3 multi-scale testing
+CUDA_VISIBLE_DEVICES=0 python  eval.py --use_gpu \
+                                       --multi_scales \
+                                       --cfg ./configs/pspnet_res101_cityscapes.yaml
diff --git a/PaddleCV/Research/SemSegPaddle/expes/pspnet_res101_pascalcontext.sh b/PaddleCV/Research/SemSegPaddle/expes/pspnet_res101_pascalcontext.sh
new file mode 100755
index 0000000000000000000000000000000000000000..1959e6e01a0c59c45f30b9e0bb1ec0dbf7c4dde5
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/expes/pspnet_res101_pascalcontext.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+
+#PSPNet_Res101_PascalContext
+# 1.1 training 
+CUDA_VISIBLE_DEVICES=0,1,2,3  python train.py  --use_gpu \
+                                               --cfg ./configs/pspnet_res101_pascalcontext.yaml | tee -a train.log 2>&1
+# 1.2 single-scale testing
+CUDA_VISIBLE_DEVICES=0 python  eval.py --use_gpu \
+                                       --cfg ./configs/pspnet_res101_pascalcontext.yaml
+# 1.3 multi-scale testing
+CUDA_VISIBLE_DEVICES=0 python  eval.py --use_gpu \
+                                       --multi_scales \
+                                       --cfg ./configs/pspnet_res101_pascalcontext.yaml
diff --git a/PaddleCV/Research/SemSegPaddle/pretrained_model/note.txt b/PaddleCV/Research/SemSegPaddle/pretrained_model/note.txt
new file mode 100644
index 0000000000000000000000000000000000000000..6580b2565e4d36e172b75abf55e41752620a80ee
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/pretrained_model/note.txt
@@ -0,0 +1 @@
+please put the pretrained weights of backbone here
diff --git a/PaddleCV/Research/SemSegPaddle/snapshots/note.txt b/PaddleCV/Research/SemSegPaddle/snapshots/note.txt
new file mode 100644
index 0000000000000000000000000000000000000000..9fe80f1163d114bb2bfc9c8033de12576b7a7c3e
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/snapshots/note.txt
@@ -0,0 +1 @@
+please put the trained model here
diff --git a/PaddleCV/Research/SemSegPaddle/src/__init__.py b/PaddleCV/Research/SemSegPaddle/src/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..a65af8351df2131361501fc0dce51af3b3252313
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/__init__.py
@@ -0,0 +1 @@
+from . import datasets, models, utils
diff --git a/PaddleCV/Research/SemSegPaddle/src/datasets/__init__.py b/PaddleCV/Research/SemSegPaddle/src/datasets/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c1c1a7255c4c7fbb58906bf96fd8061b14abc34
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/datasets/__init__.py
@@ -0,0 +1,27 @@
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .cityscapes import CityscapesSeg
+from .pascal_context import PascalContextSeg
+from .ade import AdeSeg
+datasets ={
+    'cityscapes':    CityscapesSeg,
+    'pascalcontext': PascalContextSeg,
+    'adechallengedata2016': AdeSeg,
+}
+def build_dataset(name, **kwargs):
+    return datasets[name.lower()](**kwargs)
+
+
diff --git a/PaddleCV/Research/SemSegPaddle/src/datasets/ade.py b/PaddleCV/Research/SemSegPaddle/src/datasets/ade.py
new file mode 100644
index 0000000000000000000000000000000000000000..6b63d8e297ff1ced2705c7b484cff9d3f2b61a35
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/datasets/ade.py
@@ -0,0 +1,105 @@
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import sys
+import os
+import math
+import random
+import functools
+import io
+import time
+import codecs
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import cv2
+from PIL import Image
+import copy
+
+from src.utils.config import cfg
+from src.models.model_builder import ModelPhase
+from .baseseg import BaseSeg
+
+
+class AdeSeg(BaseSeg):
+    def __init__(self,
+                 file_list,
+                 data_dir,
+                 shuffle=False,
+                 mode=ModelPhase.TRAIN, base_size=520, crop_size=520, rand_scale=True):
+        super(AdeSeg, self).__init__(file_list, data_dir, shuffle, mode, base_size, crop_size, rand_scale)
+
+    def _mask_transform(self, mask):
+        target = np.array(mask).astype('int32') - 1
+        return target
+
+
+
+    def load_image(self, line, src_dir, mode=ModelPhase.TRAIN):
+        # original image cv2.imread flag setting
+        cv2_imread_flag = cv2.IMREAD_COLOR
+        if cfg.DATASET.IMAGE_TYPE == "rgba":
+            # If use RBGA 4 channel ImageType, use IMREAD_UNCHANGED flags to
+            # reserver alpha channel
+            cv2_imread_flag = cv2.IMREAD_UNCHANGED
+        #print("line: ", line)
+        parts = line.strip().split(cfg.DATASET.SEPARATOR)
+        if len(parts) != 2:
+            if mode == ModelPhase.TRAIN or mode == ModelPhase.EVAL:
+                raise Exception("File list format incorrect! It should be"
+                                " image_name{}label_name\\n".format(
+                                    cfg.DATASET.SEPARATOR))
+            img_name, grt_name = parts[0], None
+        else:
+            img_name, grt_name = parts[0], parts[1]
+
+        img_path = os.path.join(src_dir, img_name)
+        img = self.cv2_imread(img_path, cv2_imread_flag)
+
+        if grt_name is not None:
+            grt_path = os.path.join(src_dir, grt_name)
+            grt = self.pil_imread(grt_path)
+        else:
+            grt = None
+
+        if img is None:
+            raise Exception(
+                "Empty image, src_dir: {}, img: {} & lab: {}".format(
+                    src_dir, img_path, grt_path))
+
+        img_height = img.shape[0]
+        img_width = img.shape[1]
+        #print('img.shape',img.shape)
+        if grt is not None:
+            grt_height = grt.shape[0]
+            grt_width = grt.shape[1]
+
+            if img_height != grt_height or img_width != grt_width:
+                raise Exception(
+                    "source img and label img must has the same size")
+        else:
+            if mode == ModelPhase.TRAIN or mode == ModelPhase.EVAL:
+                raise Exception(
+                    "Empty image, src_dir: {}, img: {} & lab: {}".format(
+                        src_dir, img_path, grt_path))
+
+        if len(img.shape) < 3:
+            img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
+
+        grt = self._mask_transform(grt)
+
+        return img, grt, img_name, grt_name
+
diff --git a/PaddleCV/Research/SemSegPaddle/src/datasets/baseseg.py b/PaddleCV/Research/SemSegPaddle/src/datasets/baseseg.py
new file mode 100644
index 0000000000000000000000000000000000000000..5433342c8c0d54b2b45bd2fd9a4e01e0b135d0b4
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/datasets/baseseg.py
@@ -0,0 +1,239 @@
+from __future__ import print_function
+import sys
+import os
+import math
+import random
+import functools
+import io
+import time
+import codecs
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import cv2
+import copy
+from PIL import Image, ImageOps, ImageFilter, ImageEnhance
+
+from src.models.model_builder import ModelPhase
+from src.utils.config import cfg
+from .data_utils import GeneratorEnqueuer
+
+
+class BaseSeg(object):
+    def __init__(self, file_list, data_dir, shuffle=False, mode=ModelPhase.TRAIN, base_size=1024, crop_size=769, rand_scale=True):
+        self.mode = mode
+        self.shuffle = shuffle
+        self.data_dir = data_dir
+        self.shuffle_seed = 0
+
+        self.crop_size = crop_size  
+        self.base_size = base_size  # short edge when training
+        self.rand_scale = rand_scale
+
+        # NOTE: Please ensure file list was save in UTF-8 coding format
+        with codecs.open(file_list, 'r', 'utf-8') as flist:
+            self.lines = [line.strip() for line in flist]
+            self.all_lines = copy.deepcopy(self.lines)
+            if shuffle and cfg.NUM_TRAINERS > 1:
+                np.random.RandomState(self.shuffle_seed).shuffle(self.all_lines)
+            elif shuffle:
+                np.random.shuffle(self.lines)
+        self.num_trainers= cfg.NUM_TRAINERS
+        self.trainer_id=cfg.TRAINER_ID
+
+    def generator(self):
+        if self.shuffle and cfg.NUM_TRAINERS > 1:
+            np.random.RandomState(self.shuffle_seed).shuffle(self.all_lines)
+            num_lines = len(self.all_lines) // cfg.NUM_TRAINERS
+            self.lines = self.all_lines[num_lines * cfg.TRAINER_ID: num_lines * (cfg.TRAINER_ID + 1)]
+            self.shuffle_seed += 1
+        elif self.shuffle:
+            np.random.shuffle(self.lines)
+
+        for line in self.lines:
+            yield self.process_image(line, self.data_dir, self.mode)
+
+    def sharding_generator(self, pid=0, num_processes=1):
+        """
+        Use line id as shard key for multiprocess io
+        It's a normal generator if pid=0, num_processes=1
+        """
+        for index, line in enumerate(self.lines):
+            # Use index and pid to shard file list
+                if index % num_processes == pid:
+                    yield self.process_image(line, self.data_dir, self.mode)
+
+    def batch_reader(self, batch_size):
+        br = self.batch(self.reader, batch_size)
+        for batch in br:
+            yield batch[0], batch[1], batch[2]
+
+    def multiprocess_generator(self, max_queue_size=32, num_processes=8):
+        # Re-shuffle file list
+        if self.shuffle and cfg.NUM_TRAINERS > 1:
+            np.random.RandomState(self.shuffle_seed).shuffle(self.all_lines)
+            num_lines = len(self.all_lines) // self.num_trainers
+            self.lines = self.all_lines[num_lines * self.trainer_id: num_lines * (self.trainer_id + 1)]
+            self.shuffle_seed += 1
+        elif self.shuffle:
+            np.random.shuffle(self.lines)
+
+        # Create multiple sharding generators according to num_processes for multiple processes
+        generators = []
+        for pid in range(num_processes):
+            generators.append(self.sharding_generator(pid, num_processes))
+
+        try:
+            enqueuer = GeneratorEnqueuer(generators)
+            enqueuer.start(max_queue_size=max_queue_size, workers=num_processes)
+            while True:
+                generator_out = None
+                while enqueuer.is_running():
+                    if not enqueuer.queue.empty():
+                        generator_out = enqueuer.queue.get(timeout=5)
+                        break
+                    else:
+                        time.sleep(0.01)
+                if generator_out is None:
+                    break
+                yield generator_out
+        finally:
+            if enqueuer is not None:
+                enqueuer.stop()
+
+    def batch(self, reader, batch_size, is_test=False, drop_last=False):
+        def batch_reader(is_test=False, drop_last=drop_last):
+            if is_test:
+                imgs, grts, img_names, valid_shapes, org_shapes = [], [], [], [], []
+                for img, grt, img_name, valid_shape, org_shape in reader():
+                    imgs.append(img)
+                    grts.append(grt)
+                    img_names.append(img_name)
+                    valid_shapes.append(valid_shape)
+                    org_shapes.append(org_shape)
+                    if len(imgs) == batch_size:
+                        yield np.array(imgs), np.array(
+                            grts), img_names, np.array(valid_shapes), np.array(
+                                org_shapes)
+                        imgs, grts, img_names, valid_shapes, org_shapes = [], [], [], [], []
+
+                if not drop_last and len(imgs) > 0:
+                    yield np.array(imgs), np.array(grts), img_names, np.array(
+                        valid_shapes), np.array(org_shapes)
+            else:
+                imgs, labs, ignore = [], [], []
+                bs = 0
+                for img, lab, ig in reader():
+                    imgs.append(img)
+                    labs.append(lab)
+                    ignore.append(ig)
+                    bs += 1
+                    if bs == batch_size:
+                        yield np.array(imgs), np.array(labs), np.array(ignore)
+                        bs = 0
+                        imgs, labs, ignore = [], [], []
+
+                if not drop_last and bs > 0:
+                    yield np.array(imgs), np.array(labs), np.array(ignore)
+
+        return batch_reader(is_test, drop_last)
+
+    def load_image(self, line, src_dir, mode=ModelPhase.TRAIN):
+        raise NotImplemented
+
+    def pil_imread(self, file_path):
+        """read pseudo-color label"""
+        im = Image.open(file_path)
+        return np.asarray(im)
+
+    def cv2_imread(self, file_path, flag=cv2.IMREAD_COLOR):
+        # resolve cv2.imread open Chinese file path issues on Windows Platform.
+        return cv2.imdecode(np.fromfile(file_path, dtype=np.uint8), flag)
+
+    def normalize_image(self, img):
+        img = img.transpose((2, 0, 1)).astype('float32') / 255.0
+        img_mean = np.array(cfg.MEAN).reshape((len(cfg.MEAN), 1, 1))
+        img_std = np.array(cfg.STD).reshape((len(cfg.STD), 1, 1))
+        img -= img_mean
+        img /= img_std
+
+        return img
+
+    def process_image(self, line, data_dir, mode):
+        """ process_image """
+        img, grt, img_name, grt_name = self.load_image( line, data_dir, mode=mode)  # img.type: numpy.array, grt.type: numpy.array
+        if mode == ModelPhase.TRAIN:
+            # numpy.array convert to  PIL.Image 
+            img = Image.fromarray(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
+            grt = Image.fromarray(grt.astype('uint8')).convert('L')
+            
+            crop_size = self.crop_size
+            # random scale 
+            if self.rand_scale:
+                short_size = random.randint(int(self.base_size * cfg.DATAAUG.RAND_SCALE_MIN), int(self.base_size * cfg.DATAAUG.RAND_SCALE_MAX))
+            else:
+                short_size = self.base_size
+            w, h = img.size
+            if h > w:
+                out_w = short_size
+                out_h = int(1.0 * h / w * out_w)
+            else:
+                out_h = short_size
+                out_w = int(1.0 * w / h * out_h)
+            img = img.resize((out_w, out_h), Image.BILINEAR)
+            grt = grt.resize((out_w, out_h), Image.NEAREST)
+
+            # rand flip
+            if random.random() > 0.5:
+                img = img.transpose(Image.FLIP_LEFT_RIGHT)
+                grt = grt.transpose(Image.FLIP_LEFT_RIGHT)
+
+            # padding
+            if short_size < crop_size:
+                pad_h = crop_size - out_h if out_h < crop_size else 0
+                pad_w = crop_size - out_w if out_w < crop_size else 0
+                img = ImageOps.expand(img, border=(pad_w // 2, pad_h // 2, pad_w - pad_w // 2, pad_h - pad_h // 2), fill=0)
+                grt = ImageOps.expand(grt, border=(pad_w // 2, pad_h // 2, pad_w - pad_w // 2, pad_h - pad_h // 2), fill=cfg.DATASET.IGNORE_INDEX)
+
+            # random crop
+            w, h = img.size
+            x = random.randint(0, w - crop_size)
+            y = random.randint(0, h - crop_size)
+            img = img.crop((x, y, x + crop_size, y + crop_size))
+            grt = grt.crop((x, y, x + crop_size, y + crop_size))
+
+
+            # gaussian blur
+            if cfg.DATAAUG_EXTRA:
+                if random.random() > 0.7:
+                    img = img.filter(ImageFilter.GaussianBlur(radius=random.random()))
+
+            # PIL.Image -> cv2
+            img = cv2.cvtColor(np.asarray(img),cv2.COLOR_RGB2BGR)
+            grt = np.array(grt)
+            
+        elif ModelPhase.is_eval(mode):
+            org_shape = [img.shape[0], img.shape[1]]  # 1024 x 2048 for cityscapes
+
+        elif ModelPhase.is_visual(mode):
+            org_shape = [img.shape[0], img.shape[1]]
+            #img, grt = resize(img, grt, mode=mode)
+            valid_shape = [img.shape[0], img.shape[1]]
+            #img, grt = rand_crop(img, grt, mode=mode)
+        else:
+            raise ValueError("Dataset mode={} Error!".format(mode))
+
+        # Normalize image
+        img = self.normalize_image(img)
+
+        if ModelPhase.is_train(mode) or ModelPhase.is_eval(mode):
+            grt = np.expand_dims(np.array(grt).astype('int32'), axis=0)
+            ignore = (grt != cfg.DATASET.IGNORE_INDEX).astype('int32')
+
+
+        if ModelPhase.is_train(mode):
+            return (img, grt, ignore)
+        elif ModelPhase.is_eval(mode):
+            return (img, grt, ignore)
+        elif ModelPhase.is_visual(mode):
+            return (img, grt, img_name, valid_shape, org_shape)
diff --git a/PaddleCV/Research/SemSegPaddle/src/datasets/cityscapes.py b/PaddleCV/Research/SemSegPaddle/src/datasets/cityscapes.py
new file mode 100644
index 0000000000000000000000000000000000000000..e9487bf40c180b96564342badeea2e7599c56377
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/datasets/cityscapes.py
@@ -0,0 +1,79 @@
+from __future__ import print_function
+import sys
+import os
+import math
+import random
+import functools
+import io
+import time
+import codecs
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import cv2
+from PIL import Image
+import copy
+
+from src.utils.config import cfg
+from src.models.model_builder import ModelPhase
+from .baseseg import BaseSeg
+
+
+class CityscapesSeg(BaseSeg):
+    def __init__(self, file_list, data_dir, shuffle=False, mode=ModelPhase.TRAIN, base_size=1024, crop_size=769, rand_scale=True):
+
+        super(CityscapesSeg, self).__init__(file_list, data_dir, shuffle, mode, base_size, crop_size, rand_scale)
+
+    def load_image(self, line, src_dir, mode=ModelPhase.TRAIN):
+        # original image cv2.imread flag setting
+        cv2_imread_flag = cv2.IMREAD_COLOR
+        if cfg.DATASET.IMAGE_TYPE == "rgba":
+            # If use RBGA 4 channel ImageType, use IMREAD_UNCHANGED flags to
+            # reserver alpha channel
+            cv2_imread_flag = cv2.IMREAD_UNCHANGED
+
+        parts = line.strip().split(cfg.DATASET.SEPARATOR)
+        if len(parts) != 2:
+            if mode == ModelPhase.TRAIN or mode == ModelPhase.EVAL:
+                raise Exception("File list format incorrect! It should be image_name {} label_name\\n".format(cfg.DATASET.SEPARATOR))
+            img_name, grt_name = parts[0], None
+        else:
+            img_name, grt_name = parts[0], parts[1]
+
+        img_path = os.path.join(src_dir, img_name)
+        img = self.cv2_imread(img_path, cv2_imread_flag)
+
+        if grt_name is not None:
+            grt_path = os.path.join(src_dir, grt_name)
+            grt = self.pil_imread(grt_path)
+        else:
+            grt = None
+
+        img_height = img.shape[0]
+        img_width = img.shape[1]
+        if grt is not None:
+            grt_height = grt.shape[0]
+            grt_width = grt.shape[1]
+            id_to_trainid = [255, 255, 255, 255, 255,
+                             255, 255, 255, 0, 1,
+                             255, 255, 2, 3, 4,
+                             255, 255, 255, 5, 255,
+                             6, 7, 8, 9, 10,
+                             11, 12, 13, 14, 15,
+                             255, 255, 16, 17, 18]
+            grt_ = np.zeros([grt_height, grt_width])
+        
+            for h in range(grt_height):
+                for w in range(grt_width):
+                    grt_[h][w] = id_to_trainid[int(grt[h][w])+1]
+
+            if img_height != grt_height or img_width != grt_width:
+                raise Exception("source img and label img must has the same size")
+        else:
+            if mode == ModelPhase.TRAIN or mode == ModelPhase.EVAL:
+                raise Exception("Empty image, src_dir: {}, img: {} & lab: {}".format(src_dir, img_path, grt_path))
+
+        if len(img.shape) < 3:
+            img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
+
+        return img, grt_, img_name, grt_name
diff --git a/PaddleCV/Research/SemSegPaddle/src/datasets/data_utils.py b/PaddleCV/Research/SemSegPaddle/src/datasets/data_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..65bea35f1ade62289e704271a6a37af27c8c2c7c
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/datasets/data_utils.py
@@ -0,0 +1,115 @@
+"""
+This code is based on https://github.com/fchollet/keras/blob/master/keras/utils/data_utils.py
+"""
+
+import time
+import numpy as np
+import threading
+import multiprocessing
+try:
+    import queue
+except ImportError:
+    import Queue as queue
+
+
+class GeneratorEnqueuer(object):
+    """
+    Multiple generators 
+
+    Args:
+        generators: 
+        wait_time (float): time to sleep in-between calls to `put()`.
+    """
+
+    def __init__(self, generators, wait_time=0.05):
+        self.wait_time = wait_time
+        self._generators = generators
+        self._threads = []
+        self._stop_events = []
+        self.queue = None
+        self._manager = None
+        self.workers = 1
+
+    def start(self, workers=1, max_queue_size=16):
+        """
+        Start worker threads which add data from the generator into the queue.
+
+        Args:
+            workers (int): number of worker threads
+            max_queue_size (int): queue size
+                (when full, threads could block on `put()`)
+        """
+
+        self.workers = workers
+
+        def data_generator_task(pid):
+            """
+            Data generator task.
+            """
+
+            def task(pid):
+                if (self.queue is not None
+                        and self.queue.qsize() < max_queue_size):
+                    generator_output = next(self._generators[pid])
+                    self.queue.put((generator_output))
+                else:
+                    time.sleep(self.wait_time)
+
+            while not self._stop_events[pid].is_set():
+                try:
+                    task(pid)
+                except Exception:
+                    self._stop_events[pid].set()
+                    break
+
+        try:
+            self._manager = multiprocessing.Manager()
+            self.queue = self._manager.Queue(maxsize=max_queue_size)
+            for pid in range(self.workers):
+                self._stop_events.append(multiprocessing.Event())
+                thread = multiprocessing.Process(
+                    target=data_generator_task, args=(pid, ))
+                thread.daemon = True
+                self._threads.append(thread)
+                thread.start()
+        except:
+            self.stop()
+            raise
+
+    def is_running(self):
+        """
+        Returns:
+            bool: Whether the worker theads are running.
+        """
+
+        # If queue is not empty then still in runing state wait for consumer
+        if not self.queue.empty():
+            return True
+
+        for pid in range(self.workers):
+            if not self._stop_events[pid].is_set():
+                return True
+
+        return False
+
+    def stop(self, timeout=None):
+        """
+        Stops running threads and wait for them to exit, if necessary.
+        Should be called by the same thread which called `start()`.
+
+        Args:
+            timeout(int|None): maximum time to wait on `thread.join()`.
+        """
+        if self.is_running():
+            for pid in range(self.workers):
+                self._stop_events[pid].set()
+
+        for thread in self._threads:
+            if thread.is_alive():
+                thread.join(timeout)
+        if self._manager:
+            self._manager.shutdown()
+
+        self._threads = []
+        self._stop_events = []
+        self.queue = None
diff --git a/PaddleCV/Research/SemSegPaddle/src/datasets/pascal_context.py b/PaddleCV/Research/SemSegPaddle/src/datasets/pascal_context.py
new file mode 100644
index 0000000000000000000000000000000000000000..44cbe4065d8f51588be8f24ee46ffddb9b02e368
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/datasets/pascal_context.py
@@ -0,0 +1,65 @@
+from __future__ import print_function
+import sys
+import os
+import math
+import random
+import functools
+import io
+import time
+import codecs
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import cv2
+from PIL import Image
+import copy
+
+from src.utils.config import cfg
+from src.models.model_builder import ModelPhase
+from .baseseg import BaseSeg
+
+
+class PascalContextSeg(BaseSeg):
+    def __init__(self,
+                 file_list,
+                 data_dir,
+                 shuffle=False,
+                 mode=ModelPhase.TRAIN, base_size=520, crop_size=520, rand_scale=True):
+        super(PascalContextSeg, self).__init__(file_list, data_dir, shuffle, mode, base_size, crop_size, rand_scale)
+
+    def _mask_transform(self, mask):
+        target = np.array(mask).astype('int32') - 1
+        return target
+
+    def load_image(self, line, src_dir, mode=ModelPhase.TRAIN):
+        # original image cv2.imread flag setting
+        cv2_imread_flag = cv2.IMREAD_COLOR
+        if cfg.DATASET.IMAGE_TYPE == "rgba":
+            # If use RBGA 4 channel ImageType, use IMREAD_UNCHANGED flags to
+            # reserver alpha channel
+            cv2_imread_flag = cv2.IMREAD_UNCHANGED
+        parts = line.strip().split(cfg.DATASET.SEPARATOR)
+        if len(parts) != 2:
+            if mode == ModelPhase.TRAIN or mode == ModelPhase.EVAL:
+                raise Exception("File list format incorrect! It should be"
+                                " image_name{}label_name\\n".format(
+                                    cfg.DATASET.SEPARATOR))
+            img_name, grt_name = parts[0], None
+        else:
+            img_name, grt_name = parts[0], parts[1]
+
+        img_path = os.path.join(src_dir, img_name)
+        img = self.cv2_imread(img_path, cv2_imread_flag)
+
+        if grt_name is not None:
+            grt_path = os.path.join(src_dir, grt_name)
+            grt = self.pil_imread(grt_path)
+        else:
+            grt = None
+
+        if len(img.shape) < 3:
+            img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
+
+        grt = self._mask_transform(grt)
+        return img, grt, img_name, grt_name
+
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/__init__.py b/PaddleCV/Research/SemSegPaddle/src/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..0d80c0d8102d2e8160af9b66d0609a2973209e51
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/models/__init__.py
@@ -0,0 +1,19 @@
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#import models.modeling
+#import models.libs
+#import models.backbone
+from . import modeling, libs, backbone
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/backbone/__init__.py b/PaddleCV/Research/SemSegPaddle/src/models/backbone/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/backbone/hrnet.py b/PaddleCV/Research/SemSegPaddle/src/models/backbone/hrnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..a8c3ab91e6d925791480ec2c2d4163a19a72b831
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/models/backbone/hrnet.py
@@ -0,0 +1,220 @@
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.initializer import MSRA
+from paddle.fluid.param_attr import ParamAttr
+
+from src.utils.config import cfg
+
+class HRNet():
+    """
+    Reference: 
+        Sun, Ke, et al. "Deep High-Resolution Representation Learning for Human Pose Estimation.", In CVPR 2019
+    """
+    def __init__(self, stride=4, seg_flag=False):
+        self.stride= stride
+        self.seg_flag=seg_flag
+
+    def conv_bn_layer(self, input, filter_size, num_filters, stride=1, padding=1, num_groups=1, if_act=True, name=None):
+        conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=(filter_size - 1) // 2,
+            groups=num_groups,
+            act=None,
+            param_attr=ParamAttr(initializer=MSRA(), name=name + '_weights'),
+            bias_attr=False)
+        bn_name = name + '_bn'
+        bn = fluid.layers.batch_norm(input=conv,
+                                     param_attr=ParamAttr(name=bn_name + "_scale",
+                                                          initializer=fluid.initializer.Constant(1.0)),
+                                     bias_attr=ParamAttr(name=bn_name + "_offset",
+                                                         initializer=fluid.initializer.Constant(0.0)),
+                                     moving_mean_name=bn_name + '_mean',
+                                     moving_variance_name=bn_name + '_variance')
+        if if_act:
+            bn = fluid.layers.relu(bn)
+        return bn
+
+
+    def basic_block(self, input, num_filters, stride=1, downsample=False, name=None):
+        residual = input
+        conv = self.conv_bn_layer(input=input, filter_size=3, num_filters=num_filters, stride=stride, name=name + '_conv1')
+        conv = self.conv_bn_layer(input=conv, filter_size=3, num_filters=num_filters, if_act=False, name=name + '_conv2')
+        if downsample:
+            residual = self.conv_bn_layer(input=input, filter_size=1, num_filters=num_filters, if_act=False,
+                                      name=name + '_downsample')
+        return fluid.layers.elementwise_add(x=residual, y=conv, act='relu')
+
+
+
+    def bottleneck_block(self, input, num_filters, stride=1, downsample=False, name=None):
+        residual = input
+        conv = self.conv_bn_layer(input=input, filter_size=1, num_filters=num_filters, name=name + '_conv1')
+        conv = self.conv_bn_layer(input=conv, filter_size=3, num_filters=num_filters, stride=stride, name=name + '_conv2')
+        conv = self.conv_bn_layer(input=conv, filter_size=1, num_filters=num_filters * 4, if_act=False,
+                              name=name + '_conv3')
+        if downsample:
+            residual = self.conv_bn_layer(input=input, filter_size=1, num_filters=num_filters * 4, if_act=False,
+                                      name=name + '_downsample')
+        return fluid.layers.elementwise_add(x=residual, y=conv, act='relu')
+
+    def fuse_layers(self, x, channels, multi_scale_output=True, name=None):
+        out = []
+        for i in range(len(channels) if multi_scale_output else 1):
+            residual = x[i]
+            shape = residual.shape
+            width = shape[-1]
+            height = shape[-2]
+            for j in range(len(channels)):
+                if j > i:
+                    y = self.conv_bn_layer(x[j], filter_size=1, num_filters=channels[i], if_act=False,
+                                           name=name + '_layer_' + str(i + 1) + '_' + str(j + 1))
+                    y = fluid.layers.resize_bilinear(input=y, out_shape=[height, width])
+                    residual = fluid.layers.elementwise_add(x=residual, y=y, act=None)
+                elif j < i:
+                    y = x[j]
+                    for k in range(i - j):
+                        if k == i - j - 1:
+                            y = self.conv_bn_layer(y, filter_size=3, num_filters=channels[i], stride=2, if_act=False,
+                                               name=name + '_layer_' + str(i + 1) + '_' + str(j + 1) + '_' + str(k + 1))
+                        else:
+                            y = self.conv_bn_layer(y, filter_size=3, num_filters=channels[j], stride=2,
+                                               name=name + '_layer_' + str(i + 1) + '_' + str(j + 1) + '_' + str(k + 1))
+                    residual = fluid.layers.elementwise_add(x=residual, y=y, act=None)
+
+            residual = fluid.layers.relu(residual)
+            out.append(residual)
+        return out
+
+    def branches(self, x, block_num, channels, name=None):
+        out = []
+        for i in range(len(channels)):
+            residual = x[i]
+            for j in range(block_num):
+                residual = self.basic_block(residual, channels[i],
+                                        name=name + '_branch_layer_' + str(i + 1) + '_' + str(j + 1))
+            out.append(residual)
+        return out
+
+    def high_resolution_module(self, x, channels, multi_scale_output=True, name=None):
+        residual = self.branches(x, 4, channels, name=name)
+        out = self.fuse_layers(residual, channels, multi_scale_output=multi_scale_output, name=name)
+        return out
+
+    def transition_layer(self, x, in_channels, out_channels, name=None):
+        num_in = len(in_channels)
+        num_out = len(out_channels)
+        out = []
+        for i in range(num_out):
+            if i < num_in:
+                if in_channels[i] != out_channels[i]:
+                    residual = self.conv_bn_layer(x[i], filter_size=3, num_filters=out_channels[i],
+                                              name=name + '_layer_' + str(i + 1))
+                    out.append(residual)
+                else:
+                    out.append(x[i])
+            else:
+                residual = self.conv_bn_layer(x[-1], filter_size=3, num_filters=out_channels[i], stride=2,
+                                          name=name + '_layer_' + str(i + 1))
+                out.append(residual)
+        return out
+
+    def stage(self, x, num_modules, channels, multi_scale_output=True, name=None):
+        out = x
+        for i in range(num_modules):
+            if i == num_modules - 1 and multi_scale_output == False:
+                out = self.high_resolution_module(out, channels, multi_scale_output=False, name=name + '_' + str(i + 1))
+            else:
+                out = self.high_resolution_module(out, channels, name=name + '_' + str(i + 1))
+
+        return out
+
+    def layer1(self, input, name=None):
+        conv = input
+        for i in range(4):
+            conv = self.bottleneck_block(conv, num_filters=64, downsample=True if i == 0 else False,
+                                     name=name + '_' + str(i + 1))
+        return conv
+
+    #def highResolutionNet(input, num_classes):
+    def net(self, input, num_classes=1000):
+    
+        channels_2 = cfg.MODEL.HRNET.STAGE2.NUM_CHANNELS
+        channels_3 = cfg.MODEL.HRNET.STAGE3.NUM_CHANNELS
+        channels_4 = cfg.MODEL.HRNET.STAGE4.NUM_CHANNELS
+    
+        num_modules_2 = cfg.MODEL.HRNET.STAGE2.NUM_MODULES
+        num_modules_3 = cfg.MODEL.HRNET.STAGE3.NUM_MODULES
+        num_modules_4 = cfg.MODEL.HRNET.STAGE4.NUM_MODULES
+
+        x = self.conv_bn_layer(input=input, filter_size=3, num_filters=64, stride=2, if_act=True, name='layer1_1')
+        x = self.conv_bn_layer(input=x, filter_size=3, num_filters=64, stride=2, if_act=True, name='layer1_2')
+
+        la1 = self.layer1(x, name='layer2')
+        tr1 = self.transition_layer([la1], [256], channels_2, name='tr1')
+        st2 = self.stage(tr1, num_modules_2, channels_2, name='st2')
+        tr2 = self.transition_layer(st2, channels_2, channels_3, name='tr2')
+        st3 = self.stage(tr2, num_modules_3, channels_3, name='st3')
+        tr3 = self.transition_layer(st3, channels_3, channels_4, name='tr3')
+        st4 = self.stage(tr3, num_modules_4, channels_4, name='st4')
+
+        # upsample
+        shape = st4[0].shape
+        height, width = shape[-2], shape[-1]
+        st4[1] = fluid.layers.resize_bilinear(st4[1], out_shape=[height, width])
+        st4[2] = fluid.layers.resize_bilinear(st4[2], out_shape=[height, width])
+        st4[3] = fluid.layers.resize_bilinear(st4[3], out_shape=[height, width])
+
+        out = fluid.layers.concat(st4, axis=1)
+        if self.seg_flag and self.stride==4:
+           return out
+
+        last_channels = sum(channels_4)
+
+        out = conv_bn_layer(input=out, filter_size=1, num_filters=last_channels, stride=1, if_act=True, name='conv-2')
+        out= fluid.layers.conv2d(
+            input=out,
+            num_filters=num_classes,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            act=None,
+            param_attr=ParamAttr(initializer=MSRA(), name='conv-1_weights'),
+            bias_attr=False)
+
+        out = fluid.layers.resize_bilinear(out, input.shape[2:])
+
+
+        return out
+
+
+def hrnet():
+    model = HRNet(stride=4, seg_flag=True)
+    return model
+
+if __name__ == '__main__':
+    image_shape = [3, 769, 769]
+    image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
+    logit = hrnet(image, 4)
+    print("logit:", logit.shape)
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/backbone/mobilenet_v2.py b/PaddleCV/Research/SemSegPaddle/src/models/backbone/mobilenet_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..5022a7826d16ef2cb24237b3f11cfb23e5189ff3
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/models/backbone/mobilenet_v2.py
@@ -0,0 +1,302 @@
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import paddle.fluid as fluid
+from paddle.fluid.initializer import MSRA
+from paddle.fluid.param_attr import ParamAttr
+
+__all__ = [
+    'MobileNetV2', 'MobileNetV2_x0_25', 'MobileNetV2_x0_5', 'MobileNetV2_x1_0',
+    'MobileNetV2_x1_5', 'MobileNetV2_x2_0', 'MobileNetV2_scale'
+]
+
+
+
+class MobileNetV2():
+    def __init__(self, scale=1.0, change_depth=False, output_stride=None):
+        self.scale = scale
+        self.change_depth = change_depth
+        self.bottleneck_params_list = [
+            (1, 16, 1, 1),
+            (6, 24, 2, 2),
+            (6, 32, 3, 2),
+            (6, 64, 4, 2),
+            (6, 96, 3, 1),
+            (6, 160, 3, 2),
+            (6, 320, 1, 1),
+        ] if change_depth == False else [
+            (1, 16, 1, 1),
+            (6, 24, 2, 2),
+            (6, 32, 5, 2),
+            (6, 64, 7, 2),
+            (6, 96, 5, 1),
+            (6, 160, 3, 2),
+            (6, 320, 1, 1),
+        ]
+        self.modify_bottle_params(output_stride)
+
+    def modify_bottle_params(self, output_stride=None):
+        if output_stride is not None and output_stride % 2 != 0:
+            raise Exception("output stride must to be even number")
+        if output_stride is None:
+            return
+        else:
+            stride = 2
+            for i, layer_setting in enumerate(self.bottleneck_params_list):
+                t, c, n, s = layer_setting
+                stride = stride * s
+                if stride > output_stride:
+                    s = 1
+                self.bottleneck_params_list[i] = (t, c, n, s)
+
+    def net(self, input, class_dim=1000, end_points=None, decode_points=None):
+        scale = self.scale
+        change_depth = self.change_depth
+        #if change_depth is True, the new depth is 1.4 times as deep as before.
+        bottleneck_params_list = self.bottleneck_params_list
+        decode_ends = dict()
+
+        def check_points(count, points):
+            if points is None:
+                return False
+            else:
+                if isinstance(points, list):
+                    return (True if count in points else False)
+                else:
+                    return (True if count == points else False)
+
+        #conv1
+        input = self.conv_bn_layer(
+            input,
+            num_filters=int(32 * scale),
+            filter_size=3,
+            stride=2,
+            padding=1,
+            if_act=True,
+            name='conv1_1')
+        layer_count = 1
+
+        #print("node test:", layer_count, input.shape)
+
+        if check_points(layer_count, decode_points):
+            decode_ends[layer_count] = input
+
+        if check_points(layer_count, end_points):
+            return input, decode_ends
+
+        # bottleneck sequences
+        i = 1
+        in_c = int(32 * scale)
+        for layer_setting in bottleneck_params_list:
+            t, c, n, s = layer_setting
+            i += 1
+            input, depthwise_output = self.invresi_blocks(
+                input=input,
+                in_c=in_c,
+                t=t,
+                c=int(c * scale),
+                n=n,
+                s=s,
+                name='conv' + str(i))
+            in_c = int(c * scale)
+            layer_count += n
+
+            #print("node test:", layer_count, input.shape)
+            if check_points(layer_count, decode_points):
+                decode_ends[layer_count] = depthwise_output
+
+            if check_points(layer_count, end_points):
+                return input, decode_ends
+
+        #last_conv
+        input = self.conv_bn_layer(
+            input=input,
+            num_filters=int(1280 * scale) if scale > 1.0 else 1280,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            if_act=True,
+            name='conv9')
+
+        input = fluid.layers.pool2d(
+            input=input,
+            pool_size=7,
+            pool_stride=1,
+            pool_type='avg',
+            global_pooling=True)
+
+        output = fluid.layers.fc(
+            input=input,
+            size=class_dim,
+            param_attr=ParamAttr(name='fc10_weights'),
+            bias_attr=ParamAttr(name='fc10_offset'))
+        return output
+
+    def conv_bn_layer(self,
+                      input,
+                      filter_size,
+                      num_filters,
+                      stride,
+                      padding,
+                      channels=None,
+                      num_groups=1,
+                      if_act=True,
+                      name=None,
+                      use_cudnn=True):
+        conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            groups=num_groups,
+            act=None,
+            use_cudnn=use_cudnn,
+            param_attr=ParamAttr(name=name + '_weights'),
+            bias_attr=False)
+        bn_name = name + '_bn'
+        bn = fluid.layers.batch_norm(
+            input=conv,
+            param_attr=ParamAttr(name=bn_name + "_scale"),
+            bias_attr=ParamAttr(name=bn_name + "_offset"),
+            moving_mean_name=bn_name + '_mean',
+            moving_variance_name=bn_name + '_variance')
+        if if_act:
+            return fluid.layers.relu6(bn)
+        else:
+            return bn
+
+    def shortcut(self, input, data_residual):
+        return fluid.layers.elementwise_add(input, data_residual)
+
+    def inverted_residual_unit(self,
+                               input,
+                               num_in_filter,
+                               num_filters,
+                               ifshortcut,
+                               stride,
+                               filter_size,
+                               padding,
+                               expansion_factor,
+                               name=None):
+        num_expfilter = int(round(num_in_filter * expansion_factor))
+
+        channel_expand = self.conv_bn_layer(
+            input=input,
+            num_filters=num_expfilter,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            num_groups=1,
+            if_act=True,
+            name=name + '_expand')
+
+        bottleneck_conv = self.conv_bn_layer(
+            input=channel_expand,
+            num_filters=num_expfilter,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            num_groups=num_expfilter,
+            if_act=True,
+            name=name + '_dwise',
+            use_cudnn=False)
+
+        depthwise_output = bottleneck_conv
+
+        linear_out = self.conv_bn_layer(
+            input=bottleneck_conv,
+            num_filters=num_filters,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            num_groups=1,
+            if_act=False,
+            name=name + '_linear')
+
+        if ifshortcut:
+            out = self.shortcut(input=input, data_residual=linear_out)
+            return out, depthwise_output
+        else:
+            return linear_out, depthwise_output
+
+    def invresi_blocks(self, input, in_c, t, c, n, s, name=None):
+        first_block, depthwise_output = self.inverted_residual_unit(
+            input=input,
+            num_in_filter=in_c,
+            num_filters=c,
+            ifshortcut=False,
+            stride=s,
+            filter_size=3,
+            padding=1,
+            expansion_factor=t,
+            name=name + '_1')
+
+        last_residual_block = first_block
+        last_c = c
+
+        for i in range(1, n):
+            last_residual_block, depthwise_output = self.inverted_residual_unit(
+                input=last_residual_block,
+                num_in_filter=last_c,
+                num_filters=c,
+                ifshortcut=True,
+                stride=1,
+                filter_size=3,
+                padding=1,
+                expansion_factor=t,
+                name=name + '_' + str(i + 1))
+        return last_residual_block, depthwise_output
+
+
+def MobileNetV2_x0_25():
+    model = MobileNetV2(scale=0.25)
+    return model
+
+
+def MobileNetV2_x0_5():
+    model = MobileNetV2(scale=0.5)
+    return model
+
+
+def MobileNetV2_x1_0():
+    model = MobileNetV2(scale=1.0)
+    return model
+
+
+def MobileNetV2_x1_5():
+    model = MobileNetV2(scale=1.5)
+    return model
+
+
+def MobileNetV2_x2_0():
+    model = MobileNetV2(scale=2.0)
+    return model
+
+
+def MobileNetV2_scale():
+    model = MobileNetV2(scale=1.2, change_depth=True)
+    return model
+
+
+if __name__ == '__main__':
+    image_shape = [3, 224, 224]
+    image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
+    model = MobileNetV2_x1_0()
+    logit, decode_ends = model.net(image)
+    #print("logit:", logit.shape)
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/backbone/resnet.py b/PaddleCV/Research/SemSegPaddle/src/models/backbone/resnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..d988ee9687cbe40f72453ede5d7af7f935b2cafa
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/models/backbone/resnet.py
@@ -0,0 +1,303 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from src.utils.config import cfg
+
+__all__ = [
+    "ResNet", "ResNet18", "ResNet34", "ResNet50", "ResNet101", "ResNet152"
+]
+
+class ResNet():
+    def __init__(self, layers=50, scale=1.0):
+        self.layers = layers
+        self.scale = scale
+
+    def net(self,
+            input,
+            class_dim=1000,
+            end_points=None,
+            decode_points=None,
+            resize_points=None,
+            dilation_dict=None):
+        layers = self.layers
+        supported_layers = [18, 34, 50, 101, 152]
+        assert layers in supported_layers, \
+            "supported layers are {} but input layer is {}".format(supported_layers, layers)
+
+        decode_ends = dict()
+
+        def check_points(count, points):
+            if points is None:
+                return False
+            else:
+                if isinstance(points, list):
+                    return (True if count in points else False)
+                else:
+                    return (True if count == points else False)
+
+        def get_dilated_rate(dilation_dict, idx):
+            if dilation_dict is None or idx not in dilation_dict:
+                return 1
+            else:
+                return dilation_dict[idx]
+
+        if layers == 18:
+            depth = [2, 2, 2, 2]
+        elif layers == 34 or layers == 50:
+            depth = [3, 4, 6, 3]
+        elif layers == 101:
+            depth = [3, 4, 23, 3]
+        elif layers == 152:
+            depth = [3, 8, 36, 3]
+        num_filters = [64, 128, 256, 512]
+
+        # stage_1: 3 3x3_Conv
+        conv = self.conv_bn_layer(
+                input=input,
+                num_filters=int(64 * self.scale),
+                filter_size=3,
+                stride=2,
+                act='relu',
+                name="conv1_1")
+        conv = self.conv_bn_layer(
+                input=conv,
+                num_filters=int(64 * self.scale),
+                filter_size=3,
+                stride=1,
+                act='relu',
+                name="conv1_2")
+        conv = self.conv_bn_layer(
+                input=conv,
+                num_filters=int(128 * self.scale),
+                filter_size=3,
+                stride=1,
+                act='relu',
+                name="conv1_3")
+
+        conv = fluid.layers.pool2d(
+            input=conv,
+            pool_size=3,
+            pool_stride=2,
+            pool_padding=1,
+            pool_type='max')
+
+        layer_count = 1
+        if check_points(layer_count, decode_points):
+            decode_ends[layer_count] = conv
+
+        if check_points(layer_count, end_points):
+            return conv, decode_ends
+
+        if layers >= 50:
+            for block in range(len(depth)):
+                for i in range(depth[block]):  #depth = [3, 4, 23, 3]
+                    if layers in [101, 152] and block == 2:
+                        if i == 0:
+                            conv_name = "res" + str(block + 2) + "a"
+                        else:
+                            conv_name = "res" + str(block + 2) + "b" + str(i)
+                    else:
+                        conv_name = "res" + str(block + 2) + chr(97 + i)
+                    dilation_rate = get_dilated_rate(dilation_dict, block)
+                    # added by Rosun, employ multi-grid
+                    if cfg.MODEL.BACKBONE_MULTI_GRID== True and block==3:
+                        if i==0:
+                            dilation_rate = dilation_rate*(i+1)
+                        else: 
+                            dilation_rate = dilation_rate*(2*i)  # 2, 4
+                        print("employ multi-grid for resnet backbone network: dilation_rate={}\n".format(dilation_rate))
+                    
+                    conv = self.bottleneck_block(
+                        input=conv,
+                        num_filters=int(num_filters[block] * self.scale),
+                        stride=2
+                        if i == 0 and block != 0 and dilation_rate == 1 else 1,
+                        name=conv_name,
+                        dilation=dilation_rate)
+                    layer_count += 3
+
+                    if check_points(layer_count, decode_points):
+                        decode_ends[layer_count] = conv
+
+                    if check_points(layer_count, end_points):
+                        return conv, decode_ends
+
+                    if check_points(layer_count, resize_points):
+                        conv = self.interp(
+                            conv,
+                            np.ceil(
+                                np.array(conv.shape[2:]).astype('int32') / 2))
+
+            pool = fluid.layers.pool2d(input=conv, pool_size=7, pool_type='avg', global_pooling=True)
+            stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
+            out = fluid.layers.fc(input=pool,
+                                  size=class_dim,
+                                  param_attr=fluid.param_attr.ParamAttr(initializer=fluid.initializer.Uniform(-stdv, stdv)))
+        else:
+            for block in range(len(depth)):
+                for i in range(depth[block]):
+                    conv_name = "res" + str(block + 2) + chr(97 + i)
+                    conv = self.basic_block(
+                        input=conv,
+                        num_filters=num_filters[block],
+                        stride=2 if i == 0 and block != 0 else 1,
+                        is_first=block == i == 0,
+                        name=conv_name)
+                    layer_count += 2
+                    if check_points(layer_count, decode_points):
+                        decode_ends[layer_count] = conv
+
+                    if check_points(layer_count, end_points):
+                        return conv, decode_ends
+
+            pool = fluid.layers.pool2d(
+                input=conv, pool_size=7, pool_type='avg', global_pooling=True)
+            stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
+            out = fluid.layers.fc(
+                input=pool,
+                size=class_dim,
+                param_attr=fluid.param_attr.ParamAttr(
+                    initializer=fluid.initializer.Uniform(-stdv, stdv)))
+        return out
+
+    def zero_padding(self, input, padding):
+        return fluid.layers.pad(
+            input, [0, 0, 0, 0, padding, padding, padding, padding])
+
+    def interp(self, input, out_shape):
+        out_shape = list(out_shape.astype("int32"))
+        return fluid.layers.resize_bilinear(input, out_shape=out_shape)
+
+    def conv_bn_layer(self,
+                      input,
+                      num_filters,
+                      filter_size,
+                      stride=1,
+                      dilation=1,
+                      groups=1,
+                      act=None,
+                      name=None):
+   
+        bias_attr=False
+
+        conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=(filter_size - 1) // 2 if dilation == 1 else 0,
+            dilation=dilation,
+            groups=groups,
+            act=None,
+            param_attr=ParamAttr(name=name + "_weights"),
+            bias_attr=bias_attr,
+            name=name + '.conv2d.output.1')
+
+        if name == "conv1":
+            bn_name = "bn_" + name
+        else:
+            bn_name = "bn" + name[3:]
+        return fluid.layers.batch_norm(input=conv,
+                                       act=act,
+                                       name=bn_name + '.output.1',
+                                       param_attr=ParamAttr(name=bn_name + '_scale'),
+                                       bias_attr=ParamAttr(bn_name + '_offset'),
+                                       moving_mean_name=bn_name + '_mean',
+                                       moving_variance_name=bn_name + '_variance', )
+
+    def shortcut(self, input, ch_out, stride, is_first, name):
+        ch_in = input.shape[1]
+        if ch_in != ch_out or stride != 1 or is_first == True:
+            return self.conv_bn_layer(input, ch_out, 1, stride, name=name)
+        else:
+            return input
+
+    def bottleneck_block(self, input, num_filters, stride, name, dilation=1):
+        if self.layers == 101:
+            strides = [1, stride]
+        else:
+            strides = [stride, 1]
+        
+        conv0 = self.conv_bn_layer(
+            input=input,
+            num_filters=num_filters,
+            filter_size=1,
+            dilation=1,
+            stride=strides[0],
+            act='relu',
+            name=name + "_branch2a")
+        if dilation > 1:
+            conv0 = self.zero_padding(conv0, dilation)
+        conv1 = self.conv_bn_layer(
+            input=conv0,
+            num_filters=num_filters,
+            filter_size=3,
+            dilation=dilation,
+            stride=strides[1],
+            act='relu',
+            name=name + "_branch2b")
+        conv2 = self.conv_bn_layer(
+            input=conv1,
+            num_filters=num_filters * 4,
+            dilation=1,
+            filter_size=1,
+            act=None,
+            name=name + "_branch2c")
+
+        short = self.shortcut(
+            input,
+            num_filters * 4,
+            stride,
+            is_first=False,
+            name=name + "_branch1")
+
+        return fluid.layers.elementwise_add(
+            x=short, y=conv2, act='relu', name=name + ".add.output.5")
+
+    def basic_block(self, input, num_filters, stride, is_first, name):
+        conv0 = self.conv_bn_layer(
+            input=input,
+            num_filters=num_filters,
+            filter_size=3,
+            act='relu',
+            stride=stride,
+            name=name + "_branch2a")
+        conv1 = self.conv_bn_layer(
+            input=conv0,
+            num_filters=num_filters,
+            filter_size=3,
+            act=None,
+            name=name + "_branch2b")
+        short = self.shortcut(
+            input, num_filters, stride, is_first, name=name + "_branch1")
+        return fluid.layers.elementwise_add(x=short, y=conv1, act='relu')
+
+
+def ResNet18():
+    model = ResNet(layers=18)
+    return model
+
+
+def ResNet34():
+    model = ResNet(layers=34)
+    return model
+
+
+def ResNet50():
+    model = ResNet(layers=50)
+    return model
+
+
+def ResNet101():
+    model = ResNet(layers=101)
+    return model
+
+
+def ResNet152():
+    model = ResNet(layers=152)
+    return model
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/backbone/xception.py b/PaddleCV/Research/SemSegPaddle/src/models/backbone/xception.py
new file mode 100644
index 0000000000000000000000000000000000000000..be84e3ba0a83f0650ab35e5653722a8da0de4bd2
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/models/backbone/xception.py
@@ -0,0 +1,317 @@
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import contextlib
+import paddle
+import math
+import paddle.fluid as fluid
+from src.models.libs.model_libs import scope, name_scope
+from src.models.libs.model_libs import bn, bn_relu, relu
+from src.models.libs.model_libs import conv
+from src.models.libs.model_libs import separate_conv
+
+__all__ = ['xception_65', 'xception_41', 'xception_71']
+
+
+def check_data(data, number):
+    if type(data) == int:
+        return [data] * number
+    assert len(data) == number
+    return data
+
+
+def check_stride(s, os):
+    if s <= os:
+        return True
+    else:
+        return False
+
+
+def check_points(count, points):
+    if points is None:
+        return False
+    else:
+        if isinstance(points, list):
+            return (True if count in points else False)
+        else:
+            return (True if count == points else False)
+
+
+class Xception():
+    def __init__(self, backbone="xception_65"):
+        self.bottleneck_params = self.gen_bottleneck_params(backbone)
+        self.backbone = backbone
+
+    def gen_bottleneck_params(self, backbone='xception_65'):
+        if backbone == 'xception_65':
+            bottleneck_params = {
+                "entry_flow": (3, [2, 2, 2], [128, 256, 728]),
+                "middle_flow": (16, 1, 728),
+                "exit_flow": (2, [2, 1], [[728, 1024, 1024], [1536, 1536,
+                                                              2048]])
+            }
+        elif backbone == 'xception_41':
+            bottleneck_params = {
+                "entry_flow": (3, [2, 2, 2], [128, 256, 728]),
+                "middle_flow": (8, 1, 728),
+                "exit_flow": (2, [2, 1], [[728, 1024, 1024], [1536, 1536,
+                                                              2048]])
+            }
+        elif backbone == 'xception_71':
+            bottleneck_params = {
+                "entry_flow": (5, [2, 1, 2, 1, 2], [128, 256, 256, 728, 728]),
+                "middle_flow": (16, 1, 728),
+                "exit_flow": (2, [2, 1], [[728, 1024, 1024], [1536, 1536,
+                                                              2048]])
+            }
+        else:
+            raise Exception(
+                "xception backbont only support xception_41/xception_65/xception_71"
+            )
+        return bottleneck_params
+
+    def net(self,
+            input,
+            output_stride=32,
+            num_classes=1000,
+            end_points=None,
+            decode_points=None):
+        self.stride = 2
+        self.block_point = 0
+        self.output_stride = output_stride
+        self.decode_points = decode_points
+        self.short_cuts = dict()
+        with scope(self.backbone):
+            # Entry flow
+            data = self.entry_flow(input)
+            if check_points(self.block_point, end_points):
+                return data, self.short_cuts
+
+            # Middle flow
+            data = self.middle_flow(data)
+            if check_points(self.block_point, end_points):
+                return data, self.short_cuts
+
+            # Exit flow
+            data = self.exit_flow(data)
+            if check_points(self.block_point, end_points):
+                return data, self.short_cuts
+
+            data = fluid.layers.reduce_mean(data, [2, 3], keep_dim=True)
+            data = fluid.layers.dropout(data, 0.5)
+            stdv = 1.0 / math.sqrt(data.shape[1] * 1.0)
+            with scope("logit"):
+                out = fluid.layers.fc(
+                    input=data,
+                    size=num_classes,
+                    act='softmax',
+                    param_attr=fluid.param_attr.ParamAttr(
+                        name='weights',
+                        initializer=fluid.initializer.Uniform(-stdv, stdv)),
+                    bias_attr=fluid.param_attr.ParamAttr(name='bias'))
+
+            return out
+
+    def entry_flow(self, data):
+        param_attr = fluid.ParamAttr(
+            name=name_scope + 'weights',
+            regularizer=None,
+            initializer=fluid.initializer.TruncatedNormal(loc=0.0, scale=0.09))
+        with scope("entry_flow"):
+            with scope("conv1"):
+                data = bn_relu(
+                    conv(
+                        data, 32, 3, stride=2, padding=1,
+                        param_attr=param_attr))
+            with scope("conv2"):
+                data = bn_relu(
+                    conv(
+                        data, 64, 3, stride=1, padding=1,
+                        param_attr=param_attr))
+
+        # get entry flow params
+        block_num = self.bottleneck_params["entry_flow"][0]
+        strides = self.bottleneck_params["entry_flow"][1]
+        chns = self.bottleneck_params["entry_flow"][2]
+        strides = check_data(strides, block_num)
+        chns = check_data(chns, block_num)
+
+        # params to control your flow
+        s = self.stride
+        block_point = self.block_point
+        output_stride = self.output_stride
+        with scope("entry_flow"):
+            for i in range(block_num):
+                block_point = block_point + 1
+                with scope("block" + str(i + 1)):
+                    stride = strides[i] if check_stride(s * strides[i],
+                                                        output_stride) else 1
+                    data, short_cuts = self.xception_block(
+                        data, chns[i], [1, 1, stride])
+                    s = s * stride
+                    if check_points(block_point, self.decode_points):
+                        self.short_cuts[block_point] = short_cuts[1]
+
+        self.stride = s
+        self.block_point = block_point
+        return data
+
+    def middle_flow(self, data):
+        block_num = self.bottleneck_params["middle_flow"][0]
+        strides = self.bottleneck_params["middle_flow"][1]
+        chns = self.bottleneck_params["middle_flow"][2]
+        strides = check_data(strides, block_num)
+        chns = check_data(chns, block_num)
+
+        # params to control your flow
+        s = self.stride
+        block_point = self.block_point
+        output_stride = self.output_stride
+        with scope("middle_flow"):
+            for i in range(block_num):
+                block_point = block_point + 1
+                with scope("block" + str(i + 1)):
+                    stride = strides[i] if check_stride(s * strides[i],
+                                                        output_stride) else 1
+                    data, short_cuts = self.xception_block(
+                        data, chns[i], [1, 1, strides[i]], skip_conv=False)
+                    s = s * stride
+                    if check_points(block_point, self.decode_points):
+                        self.short_cuts[block_point] = short_cuts[1]
+
+        self.stride = s
+        self.block_point = block_point
+        return data
+
+    def exit_flow(self, data):
+        block_num = self.bottleneck_params["exit_flow"][0]
+        strides = self.bottleneck_params["exit_flow"][1]
+        chns = self.bottleneck_params["exit_flow"][2]
+        strides = check_data(strides, block_num)
+        chns = check_data(chns, block_num)
+
+        assert (block_num == 2)
+        # params to control your flow
+        s = self.stride
+        block_point = self.block_point
+        output_stride = self.output_stride
+        with scope("exit_flow"):
+            with scope('block1'):
+                block_point += 1
+                stride = strides[0] if check_stride(s * strides[0],
+                                                    output_stride) else 1
+                data, short_cuts = self.xception_block(data, chns[0],
+                                                       [1, 1, stride])
+                s = s * stride
+                if check_points(block_point, self.decode_points):
+                    self.short_cuts[block_point] = short_cuts[1]
+            with scope('block2'):
+                block_point += 1
+                stride = strides[1] if check_stride(s * strides[1],
+                                                    output_stride) else 1
+                data, short_cuts = self.xception_block(
+                    data,
+                    chns[1], [1, 1, stride],
+                    dilation=2,
+                    has_skip=False,
+                    activation_fn_in_separable_conv=True)
+                s = s * stride
+                if check_points(block_point, self.decode_points):
+                    self.short_cuts[block_point] = short_cuts[1]
+
+        self.stride = s
+        self.block_point = block_point
+        return data
+
+    def xception_block(self,
+                       input,
+                       channels,
+                       strides=1,
+                       filters=3,
+                       dilation=1,
+                       skip_conv=True,
+                       has_skip=True,
+                       activation_fn_in_separable_conv=False):
+        repeat_number = 3
+        channels = check_data(channels, repeat_number)
+        filters = check_data(filters, repeat_number)
+        strides = check_data(strides, repeat_number)
+        data = input
+        results = []
+        for i in range(repeat_number):
+            with scope('separable_conv' + str(i + 1)):
+                if not activation_fn_in_separable_conv:
+                    data = relu(data)
+                    data = separate_conv(
+                        data,
+                        channels[i],
+                        strides[i],
+                        filters[i],
+                        dilation=dilation)
+                else:
+                    data = separate_conv(
+                        data,
+                        channels[i],
+                        strides[i],
+                        filters[i],
+                        dilation=dilation,
+                        act=relu)
+                results.append(data)
+        if not has_skip:
+            return data, results
+        if skip_conv:
+            param_attr = fluid.ParamAttr(
+                name=name_scope + 'weights',
+                regularizer=None,
+                initializer=fluid.initializer.TruncatedNormal(
+                    loc=0.0, scale=0.09))
+            with scope('shortcut'):
+                skip = bn(
+                    conv(
+                        input,
+                        channels[-1],
+                        1,
+                        strides[-1],
+                        groups=1,
+                        padding=0,
+                        param_attr=param_attr))
+        else:
+            skip = input
+        return data + skip, results
+
+
+def xception_65():
+    model = Xception("xception_65")
+    return model
+
+
+def xception_41():
+    model = Xception("xception_41")
+    return model
+
+
+def xception_71():
+    model = Xception("xception_71")
+    return model
+
+
+if __name__ == '__main__':
+    image_shape = [3, 224, 224]
+    image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
+    model = xception_65()
+    logit = model.net(image)
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/libs/__init__.py b/PaddleCV/Research/SemSegPaddle/src/models/libs/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/libs/model_libs.py b/PaddleCV/Research/SemSegPaddle/src/models/libs/model_libs.py
new file mode 100644
index 0000000000000000000000000000000000000000..cae973a2262fbcdd58c5eda85c0a8d981bfd98fe
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/models/libs/model_libs.py
@@ -0,0 +1,219 @@
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import sys
+import paddle
+import paddle.fluid as fluid
+from src.utils.config import cfg
+import contextlib
+
+bn_regularizer = fluid.regularizer.L2DecayRegularizer(regularization_coeff=0.0)
+name_scope = ""
+
+
+@contextlib.contextmanager
+def scope(name):
+    global name_scope
+    bk = name_scope
+    name_scope = name_scope + name + '/'
+    yield
+    name_scope = bk
+
+
+def max_pool(input, kernel, stride, padding):
+    data = fluid.layers.pool2d(
+        input,
+        pool_size=kernel,
+        pool_type='max',
+        pool_stride=stride,
+        pool_padding=padding)
+    return data
+
+
+def avg_pool(input, kernel, stride, padding=0):
+    data = fluid.layers.pool2d(
+        input,
+        pool_size=kernel,
+        pool_type='avg',
+        pool_stride=stride,
+        pool_padding=padding)
+    return data
+
+
+def group_norm(input, G, eps=1e-5, param_attr=None, bias_attr=None):
+    N, C, H, W = input.shape
+    if C % G != 0:
+        # print "group can not divide channle:", C, G
+        for d in range(10):
+            for t in [d, -d]:
+                if G + t <= 0: continue
+                if C % (G + t) == 0:
+                    G = G + t
+                    break
+            if C % G == 0:
+                # print "use group size:", G
+                break
+    assert C % G == 0
+    x = fluid.layers.group_norm(
+        input,
+        groups=G,
+        param_attr=param_attr,
+        bias_attr=bias_attr,
+        name=name_scope + 'group_norm')
+    return x
+
+
+def bn(*args, **kargs):
+    if cfg.MODEL.DEFAULT_NORM_TYPE == 'bn':
+        with scope('BatchNorm'):
+            return fluid.layers.batch_norm(
+                *args,
+                epsilon=cfg.MODEL.DEFAULT_EPSILON,
+                momentum=cfg.MODEL.BN_MOMENTUM,
+                param_attr=fluid.ParamAttr(
+                    name=name_scope + 'gamma', regularizer=bn_regularizer),
+                bias_attr=fluid.ParamAttr(
+                    name=name_scope + 'beta', regularizer=bn_regularizer),
+                moving_mean_name=name_scope + 'moving_mean',
+                moving_variance_name=name_scope + 'moving_variance',
+                **kargs)
+    elif cfg.MODEL.DEFAULT_NORM_TYPE == 'gn':
+        with scope('GroupNorm'):
+            return group_norm(
+                args[0],
+                cfg.MODEL.DEFAULT_GROUP_NUMBER,
+                eps=cfg.MODEL.DEFAULT_EPSILON,
+                param_attr=fluid.ParamAttr(
+                    name=name_scope + 'gamma', regularizer=bn_regularizer),
+                bias_attr=fluid.ParamAttr(
+                    name=name_scope + 'beta', regularizer=bn_regularizer))
+    else:
+        raise Exception("Unsupport norm type:" + cfg.MODEL.DEFAULT_NORM_TYPE)
+
+def bn_zero(*args, **kargs):
+    if cfg.MODEL.DEFAULT_NORM_TYPE == 'bn':
+        with scope('BatchNormZeroInit'):
+            return fluid.layers.batch_norm(
+                *args,
+                epsilon=cfg.MODEL.DEFAULT_EPSILON,
+                momentum=cfg.MODEL.BN_MOMENTUM,
+                param_attr=fluid.ParamAttr(
+                    name=name_scope + 'gamma', regularizer=bn_regularizer,
+                    initializer=fluid.initializer.ConstantInitializer(value=0.0)),
+                bias_attr=fluid.ParamAttr(
+                    name=name_scope + 'beta', regularizer=bn_regularizer,
+                    initializer=fluid.initializer.ConstantInitializer(value=0.0)),
+                moving_mean_name=name_scope + 'moving_mean',
+                moving_variance_name=name_scope + 'moving_variance',
+                **kargs)
+
+
+def bn_relu(data):
+    return fluid.layers.relu(bn(data))
+
+
+def relu(data):
+    return fluid.layers.relu(data)
+
+
+def conv(*args, **kargs):
+    kargs['param_attr'] = name_scope + 'weights'
+    if 'bias_attr' in kargs and kargs['bias_attr']:
+        kargs['bias_attr'] = fluid.ParamAttr(
+            name=name_scope + 'biases',
+            regularizer=None,
+            initializer=fluid.initializer.ConstantInitializer(value=0.0))
+    else:
+        kargs['bias_attr'] = False
+    return fluid.layers.conv2d(*args, **kargs)
+
+
+def deconv(*args, **kargs):
+    kargs['param_attr'] = name_scope + 'weights'
+    if 'bias_attr' in kargs and kargs['bias_attr']:
+        kargs['bias_attr'] = name_scope + 'biases'
+    else:
+        kargs['bias_attr'] = False
+    return fluid.layers.conv2d_transpose(*args, **kargs)
+
+
+def separate_conv(input, channel, stride, filter, dilation=1, act=None):
+    param_attr = fluid.ParamAttr(
+        name=name_scope + 'weights',
+        regularizer=fluid.regularizer.L2DecayRegularizer(
+            regularization_coeff=0.0),
+        initializer=fluid.initializer.TruncatedNormal(loc=0.0, scale=0.33))
+    with scope('depthwise'):
+        input = conv(
+            input,
+            input.shape[1],
+            filter,
+            stride,
+            groups=input.shape[1],
+            padding=(filter // 2) * dilation,
+            dilation=dilation,
+            use_cudnn=False,
+            param_attr=param_attr)
+        input = bn(input)
+        if act: input = act(input)
+
+    param_attr = fluid.ParamAttr(
+        name=name_scope + 'weights',
+        regularizer=None,
+        initializer=fluid.initializer.TruncatedNormal(loc=0.0, scale=0.06))
+    with scope('pointwise'):
+        input = conv(
+            input, channel, 1, 1, groups=1, padding=0, param_attr=param_attr)
+        input = bn(input)
+        if act: input = act(input)
+    return input
+
+
+def FCNHead(input, mid_feat_channel, num_classes, output_shape):
+    # Arch: Conv_3x3 + BN + ReLU + Dropout + Conv_1x1
+
+    # Conv_3x3 + BN + ReLU
+    aux_seg_name= "Aux_layer1"
+    with scope(aux_seg_name):
+        conv_feat= conv(input, mid_feat_channel, filter_size=3, padding=1, bias_attr=False, name=aux_seg_name + '_conv')
+        bn_feat = bn(conv_feat, act='relu')
+    # Dropout
+    dropout_out = fluid.layers.dropout(bn_feat, dropout_prob=0.1, name="Aux_dropout")
+
+    # Conv_1x1 + bilinear_upsample
+    aux_seg_name= "Aux_layer2"
+    with scope(aux_seg_name):
+        aux_logit = conv(dropout_out, num_classes, filter_size=1, bias_attr=True, name= aux_seg_name + '_logit_conv')
+        aux_logit_interp = fluid.layers.resize_bilinear(aux_logit, out_shape=output_shape, name= aux_seg_name + '_logit_interp')
+
+    return aux_logit_interp
+
+def conv1d(x, output_channels, name_scope, bias_attr=False):
+    '''
+    x:B, C, N
+    reshape to 4D --> conv2d --> reshape to 3D
+    '''
+    B, C, N = x.shape
+    with scope(name_scope):
+        x = fluid.layers.reshape(x, shape=[B, C, N, 1])
+        if bias_attr:
+            x = conv(x, output_channels, filter_size=1, name=name_scope, bias_attr=bias_attr)
+        else: 
+            x = conv(x, output_channels, filter_size=1, name=name_scope)
+        x = fluid.layers.reshape(x, shape=[B, C, N])
+    return x
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/model_builder.py b/PaddleCV/Research/SemSegPaddle/src/models/model_builder.py
new file mode 100644
index 0000000000000000000000000000000000000000..6ffc133fc862783c72825fc38c82ce4d271d3c41
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/models/model_builder.py
@@ -0,0 +1,273 @@
+import sys
+import struct
+import importlib
+
+import paddle.fluid as fluid
+import numpy as np
+from paddle.fluid.proto.framework_pb2 import VarType
+
+from src.utils import solver
+from src.utils.config import cfg
+from src.utils.loss import multi_softmax_with_loss, multi_dice_loss, multi_bce_loss
+
+
+class ModelPhase(object):
+    """
+    Standard name for model phase in PaddleSeg
+
+    The following standard keys are defined:
+    * `TRAIN`: training mode.
+    * `EVAL`: testing/evaluation mode.
+    * `PREDICT`: prediction/inference mode.
+    * `VISUAL` : visualization mode
+    """
+
+    TRAIN = 'train'
+    EVAL = 'eval'
+    PREDICT = 'predict'
+    VISUAL = 'visual'
+
+    @staticmethod
+    def is_train(phase):
+        return phase == ModelPhase.TRAIN
+
+    @staticmethod
+    def is_predict(phase):
+        return phase == ModelPhase.PREDICT
+
+    @staticmethod
+    def is_eval(phase):
+        return phase == ModelPhase.EVAL
+
+    @staticmethod
+    def is_visual(phase):
+        return phase == ModelPhase.VISUAL
+
+    @staticmethod
+    def is_valid_phase(phase):
+        """ Check valid phase """
+        if ModelPhase.is_train(phase) or ModelPhase.is_predict(phase) \
+                or ModelPhase.is_eval(phase) or ModelPhase.is_visual(phase):
+            return True
+
+        return False
+
+
+def map_model_name(model_name):
+    name_dict = {
+        "deeplabv3": "deeplabv3.deeplabv3",
+        "pspnet": "pspnet.pspnet",
+        "glore": "glore.glore",
+    }
+    if model_name in name_dict.keys():
+        return name_dict[model_name]
+    else:
+        raise Exception(
+            "unknow model name, only support unet, deeplabv3p, icnet")
+
+
+def get_func(func_name):
+    """Helper to return a function object by name. func_name must identify a
+    function in this module or the path to a function relative to the base
+    'modeling' module.
+    """
+    print("func_name: ", func_name)
+    if func_name == '':
+        return None
+    try:
+        parts = func_name.split('.')
+        # Refers to a function in this module
+        if len(parts) == 1:
+            return globals()[parts[0]]
+        # Otherwise, assume we're referencing a module under modeling
+        module_name = 'src.models.' + '.'.join(parts[:-1])
+        print("module_name: ", module_name)
+        # method 1
+        #from src.models.modeling import pspnet
+        # method 2
+        module = importlib.import_module(module_name)
+        return getattr(module, parts[-1])
+    except Exception:
+        print('Failed to find function: {}'.format(func_name))
+    return module
+
+
+def softmax(logit):
+    logit = fluid.layers.transpose(logit, [0, 2, 3, 1])
+    logit = fluid.layers.softmax(logit)
+    logit = fluid.layers.transpose(logit, [0, 3, 1, 2])
+    return logit
+
+def sigmoid_to_softmax(logit):
+    """
+    one channel to two channel
+    """
+    logit = fluid.layers.transpose(logit, [0, 2, 3, 1])
+    logit = fluid.layers.sigmoid(logit)
+    logit_back = 1 - logit
+    logit = fluid.layers.concat([logit_back, logit], axis=-1)
+    logit = fluid.layers.transpose(logit, [0, 3, 1, 2])
+    return logit
+
+
+def build_model(main_prog, start_prog, phase=ModelPhase.TRAIN):
+    if not ModelPhase.is_valid_phase(phase):
+        raise ValueError("ModelPhase {} is not valid!".format(phase))
+    if ModelPhase.is_train(phase):
+        width = cfg.DATAAUG.CROP_SIZE
+        height = cfg.DATAAUG.CROP_SIZE
+    else:
+        width = cfg.TEST.CROP_SIZE
+        height = cfg.TEST.CROP_SIZE
+
+    image_shape = [cfg.DATASET.DATA_DIM, height, width]
+    grt_shape = [1, height, width]
+    class_num = cfg.DATASET.NUM_CLASSES
+
+    with fluid.program_guard(main_prog, start_prog):
+        with fluid.unique_name.guard():
+            # 在导出模型的时候，增加图像标准化预处理,减小预测部署时图像的处理流程
+            # 预测部署时只须对输入图像增加batch_size维度即可
+            if ModelPhase.is_predict(phase):
+                origin_image = fluid.layers.data(name='image', 
+                        shape=[ -1, 1, 1, cfg.DATASET.DATA_DIM], 
+                        dtype='float32', 
+                        append_batch_size=False)
+                image = fluid.layers.transpose(origin_image, [0, 3, 1, 2])
+                origin_shape = fluid.layers.shape(image)[-2:]
+                mean = np.array(cfg.MEAN).reshape(1, len(cfg.MEAN), 1, 1)
+                mean = fluid.layers.assign(mean.astype('float32'))
+                std = np.array(cfg.STD).reshape(1, len(cfg.STD), 1, 1)
+                std = fluid.layers.assign(std.astype('float32'))
+                image = (image/255 - mean)/std
+                image = fluid.layers.resize_bilinear(image, 
+                        out_shape=[height, width], align_corners=False, align_mode=0)
+            else:
+                image = fluid.layers.data( name='image', shape=image_shape, dtype='float32')
+            label = fluid.layers.data( name='label', shape=grt_shape, dtype='int32')
+            mask = fluid.layers.data( name='mask', shape=grt_shape, dtype='int32')
+
+            # use PyReader when doing traning and evaluation
+            if ModelPhase.is_train(phase) or ModelPhase.is_eval(phase):
+                iterable = True if ModelPhase.is_eval(phase) else False
+                print("iterable: ", iterable)
+                py_reader = fluid.io.PyReader(
+                    feed_list=[image, label, mask],
+                    capacity=cfg.DATALOADER.BUF_SIZE,
+                    iterable=iterable,
+                    use_double_buffer=True,
+                    return_list=False)
+
+            model_name = map_model_name(cfg.MODEL.MODEL_NAME)
+            model_func = get_func("modeling." + model_name)
+
+            loss_type = cfg.SOLVER.LOSS
+            if not isinstance(loss_type, list):
+                loss_type = list(loss_type)
+
+            # dice_loss或bce_loss只适用两类分割中
+            if class_num > 2 and (("dice_loss" in loss_type) or ("bce_loss" in loss_type)):
+                raise Exception("dice loss and bce loss is only applicable to binary classfication")
+            
+            # 在两类分割情况下，当loss函数选择dice_loss或bce_loss的时候，最后logit输出通道数设置为1
+            if ("dice_loss" in loss_type) or ("bce_loss" in loss_type):
+                class_num = 1
+                if "softmax_loss" in loss_type:
+                    raise Exception("softmax loss can not combine with dice loss or bce loss")
+            
+            logits = model_func(image, class_num)
+
+            # 根据选择的loss函数计算相应的损失函数
+            if ModelPhase.is_train(phase) or ModelPhase.is_eval(phase):
+                loss_valid = False
+                avg_loss_list = []
+                valid_loss = []
+                if "softmax_loss" in loss_type: 
+                    avg_loss_list.append(multi_softmax_with_loss(logits,
+                        label, mask,class_num))
+                    loss_valid = True
+                    valid_loss.append("softmax_loss")
+                if "dice_loss" in loss_type:
+                    avg_loss_list.append(multi_dice_loss(logits, label, mask))
+                    loss_valid = True
+                    valid_loss.append("dice_loss")
+                if "bce_loss" in loss_type:
+                    avg_loss_list.append(multi_bce_loss(logits, label, mask))
+                    loss_valid = True
+                    valid_loss.append("bce_loss")
+                if not loss_valid:
+                    raise Exception("SOLVER.LOSS: {} is set wrong. it should "
+                            "include one of (softmax_loss, bce_loss, dice_loss) at least"
+                            " example: ['softmax_loss'], ['dice_loss'], ['bce_loss', 'dice_loss']".format(cfg.SOLVER.LOSS))
+                
+                invalid_loss = [x for x in loss_type if x not in valid_loss]
+                if len(invalid_loss) > 0:
+                    print("Warning: the loss {} you set is invalid. it will not be included in loss computed.".format(invalid_loss))
+
+                avg_loss = 0
+                for i in range(0, len(avg_loss_list)):
+                    avg_loss += avg_loss_list[i]
+
+            #get pred result in original size
+            if isinstance(logits, tuple):
+                logit = logits[0]
+            else:
+                logit = logits
+
+            if logit.shape[2:] != label.shape[2:]:
+                logit = fluid.layers.resize_bilinear(logit, label.shape[2:])
+
+            # return image input and logit output for inference graph prune
+            if ModelPhase.is_predict(phase):
+                # 两类分割中，使用dice_loss或bce_loss返回的logit为单通道，进行到两通道的变换
+                if class_num == 1:
+                    logit = sigmoid_to_softmax(logit)
+                else:
+                    logit = softmax(logit)
+                logit = fluid.layers.resize_bilinear(logit, out_shape=origin_shape, align_corners=False, align_mode=0)
+                logit = fluid.layers.transpose(logit, [0, 2, 3, 1])
+                logit = fluid.layers.argmax(logit, axis=3)
+                return origin_image, logit
+
+            if class_num == 1:
+                out = sigmoid_to_softmax(logit)
+                out = fluid.layers.transpose(out, [0, 2, 3, 1])
+            else:
+                out = fluid.layers.transpose(logit, [0, 2, 3, 1])
+
+            pred = fluid.layers.argmax(out, axis=3)
+            pred = fluid.layers.unsqueeze(pred, axes=[3])
+            if ModelPhase.is_visual(phase):
+                if class_num == 1:
+                    logit = sigmoid_to_softmax(logit)
+                else:
+                    logit = softmax(logit)
+                return pred, logit
+
+            if ModelPhase.is_eval(phase):
+                out = fluid.layers.transpose(out, [0, 3, 1, 2]) #unnormalized probability
+                #return py_reader, avg_loss, pred, label, mask
+                return py_reader, avg_loss, out, label, mask
+
+            if ModelPhase.is_train(phase):
+                optimizer = solver.Solver(main_prog, start_prog)
+                decayed_lr = optimizer.optimise(avg_loss)
+                return py_reader, avg_loss, decayed_lr, pred, label, mask
+
+
+def to_int(string, dest="I"):
+    return struct.unpack(dest, string)[0]
+
+
+def parse_shape_from_file(filename):
+    with open(filename, "rb") as file:
+        version = file.read(4)
+        lod_level = to_int(file.read(8), dest="Q")
+        for i in range(lod_level):
+            _size = to_int(file.read(8), dest="Q")
+            _ = file.read(_size)
+        version = file.read(4)
+        tensor_desc_size = to_int(file.read(4))
+        tensor_desc = VarType.TensorDesc()
+        tensor_desc.ParseFromString(file.read(tensor_desc_size))
+    return tuple(tensor_desc.dims)
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/modeling/__init__.py b/PaddleCV/Research/SemSegPaddle/src/models/modeling/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/modeling/deeplabv3.py b/PaddleCV/Research/SemSegPaddle/src/models/modeling/deeplabv3.py
new file mode 100644
index 0000000000000000000000000000000000000000..4a4fd31554d71ad38a5f5a8a4494146f9cfbf4f8
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/models/modeling/deeplabv3.py
@@ -0,0 +1,174 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import contextlib
+import paddle
+import paddle.fluid as fluid
+from src.utils.config import cfg
+from src.models.libs.model_libs import scope, name_scope
+from src.models.libs.model_libs import bn, bn_relu, relu, FCNHead
+from src.models.libs.model_libs import conv
+from src.models.libs.model_libs import separate_conv
+from src.models.backbone.mobilenet_v2 import MobileNetV2 as mobilenet_backbone
+from src.models.backbone.xception import Xception as xception_backbone
+from src.models.backbone.resnet import ResNet as resnet_backbone
+from src.models.backbone.hrnet import HRNet as hrnet_backbone
+
+
+
+def ASPPHead(input, mid_channel, num_classes, output_shape):
+    # Arch of Atorus Spatial Pyramid Pooling Module:                                                 
+    #
+    #          |----> ImagePool + Conv_1x1 + BN + ReLU + bilinear_interp-------->|————————|
+    #          |                                                                 |        |
+    #          |---->           Conv_1x1 + BN + ReLU                    -------->|        | 
+    #          |                                                                 |        |
+    #   x----->|---->        AtrousConv_3x3 + BN + ReLU                 -------->| concat |----> Conv_1x1 + BN + ReLU -->Dropout --> Conv_1x1 
+    #          |                                                                 |        |
+    #          |---->        AtrousConv_3x3 + BN + ReLU                 -------->|        |
+    #          |                                                                 |        |
+    #          |---->        AtorusConv_3x3 + BN + ReLU                 -------->|________|
+    #                                                                                    
+    #
+
+    if cfg.MODEL.BACKBONE_OUTPUT_STRIDE == 16:
+        aspp_ratios = [6, 12, 18]
+    elif cfg.MODEL.BACKBONE_OUTPUT_STRIDE == 8:
+        aspp_ratios = [12, 24, 36]
+    else:
+        raise Exception("deeplab only support stride 8 or 16")
+
+    param_attr = fluid.ParamAttr(name=name_scope + 'weights', regularizer=None,
+                                 initializer=fluid.initializer.TruncatedNormal(loc=0.0, scale=0.06))
+    with scope('ASPPHead'):
+        with scope("image_pool"):
+            image_avg = fluid.layers.reduce_mean( input, [2, 3], keep_dim=True)
+            image_avg = bn_relu( conv( image_avg, mid_channel, 1, 1, groups=1, padding=0, param_attr=param_attr))
+            image_avg = fluid.layers.resize_bilinear(image_avg, input.shape[2:])
+
+        with scope("aspp0"):
+            aspp0 = bn_relu( conv( input, mid_channel, 1, 1, groups=1, padding=0, param_attr=param_attr))
+        with scope("aspp1"):
+            if cfg.MODEL.DEEPLAB.ASPP_WITH_SEP_CONV:
+                aspp1 = separate_conv( input, mid_channel, 1, 3, dilation=aspp_ratios[0], act=relu)
+            else:
+                aspp1 = bn_relu( conv( input, mid_channel, stride=1, filter_size=3, dilation=aspp_ratios[0], 
+                                       padding=aspp_ratios[0], param_attr=param_attr))
+        with scope("aspp2"):
+            if cfg.MODEL.DEEPLAB.ASPP_WITH_SEP_CONV:
+                aspp2 = separate_conv( input, mid_channel, 1, 3, dilation=aspp_ratios[1], act=relu)
+            else:
+                aspp2 = bn_relu( conv( input, mid_channel, stride=1, filter_size=3, dilation=aspp_ratios[1], 
+                                       padding=aspp_ratios[1], param_attr=param_attr))
+        with scope("aspp3"):
+            if cfg.MODEL.DEEPLAB.ASPP_WITH_SEP_CONV:
+                aspp3 = separate_conv( input, mid_channel, 1, 3, dilation=aspp_ratios[2], act=relu)
+            else:
+                aspp3 = bn_relu( conv( input, mid_channel, stride=1, filter_size=3, dilation=aspp_ratios[2],
+                                       padding=aspp_ratios[2], param_attr=param_attr))
+        with scope("concat"):
+            feat = fluid.layers.concat([image_avg, aspp0, aspp1, aspp2, aspp3], axis=1)
+            feat = bn_relu( conv( feat, 2*mid_channel, 1, 1, groups=1, padding=0, param_attr=param_attr))
+            feat = fluid.layers.dropout(feat, 0.1)
+
+    # Conv_1x1 + bilinear_upsample
+    seg_name = "logit"
+    with scope(seg_name):
+        param_attr = fluid.ParamAttr( name= seg_name+'_weights',
+                                      regularizer=fluid.regularizer.L2DecayRegularizer(regularization_coeff=0.0),
+                                      initializer=fluid.initializer.TruncatedNormal(loc=0.0, scale=0.01))
+        logit = conv(feat, num_classes, filter_size=1, param_attr=param_attr, bias_attr=True, name=seg_name+'_conv')
+        logit_interp = fluid.layers.resize_bilinear(logit, out_shape=output_shape, name=seg_name+'_interp')
+
+    return logit_interp
+
+
+
+def mobilenetv2(input):
+    # Backbone: mobilenetv2结构配置
+    # DEPTH_MULTIPLIER: mobilenetv2的scale设置，默认1.0
+    # OUTPUT_STRIDE：下采样倍数
+    # end_points: mobilenetv2的block数
+    # decode_point: 从mobilenetv2中引出分支所在block数, 作为decoder输入
+    scale = cfg.MODEL.DEEPLABv3.DEPTH_MULTIPLIER
+    output_stride = cfg.MODEL.DEEPLABv3.OUTPUT_STRIDE
+    model = mobilenet_backbone(scale=scale, output_stride=output_stride)
+    end_points = 18
+    decode_point = 4
+    data, decode_shortcuts = model.net(
+        input, end_points=end_points, decode_points=decode_point)
+    decode_shortcut = decode_shortcuts[decode_point]
+    return data, decode_shortcut
+
+
+def xception(input):
+    # Backbone: Xception结构配置, xception_65, xception_41, xception_71三种可选
+    # decode_point: 从Xception中引出分支所在block数，作为decoder输入
+    # end_point：Xception的block数
+    cfg.MODEL.DEFAULT_EPSILON = 1e-3
+    model = xception_backbone(cfg.MODEL.BACKBONE)
+    backbone = cfg.MODEL.BACKBONE
+    output_stride = cfg.MODEL.DEEPLABv3.OUTPUT_STRIDE
+    if '65' in backbone:
+        decode_point = 2
+        end_points = 21
+    if '41' in backbone:
+        decode_point = 2
+        end_points = 13
+    if '71' in backbone:
+        decode_point = 3
+        end_points = 23
+    data, decode_shortcuts = model.net(
+        input,
+        output_stride=output_stride,
+        end_points=end_points,
+        decode_points=decode_point)
+    decode_shortcut = decode_shortcuts[decode_point]
+    return data, decode_shortcut
+
+
+def resnet(input):
+    # dilation_dict: 
+    #     key: stage num
+    #     value: dilation factor
+
+    scale = cfg.MODEL.DEEPLABv3.DEPTH_MULTIPLIER
+    layers = cfg.MODEL.BACKBONE_LAYERS
+    end_points = layers - 1
+    decode_points = [91,100 ]  # [10, 22, 91, 100], for obtaining feature maps of res2,res3, res4, and res5
+    dilation_dict = {2:2, 3:4}
+    model = resnet_backbone(layers, scale)
+    res5, feat_dict = model.net(input,
+                                end_points=end_points,
+                                dilation_dict=dilation_dict,
+                                decode_points=decode_points)
+    return res5, feat_dict
+
+
+def hrnet(input):
+    model = hrnet_backbone(stride=4, seg_flag=True)
+    feats = model.net(input)
+    return feats
+
+def deeplabv3(input, num_classes):
+    """
+       Chen, Liang-Chieh, et al. "Rethinking atrous convolution for semantic image segmentation", in arXiv:1706:05587
+    """
+    if 'xception' in cfg.MODEL.BACKBONE:
+        data, decode_shortcut = xception(input)
+    elif 'mobilenet' in cfg.MODEL.BACKBONE:
+        data, decode_shortcut = mobilenetv2(input)
+    elif 'resnet' in cfg.MODEL.BACKBONE:
+        res5, feat_dict = resnet(input)
+        res4 = feat_dict[91]
+    elif 'hrnet' in cfg.MODEL.BACKBONE:
+        res5 = hrnet(input)
+    else:
+        raise Exception("deeplabv3 only support xception, mobilenet, resnet, and hrnet backbone")
+
+    logit = ASPPHead(res5, mid_channel= 256, num_classes= num_classes, output_shape= input.shape[2:])
+    if cfg.MODEL.DEEPLABv3.AuxHead:
+        aux_logit = FCNHead(res4, 256, num_classes, input.shape[2:])
+        return logit, aux_logit
+    return logit
+
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/modeling/glore.py b/PaddleCV/Research/SemSegPaddle/src/models/modeling/glore.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b909a1c1865f507f0f533f7c85391840e8eac9b
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/models/modeling/glore.py
@@ -0,0 +1,126 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import sys
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from src.models.libs.model_libs import scope, name_scope
+from src.models.libs.model_libs import avg_pool, conv, bn, bn_zero, conv1d, FCNHead
+from src.models.backbone.resnet import ResNet as resnet_backbone
+from src.utils.config import cfg
+
+
+def get_logit_interp(input, num_classes, out_shape, name="logit"):
+    # 1x1_Conv
+    param_attr = fluid.ParamAttr(
+        name= name + 'weights',
+        regularizer= fluid.regularizer.L2DecayRegularizer(regularization_coeff=0.0),
+        initializer= fluid.initializer.TruncatedNormal(loc=0.0, scale=0.01))
+
+    with scope(name):
+        logit = conv(input, num_classes, filter_size=1, param_attr=param_attr, bias_attr=True, name=name+'_conv')
+        logit_interp = fluid.layers.resize_bilinear( logit, out_shape=out_shape, name=name+'_interp')
+    return logit_interp
+
+
+def gcn_module(name_scope, x, num_node, num_state):
+    '''
+    input: any tensor of 3D, B,C,N
+    '''
+    print(x.shape)
+    h = fluid.layers.transpose(x, perm=[0, 2, 1]) #B,C,N-->B,N,C
+    h = conv1d(h, num_node, name_scope+'_conv1d1', bias_attr=True)
+    h = fluid.layers.transpose(h, perm=[0, 2, 1]) #B,C,N
+    h = fluid.layers.elementwise_add(h, x, act='relu')
+    h = conv1d(h, num_state, name_scope+'_conv1d2', bias_attr= False)
+    return h
+
+def gru_module(x, num_state, num_node):
+    '''
+    Global Reasoning Unit: projection --> graph reasoning --> reverse projection
+    params:
+         x:  B x C x H x W
+         num_state: the dimension of each vertex feature
+         num_node: the number of vertet
+    output: B x C x H x W
+    feature trans:
+            B, C, H, W --> B, N, H, W -->             B, N, H*W -->B, N, C1 -->B, C1, N-->B, C1, N-->B, C1, H*W-->B, C, H, W
+                       --> B, C1,H, W -->B, C1,H*W -->B, H*W, C1
+    '''
+    # generate B
+    num_batch, C, H, W = x.shape
+    with scope('projection'):
+        B = conv(x, num_node,
+                filter_size=1,
+                bias_attr=True,
+                name='projection'+'_conv') #num_batch, node, H, W
+        B = fluid.layers.reshape(B, shape=[num_batch, num_node, H*W]) # Projection Matrix: num_batch, node, L=H*W
+    # reduce dimension
+    with scope('reduce_channel'):
+        x_reduce = conv(x, num_state,
+                filter_size=1,
+                bias_attr=True,
+                name='reduce_channel'+'_conv') #num_batch, num_state, H, W
+        x_reduce = fluid.layers.reshape(x_reduce, shape=[num_batch, num_state, H*W]) #num_batch, num_state, L
+        x_reduce = fluid.layers.transpose(x_reduce, perm=[0, 2, 1]) #num_batch, L, num_state
+
+    V = fluid.layers.transpose(fluid.layers.matmul(B, x_reduce), perm=[0,2,1]) #num_batch, num_state, num_node
+    #L = fluid.layers.fill_constant(shape=[1], value=H*W, dtype='float32')
+    #V = fluid.layers.elementwise_div(V, L)
+    new_V = gcn_module('gru'+'_gcn', V, num_node, num_state)
+
+    B = fluid.layers.reshape(B, shape= [num_batch, num_node, H*W])
+    D = fluid.layers.transpose(B, perm=[0, 2, 1])
+    Y = fluid.layers.matmul(D, fluid.layers.transpose(new_V, perm=[0, 2, 1]))
+    Y = fluid.layers.transpose(Y, perm=[0, 2, 1])
+    Y = fluid.layers.reshape(Y, shape=[num_batch, num_state, H, W])
+    with scope('extend_dim'):
+        Y = conv(Y, C, filter_size=1, bias_attr=False, name='extend_dim'+'_conv')
+        #Y = bn_zero(Y)
+        Y = bn(Y)
+    out = fluid.layers.elementwise_add(Y, x)
+    return out
+
+def resnet(input):
+    # end_points: end_layer of resnet backbone 
+    # dilation_dict: dilation factor for stages_key
+    scale = cfg.MODEL.GLORE.DEPTH_MULTIPLIER
+    layers = cfg.MODEL.BACKBONE_LAYERS
+    end_points = layers - 1
+    dilation_dict = {2:2, 3:4}
+    decode_points= [91, 100]
+    model = resnet_backbone(layers, scale)
+    res5, feat_dict = model.net(input,
+                                end_points=end_points,
+                                dilation_dict=dilation_dict,
+                                decode_points= decode_points)
+
+    return res5, feat_dict
+
+def glore(input, num_classes):
+    """
+    Reference:
+       Chen, Yunpeng, et al. "Graph-Based Global Reasoning Networks", In CVPR 2019
+    """
+
+    # Backbone: ResNet
+    res5, feat_dict = resnet(input)
+    res4= feat_dict[91]
+    # 3x3 Conv. 2048 -> 512
+    reduce_kernel=3
+    if cfg.DATASET.DATASET_NAME=='cityscapes':
+        reduce_kernel=1
+    with scope('feature'):
+        feature = conv(res5, 512, filter_size=reduce_kernel, bias_attr=False, name='feature_conv')
+        feature = bn(feature, act='relu')
+    # GRU Module
+    gru_output = gru_module(feature,  num_state= 128,  num_node = 64)
+    dropout = fluid.layers.dropout(gru_output, dropout_prob=0.1, name="dropout")
+
+    logit = get_logit_interp(dropout, num_classes, input.shape[2:])
+    if cfg.MODEL.GLORE.AuxHead:
+        aux_logit = FCNHead(res4, 256, num_classes, input.shape[2:])
+        return logit, aux_logit
+
+    return logit
+
diff --git a/PaddleCV/Research/SemSegPaddle/src/models/modeling/pspnet.py b/PaddleCV/Research/SemSegPaddle/src/models/modeling/pspnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..286c48bd733cf3a8c3aa3a49a0d5e8f0d18a2e1f
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/models/modeling/pspnet.py
@@ -0,0 +1,100 @@
+from __future__ import division
+from __future__ import print_function
+import sys
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from src.models.libs.model_libs import scope, name_scope
+from src.models.libs.model_libs import avg_pool, conv, bn, FCNHead
+from src.models.backbone.resnet import ResNet as resnet_backbone
+from src.models.backbone.hrnet import HRNet as hrnet_backbone
+from src.utils.config import cfg
+
+
+def PSPHead(input, out_features, num_classes, output_shape):
+    # Arch of Pyramid Scene Parsing Module:                                                 
+    #
+    #          |----> Pool_1x1 + Conv_1x1 + BN + ReLU + bilinear_interp-------->|————————|
+    #          |                                                                |        |
+    #          |----> Pool_2x2 + Conv_1x1 + BN + ReLU + bilinear_interp-------->|        | 
+    # x ------>|                                                                | concat |----> Conv_3x3 + BN + ReLU -->Dropout --> Conv_1x1
+    #     |    |----> Pool_3x3 + Conv_1x1 + BN + ReLU + bilinear_interp-------->|        | 
+    #     |    |                                                                |        |
+    #     |    |----> Pool_6x6 + Conv_1x1 + BN + ReLU + bilinear_interp-------->|________|
+    #     |                                                                              ^
+    #     |——————————————————————————————————————————————————————————————————————————————|
+    #
+    cat_layers = []
+    sizes = (1,2,3,6)
+    # 4 parallel pooling branches
+    for size in sizes:
+        psp_name = "psp" + str(size)
+        with scope(psp_name):
+            pool_feat = fluid.layers.adaptive_pool2d(input, pool_size=[size, size], pool_type='avg', 
+                                                name=psp_name+'_adapool')
+            conv_feat = conv(pool_feat, out_features, filter_size=1, bias_attr=True, 
+                        name= psp_name + '_conv')
+            bn_feat = bn(conv_feat, act='relu')
+            interp = fluid.layers.resize_bilinear(bn_feat, out_shape=input.shape[2:], name=psp_name+'_interp') 
+        cat_layers.append(interp)
+    cat_layers = [input] + cat_layers[::-1]
+    cat = fluid.layers.concat(cat_layers, axis=1, name='psp_cat')
+    
+    # Conv_3x3 + BN + ReLU
+    psp_end_name = "psp_end"
+    with scope(psp_end_name):
+        data = conv(cat, out_features, filter_size=3, padding=1, bias_attr=True, name=psp_end_name)
+        out = bn(data, act='relu')
+    # Dropout
+    dropout_out = fluid.layers.dropout(out, dropout_prob=0.1, name="dropout")
+   
+    # Conv_1x1 + bilinear_upsample
+    seg_name = "logit"
+    with scope(seg_name):
+        param_attr = fluid.ParamAttr( name= seg_name+'_weights',
+                                      regularizer=fluid.regularizer.L2DecayRegularizer(regularization_coeff=0.0),
+                                      initializer=fluid.initializer.TruncatedNormal(loc=0.0, scale=0.01))
+        logit = conv(dropout_out, num_classes, filter_size=1, param_attr=param_attr, bias_attr=True, name=seg_name+'_conv')
+        logit_interp = fluid.layers.resize_bilinear(logit, out_shape=output_shape, name=seg_name+'_interp') 
+
+    return logit_interp
+
+def resnet(input):
+    # dilation_dict: 
+    #     key: stage num
+    #     value: dilation factor
+
+    scale = cfg.MODEL.PSPNET.DEPTH_MULTIPLIER
+    layers = cfg.MODEL.BACKBONE_LAYERS
+    end_points = layers - 1
+    decode_points = [91, 100]  # [10, 22, 91, 100], for obtaining feature maps of res2,res3, res4, and res5
+    dilation_dict = {2:2, 3:4}
+    model = resnet_backbone(layers, scale)
+    res5, feat_dict = model.net(input, 
+                                end_points=end_points, 
+                                dilation_dict=dilation_dict,
+                                decode_points=decode_points)
+    return res5, feat_dict
+
+def hrnet(input):
+    model = hrnet_backbone(stride=4, seg_flag=True)
+    feats = model.net(input)
+    return feats
+
+def pspnet(input, num_classes):
+    """
+    Reference: 
+        Zhao, Hengshuang, et al. "Pyramid scene parsing network.", In CVPR 2017
+    """
+    if 'resnet' in cfg.MODEL.BACKBONE:
+        res5, feat_dict = resnet(input)
+        res4 = feat_dict[91]
+    elif 'hrnet' in cfg.MODEL.BACKBONE:
+        res5 = hrnet(input)
+    else:
+        raise Exception("pspnet only support resnet and hrnet backbone")
+    logit = PSPHead(res5, 512, num_classes, input.shape[2:])
+    if cfg.MODEL.PSPNET.AuxHead:
+        aux_logit = FCNHead(res4, 256, num_classes, input.shape[2:])
+        return logit, aux_logit
+    return logit
+
diff --git a/PaddleCV/Research/SemSegPaddle/src/utils/__init__.py b/PaddleCV/Research/SemSegPaddle/src/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleCV/Research/SemSegPaddle/src/utils/collect.py b/PaddleCV/Research/SemSegPaddle/src/utils/collect.py
new file mode 100644
index 0000000000000000000000000000000000000000..c434bf47a443e03dbd4ef352cbf7ceacd152cd4a
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/utils/collect.py
@@ -0,0 +1,149 @@
+#   Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""A simple attribute dictionary used for representing configuration options."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import copy
+import codecs
+from ast import literal_eval
+
+import yaml
+import six
+
+
+class SegConfig(dict):
+    def __init__(self, *args, **kwargs):
+        super(SegConfig, self).__init__(*args, **kwargs)
+        self.immutable = False
+
+    def __setattr__(self, key, value, create_if_not_exist=True):
+        if key in ["immutable"]:
+            self.__dict__[key] = value
+            return
+
+        t = self
+        keylist = key.split(".")
+        for k in keylist[:-1]:
+            t = t.__getattr__(k, create_if_not_exist)
+
+        t.__getattr__(keylist[-1], create_if_not_exist)
+        t[keylist[-1]] = value
+
+    def __getattr__(self, key, create_if_not_exist=True):
+        if key in ["immutable"]:
+            return self.__dict__[key]
+
+        if not key in self:
+            if not create_if_not_exist:
+                raise KeyError
+            self[key] = SegConfig()
+        return self[key]
+
+    def __setitem__(self, key, value):
+        #
+        if self.immutable:
+            raise AttributeError(
+                'Attempted to set "{}" to "{}", but SegConfig is immutable'.
+                format(key, value))
+        #
+        if isinstance(value, six.string_types):
+            try:
+                value = literal_eval(value)
+            except ValueError:
+                pass
+            except SyntaxError:
+                pass
+        super(SegConfig, self).__setitem__(key, value)
+
+    def update_from_segconfig(self, other):
+        if isinstance(other, dict):
+            other = SegConfig(other)
+        assert isinstance(other, SegConfig)
+        diclist = [("", other)]
+        while len(diclist):
+            prefix, tdic = diclist[0]
+            diclist = diclist[1:]
+            for key, value in tdic.items():
+                key = "{}.{}".format(prefix, key) if prefix else key
+                if isinstance(value, dict):
+                    diclist.append((key, value))
+                    continue
+                try:
+                    self.__setattr__(key, value, create_if_not_exist=False)
+                except KeyError:
+                    raise KeyError('Non-existent config key: {}'.format(key))
+
+    def check_and_infer(self):
+        if self.DATASET.IMAGE_TYPE in ['rgb', 'gray']:
+            self.DATASET.DATA_DIM = 3
+        elif self.DATASET.IMAGE_TYPE in ['rgba']:
+            self.DATASET.DATA_DIM = 4
+        else:
+            raise KeyError(
+                'DATASET.IMAGE_TYPE config error, only support `rgb`, `gray` and `rgba`'
+            )
+        if self.MEAN is not None:
+            self.DATASET.PADDING_VALUE = [x*255.0 for x in self.MEAN]
+        """
+        if not self.TRAIN_CROP_SIZE:
+            raise ValueError(
+                'TRAIN_CROP_SIZE is empty! Please set a pair of values in format (width, height)'
+            )
+
+        if not self.EVAL_CROP_SIZE:
+            raise ValueError(
+                'EVAL_CROP_SIZE is empty! Please set a pair of values in format (width, height)'
+            )
+        """
+
+        # Ensure file list is use UTF-8 encoding
+        train_sets = codecs.open(self.DATASET.TRAIN_FILE_LIST, 'r', 'utf-8').readlines()
+        val_sets = codecs.open(self.DATASET.VAL_FILE_LIST, 'r', 'utf-8').readlines()
+        test_sets = codecs.open(self.DATASET.TEST_FILE_LIST, 'r', 'utf-8').readlines()
+        self.DATASET.TRAIN_TOTAL_IMAGES = len(train_sets)
+        self.DATASET.VAL_TOTAL_IMAGES = len(val_sets)
+        self.DATASET.TEST_TOTAL_IMAGES = len(test_sets)
+
+        if self.MODEL.MODEL_NAME == 'icnet' and \
+                len(self.MODEL.MULTI_LOSS_WEIGHT) != 3:
+            self.MODEL.MULTI_LOSS_WEIGHT = [1.0, 0.4, 0.16]
+
+    def update_from_list(self, config_list):
+        if len(config_list) % 2 != 0:
+            raise ValueError(
+                "Command line options config format error! Please check it: {}".
+                format(config_list))
+        for key, value in zip(config_list[0::2], config_list[1::2]):
+            try:
+                self.__setattr__(key, value, create_if_not_exist=False)
+            except KeyError:
+                raise KeyError('Non-existent config key: {}'.format(key))
+
+    def update_from_file(self, config_file):
+        with codecs.open(config_file, 'r', 'utf-8') as file:
+            dic = yaml.load(file, Loader=yaml.FullLoader)
+        self.update_from_segconfig(dic)
+
+    def set_immutable(self, immutable):
+        self.immutable = immutable
+        for value in self.values():
+            if isinstance(value, SegConfig):
+                value.set_immutable(immutable)
+
+    def is_immutable(self):
+        return self.immutable
diff --git a/PaddleCV/Research/SemSegPaddle/src/utils/config.py b/PaddleCV/Research/SemSegPaddle/src/utils/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bec393bc63f9e69e4c4546233075915077835e4
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/utils/config.py
@@ -0,0 +1,192 @@
+# -*- coding: utf-8 -*-
+#   Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+from __future__ import unicode_literals
+from .collect import SegConfig
+import numpy as np
+
+cfg = SegConfig()
+
+########################## 基本配置 ###########################################
+# 均值，图像预处理减去的均值
+#cfg.MEAN = [0.5, 0.5, 0.5]
+cfg.MEAN = [0.485, 0.456, 0.406]
+# 标准差，图像预处理除以标准差·
+cfg.STD = [0.229, 0.224, 0.225]
+# 批处理大小
+cfg.TRAIN_BATCH_SIZE_PER_GPU = 2
+cfg.TRAIN_BATCH_SIZE= 8
+cfg.EVAL_BATCH_SIZE= 8
+# 多进程训练总进程数
+cfg.NUM_TRAINERS = 1
+# 多进程训练进程ID
+cfg.TRAINER_ID = 0
+########################## 数据载入配置 #######################################
+# 数据载入时的并发数, 建议值8
+cfg.DATALOADER.NUM_WORKERS = 8
+# 数据载入时缓存队列大小, 建议值256
+cfg.DATALOADER.BUF_SIZE = 256
+
+########################## 数据集配置 #########################################
+cfg.DATASET.DATASET_NAME = 'cityscapes'
+# 数据主目录目录
+cfg.DATASET.DATA_DIR = './data_local/cityscapes/'
+# 训练集列表
+cfg.DATASET.TRAIN_FILE_LIST = './data_local/cityscapes/train.list'
+# 训练集数量
+cfg.DATASET.TRAIN_TOTAL_IMAGES = 5
+# 验证集列表
+cfg.DATASET.VAL_FILE_LIST = './data_local/cityscapes/val.list'
+# 验证数据数量
+cfg.DATASET.VAL_TOTAL_IMAGES = 50
+# 测试数据列表
+cfg.DATASET.TEST_FILE_LIST = './data_local/cityscapes/test.list'
+# 测试数据数量
+cfg.DATASET.TEST_TOTAL_IMAGES = 1525
+# Tensorboard 可视化的数据集
+cfg.DATASET.VIS_FILE_LIST = None
+# 类别数(需包括背景类)
+cfg.DATASET.NUM_CLASSES = 19
+# 输入图像类型, 支持三通道'rgb',四通道'rgba',单通道灰度图'gray'
+cfg.DATASET.IMAGE_TYPE = 'rgb'
+# 输入图片的通道数
+cfg.DATASET.DATA_DIM = 3
+# 数据列表分割符, 默认为空格
+cfg.DATASET.SEPARATOR = '\t'
+# 忽略的像素标签值, 默认为255，一般无需改动
+cfg.DATASET.IGNORE_INDEX = 255
+# 数据增强是图像的padding值
+cfg.DATASET.PADDING_VALUE = [127.5, 127.5, 127.5]
+
+########################### 数据增强配置 ######################################
+cfg.DATAAUG.EXTRA = True
+cfg.DATAAUG.BASE_SIZE = 1024
+cfg.DATAAUG.CROP_SIZE = 769
+cfg.DATAAUG.RAND_SCALE_MIN = 0.75
+cfg.DATAAUG.RAND_SCALE_MAX = 2.0
+
+
+########################### 训练配置 ##########################################
+# 模型保存路径
+cfg.TRAIN.MODEL_SAVE_DIR = ''
+# 预训练模型路径
+cfg.TRAIN.PRETRAINED_MODEL_DIR = ''
+# 是否resume，继续训练
+cfg.TRAIN.RESUME_MODEL_DIR = ''
+# 是否使用多卡间同步BatchNorm均值和方差
+cfg.TRAIN.SYNC_BATCH_NORM = True
+# 模型参数保存的epoch间隔数，可用来继续训练中断的模型
+cfg.TRAIN.SNAPSHOT_EPOCH = 10
+
+########################### 模型优化相关配置 ##################################
+# 初始学习率
+cfg.SOLVER.LR = 0.001
+# 学习率下降方法, 支持poly piecewise cosine 三种
+cfg.SOLVER.LR_POLICY = "poly"
+# 优化算法, 支持SGD和Adam两种算法
+cfg.SOLVER.OPTIMIZER = "sgd"
+# 动量参数
+cfg.SOLVER.MOMENTUM = 0.9
+# 二阶矩估计的指数衰减率
+cfg.SOLVER.MOMENTUM2 = 0.999
+# 学习率Poly下降指数
+cfg.SOLVER.POWER = 0.9
+# step下降指数
+cfg.SOLVER.GAMMA = 0.1
+# step下降间隔
+cfg.SOLVER.DECAY_EPOCH = [10, 20]
+# 学习率权重衰减，0-1
+#cfg.SOLVER.WEIGHT_DECAY = 0.0001
+cfg.SOLVER.WEIGHT_DECAY = 0.00004
+# 训练开始epoch数，默认为1
+cfg.SOLVER.BEGIN_EPOCH = 1
+# 训练epoch数，正整数
+cfg.SOLVER.NUM_EPOCHS = 30
+# loss的选择，支持softmax_loss, bce_loss, dice_loss
+cfg.SOLVER.LOSS = ["softmax_loss"]
+# 是否开启warmup学习策略 
+cfg.SOLVER.LR_WARMUP = False 
+# warmup的迭代次数
+cfg.SOLVER.LR_WARMUP_STEPS = 2000 
+
+########################## 测试配置 ###########################################
+# 测试模型路径
+cfg.TEST.TEST_MODEL = ''
+cfg.TEST.BASE_SIZE = 2048
+cfg.TEST.CROP_SIZE = 769
+cfg.TEST.SLIDE_WINDOW = True
+
+########################## 模型通用配置 #######################################
+# 模型名称, 支持pspnet, deeplabv3, glore, ginet 
+cfg.MODEL.MODEL_NAME = ''
+# BatchNorm类型: bn、gn(group_norm)
+cfg.MODEL.DEFAULT_NORM_TYPE = 'bn'
+# 多路损失加权值
+cfg.MODEL.MULTI_LOSS_WEIGHT = [1.0, 0.4]
+# DEFAULT_NORM_TYPE为gn时group数
+cfg.MODEL.DEFAULT_GROUP_NUMBER = 32
+# 极小值, 防止分母除0溢出，一般无需改动
+cfg.MODEL.DEFAULT_EPSILON = 1e-5
+# BatchNorm动量, 一般无需改动
+cfg.MODEL.BN_MOMENTUM = 0.99
+# 是否使用FP16训练
+cfg.MODEL.FP16 = False
+# 混合精度训练需对LOSS进行scale, 默认为动态scale，静态scale可以设置为512.0
+cfg.MODEL.SCALE_LOSS = "DYNAMIC"
+# backbone network, (resnet, hrnet, xception_65, mobilenetv2)
+cfg.MODEL.BACKBONE= "resnet"
+#  backbone_layer: 101 and 50 for resnet
+cfg.MODEL.BACKBONE_LAYERS=101
+# strides= input.size / feature_maps.size
+cfg.MODEL.BACKBONE_OUTPUT_STRIDE=8
+cfg.MODEL.BACKBONE_MULTI_GRID = False
+
+
+
+########################## PSPNET模型配置 ######################################
+# RESNET backbone scale 设置
+cfg.MODEL.PSPNET.DEPTH_MULTIPLIER = 1
+# Aux loss 
+cfg.MODEL.PSPNET.AuxHead= True
+
+
+########################## GloRe模型配置 ######################################
+# RESNET backbone scale 设置
+cfg.MODEL.GLORE.DEPTH_MULTIPLIER = 1
+# Aux loss 
+cfg.MODEL.GLORE.AuxHead= True
+
+########################## DeepLabv3模型配置 ####################################
+# MobileNet v2 backbone scale 设置
+cfg.MODEL.DEEPLABv3.DEPTH_MULTIPLIER = 1.0
+# ASPP是否使用可分离卷积
+cfg.MODEL.DEEPLABv3.ASPP_WITH_SEP_CONV = True
+cfg.MODEL.DEEPLABv3.AuxHead= True
+
+
+
+########################## HRNET模型配置 ######################################
+# HRNET STAGE2 设置
+cfg.MODEL.HRNET.STAGE2.NUM_MODULES = 1
+cfg.MODEL.HRNET.STAGE2.NUM_CHANNELS = [40, 80]
+# HRNET STAGE3 设置
+cfg.MODEL.HRNET.STAGE3.NUM_MODULES = 4
+cfg.MODEL.HRNET.STAGE3.NUM_CHANNELS = [40, 80, 160]
+# HRNET STAGE4 设置
+cfg.MODEL.HRNET.STAGE4.NUM_MODULES = 3
+cfg.MODEL.HRNET.STAGE4.NUM_CHANNELS = [40, 80, 160, 320]
+
+
diff --git a/PaddleCV/Research/SemSegPaddle/src/utils/dist_utils.py b/PaddleCV/Research/SemSegPaddle/src/utils/dist_utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..64c8800fd2010d4e1e5def6cc4ea2e1ad673b4a3
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/utils/dist_utils.py
@@ -0,0 +1,92 @@
+#copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import paddle.fluid as fluid
+
+
+def nccl2_prepare(args, startup_prog, main_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.mode = "nccl2"
+    t = fluid.DistributeTranspiler(config=config)
+
+    envs = args.dist_env
+
+    t.transpile(
+        envs["trainer_id"],
+        trainers=','.join(envs["trainer_endpoints"]),
+        current_endpoint=envs["current_endpoint"],
+        startup_program=startup_prog,
+        program=main_prog)
+
+
+def pserver_prepare(args, train_prog, startup_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.slice_var_up = args.split_var
+    t = fluid.DistributeTranspiler(config=config)
+    envs = args.dist_env
+    training_role = envs["training_role"]
+
+    t.transpile(
+        envs["trainer_id"],
+        program=train_prog,
+        pservers=envs["pserver_endpoints"],
+        trainers=envs["num_trainers"],
+        sync_mode=not args.async_mode,
+        startup_program=startup_prog)
+    if training_role == "PSERVER":
+        pserver_program = t.get_pserver_program(envs["current_endpoint"])
+        pserver_startup_program = t.get_startup_program(
+            envs["current_endpoint"],
+            pserver_program,
+            startup_program=startup_prog)
+        return pserver_program, pserver_startup_program
+    elif training_role == "TRAINER":
+        train_program = t.get_trainer_program()
+        return train_program, startup_prog
+    else:
+        raise ValueError(
+            'PADDLE_TRAINING_ROLE environment variable must be either TRAINER or PSERVER'
+        )
+
+
+def nccl2_prepare_paddle(trainer_id, startup_prog, main_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.mode = "nccl2"
+    t = fluid.DistributeTranspiler(config=config)
+    t.transpile(
+        trainer_id,
+        trainers=os.environ.get('PADDLE_TRAINER_ENDPOINTS'),
+        current_endpoint=os.environ.get('PADDLE_CURRENT_ENDPOINT'),
+        startup_program=startup_prog,
+        program=main_prog)
+
+
+def prepare_for_multi_process(exe, build_strategy, train_prog):
+    # prepare for multi-process
+    trainer_id = int(os.environ.get('PADDLE_TRAINER_ID', 0))
+    num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+    if num_trainers < 2: return
+
+    build_strategy.num_trainers = num_trainers
+    build_strategy.trainer_id = trainer_id
+    # NOTE(zcd): use multi processes to train the model,
+    # and each process use one GPU card.
+    startup_prog = fluid.Program()
+    nccl2_prepare_paddle(trainer_id, startup_prog, train_prog)
+    # the startup_prog are run two times, but it doesn't matter.
+    exe.run(startup_prog)
diff --git a/PaddleCV/Research/SemSegPaddle/src/utils/fp16_utils.py b/PaddleCV/Research/SemSegPaddle/src/utils/fp16_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..38edda500c17aefba4f8c9c59284a40c03c99843
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/utils/fp16_utils.py
@@ -0,0 +1,31 @@
+import os
+from paddle import fluid
+
+def load_fp16_vars(executor, dirname, program):
+    load_dirname = os.path.normpath(dirname)
+
+    def _if_exist(var):
+        name = var.name[:-7] if var.name.endswith('.master') else var.name
+        b = os.path.exists(os.path.join(load_dirname, name))
+        if not b and isinstance(var, fluid.framework.Parameter):
+            print("===== {} not found ====".format(var.name))
+        return b
+
+    load_prog = fluid.Program()
+    load_block = load_prog.global_block()
+    vars = list(filter(_if_exist, program.list_vars()))
+
+    for var in vars:
+        new_var = fluid.io._clone_var_in_block_(load_block, var)
+        name = var.name[:-7] if var.name.endswith('.master') else var.name
+        file_path = os.path.join(load_dirname, name)
+        load_block.append_op(
+            type='load',
+            inputs={},
+            outputs={'Out': [new_var]},
+            attrs={
+                'file_path': file_path,
+                'load_as_fp16': var.dtype == fluid.core.VarDesc.VarType.FP16
+            })
+
+    executor.run(load_prog)
\ No newline at end of file
diff --git a/PaddleCV/Research/SemSegPaddle/src/utils/loss.py b/PaddleCV/Research/SemSegPaddle/src/utils/loss.py
new file mode 100644
index 0000000000000000000000000000000000000000..6bb6e98332912770b794ee6a84d849fef81773d6
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/utils/loss.py
@@ -0,0 +1,121 @@
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import paddle.fluid as fluid
+import numpy as np
+import importlib
+from src.utils.config import cfg
+
+
+def softmax_with_loss(logit, label, ignore_mask=None, num_classes=2):
+    ignore_mask = fluid.layers.cast(ignore_mask, 'float32')
+    label = fluid.layers.elementwise_min( label, fluid.layers.assign(np.array([num_classes - 1], dtype=np.int32)))
+    logit = fluid.layers.transpose(logit, [0, 2, 3, 1])
+    logit = fluid.layers.reshape(logit, [-1, num_classes])
+    label = fluid.layers.reshape(label, [-1, 1])
+    label = fluid.layers.cast(label, 'int64')
+    ignore_mask = fluid.layers.reshape(ignore_mask, [-1, 1])
+
+    loss, probs = fluid.layers.softmax_with_cross_entropy(
+        logit,
+        label,
+        ignore_index=cfg.DATASET.IGNORE_INDEX,
+        return_softmax=True)
+
+    loss = loss * ignore_mask
+    avg_loss = fluid.layers.mean(loss) / fluid.layers.mean(ignore_mask)
+
+    label.stop_gradient = True
+    ignore_mask.stop_gradient = True
+    return avg_loss
+
+# to change, how to appicate ignore index and ignore mask
+def dice_loss(logit, label, ignore_mask=None, epsilon=0.00001):
+    if logit.shape[1] != 1 or label.shape[1] != 1 or ignore_mask.shape[1] != 1:
+        raise Exception("dice loss is only applicable to one channel classfication")
+    ignore_mask = fluid.layers.cast(ignore_mask, 'float32')
+    logit = fluid.layers.transpose(logit, [0, 2, 3, 1])
+    label  = fluid.layers.transpose(label, [0, 2, 3, 1])
+    label = fluid.layers.cast(label, 'int64')
+    ignore_mask = fluid.layers.transpose(ignore_mask, [0, 2, 3, 1])
+    logit = fluid.layers.sigmoid(logit)
+    logit = logit * ignore_mask
+    label = label * ignore_mask
+    reduce_dim = list(range(1, len(logit.shape)))
+    inse = fluid.layers.reduce_sum(logit * label, dim=reduce_dim)
+    dice_denominator = fluid.layers.reduce_sum(
+        logit, dim=reduce_dim) + fluid.layers.reduce_sum(
+        label, dim=reduce_dim)
+    dice_score = 1 - inse * 2 / (dice_denominator + epsilon)
+    label.stop_gradient = True
+    ignore_mask.stop_gradient = True
+    return fluid.layers.reduce_mean(dice_score)
+
+def bce_loss(logit, label, ignore_mask=None):
+    if logit.shape[1] != 1 or label.shape[1] != 1 or ignore_mask.shape[1] != 1:
+        raise Exception("bce loss is only applicable to binary classfication")
+    label = fluid.layers.cast(label, 'float32')
+    loss = fluid.layers.sigmoid_cross_entropy_with_logits(
+        x=logit,
+        label=label,
+        ignore_index=cfg.DATASET.IGNORE_INDEX,
+        normalize=True) # or False
+    loss = fluid.layers.reduce_sum(loss)
+    label.stop_gradient = True
+    ignore_mask.stop_gradient = True
+    return loss
+
+
+def multi_softmax_with_loss(logits, label, ignore_mask=None, num_classes=2):
+    if isinstance(logits, tuple):
+        print("logits.type: ",type(logits))
+        avg_loss = 0
+        for i, logit in enumerate(logits):
+            logit_label = fluid.layers.resize_nearest(label, logit.shape[2:])
+            logit_mask = (logit_label.astype('int32') !=
+                          cfg.DATASET.IGNORE_INDEX).astype('int32')
+            loss = softmax_with_loss(logit, logit_label, logit_mask,
+                                     num_classes)
+            avg_loss += cfg.MODEL.MULTI_LOSS_WEIGHT[i] * loss
+    else:
+        avg_loss = softmax_with_loss(logits, label, ignore_mask, num_classes)
+    return avg_loss
+
+def multi_dice_loss(logits, label, ignore_mask=None):
+    if isinstance(logits, tuple):
+        avg_loss = 0
+        for i, logit in enumerate(logits):
+            logit_label = fluid.layers.resize_nearest(label, logit.shape[2:])
+            logit_mask = (logit_label.astype('int32') !=
+                          cfg.DATASET.IGNORE_INDEX).astype('int32')
+            loss = dice_loss(logit, logit_label, logit_mask)
+            avg_loss += cfg.MODEL.MULTI_LOSS_WEIGHT[i] * loss
+    else:
+        avg_loss = dice_loss(logits, label, ignore_mask)
+    return avg_loss
+
+def multi_bce_loss(logits, label, ignore_mask=None):
+    if isinstance(logits, tuple):
+        avg_loss = 0
+        for i, logit in enumerate(logits):
+            logit_label = fluid.layers.resize_nearest(label, logit.shape[2:])
+            logit_mask = (logit_label.astype('int32') !=
+                          cfg.DATASET.IGNORE_INDEX).astype('int32')
+            loss = bce_loss(logit, logit_label, logit_mask)
+            avg_loss += cfg.MODEL.MULTI_LOSS_WEIGHT[i] * loss
+    else:
+        avg_loss = bce_loss(logits, label, ignore_mask)
+    return avg_loss
diff --git a/PaddleCV/Research/SemSegPaddle/src/utils/metrics.py b/PaddleCV/Research/SemSegPaddle/src/utils/metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..2898be028f3dfa03ad9892310da89f7695829542
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/utils/metrics.py
@@ -0,0 +1,145 @@
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import numpy as np
+from scipy.sparse import csr_matrix
+
+
+class ConfusionMatrix(object):
+    """
+        Confusion Matrix for segmentation evaluation
+    """
+
+    def __init__(self, num_classes=2, streaming=False):
+        self.confusion_matrix = np.zeros([num_classes, num_classes],
+                                         dtype='int64')
+        self.num_classes = num_classes
+        self.streaming = streaming
+
+    def calculate(self, pred, label, ignore=None):
+        # If not in streaming mode, clear matrix everytime when call `calculate`
+        if not self.streaming:
+            self.zero_matrix()
+
+        label = np.transpose(label, (0, 2, 3, 1))
+        ignore = np.transpose(ignore, (0, 2, 3, 1))
+        mask = np.array(ignore) == 1
+
+        label = np.asarray(label)[mask]
+        pred = np.asarray(pred)[mask]
+        one = np.ones_like(pred)
+        # Accumuate ([row=label, col=pred], 1) into sparse matrix
+        spm = csr_matrix((one, (label, pred)),
+                         shape=(self.num_classes, self.num_classes))
+        spm = spm.todense()
+        self.confusion_matrix += spm
+
+    def zero_matrix(self):
+        """ Clear confusion matrix """
+        self.confusion_matrix = np.zeros([self.num_classes, self.num_classes],
+                                         dtype='int64')
+
+    def mean_iou(self):
+        iou_list = []
+        avg_iou = 0
+        # TODO: use numpy sum axis api to simpliy
+        vji = np.zeros(self.num_classes, dtype=int)
+        vij = np.zeros(self.num_classes, dtype=int)
+        for j in range(self.num_classes):
+            v_j = 0
+            for i in range(self.num_classes):
+                v_j += self.confusion_matrix[j][i]
+            vji[j] = v_j
+
+        for i in range(self.num_classes):
+            v_i = 0
+            for j in range(self.num_classes):
+                v_i += self.confusion_matrix[j][i]
+            vij[i] = v_i
+
+        for c in range(self.num_classes):
+            total = vji[c] + vij[c] - self.confusion_matrix[c][c]
+            if total == 0:
+                iou = 0
+            else:
+                iou = float(self.confusion_matrix[c][c]) / total
+            avg_iou += iou
+            iou_list.append(iou)
+        avg_iou = float(avg_iou) / float(self.num_classes)
+        return np.array(iou_list), avg_iou
+
+    def accuracy(self):
+        total = self.confusion_matrix.sum()
+        total_right = 0
+        for c in range(self.num_classes):
+            total_right += self.confusion_matrix[c][c]
+        if total == 0:
+            avg_acc = 0
+        else:
+            avg_acc = float(total_right) / total
+
+        vij = np.zeros(self.num_classes, dtype=int)
+        for i in range(self.num_classes):
+            v_i = 0
+            for j in range(self.num_classes):
+                v_i += self.confusion_matrix[j][i]
+            vij[i] = v_i
+
+        acc_list = []
+        for c in range(self.num_classes):
+            if vij[c] == 0:
+                acc = 0
+            else:
+                acc = self.confusion_matrix[c][c] / float(vij[c])
+            acc_list.append(acc)
+        return np.array(acc_list), avg_acc
+
+    def kappa(self):
+        vji = np.zeros(self.num_classes)
+        vij = np.zeros(self.num_classes)
+        for j in range(self.num_classes):
+            v_j = 0
+            for i in range(self.num_classes):
+                v_j += self.confusion_matrix[j][i]
+            vji[j] = v_j
+
+        for i in range(self.num_classes):
+            v_i = 0
+            for j in range(self.num_classes):
+                v_i += self.confusion_matrix[j][i]
+            vij[i] = v_i
+
+        total = self.confusion_matrix.sum()
+
+        # avoid spillovers
+        # TODO: is it reasonable to hard code 10000.0?
+        total = float(total) / 10000.0
+        vji = vji / 10000.0
+        vij = vij / 10000.0
+
+        tp = 0
+        tc = 0
+        for c in range(self.num_classes):
+            tp += vji[c] * vij[c]
+            tc += self.confusion_matrix[c][c]
+
+        tc = tc / 10000.0
+        pe = tp / (total * total)
+        po = tc / total
+
+        kappa = (po - pe) / (1 - pe)
+        return kappa
diff --git a/PaddleCV/Research/SemSegPaddle/src/utils/palette.py b/PaddleCV/Research/SemSegPaddle/src/utils/palette.py
new file mode 100644
index 0000000000000000000000000000000000000000..16f59602e1cc0b37d5c770df33c820934553c2ff
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/utils/palette.py
@@ -0,0 +1,66 @@
+def get_cityscapes_palette(num_cls=19):
+    """ Returns the color map for visualizing the segmentation mask.
+    Args:
+        num_cls: Number of classes
+    Returns:
+        The color map
+    """
+
+    palette = [0] * (num_cls * 3)
+    palette[0:3] = (128, 64, 128)       # 0: 'road' 
+    palette[3:6] = (244, 35,232)        # 1 'sidewalk'
+    palette[6:9] = (70, 70, 70)         # 2''building'
+    palette[9:12] = (102,102,156)       # 3 wall
+    palette[12:15] =  (190,153,153)     # 4 fence
+    palette[15:18] = (153,153,153)      # 5 pole
+    palette[18:21] = (250,170, 30)      # 6 'traffic light'
+    palette[21:24] = (220,220, 0)       # 7 'traffic sign'
+    palette[24:27] = (107,142, 35)      # 8 'vegetation'
+    palette[27:30] = (152,251,152)      # 9 'terrain'
+    palette[30:33] = ( 70,130,180)      # 10 sky
+    palette[33:36] = (220, 20, 60)      # 11 person
+    palette[36:39] = (255, 0, 0)        # 12 rider
+    palette[39:42] = (0, 0, 142)        # 13 car
+    palette[42:45] = (0, 0, 70)         # 14 truck
+    palette[45:48] = (0, 60,100)        # 15 bus
+    palette[48:51] = (0, 80,100)        # 16 train
+    palette[51:54] = (0, 0,230)         # 17 'motorcycle'
+    palette[54:57] = (119, 11, 32)      # 18 'bicycle'
+    palette[57:60] = (105, 105, 105)
+    
+    return palette
+
+
+def get_gene_palette(num_cls=182):  #Ref: CCNet
+    """ Returns the color map for visualizing the segmentation mask.
+    Args:
+        num_cls: Number of classes
+    Returns:
+        The color map
+    """
+
+    n = num_cls
+    palette = [0] * (n * 3)
+    for j in range(0, n):
+        lab = j
+        palette[j * 3 + 0] = 0
+        palette[j * 3 + 1] = 0
+        palette[j * 3 + 2] = 0
+        i = 0
+        while lab:
+            palette[j * 3 + 0] |= (((lab >> 0) & 1) << (7 - i))
+            palette[j * 3 + 1] |= (((lab >> 1) & 1) << (7 - i))
+            palette[j * 3 + 2] |= (((lab >> 2) & 1) << (7 - i))
+            i += 1
+            lab >>= 3
+    return palette
+
+def get_palette(dataset):
+    if dataset == 'cityscapes':
+        palette = get_cityscapes_palette(19)
+    elif dataset == 'pascalContext':
+        palette = get_gene_palette(num_cls=59)
+    else:
+        raise RuntimeError("unkonw dataset :{}".format(dataset))
+    return palette
+
diff --git a/PaddleCV/Research/SemSegPaddle/src/utils/solver.py b/PaddleCV/Research/SemSegPaddle/src/utils/solver.py
new file mode 100644
index 0000000000000000000000000000000000000000..62baf9a610244b3a20bf976cec52727ec684ab8b
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/utils/solver.py
@@ -0,0 +1,159 @@
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import paddle.fluid as fluid
+import numpy as np
+import importlib
+from src.utils.config import cfg
+from paddle.fluid.contrib.mixed_precision.decorator import OptimizerWithMixedPrecison, decorate, AutoMixedPrecisionLists
+
+
+class Solver(object):
+    def __init__(self, main_prog, start_prog):
+        total_images = cfg.DATASET.TRAIN_TOTAL_IMAGES
+        self.weight_decay = cfg.SOLVER.WEIGHT_DECAY
+        self.momentum = cfg.SOLVER.MOMENTUM
+        self.momentum2 = cfg.SOLVER.MOMENTUM2
+        self.step_per_epoch = total_images // cfg.TRAIN_BATCH_SIZE
+        if total_images % cfg.TRAIN_BATCH_SIZE != 0:
+            self.step_per_epoch += 1
+        self.total_step = cfg.SOLVER.NUM_EPOCHS * self.step_per_epoch
+        self.main_prog = main_prog
+        self.start_prog = start_prog
+        self.warmup_step = cfg.SOLVER.LR_WARMUP_STEPS if cfg.SOLVER.LR_WARMUP else -1
+        self.decay_step = self.total_step - self.warmup_step
+        self.decay_epochs = cfg.SOLVER.NUM_EPOCHS - self.warmup_step / self.step_per_epoch
+
+    def lr_warmup(self, learning_rate, start_lr, end_lr):
+        linear_step = end_lr - start_lr
+        lr = fluid.layers.tensor.create_global_var(
+            shape=[1],
+            value=0.0,
+            dtype='float32',
+            persistable=True,
+            name="learning_rate_warmup")
+
+        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
+        warmup_counter = fluid.layers.autoincreased_step_counter(
+            counter_name='@LR_DECAY_COUNTER_WARMUP_IN_SEG@', begin=1, step=1)
+        global_counter = fluid.default_main_program().global_block(
+        ).vars['@LR_DECAY_COUNTER@']
+        warmup_counter = fluid.layers.cast(warmup_counter, 'float32')
+
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(warmup_counter <= self.warmup_step):
+                decayed_lr = start_lr + linear_step * (
+                    warmup_counter / self.warmup_step)
+                fluid.layers.tensor.assign(decayed_lr, lr)
+                # hold the global_step to 0 during the warm-up phase
+                fluid.layers.increment(global_counter, value=-1)
+            with switch.default():
+                fluid.layers.tensor.assign(learning_rate, lr)
+        return lr
+
+    def piecewise_decay(self):
+        gamma = cfg.SOLVER.GAMMA
+        bd = [self.step_per_epoch * e for e in cfg.SOLVER.DECAY_EPOCH]
+        lr = [cfg.SOLVER.LR * (gamma**i) for i in range(len(bd) + 1)]
+        decayed_lr = fluid.layers.piecewise_decay(boundaries=bd, values=lr)
+        return decayed_lr
+
+    def poly_decay(self):
+        power = cfg.SOLVER.POWER
+        decayed_lr = fluid.layers.polynomial_decay(
+            cfg.SOLVER.LR, self.decay_step, end_learning_rate=0, power=power)
+        return decayed_lr
+
+    def cosine_decay(self):
+        decayed_lr = fluid.layers.cosine_decay(
+            cfg.SOLVER.LR, self.step_per_epoch, self.decay_epochs)
+        return decayed_lr
+
+    def get_lr(self, lr_policy):
+        if lr_policy.lower() == 'poly':
+            decayed_lr = self.poly_decay()
+        elif lr_policy.lower() == 'piecewise':
+            decayed_lr = self.piecewise_decay()
+        elif lr_policy.lower() == 'cosine':
+            decayed_lr = self.cosine_decay()
+        else:
+            raise Exception(
+                "unsupport learning decay policy! only support poly,piecewise,cosine"
+            )
+
+        decayed_lr = self.lr_warmup(decayed_lr, 0, cfg.SOLVER.LR)
+        return decayed_lr
+
+    def sgd_optimizer(self, lr_policy, loss):
+        decayed_lr = self.get_lr(lr_policy)
+        optimizer = fluid.optimizer.Momentum(
+            learning_rate=decayed_lr,
+            momentum=self.momentum,
+            regularization=fluid.regularizer.L2Decay(
+                regularization_coeff=self.weight_decay),
+        )
+        if cfg.MODEL.FP16:
+            if cfg.MODEL.MODEL_NAME in ["pspnet"]:
+                custom_black_list = {"pool2d"}
+            else:
+                custom_black_list = {}
+            amp_lists = AutoMixedPrecisionLists(
+                custom_black_list=custom_black_list)
+            assert isinstance(cfg.MODEL.SCALE_LOSS, float) or isinstance(cfg.MODEL.SCALE_LOSS, str), \
+                "data type of MODEL.SCALE_LOSS must be float or str"
+            if isinstance(cfg.MODEL.SCALE_LOSS, float):
+                optimizer = decorate(
+                    optimizer,
+                    amp_lists=amp_lists,
+                    init_loss_scaling=cfg.MODEL.SCALE_LOSS,
+                    use_dynamic_loss_scaling=False)
+            else:
+                assert cfg.MODEL.SCALE_LOSS.lower() in [
+                    'dynamic'
+                ], "if MODEL.SCALE_LOSS is a string,\
+                 must be set as 'DYNAMIC'!"
+
+                optimizer = decorate(
+                    optimizer,
+                    amp_lists=amp_lists,
+                    use_dynamic_loss_scaling=True)
+
+        optimizer.minimize(loss)
+        return decayed_lr
+
+    def adam_optimizer(self, lr_policy, loss):
+        decayed_lr = self.get_lr(lr_policy)
+        optimizer = fluid.optimizer.Adam(
+            learning_rate=decayed_lr,
+            beta1=self.momentum,
+            beta2=self.momentum2,
+            regularization=fluid.regularizer.L2Decay(
+                regularization_coeff=self.weight_decay),
+        )
+        optimizer.minimize(loss)
+        return decayed_lr
+
+    def optimise(self, loss):
+        lr_policy = cfg.SOLVER.LR_POLICY
+        opt = cfg.SOLVER.OPTIMIZER
+
+        if opt.lower() == 'adam':
+            return self.adam_optimizer(lr_policy, loss)
+        elif opt.lower() == 'sgd':
+            return self.sgd_optimizer(lr_policy, loss)
+        else:
+            raise Exception(
+                "unsupport optimizer solver, only support adam and sgd")
diff --git a/PaddleCV/Research/SemSegPaddle/src/utils/timer.py b/PaddleCV/Research/SemSegPaddle/src/utils/timer.py
new file mode 100644
index 0000000000000000000000000000000000000000..8e32c343def6e7cab81c6447a090b796d3ce00eb
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/src/utils/timer.py
@@ -0,0 +1,60 @@
+#   Copyright (c) 2019  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+
+def calculate_eta(remaining_step, speed):
+    if remaining_step < 0:
+        remaining_step = 0
+    remaining_time = int(remaining_step / speed)
+    result = "{:0>2}:{:0>2}:{:0>2}"
+    arr = []
+    for i in range(2, -1, -1):
+        arr.append(int(remaining_time / 60**i))
+        remaining_time %= 60**i
+    return result.format(*arr)
+
+
+class Timer(object):
+    """ Simple timer class for measuring time consuming """
+
+    def __init__(self):
+        self._start_time = 0.0
+        self._end_time = 0.0
+        self._elapsed_time = 0.0
+        self._is_running = False
+
+    def start(self):
+        self._is_running = True
+        self._start_time = time.time()
+
+    def restart(self):
+        self.start()
+
+    def stop(self):
+        self._is_running = False
+        self._end_time = time.time()
+
+    def elapsed_time(self):
+        self._end_time = time.time()
+        self._elapsed_time = self._end_time - self._start_time
+        if not self.is_running:
+            return 0.0
+
+        return self._elapsed_time
+
+    @property
+    def is_running(self):
+        return self._is_running
diff --git a/PaddleCV/Research/SemSegPaddle/train.py b/PaddleCV/Research/SemSegPaddle/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..e91113e5de996a6f19988fa44fa0c8d32d37620d
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/train.py
@@ -0,0 +1,429 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+# GPU memory garbage collection optimization flags
+os.environ['FLAGS_eager_delete_tensor_gb'] = "0.0"
+
+import sys
+import timeit
+import argparse
+import pprint
+import shutil
+import functools
+import paddle
+import numpy as np
+import paddle.fluid as fluid
+
+from src.utils.metrics import ConfusionMatrix
+from src.utils.config import cfg
+from src.utils.timer import Timer, calculate_eta
+from src.utils import dist_utils
+from src.datasets import build_dataset
+from src.models.model_builder import build_model
+from src.models.model_builder import ModelPhase
+from src.models.model_builder import parse_shape_from_file
+from eval import evaluate
+from vis import visualize
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='semseg-paddle')
+    parser.add_argument(
+        '--cfg',
+        dest='cfg_file',
+        help='Config file for training (and optionally testing)',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--use_gpu',
+        dest='use_gpu',
+        help='Use gpu or cpu',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--use_mpio',
+        dest='use_mpio',
+        help='Use multiprocess I/O or not',
+        action='store_true',
+        default=False)
+    parser.add_argument(
+        '--log_steps',
+        dest='log_steps',
+        help='Display logging information at every log_steps',
+        default=10,
+        type=int)
+    parser.add_argument(
+        '--debug',
+        dest='debug',
+        help='debug mode, display detail information of training',
+        action='store_true')
+    parser.add_argument(
+        '--use_tb',
+        dest='use_tb',
+        help='whether to record the data during training to Tensorboard',
+        action='store_true')
+    parser.add_argument(
+        '--tb_log_dir',
+        dest='tb_log_dir',
+        help='Tensorboard logging directory',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--do_eval',
+        dest='do_eval',
+        help='Evaluation models result on every new checkpoint',
+        action='store_true')
+    parser.add_argument(
+        'opts',
+        help='See utils/config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    return parser.parse_args()
+
+
+
+
+def save_checkpoint(exe, program, ckpt_name):
+    """
+    Save checkpoint for evaluation or resume training
+    """
+    filename= '{}_{}_{}_epoch_{}.pdparams'.format(str(cfg.MODEL.MODEL_NAME), 
+                                                  str(cfg.MODEL.BACKBONE), str(cfg.DATASET.DATASET_NAME), ckpt_name)
+    ckpt_dir = cfg.TRAIN.MODEL_SAVE_DIR
+
+    print("Save model checkpoint to {}".format(ckpt_dir))
+    if not os.path.isdir(ckpt_dir):
+        os.makedirs(ckpt_dir)
+
+    fluid.io.save_params(exe, ckpt_dir, program, filename)
+    return ckpt_dir
+
+
+def load_checkpoint(exe, program):
+    """
+    Load checkpoiont from pretrained model directory for resume training
+    """
+
+    print('Resume model training from:', cfg.TRAIN.RESUME_MODEL_DIR)
+    if not os.path.exists(cfg.TRAIN.RESUME_MODEL_DIR):
+        raise ValueError("TRAIN.PRETRAIN_MODEL {} not exist!".format(
+            cfg.TRAIN.RESUME_MODEL_DIR))
+
+    fluid.io.load_persistables(
+        exe, cfg.TRAIN.RESUME_MODEL_DIR, main_program=program)
+
+    model_path = cfg.TRAIN.RESUME_MODEL_DIR
+    # Check is path ended by path spearator
+    if model_path[-1] == os.sep:
+        model_path = model_path[0:-1]
+    epoch_name = os.path.basename(model_path)
+    # If resume model is final model
+    if epoch_name == 'final':
+        begin_epoch = cfg.SOLVER.NUM_EPOCHS
+    # If resume model path is end of digit, restore epoch status
+    elif epoch_name.isdigit():
+        epoch = int(epoch_name)
+        begin_epoch = epoch + 1
+    else:
+        raise ValueError("Resume model path is not valid!")
+    print("Model checkpoint loaded successfully!")
+
+    return begin_epoch
+
+
+def print_info(*msg):
+    if cfg.TRAINER_ID == 0:
+        print(*msg)
+
+
+def train(cfg):
+    startup_prog = fluid.Program()
+    train_prog = fluid.Program()
+    drop_last = True
+    dataset = build_dataset(cfg.DATASET.DATASET_NAME,
+        file_list=cfg.DATASET.TRAIN_FILE_LIST,
+        mode=ModelPhase.TRAIN,
+        shuffle=True,
+        data_dir=cfg.DATASET.DATA_DIR,
+        base_size= cfg.DATAAUG.BASE_SIZE, crop_size= cfg.DATAAUG.CROP_SIZE, rand_scale=True)
+
+    def data_generator():
+        if args.use_mpio:
+            data_gen = dataset.multiprocess_generator(
+                num_processes=cfg.DATALOADER.NUM_WORKERS,
+                max_queue_size=cfg.DATALOADER.BUF_SIZE)
+        else:
+            data_gen = dataset.generator()
+
+        batch_data = []
+        for b in data_gen:
+            batch_data.append(b)
+            if len(batch_data) == (cfg.TRAIN_BATCH_SIZE // cfg.NUM_TRAINERS):
+                for item in batch_data:
+                    yield item[0], item[1], item[2]
+                batch_data = []
+        # If use sync batch norm strategy, drop last batch if number of samples
+        # in batch_data is less then cfg.BATCH_SIZE to avoid NCCL hang issues
+        if not cfg.TRAIN.SYNC_BATCH_NORM:
+            for item in batch_data:
+                yield item[0], item[1], item[2]
+
+    # Get device environment
+    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+    place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
+    places = fluid.cuda_places() if args.use_gpu else fluid.cpu_places()
+
+    # Get number of GPU
+    dev_count = cfg.NUM_TRAINERS if cfg.NUM_TRAINERS > 1 else len(places)
+    print_info("#device count: {}".format(dev_count))
+    cfg.TRAIN_BATCH_SIZE = dev_count * int(cfg.TRAIN_BATCH_SIZE_PER_GPU)
+    print_info("#train_batch_size: {}".format(cfg.TRAIN_BATCH_SIZE))
+    print_info("#batch_size_per_dev: {}".format(cfg.TRAIN_BATCH_SIZE_PER_GPU))
+
+    py_reader, avg_loss, lr, pred, grts, masks = build_model(
+        train_prog, startup_prog, phase=ModelPhase.TRAIN)
+    py_reader.decorate_sample_generator(
+        data_generator, batch_size=cfg.TRAIN_BATCH_SIZE_PER_GPU, drop_last=drop_last)
+
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+
+    exec_strategy = fluid.ExecutionStrategy()
+    # Clear temporary variables every 100 iteration
+    if args.use_gpu:
+        exec_strategy.num_threads = fluid.core.get_cuda_device_count()
+    exec_strategy.num_iteration_per_drop_scope = 100
+    build_strategy = fluid.BuildStrategy()
+
+    if cfg.NUM_TRAINERS > 1 and args.use_gpu:
+        dist_utils.prepare_for_multi_process(exe, build_strategy, train_prog)
+        exec_strategy.num_threads = 1
+
+    if cfg.TRAIN.SYNC_BATCH_NORM and args.use_gpu:
+        if dev_count > 1:
+            # Apply sync batch norm strategy
+            print_info("Sync BatchNorm strategy is effective.")
+            build_strategy.sync_batch_norm = True
+        else:
+            print_info(
+                "Sync BatchNorm strategy will not be effective if GPU device"
+                " count <= 1")
+    compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
+        loss_name=avg_loss.name,
+        exec_strategy=exec_strategy,
+        build_strategy=build_strategy)
+
+    # Resume training
+    begin_epoch = cfg.SOLVER.BEGIN_EPOCH
+    if cfg.TRAIN.RESUME_MODEL_DIR:
+        begin_epoch = load_checkpoint(exe, train_prog)
+    # Load pretrained model
+    elif os.path.exists(cfg.TRAIN.PRETRAINED_MODEL_DIR):
+        print_info('Pretrained model dir: ', cfg.TRAIN.PRETRAINED_MODEL_DIR)
+        load_vars = []
+        load_fail_vars = []
+
+        def var_shape_matched(var, shape):
+            """
+            Check whehter persitable variable shape is match with current network
+            """
+            var_exist = os.path.exists(
+                os.path.join(cfg.TRAIN.PRETRAINED_MODEL_DIR, var.name))
+            if var_exist:
+                var_shape = parse_shape_from_file(
+                    os.path.join(cfg.TRAIN.PRETRAINED_MODEL_DIR, var.name))
+                return var_shape == shape
+            return False
+
+        for x in train_prog.list_vars():
+            if isinstance(x, fluid.framework.Parameter):
+                shape = tuple(fluid.global_scope().find_var(
+                    x.name).get_tensor().shape())
+                if var_shape_matched(x, shape):
+                    load_vars.append(x)
+                else:
+                    load_fail_vars.append(x)
+
+        fluid.io.load_vars(
+            exe, dirname=cfg.TRAIN.PRETRAINED_MODEL_DIR, vars=load_vars)
+        for var in load_vars:
+            print_info("Parameter[{}] loaded sucessfully!".format(var.name))
+        for var in load_fail_vars:
+            print_info(
+                "Parameter[{}] don't exist or shape does not match current network, skip"
+                " to load it.".format(var.name))
+        print_info("{}/{} pretrained parameters loaded successfully!".format(
+            len(load_vars),
+            len(load_vars) + len(load_fail_vars)))
+    else:
+        print_info(
+            'Pretrained model dir {} not exists, training from scratch...'.
+            format(cfg.TRAIN.PRETRAINED_MODEL_DIR))
+
+    fetch_list = [avg_loss.name, lr.name]
+    if args.debug:
+        # Fetch more variable info and use streaming confusion matrix to
+        # calculate IoU results if in debug mode
+        np.set_printoptions(
+            precision=4, suppress=True, linewidth=160, floatmode="fixed")
+        fetch_list.extend([pred.name, grts.name, masks.name])
+        cm = ConfusionMatrix(cfg.DATASET.NUM_CLASSES, streaming=True)
+
+    if args.use_tb:
+        if not args.tb_log_dir:
+            print_info("Please specify the log directory by --tb_log_dir.")
+            exit(1)
+
+        from tb_paddle import SummaryWriter
+        log_writer = SummaryWriter(args.tb_log_dir)
+
+    # trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    # num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+    global_step = 0
+    all_step = cfg.DATASET.TRAIN_TOTAL_IMAGES // cfg.TRAIN_BATCH_SIZE
+    if cfg.DATASET.TRAIN_TOTAL_IMAGES % cfg.TRAIN_BATCH_SIZE and drop_last != True:
+        all_step += 1
+    all_step *= (cfg.SOLVER.NUM_EPOCHS - begin_epoch + 1)
+
+    avg_loss = 0.0
+    timer = Timer()
+    timer.start()
+    if begin_epoch > cfg.SOLVER.NUM_EPOCHS:
+        raise ValueError(
+            ("begin epoch[{}] is larger than cfg.SOLVER.NUM_EPOCHS[{}]").format(
+                begin_epoch, cfg.SOLVER.NUM_EPOCHS))
+
+    if args.use_mpio:
+        print_info("Use multiprocess reader")
+    else:
+        print_info("Use multi-thread reader")
+
+    for epoch in range(begin_epoch, cfg.SOLVER.NUM_EPOCHS + 1):
+        py_reader.start()
+        while True:
+            try:
+                if args.debug:
+                    # Print category IoU and accuracy to check whether the
+                    # traning process is corresponed to expectation
+                    loss, lr, pred, grts, masks = exe.run(
+                        program=compiled_train_prog,
+                        fetch_list=fetch_list,
+                        return_numpy=True)
+                    cm.calculate(pred, grts, masks)
+                    avg_loss += np.mean(np.array(loss))
+                    global_step += 1
+
+                    if global_step % args.log_steps == 0:
+                        speed = args.log_steps / timer.elapsed_time()
+                        avg_loss /= args.log_steps
+                        category_acc, mean_acc = cm.accuracy()
+                        category_iou, mean_iou = cm.mean_iou()
+
+                        print_info((
+                            "epoch={}/{} step={}/{} lr={:.5f} loss={:.4f} acc={:.5f} mIoU={:.5f} step/sec={:.3f} | ETA {}"
+                        ).format(epoch, cfg.SOLVER.NUM_EPOCHS, global_step, all_step, lr[0], avg_loss, mean_acc,
+                                 mean_iou, speed,
+                                 calculate_eta(all_step - global_step, speed)))
+                        print_info("Category IoU: ", category_iou)
+                        print_info("Category Acc: ", category_acc)
+                        if args.use_tb:
+                            log_writer.add_scalar('Train/mean_iou', mean_iou,
+                                                  global_step)
+                            log_writer.add_scalar('Train/mean_acc', mean_acc,
+                                                  global_step)
+                            log_writer.add_scalar('Train/loss', avg_loss,
+                                                  global_step)
+                            log_writer.add_scalar('Train/lr', lr[0],
+                                                  global_step)
+                            log_writer.add_scalar('Train/step/sec', speed,
+                                                  global_step)
+                        sys.stdout.flush()
+                        avg_loss = 0.0
+                        cm.zero_matrix()
+                        timer.restart()
+                else:
+                    # If not in debug mode, avoid unnessary log and calculate
+                    loss, lr = exe.run(
+                        program=compiled_train_prog,
+                        fetch_list=fetch_list,
+                        return_numpy=True)
+                    avg_loss += np.mean(np.array(loss))
+                    global_step += 1
+
+                    if global_step % args.log_steps == 0 and cfg.TRAINER_ID == 0:
+                        avg_loss /= args.log_steps
+                        speed = args.log_steps / timer.elapsed_time()
+                        print((
+                            "epoch={}/{} step={}/{} lr={:.5f} loss={:.4f} step/sec={:.3f} | ETA {}"
+                        ).format(epoch, cfg.SOLVER.NUM_EPOCHS, global_step, all_step, lr[0], avg_loss, speed,
+                                 calculate_eta(all_step - global_step, speed)))
+                        if args.use_tb:
+                            log_writer.add_scalar('Train/loss', avg_loss,
+                                                  global_step)
+                            log_writer.add_scalar('Train/lr', lr[0],
+                                                  global_step)
+                            log_writer.add_scalar('Train/speed', speed,
+                                                  global_step)
+                        sys.stdout.flush()
+                        avg_loss = 0.0
+                        timer.restart()
+
+            except fluid.core.EOFException:
+                py_reader.reset()
+                break
+            except Exception as e:
+                print(e)
+
+        if epoch % cfg.TRAIN.SNAPSHOT_EPOCH == 0 and cfg.TRAINER_ID == 0:
+            ckpt_dir = save_checkpoint(exe, train_prog, epoch)
+
+            if args.do_eval:
+                print("Evaluation start")
+                _, mean_iou, _, mean_acc = evaluate(
+                    cfg=cfg,
+                    ckpt_dir=ckpt_dir,
+                    use_gpu=args.use_gpu,
+                    use_mpio=args.use_mpio)
+                if args.use_tb:
+                    log_writer.add_scalar('Evaluate/mean_iou', mean_iou,
+                                          global_step)
+                    log_writer.add_scalar('Evaluate/mean_acc', mean_acc,
+                                          global_step)
+
+            # Use Tensorboard to visualize results
+            if args.use_tb and cfg.DATASET.VIS_FILE_LIST is not None:
+                visualize(
+                    cfg=cfg,
+                    use_gpu=args.use_gpu,
+                    vis_file_list=cfg.DATASET.VIS_FILE_LIST,
+                    vis_dir="visual",
+                    ckpt_dir=ckpt_dir,
+                    log_writer=log_writer)
+
+    # save final model
+    if cfg.TRAINER_ID == 0:
+        save_checkpoint(exe, train_prog, 'final')
+
+
+def main(args):
+    if args.cfg_file is not None:
+        cfg.update_from_file(args.cfg_file)
+    if args.opts:
+        cfg.update_from_list(args.opts)
+
+    cfg.TRAINER_ID = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    cfg.NUM_TRAINERS = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+
+    cfg.check_and_infer()
+    print_info(pprint.pformat(cfg))
+    train(cfg)
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    start = timeit.default_timer()
+    main(args)
+    end = timeit.default_timer()
+    print("training time: {} h".format(1.0*(end-start)/3600))
diff --git a/PaddleCV/Research/SemSegPaddle/vis.py b/PaddleCV/Research/SemSegPaddle/vis.py
new file mode 100644
index 0000000000000000000000000000000000000000..b32998b79b544da47a93879fa3a733fa2d5b170b
--- /dev/null
+++ b/PaddleCV/Research/SemSegPaddle/vis.py
@@ -0,0 +1,235 @@
+# coding: utf8
+# copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+# GPU memory garbage collection optimization flags
+os.environ['FLAGS_eager_delete_tensor_gb'] = "0.0"
+
+import sys
+import argparse
+import pprint
+import cv2
+import numpy as np
+import paddle.fluid as fluid
+
+from PIL import Image as PILImage
+from src.utils.config import cfg
+from src.datasets.cityscapes import CityscapesSeg
+from src.models.model_builder import build_model
+from src.models.model_builder import ModelPhase
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='PaddeSeg visualization tools')
+    parser.add_argument(
+        '--cfg',
+        dest='cfg_file',
+        help='Config file for training (and optionally testing)',
+        default=None,
+        type=str)
+    parser.add_argument(
+        '--use_gpu', dest='use_gpu', help='Use gpu or cpu', action='store_true')
+    parser.add_argument(
+        '--vis_dir',
+        dest='vis_dir',
+        help='visual save dir',
+        type=str,
+        default='visual')
+    parser.add_argument(
+        '--local_test',
+        dest='local_test',
+        help='if in local test mode, only visualize 5 images for testing',
+        action='store_true')
+    parser.add_argument(
+        'opts',
+        help='See config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    if len(sys.argv) == 1:
+        parser.print_help()
+        sys.exit(1)
+    return parser.parse_args()
+
+
+def makedirs(directory):
+    if not os.path.exists(directory):
+        os.makedirs(directory)
+
+
+def get_color_map_list(num_classes):
+    """ Returns the color map for visualizing the segmentation mask,
+        which can support arbitrary number of classes.
+    Args:
+        num_classes: Number of classes
+    Returns:
+        The color map
+    """
+    color_map = num_classes * [0, 0, 0]
+    for i in range(0, num_classes):
+        j = 0
+        lab = i
+        while lab:
+            color_map[i * 3] |= (((lab >> 0) & 1) << (7 - j))
+            color_map[i * 3 + 1] |= (((lab >> 1) & 1) << (7 - j))
+            color_map[i * 3 + 2] |= (((lab >> 2) & 1) << (7 - j))
+            j += 1
+            lab >>= 3
+
+    return color_map
+
+
+def to_png_fn(fn):
+    """
+    Append png as filename postfix
+    """
+    directory, filename = os.path.split(fn)
+    basename, ext = os.path.splitext(filename)
+
+    return basename + ".png"
+
+
+def visualize(cfg,
+              vis_file_list=None,
+              use_gpu=False,
+              vis_dir="visual_predict",
+              ckpt_dir=None,
+              log_writer=None,
+              local_test=False,
+              **kwargs):
+    if vis_file_list is None:
+        vis_file_list = cfg.DATASET.TEST_FILE_LIST
+    dataset = SegDataset(
+        file_list=vis_file_list,
+        mode=ModelPhase.VISUAL,
+        data_dir=cfg.DATASET.DATA_DIR)
+
+    startup_prog = fluid.Program()
+    test_prog = fluid.Program()
+    pred, logit = build_model(test_prog, startup_prog, phase=ModelPhase.VISUAL)
+    # Clone forward graph
+    test_prog = test_prog.clone(for_test=True)
+
+    # Generator full colormap for maximum 256 classes
+    color_map = get_color_map_list(256)
+
+    # Get device environment
+    place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+
+    ckpt_dir = cfg.TEST.TEST_MODEL if not ckpt_dir else ckpt_dir
+
+    fluid.io.load_params(exe, ckpt_dir, main_program=test_prog)
+
+    save_dir = os.path.join('visual', vis_dir)
+    makedirs(save_dir)
+
+    fetch_list = [pred.name]
+    test_reader = dataset.batch(dataset.generator, batch_size=1, is_test=True)
+    img_cnt = 0
+    for imgs, grts, img_names, valid_shapes, org_shapes in test_reader:
+        pred_shape = (imgs.shape[2], imgs.shape[3])
+        pred, = exe.run(
+            program=test_prog,
+            feed={'image': imgs},
+            fetch_list=fetch_list,
+            return_numpy=True)
+
+        num_imgs = pred.shape[0]
+        # TODO: use multi-thread to write images
+        for i in range(num_imgs):
+            # Add more comments
+            res_map = np.squeeze(pred[i, :, :, :]).astype(np.uint8)
+            img_name = img_names[i]
+            res_shape = (res_map.shape[0], res_map.shape[1])
+            if res_shape[0] != pred_shape[0] or res_shape[1] != pred_shape[1]:
+                res_map = cv2.resize(
+                    res_map, pred_shape, interpolation=cv2.INTER_NEAREST)
+            valid_shape = (valid_shapes[i, 0], valid_shapes[i, 1])
+            res_map = res_map[0:valid_shape[0], 0:valid_shape[1]]
+            org_shape = (org_shapes[i, 0], org_shapes[i, 1])
+            res_map = cv2.resize(
+                res_map, (org_shape[1], org_shape[0]),
+                interpolation=cv2.INTER_NEAREST)
+
+            png_fn = to_png_fn(img_name)
+
+            # colorful segment result visualization
+            vis_fn = os.path.join(save_dir, png_fn)
+            dirname = os.path.dirname(vis_fn)
+            makedirs(dirname)
+
+            pred_mask = PILImage.fromarray(res_map.astype(np.uint8), mode='P')
+            pred_mask.putpalette(color_map)
+            pred_mask.save(vis_fn)
+
+            img_cnt += 1
+            print("#{} visualize image path: {}".format(img_cnt, vis_fn))
+
+            # Use Tensorboard to visualize image
+            if log_writer is not None:
+                # Calulate epoch from ckpt_dir folder name
+                epoch = int(os.path.split(ckpt_dir)[-1])
+                print("Tensorboard visualization epoch", epoch)
+
+                pred_mask_np = np.array(pred_mask.convert("RGB"))
+                log_writer.add_image(
+                    "Predict/{}".format(img_name),
+                    pred_mask_np,
+                    epoch,
+                    dataformats='HWC')
+                # Original image
+                # BGR->RGB
+                img = cv2.imread(
+                    os.path.join(cfg.DATASET.DATA_DIR, img_name))[..., ::-1]
+                log_writer.add_image(
+                    "Images/{}".format(img_name),
+                    img,
+                    epoch,
+                    dataformats='HWC')
+                # add ground truth (label) images
+                grt = grts[i]
+                if grt is not None:
+                    grt = grt[0:valid_shape[0], 0:valid_shape[1]]
+                    grt_pil = PILImage.fromarray(grt.astype(np.uint8), mode='P')
+                    grt_pil.putpalette(color_map)
+                    grt_pil = grt_pil.resize((org_shape[1], org_shape[0]))
+                    grt = np.array(grt_pil.convert("RGB"))
+                    log_writer.add_image(
+                        "Label/{}".format(img_name),
+                        grt,
+                        epoch,
+                        dataformats='HWC')
+
+        # If in local_test mode, only visualize 5 images just for testing
+        # procedure
+        if local_test and img_cnt >= 5:
+            break
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    if args.cfg_file is not None:
+        cfg.update_from_file(args.cfg_file)
+    if args.opts:
+        cfg.update_from_list(args.opts)
+    cfg.check_and_infer()
+    print(pprint.pformat(cfg))
+    visualize(cfg, **args.__dict__)
diff --git a/PaddleCV/Research/danet/README.md b/PaddleCV/Research/danet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..02348ae2d2c06a548ac6ad9987d398eb83dce49d
--- /dev/null
+++ b/PaddleCV/Research/danet/README.md
@@ -0,0 +1,155 @@
+# [Dual Attention Network for Scene Segmentation (CVPR2019)](https://arxiv.org/pdf/1809.02983.pdf)
+
+本项目是[DANet](https://arxiv.org/pdf/1809.02983.pdf)的 PaddlePaddle（>=1.5.2） 实现， 包含模型训练，验证等内容。
+
+## 模型简介
+![net](img/Network.png)
+骨干网络使用ResNet，为更好地进行语义分割任务，作者对ResNet做出以下改动：
+
+    1、将最后两个layer的downsampling取消，使得特征图是原图的1/8，保持较高空间分辨率。
+    2、最后两个layer采用空洞卷积扩大感受野。
+然后接上两个并行的注意力模块（位置注意力和通道注意力），最终将两个模块的结果进行elementwise操作，之后再接一层卷积输出分割图。
+
+### 位置注意力
+
+![position](img/position.png)
+
+A是骨干网络ResNet输出经过一层卷积生成的特征图，维度为CHW；
+A经过3个卷积操作输出维度均为CHW的B、C、D。将B、C、D都reshape到CN（N = H*W）；
+然后将B reshape后的结果转置与C相乘，得到N * N的矩阵， 对于矩阵的每一个点进行softmax；
+然后将D与softmax后的结果相乘并reshape到CHW，再与A进行elementwise。
+
+### 通道注意力
+![channel](img/channel.png)
+
+
+A是骨干网络ResNet输出经过一层卷积生成的特征图，维度为CHW；
+A经过3个reshape操作输出维度均为CN（N = H*W）的B、C、D；
+然后将B转置与C相乘，得到C * C的矩阵，对于矩阵的每一个点进行softmax；
+然后将D与softmax后的结果相乘并reshape到CHW，再与A进行elementwise。
+
+
+
+## 数据准备
+
+公开数据集：Cityscapes
+
+训练集2975张，验证集500张，测试集1525张，图片分辨率都是1024*2048。
+
+数据集来源：AIstudio数据集页面上[下载](https://aistudio.baidu.com/aistudio/datasetDetail/11503),  cityscapes.zip解压至dataset文件夹下,train.zip解压缩到cityscapes/leftImg8bit，其目录结构如下：
+```text
+dataset
+  ├── cityscapes               # Cityscapes数据集
+         ├── gtFine            # 精细化标注的label
+         ├── leftImg8bit       # 训练，验证，测试图片
+         ├── trainLabels.txt   # 训练图片路径
+         ├── valLabels.txt     # 验证图片路径
+              ...               ...
+```
+## 训练说明
+
+#### 数据增强策略
+    1、随机尺度缩放：尺度范围0.75到2.0
+    2、随机左右翻转：发生概率0.5
+    3、同比例缩放：缩放的大小由选项1决定。
+    4、随机裁剪：
+    5、高斯模糊：发生概率0.3（可选）
+    6、颜色抖动，对比度，锐度，亮度; 发生概率0.3（可选）
+###### 默认1、2、3、4、5、6都开启
+
+#### 学习率调节策略
+    1、使用热身策略，学习率由0递增到base_lr，热身轮数（epoch）是5
+    2、在热身策略之后使用学习率衰减策略（poly），学习率由base_lr递减到0
+
+#### 优化器选择
+	Momentum: 动量0.9，正则化系数1e-4
+
+#### 加载预训练模型
+	设置 --load_pretrained_model（默认为False）
+	预训练文件：
+	    checkpoint/DANet50_pretrained_model_paddle1.6.pdparams
+        checkpoint/DANet101_pretrained_model_paddle1.6.pdparams
+
+#### 加载训练好的模型
+	设置 --load_better_model（默认为False）
+	训练好的文件：
+		checkpoint/DANet101_better_model_paddle1.6.pdparams
+##### 【注】
+    训练时paddle版本是1.5.2，代码已转为1.6版本（兼容1.6版本），预训练参数、训练好的参数来自1.5.2版本
+
+#### 配置模型文件路径
+[预训练参数、最优模型参数下载](https://paddlemodels.bj.bcebos.com/DANet/DANet_models.tar)
+
+其目录结构如下：
+```text
+checkpoint
+    ├── DANet50_pretrained_model_paddle1.6.pdparams       # DANet50预训练模型，需要paddle >=1.6.0
+    ├── DANet101_pretrained_model_paddle1.6.pdparams      # DANet101预训练模型，需要paddle >=1.6.0
+    ├── DANet101_better_model_paddle1.6.pdparams          # DANet101训练最优模型，需要paddle >=1.6.0
+    ├── DANet101_better_model_paddle1.5.2                 # DANet101在1.5.2版本训练的最优模型，需要paddle >= 1.5.2
+
+```
+
+## 模型训练
+
+```sh
+cd danet
+export PYTHONPATH=`pwd`:$PYTHONPATH
+# open garbage collection to save memory
+export FLAGS_eager_delete_tensor_gb=0.0
+# setting visible devices for train
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+```
+
+executor执行以下命令进行训练
+```sh
+python train_executor.py --backbone resnet101 --base_size 1024 --crop_size 768 --epoch_num 350 --batch_size 2 --lr 0.003 --lr_scheduler poly --warm_up --warmup_epoch 2 --cuda --use_data_parallel --load_pretrained_model --save_model checkpoint/DANet101_better_model_paddle1.5.2 --multi_scales --flip --dilated --multi_grid --scale --multi_dilation 4 8 16
+```
+参数含义： 使用ResNet101骨干网络，训练图片基础大小是1024，裁剪大小是768，训练轮数是350次，batch size是2
+学习率是0.003，学习率衰减策略是poly，使用学习率热身，热身轮数是2轮，使用GPU，使用数据并行， 加载预训练模型，设置加载的模型地址，使用多尺度测试， 使用图片左右翻转测试，使用空洞卷积，使用multi_grid，multi_dilation设置为4 8 16，使用多尺度训练
+##### Windows下训练需要去掉 --use_data_parallel
+#### 或者
+dygraph执行以下命令进行训练
+```sh
+python train_dygraph.py --backbone resnet101 --base_size 1024 --crop_size 768 --epoch_num 350 --batch_size 2 --lr 0.003 --lr_scheduler poly --cuda --use_data_parallel --load_pretrained_model --save_model checkpoint/DANet101_better_model_paddle1.6 --multi_scales --flip --dilated --multi_grid --scale --multi_dilation 4 8 16
+```
+参数含义： 使用ResNet101骨干网络，训练图片基础大小是1024，裁剪大小是768，训练轮数是350次，batch size是2，学习率是0.003，学习率衰减策略是poly，使用GPU， 使用数据并行，加载预训练模型，设置加载的模型地址，使用多尺度测试，使用图片左右翻转测试，使用空洞卷积，使用multi_grid，multi_dilation设置4 8 16，使用多尺度训练
+
+#### 【注】
+##### train_executor.py使用executor方式训练（适合paddle >= 1.5.2），train_dygraph.py使用动态图方式训练（适合paddle >= 1.6.0），两种方式都可以
+##### 动态图方式训练暂时不支持学习率热身
+
+#### 在训练阶段，输出的验证结果不是真实的，需要使用eval.py来获得验证的最终结果。
+
+ ## 模型验证
+```sh
+# open garbage collection to save memory
+export FLAGS_eager_delete_tensor_gb=0.0
+# setting visible devices for prediction
+export CUDA_VISIBLE_DEVICES=0
+
+python eval.py --backbone resnet101 --base_size 2048 --crop_size 1024 --cuda --use_data_parallel --load_better_model --save_model checkpoint/DANet101_better_model_paddle1.6 --multi_scales --flip --dilated --multi_grid --multi_dilation 4 8 16
+```
+##### 如果需要把executor训练的参数转成dygraph模式下进行验证的话，请在命令行加上--change_executor_to_dygraph
+
+## 验证结果
+评测指标：mean IOU(平均交并比)
+
+
+| 模型 | 单尺度 | 多尺度 |
+| :---:|:---:| :---:|
+|DANet101|0.8043836|0.8138021
+
+##### 具体数值
+| 模型 | cls1 | cls2 | cls3 | cls4 | cls5 | cls6 | cls7 | cls8 | cls9 | cls10 | cls11 | cls12 | cls13 | cls14 | cls15 | cls16 |cls17 | cls18 | cls19 |
+| :---:|:---: | :---:| :---:|:---: | :---:| :---:|:---: | :---:| :---:|:---:  |:---: |:---:  |:---:  | :---: | :---: |:---:  | :---:| :---: |:---:  |
+|DANet101-SS|0.98212|0.85372|0.92799|0.59976|0.63318|0.65819|0.72023|0.80000|0.92605|0.65788|0.94841|0.83377|0.65206|0.95566|0.87148|0.91233|0.84352|0.71948|0.78737|
+|DANet101-MS|0.98047|0.84637|0.93084|0.62699|0.64839|0.67769|0.73650|0.81343|0.92942|0.67010|0.95127|0.84466|0.66635|0.95749|0.87755|0.92370|0.85344|0.73007|0.79742|
+
+## 输出结果可视化
+![val_1](img/val_1.png)
+###### 输入图片
+![val_gt](img/val_gt.png)
+###### 图片label
+![val_output](img/val_output.png)
+###### DANet101模型输出
diff --git a/PaddleCV/Research/danet/checkpoint/.gitkeep b/PaddleCV/Research/danet/checkpoint/.gitkeep
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleCV/Research/danet/danet.py b/PaddleCV/Research/danet/danet.py
new file mode 100644
index 0000000000000000000000000000000000000000..566a13e5cb7c9079de704db86647bcf2a5cabf1b
--- /dev/null
+++ b/PaddleCV/Research/danet/danet.py
@@ -0,0 +1,641 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import shutil
+import paddle.fluid as fluid
+import os
+
+
+__all__ = ['DANet']
+
+
+class ConvBN(fluid.dygraph.Layer):
+
+    def __init__(self,
+                 name_scope,
+                 num_filters,
+                 filter_size=3,
+                 stride=1,
+                 dilation=1,
+                 act=None,
+                 learning_rate=1.0,
+                 dtype='float32',
+                 bias_attr=False):
+        super(ConvBN, self).__init__(name_scope)
+
+        if dilation != 1:
+            padding = dilation
+        else:
+            padding = (filter_size - 1) // 2
+
+        self._conv = fluid.dygraph.Conv2D(name_scope,
+                                          num_filters=num_filters,
+                                          filter_size=filter_size,
+                                          stride=stride,
+                                          padding=padding,
+                                          dilation=dilation,
+                                          act=None,
+                                          dtype=dtype,
+                                          bias_attr=bias_attr if bias_attr is False else fluid.ParamAttr(
+                                              learning_rate=learning_rate,
+                                              name='bias'),
+                                          param_attr=fluid.ParamAttr(
+                                              learning_rate=learning_rate,
+                                              name='weight')
+                                          )
+        self._bn = fluid.dygraph.BatchNorm(name_scope,
+                                           num_channels=num_filters,
+                                           act=act,
+                                           dtype=dtype,
+                                           momentum=0.9,
+                                           epsilon=1e-5,
+                                           bias_attr=fluid.ParamAttr(
+                                               learning_rate=learning_rate,
+                                               name='bias'),
+                                           param_attr=fluid.ParamAttr(
+                                               learning_rate=learning_rate,
+                                               name='weight'),
+                                           moving_mean_name='running_mean',
+                                           moving_variance_name='running_var'
+                                           )
+
+    def forward(self, inputs):
+        x = self._conv(inputs)
+        x = self._bn(x)
+        return x
+
+
+class BasicBlock(fluid.dygraph.Layer):
+
+    def __init__(self,
+                 name_scope,
+                 num_filters,
+                 stride=1,
+                 dilation=1,
+                 same=False):
+        super(BasicBlock, self).__init__(name_scope)
+        self._conv0 = ConvBN(self.full_name(),
+                             num_filters=num_filters,
+                             filter_size=3,
+                             stride=stride,
+                             dilation=dilation,
+                             act='relu')
+        self._conv1 = ConvBN(self.full_name(),
+                             num_filters=num_filters,
+                             filter_size=3,
+                             stride=1,
+                             dilation=dilation,
+                             act=None)
+
+        self.same = same
+
+        if not same:
+            self._skip = ConvBN(self.full_name(),
+                                num_filters=num_filters,
+                                filter_size=1,
+                                stride=stride,
+                                act=None)
+
+    def forward(self, inputs):
+        x = self._conv0(inputs)
+        x = self._conv1(x)
+        if self.same:
+            skip = inputs
+        else:
+            skip = self._skip(inputs)
+        x = fluid.layers.elementwise_add(x, skip, act='relu')
+        return x
+
+
+class BottleneckBlock(fluid.dygraph.Layer):
+    def __init__(self, name_scope, num_filters, stride, dilation=1, same=False):
+        super(BottleneckBlock, self).__init__(name_scope)
+        self.expansion = 4
+
+        self._conv0 = ConvBN(name_scope,
+                             num_filters=num_filters,
+                             filter_size=1,
+                             stride=1,
+                             act='relu')
+        self._conv1 = ConvBN(name_scope,
+                             num_filters=num_filters,
+                             filter_size=3,
+                             stride=stride,
+                             dilation=dilation,
+                             act='relu')
+        self._conv2 = ConvBN(name_scope,
+                             num_filters=num_filters * self.expansion,
+                             filter_size=1,
+                             stride=1,
+                             act=None)
+        self.same = same
+
+        if not same:
+            self._skip = ConvBN(name_scope,
+                                num_filters=num_filters * self.expansion,
+                                filter_size=1,
+                                stride=stride,
+                                act=None)
+
+    def forward(self, inputs):
+        x = self._conv0(inputs)
+        x = self._conv1(x)
+        x = self._conv2(x)
+        if self.same:
+            skip = inputs
+        else:
+            skip = self._skip(inputs)
+        x = fluid.layers.elementwise_add(x, skip, act='relu')
+        return x
+
+
+class ResNet(fluid.dygraph.Layer):
+    def __init__(self,
+                 name_scope,
+                 layer=152,
+                 num_class=1000,
+                 dilated=True,
+                 multi_grid=True,
+                 multi_dilation=[4, 8, 16],
+                 need_fc=False):
+        super(ResNet, self).__init__(name_scope)
+
+        support_layer = [18, 34, 50, 101, 152]
+        assert layer in support_layer, 'layer({}) not in {}'.format(layer, support_layer)
+        self.need_fc = need_fc
+        self.num_filters_list = [64, 128, 256, 512]
+        if layer == 18:
+            self.depth = [2, 2, 2, 2]
+        elif layer == 34:
+            self.depth = [3, 4, 6, 3]
+        elif layer == 50:
+            self.depth = [3, 4, 6, 3]
+        elif layer == 101:
+            self.depth = [3, 4, 23, 3]
+        elif layer == 152:
+            self.depth = [3, 8, 36, 3]
+
+        if multi_grid:
+            assert multi_dilation is not None
+            self.multi_dilation = multi_dilation
+
+        self._conv = ConvBN(name_scope, 64, 7, 2, act='relu')
+        self._pool = fluid.dygraph.Pool2D(name_scope,
+                                          pool_size=3,
+                                          pool_stride=2,
+                                          pool_padding=1,
+                                          pool_type='max')
+        if layer >= 50:
+            self.layer1 = self._make_layer(block=BottleneckBlock,
+                                           depth=self.depth[0],
+                                           num_filters=self.num_filters_list[0],
+                                           stride=1,
+                                           same=False,
+                                           name='layer1')
+            self.layer2 = self._make_layer(block=BottleneckBlock,
+                                           depth=self.depth[1],
+                                           num_filters=self.num_filters_list[1],
+                                           stride=2,
+                                           same=False,
+                                           name='layer2')
+            if dilated:
+                self.layer3 = self._make_layer(block=BottleneckBlock,
+                                               depth=self.depth[2],
+                                               num_filters=self.num_filters_list[2],
+                                               stride=2,
+                                               dilation=2,
+                                               same=False,
+                                               name='layer3')
+                if multi_grid:  # layer4 采用不同的采样率
+                    self.layer4 = self._make_layer(block=BottleneckBlock,
+                                                   depth=self.depth[3],
+                                                   num_filters=self.num_filters_list[3],
+                                                   stride=2,
+                                                   dilation=4,
+                                                   multi_grid=multi_grid,
+                                                   multi_dilation=self.multi_dilation,
+                                                   same=False,
+                                                   name='layer4')
+                else:
+                    self.layer4 = self._make_layer(block=BottleneckBlock,
+                                                   depth=self.depth[3],
+                                                   num_filters=self.num_filters_list[3],
+                                                   stride=2,
+                                                   dilation=4,
+                                                   same=False,
+                                                   name='layer4')
+            else:
+                self.layer3 = self._make_layer(block=BottleneckBlock,
+                                               depth=self.depth[2],
+                                               num_filters=self.num_filters_list[2],
+                                               stride=2,
+                                               dilation=1,
+                                               same=False,
+                                               name='layer3')
+                self.layer4 = self._make_layer(block=BottleneckBlock,
+                                               depth=self.depth[3],
+                                               num_filters=self.num_filters_list[3],
+                                               stride=2,
+                                               dilation=1,
+                                               same=False,
+                                               name='layer4')
+
+        else:  # layer=18 or layer=34
+            self.layer1 = self._make_layer(block=BasicBlock,
+                                           depth=self.depth[0],
+                                           num_filters=self.num_filters_list[0],
+                                           stride=1,
+                                           same=True,
+                                           name=name_scope)
+            self.layer2 = self._make_layer(block=BasicBlock,
+                                           depth=self.depth[1],
+                                           num_filters=self.num_filters_list[1],
+                                           stride=2,
+                                           same=False,
+                                           name=name_scope)
+            self.layer3 = self._make_layer(block=BasicBlock,
+                                           depth=self.depth[2],
+                                           num_filters=self.num_filters_list[2],
+                                           stride=2,
+                                           dilation=1,
+                                           same=False,
+                                           name=name_scope)
+            self.layer4 = self._make_layer(block=BasicBlock,
+                                           depth=self.depth[3],
+                                           num_filters=self.num_filters_list[3],
+                                           stride=2,
+                                           dilation=1,
+                                           same=False,
+                                           name=name_scope)
+
+        self._avgpool = fluid.dygraph.Pool2D(name_scope,
+                                             global_pooling=True,
+                                             pool_type='avg')
+        self.fc = fluid.dygraph.FC(name_scope,
+                                   size=num_class,
+                                   act='softmax')
+
+    def _make_layer(self, block, depth, num_filters, stride=1, dilation=1, same=False, multi_grid=False,
+                    multi_dilation=None, name=None):
+        layers = []
+        if dilation != 1:
+            #  stride(2x2) with a dilated convolution instead
+            stride = 1
+
+        if multi_grid:
+            assert len(multi_dilation) == 3
+            for depth in range(depth):
+                temp = block(name + '.{}'.format(depth),
+                             num_filters=num_filters,
+                             stride=stride,
+                             dilation=multi_dilation[depth],
+                             same=same)
+                stride = 1
+                same = True
+                layers.append(self.add_sublayer('_{}_{}'.format(name, depth + 1), temp))
+        else:
+            for depth in range(depth):
+                temp = block(name + '.{}'.format(depth),
+                             num_filters=num_filters,
+                             stride=stride,
+                             dilation=dilation if depth > 0 else 1,
+                             same=same)
+                stride = 1
+                same = True
+                layers.append(self.add_sublayer('_{}_{}'.format(name, depth + 1), temp))
+        return layers
+
+    def forward(self, inputs):
+        x = self._conv(inputs)
+
+        x = self._pool(x)
+        for layer in self.layer1:
+            x = layer(x)
+        c1 = x
+
+        for layer in self.layer2:
+            x = layer(x)
+        c2 = x
+
+        for layer in self.layer3:
+            x = layer(x)
+        c3 = x
+
+        for layer in self.layer4:
+            x = layer(x)
+        c4 = x
+
+        if self.need_fc:
+            x = self._avgpool(x)
+            x = self.fc(x)
+            return x
+        else:
+            return c1, c2, c3, c4
+
+
+class CAM(fluid.dygraph.Layer):
+    def __init__(self,
+                 name_scope,
+                 in_channels=512,
+                 default_value=0):
+        """
+        channel_attention_module
+        """
+        super(CAM, self).__init__(name_scope)
+        self.in_channels = in_channels
+        self.gamma = fluid.layers.create_parameter(shape=[1],
+                                                   dtype='float32',
+                                                   is_bias=True,
+                                                   attr=fluid.ParamAttr(
+                                                       learning_rate=10.0,
+                                                       name='cam_gamma'),
+                                                   default_initializer=fluid.initializer.ConstantInitializer(
+                                                       value=default_value)
+                                                   )
+
+    def forward(self, inputs):
+        batch_size, c, h, w = inputs.shape
+        out_b = fluid.layers.reshape(inputs, shape=[batch_size, self.in_channels, h * w])
+        out_c = fluid.layers.reshape(inputs, shape=[batch_size, self.in_channels, h * w])
+        out_c_t = fluid.layers.transpose(out_c, perm=[0, 2, 1])
+        mul_bc = fluid.layers.matmul(out_b, out_c_t)
+
+        mul_bc_max = fluid.layers.reduce_max(mul_bc, dim=-1, keep_dim=True)
+        mul_bc_max = fluid.layers.expand(mul_bc_max, expand_times=[1, 1, c])
+        x = fluid.layers.elementwise_sub(mul_bc_max, mul_bc)
+
+        attention = fluid.layers.softmax(x, use_cudnn=True, axis=-1)
+
+        out_d = fluid.layers.reshape(inputs, shape=[batch_size, self.in_channels, h * w])
+        attention_mul = fluid.layers.matmul(attention, out_d)
+
+        attention_reshape = fluid.layers.reshape(attention_mul, shape=[batch_size, self.in_channels, h, w])
+        gamma_attention = fluid.layers.elementwise_mul(attention_reshape, self.gamma)
+        out = fluid.layers.elementwise_add(gamma_attention, inputs)
+        return out
+
+
+class PAM(fluid.dygraph.Layer):
+    def __init__(self,
+                 name_scope,
+                 in_channels=512,
+                 default_value=0):
+        """
+        position_attention_module
+        """
+        super(PAM, self).__init__(name_scope)
+
+        assert in_channels // 8, 'in_channel // 8 > 0 '
+        self.channel_in = in_channels // 8
+        self._convB = fluid.dygraph.Conv2D(name_scope,
+                                           num_filters=in_channels // 8,
+                                           filter_size=1,
+                                           bias_attr=fluid.ParamAttr(
+                                               learning_rate=10.0,
+                                               name='bias'),
+                                           param_attr=fluid.ParamAttr(
+                                               learning_rate=10.0,
+                                               name='weight')
+                                           )
+        self._convC = fluid.dygraph.Conv2D(name_scope,
+                                           num_filters=in_channels // 8,
+                                           filter_size=1,
+                                           bias_attr=fluid.ParamAttr(
+                                               learning_rate=10.0,
+                                               name='bias'),
+                                           param_attr=fluid.ParamAttr(
+                                               learning_rate=10.0,
+                                               name='weight')
+                                           )
+        self._convD = fluid.dygraph.Conv2D(name_scope,
+                                           num_filters=in_channels,
+                                           filter_size=1,
+                                           bias_attr=fluid.ParamAttr(
+                                               learning_rate=10.0,
+                                               name='bias'),
+                                           param_attr=fluid.ParamAttr(
+                                               learning_rate=10.0,
+                                               name='weight')
+                                           )
+        self.gamma = fluid.layers.create_parameter(shape=[1],
+                                                   dtype='float32',
+                                                   is_bias=True,
+                                                   attr=fluid.ParamAttr(
+                                                       learning_rate=10.0,
+                                                       name='pam_gamma'),
+                                                   default_initializer=fluid.initializer.ConstantInitializer(
+                                                       value=default_value))
+
+    def forward(self, inputs):
+        batch_size, c, h, w = inputs.shape
+        out_b = self._convB(inputs)
+        out_b_reshape = fluid.layers.reshape(out_b, shape=[batch_size, self.channel_in, h * w])
+        out_b_reshape_t = fluid.layers.transpose(out_b_reshape, perm=[0, 2, 1])
+        out_c = self._convC(inputs)
+        out_c_reshape = fluid.layers.reshape(out_c, shape=[batch_size, self.channel_in, h * w])
+
+        mul_bc = fluid.layers.matmul(out_b_reshape_t, out_c_reshape)
+        soft_max_bc = fluid.layers.softmax(mul_bc, use_cudnn=True, axis=-1)
+
+        out_d = self._convD(inputs)
+        out_d_reshape = fluid.layers.reshape(out_d, shape=[batch_size, self.channel_in * 8, h * w])
+        attention = fluid.layers.matmul(out_d_reshape, fluid.layers.transpose(soft_max_bc, perm=[0, 2, 1]))
+        attention = fluid.layers.reshape(attention, shape=[batch_size, self.channel_in * 8, h, w])
+
+        gamma_attention = fluid.layers.elementwise_mul(attention, self.gamma)
+        out = fluid.layers.elementwise_add(gamma_attention, inputs)
+        return out
+
+
+class DAHead(fluid.dygraph.Layer):
+    def __init__(self,
+                 name_scope,
+                 in_channels,
+                 out_channels,
+                 batch_size):
+        super(DAHead, self).__init__(name_scope)
+        self.in_channel = in_channels // 4
+        self.batch_size = batch_size
+        self._conv_bn_relu0 = ConvBN(name_scope,
+                                     num_filters=self.in_channel,
+                                     filter_size=3,
+                                     stride=1,
+                                     act='relu',
+                                     learning_rate=10.0,
+                                     bias_attr=False)
+
+        self._conv_bn_relu1 = ConvBN(name_scope,
+                                     num_filters=self.in_channel,
+                                     filter_size=3,
+                                     stride=1,
+                                     act='relu',
+                                     learning_rate=10.0,
+                                     bias_attr=False)
+
+        self._pam = PAM('pam', in_channels=self.in_channel, default_value=0.0)
+        self._cam = CAM('cam', in_channels=self.in_channel, default_value=0.0)
+
+        self._conv_bn_relu2 = ConvBN(name_scope,
+                                     num_filters=self.in_channel,
+                                     filter_size=3,
+                                     stride=1,
+                                     act='relu',
+                                     learning_rate=10.0,
+                                     bias_attr=False)
+
+        self._conv_bn_relu3 = ConvBN(name_scope,
+                                     num_filters=self.in_channel,
+                                     filter_size=3,
+                                     stride=1,
+                                     act='relu',
+                                     learning_rate=10.0,
+                                     bias_attr=False)
+        self._pam_last_conv = fluid.dygraph.Conv2D(name_scope,
+                                                   num_filters=out_channels,
+                                                   filter_size=1,
+                                                   bias_attr=fluid.ParamAttr(
+                                                       learning_rate=10.0,
+                                                       name='bias'),
+                                                   param_attr=fluid.ParamAttr(
+                                                       learning_rate=10.0,
+                                                       name='weight')
+                                                   )
+        self._cam_last_conv = fluid.dygraph.Conv2D(name_scope,
+                                                   num_filters=out_channels,
+                                                   filter_size=1,
+                                                   bias_attr=fluid.ParamAttr(
+                                                       learning_rate=10.0,
+                                                       name='bias'),
+                                                   param_attr=fluid.ParamAttr(
+                                                       learning_rate=10.0,
+                                                       name='weight')
+                                                   )
+        self._last_conv = fluid.dygraph.Conv2D(name_scope,
+                                               num_filters=out_channels,
+                                               filter_size=1,
+                                               bias_attr=fluid.ParamAttr(
+                                                   learning_rate=10.0,
+                                                   name='bias'),
+                                               param_attr=fluid.ParamAttr(
+                                                   learning_rate=10.0,
+                                                   name='weight')
+                                               )
+
+    def forward(self, inputs):
+        out = []
+        inputs_pam = self._conv_bn_relu0(inputs)
+        pam = self._pam(inputs_pam)
+        position = self._conv_bn_relu2(pam)
+
+        batch_size, num_channels = position.shape[:2]
+
+        # dropout2d
+        ones = fluid.layers.ones(shape=[self.batch_size, num_channels], dtype='float32')
+        dropout1d_P = fluid.layers.dropout(ones, 0.1, dropout_implementation='upscale_in_train')
+        out_position_drop2d = fluid.layers.elementwise_mul(position, dropout1d_P, axis=0)
+        dropout1d_P.stop_gradient = True
+
+        inputs_cam = self._conv_bn_relu1(inputs)
+        cam = self._cam(inputs_cam)
+        channel = self._conv_bn_relu3(cam)
+
+        # dropout2d
+        ones2 = fluid.layers.ones(shape=[self.batch_size, num_channels], dtype='float32')
+        dropout1d_C = fluid.layers.dropout(ones2, 0.1, dropout_implementation='upscale_in_train')
+        out_channel_drop2d = fluid.layers.elementwise_mul(channel, dropout1d_C, axis=0)
+        dropout1d_C.stop_gradient = True
+        position_out = self._pam_last_conv(out_position_drop2d)
+        channel_out = self._cam_last_conv(out_channel_drop2d)
+
+        feat_sum = fluid.layers.elementwise_add(position, channel, axis=1)
+        feat_sum_batch_size, feat_sum_num_channels = feat_sum.shape[:2]
+
+        # dropout2d
+        feat_sum_ones = fluid.layers.ones(shape=[self.batch_size, feat_sum_num_channels], dtype='float32')
+        dropout1d_sum = fluid.layers.dropout(feat_sum_ones, 0.1, dropout_implementation='upscale_in_train')
+        dropout2d_feat_sum = fluid.layers.elementwise_mul(feat_sum, dropout1d_sum, axis=0)
+        dropout1d_sum.stop_gradient = True
+        feat_sum_out = self._last_conv(dropout2d_feat_sum)
+
+        out.append(feat_sum_out)
+        out.append(position_out)
+        out.append(channel_out)
+        return tuple(out)
+
+
+class DANet(fluid.dygraph.Layer):
+    def __init__(self,
+                 name_scope,
+                 backbone='resnet50',
+                 num_classes=19,
+                 batch_size=1,
+                 dilated=True,
+                 multi_grid=True,
+                 multi_dilation=[4, 8, 16]):
+        super(DANet, self).__init__(name_scope)
+        if backbone == 'resnet50':
+            print('backbone resnet50, dilated={}, multi_grid={}, '
+                  'multi_dilation={}'.format(dilated, multi_grid, multi_dilation))
+            self._backone = ResNet('resnet50', layer=50, dilated=dilated,
+                                   multi_grid=multi_grid, multi_dilation=multi_dilation)
+        elif backbone == 'resnet101':
+            print('backbone resnet101, dilated={}, multi_grid={}, '
+                  'multi_dilation={}'.format(dilated, multi_grid, multi_dilation))
+            self._backone = ResNet('resnet101', layer=101, dilated=dilated,
+                                   multi_grid=multi_grid, multi_dilation=multi_dilation)
+        elif backbone == 'resnet152':
+            print('backbone resnet152, dilated={}, multi_grid={}, '
+                  'multi_dilation={}'.format(dilated, multi_grid, multi_dilation))
+            self._backone = ResNet('resnet152', layer=152, dilated=dilated,
+                                   multi_grid=multi_grid, multi_dilation=multi_dilation)
+        else:
+            raise ValueError('unknown backbone: {}'.format(backbone))
+
+        self._head = DAHead('DA_head', in_channels=2048, out_channels=num_classes, batch_size=batch_size)
+
+    def forward(self, inputs):
+        h, w = inputs.shape[2:]
+        _, _, c3, c4 = self._backone(inputs)
+        x1, x2, x3 = self._head(c4)
+        out = []
+        out1 = fluid.layers.resize_bilinear(x1, out_shape=[h, w])
+        out2 = fluid.layers.resize_bilinear(x2, out_shape=[h, w])
+        out3 = fluid.layers.resize_bilinear(x3, out_shape=[h, w])
+        out.append(out1)
+        out.append(out2)
+        out.append(out3)
+        return out
+
+
+def copy_model(path, new_path):
+    shutil.rmtree(new_path, ignore_errors=True)
+    shutil.copytree(path, new_path)
+    model_path = os.path.join(new_path, '__model__')
+    if os.path.exists(model_path):
+        os.remove(model_path)
+
+
+if __name__ == '__main__':
+    import numpy as np
+
+    with fluid.dygraph.guard(fluid.CPUPlace()):
+        x = np.random.randn(2, 3, 224, 224).astype('float32')
+        x = fluid.dygraph.to_variable(x)
+        model = DANet('test', backbone='resnet101', num_classes=19, batch_size=2)
+        y = model(x)
+        print(y[0].shape)
diff --git a/PaddleCV/Research/danet/dataset/.gitkeep b/PaddleCV/Research/danet/dataset/.gitkeep
new file mode 100644
index 0000000000000000000000000000000000000000..8b137891791fe96927ad78e64b0aad7bded08bdc
--- /dev/null
+++ b/PaddleCV/Research/danet/dataset/.gitkeep
@@ -0,0 +1 @@
+
diff --git a/PaddleCV/Research/danet/eval.py b/PaddleCV/Research/danet/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..46c825fabb71e8fee5834d2c09d5a2332833e007
--- /dev/null
+++ b/PaddleCV/Research/danet/eval.py
@@ -0,0 +1,410 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+os.environ['FLAGS_eager_delete_tensor_gb'] = "0.0"
+os.environ['FLAGS_fraction_of_gpu_memory_to_use'] = "0.99"
+
+import paddle.fluid as fluid
+import paddle
+import logging
+import math
+import numpy as np
+import shutil
+import os
+
+from PIL import ImageOps, Image, ImageEnhance, ImageFilter
+from datetime import datetime
+
+from danet import DANet
+from options import Options
+from utils.cityscapes_data import cityscapes_train
+from utils.cityscapes_data import cityscapes_val
+from utils.cityscapes_data import cityscapes_test
+from utils.lr_scheduler import Lr
+from iou import IOUMetric
+
+#  globals
+data_mean = np.array([0.485, 0.456, 0.406]).reshape(3, 1, 1)
+data_std = np.array([0.229, 0.224, 0.225]).reshape(3, 1, 1)
+
+
+def pad_single_image(image, crop_size):
+    w, h = image.size
+    pad_h = crop_size - h if h < crop_size else 0
+    pad_w = crop_size - w if w < crop_size else 0
+    image = ImageOps.expand(image, border=(0, 0, pad_w, pad_h), fill=0)
+    assert (image.size[0] >= crop_size and image.size[1] >= crop_size)
+    return image
+
+
+def crop_image(image, h0, w0, h1, w1):
+    return image.crop((w0, h0, w1, h1))
+
+
+def flip_left_right_image(image):
+    return image.transpose(Image.FLIP_LEFT_RIGHT)
+
+
+def resize_image(image, out_h, out_w, mode=Image.BILINEAR):
+    return image.resize((out_w, out_h), mode)
+
+
+def mapper_image(image):
+    image_array = np.array(image)
+    image_array = image_array.transpose((2, 0, 1))
+    image_array = image_array / 255.0
+    image_array = (image_array - data_mean) / data_std
+    image_array = image_array.astype('float32')
+    image_array = image_array[np.newaxis, :]
+    return image_array
+
+
+def get_model(args):
+    model = DANet('DANet',
+                  backbone=args.backbone,
+                  num_classes=args.num_classes,
+                  batch_size=1,
+                  dilated=args.dilated,
+                  multi_grid=args.multi_grid,
+                  multi_dilation=args.multi_dilation)
+    return model
+
+
+def copy_model(path, new_path):
+    shutil.rmtree(new_path, ignore_errors=True)
+    shutil.copytree(path, new_path)
+    model_path = os.path.join(new_path, '__model__')
+    if os.path.exists(model_path):
+        os.remove(model_path)
+
+
+def mean_iou(pred, label, num_classes=19):
+    label = fluid.layers.elementwise_min(fluid.layers.cast(label, np.int32),
+                                         fluid.layers.assign(np.array([num_classes], dtype=np.int32)))
+    label_ig = (label == num_classes).astype('int32')
+    label_ng = (label != num_classes).astype('int32')
+    pred = fluid.layers.cast(fluid.layers.argmax(pred, axis=1), 'int32')
+    pred = pred * label_ng + label_ig * num_classes
+    miou, wrong, correct = fluid.layers.mean_iou(pred, label, num_classes + 1)
+    label.stop_gradient = True
+    return miou, wrong, correct
+
+
+def change_model_executor_to_dygraph(args):
+    temp_image = fluid.layers.data(name='temp_image', shape=[3, 224, 224], dtype='float32')
+    model = get_model(args)
+    y = model(temp_image)
+    if args.cuda:
+        gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+    place = fluid.CUDAPlace(gpu_id) if args.cuda else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+    model_path = args.save_model
+    assert os.path.exists(model_path), "Please check whether the executor model file address {} exists. " \
+                                       "Note: the executor model file is multiple files.".format(model_path)
+    fluid.io.load_persistables(exe, model_path, fluid.default_main_program())
+    print('load executor train model successful, start change!')
+    param_list = fluid.default_main_program().block(0).all_parameters()
+    param_name_list = [p.name for p in param_list]
+    temp_dict = {}
+    for name in param_name_list:
+        tensor = fluid.global_scope().find_var(name).get_tensor()
+        npt = np.asarray(tensor)
+        temp_dict[name] = npt
+    del model
+    with fluid.dygraph.guard():
+        x = np.random.randn(1, 3, 224, 224).astype('float32')
+        x = fluid.dygraph.to_variable(x)
+        model = get_model(args)
+        y = model(x)
+        new_param_dict = {}
+        for k, v in temp_dict.items():
+            value = v
+            value_shape = value.shape
+            name = k
+            tensor = fluid.layers.create_parameter(shape=value_shape,
+                                                   name=name,
+                                                   dtype='float32',
+                                                   default_initializer=fluid.initializer.NumpyArrayInitializer(value))
+            new_param_dict[name] = tensor
+        assert len(new_param_dict) == len(
+            model.state_dict()), "The number of parameters is not equal. Loading parameters failed, " \
+                                 "Please check whether the model is consistent!"
+        model.set_dict(new_param_dict)
+        fluid.save_dygraph(model.state_dict(), model_path)
+        del model
+        del temp_dict
+        print('change executor model to dygraph successful!')
+
+
+def eval(args):
+    if args.change_executor_to_dygraph:
+        change_model_executor_to_dygraph(args)
+    with fluid.dygraph.guard():
+        num_classes = args.num_classes
+        base_size = args.base_size
+        crop_size = args.crop_size
+        multi_scales = args.multi_scales
+        flip = args.flip
+
+        if not multi_scales:
+            scales = [1.0]
+        else:
+            # scales = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.2]
+            scales = [0.5, 0.75, 1.0, 1.25, 1.35, 1.5, 1.75, 2.0, 2.2]  # It might work better
+
+        if len(scales) == 1:  # single scale
+            # stride_rate = 2.0 / 3.0
+            stride_rate = 1.0 / 2.0  # It might work better
+        else:
+            stride_rate = 1.0 / 2.0
+        stride = int(crop_size * stride_rate)  # slid stride
+
+        model = get_model(args)
+        x = np.random.randn(1, 3, 224, 224).astype('float32')
+        x = fluid.dygraph.to_variable(x)
+        y = model(x)
+        iou = IOUMetric(num_classes)
+        model_path = args.save_model
+        # load_better_model
+        if paddle.__version__ == '1.5.2' and args.load_better_model:
+            assert os.path.exists(model_path), "your input save_model: {} ,but '{}' is not exists".format(
+                model_path, model_path)
+            print('better model exist!')
+            new_model_path = 'dygraph/' + model_path
+            copy_model(model_path, new_model_path)
+            model_param, _ = fluid.dygraph.load_persistables(new_model_path)
+            model.load_dict(model_param)
+        elif args.load_better_model:
+            assert os.path.exists(model_path + '.pdparams'), "your input save_model: {} ,but '{}' is not exists".format(
+                model_path, model_path + '.pdparams')
+            print('better model exist!')
+            model_param, _ = fluid.dygraph.load_dygraph(model_path)
+            model.load_dict(model_param)
+        else:
+            raise ValueError('Please set --load_better_model!')
+
+        assert len(model_param) == len(
+            model.state_dict()), "The number of parameters is not equal. Loading parameters failed, " \
+                                 "Please check whether the model is consistent!"
+        model.eval()
+
+        prev_time = datetime.now()
+        # reader = cityscapes_test(split='test', base_size=2048, crop_size=1024, scale=True, xmap=True)
+        reader = cityscapes_test(split='val', base_size=2048, crop_size=1024, scale=True, xmap=True)
+
+        print('MultiEvalModule: base_size {}, crop_size {}'.
+              format(base_size, crop_size))
+        print('scales: {}'.format(scales))
+        print('val ing...')
+        logging.basicConfig(level=logging.INFO,
+                            filename='DANet_{}_eval_dygraph.log'.format(args.backbone),
+                            format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+        logging.info('DANet')
+        logging.info(args)
+        palette = pat()
+        for data in reader():
+            image = data[0]
+            label_path = data[1]  # val_label is a picture, test_label is a path
+            label = Image.open(label_path, mode='r')  # val_label is a picture, test_label is a path
+            save_png_path = label_path.replace('val', '{}_val'.format(args.backbone)).replace('test', '{}_test'.format(
+                args.backbone))
+            label_np = np.array(label)
+            w, h = image.size  # h 1024, w 2048
+            scores = np.zeros(shape=[num_classes, h, w], dtype='float32')
+            for scale in scales:
+                long_size = int(math.ceil(base_size * scale))  # long_size
+                if h > w:
+                    height = long_size
+                    width = int(1.0 * w * long_size / h + 0.5)
+                    short_size = width
+                else:
+                    width = long_size
+                    height = int(1.0 * h * long_size / w + 0.5)
+                    short_size = height
+
+                cur_img = resize_image(image, height, width)
+                # pad
+                if long_size <= crop_size:
+                    pad_img = pad_single_image(cur_img, crop_size)
+                    pad_img = mapper_image(pad_img)
+                    pad_img = fluid.dygraph.to_variable(pad_img)
+                    pred1, pred2, pred3 = model(pad_img)
+                    pred1 = pred1.numpy()
+                    outputs = pred1[:, :, :height, :width]
+                    if flip:
+                        pad_img_filp = flip_left_right_image(cur_img)
+                        pad_img_filp = pad_single_image(pad_img_filp, crop_size)  # pad
+                        pad_img_filp = mapper_image(pad_img_filp)
+                        pad_img_filp = fluid.dygraph.to_variable(pad_img_filp)
+                        pred1, pred2, pred3 = model(pad_img_filp)
+                        pred1 = fluid.layers.reverse(pred1, axis=3)
+                        pred1 = pred1.numpy()
+                        outputs += pred1[:, :, :height, :width]
+                else:
+                    if short_size < crop_size:
+                        # pad if needed
+                        pad_img = pad_single_image(cur_img, crop_size)
+                    else:
+                        pad_img = cur_img
+                    pw, ph = pad_img.size
+                    assert (ph >= height and pw >= width)
+
+                    # slid window
+                    h_grids = int(math.ceil(1.0 * (ph - crop_size) / stride)) + 1
+                    w_grids = int(math.ceil(1.0 * (pw - crop_size) / stride)) + 1
+                    outputs = np.zeros(shape=[1, num_classes, ph, pw], dtype='float32')
+                    count_norm = np.zeros(shape=[1, 1, ph, pw], dtype='int32')
+                    for idh in range(h_grids):
+                        for idw in range(w_grids):
+                            h0 = idh * stride
+                            w0 = idw * stride
+                            h1 = min(h0 + crop_size, ph)
+                            w1 = min(w0 + crop_size, pw)
+                            crop_img = crop_image(pad_img, h0, w0, h1, w1)
+                            pad_crop_img = pad_single_image(crop_img, crop_size)
+                            pad_crop_img = mapper_image(pad_crop_img)
+                            pad_crop_img = fluid.dygraph.to_variable(pad_crop_img)
+                            pred1, pred2, pred3 = model(pad_crop_img)  # shape [1, num_class, h, w]
+                            pred = pred1.numpy()  # channel, h, w
+                            outputs[:, :, h0:h1, w0:w1] += pred[:, :, 0:h1 - h0, 0:w1 - w0]
+                            count_norm[:, :, h0:h1, w0:w1] += 1
+                            if flip:
+                                pad_img_filp = flip_left_right_image(crop_img)
+                                pad_img_filp = pad_single_image(pad_img_filp, crop_size)  # pad
+                                pad_img_array = mapper_image(pad_img_filp)
+                                pad_img_array = fluid.dygraph.to_variable(pad_img_array)
+                                pred1, pred2, pred3 = model(pad_img_array)
+                                pred1 = fluid.layers.reverse(pred1, axis=3)
+                                pred = pred1.numpy()
+                                outputs[:, :, h0:h1, w0:w1] += pred[:, :, 0:h1 - h0, 0:w1 - w0]
+                                count_norm[:, :, h0:h1, w0:w1] += 1
+                    assert ((count_norm == 0).sum() == 0)
+                    outputs = outputs / count_norm
+                    outputs = outputs[:, :, :height, :width]
+                outputs = fluid.dygraph.to_variable(outputs)
+                outputs = fluid.layers.resize_bilinear(outputs, out_shape=[h, w])
+                score = outputs.numpy()[0]
+                scores += score  # the sum of all scales, shape: [channel, h, w]
+                pred = np.argmax(score, axis=0).astype('uint8')
+                picture_path = '{}'.format(save_png_path).replace('.png', '_scale_{}'.format(scale))
+                save_png(pred, palette, picture_path)
+            pred = np.argmax(scores, axis=0).astype('uint8')
+            picture_path = '{}'.format(save_png_path).replace('.png', '_scores')
+            save_png(pred, palette, picture_path)
+            iou.add_batch(pred, label_np)  # cal iou
+        print('eval done!')
+        logging.info('eval done!')
+        acc, acc_cls, iu, mean_iu, fwavacc, kappa = iou.evaluate()
+        print('acc = {}'.format(acc))
+        logging.info('acc = {}'.format(acc))
+        print('acc_cls = {}'.format(acc_cls))
+        logging.info('acc_cls = {}'.format(acc_cls))
+        print('iu = {}'.format(iu))
+        logging.info('iu = {}'.format(iu))
+        print('mean_iou -- 255 = {}'.format(mean_iu))
+        logging.info('mean_iou --255 = {}'.format(mean_iu))
+        print('mean_iou = {}'.format(np.nanmean(iu[:-1])))  # realy iou
+        logging.info('mean_iou = {}'.format(np.nanmean(iu[:-1])))
+        print('fwavacc = {}'.format(fwavacc))
+        logging.info('fwavacc = {}'.format(fwavacc))
+        print('kappa = {}'.format(kappa))
+        logging.info('kappa = {}'.format(kappa))
+        cur_time = datetime.now()
+        h, remainder = divmod((cur_time - prev_time).seconds, 3600)
+        m, s = divmod(remainder, 60)
+        time_str = "Time %02d:%02d:%02d" % (h, m, s)
+        print('val ' + time_str)
+        logging.info('val ' + time_str)
+
+
+def save_png(pred_value, palette, name):
+    if isinstance(pred_value, np.ndarray):
+        if pred_value.ndim == 3:
+            batch_size = pred_value.shape[0]
+            if batch_size == 1:
+                pred_value = pred_value.squeeze(axis=0)
+                image = Image.fromarray(pred_value).convert('P')
+                image.putpalette(palette)
+                save_path = '{}.png'.format(name)
+                save_dir = os.path.dirname(save_path)
+                if not os.path.exists(save_dir):
+                    os.makedirs(save_dir)
+                image.save(save_path)
+            else:
+                for batch_id in range(batch_size):
+                    value = pred_value[batch_id]
+                    image = Image.fromarray(value).convert('P')
+                    image.putpalette(palette)
+                    save_path = '{}.png'.format(name[batch_id])
+                    save_dir = os.path.dirname(save_path)
+                    if not os.path.exists(save_dir):
+                        os.makedirs(save_dir)
+                    image.save(save_path)
+        elif pred_value.ndim == 2:
+            image = Image.fromarray(pred_value).convert('P')
+            image.putpalette(palette)
+            save_path = '{}.png'.format(name)
+            save_dir = os.path.dirname(save_path)
+            if not os.path.exists(save_dir):
+                os.makedirs(save_dir)
+            image.save(save_path)
+    else:
+        raise ValueError('Only support nd-array')
+
+
+def save_png_test(path):
+    im = Image.open(path)
+    im_array = np.array(im).astype('uint8')
+    save_png(im_array, pat(), 'save_png_test')
+
+
+def pat():
+    palette = []
+    for i in range(256):
+        palette.extend((i, i, i))
+    palette[:3 * 19] = np.array([[128, 64, 128],
+                                 [244, 35, 232],
+                                 [70, 70, 70],
+                                 [102, 102, 156],
+                                 [190, 153, 153],
+                                 [153, 153, 153],
+                                 [250, 170, 30],
+                                 [220, 220, 0],
+                                 [107, 142, 35],
+                                 [152, 251, 152],
+                                 [70, 130, 180],
+                                 [220, 20, 60],
+                                 [255, 0, 0],
+                                 [0, 0, 142],
+                                 [0, 0, 70],
+                                 [0, 60, 100],
+                                 [0, 80, 100],
+                                 [0, 0, 230],
+                                 [119, 11, 32]], dtype='uint8').flatten()
+    return palette
+
+
+if __name__ == '__main__':
+    options = Options()
+    args = options.parse()
+    options.print_args()
+    eval(args)
+
diff --git a/PaddleCV/Research/danet/img/Network.png b/PaddleCV/Research/danet/img/Network.png
new file mode 100644
index 0000000000000000000000000000000000000000..ac109b403a122a0241cb391c2d17b45ca43cb41b
Binary files /dev/null and b/PaddleCV/Research/danet/img/Network.png differ
diff --git a/PaddleCV/Research/danet/img/channel.png b/PaddleCV/Research/danet/img/channel.png
new file mode 100644
index 0000000000000000000000000000000000000000..eae8854c4252dec561f0b71febf5ddf1372b428c
Binary files /dev/null and b/PaddleCV/Research/danet/img/channel.png differ
diff --git a/PaddleCV/Research/danet/img/position.png b/PaddleCV/Research/danet/img/position.png
new file mode 100644
index 0000000000000000000000000000000000000000..b46f9e1751783eb338b4554da70696df5e411457
Binary files /dev/null and b/PaddleCV/Research/danet/img/position.png differ
diff --git a/PaddleCV/Research/danet/img/val_1.png b/PaddleCV/Research/danet/img/val_1.png
new file mode 100644
index 0000000000000000000000000000000000000000..4f4610d36f3d16ec669a89aaaf6ee71b24982435
Binary files /dev/null and b/PaddleCV/Research/danet/img/val_1.png differ
diff --git a/PaddleCV/Research/danet/img/val_gt.png b/PaddleCV/Research/danet/img/val_gt.png
new file mode 100644
index 0000000000000000000000000000000000000000..5a0d27351a66a0cab3f885f86e42141a7f96b06d
Binary files /dev/null and b/PaddleCV/Research/danet/img/val_gt.png differ
diff --git a/PaddleCV/Research/danet/img/val_output.png b/PaddleCV/Research/danet/img/val_output.png
new file mode 100644
index 0000000000000000000000000000000000000000..3d9ee2191629b8dad656672e99716e1bcb6f720c
Binary files /dev/null and b/PaddleCV/Research/danet/img/val_output.png differ
diff --git a/PaddleCV/Research/danet/iou.py b/PaddleCV/Research/danet/iou.py
new file mode 100644
index 0000000000000000000000000000000000000000..1f560a3041c29f47deb70a7eecbe937d1d096317
--- /dev/null
+++ b/PaddleCV/Research/danet/iou.py
@@ -0,0 +1,74 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+
+class IOUMetric(object):
+
+    def __init__(self, num_classes):
+        self.num_classes = num_classes + 1
+        self.hist = np.zeros((num_classes + 1, num_classes + 1))
+
+    def _fast_hist(self, label_pred, label_true):
+        mask = (label_true >= 0) & (label_true < self.num_classes)
+        hist = np.bincount(
+            self.num_classes * label_true[mask].astype(int) +
+            label_pred[mask], minlength=self.num_classes ** 2).reshape(self.num_classes, self.num_classes)
+        return hist
+
+    def add_batch(self, predictions, gts):
+        # gts = BHW
+        # predictions = BHW
+        if isinstance(gts, np.ndarray):
+            gts_ig = (gts == 255).astype(np.int32)
+            gts_nig = (gts != 255).astype(np.int32)
+            # print(predictions)
+            gts[gts == 255] = self.num_classes - 1  # 19
+            predictions = gts_nig * predictions + gts_ig * (self.num_classes - 1)
+            # print(predictions)
+        for lp, lt in zip(predictions, gts):
+            self.hist += self._fast_hist(lp.flatten(), lt.flatten())
+
+    def evaluate(self):
+        acc = np.diag(self.hist).sum() / self.hist.sum()
+        acc_cls = np.nanmean(np.diag(self.hist) / self.hist.sum(axis=1))
+        iu = np.diag(self.hist) / (self.hist.sum(axis=1) + self.hist.sum(axis=0) - np.diag(self.hist))
+        mean_iu = np.nanmean(iu)
+        freq = self.hist.sum(axis=1) / self.hist.sum()
+        fwavacc = (freq[freq > 0] * iu[freq > 0]).sum()
+        kappa = (self.hist.sum() * np.diag(self.hist).sum() - (self.hist.sum(axis=0) * self.hist.sum(axis=1)).sum()) / (
+                self.hist.sum() ** 2 - (self.hist.sum(axis=0) * self.hist.sum(axis=1)).sum())
+        return acc, acc_cls, iu, mean_iu, fwavacc, kappa
+
+    def evaluate_kappa(self):
+        kappa = (self.hist.sum() * np.diag(self.hist).sum() - (self.hist.sum(axis=0) * self.hist.sum(axis=1)).sum()) / (
+                self.hist.sum() ** 2 - (self.hist.sum(axis=0) * self.hist.sum(axis=1)).sum())
+        return kappa
+
+    def evaluate_iou_kappa(self):
+        iu = np.diag(self.hist) / (self.hist.sum(axis=1) + self.hist.sum(axis=0) - np.diag(self.hist))
+        mean_iu = np.nanmean(iu)
+        kappa = (self.hist.sum() * np.diag(self.hist).sum() - (self.hist.sum(axis=0) * self.hist.sum(axis=1)).sum()) / (
+                self.hist.sum() ** 2 - (self.hist.sum(axis=0) * self.hist.sum(axis=1)).sum())
+        return mean_iu, kappa
+
+    def evaluate_iu(self):
+        iu = np.diag(self.hist) / (self.hist.sum(axis=1) + self.hist.sum(axis=0) - np.diag(self.hist))
+        return iu
+
diff --git a/PaddleCV/Research/danet/options.py b/PaddleCV/Research/danet/options.py
new file mode 100644
index 0000000000000000000000000000000000000000..40f73feef8ae2ee53491c506cba8cb5232e0e4c8
--- /dev/null
+++ b/PaddleCV/Research/danet/options.py
@@ -0,0 +1,176 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import argparse
+
+
+class Options(object):
+    def __init__(self):
+        parser = argparse.ArgumentParser(description='Paddle DANet Segmentation')
+
+        # model and dataset
+        parser.add_argument('--model', type=str, default='danet',
+                            help='model name (default: danet)')
+        parser.add_argument('--backbone', type=str, default='resnet101',
+                            help='backbone name (default: resnet101)')
+        parser.add_argument('--dataset', type=str, default='cityscapes',
+                            help='dataset name (default: cityscapes)')
+        parser.add_argument('--num_classes', type=int, default=19,
+                            help='num_classes (default: cityscapes = 19)')
+        parser.add_argument('--data_folder', type=str,
+                            default='./dataset',
+                            help='training dataset folder (default: ./dataset')
+        parser.add_argument('--base_size', type=int, default=1024,
+                            help='base image size')
+        parser.add_argument('--crop_size', type=int, default=768,
+                            help='crop image size')
+
+        # training hyper params
+        parser.add_argument('--epoch_num', type=int, default=None, metavar='N',
+                            help='number of epochs to train (default: auto)')
+        parser.add_argument('--start_epoch', type=int, default=0,
+                            metavar='N', help='start epochs (default:0)')
+        parser.add_argument('--batch_size', type=int, default=None,
+                            metavar='N', help='input batch size for \
+                            training (default: auto)')
+        parser.add_argument('--test_batch_size', type=int, default=None,
+                            metavar='N', help='input batch size for \
+                            testing (default: same as batch size)')
+
+        # optimizer params
+        parser.add_argument('--lr', type=float, default=None, metavar='LR',
+                            help='learning rate (default: auto)')
+        parser.add_argument('--lr_scheduler', type=str, default='poly',
+                            help='learning rate scheduler (default: poly)')
+        parser.add_argument('--lr_pow', type=float, default=0.9,
+                            help='learning rate scheduler (default: 0.9)')
+        parser.add_argument('--lr_step', type=int, default=None,
+                            help='lr step to change lr')
+        parser.add_argument('--warm_up', action='store_true', default=False,
+                            help='warm_up (default: False)')
+        parser.add_argument('--warmup_epoch', type=int, default=5,
+                            help='warmup_epoch (default: 5)')
+        parser.add_argument('--total_step', type=int, default=None,
+                            metavar='N', help='total_step (default: auto)')
+        parser.add_argument('--step_per_epoch', type=int, default=None,
+                            metavar='N', help='step_per_epoch (default: auto)')
+        parser.add_argument('--momentum', type=float, default=0.9,
+                            metavar='M', help='momentum (default: 0.9)')
+        parser.add_argument('--weight_decay', type=float, default=1e-4,
+                            metavar='M', help='w-decay (default: 1e-4)')
+
+        # cuda, seed and logging
+        parser.add_argument('--cuda', action='store_true', default=False,
+                            help='use CUDA training, (default: False)')
+        parser.add_argument('--use_data_parallel', action='store_true', default=False,
+                            help='use data_parallel training, (default: False)')
+        parser.add_argument('--seed', type=int, default=1, metavar='S',
+                            help='random seed (default: 1)')
+        parser.add_argument('--log_root', type=str,
+                            default='./', help='set a log path folder')
+
+        # checkpoint
+        parser.add_argument("--save_model", default='checkpoint/DANet101_better_model_paddle1.6', type=str,
+                            help="model path, (default: checkpoint/DANet101_better_model_paddle1.6)")
+
+        # change executor model params to dygraph model params
+        parser.add_argument("--change_executor_to_dygraph", action='store_true', default=False,
+                            help="change executor model params to dygraph model params (default:False)")
+
+        # finetuning pre-trained models
+        parser.add_argument("--load_pretrained_model", action='store_true', default=False,
+                            help="load pretrained model (default: False)")
+        # load better models
+        parser.add_argument("--load_better_model", action='store_true', default=False,
+                            help="load better model (default: False)")
+        parser.add_argument('--multi_scales', action='store_true', default=False,
+                            help="testing scale, (default: False)")
+        parser.add_argument('--flip', action='store_true', default=False,
+                            help="testing flip image, (default: False)")
+
+        # multi grid dilation option
+        parser.add_argument("--dilated", action='store_true', default=False,
+                            help="use dilation policy, (default: False)")
+        parser.add_argument("--multi_grid", action='store_true', default=False,
+                            help="use multi grid dilation policy, default: False")
+        parser.add_argument('--multi_dilation', nargs='+', type=int, default=None,
+                            help="multi grid dilation list, (default: None), can use --mutil_dilation 4 8 16")
+        parser.add_argument('--scale', action='store_true', default=False,
+                            help='choose to use random scale transform(0.75-2.0) for train, (default: False)')
+
+        # the parser
+        self.parser = parser
+
+    def parse(self):
+        args = self.parser.parse_args()
+        # default settings for epochs, batch_size and lr
+        if args.epoch_num is None:
+            epoches = {
+                'pascal_voc': 180,
+                'pascal_aug': 180,
+                'pcontext': 180,
+                'ade20k': 180,
+                'cityscapes': 350,
+            }
+            num_class_dict = {
+                'pascal_voc': 21,
+                'pascal_aug': 21,
+                'pcontext': 21,
+                'ade20k': None,
+                'cityscapes': 19,
+            }
+            total_steps = {
+                'pascal_voc': 200000,
+                'pascal_aug': 500000,
+                'pcontext': 500000,
+                'ade20k': 500000,
+                'cityscapes': 150000,
+            }
+            args.epoch_num = epoches[args.dataset.lower()]
+            args.num_classes = num_class_dict[args.dataset.lower()]
+            args.total_step = total_steps[args.dataset.lower()]
+        if args.batch_size is None:
+            args.batch_size = 2
+        if args.test_batch_size is None:
+            args.test_batch_size = args.batch_size
+        if args.step_per_epoch is None:
+            step_per_epoch = {
+                'pascal_voc': 185,
+                'pascal_aug': 185,
+                'pcontext': 185,
+                'ade20k': 185,
+                'cityscapes': 371,  # 2975 // batch_size // GPU_num
+            }
+            args.step_per_epoch = step_per_epoch[args.dataset.lower()]
+        if args.lr is None:
+            lrs = {
+                'pascal_voc': 0.0001,
+                'pascal_aug': 0.001,
+                'pcontext': 0.001,
+                'ade20k': 0.01,
+                'cityscapes': 0.003,
+            }
+            args.lr = lrs[args.dataset.lower()] / 8 * args.batch_size
+        return args
+
+    def print_args(self):
+        arg_dict = self.parse().__dict__
+        for k, v in arg_dict.items():
+            print('{:30s}: {}'.format(k, v))
+
diff --git a/PaddleCV/Research/danet/train_dygraph.py b/PaddleCV/Research/danet/train_dygraph.py
new file mode 100644
index 0000000000000000000000000000000000000000..df610999e5eaff47aafe3e53b18833b1eb73b576
--- /dev/null
+++ b/PaddleCV/Research/danet/train_dygraph.py
@@ -0,0 +1,353 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+os.environ['FLAGS_eager_delete_tensor_gb'] = "0.0"
+os.environ['FLAGS_fraction_of_gpu_memory_to_use'] = "0.99"
+
+import paddle.fluid as fluid
+import numpy as np
+import random
+import paddle
+import logging
+import shutil
+import multiprocessing
+import sys
+from datetime import datetime
+from paddle.utils import Ploter
+
+from danet import DANet
+from options import Options
+from utils.cityscapes_data import cityscapes_train
+from utils.cityscapes_data import cityscapes_val
+from utils.lr_scheduler import Lr
+import matplotlib
+
+matplotlib.use('Agg')
+
+
+def get_model(args):
+    model = DANet('DANet',
+                  backbone=args.backbone,
+                  num_classes=args.num_classes,
+                  batch_size=args.batch_size,
+                  dilated=args.dilated,
+                  multi_grid=args.multi_grid,
+                  multi_dilation=args.multi_dilation)
+    return model
+
+
+def _cpu_num():
+    if "CPU_NUM" not in os.environ.keys():
+        if multiprocessing.cpu_count() > 1:
+            sys.stderr.write(
+                '!!! The CPU_NUM is not specified, you should set CPU_NUM in the environment variable list.\n'
+                'CPU_NUM indicates that how many CPUPlace are used in the current task.\n'
+                'And if this parameter are set as N (equal to the number of physical CPU core) the program may be faster.\n\n'
+                'export CPU_NUM={} # for example, set CPU_NUM as number of physical CPU core which is {}.\n\n'
+                '!!! The default number of CPU_NUM=1.\n'.format(
+                    multiprocessing.cpu_count(), multiprocessing.cpu_count()))
+        os.environ['CPU_NUM'] = str(1)
+    cpu_num = os.environ.get('CPU_NUM')
+    return int(cpu_num)
+
+
+def mean_iou(pred, label, num_classes=19):
+    label = fluid.layers.elementwise_min(fluid.layers.cast(label, np.int32),
+                                         fluid.layers.assign(np.array([num_classes], dtype=np.int32)))
+    label_ig = (label == num_classes).astype('int32')
+    label_ng = (label != num_classes).astype('int32')
+    pred = fluid.layers.cast(fluid.layers.argmax(pred, axis=1), 'int32')
+    pred = pred * label_ng + label_ig * num_classes
+    miou, wrong, correct = fluid.layers.mean_iou(pred, label, num_classes + 1)
+    label.stop_gradient = True
+    return miou, wrong, correct
+
+
+def loss_fn(pred, pred2, pred3, label, num_classes=19):
+    pred = fluid.layers.transpose(pred, perm=[0, 2, 3, 1])
+    pred = fluid.layers.reshape(pred, [-1, num_classes])
+
+    pred2 = fluid.layers.transpose(pred2, perm=[0, 2, 3, 1])
+    pred2 = fluid.layers.reshape(pred2, [-1, num_classes])
+
+    pred3 = fluid.layers.transpose(pred3, perm=[0, 2, 3, 1])
+    pred3 = fluid.layers.reshape(pred3, [-1, num_classes])
+
+    label = fluid.layers.reshape(label, [-1, 1])
+
+    pred = fluid.layers.softmax(pred, use_cudnn=False)
+    loss1 = fluid.layers.cross_entropy(pred, label, ignore_index=255)
+
+    pred2 = fluid.layers.softmax(pred2, use_cudnn=False)
+    loss2 = fluid.layers.cross_entropy(pred2, label, ignore_index=255)
+
+    pred3 = fluid.layers.softmax(pred3, use_cudnn=False)
+    loss3 = fluid.layers.cross_entropy(pred3, label, ignore_index=255)
+
+    label.stop_gradient = True
+    return loss1 + loss2 + loss3
+
+
+def optimizer_setting(args):
+    if args.weight_decay is not None:
+        regular = fluid.regularizer.L2Decay(regularization_coeff=args.weight_decay)
+    else:
+        regular = None
+    if args.lr_scheduler == 'poly':
+        lr_scheduler = Lr(lr_policy='poly',
+                          base_lr=args.lr,
+                          epoch_nums=args.epoch_num,
+                          step_per_epoch=args.step_per_epoch,
+                          power=args.lr_pow,
+                          warm_up=args.warm_up,
+                          warmup_epoch=args.warmup_epoch)
+        decayed_lr = lr_scheduler.get_lr()
+    elif args.lr_scheduler == 'cosine':
+        lr_scheduler = Lr(lr_policy='cosine',
+                          base_lr=args.lr,
+                          epoch_nums=args.epoch_num,
+                          step_per_epoch=args.step_per_epoch,
+                          warm_up=args.warm_up,
+                          warmup_epoch=args.warmup_epoch)
+        decayed_lr = lr_scheduler.get_lr()
+    elif args.lr_scheduler == 'piecewise':
+        lr_scheduler = Lr(lr_policy='piecewise',
+                          base_lr=args.lr,
+                          epoch_nums=args.epoch_num,
+                          step_per_epoch=args.step_per_epoch,
+                          warm_up=args.warm_up,
+                          warmup_epoch=args.warmup_epoch,
+                          decay_epoch=[50, 100, 150],
+                          gamma=0.1)
+        decayed_lr = lr_scheduler.get_lr()
+    else:
+        decayed_lr = args.lr
+    return fluid.optimizer.MomentumOptimizer(learning_rate=decayed_lr,
+                                             momentum=args.momentum,
+                                             regularization=regular)
+
+
+def main(args):
+    batch_size = args.batch_size
+    num_epochs = args.epoch_num
+    num_classes = args.num_classes
+    data_root = args.data_folder
+    if args.cuda:
+        num = fluid.core.get_cuda_device_count()
+        print('The number of GPU： {}'.format(num))
+    else:
+        num = _cpu_num()
+        print('The number of CPU： {}'.format(num))
+
+    # program
+    start_prog = fluid.default_startup_program()
+    train_prog = fluid.default_main_program()
+
+    start_prog.random_seed = args.seed
+    train_prog.random_seed = args.seed
+    np.random.seed(args.seed)
+    random.seed(args.seed)
+
+    logging.basicConfig(level=logging.INFO,
+                        filename='DANet_{}_train_dygraph.log'.format(args.backbone),
+                        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+    logging.info('DANet')
+    logging.info(args)
+
+    if args.cuda:
+        gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+
+    place = fluid.CUDAPlace(gpu_id) if args.cuda else fluid.CPUPlace()
+    train_loss_title = 'Train_loss'
+    test_loss_title = 'Test_loss'
+
+    train_iou_title = 'Train_mIOU'
+    test_iou_title = 'Test_mIOU'
+
+    plot_loss = Ploter(train_loss_title, test_loss_title)
+    plot_iou = Ploter(train_iou_title, test_iou_title)
+
+    with fluid.dygraph.guard(place):
+
+        model = get_model(args)
+        x = np.random.randn(batch_size, 3, 224, 224).astype('float32')
+        x = fluid.dygraph.to_variable(x)
+        model(x)
+
+        # load_pretrained_model
+        if args.load_pretrained_model:
+            save_dir = args.save_model
+            assert os.path.exists(save_dir + '.pdparams'), "your input save_model: {} ,but '{}' is not exists".format(
+                save_dir, save_dir + '.pdparams')
+            param, _ = fluid.load_dygraph(save_dir)
+            model.set_dict(param)
+            assert len(param) == len(
+                model.state_dict()), "The number of parameters is not equal. Loading parameters failed, " \
+                                     "Please check whether the model is consistent!"
+            print('load pretrained model!')
+
+        # load_better_model
+        if args.load_better_model:
+            save_dir = args.save_model
+            assert os.path.exists(save_dir + '.pdparams'), "your input save_model: {} ,but '{}' is not exists".format(
+                save_dir, save_dir + '.pdparams')
+            param, _ = fluid.load_dygraph(save_dir)
+            model.set_dict(param)
+            assert len(param) == len(
+                model.state_dict()), "The number of parameters is not equal. Loading parameters failed, " \
+                                     "Please check whether the model is consistent!"
+            print('load better model!')
+
+        optimizer = optimizer_setting(args)
+        train_data = cityscapes_train(data_root=data_root,
+                                      base_size=args.base_size,
+                                      crop_size=args.crop_size,
+                                      scale=args.scale,
+                                      xmap=True,
+                                      batch_size=batch_size,
+                                      gpu_num=num)
+        batch_train_data = paddle.batch(paddle.reader.shuffle(
+            train_data, buf_size=batch_size * 64),
+            batch_size=batch_size,
+            drop_last=True)
+
+        val_data = cityscapes_val(data_root=data_root,
+                                  base_size=args.base_size,
+                                  crop_size=args.crop_size,
+                                  scale=args.scale,
+                                  xmap=True)
+        batch_test_data = paddle.batch(val_data,
+                                       batch_size=batch_size,
+                                       drop_last=True)
+
+        train_iou_manager = fluid.metrics.Accuracy()
+        train_avg_loss_manager = fluid.metrics.Accuracy()
+        test_iou_manager = fluid.metrics.Accuracy()
+        test_avg_loss_manager = fluid.metrics.Accuracy()
+
+        better_miou_train = 0
+        better_miou_test = 0
+
+        for epoch in range(num_epochs):
+            prev_time = datetime.now()
+            train_avg_loss_manager.reset()
+            train_iou_manager.reset()
+            for batch_id, data in enumerate(batch_train_data()):
+                image = np.array([x[0] for x in data]).astype('float32')
+                label = np.array([x[1] for x in data]).astype('int64')
+
+                image = fluid.dygraph.to_variable(image)
+                label = fluid.dygraph.to_variable(label)
+                label.stop_gradient = True
+                pred, pred2, pred3 = model(image)
+                train_loss = loss_fn(pred, pred2, pred3, label, num_classes=num_classes)
+                train_avg_loss = fluid.layers.mean(train_loss)
+                miou, wrong, correct = mean_iou(pred, label, num_classes=num_classes)
+                train_avg_loss.backward()
+                optimizer.minimize(train_avg_loss)
+                model.clear_gradients()
+                train_iou_manager.update(miou.numpy(), weight=int(batch_size * num))
+                train_avg_loss_manager.update(train_avg_loss.numpy(), weight=int(batch_size * num))
+                batch_train_str = "epoch: {}, batch: {}, train_avg_loss: {:.6f}, " \
+                                  "train_miou: {:.6f}.".format(epoch + 1,
+                                                               batch_id + 1,
+                                                               train_avg_loss.numpy()[0],
+                                                               miou.numpy()[0])
+                if batch_id % 100 == 0:
+                    logging.info(batch_train_str)
+                    print(batch_train_str)
+            cur_time = datetime.now()
+            h, remainder = divmod((cur_time - prev_time).seconds, 3600)
+            m, s = divmod(remainder, 60)
+            time_str = " Time %02d:%02d:%02d" % (h, m, s)
+            train_str = "\nepoch: {}, train_avg_loss: {:.6f}, " \
+                        "train_miou: {:.6f}.".format(epoch + 1,
+                                                     train_avg_loss_manager.eval()[0],
+                                                     train_iou_manager.eval()[0])
+            print(train_str + time_str + '\n')
+            logging.info(train_str + time_str + '\n')
+            plot_loss.append(train_loss_title, epoch, train_avg_loss_manager.eval()[0])
+            plot_loss.plot('./DANet_loss_dygraph.jpg')
+            plot_iou.append(train_iou_title, epoch, train_iou_manager.eval()[0])
+            plot_iou.plot('./DANet_miou_dygraph.jpg')
+            fluid.dygraph.save_dygraph(model.state_dict(), 'checkpoint/DANet_epoch_new')
+            # save_model
+            if better_miou_train < train_iou_manager.eval()[0]:
+                shutil.rmtree('checkpoint/DANet_better_train_{:.4f}.pdparams'.format(better_miou_train),
+                              ignore_errors=True)
+                better_miou_train = train_iou_manager.eval()[0]
+                fluid.dygraph.save_dygraph(model.state_dict(),
+                                           'checkpoint/DANet_better_train_{:.4f}'.format(better_miou_train))
+
+            ########## test ############
+            model.eval()
+            test_iou_manager.reset()
+            test_avg_loss_manager.reset()
+            prev_time = datetime.now()
+            for (batch_id, data) in enumerate(batch_test_data()):
+                image = np.array([x[0] for x in data]).astype('float32')
+                label = np.array([x[1] for x in data]).astype('int64')
+
+                image = fluid.dygraph.to_variable(image)
+                label = fluid.dygraph.to_variable(label)
+
+                label.stop_gradient = True
+                pred, pred2, pred3 = model(image)
+                test_loss = loss_fn(pred, pred2, pred3, label, num_classes=num_classes)
+                test_avg_loss = fluid.layers.mean(test_loss)
+                miou, wrong, correct = mean_iou(pred, label, num_classes=num_classes)
+                test_iou_manager.update(miou.numpy(), weight=int(batch_size * num))
+                test_avg_loss_manager.update(test_avg_loss.numpy(), weight=int(batch_size * num))
+                batch_test_str = "epoch: {}, batch: {}, test_avg_loss: {:.6f}, " \
+                                 "test_miou: {:.6f}.".format(epoch + 1, batch_id + 1,
+                                                             test_avg_loss.numpy()[0],
+                                                             miou.numpy()[0])
+                if batch_id % 20 == 0:
+                    logging.info(batch_test_str)
+                    print(batch_test_str)
+            cur_time = datetime.now()
+            h, remainder = divmod((cur_time - prev_time).seconds, 3600)
+            m, s = divmod(remainder, 60)
+            time_str = " Time %02d:%02d:%02d" % (h, m, s)
+            test_str = "\nepoch: {}, test_avg_loss: {:.6f}, " \
+                       "test_miou: {:.6f}.".format(epoch + 1,
+                                                   test_avg_loss_manager.eval()[0],
+                                                   test_iou_manager.eval()[0])
+            print(test_str + time_str + '\n')
+            logging.info(test_str + time_str + '\n')
+            plot_loss.append(test_loss_title, epoch, test_avg_loss_manager.eval()[0])
+            plot_loss.plot('./DANet_loss_dygraph.jpg')
+            plot_iou.append(test_iou_title, epoch, test_iou_manager.eval()[0])
+            plot_iou.plot('./DANet_miou_dygraph.jpg')
+            model.train()
+            # save_model
+            if better_miou_test < test_iou_manager.eval()[0]:
+                shutil.rmtree('checkpoint/DANet_better_test_{:.4f}.pdparams'.format(better_miou_test),
+                              ignore_errors=True)
+                better_miou_test = test_iou_manager.eval()[0]
+                fluid.dygraph.save_dygraph(model.state_dict(),
+                                           'checkpoint/DANet_better_test_{:.4f}'.format(better_miou_test))
+
+
+if __name__ == '__main__':
+    options = Options()
+    args = options.parse()
+    options.print_args()
+    main(args)
diff --git a/PaddleCV/Research/danet/train_executor.py b/PaddleCV/Research/danet/train_executor.py
new file mode 100644
index 0000000000000000000000000000000000000000..82f451dd168527886081c684686dd271a9e3c38c
--- /dev/null
+++ b/PaddleCV/Research/danet/train_executor.py
@@ -0,0 +1,423 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+os.environ['FLAGS_eager_delete_tensor_gb'] = "0.0"
+os.environ['FLAGS_fraction_of_gpu_memory_to_use'] = "0.99"
+
+import paddle.fluid as fluid
+import numpy as np
+import random
+import paddle
+import logging
+import shutil
+import multiprocessing
+import sys
+from datetime import datetime
+from paddle.utils import Ploter
+
+from danet import DANet
+from options import Options
+from utils.cityscapes_data import cityscapes_train
+from utils.cityscapes_data import cityscapes_val
+from utils.lr_scheduler import Lr
+import matplotlib
+
+matplotlib.use('Agg')
+
+
+def get_model(args):
+    model = DANet('DANet',
+                  backbone=args.backbone,
+                  num_classes=args.num_classes,
+                  batch_size=args.batch_size,
+                  dilated=args.dilated,
+                  multi_grid=args.multi_grid,
+                  multi_dilation=args.multi_dilation)
+    return model
+
+
+def _cpu_num():
+    if "CPU_NUM" not in os.environ.keys():
+        if multiprocessing.cpu_count() > 1:
+            sys.stderr.write(
+                '!!! The CPU_NUM is not specified, you should set CPU_NUM in the environment variable list.\n'
+                'CPU_NUM indicates that how many CPUPlace are used in the current task.\n'
+                'And if this parameter are set as N (equal to the number of physical CPU core) the program may be faster.\n\n'
+                'export CPU_NUM={} # for example, set CPU_NUM as number of physical CPU core which is {}.\n\n'
+                '!!! The default number of CPU_NUM=1.\n'.format(
+                    multiprocessing.cpu_count(), multiprocessing.cpu_count()))
+        os.environ['CPU_NUM'] = str(1)
+    cpu_num = os.environ.get('CPU_NUM')
+    return int(cpu_num)
+
+
+def mean_iou(pred, label, num_classes=19):
+    label = fluid.layers.elementwise_min(fluid.layers.cast(label, np.int32),
+                                         fluid.layers.assign(np.array([num_classes], dtype=np.int32)))
+    label_ig = (label == num_classes).astype('int32')
+    label_ng = (label != num_classes).astype('int32')
+    pred = fluid.layers.cast(fluid.layers.argmax(pred, axis=1), 'int32')
+    pred = pred * label_ng + label_ig * num_classes
+    miou, wrong, correct = fluid.layers.mean_iou(pred, label, num_classes + 1)
+    label.stop_gradient = True
+    return miou, wrong, correct
+
+
+def loss_fn(pred, pred2, pred3, label, num_classes=19):
+    pred = fluid.layers.transpose(pred, perm=[0, 2, 3, 1])
+    pred = fluid.layers.reshape(pred, [-1, num_classes])
+
+    pred2 = fluid.layers.transpose(pred2, perm=[0, 2, 3, 1])
+    pred2 = fluid.layers.reshape(pred2, [-1, num_classes])
+
+    pred3 = fluid.layers.transpose(pred3, perm=[0, 2, 3, 1])
+    pred3 = fluid.layers.reshape(pred3, [-1, num_classes])
+
+    label = fluid.layers.reshape(label, [-1, 1])
+
+    # loss1 = fluid.layers.softmax_with_cross_entropy(pred, label, ignore_index=255)
+    # 以上方式会出现loss为NaN的情况
+    pred = fluid.layers.softmax(pred, use_cudnn=False)
+    loss1 = fluid.layers.cross_entropy(pred, label, ignore_index=255)
+
+    pred2 = fluid.layers.softmax(pred2, use_cudnn=False)
+    loss2 = fluid.layers.cross_entropy(pred2, label, ignore_index=255)
+
+    pred3 = fluid.layers.softmax(pred3, use_cudnn=False)
+    loss3 = fluid.layers.cross_entropy(pred3, label, ignore_index=255)
+
+    label.stop_gradient = True
+    return loss1 + loss2 + loss3
+
+
+def save_model(save_dir, exe, program=None):
+    if os.path.exists(save_dir):
+        shutil.rmtree(save_dir, ignore_errors=True)
+        os.makedirs(save_dir)
+        # fluid.io.save_persistables(exe, save_dir, program)
+        fluid.io.save_params(exe, save_dir, program)
+        print('save: {}'.format(os.path.basename(save_dir)))
+    else:
+        os.makedirs(save_dir)
+        fluid.io.save_persistables(exe, save_dir, program)
+        print('create: {}'.format(os.path.basename(save_dir)))
+
+
+def load_model(save_dir, exe, program=None):
+    if os.path.exists(save_dir):
+        # fluid.io.load_persistables(exe, save_dir, program)
+        fluid.io.load_params(exe, save_dir, program)
+        print('Load successful!')
+    else:
+        raise Exception('Please check the model path!')
+
+
+def optimizer_setting(args):
+    if args.weight_decay is not None:
+        regular = fluid.regularizer.L2Decay(regularization_coeff=args.weight_decay)
+    else:
+        regular = None
+    if args.lr_scheduler == 'poly':
+        lr_scheduler = Lr(lr_policy='poly',
+                          base_lr=args.lr,
+                          epoch_nums=args.epoch_num,
+                          step_per_epoch=args.step_per_epoch,
+                          power=args.lr_pow,
+                          warm_up=args.warm_up,
+                          warmup_epoch=args.warmup_epoch)
+        decayed_lr = lr_scheduler.get_lr()
+    elif args.lr_scheduler == 'cosine':
+        lr_scheduler = Lr(lr_policy='cosine',
+                          base_lr=args.lr,
+                          epoch_nums=args.epoch_num,
+                          step_per_epoch=args.step_per_epoch,
+                          warm_up=args.warm_up,
+                          warmup_epoch=args.warmup_epoch)
+        decayed_lr = lr_scheduler.get_lr()
+    elif args.lr_scheduler == 'piecewise':
+        lr_scheduler = Lr(lr_policy='piecewise',
+                          base_lr=args.lr,
+                          epoch_nums=args.epoch_num,
+                          step_per_epoch=args.step_per_epoch,
+                          warm_up=args.warm_up,
+                          warmup_epoch=args.warmup_epoch,
+                          decay_epoch=[50, 100, 150],
+                          gamma=0.1)
+        decayed_lr = lr_scheduler.get_lr()
+    else:
+        decayed_lr = args.lr
+    return fluid.optimizer.MomentumOptimizer(learning_rate=decayed_lr,
+                                             momentum=args.momentum,
+                                             regularization=regular)
+
+
+def main(args):
+    image_shape = args.crop_size
+    image = fluid.layers.data(name='image', shape=[3, image_shape, image_shape], dtype='float32')
+    label = fluid.layers.data(name='label', shape=[image_shape, image_shape], dtype='int64')
+
+    batch_size = args.batch_size
+    epoch_num = args.epoch_num
+    num_classes = args.num_classes
+    data_root = args.data_folder
+    if args.cuda:
+        num = fluid.core.get_cuda_device_count()
+        print('The number of GPU： {}'.format(num))
+    else:
+        num = _cpu_num()
+        print('The number of CPU： {}'.format(num))
+
+    # program
+    start_prog = fluid.default_startup_program()
+    train_prog = fluid.default_main_program()
+
+    start_prog.random_seed = args.seed
+    train_prog.random_seed = args.seed
+    np.random.seed(args.seed)
+    random.seed(args.seed)
+
+    # clone
+    test_prog = train_prog.clone(for_test=True)
+
+    logging.basicConfig(level=logging.INFO,
+                        filename='DANet_{}_train_executor.log'.format(args.backbone),
+                        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+    logging.info('DANet')
+    logging.info(args)
+
+    with fluid.program_guard(train_prog, start_prog):
+        with fluid.unique_name.guard():
+            train_py_reader = fluid.io.PyReader(feed_list=[image, label],
+                                                capacity=64,
+                                                use_double_buffer=True,
+                                                iterable=False)
+            train_data = cityscapes_train(data_root=data_root,
+                                          base_size=args.base_size,
+                                          crop_size=args.crop_size,
+                                          scale=args.scale,
+                                          xmap=True,
+                                          batch_size=batch_size,
+                                          gpu_num=num)
+            batch_train_data = paddle.batch(paddle.reader.shuffle(
+                train_data, buf_size=batch_size * 16),
+                batch_size=batch_size,
+                drop_last=True)
+            train_py_reader.decorate_sample_list_generator(batch_train_data)
+
+            model = get_model(args)
+            pred, pred2, pred3 = model(image)
+            train_loss = loss_fn(pred, pred2, pred3, label, num_classes=num_classes)
+            train_avg_loss = fluid.layers.mean(train_loss)
+            optimizer = optimizer_setting(args)
+            optimizer.minimize(train_avg_loss)
+            # miou不是真实的
+            miou, wrong, correct = mean_iou(pred, label, num_classes=num_classes)
+
+    with fluid.program_guard(test_prog, start_prog):
+        with fluid.unique_name.guard():
+            test_py_reader = fluid.io.PyReader(feed_list=[image, label],
+                                               capacity=64,
+                                               iterable=False,
+                                               use_double_buffer=True)
+            val_data = cityscapes_val(data_root=data_root,
+                                      base_size=args.base_size,
+                                      crop_size=args.crop_size,
+                                      scale=args.scale,
+                                      xmap=True)
+            batch_test_data = paddle.batch(val_data,
+                                           batch_size=batch_size,
+                                           drop_last=True)
+            test_py_reader.decorate_sample_list_generator(batch_test_data)
+
+            model = get_model(args)
+            pred, pred2, pred3 = model(image)
+            test_loss = loss_fn(pred, pred2, pred3, label, num_classes=num_classes)
+            test_avg_loss = fluid.layers.mean(test_loss)
+            # miou不是真实的
+            miou, wrong, correct = mean_iou(pred, label, num_classes=num_classes)
+
+    place = fluid.CUDAPlace(0) if args.cuda else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(start_prog)
+
+    if args.use_data_parallel and args.cuda:
+        exec_strategy = fluid.ExecutionStrategy()
+        exec_strategy.num_threads = fluid.core.get_cuda_device_count()
+        exec_strategy.num_iteration_per_drop_scope = 100
+        build_strategy = fluid.BuildStrategy()
+        build_strategy.sync_batch_norm = True
+        print("sync_batch_norm = True!")
+        compiled_train_prog = fluid.compiler.CompiledProgram(train_prog).with_data_parallel(
+            loss_name=train_avg_loss.name,
+            build_strategy=build_strategy,
+            exec_strategy=exec_strategy)
+    else:
+        compiled_train_prog = fluid.compiler.CompiledProgram(train_prog)
+
+    # 加载预训练模型
+    if args.load_pretrained_model:
+        assert os.path.exists(args.save_model), "your input save_model: {} ,but '{}' is not exists".format(
+            args.save_model, args.save_model)
+        load_model(args.save_model, exe, program=train_prog)
+        print('load pretrained model!')
+
+    # 加载最优模型
+    if args.load_better_model:
+        assert os.path.exists(args.save_model), "your input save_model: {} ,but '{}' is not exists".format(
+            args.save_model, args.save_model)
+        load_model(args.save_model, exe, program=train_prog)
+        print('load better model!')
+
+    train_iou_manager = fluid.metrics.Accuracy()
+    train_avg_loss_manager = fluid.metrics.Accuracy()
+    test_iou_manager = fluid.metrics.Accuracy()
+    test_avg_loss_manager = fluid.metrics.Accuracy()
+    better_miou_train = 0
+    better_miou_test = 0
+
+    train_loss_title = 'Train_loss'
+    test_loss_title = 'Test_loss'
+
+    train_iou_title = 'Train_mIOU'
+    test_iou_title = 'Test_mIOU'
+
+    plot_loss = Ploter(train_loss_title, test_loss_title)
+    plot_iou = Ploter(train_iou_title, test_iou_title)
+
+    for epoch in range(epoch_num):
+        prev_time = datetime.now()
+        train_avg_loss_manager.reset()
+        train_iou_manager.reset()
+        logging.info('training, epoch = {}'.format(epoch + 1))
+        train_py_reader.start()
+        batch_id = 0
+        while True:
+            try:
+                train_fetch_list = [train_avg_loss, miou, wrong, correct]
+                train_avg_loss_value, train_iou_value, w, c = exe.run(
+                    program=compiled_train_prog,
+                    fetch_list=train_fetch_list)
+
+                train_iou_manager.update(train_iou_value, weight=int(batch_size * num))
+                train_avg_loss_manager.update(train_avg_loss_value, weight=int(batch_size * num))
+                batch_train_str = "epoch: {}, batch: {}, train_avg_loss: {:.6f}, " \
+                                  "train_miou: {:.6f}.".format(epoch + 1,
+                                                               batch_id + 1,
+                                                               train_avg_loss_value[0],
+                                                               train_iou_value[0])
+                if batch_id % 40 == 0:
+                    logging.info(batch_train_str)
+                    print(batch_train_str)
+                batch_id += 1
+            except fluid.core.EOFException:
+                train_py_reader.reset()
+                break
+        cur_time = datetime.now()
+        h, remainder = divmod((cur_time - prev_time).seconds, 3600)
+        m, s = divmod(remainder, 60)
+        time_str = " Time %02d:%02d:%02d" % (h, m, s)
+        train_str = "epoch: {}, train_avg_loss: {:.6f}, " \
+                    "train_miou: {:.6f}.".format(epoch + 1,
+                                                 train_avg_loss_manager.eval()[0],
+                                                 train_iou_manager.eval()[0])
+        print(train_str + time_str + '\n')
+        logging.info(train_str + time_str)
+        plot_loss.append(train_loss_title, epoch, train_avg_loss_manager.eval()[0])
+        plot_loss.plot('./DANet_loss_executor.jpg')
+        plot_iou.append(train_iou_title, epoch, train_iou_manager.eval()[0])
+        plot_iou.plot('./DANet_miou_executor.jpg')
+
+        # save_model
+        if better_miou_train < train_iou_manager.eval()[0]:
+            shutil.rmtree('./checkpoint/DANet_better_train_{:.4f}'.format(better_miou_train),
+                          ignore_errors=True)
+            better_miou_train = train_iou_manager.eval()[0]
+            logging.warning(
+                '-----------train---------------better_train: {:.6f}, epoch: {}, -----------Train model saved successfully!\n'.format(
+                    better_miou_train, epoch + 1))
+            save_dir = './checkpoint/DANet_better_train_{:.4f}'.format(better_miou_train)
+            save_model(save_dir, exe, program=train_prog)
+        if (epoch + 1) % 5 == 0:
+            save_dir = './checkpoint/DANet_epoch_train'
+            save_model(save_dir, exe, program=train_prog)
+
+        # test
+        test_py_reader.start()
+        test_iou_manager.reset()
+        test_avg_loss_manager.reset()
+        prev_time = datetime.now()
+        logging.info('testing, epoch = {}'.format(epoch + 1))
+        batch_id = 0
+        while True:
+            try:
+                test_fetch_list = [test_avg_loss, miou, wrong, correct]
+                test_avg_loss_value, test_iou_value, _, _ = exe.run(program=test_prog,
+                                                                    fetch_list=test_fetch_list)
+                test_iou_manager.update(test_iou_value, weight=int(batch_size * num))
+                test_avg_loss_manager.update(test_avg_loss_value, weight=int(batch_size * num))
+                batch_test_str = "epoch: {}, batch: {}, test_avg_loss: {:.6f}, " \
+                                 "test_miou: {:.6f}. ".format(epoch + 1,
+                                                              batch_id + 1,
+                                                              test_avg_loss_value[0],
+                                                              test_iou_value[0])
+                if batch_id % 40 == 0:
+                    logging.info(batch_test_str)
+                    print(batch_test_str)
+                batch_id += 1
+            except fluid.core.EOFException:
+                test_py_reader.reset()
+                break
+        cur_time = datetime.now()
+        h, remainder = divmod((cur_time - prev_time).seconds, 3600)
+        m, s = divmod(remainder, 60)
+        time_str = " Time %02d:%02d:%02d" % (h, m, s)
+        test_str = "epoch: {}, test_avg_loss: {:.6f}, " \
+                   "test_miou: {:.6f}.".format(epoch + 1,
+                                               test_avg_loss_manager.eval()[0],
+                                               test_iou_manager.eval()[0])
+        print(test_str + time_str + '\n')
+        logging.info(test_str + time_str)
+        plot_loss.append(test_loss_title, epoch, test_avg_loss_manager.eval()[0])
+        plot_loss.plot('./DANet_loss_executor.jpg')
+        plot_iou.append(test_iou_title, epoch, test_iou_manager.eval()[0])
+        plot_iou.plot('./DANet_miou_executor.jpg')
+
+        # save_model_infer
+        if better_miou_test < test_iou_manager.eval()[0]:
+            shutil.rmtree('./checkpoint/infer/DANet_better_test_{:.4f}'.format(better_miou_test),
+                          ignore_errors=True)
+            better_miou_test = test_iou_manager.eval()[0]
+            logging.warning(
+                '------------test-------------infer better_test: {:.6f}, epoch: {}, ----------------Inference model saved successfully!\n'.format(
+                    better_miou_test, epoch + 1))
+            save_dir = './checkpoint/infer/DANet_better_test_{:.4f}'.format(better_miou_test)
+            # save_model(save_dir, exe, program=test_prog)
+            fluid.io.save_inference_model(save_dir, [image.name], [pred, pred2, pred3], exe)
+            print('Inference model saved successfully')
+
+
+if __name__ == '__main__':
+    options = Options()
+    args = options.parse()
+    options.print_args()
+    main(args)
+
+
+
diff --git a/PaddleCV/Research/danet/utils/__init__.py b/PaddleCV/Research/danet/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..8469aa22358578b58a9161761d987815064060f2
--- /dev/null
+++ b/PaddleCV/Research/danet/utils/__init__.py
@@ -0,0 +1,24 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from .base import BaseDataSet
+from .cityscapes import CityScapes
+from .lr_scheduler import Lr
+from .cityscapes_data import *
+from .voc import VOC
+from .voc_data import *
diff --git a/PaddleCV/Research/danet/utils/base.py b/PaddleCV/Research/danet/utils/base.py
new file mode 100644
index 0000000000000000000000000000000000000000..b1528917f870b5965634b3b31f803866036b539f
--- /dev/null
+++ b/PaddleCV/Research/danet/utils/base.py
@@ -0,0 +1,132 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import random
+import numpy as np
+from PIL import Image, ImageOps, ImageFilter, ImageEnhance
+import os
+import sys
+
+curPath = os.path.abspath(os.path.dirname(__file__))
+parentPath = os.path.split(curPath)[0]
+rootPath = os.path.split(parentPath)[0]
+sys.path.append(rootPath)
+
+
+class BaseDataSet(object):
+
+    def __init__(self, root, split, base_size=1024, crop_size=768, scale=True):
+        self.root = root
+        support = ['train', 'train_val', 'val', 'test']
+        assert split in support, "split= \'{}\' not in {}".format(split, support)
+        self.split = split
+        self.crop_size = crop_size  # 裁剪大小
+        self.base_size = base_size  # 图片最短边
+        self.scale = scale
+        self.image_path = None
+        self.label_path = None
+
+    def sync_transform(self, image, label, aug=True):
+        crop_size = self.crop_size
+        if self.scale:
+            short_size = random.randint(int(self.base_size * 0.75), int(self.base_size * 2.0))
+        else:
+            short_size = self.base_size
+
+        # 随机左右翻转
+        if random.random() > 0.5:
+            image = image.transpose(Image.FLIP_LEFT_RIGHT)
+            label = label.transpose(Image.FLIP_LEFT_RIGHT)
+        w, h = image.size
+
+        # 同比例缩放
+        if h > w:
+            out_w = short_size
+            out_h = int(1.0 * h / w * out_w)
+        else:
+            out_h = short_size
+            out_w = int(1.0 * w / h * out_h)
+        image = image.resize((out_w, out_h), Image.BILINEAR)
+        label = label.resize((out_w, out_h), Image.NEAREST)
+
+        # 四周填充
+        if short_size < crop_size:
+            pad_h = crop_size - out_h if out_h < crop_size else 0
+            pad_w = crop_size - out_w if out_w < crop_size else 0
+            image = ImageOps.expand(image, border=(pad_w // 2, pad_h // 2, pad_w - pad_w // 2, pad_h - pad_h // 2),
+                                    fill=0)
+            label = ImageOps.expand(label, border=(pad_w // 2, pad_h // 2, pad_w - pad_w // 2, pad_h - pad_h // 2),
+                                    fill=255)
+
+        # 随机裁剪
+        w, h = image.size
+        x = random.randint(0, w - crop_size)
+        y = random.randint(0, h - crop_size)
+        image = image.crop((x, y, x + crop_size, y + crop_size))
+        label = label.crop((x, y, x + crop_size, y + crop_size))
+
+        if aug:
+            # 高斯模糊，可选
+            if random.random() > 0.7:
+                image = image.filter(ImageFilter.GaussianBlur(radius=random.random()))
+
+            # 可选
+            if random.random() > 0.7:
+                # 随机亮度
+                factor = np.random.uniform(0.75, 1.25)
+                image = ImageEnhance.Brightness(image).enhance(factor)
+
+                # 颜色抖动
+                factor = np.random.uniform(0.75, 1.25)
+                image = ImageEnhance.Color(image).enhance(factor)
+
+                # 随机对比度
+                factor = np.random.uniform(0.75, 1.25)
+                image = ImageEnhance.Contrast(image).enhance(factor)
+
+                # 随机锐度
+                factor = np.random.uniform(0.75, 1.25)
+                image = ImageEnhance.Sharpness(image).enhance(factor)
+        return image, label
+
+    def sync_val_transform(self, image, label):
+        crop_size = self.crop_size
+        short_size = self.base_size
+
+        w, h = image.size
+
+        # 同比例缩放
+        if h > w:
+            out_w = short_size
+            out_h = int(1.0 * h / w * out_w)
+        else:
+            out_h = short_size
+            out_w = int(1.0 * w / h * out_h)
+        image = image.resize((out_w, out_h), Image.BILINEAR)
+        label = label.resize((out_w, out_h), Image.NEAREST)
+
+        # 中心裁剪
+        w, h = image.size
+        x1 = int(round((w - crop_size) / 2.))
+        y1 = int(round((h - crop_size) / 2.))
+        image = image.crop((x1, y1, x1 + crop_size, y1 + crop_size))
+        label = label.crop((x1, y1, x1 + crop_size, y1 + crop_size))
+        return image, label
+
+    def eval(self, image):
+        pass
diff --git a/PaddleCV/Research/danet/utils/cityscapes.py b/PaddleCV/Research/danet/utils/cityscapes.py
new file mode 100644
index 0000000000000000000000000000000000000000..5c7ee431c11c8d9eb0cdd9d3bd976156d5883766
--- /dev/null
+++ b/PaddleCV/Research/danet/utils/cityscapes.py
@@ -0,0 +1,79 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+from utils.base import BaseDataSet
+
+
+class CityScapes(BaseDataSet):
+    """prepare cityscapes path_pairs"""
+
+    BASE_DIR = 'cityscapes'
+    NUM_CLASS = 19
+
+    def __init__(self, root='./dataset', split='train', **kwargs):
+        super(CityScapes, self).__init__(root, split, **kwargs)
+        if os.sep == '\\':  # windows
+            root = root.replace('/', '\\')
+
+        root = os.path.join(root, self.BASE_DIR)
+        assert os.path.exists(root), "please download cityscapes data_set, put in dataset(dir),or check root"
+        self.image_path, self.label_path = self._get_cityscapes_pairs(root, split)
+        assert len(self.image_path) == len(self.label_path), "please check image_length = label_length"
+        self.print_param()
+
+    def print_param(self):  # 用于核对当前数据集的信息
+        print('INFO: dataset_root: {}, split: {}, '
+              'base_size: {}, crop_size: {}, scale: {}, '
+              'image_length: {}, label_length: {}'.format(self.root, self.split, self.base_size,
+                                                          self.crop_size, self.scale, len(self.image_path),
+                                                          len(self.label_path)))
+
+    @staticmethod
+    def _get_cityscapes_pairs(root, split):
+
+        def get_pairs(root, file_image, file_label):
+            file_image = os.path.join(root, file_image)
+            file_label = os.path.join(root, file_label)
+            with open(file_image, 'r') as f:
+                file_list_image = f.read().split()
+            with open(file_label, 'r') as f:
+                file_list_label = f.read().split()
+            if os.sep == '\\':  # for windows
+                image_path = [os.path.join(root, x.replace('/', '\\')) for x in file_list_image]
+                label_path = [os.path.join(root, x.replace('/', '\\')) for x in file_list_label]
+            else:
+                image_path = [os.path.join(root, x) for x in file_list_image]
+                label_path = [os.path.join(root, x) for x in file_list_label]
+            return image_path, label_path
+
+        if split == 'train':
+            image_path, label_path = get_pairs(root, 'trainImages.txt', 'trainLabels.txt')
+        elif split == 'val':
+            image_path, label_path = get_pairs(root, 'valImages.txt', 'valLabels.txt')
+        elif split == 'test':
+            image_path, label_path = get_pairs(root, 'testImages.txt', 'testLabels.txt')  # 返回文件路径，test_label并不存在
+        else:  # 'train_val'
+            image_path1, label_path1 = get_pairs(root, 'trainImages.txt', 'trainLabels.txt')
+            image_path2, label_path2 = get_pairs(root, 'valImages.txt', 'valLabels.txt')
+            image_path, label_path = image_path1+image_path2, label_path1+label_path2
+        return image_path, label_path
+
+    def get_path_pairs(self):
+        return self.image_path, self.label_path
+
diff --git a/PaddleCV/Research/danet/utils/cityscapes_data.py b/PaddleCV/Research/danet/utils/cityscapes_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..e96534cf31d5f5a2e226435d527515bef7bd8f03
--- /dev/null
+++ b/PaddleCV/Research/danet/utils/cityscapes_data.py
@@ -0,0 +1,144 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import random
+import paddle
+import numpy as np
+
+from PIL import Image
+
+from utils.cityscapes import CityScapes
+
+__all__ = ['cityscapes_train', 'cityscapes_val', 'cityscapes_train_val', 'cityscapes_test']
+
+#  globals
+data_mean = np.array([0.485, 0.456, 0.406]).reshape(3, 1, 1)
+data_std = np.array([0.229, 0.224, 0.225]).reshape(3, 1, 1)
+
+
+def mapper_train(sample):
+    image_path, label_path, city = sample
+    image = Image.open(image_path, mode='r').convert('RGB')
+    label = Image.open(label_path, mode='r')
+
+    image, label = city.sync_transform(image, label)
+    image_array = np.array(image)  # HWC
+    label_array = np.array(label)  # HW
+
+    image_array = image_array.transpose((2, 0, 1))  # CHW
+    image_array = image_array / 255.0
+    image_array = (image_array - data_mean) / data_std
+    image_array = image_array.astype('float32')
+    label_array = label_array.astype('int64')
+    return image_array, label_array
+
+
+def mapper_val(sample):
+    image_path, label_path, city = sample
+    image = Image.open(image_path, mode='r').convert('RGB')
+    label = Image.open(label_path, mode='r')
+
+    image, label = city.sync_val_transform(image, label)
+    image_array = np.array(image)  # HWC
+    label_array = np.array(label)  # HW
+
+    image_array = image_array.transpose((2, 0, 1))  # CHW
+    image_array = image_array / 255.0
+    image_array = (image_array - data_mean) / data_std
+    image_array = image_array.astype('float32')
+    label_array = label_array.astype('int64')
+    return image_array, label_array
+
+
+def mapper_test(sample):
+    image_path, label_path = sample  # label is path
+    image = Image.open(image_path, mode='r').convert('RGB')
+    image_array = image
+    return image_array, label_path  # image is a picture, label is path
+
+
+# root, base_size, crop_size; gpu_num必须设置，否则syncBN会出现某些卡没有数据的情况
+def cityscapes_train(data_root='./dataset', base_size=1024, crop_size=768, scale=True, xmap=True, batch_size=1, gpu_num=1):
+    city = CityScapes(root=data_root, split='train', base_size=base_size, crop_size=crop_size, scale=scale)
+    image_path, label_path = city.get_path_pairs()
+
+    def reader():
+        if len(image_path) % (batch_size * gpu_num) != 0:
+            length = (len(image_path) // (batch_size * gpu_num)) * (batch_size * gpu_num)
+        else:
+            length = len(image_path)
+        for i in range(length):
+            if i == 0:
+                cc = list(zip(image_path, label_path))
+                random.shuffle(cc)
+                image_path[:], label_path[:] = zip(*cc)
+            yield image_path[i], label_path[i], city
+    if xmap:
+        return paddle.reader.xmap_readers(mapper_train, reader, 4, 32)
+    else:
+        return paddle.reader.map_readers(mapper_train, reader)
+
+
+def cityscapes_val(data_root='./dataset', base_size=1024, crop_size=768, scale=True, xmap=True):
+    city = CityScapes(root=data_root, split='val', base_size=base_size, crop_size=crop_size, scale=scale)
+    image_path, label_path = city.get_path_pairs()
+
+    def reader():
+        for i in range(len(image_path)):
+            yield image_path[i], label_path[i], city
+
+    if xmap:
+        return paddle.reader.xmap_readers(mapper_val, reader, 4, 32)
+    else:
+        return paddle.reader.map_readers(mapper_val, reader)
+
+
+def cityscapes_train_val(data_root='./dataset', base_size=1024, crop_size=768, scale=True, xmap=True, batch_size=1, gpu_num=1):
+    city = CityScapes(root=data_root, split='train_val', base_size=base_size, crop_size=crop_size, scale=scale)
+    image_path, label_path = city.get_path_pairs()
+
+    def reader():
+        if len(image_path) % (batch_size * gpu_num) != 0:
+            length = (len(image_path) // (batch_size * gpu_num)) * (batch_size * gpu_num)
+        else:
+            length = len(image_path)
+        for i in range(length):
+            if i == 0:
+                cc = list(zip(image_path, label_path))
+                random.shuffle(cc)
+                image_path[:], label_path[:] = zip(*cc)
+            yield image_path[i], label_path[i], city
+
+    if xmap:
+        return paddle.reader.xmap_readers(mapper_train, reader, 4, 32)
+    else:
+        return paddle.reader.map_readers(mapper_train, reader)
+
+
+def cityscapes_test(split='test', base_size=2048, crop_size=1024, scale=True, xmap=True):
+    # 实际未使用base_size, crop_size, scale
+    city = CityScapes(split=split, base_size=base_size, crop_size=crop_size, scale=scale)
+    image_path, label_path = city.get_path_pairs()
+
+    def reader():
+        for i in range(len(image_path)):
+            yield image_path[i], label_path[i]
+    if xmap:
+        return paddle.reader.xmap_readers(mapper_test, reader, 4, 32)
+    else:
+        return paddle.reader.map_readers(mapper_test, reader)
diff --git a/PaddleCV/Research/danet/utils/lr_scheduler.py b/PaddleCV/Research/danet/utils/lr_scheduler.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ce8316a43536aef9414ca5e40a4e8a5ccb63aba
--- /dev/null
+++ b/PaddleCV/Research/danet/utils/lr_scheduler.py
@@ -0,0 +1,152 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import paddle.fluid as fluid
+import math
+
+
+class Lr(object):
+    """
+    示例：使用poly策略， 有热身，
+     lr_scheduler = Lr(lr_policy='poly', base_lr=0.003, epoch_nums=200, step_per_epoch=20,
+                      warm_up=True, warmup_epoch=11)
+     lr = lr_scheduler.get_lr()
+
+    示例：使用cosine策略， 有热身，
+    lr_scheduler = Lr(lr_policy='cosine', base_lr=0.003, epoch_nums=200, step_per_epoch=20,
+                      warm_up=True, warmup_epoch=11)
+    lr = lr_scheduler.get_lr()
+
+    示例：使用piecewise策略， 有热身，必须设置边界（decay_epoch list), gamma系数默认0.1
+    lr_scheduler = Lr(lr_policy='piecewise', base_lr=0.003, epoch_nums=200, step_per_epoch=20,
+                      warm_up=True, warmup_epoch=11, decay_epoch=[50], gamma=0.1)
+    lr = lr_scheduler.get_lr()
+    """
+    def __init__(self, lr_policy, base_lr, epoch_nums, step_per_epoch,
+                 power=0.9, end_lr=0.0, gamma=0.1, decay_epoch=[],
+                 warm_up=False, warmup_epoch=0):
+        support_lr_policy = ['poly', 'piecewise', 'cosine']
+        assert lr_policy in support_lr_policy, "Only support poly, piecewise, cosine"
+        self.lr_policy = lr_policy  # 学习率衰减策略 : str(`cosine`, `poly`, `piecewise`)
+
+        assert base_lr >= 0, "Start learning rate should greater than 0"
+        self.base_lr = base_lr  # 基础学习率: float
+
+        assert end_lr >= 0, "End learning rate should greater than 0"
+        self.end_lr = end_lr  # 学习率终点: float
+
+        assert epoch_nums, "epoch_nums should greater than 0"
+        assert step_per_epoch, "step_per_epoch should greater than 0"
+
+        self.epoch_nums = epoch_nums  # epoch数: int
+        self.step_per_epoch = step_per_epoch  # 每个epoch的迭代数: int
+        self.total_step = epoch_nums * step_per_epoch  # 总的迭代数 :auto
+        self.power = power  # 指数: float
+        self.gamma = gamma  # 分段衰减的系数: float
+        self.decay_epoch = decay_epoch  # 分段衰减的epoch: list
+        if self.lr_policy == 'piecewise':
+            assert len(decay_epoch) >= 1, "use piecewise policy, should set decay_epoch list"
+        self.warm_up = warm_up  # 是否热身：bool
+        if self.warm_up:
+            assert warmup_epoch, "warmup_epoch should greater than 0"
+            assert warmup_epoch < epoch_nums, "warmup_epoch should less than epoch_nums"
+        self.warmup_epoch = warmup_epoch
+        self.warmup_steps = warmup_epoch * step_per_epoch  # 热身steps：int(epoch*step_per_epoch)
+
+    def _piecewise_decay(self):
+        gamma = self.gamma
+        bd = [self.step_per_epoch * e for e in self.decay_epoch]
+        lr = [self.base_lr * (gamma ** i) for i in range(len(bd) + 1)]
+        decayed_lr = fluid.layers.piecewise_decay(boundaries=bd, values=lr)
+        return decayed_lr
+
+    def _poly_decay(self):
+        decayed_lr = fluid.layers.polynomial_decay(
+            self.base_lr, self.total_step, end_learning_rate=self.end_lr, power=self.power)
+        return decayed_lr
+
+    def _cosine_decay(self):
+        decayed_lr = fluid.layers.cosine_decay(
+            self.base_lr, self.step_per_epoch, self.epoch_nums)
+        return decayed_lr
+
+    def get_lr(self):
+        if self.lr_policy.lower() == 'poly':
+            if self.warm_up:
+                warm_up_end_lr = (self.base_lr - self.end_lr) * pow(
+                    (1 - self.warmup_steps / self.total_step), self.power) + self.end_lr
+                print('poly warm_up_end_lr：', warm_up_end_lr)
+                decayed_lr = fluid.layers.linear_lr_warmup(self._poly_decay(),
+                                                           warmup_steps=self.warmup_steps,
+                                                           start_lr=0.0,
+                                                           end_lr=warm_up_end_lr)
+            else:
+                decayed_lr = self._poly_decay()
+        elif self.lr_policy.lower() == 'piecewise':
+            if self.warm_up:
+                assert self.warmup_steps < self.decay_epoch[0] * self.step_per_epoch
+                warm_up_end_lr = self.base_lr
+                print('piecewise warm_up_end_lr：', warm_up_end_lr)
+                decayed_lr = fluid.layers.linear_lr_warmup(self._piecewise_decay(),
+                                                           warmup_steps=self.warmup_steps,
+                                                           start_lr=0.0,
+                                                           end_lr=warm_up_end_lr)
+            else:
+                decayed_lr = self._piecewise_decay()
+        elif self.lr_policy.lower() == 'cosine':
+            if self.warm_up:
+                warm_up_end_lr = self.base_lr*0.5*(math.cos(self.warmup_epoch*math.pi/self.epoch_nums)+1)
+                print('cosine warm_up_end_lr：', warm_up_end_lr)
+                decayed_lr = fluid.layers.linear_lr_warmup(self._cosine_decay(),
+                                                           warmup_steps=self.warmup_steps,
+                                                           start_lr=0.0,
+                                                           end_lr=warm_up_end_lr)
+            else:
+                decayed_lr = self._cosine_decay()
+        else:
+            raise Exception(
+                "unsupport learning decay policy! only support poly,piecewise,cosine"
+            )
+        return decayed_lr
+
+
+if __name__ == '__main__':
+    epoch_nums = 200
+    step_per_epoch = 180
+    base_lr = 0.003
+    warmup_epoch = 5   # 热身数
+    lr_scheduler = Lr(lr_policy='poly', base_lr=base_lr, epoch_nums=epoch_nums, step_per_epoch=step_per_epoch,
+                      warm_up=True, warmup_epoch=warmup_epoch, decay_epoch=[50])
+    lr = lr_scheduler.get_lr()
+    exe = fluid.Executor(fluid.CPUPlace())
+    exe.run(fluid.default_startup_program())
+
+    lr_list = []
+    for epoch in range(epoch_nums):
+        for i in range(step_per_epoch):
+            x = exe.run(fluid.default_main_program(),
+                        fetch_list=[lr])
+            lr_list.append(x[0])
+            # print(x[0])
+    # 绘图
+    from matplotlib import pyplot as plt
+    plt.plot(range(epoch_nums*step_per_epoch), lr_list)
+    plt.xlabel('step')
+    plt.ylabel('lr')
+    plt.show()
+
diff --git a/PaddleCV/Research/danet/utils/voc.py b/PaddleCV/Research/danet/utils/voc.py
new file mode 100644
index 0000000000000000000000000000000000000000..01021ec01f6e0e96df65e6af50863db96e400eef
--- /dev/null
+++ b/PaddleCV/Research/danet/utils/voc.py
@@ -0,0 +1,101 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+
+from utils.base import BaseDataSet
+
+
+class VOC(BaseDataSet):
+    """prepare pascalVOC path_pairs"""
+    BASE_DIR = 'VOC2012_SBD'
+    NUM_CLASS = 21
+
+    def __init__(self, root='../dataset', split='train', **kwargs):
+        super(VOC, self).__init__(root, split, **kwargs)
+        if os.sep == '\\':  # windows
+            root = root.replace('/', '\\')
+
+        root = os.path.join(root, self.BASE_DIR)
+        assert os.path.exists(root), "please download voc2012 data_set, put in dataset(dir)"
+        if split == 'test':
+            self.image_path = self._get_cityscapes_pairs(root, split)
+        else:
+            self.image_path, self.label_path = self._get_cityscapes_pairs(root, split)
+        if self.label_path is None:
+            pass
+        else:
+            assert len(self.image_path) == len(self.label_path), "please check image_length = label_length"
+        self.print_param()
+
+    def print_param(self): # 用于核对当前数据集的信息
+        if self.label_path is None:
+            print('INFO: dataset_root: {}, split: {}, '
+                  'base_size: {}, crop_size: {}, scale: {}, '
+                  'image_length: {}'.format(self.root, self.split, self.base_size,
+                                            self.crop_size, self.scale, len(self.image_path)))
+        else:
+            print('INFO: dataset_root: {}, split: {}, '
+                  'base_size: {}, crop_size: {}, scale: {}, '
+                  'image_length: {}, label_length: {}'.format(self.root, self.split, self.base_size,
+                                                              self.crop_size, self.scale, len(self.image_path),
+                                                              len(self.label_path)))
+
+    @staticmethod
+    def _get_cityscapes_pairs(root, split):
+
+        def get_pairs(root, file):
+            if file.find('test') == -1:
+                file = os.path.join(root, file)
+                with open(file, 'r') as f:
+                    file_list = f.readlines()
+                if os.sep == '\\':  # for windows
+                    image_path = [
+                        os.path.join(root, 'pascal', 'VOC2012', x.split()[0][1:].replace('/', '\\').replace('\n', ''))
+                        for x in file_list]
+                    label_path = [os.path.join(root, 'pascal', 'VOC2012', x.split()[1][1:].replace('/', '\\')) for x in
+                                  file_list]
+                else:
+                    image_path = [os.path.join(root, 'pascal', 'VOC2012', x.split()[0][1:]) for x in file_list]
+                    label_path = [os.path.join(root, 'pascal', 'VOC2012', x.split()[1][1:]) for x in file_list]
+                return image_path, label_path
+            else:
+                file = os.path.join(root, file)
+                with open(file, 'r') as f:
+                    file_list = f.readlines()
+                if os.sep == '\\':  # for windows
+                    image_path = [
+                        os.path.join(root, 'pascal', 'VOC2012', x.split()[0][1:].replace('/', '\\').replace('\n', ''))
+                        for x in file_list]
+                else:
+                    image_path = [os.path.join(root, 'pascal', 'VOC2012', x.split()[0][1:]) for x in file_list]
+                return image_path
+
+        if split == 'train':
+            image_path, label_path = get_pairs(root, 'list/train_aug.txt')
+        elif split == 'val':
+            image_path, label_path = get_pairs(root, 'list/val.txt')
+        elif split == 'test':
+            image_path = get_pairs(root, 'list/test.txt')  # 返回文件路径，test_label并不存在
+            return image_path
+        else:  # 'train_val'
+            image_path, label_path = get_pairs(root, 'list/trainval_aug.txt')
+        return image_path, label_path
+
+    def get_path_pairs(self):
+        return self.image_path, self.label_path
diff --git a/PaddleCV/Research/danet/utils/voc_data.py b/PaddleCV/Research/danet/utils/voc_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..d2dba4f9135dc80fd9c015ea7a7c3bde1af5b0e1
--- /dev/null
+++ b/PaddleCV/Research/danet/utils/voc_data.py
@@ -0,0 +1,144 @@
+# -*- coding: utf-8 -*-
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import random
+import paddle
+import numpy as np
+
+from PIL import Image
+
+from utils.voc import VOC
+
+__all__ = ['voc_train', 'voc_val', 'voc_train_val', 'voc_test']
+
+#  globals
+data_mean = np.array([0.485, 0.456, 0.406]).reshape(3, 1, 1)
+data_std = np.array([0.229, 0.224, 0.225]).reshape(3, 1, 1)
+
+
+def mapper_train(sample):
+    image_path, label_path, voc = sample
+    image = Image.open(image_path, mode='r').convert('RGB')
+    label = Image.open(label_path, mode='r')
+
+    image, label = voc.sync_transform(image, label)
+    image_array = np.array(image)  # HWC
+    label_array = np.array(label)  # HW
+
+    image_array = image_array.transpose((2, 0, 1))  # CHW
+    image_array = image_array / 255.0
+    image_array = (image_array - data_mean) / data_std
+    image_array = image_array.astype('float32')
+    label_array = label_array.astype('int64')
+    return image_array, label_array
+
+
+def mapper_val(sample):
+    image_path, label_path, city = sample
+    image = Image.open(image_path, mode='r').convert('RGB')
+    label = Image.open(label_path, mode='r')
+
+    image, label = city.sync_val_transform(image, label)
+    image_array = np.array(image)
+    label_array = np.array(label)
+
+    image_array = image_array.transpose((2, 0, 1))
+    image_array = image_array / 255.0
+    image_array = (image_array - data_mean) / data_std
+    image_array = image_array.astype('float32')
+    label_array = label_array.astype('int64')
+    return image_array, label_array
+
+
+def mapper_test(sample):
+    image_path, label_path = sample  # label is path
+    image = Image.open(image_path, mode='r').convert('RGB')
+    image_array = image
+    return image_array, label_path  # label is path
+
+
+# 已完成， 引用时记得传入参数，root, base_size, crop_size等， gpu_num必须设置，否则syncBN会出现某些卡没有数据的情况
+def voc_train(data_root='../dataset', base_size=768, crop_size=576, scale=True, xmap=True, batch_size=1, gpu_num=1):
+    voc = VOC(root=data_root, split='train', base_size=base_size, crop_size=crop_size, scale=scale)
+    image_path, label_path = voc.get_path_pairs()
+
+    def reader():
+        if len(image_path) % (batch_size * gpu_num) != 0:
+            length = (len(image_path) // (batch_size * gpu_num)) * (batch_size * gpu_num)
+        else:
+            length = len(image_path)
+        for i in range(length):
+            if i == 0:
+                cc = list(zip(image_path, label_path))
+                random.shuffle(cc)
+                image_path[:], label_path[:] = zip(*cc)
+            yield image_path[i], label_path[i], voc
+    if xmap:
+        return paddle.reader.xmap_readers(mapper_train, reader, 4, 32)
+    else:
+        return paddle.reader.map_readers(mapper_train, reader)
+
+
+def voc_val(data_root='../dataset', base_size=768, crop_size=576, scale=True, xmap=True):
+    voc = VOC(root=data_root, split='val', base_size=base_size, crop_size=crop_size, scale=scale)
+    image_path, label_path = voc.get_path_pairs()
+
+    def reader():
+        for i in range(len(image_path)):
+            yield image_path[i], label_path[i], voc
+
+    if xmap:
+        return paddle.reader.xmap_readers(mapper_val, reader, 4, 32)
+    else:
+        return paddle.reader.map_readers(mapper_val, reader)
+
+
+def voc_train_val(data_root='./dataset', base_size=768, crop_size=576, scale=True, xmap=True, batch_size=1, gpu_num=1):
+    voc = VOC(root=data_root, split='train_val', base_size=base_size, crop_size=crop_size, scale=scale)
+    image_path, label_path = voc.get_path_pairs()
+
+    def reader():
+        if len(image_path) % (batch_size * gpu_num) != 0:
+            length = (len(image_path) // (batch_size * gpu_num)) * (batch_size * gpu_num)
+        else:
+            length = len(image_path)
+        for i in range(length):
+            if i == 0:
+                cc = list(zip(image_path, label_path))
+                random.shuffle(cc)
+                image_path[:], label_path[:] = zip(*cc)
+            yield image_path[i], label_path[i]
+
+    if xmap:
+        return paddle.reader.xmap_readers(mapper_train, reader, 4, 32)
+    else:
+        return paddle.reader.map_readers(mapper_train, reader)
+
+
+def voc_test(split='test', base_size=2048, crop_size=1024, scale=True, xmap=True):
+    # 实际未使用base_size, crop_size, scale
+    voc = VOC(split=split, base_size=base_size, crop_size=crop_size, scale=scale)
+    image_path = voc.get_path_pairs()
+
+    def reader():
+        for i in range(len(image_path[:1])):
+            yield image_path[i], image_path[i]
+    if xmap:
+        return paddle.reader.xmap_readers(mapper_test, reader, 4, 32)
+    else:
+        return paddle.reader.map_readers(mapper_test, reader)
diff --git a/PaddleCV/deeplabv3+/reader.py b/PaddleCV/deeplabv3+/reader.py
index 810b5d2d6840e295f98ff6a388bd45e89d9fa938..2ebc7e9084ad481de58bfdc790a56006b7e9c62e 100644
--- a/PaddleCV/deeplabv3+/reader.py
+++ b/PaddleCV/deeplabv3+/reader.py
@@ -28,6 +28,10 @@ default_config = {
     "crop_size": 769,
 }
 
+# used for ce
+if 'ce_mode' in os.environ:
+    np.random.seed(0)
+
 
 def slice_with_pad(a, s, value=0):
     pads = []
diff --git a/PaddleCV/deeplabv3+/train.py b/PaddleCV/deeplabv3+/train.py
index e0a1f10b8d3caecf504b95230d4fa8ac9e2e9fd9..06860048d2d543836c1fd3d1941e4207eb4e4dde 100755
--- a/PaddleCV/deeplabv3+/train.py
+++ b/PaddleCV/deeplabv3+/train.py
@@ -39,6 +39,7 @@ set_paddle_flags({
 
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 import numpy as np
 import argparse
 from reader import CityscapeDataset
@@ -70,6 +71,8 @@ add_arg('profile',              bool,    False, "Enable profiler.")
 add_arg('use_py_reader',        bool,    True,  "Use py reader.")
 add_arg('use_multiprocessing',  bool,    False, "Use multiprocessing.")
 add_arg("num_workers",          int,     8,     "The number of python processes used to read and preprocess data.")
+# NOTE: args for profiler, used for benchmark
+add_arg("profiler_path",        str,     '/tmp/profile_file2',  "the profiler output file path. (used for benchmark)")
 parser.add_argument(
     '--enable_ce',
     action='store_true',
@@ -79,7 +82,7 @@ parser.add_argument(
 @contextlib.contextmanager
 def profile_context(profile=True):
     if profile:
-        with profiler.profiler('All', 'total', '/tmp/profile_file2'):
+        with profiler.profiler('All', 'total', args.profiler_path):
             yield
     else:
         yield
@@ -142,12 +145,6 @@ deeplabv3p = models.deeplabv3p
 sp = fluid.Program()
 tp = fluid.Program()
 
-# only for ce
-if args.enable_ce:
-    SEED = 102
-    sp.random_seed = SEED
-    tp.random_seed = SEED
-
 crop_size = args.train_crop_size
 batch_size = args.batch_size
 image_shape = [crop_size, crop_size]
@@ -159,9 +156,16 @@ weight_decay = 0.00004
 base_lr = args.base_lr
 total_step = args.total_step
 
+# only for ce
+if args.enable_ce:
+    SEED = 102
+    sp.random_seed = SEED
+    tp.random_seed = SEED
+    reader.default_config['shuffle'] = False
+
 with fluid.program_guard(tp, sp):
     if args.use_py_reader:
-        batch_size_each = batch_size // fluid.core.get_cuda_device_count()
+        batch_size_each = batch_size // utility.get_device_count()
         py_reader = fluid.layers.py_reader(capacity=64,
                                         shapes=[[batch_size_each, 3] + image_shape, [batch_size_each] + image_shape],
                                         dtypes=['float32', 'int32'])
@@ -194,7 +198,7 @@ with fluid.program_guard(tp, sp):
 
 
 exec_strategy = fluid.ExecutionStrategy()
-exec_strategy.num_threads = fluid.core.get_cuda_device_count()
+exec_strategy.num_threads = utility.get_device_count()
 exec_strategy.num_iteration_per_drop_scope = 100
 build_strategy = fluid.BuildStrategy()
 if args.memory_optimize:
@@ -222,11 +226,11 @@ else:
     binary = fluid.compiler.CompiledProgram(tp)
 
 if args.use_py_reader:
-    assert(batch_size % fluid.core.get_cuda_device_count() == 0)
+    assert(batch_size % utility.get_device_count() == 0)
     def data_gen():
         batches = dataset.get_batch_generator(
-            batch_size // fluid.core.get_cuda_device_count(),
-            total_step * fluid.core.get_cuda_device_count(),
+            batch_size // utility.get_device_count(),
+            total_step * utility.get_device_count(),
             use_multiprocessing=args.use_multiprocessing, num_workers=args.num_workers)
         for b in batches:
             yield b[0], b[1]
@@ -252,6 +256,7 @@ with profile_context(args.profile):
         train_loss = np.mean(train_loss)
         end_time = time.time()
         total_time += end_time - begin_time
+
         if i % 100 == 0:
             print("Model is saved to", args.save_weights_path)
             save_model()
@@ -262,7 +267,7 @@ print("Training done. Model is saved to", args.save_weights_path)
 save_model()
 
 if args.enable_ce:
-    gpu_num = fluid.core.get_cuda_device_count()
+    gpu_num = utility.get_device_count()
     print("kpis\teach_pass_duration_card%s\t%s" %
           (gpu_num, total_time / epoch_idx))
     print("kpis\ttrain_loss_card%s\t%s" % (gpu_num, train_loss))
diff --git a/PaddleCV/deeplabv3+/utility.py b/PaddleCV/deeplabv3+/utility.py
index 8d9bcfb3b4cde1f0c54aca4c8d0c61ca2e955d7b..ce1bd1e683870560684212b89375b6f0f893c4b4 100644
--- a/PaddleCV/deeplabv3+/utility.py
+++ b/PaddleCV/deeplabv3+/utility.py
@@ -78,3 +78,12 @@ def check_gpu(use_gpu):
             sys.exit(1)
     except Exception as e:
         pass
+
+
+def get_device_count():
+    try:
+        device_num = max(fluid.core.get_cuda_device_count(), 1)
+    except:
+        device_num = 1
+
+    return device_num
diff --git a/PaddleCV/face_detection/reader.py b/PaddleCV/face_detection/reader.py
index 970a88be9288ad973ab8a2e2c27ff775c8147675..4a1e5fe6d5ba1a68b80ac7ab82b269878a45be20 100644
--- a/PaddleCV/face_detection/reader.py
+++ b/PaddleCV/face_detection/reader.py
@@ -97,7 +97,10 @@ def preprocess(img, bbox_labels, mode, settings, image_path):
 
         # sampling
         batch_sampler = []
-
+        # used for continuous evaluation
+        if 'ce_mode' in os.environ:
+           random.seed(0)
+           np.random.seed(0)
         prob = np.random.uniform(0., 1.)
         if prob > settings.data_anchor_sampling_prob:
             scale_array = np.array([16, 32, 64, 128, 256, 512])
@@ -229,7 +232,7 @@ def expand_bboxes(bboxes,
 
 def train_generator(settings, file_list, batch_size, shuffle=True):
     def reader():
-        if shuffle:
+        if shuffle and 'ce_mode' not in os.environ:
             np.random.shuffle(file_list)
         batch_out = []
         for item in file_list:
diff --git a/PaddleCV/face_detection/train.py b/PaddleCV/face_detection/train.py
index c74cc8c62c63a0aca7c831f05e4b5cfaca1f7b92..721c52b4df9e96425a884e48d9eba549b750242a 100644
--- a/PaddleCV/face_detection/train.py
+++ b/PaddleCV/face_detection/train.py
@@ -150,6 +150,7 @@ def train(args, config, train_params, train_file_list):
 
     #only for ce
     if args.enable_ce:
+        is_shuffle = False
         SEED = 102
         startup_prog.random_seed = SEED
         train_prog.random_seed = SEED
diff --git a/PaddleCV/face_detection/widerface_eval.py b/PaddleCV/face_detection/widerface_eval.py
index 27a0d28562ff3172517d2ce5e679f1ad18d72927..80201486b70d49689a9063b518efd6d4223edefd 100644
--- a/PaddleCV/face_detection/widerface_eval.py
+++ b/PaddleCV/face_detection/widerface_eval.py
@@ -22,11 +22,13 @@ import argparse
 import functools
 from PIL import Image
 
+
 def set_paddle_flags(**kwargs):
     for key, value in kwargs.items():
         if os.environ.get(key, None) is None:
             os.environ[key] = str(value)
 
+
 # NOTE(paddle-dev): All of these flags should be
 # set before `import paddle`. Otherwise, it would
 # not take any effect.
@@ -34,7 +36,6 @@ set_paddle_flags(
     FLAGS_eager_delete_tensor_gb=0,  # enable GC to save memory
 )
 
-
 import paddle.fluid as fluid
 import reader
 from pyramidbox import PyramidBox
@@ -315,6 +316,8 @@ def get_shrink(height, width):
         max_shrink = max_shrink - 0.4
     elif max_shrink >= 5:
         max_shrink = max_shrink - 0.5
+    elif max_shrink <= 0.1:
+        max_shrink = 0.1
 
     shrink = max_shrink if max_shrink < 1 else 1
     return shrink, max_shrink
diff --git a/PaddleCV/human_pose_estimation/lib/coco_reader.py b/PaddleCV/human_pose_estimation/lib/coco_reader.py
index 91b26c6dc3e170982f3fa1c054f4ea0d484460d3..8640d7f14b605402feb123d03a4cf8e88644e081 100644
--- a/PaddleCV/human_pose_estimation/lib/coco_reader.py
+++ b/PaddleCV/human_pose_estimation/lib/coco_reader.py
@@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 ##############################################################################
-
 """Data reader for COCO dataset."""
 
 from __future__ import absolute_import
@@ -60,6 +59,7 @@ from pycocotools.coco import COCO
 #   [7,9],[8,10],[9,11],[2,3],[1,2],[1,3],[2,4],[3,5],[4,6],[5,7]
 # ]
 
+
 class Config:
     """Configurations for COCO dataset.
     """
@@ -68,13 +68,14 @@ class Config:
 
     # For reader
     BUF_SIZE = 102400
-    THREAD = 1 if DEBUG else 8 # have to be larger than 0
+    THREAD = 1 if DEBUG else 8  # have to be larger than 0
 
     # Fixed infos of dataset
     DATAROOT = 'data/coco'
     IMAGEDIR = 'images'
     NUM_JOINTS = 17
-    FLIP_PAIRS = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14], [15, 16]]
+    FLIP_PAIRS = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14],
+                  [15, 16]]
     PARENT_IDS = None
 
     # CFGS
@@ -90,12 +91,15 @@ class Config:
     STD = [0.229, 0.224, 0.225]
     PIXEL_STD = 200
 
+
 cfg = Config()
 
+
 def _box2cs(box):
     x, y, w, h = box[:4]
     return _xywh2cs(x, y, w, h)
 
+
 def _xywh2cs(x, y, w, h):
     center = np.zeros((2), dtype=np.float32)
     center[0] = x + w * 0.5
@@ -106,21 +110,20 @@ def _xywh2cs(x, y, w, h):
     elif w < cfg.ASPECT_RATIO * h:
         w = h * cfg.ASPECT_RATIO
     scale = np.array(
-        [w * 1.0 / cfg.PIXEL_STD, h * 1.0 / cfg.PIXEL_STD],
-        dtype=np.float32)
+        [w * 1.0 / cfg.PIXEL_STD, h * 1.0 / cfg.PIXEL_STD], dtype=np.float32)
     if center[0] != -1:
         scale = scale * 1.25
 
     return center, scale
 
+
 def _select_data(db):
     db_selected = []
     for rec in db:
         num_vis = 0
         joints_x = 0.0
         joints_y = 0.0
-        for joint, joint_vis in zip(
-                rec['joints_3d'], rec['joints_3d_vis']):
+        for joint, joint_vis in zip(rec['joints_3d'], rec['joints_3d_vis']):
             if joint_vis[0] <= 0:
                 continue
             num_vis += 1
@@ -135,8 +138,8 @@ def _select_data(db):
         area = rec['scale'][0] * rec['scale'][1] * (cfg.PIXEL_STD**2)
         joints_center = np.array([joints_x, joints_y])
         bbox_center = np.array(rec['center'])
-        diff_norm2 = np.linalg.norm((joints_center-bbox_center), 2)
-        ks = np.exp(-1.0*(diff_norm2**2) / ((0.2)**2*2.0*area))
+        diff_norm2 = np.linalg.norm((joints_center - bbox_center), 2)
+        ks = np.exp(-1.0 * (diff_norm2**2) / ((0.2)**2 * 2.0 * area))
 
         metric = (0.2 / 16) * num_vis + 0.45 - 0.2 / 16
         if ks > metric:
@@ -146,7 +149,9 @@ def _select_data(db):
     print('=> num selected db: {}'.format(len(db_selected)))
     return db_selected
 
-def _load_coco_keypoint_annotation(image_set_index, coco, _coco_ind_to_class_ind, image_set):
+
+def _load_coco_keypoint_annotation(image_set_index, coco,
+                                   _coco_ind_to_class_ind, image_set):
     """Ground truth bbox and keypoints.
     """
     print('generating coco gt_db...')
@@ -168,7 +173,7 @@ def _load_coco_keypoint_annotation(image_set_index, coco, _coco_ind_to_class_ind
             x2 = np.min((width - 1, x1 + np.max((0, w - 1))))
             y2 = np.min((height - 1, y1 + np.max((0, h - 1))))
             if obj['area'] > 0 and x2 >= x1 and y2 >= y1:
-                obj['clean_bbox'] = [x1, y1, x2-x1, y2-y1]
+                obj['clean_bbox'] = [x1, y1, x2 - x1, y2 - y1]
                 valid_objs.append(obj)
         objs = valid_objs
 
@@ -197,7 +202,8 @@ def _load_coco_keypoint_annotation(image_set_index, coco, _coco_ind_to_class_ind
 
             center, scale = _box2cs(obj['clean_bbox'][:4])
             rec.append({
-                'image': os.path.join(cfg.DATAROOT, cfg.IMAGEDIR, image_set+'2017', '%012d.jpg' % index),
+                'image': os.path.join(cfg.DATAROOT, cfg.IMAGEDIR,
+                                      image_set + '2017', '%012d.jpg' % index),
                 'center': center,
                 'scale': scale,
                 'joints_3d': joints_3d,
@@ -209,6 +215,7 @@ def _load_coco_keypoint_annotation(image_set_index, coco, _coco_ind_to_class_ind
         gt_db.extend(rec)
     return gt_db
 
+
 def data_augmentation(sample, is_train):
     image_file = sample['image']
     filename = sample['filename'] if 'filename' in sample else ''
@@ -220,28 +227,32 @@ def data_augmentation(sample, is_train):
     # imgnum = sample['imgnum'] if 'imgnum' in sample else ''
     r = 0
 
-    data_numpy = cv2.imread(
-        image_file, cv2.IMREAD_COLOR | cv2.IMREAD_IGNORE_ORIENTATION)
+    # used for ce
+    if 'ce_mode' in os.environ:
+        random.seed(0)
+        np.random.seed(0)
+
+    data_numpy = cv2.imread(image_file, cv2.IMREAD_COLOR |
+                            cv2.IMREAD_IGNORE_ORIENTATION)
 
     if is_train:
         sf = cfg.SCALE_FACTOR
         rf = cfg.ROT_FACTOR
-        s = s * np.clip(np.random.randn()*sf + 1, 1 - sf, 1 + sf)
+        s = s * np.clip(np.random.randn() * sf + 1, 1 - sf, 1 + sf)
         r = np.clip(np.random.randn()*rf, -rf*2, rf*2) \
                 if random.random() <= 0.6 else 0
 
         if cfg.FLIP and random.random() <= 0.5:
             data_numpy = data_numpy[:, ::-1, :]
             joints, joints_vis = fliplr_joints(
-                    joints, joints_vis, data_numpy.shape[1], cfg.FLIP_PAIRS)
+                joints, joints_vis, data_numpy.shape[1], cfg.FLIP_PAIRS)
             c[0] = data_numpy.shape[1] - c[0] - 1
 
     trans = get_affine_transform(c, s, r, cfg.IMAGE_SIZE)
     input = cv2.warpAffine(
-            data_numpy,
-            trans,
-            (int(cfg.IMAGE_SIZE[0]), int(cfg.IMAGE_SIZE[1])),
-            flags=cv2.INTER_LINEAR)
+        data_numpy,
+        trans, (int(cfg.IMAGE_SIZE[0]), int(cfg.IMAGE_SIZE[1])),
+        flags=cv2.INTER_LINEAR)
 
     for i in range(cfg.NUM_JOINTS):
         if joints_vis[i, 0] > 0.0:
@@ -263,23 +274,30 @@ def data_augmentation(sample, is_train):
     else:
         return input, target, target_weight, c, s, score, image_file
 
-# Create a reader
-def _reader_creator(root, image_set, shuffle=False, is_train=False, use_gt_bbox=False):
 
+# Create a reader
+def _reader_creator(root,
+                    image_set,
+                    shuffle=False,
+                    is_train=False,
+                    use_gt_bbox=False):
     def reader():
         if image_set in ['train', 'val']:
-            file_name = os.path.join(root, 'annotations', 'person_keypoints_'+image_set+'2017.json')
+            file_name = os.path.join(
+                root, 'annotations',
+                'person_keypoints_' + image_set + '2017.json')
         elif image_set in ['test', 'test-dev']:
-            file_name = os.path.join(root, 'annotations', 'image_info_'+image_set+'2017.json')
+            file_name = os.path.join(root, 'annotations',
+                                     'image_info_' + image_set + '2017.json')
         else:
-            raise ValueError("The dataset '{}' is not supported".format(image_set))
+            raise ValueError("The dataset '{}' is not supported".format(
+                image_set))
 
         # Load annotations
         coco = COCO(file_name)
 
         # Deal with class names
-        cats = [cat['name']
-                for cat in coco.loadCats(coco.getCatIds())]
+        cats = [cat['name'] for cat in coco.loadCats(coco.getCatIds())]
         classes = ['__background__'] + cats
         print('=> classes: {}'.format(classes))
         num_classes = len(classes)
@@ -287,7 +305,7 @@ def _reader_creator(root, image_set, shuffle=False, is_train=False, use_gt_bbox=
         _class_to_coco_ind = dict(zip(cats, coco.getCatIds()))
         _coco_ind_to_class_ind = dict([(_class_to_coco_ind[cls],
                                         _class_to_ind[cls])
-                                        for cls in classes[1:]])
+                                       for cls in classes[1:]])
 
         # Load image file names
         image_set_index = coco.getImgIds()
@@ -296,7 +314,7 @@ def _reader_creator(root, image_set, shuffle=False, is_train=False, use_gt_bbox=
 
         if is_train or use_gt_bbox:
             gt_db = _load_coco_keypoint_annotation(
-                    image_set_index, coco, _coco_ind_to_class_ind, image_set)
+                image_set_index, coco, _coco_ind_to_class_ind, image_set)
             gt_db = _select_data(gt_db)
 
         if shuffle:
@@ -308,23 +326,40 @@ def _reader_creator(root, image_set, shuffle=False, is_train=False, use_gt_bbox=
     mapper = functools.partial(data_augmentation, is_train=is_train)
     return reader, mapper
 
+
 def train():
-    reader, mapper = _reader_creator(cfg.DATAROOT, 'train', shuffle=True, is_train=True)
+    reader, mapper = _reader_creator(
+        cfg.DATAROOT, 'train', shuffle=True, is_train=True)
+
+    # used for ce
+    if 'ce_mode' in os.environ:
+        reader, mapper = _reader_creator(
+            cfg.DATAROOT, 'train', shuffle=False, is_train=True)
+
     def pop():
         for i, x in enumerate(reader()):
             yield mapper(x)
+
     return pop
 
+
 def valid():
-    reader, mapper = _reader_creator(cfg.DATAROOT, 'val', shuffle=False, is_train=False, use_gt_bbox=True)
+    reader, mapper = _reader_creator(
+        cfg.DATAROOT, 'val', shuffle=False, is_train=False, use_gt_bbox=True)
+
     def pop():
         for i, x in enumerate(reader()):
             yield mapper(x)
+
     return pop
 
+
 def test():
-    reader, mapper = _reader_creator(cfg.DATAROOT, 'test', shuffle=False, is_train=False, use_gt_bbox=True)
+    reader, mapper = _reader_creator(
+        cfg.DATAROOT, 'test', shuffle=False, is_train=False, use_gt_bbox=True)
+
     def pop():
         for i, x in enumerate(reader()):
             yield mapper(x)
+
     return pop
diff --git a/PaddleCV/human_pose_estimation/train.py b/PaddleCV/human_pose_estimation/train.py
index b6fe463d9e31f8d5582ee4256be962e62cbeb025..335846c71a2221dc41b4e98746a01b8acdf4a9ed 100644
--- a/PaddleCV/human_pose_estimation/train.py
+++ b/PaddleCV/human_pose_estimation/train.py
@@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 ##############################################################################
-
 """Functions for training."""
 
 import os
@@ -42,8 +41,10 @@ add_arg('pretrained_model', str,   "pretrained/resnet_50/115",   "Whether to use
 add_arg('checkpoint',       str,   None,                         "Whether to resume checkpoint.")
 add_arg('lr',               float, 0.001,                        "Set learning rate.")
 add_arg('lr_strategy',      str,   "piecewise_decay",            "Set the learning rate decay strategy.")
+add_arg('enable_ce',        bool,  False,                        "If set True, enable continuous evaluation job.")
 # yapf: enable
 
+
 def optimizer_setting(args, params):
     lr_drop_ratio = 0.1
 
@@ -64,8 +65,8 @@ def optimizer_setting(args, params):
 
         # AdamOptimizer
         optimizer = paddle.fluid.optimizer.AdamOptimizer(
-                        learning_rate=fluid.layers.piecewise_decay(
-                        boundaries=bd, values=lr))
+            learning_rate=fluid.layers.piecewise_decay(
+                boundaries=bd, values=lr))
     else:
         lr = params["lr"]
         optimizer = fluid.optimizer.Momentum(
@@ -87,28 +88,41 @@ def train(args):
         IMAGE_SIZE = [288, 384]
         HEATMAP_SIZE = [72, 96]
         args.kp_dim = 17
-        args.total_images = 144406 # 149813
+        args.total_images = 144406  # 149813
     elif args.dataset == 'mpii':
         import lib.mpii_reader as reader
         IMAGE_SIZE = [384, 384]
-        HEATMAP_SIZE = [96, 96]        
+        HEATMAP_SIZE = [96, 96]
         args.kp_dim = 16
         args.total_images = 22246
     else:
-        raise ValueError('The dataset {} is not supported yet.'.format(args.dataset))
+        raise ValueError('The dataset {} is not supported yet.'.format(
+            args.dataset))
 
     print_arguments(args)
 
     # Image and target
-    image = layers.data(name='image', shape=[3, IMAGE_SIZE[1], IMAGE_SIZE[0]], dtype='float32')
-    target = layers.data(name='target', shape=[args.kp_dim, HEATMAP_SIZE[1], HEATMAP_SIZE[0]], dtype='float32')
-    target_weight = layers.data(name='target_weight', shape=[args.kp_dim, 1], dtype='float32')
+    image = layers.data(
+        name='image', shape=[3, IMAGE_SIZE[1], IMAGE_SIZE[0]], dtype='float32')
+    target = layers.data(
+        name='target',
+        shape=[args.kp_dim, HEATMAP_SIZE[1], HEATMAP_SIZE[0]],
+        dtype='float32')
+    target_weight = layers.data(
+        name='target_weight', shape=[args.kp_dim, 1], dtype='float32')
+
+    # used for ce
+    if args.enable_ce:
+        fluid.default_startup_program().random_seed = 90
+        fluid.default_main_program().random_seed = 90
 
     # Build model
     model = pose_resnet.ResNet(layers=50, kps_num=args.kp_dim)
 
     # Output
-    loss, output = model.net(input=image, target=target, target_weight=target_weight)
+    loss, output = model.net(input=image,
+                             target=target,
+                             target_weight=target_weight)
 
     # Parameters from model and arguments
     params = {}
@@ -127,11 +141,13 @@ def train(args):
     exe = fluid.Executor(place)
     exe.run(fluid.default_startup_program())
 
-
     if args.pretrained_model:
+
         def if_exist(var):
-            exist_flag = os.path.exists(os.path.join(args.pretrained_model, var.name))
+            exist_flag = os.path.exists(
+                os.path.join(args.pretrained_model, var.name))
             return exist_flag
+
         fluid.io.load_vars(exe, args.pretrained_model, predicate=if_exist)
 
     if args.checkpoint is not None:
@@ -139,7 +155,8 @@ def train(args):
 
     # Dataloader
     train_reader = paddle.batch(reader.train(), batch_size=args.batch_size)
-    feeder = fluid.DataFeeder(place=place, feed_list=[image, target, target_weight])
+    feeder = fluid.DataFeeder(
+        place=place, feed_list=[image, target, target_weight])
 
     train_exe = fluid.ParallelExecutor(
         use_cuda=True if args.use_gpu else False, loss_name=loss.name)
@@ -147,29 +164,40 @@ def train(args):
 
     for pass_id in range(params["num_epochs"]):
         for batch_id, data in enumerate(train_reader()):
-            current_lr = np.array(paddle.fluid.global_scope().find_var('learning_rate').get_tensor())
+            current_lr = np.array(paddle.fluid.global_scope().find_var(
+                'learning_rate').get_tensor())
 
             input_image, loss, out_heatmaps = train_exe.run(
-                    fetch_list, feed=feeder.feed(data))
+                fetch_list, feed=feeder.feed(data))
 
             loss = np.mean(np.array(loss))
 
             print_immediately('Epoch [{:4d}/{:3d}] LR: {:.10f} '
-                  'Loss = {:.5f}'.format(
-                  batch_id, pass_id, current_lr[0], loss))
+                              'Loss = {:.5f}'.format(batch_id, pass_id,
+                                                     current_lr[0], loss))
 
             if batch_id % 10 == 0:
-                save_batch_heatmaps(input_image, out_heatmaps, file_name='visualization@train.jpg', normalize=True)
-
-        model_path = os.path.join(args.model_save_dir + '/' + 'simplebase-{}'.format(args.dataset),
-                                  str(pass_id))
+                save_batch_heatmaps(
+                    input_image,
+                    out_heatmaps,
+                    file_name='visualization@train.jpg',
+                    normalize=True)
+
+        model_path = os.path.join(
+            args.model_save_dir + '/' + 'simplebase-{}'.format(args.dataset),
+            str(pass_id))
         if not os.path.isdir(model_path):
             os.makedirs(model_path)
         fluid.io.save_persistables(exe, model_path)
 
+    # used for ce
+    if args.enable_ce:
+        device_num = fluid.core.get_cuda_device_count() if args.use_gpu else 1
+        print("kpis\t{}_train_cost_card{}\t{:.5f}".format(args.dataset,
+                                                          device_num, loss))
+
 
 if __name__ == '__main__':
     args = parser.parse_args()
     check_cuda(args.use_gpu)
     train(args)
-
diff --git a/PaddleCV/icnet/eval.py b/PaddleCV/icnet/eval.py
index c3a4b8b2c5e94298942f0652d4a519a0a0c3ef14..a21ceb1bb6ce9eef1f1fd53a04d0077813f9686a 100644
--- a/PaddleCV/icnet/eval.py
+++ b/PaddleCV/icnet/eval.py
@@ -16,7 +16,6 @@ import paddle.fluid as fluid
 import numpy as np
 from utils import add_arguments, print_arguments, get_feeder_data, check_gpu
 from paddle.fluid.layers.learning_rate_scheduler import _decay_step_counter
-from paddle.fluid.initializer import init_on_cpu
 from icnet import icnet
 import cityscape
 import argparse
diff --git a/PaddleCV/icnet/infer.py b/PaddleCV/icnet/infer.py
index f24fb4df88eeee7423b01d409cd52adf046af99c..f1d7db512b99975135cbfd28c97899a5f8db3745 100644
--- a/PaddleCV/icnet/infer.py
+++ b/PaddleCV/icnet/infer.py
@@ -25,7 +25,6 @@ import paddle
 from icnet import icnet
 from utils import add_arguments, print_arguments, get_feeder_data, check_gpu
 from paddle.fluid.layers.learning_rate_scheduler import _decay_step_counter
-from paddle.fluid.initializer import init_on_cpu
 import numpy as np
 
 IMG_MEAN = np.array((103.939, 116.779, 123.68), dtype=np.float32)
diff --git a/PaddleCV/icnet/train.py b/PaddleCV/icnet/train.py
index 30a423a45c529beef405fd27453e74fe5e40e308..ae616eee603c221b8bea475467a02ecf3785464a 100644
--- a/PaddleCV/icnet/train.py
+++ b/PaddleCV/icnet/train.py
@@ -26,7 +26,6 @@ import paddle.fluid as fluid
 import numpy as np
 from utils import add_arguments, print_arguments, get_feeder_data, check_gpu
 from paddle.fluid.layers.learning_rate_scheduler import _decay_step_counter
-from paddle.fluid.initializer import init_on_cpu
 
 if 'ce_mode' in os.environ:
     np.random.seed(10)
@@ -71,9 +70,8 @@ def create_loss(predict, label, mask, num_classes):
 
 def poly_decay():
     global_step = _decay_step_counter()
-    with init_on_cpu():
-        decayed_lr = LEARNING_RATE * (fluid.layers.pow(
-            (1 - global_step / TOTAL_STEP), POWER))
+    decayed_lr = LEARNING_RATE * (fluid.layers.pow(
+        (1 - global_step / TOTAL_STEP), POWER))
     return decayed_lr
 
 
diff --git a/PaddleCV/image_classification/README.md b/PaddleCV/image_classification/README.md
index 1359b038aca4444b4701cb1eb695bd07bab38601..96468b456042d46bbca2011fcd92d13b739ab2d6 100644
--- a/PaddleCV/image_classification/README.md
+++ b/PaddleCV/image_classification/README.md
@@ -14,7 +14,8 @@
 - [进阶使用](#进阶使用)
     - [Mixup训练](#mixup训练)
     - [混合精度训练](#混合精度训练)
-    - [自定义数据集](#自定义数据集)
+    - [性能分析](#性能分析)
+    - [DALI预处理](#DALI预处理)
 - [已发布模型及其性能](#已发布模型及其性能)
 - [FAQ](#faq)
 - [参考文献](#参考文献)
@@ -32,11 +33,11 @@
 
 ### 安装说明
 
-在当前目录下运行样例代码需要python 2.7及以上版本，PadddlePaddle Fluid v1.6或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本，请根据 [安装文档](http://paddlepaddle.org/documentation/docs/zh/1.6/beginners_guide/install/index_cn.html) 中的说明来更新PaddlePaddle。
+在当前目录下运行样例代码需要python 2.7及以上版本，PadddlePaddle Fluid v1.6或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本，请根据 [安装文档](https://www.paddlepaddle.org.cn/install/quick) 中的说明来更新PaddlePaddle。
 
 #### 环境依赖
 
-python >= 2.7，CUDA >= 8.0，CUDNN >= 7.0
+python >= 2.7
 运行训练代码需要安装numpy，cv2
 
 ```bash
@@ -75,19 +76,34 @@ val/ILSVRC2012_val_00000001.jpeg 65
 ### 模型训练
 
 数据准备完毕后，可以通过如下的方式启动训练：
-```
+
+```bash
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+export FLAGS_fraction_of_gpu_memory_to_use=0.98
+
 python train.py \
-       --model=ResNet50 \
-       --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
-       --model_save_dir=output/ \
-       --lr_strategy=piecewise_decay \
-       --lr=0.1
+        --data_dir=./data/ILSVRC2012/ \
+        --total_images=1281167 \
+        --class_dim=1000 \
+        --validate=True \
+        --model=ResNet50_vd \
+        --batch_size=256 \
+        --lr_strategy=cosine_decay \
+        --lr=0.1 \
+        --num_epochs=200 \
+        --model_save_dir=output/ \
+        --l2_decay=7e-5 \
+        --use_mixup=True \
+        --use_label_smoothing=True \
+        --label_smoothing_epsilon=0.1
 ```
 
-注意: 当添加如step_epochs这种列表型参数，需要去掉"="，如：--step_epochs 10 20 30
+
+注意:
+- 当添加如step_epochs这种列表型参数，需要去掉"="，如：--step_epochs 10 20 30
+- 如果需要训练自己的数据集，则需要修改根据自己的数据集修改`data_dir`, `total_images`, `class_dim`参数；如果因为GPU显存不够而需要调整`batch_size`，则参数`lr`也需要根据`batch_size`进行线性调整。
+- 如果需要使用其他模型进行训练，则需要修改`model`参数，也可以在`scripts/train/`文件夹中根据对应模型的默认运行脚本进行修改并训练。
+
 
 或通过run.sh 启动训练
 
@@ -98,13 +114,14 @@ bash run.sh train 模型名
 **多进程模型训练：**
 
 如果你有多张GPU卡的话，我们强烈建议你使用多进程模式来训练模型，这会极大的提升训练速度。启动方式如下：
-```
+
+```bash
 CUDA_VISIBLE_DEVICES=0,1,2,3 python -m paddle.distributed.launch train.py \
        --model=ResNet50 \
        --batch_size=256 \
        --total_images=1281167 \
        --class_dim=1000 \
-       --image_shape=3,224,224 \
+       --image_shape 3 224 224 \
        --model_save_dir=output/ \
        --lr_strategy=piecewise_decay \
        --reader_thread=4 \
@@ -128,7 +145,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m paddle.distributed.launch train.py \
 * **model**: 模型名称， 默认值: "ResNet50"
 * **total_images**: 图片数，ImageNet2012，默认值: 1281167
 * **class_dim**: 类别数，默认值: 1000
-* **image_shape**: 图片大小，默认值: "3,224,224"
+* **image_shape**: 图片大小，默认值: [3,224,224]
 * **num_epochs**: 训练回合数，默认值: 120
 * **batch_size**: batch size大小(所有设备)，默认值: 8
 * **test_batch_size**: 测试batch大小，默认值：16
@@ -146,7 +163,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m paddle.distributed.launch train.py \
 * **lower_ratio**: 数据随机裁剪处理时的lower ratio值，默认值:3./4.
 * **upper_ratio**: 数据随机裁剪处理时的upper ratio值，默认值:4./3.
 * **resize_short_size**: 指定数据处理时改变图像大小的短边值，默认值: 256
-* **crop_size**: 指定裁剪的大小，默认值:224
 * **use_mixup**: 是否对数据进行mixup处理，默认值: False
 * **mixup_alpha**: 指定mixup处理时的alpha值，默认值: 0.2
 * **use_aa**: 是否对数据进行auto augment处理. 默认值: False.
@@ -159,46 +175,87 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m paddle.distributed.launch train.py \
 
 一些开关：
 
+* **validate**: 是否在模型训练过程中启动模型测试，默认值: True
 * **use_gpu**: 是否在GPU上运行，默认值: True
 * **use_label_smoothing**: 是否对数据进行label smoothing处理，默认值: False
 * **label_smoothing_epsilon**: label_smoothing的epsilon， 默认值:0.1
-* **random_seed**: 随机数种子， 默认值: 1000
 * **padding_type**: efficientNet中卷积操作的padding方式, 默认值: "SAME".
 * **use_se**: efficientNet中是否使用Squeeze-and-Excitation模块, 默认值: True.
 * **use_ema**: 是否在更新模型参数时使用ExponentialMovingAverage. 默认值: False.
 * **ema_decay**: ExponentialMovingAverage的decay rate. 默认值: 0.9999.
 
+
+性能分析：
+
+* **enable_ce**: 是否开启CE，默认值: False
+* **random_seed**: 随机数种子，当设置数值后，所有随机化会被固定，默认值: None
+* **is_profiler**: 是否开启性能分析，默认值: 0
+* **profilier_path**: 分析文件保存位置，默认值: 'profiler_path/'
+* **max_iter**: 最大训练batch数，默认值: 0
+* **same_feed**: 是否feed相同数据进入网络，设定具体数值来指定数据数量，默认值：0
+
+
 **数据读取器说明：** 数据读取器定义在```reader.py```文件中，现在默认基于cv2的数据读取器， 在[训练阶段](#模型训练)，默认采用的增广方式是随机裁剪与水平翻转， 而在[模型评估](#模型评估)与[模型预测](#模型预测)阶段用的默认方式是中心裁剪。当前支持的数据增广方式有：
 
 * 旋转
-* 颜色抖动（暂未实现）
+* 颜色抖动
 * 随机裁剪
 * 中心裁剪
 * 长宽调整
 * 水平翻转
 * 自动增广
 
+
 ### 参数微调
 
 参数微调(Finetune)是指在特定任务上微调已训练模型的参数。可以下载[已发布模型及其性能](#已发布模型及其性能)并且设置```path_to_pretrain_model```为模型所在路径，微调一个模型可以采用如下的命令：
 
 ```bash
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+export FLAGS_fraction_of_gpu_memory_to_use=0.98
+
 python train.py \
-       --model=model_name \
-       --pretrained_model=${path_to_pretrain_model}
+        --data_dir=./data/ILSVRC2012/ \
+        --total_images=1281167 \
+        --class_dim=1000 \
+        --validate=True \
+        --model=ResNet50_vd \
+        --batch_size=256 \  
+        --lr=0.1 \
+        --num_epochs=200 \
+        --model_save_dir=output/ \
+        --l2_decay=7e-5 \
+        --pretrained_model=${path_to_pretrain_model} \
+        --finetune_exclude_pretrained_params=fc_0.w_0,fc_0.b_0
 ```
-注意：根据具体模型和任务添加并调整其他参数
+
+注意：
+- 在自己的数据集上进行微调时，则需要修改根据自己的数据集修改`data_dir`, `total_images`, `class_dim`参数。
+- 加载的参数是ImageNet1000的预训练模型参数，对于相同模型，最后的类别数或者含义可能不同，因此在加载预训练模型参数时，需要过滤掉最后的FC层，否则可能会因为**维度不匹配**而报错。
+
 
 ### 模型评估
 
-模型评估(Eval)是指对训练完毕的模型评估各类性能指标。可以下载[已发布模型及其性能](#已发布模型及其性能)并且设置```path_to_pretrain_model```为模型所在路径。运行如下的命令，可以获得模型top-1/top-5精度:
+模型评估(Eval)是指对训练完毕的模型评估各类性能指标。可以下载[已发布模型及其性能](#已发布模型及其性能)并且设置```path_to_pretrain_model```为模型所在路径，```json_path```为保存指标的路径。运行如下的命令，可以获得模型top-1/top-5精度。
+
+**参数说明**
+
+* **save_json_path**: 是否将eval结果保存到json文件中，默认值：None
+* `model`: 模型名称，与预训练模型需保持一致。
+* `batch_size`: 每个minibatch评测的图片个数。
+* `data_dir`: 数据路径。注意：该路径下需要同时包括待评估的**图片文件**以及图片和对应类别标注的**映射文本文件**，文本文件名称需为`val.txt`。
 
 ```bash
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+export FLAGS_fraction_of_gpu_memory_to_use=0.98
+
 python eval.py \
-       --model=model_name \
-       --pretrained_model=${path_to_pretrain_model}
+       --model=ResNet50_vd \
+       --pretrained_model=${path_to_pretrain_model} \
+       --data_dir=./data/ILSVRC2012/ \
+       --save_json_path=${json_path} \
+       --batch_size=256
 ```
-注意：根据具体模型和任务添加并调整其他参数
 
 ### 指数滑动平均的模型评估
 
@@ -210,29 +267,93 @@ python ema_clean.py \
        --cleaned_model_dir=your_cleaned_model_dir
 
 python eval.py \
-       --model=model_name \
+       --model=ResNet50_vd \
        --pretrained_model=your_cleaned_model_dir
 ```
 
-### 模型预测
+### 模型fluid预测
 
-模型预测(Infer)可以获取一个模型的预测分数或者图像的特征，可以下载[已发布模型及其性能](#已发布模型及其性能)并且设置```path_to_pretrain_model```为模型所在路径。运行如下的命令获得预测结果：
+模型预测(Infer)可以获取一个模型的预测分数或者图像的特征，可以下载[已发布模型及其性能](#已发布模型及其性能)并且设置```path_to_pretrain_model```为模型所在路径，```test_res_json_path```为模型预测结果保存的文本路径，```image_path```为模型预测的图片路径或者图片列表所在的文件夹路径。
 
 **参数说明：**
 
-* **save_inference**: 是否保存模型，默认值：False
+* **data_dir**: 数据存储位置，默认值：`/data/ILSVRC2012/val/`
+* **save_inference**: 是否保存二进制模型，默认值：`False`
 * **topk**: 按照置信由高到低排序标签结果，返回的结果数量，默认值：1
-* **label_path**: 可读标签文件路径，默认值："./utils/tools/readable_label.txt"
+* **class_map_path**: 可读标签文件路径，默认值：`./utils/tools/readable_label.txt`
+* **image_path**: 指定单文件进行预测，默认值：`None`
+* **save_json_path**: 将预测结果保存到json文件中，默认值: `test_res.json`
+
+#### 单张图片预测
+
+```bash
+export CUDA_VISIBLE_DEVICES=0
+
+python infer.py \
+        --model=ResNet50_vd \
+        --class_dim=1000  \
+        --pretrained_model=${path_to_pretrain_model} \
+        --class_map_path=./utils/tools/readable_label.txt \
+        --image_path=${image_path} \
+        --save_json_path=${test_res_json_path}
+```
+
+#### 图片列表预测
+* 该种情况下，需要指定```data_dir```路径和```batch_size```。
 
 ```bash
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+
 python infer.py \
-       --model=model_name \
-       --pretrained_model=${path_to_pretrain_model}
+        --model=ResNet50_vd \
+        --class_dim=1000  \
+        --pretrained_model=${path_to_pretrain_model} \
+        --class_map_path=./utils/tools/readable_label.txt \
+        --data_dir=${data_dir} \
+        --save_json_path=${test_res_json_path} \
+        --batch_size=${batch_size}
 ```
-注意：根据具体模型和任务添加并调整其他参数
 
-模型预测默认ImageNet1000类类别，标签文件存储在/utils/tools/readable_label.txt中，如果使用自定义数据，请指定--label_path参数
+注意：
+- 模型名称需要与该模型训练时保持一致。
+- 模型预测默认ImageNet1000类类别，预测数值和可读标签的map文件存储在`./utils/tools/readable_label.txt`中，如果使用自定义数据，请指定`--class_map_path`参数。
+
 
+### Python预测API
+* Fluid提供了高度优化的C++预测库，为了方便使用，Paddle也提供了C++预测库对应的Python接口，更加具体的Python预测API介绍可以参考这里：[https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/deploy/inference/python_infer_cn.html](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/deploy/inference/python_infer_cn.html)
+* 使用Python预测API进行模型预测的步骤有模型转换和模型预测，详细介绍如下。
+
+#### 模型转换
+* 首先将保存的fluid模型转换为二进制模型，转换方法如下，其中```path_to_pretrain_model```表示预训练模型的路径。
+
+```bash
+python infer.py \
+        --model=ResNet50_vd \
+        --pretrained_model=${path_to_pretrain_model} \
+        --save_inference=True
+```
+
+注意：
+- 预训练模型和模型名称需要保持一致。
+- 在转换模型时，使用`save_inference_model`函数进行模型转换，参数`feeded_var_names`表示模型预测时所需提供数据的所有变量名称；参数`target_vars`表示模型的所有输出变量，通过这些输出变量即可得到模型的预测结果。
+- 转换完成后，会在`ResNet50_vd`文件下生成`model`和`params`文件。
+
+#### 模型预测
+
+根据转换的模型二进制文件，基于Python API的预测方法如下，其中```model_path```表示model文件的路径，```params_path```表示params文件的路径，```image_path```表示图片文件的路径。
+
+```bash
+python predict.py \
+        --model_file=./ResNet50_vd/model \
+        --params_file=./ResNet50_vd/params \
+        --image_path=${image_path} \
+        --gpu_id=0 \
+        --gpu_mem=1024
+```
+
+注意：
+- 这里只提供了预测单张图片的脚本，如果需要预测文件夹内的多张图片，需要自己修改预测文件`predict.py`。
+- 参数`gpu_id`指定了当前使用的GPU ID号，参数`gpu_mem`指定了初始化的GPU显存。
 
 ## 进阶使用
 
@@ -244,38 +365,212 @@ Mixup相关介绍参考[mixup: Beyond Empirical Risk Minimization](https://arxiv
 
 ### 混合精度训练
 
-FP16相关内容已经迁移至PaddlePaddle/Fleet 中
+通过指定--use_fp16=True 启动混合精度训练，在训练过程中会使用float16数据类型，并输出float32的模型参数。您可能需要同时传入--scale_loss来解决fp16训练的精度问题，如传入--scale_loss=128.0。
+
+在配置好数据集路径后（修改[scripts/train/ResNet50_fp16.sh](scripts/train/ResNet50_fp16.sh)文件中`DATA_DIR`的值），对ResNet50模型进行混合精度训练可通过运行`bash run.sh train ResNet50_fp16`命令完成。
+
+多机多卡ResNet50模型的混合精度训练请参考[PaddlePaddle/Fleet](https://github.com/PaddlePaddle/Fleet/tree/develop/benchmark/collective/resnet)。
+
+使用Tesla V100单机8卡、2机器16卡、4机器32卡，对ResNet50模型进行混合精度训练的结果如下（开启DALI）：
+
+* BatchSize = 256
+
+节点数*卡数|吞吐|加速比|test\_acc1|test\_acc5
+---|---|---|---|---
+1*1|1035 ins/s|1|0.75333|0.92702
+1*8|7840 ins/s|7.57|0.75603|0.92771
+2*8|14277 ins/s|13.79|0.75872|0.92793
+4*8|28594 ins/s|27.63|0.75253|0.92713
+
+* BatchSize = 128
+
+节点数*卡数|吞吐|加速比|test\_acc1|test\_acc5
+---|---|---|---|---
+1*1|936  ins/s|1|0.75280|0.92531
+1*8|7108 ins/s|7.59|0.75832|0.92771
+2*8|12343 ins/s|13.18|0.75766|0.92723
+4*8|24407 ins/s|26.07|0.75859|0.92871
+
+### 性能分析
+
+注意：本部分主要为内部测试功能。
+其中包括启动CE以监测模型运行的稳定性，启动profiler以测试benchmark，启动same_feed来进行快速调试。
+
+启动CE会固定随机初始化，其中包括数据读取器中的shuffle和program的[random_seed](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/fluid_cn/Program_cn.html#random_seed)
+
+``` bash
+python train.py \
+    --enable_ce=True \
+    --data_dir=${path_to_a_smaller_dataset}
+```
+
+启动profiler进行性能分析
+
+``` bash
+python train.py \
+    --is_profiler=True
+```
+
+设置same_feed参数以进行快速调试, 相同的图片（same_feed张图片）将传入网络中
+```bash
+python train.py \
+    --same_feed=8 \
+    --batch_size=4 \
+    --print_step=1
+```
+
+### DALI预处理
 
-### 自定义数据集
+使用[Nvidia DALI](https://github.com/NVIDIA/DALI)预处理类库可以加速训练并提高GPU利用率。
 
-PaddlePaddle/Models ImageClassification 支持自定义数据
+DALI预处理目前支持标准ImageNet处理步骤（ random crop -> resize -> flip -> normalize），并且支持列表文件或者文件夹方式的数据集格式。
 
-1. 组织自定义数据，调整数据读取器以正确的传入数据
-2. 注意更改训练脚本中
---data_dim 类别数为自定义数据类别数
---total_image 图片数量
-3. 当进行finetune时，
-指定--pretrained_model 加载预训练模型，注意：本模型库提供的是基于ImageNet 1000类数据的预训练模型，当使用不同类别数的数据时，请删除预训练模型中fc_weight 和fc_offset参数
+指定`--use_dali=True`即可开启DALI预处理，如下面的例子中，使用DALI训练ShuffleNet v2 0.25x，在8卡v100上，图片吞吐可以达到10000张/秒以上，GPU利用率在85%以上。
+
+``` bash
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+export FLAGS_fraction_of_gpu_memory_to_use=0.80
+
+python -m paddle.distributed.launch train.py \
+       --model=ShuffleNetV2_x0_25 \
+       --batch_size=2048 \
+       --lr_strategy=cosine_decay_warmup \
+       --num_epochs=240 \
+       --lr=0.5 \
+       --l2_decay=3e-5 \
+       --lower_scale=0.64 \
+       --lower_ratio=0.8 \
+       --upper_ratio=1.2 \
+       --use_dali=True
+```
+
+更多DALI相关用例请参考[DALI Paddle插件文档](https://docs.nvidia.com/deeplearning/sdk/dali-master-branch-user-guide/docs/plugins/paddle_tutorials.html)。
+
+#### 注意事项
+
+1. 请务必使用GCC5.4以上编译器[编译安装](https://www.paddlepaddle.org.cn/install/doc/source/ubuntu)的1.6或以上版本paddlepaddle, 另外，请在编译过程中指定-DWITH_DISTRIBUTE=ON 来启动多进程训练模式。注意：官方的paddlepaddle是GCC4.8编译的，请务必检查此项，或参考使用[已经编译好的whl包](https://github.com/NVIDIA/DALI/blob/master/qa/setup_packages.py#L38)
+2. Nvidia DALI需要使用[#1371](https://github.com/NVIDIA/DALI/pull/1371)以后的git版本。请参考[此文档](https://docs.nvidia.com/deeplearning/sdk/dali-master-branch-user-guide/docs/installation.html)安装nightly版本或从源码安装。
+3. 因为DALI使用GPU进行图片预处理，需要占用部分显存，请适当调整 `FLAGS_fraction_of_gpu_memory_to_use`环境变量（如`0.8`）来预留部分显存供DALI使用。
 
 
 ## 已发布模型及其性能
 表格中列出了在models目录下目前支持的图像分类模型，并且给出了已完成训练的模型在ImageNet-2012验证集合上的top-1和top-5精度，以及Paddle Fluid和Paddle TensorRT基于动态链接库的预测时间（测试GPU型号为NVIDIA® Tesla® P4）。
 可以通过点击相应模型的名称下载对应的预训练模型。
 
-- 注意
-   - 1：ResNet50_vd_v2是ResNet50_vd蒸馏版本。
-   - 2：除EfficientNet外，InceptionV4和Xception采用的输入图像的分辨率为299x299，DarkNet53为256x256，Fix_ResNeXt101_32x48d_wsl为320x320，其余模型使用的分辨率均为224x224。在预测时，DarkNet53与Fix_ResNeXt101_32x48d_wsl系列网络resize_short_size与输入的图像分辨率的宽或高相同，InceptionV4和Xception网络resize_short_size为320，其余网络resize_short_size均为256。
-   - 3: EfficientNetB0~B7的分辨率大小分别为224x224，240x240，260x260，300x300，380x380，456x456，528x528，600x600，预测时的resize_short_size在其分辨率的长或高的基础上加32，如EfficientNetB1的resize_short_size为272，在该系列模型训练和预测的过程中，图片resize参数interpolation的值设置为2（cubic插值方式），该模型在训练过程中使用了指数滑动平均策略，具体请参考[指数滑动平均](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/api_cn/optimizer_cn.html#exponentialmovingaverage)。
-   - 4：调用动态链接库预测时需要将训练模型转换为二进制模型。
-
-        ```bash
-        python infer.py \
-               --model=model_name \
-               --pretrained_model=${path_to_pretrain_model} \
-               --save_inference=True
-        ```
-
-   - 5: ResNeXt101_wsl系列的预训练模型转自pytorch模型，详情见[ResNeXt wsl](https://pytorch.org/hub/facebookresearch_WSL-Images_resnext/)。
+#### 注意事项
+
+- 特殊参数配置
+
+<table>
+<tr>
+    <td><b>Model</b>
+    </td>
+    <td><b>输入图像分辨率</b>
+    </td>
+    <td><b>参数 resize_short_size</b>
+    </td>
+</tr>
+<tr>
+   <td>Inception, Xception
+   </td>
+   <td>299
+   </td>
+   <td>320
+   </td>
+</tr>
+<tr>
+    <td> DarkNet53
+    </td>
+    <td>256
+    </td>
+    <td>256
+    </td>
+</tr>
+<tr>
+   <td>Fix_ResNeXt101_32x48d_wsl
+   </td>
+   <td>320
+   </td>
+   <td>320
+   </td>
+</tr>
+<tr>
+   <td rowspan="8"> EfficientNet: <br/><br/>
+   预测时的resize_short_size在其分辨率的长或高的基础上加32<br/>
+   在该系列模型训练和预测的过程中<br/>
+   图片resize参数interpolation的值设置为2（cubic插值方式）<br/>
+   该模型在训练过程中使用了指数滑动平均策略<br/>
+   具体请参考<a href="https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/api_cn/optimizer_cn.html#exponentialmovingaverage">指数滑动平均</a>
+   </td>
+   <td>B0: 224
+   </td>
+   <td>256
+   </td>
+</tr>
+<tr>
+   <td>B1: 240
+   </td>
+   <td>272
+   </td>
+</tr>
+<tr>
+   <td>B2: 260
+   </td>
+   <td>292
+   </td>
+</tr>
+<tr>
+   <td>B3: 300
+   </td>
+   <td>332
+   </td>
+</tr>
+<tr>
+   <td>B4: 380
+   </td>
+   <td>412
+   </td>
+</tr>
+<tr>
+   <td>B5: 456
+   </td>
+   <td>488
+   </td>
+</tr>
+<tr>
+   <td>B6: 528
+   </td>
+   <td>560
+   </td>
+</tr>
+<tr>
+
+   <td>B7: 600
+   </td>
+   <td>632
+   </td>
+</tr>
+<tr>
+    <td>其余分类模型
+    </td>
+    <td>224
+    </td>
+    <td>256
+    </td>
+</tr>
+</table>
+
+
+- 调用动态链接库预测时需要将训练模型转换为二进制模型。
+
+    ```bash
+    python infer.py \
+           --model=ResNet50_vd \
+           --pretrained_model=${path_to_pretrain_model} \
+           --save_inference=True
+    ```
+
+- ResNeXt101_wsl系列的预训练模型转自pytorch模型，详情见[ResNeXt wsl](https://pytorch.org/hub/facebookresearch_WSL-Images_resnext/)。
 
 
 ### AlexNet
@@ -323,6 +618,13 @@ PaddlePaddle/Models ImageClassification 支持自定义数据
 |[ShuffleNetV2_x2_0](https://paddle-imagenet-models-name.bj.bcebos.com/ShuffleNetV2_x2_0_pretrained.tar) | 73.15% | 91.20% | 6.430 | 3.954 |
 |[ShuffleNetV2_swish](https://paddle-imagenet-models-name.bj.bcebos.com/ShuffleNetV2_swish_pretrained.tar) | 70.03% | 89.17% | 6.078 | 4.976 |
 
+### AutoDL Series
+|Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
+|- |:-: |:-: |:-: |:-: |
+|[DARTS_4M](https://paddle-imagenet-models-name.bj.bcebos.com/DARTS_GS_4M_pretrained.tar) | 75.23% | 92.15% | 13.572 | 6.335 |
+|[DARTS_6M](https://paddle-imagenet-models-name.bj.bcebos.com/DARTS_GS_6M_pretrained.tar) | 76.03% | 92.79% | 16.406 | 6.864 |
+- AutoDL基于可微结构搜索思路DARTS改进，引入Local Rademacher Complexity控制过拟合，并通过Resource Constraining灵活调整模型大小。
+
 ### ResNet Series
 |Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
 |- |:-: |:-: |:-: |:-: |
@@ -333,13 +635,24 @@ PaddlePaddle/Models ImageClassification 支持自定义数据
 |[ResNet50](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_pretrained.tar) | 76.50% | 93.00% | 8.787 | 5.137 |
 |[ResNet50_vc](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vc_pretrained.tar) |78.35% | 94.03% | 9.013 | 5.285 |
 |[ResNet50_vd](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_pretrained.tar) | 79.12% | 94.44% | 9.058 | 5.259 |
-|[ResNet50_vd_v2](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_v2_pretrained.tar) | 79.84% | 94.93% | 9.058 | 5.259 |
+|[ResNet50_vd_v2](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_v2_pretrained.tar)<sup>[1](#trans1)</sup> | 79.84% | 94.93% | 9.058 | 5.259 |
 |[ResNet101](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet101_pretrained.tar) | 77.56% | 93.64% | 15.447 | 8.473 |
 |[ResNet101_vd](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet101_vd_pretrained.tar) | 80.17% | 94.97% | 15.685 | 8.574 |
 |[ResNet152](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet152_pretrained.tar) | 78.26% | 93.96% | 21.816 | 11.646 |
 |[ResNet152_vd](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet152_vd_pretrained.tar) | 80.59% | 95.30% | 22.041 | 11.858 |
 |[ResNet200_vd](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet200_vd_pretrained.tar) | 80.93% | 95.33% | 28.015 | 14.896 |
 
+<a name="trans1">[1]</a> 该预训练模型是在ResNet50_vd的预训练模型继续蒸馏得到的，用户可以通过ResNet50_vd的结构直接加载该预训练模型。
+
+### Res2Net Series
+|Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
+|- |:-: |:-: |:-: |:-: |
+|[Res2Net50_26w_4s](https://paddle-imagenet-models-name.bj.bcebos.com/Res2Net50_26w_4s_pretrained.tar) | 79.33% | 94.57% | 10.731 | 8.274 |
+|[Res2Net50_vd_26w_4s](https://paddle-imagenet-models-name.bj.bcebos.com/Res2Net50_vd_26w_4s_pretrained.tar) | 79.75% | 94.91% | 11.012 | 8.493 |
+|[Res2Net50_14w_8s](https://paddle-imagenet-models-name.bj.bcebos.com/Res2Net50_14w_8s_pretrained.tar) | 79.46% | 94.70% | 16.937 | 10.205 |
+|[Res2Net101_vd_26w_4s](https://paddle-imagenet-models-name.bj.bcebos.com/Res2Net101_vd_26w_4s_pretrained.tar) | 80.64% | 95.22% | 19.612 | 14.651 |
+|[Res2Net200_vd_26w_4s](https://paddle-imagenet-models-name.bj.bcebos.com/Res2Net200_vd_26w_4s_pretrained.tar) | 81.21% | 95.71% | 35.809 | 26.479 |
+
 ### ResNeXt Series
 |Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
 |- |:-: |:-: |:-: |:-: |
@@ -349,9 +662,10 @@ PaddlePaddle/Models ImageClassification 支持自定义数据
 |[ResNeXt50_vd_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_vd_64x4d_pretrained.tar) | 80.12% | 94.86% | 20.888 | 15.938 |
 |[ResNeXt101_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_32x4d_pretrained.tar) | 78.65% | 94.19% | 24.154 | 17.661 |
 |[ResNeXt101_vd_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_vd_32x4d_pretrained.tar) | 80.33% | 95.12% | 24.701 | 17.249 |
-|[ResNeXt101_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_64x4d_pretrained.tar) | 78.43% | 94.13% | 41.073 | 31.288 |
+|[ResNeXt101_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_64x4d_pretrained.tar) | 78.35% | 94.52% | 41.073 | 31.288 |
 |[ResNeXt101_vd_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_vd_64x4d_pretrained.tar) | 80.78% | 95.20% | 42.277 | 32.620 |
 |[ResNeXt152_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_32x4d_pretrained.tar) | 78.98% | 94.33% | 37.007 | 26.981 |
+|[ResNeXt152_vd_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_vd_32x4d_pretrained.tar) | 80.72% | 95.20% | 35.783 | 26.081 |
 |[ResNeXt152_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_64x4d_pretrained.tar) | 79.51% | 94.71% | 58.966 | 47.915 |
 |[ResNeXt152_vd_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_vd_64x4d_pretrained.tar) | 81.08% | 95.34% | 60.947 | 47.406 |
 
@@ -376,8 +690,11 @@ PaddlePaddle/Models ImageClassification 支持自定义数据
 ### SENet Series
 |Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
 |- |:-: |:-: |:-: |:-: |
+|[SE_ResNet18_vd](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNet18_vd_pretrained.tar) | 73.33% | 91.38% | 4.715 | 3.061 |
+|[SE_ResNet34_vd](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNet34_vd_pretrained.tar) | 76.51% | 93.20% | 7.475 | 4.299 |
 |[SE_ResNet50_vd](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNet50_vd_pretrained.tar) | 79.52% | 94.75% | 10.345 | 7.631 |
 |[SE_ResNeXt50_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNeXt50_32x4d_pretrained.tar) | 78.44% | 93.96% | 14.916 | 12.305 |
+|[SE_ResNeXt50_vd_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNeXt50_vd_32x4d_pretrained.tar) | 80.24% | 94.89% | 15.155 | 12.687 |
 |[SE_ResNeXt101_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNeXt101_32x4d_pretrained.tar) | 79.12% | 94.20% | 30.085 | 23.218 |
 |[SENet154_vd](https://paddle-imagenet-models-name.bj.bcebos.com/SENet154_vd_pretrained.tar) | 81.40% | 95.48% | 71.892 | 53.131 |
 
@@ -410,18 +727,41 @@ PaddlePaddle/Models ImageClassification 支持自定义数据
 |Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
 |- |:-: |:-: |:-: |:-: |
 |[EfficientNetB0](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB0_pretrained.tar) | 77.38% | 93.31% | 10.303 | 4.334 |
-|[EfficientNetB1](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB1_pretrained.tar)<sup>[1](#trans)</sup> | 79.15% | 94.41% | 15.626 | 6.502 |
-|[EfficientNetB2](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB2_pretrained.tar)<sup>[1](#trans)</sup> | 79.85% | 94.74% | 17.847 | 7.558 |
-|[EfficientNetB3](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB3_pretrained.tar)<sup>[1](#trans)</sup> | 81.15% | 95.41% | 25.993 | 10.937 |
-|[EfficientNetB4](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB4_pretrained.tar)<sup>[1](#trans)</sup> | 82.85% | 96.23% | 47.734 | 18.536 |
-|[EfficientNetB5](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB5_pretrained.tar)<sup>[1](#trans)</sup> | 83.62% | 96.72% | 88.578 | 32.102 |
-|[EfficientNetB6](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB6_pretrained.tar)<sup>[1](#trans)</sup> | 84.00% | 96.88% | 138.670 | 51.059 |
-|[EfficientNetB7](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB7_pretrained.tar)<sup>[1](#trans)</sup> | 84.30% | 96.89% | 234.364 | 82.107 |
-|[EfficientNetB0_small](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB0_Small_pretrained.tar)<sup>[2](#trans)</sup> | 75.80% | 92.58% | 3.342 | 2.729 |
+|[EfficientNetB1](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB1_pretrained.tar)<sup>[2](#trans2)</sup> | 79.15% | 94.41% | 15.626 | 6.502 |
+|[EfficientNetB2](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB2_pretrained.tar)<sup>[2](#trans2)</sup> | 79.85% | 94.74% | 17.847 | 7.558 |
+|[EfficientNetB3](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB3_pretrained.tar)<sup>[2](#trans2)</sup> | 81.15% | 95.41% | 25.993 | 10.937 |
+|[EfficientNetB4](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB4_pretrained.tar)<sup>[2](#trans2)</sup> | 82.85% | 96.23% | 47.734 | 18.536 |
+|[EfficientNetB5](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB5_pretrained.tar)<sup>[2](#trans2)</sup> | 83.62% | 96.72% | 88.578 | 32.102 |
+|[EfficientNetB6](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB6_pretrained.tar)<sup>[2](#trans2)</sup> | 84.00% | 96.88% | 138.670 | 51.059 |
+|[EfficientNetB7](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB7_pretrained.tar)<sup>[2](#trans2)</sup> | 84.30% | 96.89% | 234.364 | 82.107 |
+|[EfficientNetB0_small](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB0_Small_pretrained.tar)<sup>[3](#trans3)</sup> | 75.80% | 92.58% | 3.342 | 2.729 |
 
-<a name="trans">[1]</a> 表示该预训练权重是由[官方的代码仓库](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet)转换来的。
+<a name="trans2">[2]</a> 表示该预训练权重是由[官方的代码仓库](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet)转换来的。
+
+<a name="trans3">[3]</a> 表示该预训练权重是在EfficientNetB0的基础上去除se模块，并使用通用的卷积训练的，精度稍稍下降，但是速度大幅提升。
+
+### HRNet Series
+|Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
+|- |:-: |:-: |:-: |:-: |
+|[HRNet_W18_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W18_C_pretrained.tar) | 76.92% | 93.39% | 23.013 | 11.601 |
+|[HRNet_W30_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W30_C_pretrained.tar) | 78.04% | 94.02% | 25.793 | 14.367 |
+|[HRNet_W32_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W32_C_pretrained.tar) | 78.28% | 94.24% | 29.564 | 14.328 |
+|[HRNet_W40_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W40_C_pretrained.tar) | 78.77% | 94.47% | 33.880 | 17.616 |
+|[HRNet_W44_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W44_C_pretrained.tar) | 79.00% | 94.51% | 36.021 | 18.990 |
+|[HRNet_W48_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W48_C_pretrained.tar) | 78.95% | 94.42% | 30.064 | 19.963 |
+|[HRNet_W64_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W64_C_pretrained.tar) | 79.30% | 94.61% | 38.921 | 24.742 |
+
+
+### ResNet_ACNet Series
+|Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
+|- |:-: |:-: |:-: |:-: |
+|[ResNet50_ACNet](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_ACNet_pretrained.tar)<sub>1</sub> | 76.71% | 93.24% | 13.205 | 8.804 |
+|[ResNet50_ACNet](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_ACNet_deploy_pretrained.tar)<sub>2</sub> | 76.71% | 93.24% | 7.418 | 5.950 |
 
-<a name="trans">[2]</a> 表示该预训练权重是在EfficientNetB0的基础上去除se模块，并使用通用的卷积训练的，精度稍稍下降，但是速度大幅提升。
+* 注:
+    * `1`. 不对训练模型结果进行参数转换，进行评估。
+    * `2`. 使用`sh ./utils/acnet/convert_model.sh`命令对训练模型结果进行参数转换，并设置`deploy mode=True`，进行评估。
+    * `./utils/acnet/convert_model.sh`包含4个参数，分别是模型名称、输入的模型地址、输出的模型地址以及类别数量。
 
 ## FAQ
 
@@ -462,6 +802,10 @@ PaddlePaddle/Models ImageClassification 支持自定义数据
 - ResNeXt101_wsl: [Exploring the Limits of Weakly Supervised Pretraining](https://arxiv.org/abs/1805.00932), Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten
 - Fix_ResNeXt101_wsl: [Fixing the train-test resolution discrepancy](https://arxiv.org/abs/1906.06423), Hugo Touvron, Andrea Vedaldi, Matthijs Douze, Herve ́ Je ́gou
 - EfficientNet: [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946), Mingxing Tan, Quoc V. Le
+- Res2Net: [Res2Net: A New Multi-scale Backbone Architecture](https://arxiv.org/abs/1904.01169), Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, Philip Torr
+- HRNet: [Deep High-Resolution Representation Learning for Visual Recognition](https://arxiv.org/abs/1908.07919), Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao
+- DARTS: [DARTS: Differentiable Architecture Search](https://arxiv.org/pdf/1806.09055.pdf), Hanxiao Liu, Karen Simonyan, Yiming Yang
+- ACNet: [ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks](https://arxiv.org/abs/1908.03930), Xiaohan Ding, Yuchen Guo, Guiguang Ding, Jungong Han
 
 ## 版本更新
 - 2018/12/03 **Stage1**: 更新AlexNet，ResNet50，ResNet101，MobileNetV1
@@ -475,6 +819,9 @@ PaddlePaddle/Models ImageClassification 支持自定义数据
 - 2019/08/01 **Stage7**: 更新DarkNet53，DenseNet121，Densenet161，DenseNet169，DenseNet201，DenseNet264，SqueezeNet1_0，SqueezeNet1_1，ResNeXt50_vd_32x4d，ResNeXt152_64x4d，ResNeXt101_32x8d_wsl，ResNeXt101_32x16d_wsl，ResNeXt101_32x32d_wsl，ResNeXt101_32x48d_wsl，Fix_ResNeXt101_32x48d_wsl
 - 2019/09/11 **Stage8**: 更新ResNet18_vd，ResNet34_vd，MobileNetV1_x0_25，MobileNetV1_x0_5，MobileNetV1_x0_75，MobileNetV2_x0_75，MobilenNetV3_small_x1_0，DPN68，DPN92，DPN98，DPN107，DPN131，ResNeXt101_vd_32x4d，ResNeXt152_vd_64x4d，Xception65，Xception71，Xception41_deeplab，Xception65_deeplab，SE_ResNet50_vd
 - 2019/09/20 更新EfficientNet
+- 2019/11/28 **Stage9**: 更新SE_ResNet18_vd，SE_ResNet34_vd，SE_ResNeXt50_vd_32x4d，ResNeXt152_vd_32x4d，Res2Net50_26w_4s，Res2Net50_14w_8s，Res2Net50_vd_26w_4s，HRNet_W18_C，HRNet_W30_C，HRNet_W32_C，HRNet_W40_C，HRNet_W44_C，HRNet_W48_C，HRNet_W64_C
+- 2020/01/07 **Stage10**: 更新AutoDL Series
+- 2020/01/09 **Stage11**: 更新Res2Net101_vd_26w_4s, Res2Net200_vd_26w_4s
 
 ## 如何贡献代码
 
diff --git a/PaddleCV/image_classification/README_en.md b/PaddleCV/image_classification/README_en.md
index bcc92ff1afa26890f090cd405b44c0a991ba8f8f..1baa86d963ccf8f3df39095085ba2fdd2eb796fb 100644
--- a/PaddleCV/image_classification/README_en.md
+++ b/PaddleCV/image_classification/README_en.md
@@ -15,6 +15,8 @@ English | [中文](README.md)
 - [Advanced Usage](#advanced-usage)
     - [Mixup Training](#mixup-training)
     - [Using Mixed-Precision Training](#using-mixed-precision-training)
+    - [Profiling](#profiling)
+    - [Preprocessing with Nvidia DALI](#preprocessing-with-nvidia-dali)
     - [Custom Dataset](#custom-dataset)
 - [Supported Models and Performances](#supported-models-and-performances)
 - [Reference](#reference)
@@ -33,7 +35,7 @@ We also recommend users to take a look at the  [IPython Notebook demo](https:/
 
 ### Installation
 
-Running samples in this directory requires Python 2.7 and later, CUDA 8.0 and later, CUDNN 7.0 and later, python package: numpy and opencv-python, PaddelPaddle Fluid v1.6 and later, the latest release version is recommended, If the PaddlePaddle on your device is lower than v1.6, please follow the instructions in [installation document](http://paddlepaddle.org/documentation/docs/zh/1.6/beginners_guide/install/index_cn.html) and make an update.
+Running samples in this directory requires Python 2.7 and later, CUDA 8.0 and later, CUDNN 7.0 and later, python package: numpy and opencv-python, PaddelPaddle Fluid v1.6 and later, the latest release version is recommended, If the PaddlePaddle on your device is lower than v1.6, please follow the instructions in [installation document](https://www.paddlepaddle.org.cn/install/quick) and make an update.
 
 ### Data preparation
 
@@ -121,7 +123,7 @@ Solver and hyperparameters:
 * **model**: name model to use. Default: "ResNet50".
 * **total_images**: total number of images in the training set. Default: 1281167.
 * **class_dim**: the class number of the classification task. Default: 1000.
-* **image_shape**: input size of the network. Default: "3,224,224".
+* **image_shape**: input size of the network. Default: 3 224 224 .
 * **num_epochs**: the number of epochs. Default: 120.
 * **batch_size**: the batch size of all devices. Default: 8.
 * **test_batch_size**: the test batch size, Default: 16
@@ -139,7 +141,6 @@ Reader and preprocess:
 * **lower_ratio**: the lower ratio in ramdom crop. Default:3./4. .
 * **upper_ration**: the upper ratio in ramdom crop. Default:4./3. .
 * **resize_short_size**: the resize_short_size. Default: 256.
-* **crop_size**: the crop size, Default: 224.
 * **use_mixup**: whether to use mixup data processing or not. Default:False.
 * **mixup_alpha**: the mixup_alpha parameter. Default: 0.2.
 * **use_aa**: whether to use auto augment data processing or not. Default:False.
@@ -152,19 +153,29 @@ Reader and preprocess:
 
 Switch:
 
+* **validate**: whether to validate when training. Default: True.
 * **use_gpu**: whether to use GPU or not. Default: True.
 * **use_label_smoothing**: whether to use label_smoothing or not. Default:False.
 * **label_smoothing_epsilon**: the label_smoothing_epsilon. Default:0.1.
-* **random_seed**: random seed for debugging, Default: 1000.
 * **padding_type**: padding type of convolution for efficientNet, Default: "SAME".
 * **use_se**: whether to use Squeeze-and-Excitation module in efficientNet, Default: True.
 * **use_ema**: whether to use ExponentialMovingAverage or not. Default: False.
 * **ema_decay**: the value of ExponentialMovingAverage decay rate. Default: 0.9999.
 
+Profiling:
+
+* **enable_ce**: whether to start CE, Default: False
+* **random_seed**: random seed, Default: None
+* **is_profiler**: whether to start profilier, Default: 0
+* **profilier_path**: path to save profilier output, Default: 'profilier_path'
+* **max_iter**: maximum training batch, Default: 0
+* **same_feed**: whether to feed same data in the net, Default: 0
+
+
 **data reader introduction:** Data reader is defined in ```reader.py```, default reader is implemented by opencv. In the [Training](#training) Stage, random crop and flipping are applied, while center crop is applied in the [Evaluation](#evaluation) and [Inference](#inference) stages. Supported data augmentation includes:
 
 * rotation
-* color jitter (haven't implemented in cv2_reader)
+* color jitter
 * random crop
 * center crop
 * resize
@@ -187,6 +198,10 @@ Note: Add and adjust other parameters accroding to specific models and tasks.
 
 Evaluation is to evaluate the performance of a trained model. One can download [pretrained models](#supported-models-and-performances) and set its path to ```path_to_pretrain_model```. Then top1/top5 accuracy can be obtained by running the following command:
 
+**parameters**
+
+* **save_json_path**: whether to save output, default: None
+
 ```
 python eval.py \
        --model=model_name \
@@ -215,7 +230,9 @@ python eval.py \
 
 * **save_inference**: whether to save binary model, Default: False
 * **topk**: the number of sorted predicated labels to show, Default: 1
-* **label_path**: readable label filepath, Default: "/utils/tools/readable_label.txt"
+* **class_map_path**: readable label filepath, Default: "/utils/tools/readable_label.txt"
+* **save_json_path**: whether to save output, Default: None
+* **image_path**: whether to indicate the single image path to predict, Default: None
 
 Inference is used to get prediction score or image features based on trained models. One can download [pretrained models](#supported-models-and-performances) and set its path to ```path_to_pretrain_model```. Run following command then obtain prediction score.
 
@@ -236,8 +253,71 @@ Refer to [mixup: Beyond Empirical Risk Minimization](https://arxiv.org/abs/1710.
 
 ### Using Mixed-Precision Training
 
-Mixed-precision part is moving to PaddlePaddle/Fleet now.
+Set --use_fp16=True to sart Automatic Mixed Precision (AMP) Training. During the training process, the float16 data type will be used to speed up the training performance. You may need to use the --scale_loss parameter to avoid the accuracy dropping, such as setting --scale_loss=128.0.
+
+After configuring the data path (modify the value of `DATA_DIR` in [scripts/train/ResNet50_fp16.sh](scripts/train/ResNet50_fp16.sh)), you can enable ResNet50 to start AMP Training by executing the command of `bash run.sh train ResNet50_fp16`.
+
+Refer to [PaddlePaddle/Fleet](https://github.com/PaddlePaddle/Fleet/tree/develop/benchmark/collective/resnet) for the multi-machine and multi-card training.
+
+Performing on Tesla V100 single machine with 8 cards, two machines with 16 cards and four machines with 32 cards, the performance of ResNet50 AMP training is shown as below (enable DALI).
+
+* BatchSize = 256
+
+nodes*crads|throughput|speedup|test\_acc1|test\_acc5
+---|---|---|---|---
+1*1|1035 ins/s|1|0.75333|0.92702
+1*8|7840 ins/s|7.57|0.75603|0.92771
+2*8|14277 ins/s|13.79|0.75872|0.92793
+4*8|28594 ins/s|27.63|0.75253|0.92713
+
+* BatchSize = 128
+
+nodes*crads|throughput|speedup|test\_acc1|test\_acc5
+---|---|---|---|---
+1*1|936  ins/s|1|0.75280|0.92531
+1*8|7108 ins/s|7.59|0.75832|0.92771
+2*8|12343 ins/s|13.18|0.75766|0.92723
+4*8|24407 ins/s|26.07|0.75859|0.92871
 
+### Preprocessing with Nvidia DALI
+
+[Nvidia DALI](https://github.com/NVIDIA/DALI) can be used to preprocess input images, which could speed up training and achieve higher GPU utilization.
+
+At present, DALI preprocessing supports the standard ImageNet pipeline (random crop -> resize -> flip -> normalize), it supports dataset in both file list or plain folder format.
+
+DALI preprocessing can be enabled with the `--use_dali=True` command line flag.
+For example, training ShuffleNet v2 0.25x with the following command should
+reach a throughput of over 10000 images/second, and GPU utilization should be
+above 85%.
+
+``` bash
+export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+export FLAGS_fraction_of_gpu_memory_to_use=0.80
+
+python -m paddle.distributed.launch train.py \
+       --model=ShuffleNetV2_x0_25 \
+       --batch_size=2048 \
+       --lr_strategy=cosine_decay_warmup \
+       --num_epochs=240 \
+       --lr=0.5 \
+       --l2_decay=3e-5 \
+       --lower_scale=0.64 \
+       --lower_ratio=0.8 \
+       --upper_ratio=1.2 \
+       --use_dali=True
+
+```
+
+For more details please refer to [Documentation on DALI Paddle Plugin](https://docs.nvidia.com/deeplearning/sdk/dali-master-branch-user-guide/docs/plugins/paddle_tutorials.html).
+
+#### NOTES
+1. PaddlePaddle with version 1.6 or above is required, and it must be compiled
+with GCC 5.4 and up.
+2. Nvidia DALI should include this PR [#1371](https://github.com/NVIDIA/DALI/pull/1371). Please refer to [this doc](https://docs.nvidia.com/deeplearning/sdk/dali-master-branch-user-guide/docs/installation.html) and install nightly version or build from source.
+3. Since DALI utilize the GPU for preprocessing, it will take up some GPU
+   memory. Please reduce the memory used by paddle by setting the
+   `FLAGS_fraction_of_gpu_memory_to_use` environment variable to a smaller
+   number (e.g., `0.8`)
 
 ### Custom Dataset
 
@@ -248,18 +328,118 @@ Mixed-precision part is moving to PaddlePaddle/Fleet now.
 The image classification models currently supported by PaddlePaddle are listed in the table. It shows the top-1/top-5 accuracy on the ImageNet-2012 validation set of these models, the inference time of Paddle Fluid and Paddle TensorRT based on dynamic link library(test GPU model: Tesla P4).
 Pretrained models can be downloaded by clicking related model names.
 
-- Note
-    - 1: ResNet50_vd_v2 is the distilled version of ResNet50_vd.
-    - 2: In addition to EfficientNet, the image resolution feeded in InceptionV4 and Xception net is ```299x299```, Fix_ResNeXt101_32x48d_wsl is ```320x320```, DarkNet is ```256x256```, others are ```224x224```.In test time, the resize_short_size of the DarkNet53 and Fix_ResNeXt101_32x48d_wsl series networks is the same as the width or height of the input image resolution, the InceptionV4 and Xception network resize_short_size is 320, and the other networks resize_short_size are 256.
-    - 3: The resolutions of EfficientNetB0~B7 are ```224x224```,```240x240```,```260x260```,```300x300```,```380x380```,```456x456```,```528x528```,```600x600``` respectively, the resize_short_size in the inference phase is increased by 32 on the basis of the length or height of the resolution, for example, the resize_short_size of EfficientNetB1 is 272.In the process of training and inference phase of these series of models, the value of the resize parameter interpolation is set to 2 (cubic interpolation mode). Besides, the model uses  ExponentialMovingAverage during the training process, this trick please refer to [ExponentialMovingAverage](https://www.paddlepaddle.org.cn/documentation/docs/en/1.5/api/optimizer.html#exponentialmovingaverage).
-    - 4: It's necessary to convert the train model to a binary model when appling dynamic link library to infer, One can do it by running following command:
-        ```bash
-        python infer.py\
-            --model=model_name \
-            --pretrained_model=${path_to_pretrained_model} \
-            --save_inference=True
-        ```
-    - 5: The pretrained model of the ResNeXt101_wsl series network is converted from the pytorch model. Please refer to [RESNEXT WSL](https://pytorch.org/hub/facebookresearch_WSL-Images_resnext/) for details.
+#### Note
+
+- Some special settings
+
+<table>
+<tr>
+    <td><b>Model</b>
+    </td>
+    <td><b>Resolution</b>
+    </td>
+    <td><b>Parameter: resize_short_size</b>
+    </td>
+</tr>
+<tr>
+   <td>Inception, Xception
+   </td>
+   <td>299
+   </td>
+   <td>320
+   </td>
+</tr>
+<tr>
+    <td> DarkNet53
+    </td>
+    <td>256
+    </td>
+    <td>256
+    </td>
+</tr>
+<tr>
+   <td>Fix_ResNeXt101_32x48d_wsl
+   </td>
+   <td>320
+   </td>
+   <td>320
+   </td>
+</tr>
+<tr>
+   <td rowspan="8"> EfficientNet: <br/><br/>
+   In the inference phase, the resize_short_size increases 32 compared to the resolution <br/>
+   and using the 2nd interpolation(cubic interpolation mode). <br/>
+   The ExponentialMovingAverage method is also applied during the training process <br/>
+   Please refer to <a href="https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/api_cn/optimizer_cn.html#exponentialmovingaverage">ExponentialMovingAverage</a>
+   </td>
+   <td>B0: 224
+   </td>
+   <td>256
+   </td>
+</tr>
+<tr>
+   <td>B1: 240
+   </td>
+   <td>272
+   </td>
+</tr>
+<tr>
+   <td>B2: 260
+   </td>
+   <td>292
+   </td>
+</tr>
+<tr>
+   <td>B3: 300
+   </td>
+   <td>332
+   </td>
+</tr>
+<tr>
+   <td>B4: 380
+   </td>
+   <td>412
+   </td>
+</tr>
+<tr>
+   <td>B5: 456
+   </td>
+   <td>488
+   </td>
+</tr>
+<tr>
+   <td>B6: 528
+   </td>
+   <td>560
+   </td>
+</tr>
+<tr>
+
+   <td>B7: 600
+   </td>
+   <td>632
+   </td>
+</tr>
+<tr>
+    <td>Other models
+    </td>
+    <td>224
+    </td>
+    <td>256
+    </td>
+</tr>
+</table>
+
+- It's necessary to convert the train model to a binary model when appling dynamic link library to infer, One can do it by running following command:
+
+    ```bash
+    python infer.py\
+        --model=model_name \
+        --pretrained_model=${path_to_pretrained_model} \
+        --save_inference=True
+    ```
+
+- The pretrained model of the ResNeXt101_wsl series network is converted from the pytorch model. Please refer to [RESNEXT WSL](https://pytorch.org/hub/facebookresearch_WSL-Images_resnext/) for details.
 
 ### AlexNet
 |Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
@@ -306,6 +486,13 @@ Pretrained models can be downloaded by clicking related model names.
 |[ShuffleNetV2_x2_0](https://paddle-imagenet-models-name.bj.bcebos.com/ShuffleNetV2_x2_0_pretrained.tar) | 73.15% | 91.20% | 6.430 | 3.954 |
 |[ShuffleNetV2_swish](https://paddle-imagenet-models-name.bj.bcebos.com/ShuffleNetV2_swish_pretrained.tar) | 70.03% | 89.17% | 6.078 | 4.976 |
 
+### AutoDL Series
+|Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
+|- |:-: |:-: |:-: |:-: |
+|[DARTS_4M](https://paddle-imagenet-models-name.bj.bcebos.com/DARTS_GS_4M_pretrained.tar) | 75.23% | 92.15% | 13.572 | 6.335 |
+|[DARTS_6M](https://paddle-imagenet-models-name.bj.bcebos.com/DARTS_GS_6M_pretrained.tar) | 76.03% | 92.79% | 16.406 | 6.864 |
+- AutoDL is improved based on DARTS, Local Rademacher Complexity is introduced to control overfitting, and model size is flexibly adjusted through Resource Constraining.
+
 ### ResNet Series
 |Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
 |- |:-: |:-: |:-: |:-: |
@@ -316,13 +503,24 @@ Pretrained models can be downloaded by clicking related model names.
 |[ResNet50](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_pretrained.tar) | 76.50% | 93.00% | 8.787 | 5.137 |
 |[ResNet50_vc](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vc_pretrained.tar) |78.35% | 94.03% | 9.013 | 5.285 |
 |[ResNet50_vd](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_pretrained.tar) | 79.12% | 94.44% | 9.058 | 5.259 |
-|[ResNet50_vd_v2](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_v2_pretrained.tar) | 79.84% | 94.93% | 9.058 | 5.259 |
+|[ResNet50_vd_v2](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_v2_pretrained.tar)<sup>[1](#trans1)</sup> | 79.84% | 94.93% | 9.058 | 5.259 |
 |[ResNet101](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet101_pretrained.tar) | 77.56% | 93.64% | 15.447 | 8.473 |
 |[ResNet101_vd](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet101_vd_pretrained.tar) | 80.17% | 94.97% | 15.685 | 8.574 |
 |[ResNet152](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet152_pretrained.tar) | 78.26% | 93.96% | 21.816 | 11.646 |
 |[ResNet152_vd](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet152_vd_pretrained.tar) | 80.59% | 95.30% | 22.041 | 11.858 |
 |[ResNet200_vd](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet200_vd_pretrained.tar) | 80.93% | 95.33% | 28.015 | 14.896 |
 
+<a name="trans1">[1]</a> The pretrained model is distilled based on the pretrained model of ResNet50_vd. Users can directly load the pretrained model through the structure of ResNet50_vd.
+
+### Res2Net Series
+|Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
+|- |:-: |:-: |:-: |:-: |
+|[Res2Net50_26w_4s](https://paddle-imagenet-models-name.bj.bcebos.com/Res2Net50_26w_4s_pretrained.tar) | 79.33% | 94.57% | 10.731 | 8.274 |
+|[Res2Net50_vd_26w_4s](https://paddle-imagenet-models-name.bj.bcebos.com/Res2Net50_vd_26w_4s_pretrained.tar) | 79.75% | 94.91% | 11.012 | 8.493 |
+|[Res2Net50_14w_8s](https://paddle-imagenet-models-name.bj.bcebos.com/Res2Net50_14w_8s_pretrained.tar) | 79.46% | 94.70% | 16.937 | 10.205 |
+|[Res2Net101_vd_26w_4s](https://paddle-imagenet-models-name.bj.bcebos.com/Res2Net101_vd_26w_4s_pretrained.tar) | 80.64% | 95.22% | 19.612 | 14.651 |
+|[Res2Net200_vd_26w_4s](https://paddle-imagenet-models-name.bj.bcebos.com/Res2Net200_vd_26w_4s_pretrained.tar) | 81.21% | 95.71% | 35.809 | 26.479 |
+
 ### ResNeXt Series
 |Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
 |- |:-: |:-: |:-: |:-: |
@@ -332,9 +530,10 @@ Pretrained models can be downloaded by clicking related model names.
 |[ResNeXt50_vd_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_vd_64x4d_pretrained.tar) | 80.12% | 94.86% | 20.888 | 15.938 |
 |[ResNeXt101_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_32x4d_pretrained.tar) | 78.65% | 94.19% | 24.154 | 17.661 |
 |[ResNeXt101_vd_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_vd_32x4d_pretrained.tar) | 80.33% | 95.12% | 24.701 | 17.249 |
-|[ResNeXt101_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_64x4d_pretrained.tar) | 78.43% | 94.13% | 41.073 | 31.288 |
+|[ResNeXt101_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_64x4d_pretrained.tar) | 79.35% | 94.52% | 41.073 | 31.288 |
 |[ResNeXt101_vd_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_vd_64x4d_pretrained.tar) | 80.78% | 95.20% | 42.277 | 32.620 |
 |[ResNeXt152_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_32x4d_pretrained.tar) | 78.98% | 94.33% | 37.007 | 26.981 |
+|[ResNeXt152_vd_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_vd_32x4d_pretrained.tar) | 80.72% | 95.20% | 35.783 | 26.081 |
 |[ResNeXt152_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_64x4d_pretrained.tar) | 79.51% | 94.71% | 58.966 | 47.915 |
 |[ResNeXt152_vd_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_vd_64x4d_pretrained.tar) | 81.08% | 95.34% | 60.947 | 47.406 |
 
@@ -356,6 +555,17 @@ Pretrained models can be downloaded by clicking related model names.
 |[DPN107](https://paddle-imagenet-models-name.bj.bcebos.com/DPN107_pretrained.tar) | 80.89% | 95.32% | 41.071 | 18.885 |
 |[DPN131](https://paddle-imagenet-models-name.bj.bcebos.com/DPN131_pretrained.tar) | 80.70% | 95.14% | 41.179 | 18.246 |
 
+### SENet Series
+|Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
+|- |:-: |:-: |:-: |:-: |
+|[SE_ResNet18_vd](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNet18_vd_pretrained.tar) | 73.33% | 91.38% | 4.715 | 3.061 |
+|[SE_ResNet34_vd](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNet34_vd_pretrained.tar) | 76.51% | 93.20% | 7.475 | 4.299 |
+|[SE_ResNet50_vd](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNet50_vd_pretrained.tar) | 79.52% | 94.75% | 10.345 | 7.631 |
+|[SE_ResNeXt50_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNeXt50_32x4d_pretrained.tar) | 78.44% | 93.96% | 14.916 | 12.305 |
+|[SE_ResNeXt50_vd_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNeXt50_vd_32x4d_pretrained.tar) | 80.24% | 94.89% | 15.155 | 12.687 |
+|[SE_ResNeXt101_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/SE_ResNeXt101_32x4d_pretrained.tar) | 79.12% | 94.20% | 30.085 | 23.218 |
+|[SENet154_vd](https://paddle-imagenet-models-name.bj.bcebos.com/SENet154_vd_pretrained.tar) | 81.40% | 95.48% | 71.892 | 53.131 |
+
 ### SENet Series
 |Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
 |- |:-: |:-: |:-: |:-: |
@@ -393,18 +603,41 @@ Pretrained models can be downloaded by clicking related model names.
 |Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
 |- |:-: |:-: |:-: |:-: |
 |[EfficientNetB0](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB0_pretrained.tar) | 77.38% | 93.31% | 10.303 | 4.334 |
-|[EfficientNetB1](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB1_pretrained.tar)<sup>[1](#trans)</sup> | 79.15% | 94.41% | 15.626 | 6.502 |
-|[EfficientNetB2](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB2_pretrained.tar)<sup>[1](#trans)</sup> | 79.85% | 94.74% | 17.847 | 7.558 |
-|[EfficientNetB3](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB3_pretrained.tar)<sup>[1](#trans)</sup> | 81.15% | 95.41% | 25.993 | 10.937 |
-|[EfficientNetB4](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB4_pretrained.tar)<sup>[1](#trans)</sup> | 82.85% | 96.23% | 47.734 | 18.536 |
-|[EfficientNetB5](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB5_pretrained.tar)<sup>[1](#trans)</sup> | 83.62% | 96.72% | 88.578 | 32.102 |
-|[EfficientNetB6](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB6_pretrained.tar)<sup>[1](#trans)</sup> | 84.00% | 96.88% | 138.670 | 51.059 |
-|[EfficientNetB7](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB7_pretrained.tar)<sup>[1](#trans)</sup> | 84.30% | 96.89% | 234.364 | 82.107 |
-|[EfficientNetB0_small](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB0_Small_pretrained.tar)<sup>[2](#trans)</sup> | 75.80% | 92.58% | 3.342 | 2.729 |
+|[EfficientNetB1](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB1_pretrained.tar)<sup>[2](#trans2)</sup> | 79.15% | 94.41% | 15.626 | 6.502 |
+|[EfficientNetB2](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB2_pretrained.tar)<sup>[2](#trans2)</sup> | 79.85% | 94.74% | 17.847 | 7.558 |
+|[EfficientNetB3](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB3_pretrained.tar)<sup>[2](#trans2)</sup> | 81.15% | 95.41% | 25.993 | 10.937 |
+|[EfficientNetB4](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB4_pretrained.tar)<sup>[2](#trans2)</sup> | 82.85% | 96.23% | 47.734 | 18.536 |
+|[EfficientNetB5](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB5_pretrained.tar)<sup>[2](#trans2)</sup> | 83.62% | 96.72% | 88.578 | 32.102 |
+|[EfficientNetB6](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB6_pretrained.tar)<sup>[2](#trans2)</sup> | 84.00% | 96.88% | 138.670 | 51.059 |
+|[EfficientNetB7](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB7_pretrained.tar)<sup>[2](#trans2)</sup> | 84.30% | 96.89% | 234.364 | 82.107 |
+|[EfficientNetB0_small](https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB0_Small_pretrained.tar)<sup>[3](#trans3)</sup> | 75.80% | 92.58% | 3.342 | 2.729 |
+
+<a name="trans2">[2]</a> means the pretrained weight is converted form [original repository](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet).
 
-<a name="trans">[1]</a> means the pretrained weight is converted form [original repository](https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet).
+<a name="trans3">[3]</a> means the pretrained weight is based on EfficientNetB0, removed Squeeze-and-Excitation module and use general convolution. This model speed is much faster.
+
+### HRNet Series
+|Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
+|- |:-: |:-: |:-: |:-: |
+|[HRNet_W18_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W18_C_pretrained.tar) | 76.92% | 93.39% | 23.013 | 11.601 |
+|[HRNet_W30_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W30_C_pretrained.tar) | 78.04% | 94.02% | 25.793 | 14.367 |
+|[HRNet_W32_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W32_C_pretrained.tar) | 78.28% | 94.24% | 29.564 | 14.328 |
+|[HRNet_W40_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W40_C_pretrained.tar) | 78.77% | 94.47% | 33.880 | 17.616 |
+|[HRNet_W44_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W44_C_pretrained.tar) | 79.00% | 94.51% | 36.021 | 18.990 |
+|[HRNet_W48_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W48_C_pretrained.tar) | 78.95% | 94.42% | 30.064 | 19.963 |
+|[HRNet_W64_C](https://paddle-imagenet-models-name.bj.bcebos.com/HRNet_W64_C_pretrained.tar) | 79.30% | 94.61% | 38.921 | 24.742 |
+
+### ResNet_ACNet Series
+|Model | Top-1 | Top-5 | Paddle Fluid inference time(ms) | Paddle TensorRT inference time(ms) |
+|- |:-: |:-: |:-: |:-: |
+|[ResNet50_ACNet](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_ACNet_pretrained.tar)<sub>1</sub> | 76.71% | 93.24% | 13.205 | 8.804 |
+|[ResNet50_ACNet](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_ACNet_deploy_pretrained.tar)<sub>2</sub> | 76.71% | 93.24% | 7.418 | 5.950 |
+
+* Note:
+    * `1`. deploy mode is set as False to eval.
+    * `2`. Use `sh ./utils/acnet/convert_model.sh` to convert to trained model, and set deploy mode as True to eval.
+    * `./utils/acnet/convert_model.sh` contains 4 parmeters, which are model name, input model directory, output model directory and class number.
 
-<a name="trans">[2]</a> means the pretrained weight is based on EfficientNetB0, removed Squeeze-and-Excitation module and use general convolution. This model speed is much faster.
 
 ## FAQ
 
@@ -438,9 +671,14 @@ Enforce failed. Expected x_dims[1] == labels_dims[1], but received x_dims[1]:100
 - ResNeXt101_wsl: [Exploring the Limits of Weakly Supervised Pretraining](https://arxiv.org/abs/1805.00932), Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten
 - Fix_ResNeXt101_wsl: [Fixing the train-test resolution discrepancy](https://arxiv.org/abs/1906.06423), Hugo Touvron, Andrea Vedaldi, Matthijs Douze, Herve ́ Je ́gou
 - EfficientNet: [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946), Mingxing Tan, Quoc V. Le
+- Res2Net: [Res2Net: A New Multi-scale Backbone Architecture](https://arxiv.org/abs/1904.01169), Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, Philip Torr
+- HRNet: [Deep High-Resolution Representation Learning for Visual Recognition](https://arxiv.org/abs/1908.07919), Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao
+- DARTS: [DARTS: Differentiable Architecture Search](https://arxiv.org/pdf/1806.09055.pdf), Hanxiao Liu, Karen Simonyan, Yiming Yang
+- ACNet: [ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks](https://arxiv.org/abs/1908.03930), Xiaohan Ding, Yuchen Guo, Guiguang Ding, Jungong Han
 
-## Update
 
+
+## Update
 - 2018/12/03 **Stage1**: Update AlexNet, ResNet50, ResNet101, MobileNetV1
 - 2018/12/23 **Stage2**: Update VGG Series, SeResNeXt50_32x4d, SeResNeXt101_32x4d, ResNet152
 - 2019/01/31 Update MobileNetV2_x1_0
@@ -452,6 +690,9 @@ Enforce failed. Expected x_dims[1] == labels_dims[1], but received x_dims[1]:100
 - 2019/08/01 **Stage7**: Update DarkNet53, DenseNet121. Densenet161, DenseNet169, DenseNet201, DenseNet264, SqueezeNet1_0, SqueezeNet1_1, ResNeXt50_vd_32x4d, ResNeXt152_64x4d, ResNeXt101_32x8d_wsl, ResNeXt101_32x16d_wsl, ResNeXt101_32x32d_wsl, ResNeXt101_32x48d_wsl, Fix_ResNeXt101_32x48d_wsl
 - 2019/09/11 **Stage8**: Update ResNet18_vd，ResNet34_vd，MobileNetV1_x0_25，MobileNetV1_x0_5，MobileNetV1_x0_75，MobileNetV2_x0_75，MobilenNetV3_small_x1_0，DPN68，DPN92，DPN98，DPN107，DPN131，ResNeXt101_vd_32x4d，ResNeXt152_vd_64x4d，Xception65，Xception71，Xception41_deeplab，Xception65_deeplab，SE_ResNet50_vd
 - 2019/09/20 Update EfficientNet
+- 2019/11/28 **Stage9**: Update SE_ResNet18_vd，SE_ResNet34_vd，SE_ResNeXt50_vd_32x4d，ResNeXt152_vd_32x4d，Res2Net50_26w_4s，Res2Net50_14w_8s，Res2Net50_vd_26w_4s，HRNet_W18_C，HRNet_W30_C，HRNet_W32_C，HRNet_W40_C，HRNet_W44_C，HRNet_W48_C，HRNet_W64_C
+- 2020/01/07 **Stage10**: Update AutoDL Series
+- 2020/01/09 **Stage11**: Update Res2Net101_vd_26w_4s, Res2Net200_vd_26w_4s
 
 ## Contribute
 
diff --git a/PaddleCV/image_classification/build_model.py b/PaddleCV/image_classification/build_model.py
index 709f58d7e4d7539c43fc7a64ef5c05c4523595b0..a0dfd1310ad83c5bb16efceb4895c98f471a5c20 100644
--- a/PaddleCV/image_classification/build_model.py
+++ b/PaddleCV/image_classification/build_model.py
@@ -35,8 +35,11 @@ def _calc_label_smoothing_loss(softmax_out, label, class_dim, epsilon):
 def _basic_model(data, model, args, is_train):
     image = data[0]
     label = data[1]
-
-    net_out = model.net(input=image, class_dim=args.class_dim)
+    if args.model == "ResNet50":
+        image_in = fluid.layers.transpose(image, [0, 2, 3, 1]) if args.data_format == 'NHWC' else image
+        net_out = model.net(input=image_in, class_dim=args.class_dim, data_format=args.data_format)
+    else:
+        net_out = model.net(input=image, class_dim=args.class_dim)
     softmax_out = fluid.layers.softmax(net_out, use_cudnn=False)
 
     if is_train and args.use_label_smoothing:
@@ -48,7 +51,8 @@ def _basic_model(data, model, args, is_train):
 
     avg_cost = fluid.layers.mean(cost)
     acc_top1 = fluid.layers.accuracy(input=softmax_out, label=label, k=1)
-    acc_top5 = fluid.layers.accuracy(input=softmax_out, label=label, k=5)
+    acc_top5 = fluid.layers.accuracy(
+        input=softmax_out, label=label, k=min(5, args.class_dim))
     return [avg_cost, acc_top1, acc_top5]
 
 
@@ -73,7 +77,8 @@ def _googlenet_model(data, model, args, is_train):
 
     avg_cost = avg_cost0 + 0.3 * avg_cost1 + 0.3 * avg_cost2
     acc_top1 = fluid.layers.accuracy(input=out0, label=label, k=1)
-    acc_top5 = fluid.layers.accuracy(input=out0, label=label, k=5)
+    acc_top5 = fluid.layers.accuracy(
+        input=out0, label=label, k=min(5, args.class_dim))
 
     return [avg_cost, acc_top1, acc_top5]
 
@@ -86,7 +91,11 @@ def _mixup_model(data, model, args, is_train):
     y_b = data[2]
     lam = data[3]
 
-    net_out = model.net(input=image, class_dim=args.class_dim)
+    if args.model == "ResNet50":
+        image_in = fluid.layers.transpose(image, [0, 2, 3, 1]) if args.data_format == 'NHWC' else image
+        net_out = model.net(input=image_in, class_dim=args.class_dim, data_format=args.data_format)
+    else:
+        net_out = model.net(input=image, class_dim=args.class_dim)
     softmax_out = fluid.layers.softmax(net_out, use_cudnn=False)
     if not args.use_label_smoothing:
         loss_a = fluid.layers.cross_entropy(input=softmax_out, label=y_a)
diff --git a/PaddleCV/image_classification/dali.py b/PaddleCV/image_classification/dali.py
new file mode 100644
index 0000000000000000000000000000000000000000..f7c88b10ad9471b5673cb11fa9c90aa690d6ec7f
--- /dev/null
+++ b/PaddleCV/image_classification/dali.py
@@ -0,0 +1,265 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import division
+
+import os
+
+from nvidia.dali.pipeline import Pipeline
+import nvidia.dali.ops as ops
+import nvidia.dali.types as types
+from nvidia.dali.plugin.paddle import DALIGenericIterator
+
+import paddle
+from paddle import fluid
+
+
+class HybridTrainPipe(Pipeline):
+    def __init__(self,
+                 file_root,
+                 file_list,
+                 batch_size,
+                 resize_shorter,
+                 crop,
+                 min_area,
+                 lower,
+                 upper,
+                 interp,
+                 mean,
+                 std,
+                 device_id,
+                 shard_id=0,
+                 num_shards=1,
+                 random_shuffle=True,
+                 num_threads=4,
+                 seed=42):
+        super(HybridTrainPipe, self).__init__(
+            batch_size, num_threads, device_id, seed=seed)
+        self.input = ops.FileReader(
+            file_root=file_root,
+            file_list=file_list,
+            shard_id=shard_id,
+            num_shards=num_shards,
+            random_shuffle=random_shuffle)
+        # set internal nvJPEG buffers size to handle full-sized ImageNet images
+        # without additional reallocations
+        device_memory_padding = 211025920
+        host_memory_padding = 140544512
+        self.decode = ops.ImageDecoderRandomCrop(
+            device='mixed',
+            output_type=types.RGB,
+            device_memory_padding=device_memory_padding,
+            host_memory_padding=host_memory_padding,
+            random_aspect_ratio=[lower, upper],
+            random_area=[min_area, 1.0],
+            num_attempts=100)
+        self.res = ops.Resize(
+            device='gpu', resize_x=crop, resize_y=crop, interp_type=interp)
+        self.cmnp = ops.CropMirrorNormalize(
+            device="gpu",
+            output_dtype=types.FLOAT,
+            output_layout=types.NCHW,
+            crop=(crop, crop),
+            image_type=types.RGB,
+            mean=mean,
+            std=std)
+        self.coin = ops.CoinFlip(probability=0.5)
+        self.to_int64 = ops.Cast(dtype=types.INT64, device="gpu")
+
+    def define_graph(self):
+        rng = self.coin()
+        jpegs, labels = self.input(name="Reader")
+        images = self.decode(jpegs)
+        images = self.res(images)
+        output = self.cmnp(images.gpu(), mirror=rng)
+        return [output, self.to_int64(labels.gpu())]
+
+    def __len__(self):
+        return self.epoch_size("Reader")
+
+
+class HybridValPipe(Pipeline):
+    def __init__(self,
+                 file_root,
+                 file_list,
+                 batch_size,
+                 resize_shorter,
+                 crop,
+                 interp,
+                 mean,
+                 std,
+                 device_id,
+                 shard_id=0,
+                 num_shards=1,
+                 random_shuffle=False,
+                 num_threads=4,
+                 seed=42):
+        super(HybridValPipe, self).__init__(
+            batch_size, num_threads, device_id, seed=seed)
+        self.input = ops.FileReader(
+            file_root=file_root,
+            file_list=file_list,
+            shard_id=shard_id,
+            num_shards=num_shards,
+            random_shuffle=random_shuffle)
+        self.decode = ops.ImageDecoder(device="mixed", output_type=types.RGB)
+        self.res = ops.Resize(
+            device="gpu", resize_shorter=resize_shorter, interp_type=interp)
+        self.cmnp = ops.CropMirrorNormalize(
+            device="gpu",
+            output_dtype=types.FLOAT,
+            output_layout=types.NCHW,
+            crop=(crop, crop),
+            image_type=types.RGB,
+            mean=mean,
+            std=std)
+        self.to_int64 = ops.Cast(dtype=types.INT64, device="gpu")
+
+    def define_graph(self):
+        jpegs, labels = self.input(name="Reader")
+        images = self.decode(jpegs)
+        images = self.res(images)
+        output = self.cmnp(images)
+        return [output, self.to_int64(labels.gpu())]
+
+    def __len__(self):
+        return self.epoch_size("Reader")
+
+
+def build(settings, mode='train'):
+    env = os.environ
+    assert settings.use_gpu, "gpu training is required for DALI"
+    assert not settings.use_mixup, "mixup is not supported by DALI reader"
+    assert not settings.use_aa, "auto augment is not supported by DALI reader"
+    assert float(env.get('FLAGS_fraction_of_gpu_memory_to_use', 0.92)) < 0.9, \
+        "Please leave enough GPU memory for DALI workspace, e.g., by setting" \
+        " `export FLAGS_fraction_of_gpu_memory_to_use=0.8`"
+
+    file_root = settings.data_dir
+    bs = settings.batch_size
+    assert bs % paddle.fluid.core.get_cuda_device_count() == 0, \
+        "batch size must be multiple of number of devices"
+    batch_size = bs // paddle.fluid.core.get_cuda_device_count()
+
+    mean = [v * 255 for v in settings.image_mean]
+    std = [v * 255 for v in settings.image_std]
+    crop = settings.image_shape[1]
+    resize_shorter = settings.resize_short_size
+    min_area = settings.lower_scale
+    lower = settings.lower_ratio
+    upper = settings.upper_ratio
+
+    interp = settings.interpolation or 1  # default to linear
+    interp_map = {
+        0: types.INTERP_NN,  # cv2.INTER_NEAREST
+        1: types.INTERP_LINEAR,  # cv2.INTER_LINEAR
+        2: types.INTERP_CUBIC,  # cv2.INTER_CUBIC
+        4: types.INTERP_LANCZOS3,  # XXX use LANCZOS3 for cv2.INTER_LANCZOS4
+    }
+    assert interp in interp_map, "interpolation method not supported by DALI"
+    interp = interp_map[interp]
+
+    if mode != 'train':
+        p = fluid.framework.cuda_places()[0]
+        place = fluid.core.Place()
+        place.set_place(p)
+        device_id = place.gpu_device_id()
+        file_list = os.path.join(file_root, 'val_list.txt')
+        if not os.path.exists(file_list):
+            file_list = None
+            file_root = os.path.join(file_root, 'val')
+        pipe = HybridValPipe(
+            file_root,
+            file_list,
+            batch_size,
+            resize_shorter,
+            crop,
+            interp,
+            mean,
+            std,
+            device_id=device_id)
+        pipe.build()
+        return DALIGenericIterator(
+            pipe, ['feed_image', 'feed_label'],
+            size=len(pipe),
+            dynamic_shape=True,
+            fill_last_batch=False,
+            last_batch_padded=True)
+
+    file_list = os.path.join(file_root, 'train_list.txt')
+    if not os.path.exists(file_list):
+        file_list = None
+        file_root = os.path.join(file_root, 'train')
+
+    if 'PADDLE_TRAINER_ID' in env and 'PADDLE_TRAINERS_NUM' in env:
+        shard_id = int(env['PADDLE_TRAINER_ID'])
+        num_shards = int(env['PADDLE_TRAINERS_NUM'])
+        device_id = int(env['FLAGS_selected_gpus'])
+        pipe = HybridTrainPipe(
+            file_root,
+            file_list,
+            batch_size,
+            resize_shorter,
+            crop,
+            min_area,
+            lower,
+            upper,
+            interp,
+            mean,
+            std,
+            device_id,
+            shard_id,
+            num_shards,
+            seed=42 + shard_id)
+        pipe.build()
+        pipelines = [pipe]
+        sample_per_shard = len(pipe) // num_shards
+    else:
+        pipelines = []
+        places = fluid.framework.cuda_places()
+        num_shards = len(places)
+        for idx, p in enumerate(places):
+            place = fluid.core.Place()
+            place.set_place(p)
+            device_id = place.gpu_device_id()
+            pipe = HybridTrainPipe(
+                file_root,
+                file_list,
+                batch_size,
+                resize_shorter,
+                crop,
+                min_area,
+                lower,
+                upper,
+                interp,
+                mean,
+                std,
+                device_id,
+                idx,
+                num_shards,
+                seed=42 + idx)
+            pipe.build()
+            pipelines.append(pipe)
+        sample_per_shard = len(pipelines[0])
+
+    return DALIGenericIterator(
+        pipelines, ['feed_image', 'feed_label'], size=sample_per_shard)
+
+
+def train(settings):
+    return build(settings, 'train')
+
+
+def val(settings):
+    return build(settings, 'val')
diff --git a/PaddleCV/image_classification/eval.py b/PaddleCV/image_classification/eval.py
index 0b66f9521445c4cb1e61403d4065011f601fffb1..44ea47364fe7f4b930e1977e671b639a8271b3bc 100644
--- a/PaddleCV/image_classification/eval.py
+++ b/PaddleCV/image_classification/eval.py
@@ -23,6 +23,7 @@ import math
 import numpy as np
 import argparse
 import functools
+import logging
 
 import paddle
 import paddle.fluid as fluid
@@ -34,10 +35,9 @@ parser = argparse.ArgumentParser(description=__doc__)
 add_arg = functools.partial(add_arguments, argparser=parser)
 # yapf: disable
 add_arg('data_dir',         str,  "./data/ILSVRC2012/", "The ImageNet datset")
-add_arg('batch_size',       int,  256,                  "Minibatch size.")
+add_arg('batch_size',       int,  256,                  "batch size on all the devices.")
 add_arg('use_gpu',          bool, True,                 "Whether to use GPU or not.")
 add_arg('class_dim',        int,  1000,                 "Class number.")
-add_arg('image_shape',      str,  "3,224,224",          "Input image size")
 parser.add_argument("--pretrained_model", default=None, required=True, type=str, help="The path to load pretrained model")
 add_arg('model',            str,  "ResNet50", "Set the network to use.")
 add_arg('resize_short_size', int, 256,                  "Set resize short size")
@@ -45,16 +45,21 @@ add_arg('reader_thread',    int,  8,                    "The number of multi thr
 add_arg('reader_buf_size',  int,  2048,                 "The buf size of multi thread reader")
 parser.add_argument('--image_mean', nargs='+', type=float, default=[0.485, 0.456, 0.406], help="The mean of input image data")
 parser.add_argument('--image_std', nargs='+', type=float, default=[0.229, 0.224, 0.225], help="The std of input image data")
-add_arg('crop_size',        int,  224,                  "The value of crop size")
+parser.add_argument('--image_shape', nargs="+",  type=int, default=[3,224,224], help=" The shape of image")
 add_arg('interpolation',    int,  None,                 "The interpolation mode")
 add_arg('padding_type',     str,  "SAME",               "Padding type of convolution")
 add_arg('use_se',           bool, True,                 "Whether to use Squeeze-and-Excitation module for EfficientNet.")
+add_arg('save_json_path',   str,  None,                 "Whether to save output in json file.")
+add_arg('same_feed',        int,  0,                    "Whether to feed same images")
+add_arg('print_step',       int,  1,                    "the batch step to print info")
+add_arg('deploy',                bool,    False,                      "deploy mode, currently used in ACNet")
 # yapf: enable
 
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
 
-def eval(args):
-    image_shape = [int(m) for m in args.image_shape.split(",")]
 
+def eval(args):
     model_list = [m for m in dir(models) if "__" not in m]
     assert args.model in model_list, "{} is not in lists: {}".format(args.model,
                                                                      model_list)
@@ -63,8 +68,16 @@ def eval(args):
     ), "{} doesn't exist, please load right pretrained model path for eval".format(
         args.pretrained_model)
 
+    assert args.image_shape[
+        1] <= args.resize_short_size, "Please check the args:image_shape and args:resize_short_size, The croped size(image_shape[1]) must smaller than or equal to the resized length(resize_short_size) "
+
+    # check gpu: when using gpu, the number of visible cards should divide batch size
+    if args.use_gpu:
+        assert args.batch_size % fluid.core.get_cuda_device_count(
+        ) == 0, "please support correct batch_size({}), which can be divided by available cards({}), you can change the number of cards by indicating: export CUDA_VISIBLE_DEVICES= ".format(
+            args.batch_size, fluid.core.get_cuda_device_count())
     image = fluid.data(
-        name='image', shape=[None] + image_shape, dtype='float32')
+        name='image', shape=[None] + args.image_shape, dtype='float32')
     label = fluid.data(name='label', shape=[None, 1], dtype='int64')
 
     # model definition
@@ -72,6 +85,8 @@ def eval(args):
         model = models.__dict__[args.model](is_test=True,
                                             padding_type=args.padding_type,
                                             use_se=args.use_se)
+    elif "ACNet" in args.model:
+        model = models.__dict__[args.model](deploy=args.deploy)
     else:
         model = models.__dict__[args.model]()
 
@@ -98,47 +113,82 @@ def eval(args):
 
     test_program = fluid.default_main_program().clone(for_test=True)
 
-    fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name]
+    fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name, pred.name]
+    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
 
-    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
     exe = fluid.Executor(place)
+
     exe.run(fluid.default_startup_program())
+    if args.use_gpu:
+        places = fluid.framework.cuda_places()
+    else:
+        places = fluid.framework.cpu_places()
+    compiled_program = fluid.compiler.CompiledProgram(
+        test_program).with_data_parallel(places=places)
 
     fluid.io.load_persistables(exe, args.pretrained_model)
     imagenet_reader = reader.ImageNetReader()
     val_reader = imagenet_reader.val(settings=args)
 
-    feeder = fluid.DataFeeder(place=place, feed_list=[image, label])
+    # set places to run on the multi-card
+    feeder = fluid.DataFeeder(place=places, feed_list=[image, label])
 
     test_info = [[], [], []]
     cnt = 0
+    parallel_data = []
+    parallel_id = []
+    place_num = paddle.fluid.core.get_cuda_device_count(
+    ) if args.use_gpu else int(os.environ.get('CPU_NUM', 1))
+    real_iter = 0
+    info_dict = {}
+
     for batch_id, data in enumerate(val_reader()):
-        t1 = time.time()
-        loss, acc1, acc5 = exe.run(test_program,
-                                   fetch_list=fetch_list,
-                                   feed=feeder.feed(data))
-        t2 = time.time()
-        period = t2 - t1
-        loss = np.mean(loss)
-        acc1 = np.mean(acc1)
-        acc5 = np.mean(acc5)
-        test_info[0].append(loss * len(data))
-        test_info[1].append(acc1 * len(data))
-        test_info[2].append(acc5 * len(data))
-        cnt += len(data)
-        if batch_id % 10 == 0:
-            print("Testbatch {0},loss {1}, "
-                  "acc1 {2},acc5 {3},time {4}".format(batch_id, \
+        #image data and label
+        image_data = [items[0:2] for items in data]
+        image_id = [items[2] for items in data]
+        parallel_id.append(image_id)
+        parallel_data.append(image_data)
+        if place_num == len(parallel_data):
+            t1 = time.time()
+            loss_set, acc1_set, acc5_set, pred_set = exe.run(
+                compiled_program,
+                fetch_list=fetch_list,
+                feed=list(feeder.feed_parallel(parallel_data, place_num)))
+            t2 = time.time()
+            period = t2 - t1
+            loss = np.mean(loss_set)
+            acc1 = np.mean(acc1_set)
+            acc5 = np.mean(acc5_set)
+            test_info[0].append(loss * len(data))
+            test_info[1].append(acc1 * len(data))
+            test_info[2].append(acc5 * len(data))
+            cnt += len(data)
+            if batch_id % args.print_step == 0:
+                info = "Testbatch {0},loss {1}, acc1 {2},acc5 {3},time {4}".format(real_iter, \
                   "%.5f"%loss,"%.5f"%acc1, "%.5f"%acc5, \
-                  "%2.2f sec" % period))
-            sys.stdout.flush()
+                  "%2.2f sec" % period)
+                logger.info(info)
+                sys.stdout.flush()
+
+            parallel_id = []
+            parallel_data = []
+            real_iter += 1
 
     test_loss = np.sum(test_info[0]) / cnt
     test_acc1 = np.sum(test_info[1]) / cnt
     test_acc5 = np.sum(test_info[2]) / cnt
 
-    print("Test_loss {0}, test_acc1 {1}, test_acc5 {2}".format(
-        "%.5f" % test_loss, "%.5f" % test_acc1, "%.5f" % test_acc5))
+    info = "Test_loss {0}, test_acc1 {1}, test_acc5 {2}".format(
+        "%.5f" % test_loss, "%.5f" % test_acc1, "%.5f" % test_acc5)
+    if args.save_json_path:
+        info_dict = {
+            "Test_loss": test_loss,
+            "test_acc1": test_acc1,
+            "test_acc5": test_acc5
+        }
+        save_json(info_dict, args.save_json_path)
+    logger.info(info)
     sys.stdout.flush()
 
 
diff --git a/PaddleCV/image_classification/infer.py b/PaddleCV/image_classification/infer.py
index 0fb7de7791b7cf87605f32bfebcd3c7a8bd483e0..092ba63691f78d2d3b54f0d34d04cba9c7c2881d 100644
--- a/PaddleCV/image_classification/infer.py
+++ b/PaddleCV/image_classification/infer.py
@@ -23,46 +23,68 @@ import math
 import numpy as np
 import argparse
 import functools
+import re
+import logging
 
 import paddle
 import paddle.fluid as fluid
 import reader
 import models
+import json
 from utils import *
 
 parser = argparse.ArgumentParser(description=__doc__)
 # yapf: disable
 add_arg = functools.partial(add_arguments, argparser=parser)
-add_arg('data_dir',         str,  "./data/ILSVRC2012/", "The ImageNet data")
+add_arg('data_dir',         str,  "./data/ILSVRC2012/val/", "The ImageNet data")
 add_arg('use_gpu',          bool, True,                 "Whether to use GPU or not.")
 add_arg('class_dim',        int,  1000,                 "Class number.")
-add_arg('image_shape',      str,  "3,224,224",          "Input image size")
 parser.add_argument("--pretrained_model", default=None, required=True, type=str, help="The path to load pretrained model")
-add_arg('model',            str,  "ResNet50",            "Set the network to use.")
+add_arg('model',            str,  "ResNet50",           "Set the network to use.")
 add_arg('save_inference',   bool, False,                "Whether to save inference model or not")
 add_arg('resize_short_size',int,  256,                  "Set resize short size")
 add_arg('reader_thread',    int,  1,                    "The number of multi thread reader")
 add_arg('reader_buf_size',  int,  2048,                 "The buf size of multi thread reader")
 parser.add_argument('--image_mean', nargs='+', type=float, default=[0.485, 0.456, 0.406], help="The mean of input image data")
 parser.add_argument('--image_std', nargs='+', type=float, default=[0.229, 0.224, 0.225], help="The std of input image data")
-add_arg('crop_size',        int,  224,                  "The value of crop size")
+parser.add_argument('--image_shape', nargs='+', type=int, default=[3, 224, 224], help="the shape of image")
 add_arg('topk',             int,  1,                    "topk")
-add_arg('label_path',       str,  "./utils/tools/readable_label.txt", "readable label filepath")
+add_arg('class_map_path',   str,  "./utils/tools/readable_label.txt", "readable label filepath")
 add_arg('interpolation',    int,  None,                 "The interpolation mode")
 add_arg('padding_type',     str,  "SAME",               "Padding type of convolution")
 add_arg('use_se',           bool, True,                 "Whether to use Squeeze-and-Excitation module for EfficientNet.")
+add_arg('image_path',       str,  None,                 "single image path")
+add_arg('batch_size',       int,  8,                    "batch_size on all the devices")
+add_arg('save_json_path',        str,  "test_res.json",            "save output to a json file")
 # yapf: enable
 
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
 
 def infer(args):
-    image_shape = [int(m) for m in args.image_shape.split(",")]
     model_list = [m for m in dir(models) if "__" not in m]
     assert args.model in model_list, "{} is not in lists: {}".format(args.model,
                                                                      model_list)
     assert os.path.isdir(args.pretrained_model
                          ), "please load right pretrained model path for infer"
+
+    assert args.image_shape[
+        1] <= args.resize_short_size, "Please check the args:image_shape and args:resize_short_size, The croped size(image_shape[1]) must smaller than or equal to the resized length(resize_short_size) "
+
+    if args.image_path:
+        assert os.path.isfile(
+            args.image_path
+        ), "Please check the args:image_path, it should be a path to single image."
+        if args.use_gpu:
+            assert fluid.core.get_cuda_device_count(
+            ) == 1, "please set \"export CUDA_VISIBLE_DEVICES=\" available single card"
+        else:
+            assert int(os.environ.get('CPU_NUM',
+                                      1)) == 1, "please set CPU_NUM as 1"
+
     image = fluid.data(
-        name='image', shape=[None] + image_shape, dtype='float32')
+        name='image', shape=[None] + args.image_shape, dtype='float32')
 
     if args.model.startswith('EfficientNet'):
         model = models.__dict__[args.model](is_test=True,
@@ -80,10 +102,16 @@ def infer(args):
     test_program = fluid.default_main_program().clone(for_test=True)
 
     fetch_list = [out.name]
-
-    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+    place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
     exe = fluid.Executor(place)
     exe.run(fluid.default_startup_program())
+    if args.use_gpu:
+        places = fluid.framework.cuda_places()
+    else:
+        places = fluid.framework.cpu_places()
+    compiled_program = fluid.compiler.CompiledProgram(
+        test_program).with_data_parallel(places=places)
 
     fluid.io.load_persistables(exe, args.pretrained_model)
     if args.save_inference:
@@ -95,35 +123,77 @@ def infer(args):
             executor=exe,
             model_filename='model',
             params_filename='params')
-        print("model: ", args.model, " is already saved")
+        logger.info("model: {0} is already saved".format(args.model))
         exit(0)
 
-    args.test_batch_size = 1
     imagenet_reader = reader.ImageNetReader()
     test_reader = imagenet_reader.test(settings=args)
-    feeder = fluid.DataFeeder(place=place, feed_list=[image])
+    feeder = fluid.DataFeeder(place=places, feed_list=[image])
 
     TOPK = args.topk
-    assert os.path.exists(args.label_path), "Index file doesn't exist!"
-    f = open(args.label_path)
-    label_dict = {}
-    for item in f.readlines():
-        key = item.split(" ")[0]
-        value = [l.replace("\n", "") for l in item.split(" ")[1:]]
-        label_dict[key] = value
-
-    for batch_id, data in enumerate(test_reader()):
-        result = exe.run(test_program,
-                         fetch_list=fetch_list,
-                         feed=feeder.feed(data))
-        result = result[0][0]
-        pred_label = np.argsort(result)[::-1][:TOPK]
-        readable_pred_label = []
-        for label in pred_label:
-            readable_pred_label.append(label_dict[str(label)])
-        print("Test-{0}-score: {1}, class{2} {3}".format(batch_id, result[
-            pred_label], pred_label, readable_pred_label))
-        sys.stdout.flush()
+    if os.path.exists(args.class_map_path):
+        logger.info(
+            "The map of readable label and numerical label has been found!")
+        with open(args.class_map_path) as f:
+            label_dict = {}
+            strinfo = re.compile(r"\d+ ")
+            for item in f.readlines():
+                key = item.split(" ")[0]
+                value = [
+                    strinfo.sub("", l).replace("\n", "")
+                    for l in item.split(", ")
+                ]
+                label_dict[key] = value
+
+    info = {}
+    parallel_data = []
+    parallel_id = []
+    place_num = paddle.fluid.core.get_cuda_device_count(
+    ) if args.use_gpu else int(os.environ.get('CPU_NUM', 1))
+    if os.path.exists(args.save_json_path):
+        logger.warning("path: {} Already exists! will recover it\n".format(
+            args.save_json_path))
+    with open(args.save_json_path, "w") as fout:
+        for batch_id, data in enumerate(test_reader()):
+            image_data = [[items[0]] for items in data]
+            image_id = [items[1] for items in data]
+
+            parallel_id.append(image_id)
+            parallel_data.append(image_data)
+
+            if place_num == len(parallel_data):
+                result = exe.run(
+                    compiled_program,
+                    fetch_list=fetch_list,
+                    feed=list(feeder.feed_parallel(parallel_data, place_num)))
+                for i, res in enumerate(result[0]):
+                    pred_label = np.argsort(res)[::-1][:TOPK]
+                    real_id = str(np.array(parallel_id).flatten()[i])
+                    _, real_id = os.path.split(real_id)
+
+                    if os.path.exists(args.class_map_path):
+                        readable_pred_label = []
+                        for label in pred_label:
+                            readable_pred_label.append(label_dict[str(label)])
+
+                        info[real_id] = {}
+                        info[real_id]['score'], info[real_id]['class'], info[
+                            real_id]['class_name'] = str(res[pred_label]), str(
+                                pred_label), readable_pred_label
+                    else:
+                        info[real_id] = {}
+                        info[real_id]['score'], info[real_id]['class'] = str(
+                            res[pred_label]), str(pred_label)
+
+                    logger.info("{}, {}".format(real_id, info[real_id]))
+                    sys.stdout.flush()
+                    fout.write(real_id + "\t" + json.dumps(info[real_id]) +
+                               "\n")
+
+                parallel_data = []
+                parallel_id = []
+
+    os.remove(".tmp.txt")
 
 
 def main():
diff --git a/PaddleCV/image_classification/models/__init__.py b/PaddleCV/image_classification/models/__init__.py
index 9ebb3d56ce2000fd05ff768c0016c715c37c236a..63e6220688306b55d6dfe9369b4176c232314282 100644
--- a/PaddleCV/image_classification/models/__init__.py
+++ b/PaddleCV/image_classification/models/__init__.py
@@ -38,3 +38,8 @@ from .squeezenet import SqueezeNet1_0, SqueezeNet1_1
 from .darknet import DarkNet53
 from .resnext101_wsl import ResNeXt101_32x8d_wsl, ResNeXt101_32x16d_wsl, ResNeXt101_32x32d_wsl, ResNeXt101_32x48d_wsl, Fix_ResNeXt101_32x48d_wsl
 from .efficientnet import EfficientNet, EfficientNetB0, EfficientNetB1, EfficientNetB2, EfficientNetB3, EfficientNetB4, EfficientNetB5, EfficientNetB6, EfficientNetB7
+from .res2net import Res2Net50_48w_2s, Res2Net50_26w_4s, Res2Net50_14w_8s, Res2Net50_26w_6s, Res2Net50_26w_8s, Res2Net101_26w_4s, Res2Net152_26w_4s
+from .res2net_vd import Res2Net50_vd_48w_2s, Res2Net50_vd_26w_4s, Res2Net50_vd_14w_8s, Res2Net50_vd_26w_6s, Res2Net50_vd_26w_8s, Res2Net101_vd_26w_4s, Res2Net152_vd_26w_4s, Res2Net200_vd_26w_4s
+from .hrnet import HRNet_W18_C, HRNet_W30_C, HRNet_W32_C, HRNet_W40_C, HRNet_W44_C, HRNet_W48_C, HRNet_W60_C, HRNet_W64_C, SE_HRNet_W18_C, SE_HRNet_W30_C, SE_HRNet_W32_C, SE_HRNet_W40_C, SE_HRNet_W44_C, SE_HRNet_W48_C, SE_HRNet_W60_C, SE_HRNet_W64_C
+from .autodl import DARTS_6M, DARTS_4M
+from .resnet_acnet import ResNet18_ACNet, ResNet34_ACNet, ResNet50_ACNet, ResNet101_ACNet, ResNet152_ACNet
diff --git a/PaddleCV/image_classification/models/autodl.py b/PaddleCV/image_classification/models/autodl.py
new file mode 100644
index 0000000000000000000000000000000000000000..915bc1631cc841194e2217186ee7849e92735b30
--- /dev/null
+++ b/PaddleCV/image_classification/models/autodl.py
@@ -0,0 +1,562 @@
+#  Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+#
+# Based on:
+# --------------------------------------------------------
+# DARTS
+# Copyright (c) 2018, Hanxiao Liu.
+# Licensed under the Apache License, Version 2.0;
+# --------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import sys
+import numpy as np
+import time
+import functools
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Xavier
+from paddle.fluid.initializer import Normal
+from paddle.fluid.initializer import Constant
+
+
+from collections import namedtuple
+Genotype = namedtuple('Genotype', 'normal normal_concat reduce reduce_concat')
+
+arch_dict = {
+    'DARTS_6M': Genotype(normal=[('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_5x5', 1), ('sep_conv_5x5', 0), ('sep_conv_3x3', 2), ('sep_conv_3x3', 1), ('skip_connect', 4), ('sep_conv_3x3', 3)], normal_concat=range(2, 6), reduce=[('sep_conv_5x5', 0), ('max_pool_3x3', 1), ('dil_conv_5x5', 2), ('sep_conv_5x5', 0), ('sep_conv_3x3', 1), ('dil_conv_5x5', 3), ('dil_conv_3x3', 1), ('sep_conv_3x3', 2)], reduce_concat=range(2, 6)), 
+    'DARTS_4M': Genotype(normal=[('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 1), ('skip_connect', 0), ('skip_connect', 0), ('dil_conv_3x3', 1)], normal_concat=range(2, 6), reduce=[('max_pool_3x3', 0), ('max_pool_3x3', 1), ('max_pool_3x3', 0), ('avg_pool_3x3', 1), ('skip_connect', 3), ('skip_connect', 2), ('sep_conv_3x3', 0), ('sep_conv_5x5', 2)], reduce_concat=range(2, 6)),
+}
+
+__all__ = list(arch_dict.keys())
+
+train_parameters = {
+    "input_size": [3, 224, 224],
+    "input_mean": [0.485, 0.456, 0.406],
+    "input_std": [0.229, 0.224, 0.225],
+    "learning_strategy": {
+        "name": "piecewise_decay",
+        "batch_size": 256,
+        "epochs": [30, 60, 90],
+        "steps": [0.1, 0.01, 0.001, 0.0001]
+    }
+}
+
+OPS = {
+    'none' : lambda input, C, stride, name, affine: Zero(input, stride, name),
+    'avg_pool_3x3' : lambda input, C, stride, name, affine: fluid.layers.pool2d(input, 3, 'avg', pool_stride=stride, pool_padding=1, name=name),
+    'max_pool_3x3' : lambda input, C, stride, name, affine: fluid.layers.pool2d(input, 3, 'max', pool_stride=stride, pool_padding=1, name=name),
+    'skip_connect' : lambda input,C, stride, name, affine: Identity(input, name) if stride == 1 else FactorizedReduce(input, C, name=name, affine=affine),
+    'sep_conv_3x3' : lambda input,C, stride, name, affine: SepConv(input, C, C, 3, stride, 1, name=name, affine=affine),
+    'sep_conv_5x5' : lambda input,C, stride, name, affine: SepConv(input, C, C, 5, stride, 2, name=name, affine=affine),
+    'sep_conv_7x7' : lambda input,C, stride, name, affine: SepConv(input, C, C, 7, stride, 3, name=name, affine=affine),
+    'dil_conv_3x3' : lambda input,C, stride, name, affine: DilConv(input, C, C, 3, stride, 2, 2, name=name, affine=affine),
+    'dil_conv_5x5' : lambda input,C, stride, name, affine: DilConv(input, C, C, 5, stride, 4, 2, name=name, affine=affine),
+    'conv_7x1_1x7' : lambda input,C, stride, name, affine: SevenConv(input, C, name=name, affine=affine)
+}
+
+def ReLUConvBN(input, C_out, kernel_size, stride, padding, name='',
+               affine=True):
+    relu_a = fluid.layers.relu(input)
+    conv2d_a = fluid.layers.conv2d(
+        relu_a,
+        C_out,
+        kernel_size,
+        stride,
+        padding,
+        bias_attr=False)
+    if affine:
+        reluconvbn_out = fluid.layers.batch_norm(
+            conv2d_a,
+            param_attr=ParamAttr(
+                initializer=Constant(1.), name=name + 'op.2.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.), name=name + 'op.2.bias'),
+            moving_mean_name=name + 'op.2.running_mean',
+            moving_variance_name=name + 'op.2.running_var')
+    else:
+        reluconvbn_out = fluid.layers.batch_norm(
+            conv2d_a,
+            param_attr=ParamAttr(
+                initializer=Constant(1.),
+                learning_rate=0.,
+                name=name + 'op.2.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.),
+                learning_rate=0.,
+                name=name + 'op.2.bias'),
+            moving_mean_name=name + 'op.2.running_mean',
+            moving_variance_name=name + 'op.2.running_var')
+    return reluconvbn_out
+
+def DilConv(input,
+            C_in,
+            C_out,
+            kernel_size,
+            stride,
+            padding,
+            dilation,
+            name='',
+            affine=True):
+    relu_a = fluid.layers.relu(input)
+    conv2d_a = fluid.layers.conv2d(
+        relu_a,
+        C_in,
+        kernel_size,
+        stride,
+        padding,
+        dilation,
+        groups=C_in,
+        bias_attr=False,
+        use_cudnn=False)
+    conv2d_b = fluid.layers.conv2d(
+        conv2d_a,
+        C_out,
+        1,
+        bias_attr=False)
+    if affine:
+        dilconv_out = fluid.layers.batch_norm(
+            conv2d_b,
+            param_attr=ParamAttr(
+                initializer=Constant(1.), name=name + 'op.3.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.), name=name + 'op.3.bias'),
+            moving_mean_name=name + 'op.3.running_mean',
+            moving_variance_name=name + 'op.3.running_var')
+    else:
+        dilconv_out = fluid.layers.batch_norm(
+            conv2d_b,
+            param_attr=ParamAttr(
+                initializer=Constant(1.),
+                learning_rate=0.,
+                name=name + 'op.3.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.),
+                learning_rate=0.,
+                name=name + 'op.3.bias'),
+            moving_mean_name=name + 'op.3.running_mean',
+            moving_variance_name=name + 'op.3.running_var')
+    return dilconv_out
+
+def SepConv(input,
+            C_in,
+            C_out,
+            kernel_size,
+            stride,
+            padding,
+            name='',
+            affine=True):
+    relu_a = fluid.layers.relu(input)
+    conv2d_a = fluid.layers.conv2d(
+        relu_a,
+        C_in,
+        kernel_size,
+        stride,
+        padding,
+        groups=C_in,
+        bias_attr=False,
+        use_cudnn=False)
+    conv2d_b = fluid.layers.conv2d(
+        conv2d_a,
+        C_in,
+        1,
+        bias_attr=False)
+    if affine:
+        bn_a = fluid.layers.batch_norm(
+            conv2d_b,
+            param_attr=ParamAttr(
+                initializer=Constant(1.), name=name + 'op.3.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.), name=name + 'op.3.bias'),
+            moving_mean_name=name + 'op.3.running_mean',
+            moving_variance_name=name + 'op.3.running_var')
+    else:
+        bn_a = fluid.layers.batch_norm(
+            conv2d_b,
+            param_attr=ParamAttr(
+                initializer=Constant(1.),
+                learning_rate=0.,
+                name=name + 'op.3.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.),
+                learning_rate=0.,
+                name=name + 'op.3.bias'),
+            moving_mean_name=name + 'op.3.running_mean',
+            moving_variance_name=name + 'op.3.running_var')
+
+    relu_b = fluid.layers.relu(bn_a)
+    conv2d_d = fluid.layers.conv2d(
+        relu_b,
+        C_in,
+        kernel_size,
+        1,
+        padding,
+        groups=C_in,
+        bias_attr=False,
+        use_cudnn=False)
+    conv2d_e = fluid.layers.conv2d(
+        conv2d_d,
+        C_out,
+        1,
+        bias_attr=False)
+    if affine:
+        sepconv_out = fluid.layers.batch_norm(
+            conv2d_e,
+            param_attr=ParamAttr(
+                initializer=Constant(1.), name=name + 'op.7.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.), name=name + 'op.7.bias'),
+            moving_mean_name=name + 'op.7.running_mean',
+            moving_variance_name=name + 'op.7.running_var')
+    else:
+        sepconv_out = fluid.layers.batch_norm(
+            conv2d_e,
+            param_attr=ParamAttr(
+                initializer=Constant(1.),
+                learning_rate=0.,
+                name=name + 'op.7.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.),
+                learning_rate=0.,
+                name=name + 'op.7.bias'),
+            moving_mean_name=name + 'op.7.running_mean',
+            moving_variance_name=name + 'op.7.running_var')
+    return sepconv_out
+
+def SevenConv(input, C_out, stride, name='', affine=True):
+    relu_a = fluid.layers.relu(input)
+    conv2d_a = fluid.layers.conv2d(
+        relu_a,
+        C_out, (1, 7), (1, stride), (0, 3),
+        param_attr=ParamAttr(
+            initializer=Xavier(
+                uniform=False, fan_in=0),
+            name=name + 'op.1.weight'),
+        bias_attr=False)
+    conv2d_b = fluid.layers.conv2d(
+        conv2d_a,
+        C_out, (7, 1), (stride, 1), (3, 0),
+        param_attr=ParamAttr(
+            initializer=Xavier(
+                uniform=False, fan_in=0),
+            name=name + 'op.2.weight'),
+        bias_attr=False)
+    if affine:
+        out = fluid.layers.batch_norm(
+            conv2d_b,
+            param_attr=ParamAttr(
+                initializer=Constant(1.), name=name + 'op.3.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.), name=name + 'op.3.bias'),
+            moving_mean_name=name + 'op.3.running_mean',
+            moving_variance_name=name + 'op.3.running_var')
+    else:
+        out = fluid.layers.batch_norm(
+            conv2d_b,
+            param_attr=ParamAttr(
+                initializer=Constant(1.),
+                learning_rate=0.,
+                name=name + 'op.3.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.),
+                learning_rate=0.,
+                name=name + 'op.3.bias'),
+            moving_mean_name=name + 'op.3.running_mean',
+            moving_variance_name=name + 'op.3.running_var')
+
+def Identity(input, name=''):
+    return input
+
+def Zero(input, stride, name=''):
+    ones = np.ones(input.shape[-2:])
+    ones[::stride, ::stride] = 0
+    ones = fluid.layers.assign(ones)
+    return input * ones
+
+def FactorizedReduce(input, C_out, name='', affine=True):
+    relu_a = fluid.layers.relu(input)
+    conv2d_a = fluid.layers.conv2d(
+        relu_a,
+        C_out // 2,
+        1,
+        2,
+        param_attr=ParamAttr(
+            initializer=Xavier(
+                uniform=False, fan_in=0),
+            name=name + 'conv_1.weight'),
+        bias_attr=False)
+    h_end = relu_a.shape[2]
+    w_end = relu_a.shape[3]
+    slice_a = fluid.layers.slice(relu_a, [2, 3], [1, 1], [h_end, w_end])
+    conv2d_b = fluid.layers.conv2d(
+        slice_a,
+        C_out // 2,
+        1,
+        2,
+        param_attr=ParamAttr(
+            initializer=Xavier(
+                uniform=False, fan_in=0),
+            name=name + 'conv_2.weight'),
+        bias_attr=False)
+    out = fluid.layers.concat([conv2d_a, conv2d_b], axis=1)
+    if affine:
+        out = fluid.layers.batch_norm(
+            out,
+            param_attr=ParamAttr(
+                initializer=Constant(1.), name=name + 'bn.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.), name=name + 'bn.bias'),
+            moving_mean_name=name + 'bn.running_mean',
+            moving_variance_name=name + 'bn.running_var')
+    else:
+        out = fluid.layers.batch_norm(
+            out,
+            param_attr=ParamAttr(
+                initializer=Constant(1.),
+                learning_rate=0.,
+                name=name + 'bn.weight'),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.),
+                learning_rate=0.,
+                name=name + 'bn.bias'),
+            moving_mean_name=name + 'bn.running_mean',
+            moving_variance_name=name + 'bn.running_var')
+    return out
+
+class Cell():
+    def __init__(self, genotype, C_prev_prev, C_prev, C, reduction,
+                 reduction_prev):
+
+        if reduction_prev:
+            self.preprocess0 = functools.partial(FactorizedReduce, C_out=C)
+        else:
+            self.preprocess0 = functools.partial(
+                ReLUConvBN, C_out=C, kernel_size=1, stride=1, padding=0)
+        self.preprocess1 = functools.partial(
+            ReLUConvBN, C_out=C, kernel_size=1, stride=1, padding=0)
+        if reduction:
+            op_names, indices = zip(*genotype.reduce)
+            concat = genotype.reduce_concat
+        else:
+            op_names, indices = zip(*genotype.normal)
+            concat = genotype.normal_concat
+        print(op_names, indices, concat, reduction)
+        self._compile(C, op_names, indices, concat, reduction)
+
+    def _compile(self, C, op_names, indices, concat, reduction):
+        assert len(op_names) == len(indices)
+        self._steps = len(op_names) // 2
+        self._concat = concat
+        self.multiplier = len(concat)
+
+        self._ops = []
+        for name, index in zip(op_names, indices):
+            stride = 2 if reduction and index < 2 else 1
+            op = functools.partial(OPS[name], C=C, stride=stride, affine=True)
+            self._ops += [op]
+        self._indices = indices
+
+    def forward(self, s0, s1, drop_prob, is_train, name):
+        self.training = is_train
+        preprocess0_name = name + 'preprocess0.'
+        preprocess1_name = name + 'preprocess1.'
+        s0 = self.preprocess0(s0, name=preprocess0_name)
+        s1 = self.preprocess1(s1, name=preprocess1_name)
+        out = [s0, s1]
+        for i in range(self._steps):
+            h1 = out[self._indices[2 * i]]
+            h2 = out[self._indices[2 * i + 1]]
+            op1 = self._ops[2 * i]
+            op2 = self._ops[2 * i + 1]
+            h3 = op1(h1, name=name + '_ops.' + str(2 * i) + '.')
+            h4 = op2(h2, name=name + '_ops.' + str(2 * i + 1) + '.')
+            if self.training and drop_prob > 0.:
+                if h3 != h1:
+                    h3 = fluid.layers.dropout(
+                        h3,
+                        drop_prob,
+                        dropout_implementation='upscale_in_train')
+                if h4 != h2:
+                    h4 = fluid.layers.dropout(
+                        h4,
+                        drop_prob,
+                        dropout_implementation='upscale_in_train')
+            s = h3 + h4
+            out += [s]
+        return fluid.layers.concat([out[i] for i in self._concat], axis=1)
+
+def AuxiliaryHeadImageNet(input, num_classes, aux_name='auxiliary_head'):
+    relu_a = fluid.layers.relu(input)
+    pool_a = fluid.layers.pool2d(relu_a, 5, 'avg', 2)
+    conv2d_a = fluid.layers.conv2d(
+        pool_a,
+        128,
+        1,
+        name=aux_name + '.features.2',
+        bias_attr=False)
+    bn_a_name = aux_name + '.features.3'
+    bn_a = fluid.layers.batch_norm(
+        conv2d_a,
+        act='relu',
+        name=bn_a_name,
+        param_attr=ParamAttr(
+            initializer=Constant(1.), name=bn_a_name + '.weight'),
+        bias_attr=ParamAttr(
+            initializer=Constant(0.), name=bn_a_name + '.bias'),
+        moving_mean_name=bn_a_name + '.running_mean',
+        moving_variance_name=bn_a_name + '.running_var')
+    conv2d_b = fluid.layers.conv2d(
+        bn_a,
+        768,
+        2,
+        name=aux_name + '.features.5',
+        bias_attr=False)
+    bn_b_name = aux_name + '.features.6'
+    bn_b = fluid.layers.batch_norm(
+        conv2d_b,
+        act='relu',
+        name=bn_b_name,
+        param_attr=ParamAttr(
+            initializer=Constant(1.), name=bn_b_name + '.weight'),
+        bias_attr=ParamAttr(
+            initializer=Constant(0.), name=bn_b_name + '.bias'),
+        moving_mean_name=bn_b_name + '.running_mean',
+        moving_variance_name=bn_b_name + '.running_var')
+    pool_b = fluid.layers.adaptive_pool2d(bn_b, (1, 1), "avg") 
+    fc_name = aux_name + '.classifier'
+    fc = fluid.layers.fc(pool_b,
+                         num_classes,
+                         name=fc_name,
+                         param_attr=ParamAttr(
+                             initializer=Normal(scale=1e-3),
+                             name=fc_name + '.weight'),
+                         bias_attr=ParamAttr(
+                             initializer=Constant(0.), name=fc_name + '.bias'))
+    return fc
+
+
+def StemConv0(input, C_out):
+    conv_a = fluid.layers.conv2d(
+        input,
+        C_out // 2,
+        3,
+        stride=2,
+        padding=1,
+        bias_attr=False)
+    bn_a = fluid.layers.batch_norm(
+        conv_a,
+        act='relu',
+        param_attr=ParamAttr(
+            initializer=Constant(1.), name='stem0.1.weight'),
+        bias_attr=ParamAttr(
+            initializer=Constant(0.), name='stem0.1.bias'),
+        moving_mean_name='stem0.1.running_mean',
+        moving_variance_name='stem0.1.running_var')
+    
+    conv_b = fluid.layers.conv2d(
+        bn_a,
+        C_out,
+        3,
+        stride=2,
+        padding=1,
+        bias_attr=False)
+    bn_b = fluid.layers.batch_norm(
+        conv_b,
+        param_attr=ParamAttr(
+            initializer=Constant(1.), name='stem0.3.weight'),
+        bias_attr=ParamAttr(
+            initializer=Constant(0.), name='stem0.3.bias'),
+        moving_mean_name='stem0.3.running_mean',
+        moving_variance_name='stem0.3.running_var')
+    return bn_b
+
+def StemConv1(input, C_out):
+    relu_a = fluid.layers.relu(input)
+    conv_a = fluid.layers.conv2d(
+        relu_a,
+        C_out,
+        3,
+        stride=2,
+        padding=1,
+        bias_attr=False)
+    bn_a = fluid.layers.batch_norm(
+        conv_a,
+        param_attr=ParamAttr(
+            initializer=Constant(1.), name='stem1.1.weight'),
+        bias_attr=ParamAttr(
+            initializer=Constant(0.), name='stem1.1.bias'),
+        moving_mean_name='stem1.1.running_mean',
+        moving_variance_name='stem1.1.running_var')
+    return bn_a
+
+class NetworkImageNet(object):
+    def __init__(self, arch='DARTS_6M'):
+        self.params = train_parameters
+        self.class_num = 1000
+        self.init_channel = 48
+        self._layers = 14
+        self._auxiliary = False
+        self.drop_path_prob = 0
+        genotype = arch_dict[arch]
+        
+        C = self.init_channel
+        layers = self._layers
+        C_prev_prev, C_prev, C_curr = C, C, C
+        self.cells = []
+        reduction_prev = True
+        for i in range(layers):
+            if i in [layers // 3, 2 * layers // 3]:
+                C_curr *= 2
+                reduction = True
+            else:
+                reduction = False
+            cell = Cell(genotype, C_prev_prev, C_prev, C_curr, reduction,
+                        reduction_prev)
+            reduction_prev = reduction
+            self.cells += [cell]
+            C_prev_prev, C_prev = C_prev, cell.multiplier * C_curr
+            if i == 2 * layers // 3:
+                C_to_auxiliary = C_prev
+
+    def net(self, input, class_dim=1000, is_train=True):
+        self.logits_aux = None
+        num_channel = self.init_channel
+        s0 = StemConv0(input, num_channel)
+        s1 = StemConv1(s0, num_channel)
+        for i, cell in enumerate(self.cells):
+            name = 'cells.' + str(i) + '.'
+            s0, s1 = s1, cell.forward(s0, s1, self.drop_path_prob, is_train,
+                                      name)
+            if i == int(2 * self._layers // 3):
+                if self._auxiliary and is_train:
+                    self.logits_aux = AuxiliaryHeadImageNet(s1, self.class_num)
+        out = fluid.layers.adaptive_pool2d(s1, (1, 1), "avg")
+        self.logits = fluid.layers.fc(out,
+                                      size=self.class_num,
+                                      param_attr=ParamAttr(
+                                          initializer=Normal(scale=1e-4),
+                                          name='classifier.weight'),
+                                      bias_attr=ParamAttr(
+                                          initializer=Constant(0.),
+                                          name='classifier.bias'))
+        return self.logits
+
+def DARTS_6M():
+    return NetworkImageNet(arch = 'DARTS_6M')
+def DARTS_4M():
+    return NetworkImageNet(arch = 'DARTS_4M')
diff --git a/PaddleCV/image_classification/models/hrnet.py b/PaddleCV/image_classification/models/hrnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea6b98e74c6ddde976792d51268e35c48025eec0
--- /dev/null
+++ b/PaddleCV/image_classification/models/hrnet.py
@@ -0,0 +1,322 @@
+#copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.initializer import MSRA
+import math
+from paddle.fluid.param_attr import ParamAttr
+
+__all__ = ["HRNet", "HRNet_W18_C", "HRNet_W30_C", "HRNet_W32_C", "HRNet_W40_C", "HRNet_W44_C", "HRNet_W48_C", "HRNet_W60_C", 
+           "HRNet_W64_C", "SE_HRNet_W18_C", "SE_HRNet_W30_C", "SE_HRNet_W32_C", "SE_HRNet_W40_C", "SE_HRNet_W44_C", 
+           "SE_HRNet_W48_C", "SE_HRNet_W60_C", "SE_HRNet_W64_C"]
+
+
+class HRNet():
+    def __init__(self, width=18, has_se=False):
+        self.width = width
+        self.has_se = has_se
+        self.channels = {
+            18: [[18, 36], [18, 36, 72], [18, 36, 72, 144]],
+            30: [[30, 60], [30, 60, 120], [30, 60, 120, 240]],
+            32: [[32, 64], [32, 64, 128], [32, 64, 128, 256]],
+            40: [[40, 80], [40, 80, 160], [40, 80, 160, 320]],
+            44: [[44, 88], [44, 88, 176], [44, 88, 176, 352]],
+            48: [[48, 96], [48, 96, 192], [48, 96, 192, 384]],
+            60: [[60, 120], [60, 120, 240], [60, 120, 240, 480]],
+            64: [[64, 128], [64, 128, 256], [64, 128, 256, 512]]
+            }
+        
+
+    def net(self, input, class_dim=1000):
+        width = self.width
+        channels_2, channels_3, channels_4 = self.channels[width]   
+        num_modules_2, num_modules_3, num_modules_4 = 1, 4, 3
+  
+        x = self.conv_bn_layer(input=input, filter_size=3, num_filters=64, stride=2, if_act=True, name='layer1_1')
+        x = self.conv_bn_layer(input=x, filter_size=3, num_filters=64, stride=2, if_act=True, name='layer1_2')
+
+        la1 = self.layer1(x, name='layer2')
+        tr1 = self.transition_layer([la1], [256], channels_2, name='tr1')
+        st2 = self.stage(tr1, num_modules_2, channels_2, name='st2')
+        tr2 = self.transition_layer(st2, channels_2, channels_3, name='tr2')
+        st3 = self.stage(tr2, num_modules_3, channels_3, name='st3')
+        tr3 = self.transition_layer(st3, channels_3, channels_4, name='tr3')
+        st4 = self.stage(tr3, num_modules_4, channels_4, name='st4')
+        
+        #classification
+        last_cls = self.last_cls_out(x=st4, name='cls_head')
+        y = last_cls[0]
+        last_num_filters = [256, 512, 1024]
+        for i in range(3):
+            y = fluid.layers.elementwise_add(last_cls[i+1], 
+                                             self.conv_bn_layer(input=y, filter_size=3, 
+                                                                num_filters=last_num_filters[i], stride=2, 
+                                                                name='cls_head_add'+str(i+1)))
+            
+        y = self.conv_bn_layer(input=y, filter_size=1, num_filters=2048, stride=1, name='cls_head_last_conv')
+        pool = fluid.layers.pool2d(input=y, pool_type='avg', global_pooling=True)
+        stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
+        out = fluid.layers.fc(input=pool, size=class_dim,                               
+                              param_attr=ParamAttr(name='fc_weights', initializer=fluid.initializer.Uniform(-stdv, stdv)),
+                              bias_attr=ParamAttr(name='fc_offset'))
+        return out
+
+        
+    def layer1(self, input, name=None):
+        conv = input
+        for i in range(4):
+            conv = self.bottleneck_block(conv, num_filters=64, downsample=True if i == 0 else False, name=name+'_'+str(i+1))
+        return conv
+    
+    
+    def transition_layer(self, x, in_channels, out_channels, name=None):
+        num_in = len(in_channels)
+        num_out = len(out_channels)
+        out = []
+        for i in range(num_out):
+            if i < num_in:
+                if in_channels[i] != out_channels[i]:
+                    residual = self.conv_bn_layer(x[i], filter_size=3, num_filters=out_channels[i], name=name+'_layer_'+str(i+1))
+                    out.append(residual)
+                else:
+                    out.append(x[i])
+            else:
+                residual = self.conv_bn_layer(x[-1], filter_size=3, num_filters=out_channels[i], stride=2, 
+                                              name=name+'_layer_'+str(i+1))
+                out.append(residual)
+        return out
+
+    
+    def branches(self, x, block_num, channels, name=None):
+        out = []
+        for i in range(len(channels)):
+            residual = x[i]
+            for j in range(block_num):
+                residual = self.basic_block(residual, channels[i], name=name+'_branch_layer_'+str(i+1)+'_'+str(j+1))
+            out.append(residual)
+        return out
+
+    
+    def fuse_layers(self, x, channels, multi_scale_output=True, name=None):
+        out = []
+        for i in range(len(channels) if multi_scale_output else 1):
+            residual = x[i]
+            for j in range(len(channels)):
+                if j > i:
+                    y = self.conv_bn_layer(x[j], filter_size=1, num_filters=channels[i], if_act=False, 
+                                           name=name+'_layer_'+str(i+1)+'_'+str(j+1))
+                    y = fluid.layers.resize_nearest(input=y, scale=2 ** (j - i))
+                    residual = fluid.layers.elementwise_add(x=residual, y=y, act=None)
+                elif j < i:
+                    y = x[j]
+                    for k in range(i - j):
+                        if k == i - j - 1:
+                            y = self.conv_bn_layer(y, filter_size=3, num_filters=channels[i], stride=2, if_act=False, 
+                                                   name=name+'_layer_'+str(i+1)+'_'+str(j+1)+'_'+str(k+1))
+                        else:
+                            y = self.conv_bn_layer(y, filter_size=3, num_filters=channels[j], stride=2,
+                                                   name=name+'_layer_'+str(i+1)+'_'+str(j+1)+'_'+str(k+1))
+                    residual = fluid.layers.elementwise_add(x=residual, y=y, act=None)        
+
+            residual = fluid.layers.relu(residual)
+            out.append(residual)
+        return out
+    
+    
+    def high_resolution_module(self, x, channels, multi_scale_output=True, name=None):
+        residual = self.branches(x, 4, channels, name=name)
+        out = self.fuse_layers(residual, channels, multi_scale_output=multi_scale_output, name=name)
+        return out
+    
+    
+    def stage(self, x, num_modules, channels, multi_scale_output=True, name=None):
+        out = x
+        for i in range(num_modules):
+            if i == num_modules - 1 and multi_scale_output == False:
+                out = self.high_resolution_module(out, channels, multi_scale_output=False, name=name+'_'+str(i+1))
+            else:
+                out = self.high_resolution_module(out, channels, name=name+'_'+str(i+1))
+
+        return out
+    
+    
+    def last_cls_out(self, x, name=None):
+        out = []
+        num_filters_list = [32, 64, 128, 256]
+        for i in range(len(x)):
+            out.append(self.bottleneck_block(input=x[i], num_filters=num_filters_list[i], name=name+'conv_'+str(i+1), 
+                                             downsample=True))          
+        
+        return out
+
+    
+    def basic_block(self, input, num_filters, stride=1, downsample=False, name=None):
+        residual = input
+        conv = self.conv_bn_layer(input=input, filter_size=3, num_filters=num_filters, stride=stride, name=name+'_conv1')
+        conv = self.conv_bn_layer(input=conv, filter_size=3, num_filters=num_filters, if_act=False, name=name+'_conv2')
+        if downsample:
+            residual = self.conv_bn_layer(input=input, filter_size=1, num_filters=num_filters, if_act=False, 
+                                          name=name+'_downsample')
+        if self.has_se:
+            conv = self.squeeze_excitation(
+                input=conv,
+                num_channels=num_filters,
+                reduction_ratio=16,
+                name=name+'_fc')
+        return fluid.layers.elementwise_add(x=residual, y=conv, act='relu')
+    
+
+    def bottleneck_block(self, input, num_filters, stride=1, downsample=False, name=None):
+        residual = input
+        conv = self.conv_bn_layer(input=input, filter_size=1, num_filters=num_filters, name=name+'_conv1')
+        conv = self.conv_bn_layer(input=conv, filter_size=3, num_filters=num_filters, stride=stride, name=name+'_conv2')
+        conv = self.conv_bn_layer(input=conv, filter_size=1, num_filters=num_filters*4, if_act=False, name=name+'_conv3')
+        if downsample:
+            residual = self.conv_bn_layer(input=input, filter_size=1, num_filters=num_filters*4, if_act=False, 
+                                          name=name+'_downsample')
+        if self.has_se:
+            conv = self.squeeze_excitation(
+                input=conv,
+                num_channels=num_filters * 4,
+                reduction_ratio=16,
+                name=name+'_fc')
+        return fluid.layers.elementwise_add(x=residual, y=conv, act='relu')
+       
+        
+    def squeeze_excitation(self, input, num_channels, reduction_ratio, name=None):
+        pool = fluid.layers.pool2d(
+            input=input, pool_size=0, pool_type='avg', global_pooling=True)
+        stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
+        squeeze = fluid.layers.fc(input=pool,
+                                  size=num_channels / reduction_ratio,
+                                  act='relu',
+                                  param_attr=fluid.param_attr.ParamAttr(
+                                      initializer=fluid.initializer.Uniform(
+                                          -stdv, stdv),name=name+'_sqz_weights'),
+                                 bias_attr=ParamAttr(name=name+'_sqz_offset'))
+        stdv = 1.0 / math.sqrt(squeeze.shape[1] * 1.0)
+        excitation = fluid.layers.fc(input=squeeze,
+                                     size=num_channels,
+                                     act='sigmoid',
+                                     param_attr=fluid.param_attr.ParamAttr(
+                                         initializer=fluid.initializer.Uniform(
+                                             -stdv, stdv),name=name+'_exc_weights'),
+                                     bias_attr=ParamAttr(name=name+'_exc_offset'))
+        scale = fluid.layers.elementwise_mul(x=input, y=excitation, axis=0)
+        return scale
+    
+    
+    def conv_bn_layer(self,input, filter_size, num_filters, stride=1, padding=1, num_groups=1, if_act=True, name=None):
+        conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=(filter_size-1)//2,
+            groups=num_groups,
+            act=None,
+            param_attr=ParamAttr(initializer=MSRA(), name=name+'_weights'),
+            bias_attr=False)
+        bn_name = name + '_bn'
+        bn = fluid.layers.batch_norm(input=conv,
+                                     param_attr = ParamAttr(name=bn_name+"_scale", initializer=fluid.initializer.Constant(1.0)),
+                                     bias_attr=ParamAttr(name=bn_name+"_offset", initializer=fluid.initializer.Constant(0.0)),
+                                     moving_mean_name=bn_name+'_mean',
+                                     moving_variance_name=bn_name+'_variance')
+        if if_act:
+            bn = fluid.layers.relu(bn)
+        return bn
+
+    
+def HRNet_W18_C():
+    model = HRNet(width=18)
+    return model
+
+
+def HRNet_W30_C():
+    model = HRNet(width=30)
+    return model
+
+
+def HRNet_W32_C():
+    model = HRNet(width=32)
+    return model
+
+
+def HRNet_W40_C():
+    model = HRNet(width=40)
+    return model
+
+
+def HRNet_W44_C():
+    model = HRNet(width=44)
+    return model
+
+
+def HRNet_W48_C():
+    model = HRNet(width=48)
+    return model
+    
+def HRNet_W60_C():
+    model = HRNet(width=60)
+    return model
+
+
+def HRNet_W64_C():
+    model = HRNet(width=64)
+    return model
+    
+    
+def SE_HRNet_W18_C():
+    model = HRNet(width=18, has_se=True)
+    return model
+
+
+def SE_HRNet_W30_C():
+    model = HRNet(width=30, has_se=True)
+    return model
+
+def SE_HRNet_W32_C():
+    model = HRNet(width=32, has_se=True)
+    return model
+
+
+def SE_HRNet_W40_C():
+    model = HRNet(width=40, has_se=True)
+    return model
+
+
+def SE_HRNet_W44_C():
+    model = HRNet(width=44, has_se=True)
+    return model
+
+
+def SE_HRNet_W48_C():
+    model = HRNet(width=48, has_se=True)
+    return model
+    
+    
+def SE_HRNet_W60_C():
+    model = HRNet(width=60, has_se=True)
+    return model
+
+
+def SE_HRNet_W64_C():
+    model = HRNet(width=64, has_se=True)
+    return model
diff --git a/PaddleCV/image_classification/models/res2net.py b/PaddleCV/image_classification/models/res2net.py
new file mode 100644
index 0000000000000000000000000000000000000000..6237244ec3c9d6ce5d1b093584dfafe7887ee1d2
--- /dev/null
+++ b/PaddleCV/image_classification/models/res2net.py
@@ -0,0 +1,200 @@
+#copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import paddle
+import paddle.fluid as fluid
+import math
+from paddle.fluid.param_attr import ParamAttr
+
+__all__ = ["Res2Net", "Res2Net50_48w_2s", "Res2Net50_26w_4s", "Res2Net50_14w_8s", "Res2Net50_26w_6s", "Res2Net50_26w_8s", 
+           "Res2Net101_26w_4s", "Res2Net152_26w_4s"]
+
+
+class Res2Net():
+    
+    def __init__(self, layers=50, scales=4, width=26):
+        self.layers = layers
+        self.scales = scales
+        self.width = width   
+
+    def net(self, input, class_dim=1000):
+        layers = self.layers
+        supported_layers = [50, 101, 152]
+        assert layers in supported_layers, \
+            "supported layers are {} but input layer is {}".format(supported_layers, layers)
+        basic_width = self.width * self.scales
+        num_filters1 = [basic_width * t for t in [1, 2, 4, 8]]
+        num_filters2 = [256 * t for t in [1, 2, 4, 8]]
+        
+        if layers == 50:
+            depth = [3, 4, 6, 3]
+        elif layers == 101:
+            depth = [3, 4, 23, 3]
+        elif layers == 152:
+            depth = [3, 8, 36, 3]
+        conv = self.conv_bn_layer(
+            input=input, num_filters=64, filter_size=7, stride=2, act='relu', name="conv1")
+        
+        
+        conv = fluid.layers.pool2d(
+            input=conv, pool_size=3, pool_stride=2, pool_padding=1, pool_type='max')
+
+        for block in range(len(depth)):
+            for i in range(depth[block]):
+                if layers in [101, 152] and block == 2:
+                    if i == 0:
+                        conv_name = "res" + str(block+2) + "a"
+                    else:
+                        conv_name = "res" + str(block+2) + "b" + str(i)
+                else:
+                    conv_name = "res" + str(block+2) + chr(97+i)
+                conv = self.bottleneck_block(
+                    input=conv,
+                    num_filters1=num_filters1[block],
+                    num_filters2=num_filters2[block],
+                    stride=2 if i==0 and block !=0 else 1, name=conv_name)
+        pool = fluid.layers.pool2d(
+                input=conv, pool_size=7, pool_stride=1, pool_type='avg', global_pooling=True)
+        
+        stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
+        out = fluid.layers.fc(
+            input=pool,
+            size=class_dim,
+            param_attr=fluid.param_attr.ParamAttr(
+                initializer=fluid.initializer.Uniform(-stdv, stdv),name='fc_weights'),
+            bias_attr=fluid.param_attr.ParamAttr(name='fc_offset'))
+        return out
+    
+
+    def conv_bn_layer(self,
+                      input,
+                      num_filters,
+                      filter_size,
+                      stride=1,
+                      groups=1,
+                      act=None,
+                      name=None):
+        conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=(filter_size - 1)//2,
+            groups=groups,
+            act=None,
+            param_attr=ParamAttr(name=name + "_weights"),
+            bias_attr=False)
+        
+        if name == "conv1":
+            bn_name = "bn_" + name
+        else:
+            bn_name = "bn" + name[3:] 
+        
+        return fluid.layers.batch_norm(input=conv, 
+                                       act=act,
+                                       param_attr=ParamAttr(name=bn_name+'_scale'),
+                                       bias_attr=ParamAttr(bn_name+'_offset'),
+                                       moving_mean_name=bn_name+'_mean',
+                                       moving_variance_name=bn_name+'_variance')
+        
+        
+    def shortcut(self, input, ch_out, stride, name):
+        ch_in = input.shape[1]
+        if ch_in != ch_out or stride != 1:
+            return self.conv_bn_layer(input, ch_out, 1, stride, name=name)
+        else:
+            return input
+
+
+    def bottleneck_block(self, input, num_filters1, num_filters2, stride, name):
+        conv0 = self.conv_bn_layer(
+            input=input, 
+            num_filters=num_filters1, 
+            filter_size=1, 
+            stride=1, 
+            act='relu', 
+            name=name+'_branch2a')
+        xs = fluid.layers.split(conv0, self.scales, 1)
+        ys = []
+        for s in range(self.scales - 1):
+            if s == 0 or stride == 2:
+                ys.append(self.conv_bn_layer(input=xs[s], 
+                                             num_filters=num_filters1//self.scales, 
+                                             stride=stride, 
+                                             filter_size=3, 
+                                             act='relu', 
+                                             name=name+'_branch2b_'+str(s+1)))
+            else:
+                ys.append(self.conv_bn_layer(input=xs[s]+ys[-1], 
+                                             num_filters=num_filters1//self.scales, 
+                                             stride=stride, 
+                                             filter_size=3, 
+                                             act='relu', 
+                                             name=name+'_branch2b_'+str(s+1))) 
+        if stride == 1:
+            ys.append(xs[-1])
+        else:
+            ys.append(fluid.layers.pool2d(input=xs[-1], 
+                                          pool_size=3, 
+                                          pool_stride=stride, 
+                                          pool_padding=1, 
+                                          pool_type='avg'))
+
+        conv1 = fluid.layers.concat(ys, axis=1)
+        conv2 = self.conv_bn_layer(
+            input=conv1, num_filters=num_filters2, filter_size=1, act=None, name=name+"_branch2c")
+
+        short = self.shortcut(input, num_filters2, stride, name=name+"_branch1")
+
+        return fluid.layers.elementwise_add(x=short, y=conv2, act='relu')        
+
+
+
+def Res2Net50_48w_2s():
+    model = Res2Net(layers=50, scales=2, width=48)
+    return model
+
+
+def Res2Net50_26w_4s():
+    model = Res2Net(layers=50, scales=4, width=26)
+    return model
+
+
+def Res2Net50_14w_8s():
+    model = Res2Net(layers=50, scales=8, width=14)
+    return model
+
+
+def Res2Net50_26w_6s():
+    model = Res2Net(layers=50, scales=6, width=26)
+    return model
+
+
+def Res2Net50_26w_8s():
+    model = Res2Net(layers=50, scales=8, width=26)
+    return model
+
+
+def Res2Net101_26w_4s():
+    model = Res2Net(layers=101, scales=4, width=26)
+    return model
+
+
+def Res2Net152_26w_4s():
+    model = Res2Net(layers=152, scales=4, width=26)
+    return model
diff --git a/PaddleCV/image_classification/models/res2net_vd.py b/PaddleCV/image_classification/models/res2net_vd.py
new file mode 100644
index 0000000000000000000000000000000000000000..596ff6063c6202f04b8079488cf3ea5fec512d9b
--- /dev/null
+++ b/PaddleCV/image_classification/models/res2net_vd.py
@@ -0,0 +1,250 @@
+#copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import paddle
+import paddle.fluid as fluid
+import math
+from paddle.fluid.param_attr import ParamAttr
+__all__ = ["Res2Net_vd", "Res2Net50_vd_48w_2s", "Res2Net50_vd_26w_4s", "Res2Net50_vd_14w_8s", "Res2Net50_vd_26w_6s", 
+           "Res2Net50_vd_26w_8s", "Res2Net101_vd_26w_4s", "Res2Net152_vd_26w_4s", "Res2Net200_vd_26w_4s"]
+
+
+class Res2Net_vd():
+    
+    def __init__(self, layers=50, scales=4, width=26):
+        self.layers = layers
+        self.scales = scales
+        self.width = width   
+
+    def net(self, input, class_dim=1000):
+        layers = self.layers
+        supported_layers = [50, 101, 152, 200]
+        assert layers in supported_layers, \
+            "supported layers are {} but input layer is {}".format(supported_layers, layers)
+        basic_width = self.width * self.scales
+        num_filters1 = [basic_width * t for t in [1, 2, 4, 8]]
+        num_filters2 = [256 * t for t in [1, 2, 4, 8]]
+        if layers == 50:
+            depth = [3, 4, 6, 3]
+        elif layers == 101:
+            depth = [3, 4, 23, 3]
+        elif layers == 152:
+            depth = [3, 8, 36, 3]
+        elif layers == 200:
+            depth = [3, 12, 48, 3]
+        conv = self.conv_bn_layer(
+            input=input, num_filters=32, filter_size=3, stride=2, act='relu', name='conv1_1')
+        conv = self.conv_bn_layer(
+            input=conv, num_filters=32, filter_size=3, stride=1, act='relu', name='conv1_2')
+        conv = self.conv_bn_layer(
+            input=conv, num_filters=64, filter_size=3, stride=1, act='relu', name='conv1_3')
+         
+        conv = fluid.layers.pool2d(
+            input=conv, pool_size=3, pool_stride=2, pool_padding=1, pool_type='max')
+        for block in range(len(depth)):
+            for i in range(depth[block]):
+                if layers in [101, 152] and block == 2:
+                    if i == 0:
+                        conv_name = "res" + str(block+2 )+ "a"
+                    else:
+                        conv_name = "res" + str(block+2) + "b" + str(i)
+                else:
+                    conv_name = "res" + str(block+2) + chr(97+i)
+                conv = self.bottleneck_block(
+                    input=conv,
+                    num_filters1=num_filters1[block],
+                    num_filters2=num_filters2[block],
+                    stride=2 if i==0 and block!=0 else 1, 
+                    if_first=block==i==0,
+                    name=conv_name)
+        pool = fluid.layers.pool2d(
+                input=conv, pool_size=7, pool_stride=1, pool_type='avg', global_pooling=True)
+        
+        stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
+        out = fluid.layers.fc(
+            input=pool,
+            size=class_dim,
+            param_attr=fluid.param_attr.ParamAttr(
+                initializer=fluid.initializer.Uniform(-stdv, stdv), name='fc_weights'),
+            bias_attr=fluid.param_attr.ParamAttr(name='fc_offset'))
+        return out
+
+    def conv_bn_layer(self,
+                      input,
+                      num_filters,
+                      filter_size,
+                      stride=1,
+                      groups=1,
+                      act=None,
+                      name=None):
+        conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=(filter_size - 1) // 2,
+            groups=groups,
+            act=None,
+            param_attr=ParamAttr(name=name+"_weights"),
+            bias_attr=False)  
+        if name == "conv1":
+            bn_name = "bn_" + name
+        else:
+            bn_name = "bn" + name[3:] 
+        return fluid.layers.batch_norm(input=conv, 
+                                       act=act,
+                                       param_attr=ParamAttr(name=bn_name+'_scale'),
+                                       bias_attr=ParamAttr(bn_name+'_offset'),
+                                       moving_mean_name=bn_name+'_mean',
+                                       moving_variance_name=bn_name+'_variance')
+        
+    def conv_bn_layer_new(self,
+                      input,
+                      num_filters,
+                      filter_size,
+                      stride=1,
+                      groups=1,
+                      act=None,
+                      name=None):
+        pool = fluid.layers.pool2d(input=input,
+            pool_size=2,
+            pool_stride=2,
+            pool_padding=0,
+            pool_type='avg',
+            ceil_mode=True)
+        
+        conv = fluid.layers.conv2d(
+            input=pool,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=1,
+            padding=(filter_size - 1)//2,
+            groups=groups,
+            act=None,
+            param_attr=ParamAttr(name=name+"_weights"),
+            bias_attr=False)
+        if name == "conv1":
+            bn_name = "bn_" + name
+        else:
+            bn_name = "bn" + name[3:]
+        return fluid.layers.batch_norm(input=conv, 
+                                       act=act,
+                                       param_attr=ParamAttr(name=bn_name+'_scale'),
+                                       bias_attr=ParamAttr(bn_name+'_offset'),
+                                       moving_mean_name=bn_name+'_mean',
+                                       moving_variance_name=bn_name+'_variance')
+    
+
+    def shortcut(self, input, ch_out, stride, name, if_first=False):
+        ch_in = input.shape[1]
+        if ch_in != ch_out or stride != 1:
+            if if_first:
+                return self.conv_bn_layer(input, ch_out, 1, stride, name=name)
+            else:
+                return self.conv_bn_layer_new(input, ch_out, 1, stride, name=name)
+        elif if_first:
+            return self.conv_bn_layer(input, ch_out, 1, stride, name=name)
+        else:
+            return input
+
+
+    def bottleneck_block(self, input, num_filters1, num_filters2, stride, name, if_first):
+        conv0 = self.conv_bn_layer(
+            input=input, 
+            num_filters=num_filters1, 
+            filter_size=1, 
+            stride=1, 
+            act='relu', 
+            name=name+'_branch2a')
+
+        xs = fluid.layers.split(conv0, self.scales, 1)
+        ys = []
+        for s in range(self.scales - 1):
+            if s == 0 or stride == 2:
+                ys.append(self.conv_bn_layer(input=xs[s], 
+                                             num_filters=num_filters1//self.scales, 
+                                             stride=stride, 
+                                             filter_size=3, 
+                                             act='relu', 
+                                             name=name+'_branch2b_'+str(s+1)))
+            else:
+                ys.append(self.conv_bn_layer(input=xs[s]+ys[-1], 
+                                             num_filters=num_filters1//self.scales, 
+                                             stride=stride, 
+                                             filter_size=3, 
+                                             act='relu', 
+                                             name=name+'_branch2b_'+str(s+1))) 
+
+        if stride == 1:
+            ys.append(xs[-1])
+        else:
+            ys.append(fluid.layers.pool2d(input=xs[-1], 
+                                          pool_size=3, 
+                                          pool_stride=stride, 
+                                          pool_padding=1, 
+                                          pool_type='avg'))
+
+        conv1 = fluid.layers.concat(ys, axis=1)
+        conv2 = self.conv_bn_layer(
+            input=conv1, num_filters=num_filters2, filter_size=1, act=None, name=name+"_branch2c")
+
+        short = self.shortcut(input, num_filters2, stride, if_first=if_first, name=name+"_branch1")
+
+        return fluid.layers.elementwise_add(x=short, y=conv2, act='relu')        
+
+    
+
+    
+def Res2Net50_vd_48w_2s():
+    model = Res2Net_vd(layers=50, scales=2, width=48)
+    return model
+
+
+def Res2Net50_vd_26w_4s():
+    model = Res2Net_vd(layers=50, scales=4, width=26)
+    return model
+
+
+def Res2Net50_vd_14w_8s():
+    model = Res2Net_vd(layers=50, scales=8, width=14)
+    return model
+
+
+def Res2Net50_vd_26w_6s():
+    model = Res2Net_vd(layers=50, scales=6, width=26)
+    return model
+
+
+def Res2Net50_vd_26w_8s():
+    model = Res2Net_vd(layers=50, scales=8, width=26)
+    return model
+
+
+def Res2Net101_vd_26w_4s():
+    model = Res2Net_vd(layers=101, scales=4, width=26)
+    return model
+
+
+def Res2Net152_vd_26w_4s():
+    model = Res2Net_vd(layers=152, scales=4, width=26)
+    return model
+
+
+def Res2Net200_vd_26w_4s():
+    model = Res2Net_vd(layers=200, scales=4, width=26)
+    return model
diff --git a/PaddleCV/image_classification/models/resnet.py b/PaddleCV/image_classification/models/resnet.py
index bb68d018b58d6bd7197306a17042619215558eb9..fcf453588ff13e8c53d185940cfc2b060ec4e1ac 100644
--- a/PaddleCV/image_classification/models/resnet.py
+++ b/PaddleCV/image_classification/models/resnet.py
@@ -31,7 +31,7 @@ class ResNet():
     def __init__(self, layers=50):
         self.layers = layers
 
-    def net(self, input, class_dim=1000):
+    def net(self, input, class_dim=1000, data_format="NCHW"):
         layers = self.layers
         supported_layers = [18, 34, 50, 101, 152]
         assert layers in supported_layers, \
@@ -53,13 +53,15 @@ class ResNet():
             filter_size=7,
             stride=2,
             act='relu',
-            name="conv1")
+            name="conv1",
+            data_format=data_format)
         conv = fluid.layers.pool2d(
             input=conv,
             pool_size=3,
             pool_stride=2,
             pool_padding=1,
-            pool_type='max')
+            pool_type='max',
+            data_format=data_format)
         if layers >= 50:
             for block in range(len(depth)):
                 for i in range(depth[block]):
@@ -74,10 +76,11 @@ class ResNet():
                         input=conv,
                         num_filters=num_filters[block],
                         stride=2 if i == 0 and block != 0 else 1,
-                        name=conv_name)
+                        name=conv_name,
+                        data_format=data_format)
 
             pool = fluid.layers.pool2d(
-                input=conv, pool_type='avg', global_pooling=True)
+                input=conv, pool_type='avg', global_pooling=True, data_format=data_format)
             stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
             out = fluid.layers.fc(
                 input=pool,
@@ -93,10 +96,11 @@ class ResNet():
                         num_filters=num_filters[block],
                         stride=2 if i == 0 and block != 0 else 1,
                         is_first=block == i == 0,
-                        name=conv_name)
+                        name=conv_name,
+                        data_format=data_format)
 
             pool = fluid.layers.pool2d(
-                input=conv, pool_type='avg', global_pooling=True)
+                input=conv, pool_type='avg', global_pooling=True, data_format=data_format)
             stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
             out = fluid.layers.fc(
                 input=pool,
@@ -112,7 +116,8 @@ class ResNet():
                       stride=1,
                       groups=1,
                       act=None,
-                      name=None):
+                      name=None,
+                      data_format='NCHW'):
         conv = fluid.layers.conv2d(
             input=input,
             num_filters=num_filters,
@@ -123,7 +128,8 @@ class ResNet():
             act=None,
             param_attr=ParamAttr(name=name + "_weights"),
             bias_attr=False,
-            name=name + '.conv2d.output.1')
+            name=name + '.conv2d.output.1',
+            data_format=data_format)
 
         if name == "conv1":
             bn_name = "bn_" + name
@@ -136,62 +142,72 @@ class ResNet():
             param_attr=ParamAttr(name=bn_name + '_scale'),
             bias_attr=ParamAttr(bn_name + '_offset'),
             moving_mean_name=bn_name + '_mean',
-            moving_variance_name=bn_name + '_variance', )
+            moving_variance_name=bn_name + '_variance',
+            data_layout=data_format)
 
-    def shortcut(self, input, ch_out, stride, is_first, name):
-        ch_in = input.shape[1]
+    def shortcut(self, input, ch_out, stride, is_first, name, data_format):
+        if data_format == 'NCHW':
+            ch_in = input.shape[1]
+        else:
+            ch_in = input.shape[-1]
         if ch_in != ch_out or stride != 1 or is_first == True:
-            return self.conv_bn_layer(input, ch_out, 1, stride, name=name)
+            return self.conv_bn_layer(input, ch_out, 1, stride, name=name, data_format=data_format)
         else:
             return input
 
-    def bottleneck_block(self, input, num_filters, stride, name):
+    def bottleneck_block(self, input, num_filters, stride, name, data_format):
         conv0 = self.conv_bn_layer(
             input=input,
             num_filters=num_filters,
             filter_size=1,
             act='relu',
-            name=name + "_branch2a")
+            name=name + "_branch2a",
+            data_format=data_format)
         conv1 = self.conv_bn_layer(
             input=conv0,
             num_filters=num_filters,
             filter_size=3,
             stride=stride,
             act='relu',
-            name=name + "_branch2b")
+            name=name + "_branch2b",
+            data_format=data_format)
         conv2 = self.conv_bn_layer(
             input=conv1,
             num_filters=num_filters * 4,
             filter_size=1,
             act=None,
-            name=name + "_branch2c")
+            name=name + "_branch2c",
+            data_format=data_format)
 
         short = self.shortcut(
             input,
             num_filters * 4,
             stride,
             is_first=False,
-            name=name + "_branch1")
+            name=name + "_branch1",
+            data_format=data_format)
 
         return fluid.layers.elementwise_add(
             x=short, y=conv2, act='relu', name=name + ".add.output.5")
 
-    def basic_block(self, input, num_filters, stride, is_first, name):
+    def basic_block(self, input, num_filters, stride, is_first, name, data_format):
         conv0 = self.conv_bn_layer(
             input=input,
             num_filters=num_filters,
             filter_size=3,
             act='relu',
             stride=stride,
-            name=name + "_branch2a")
+            name=name + "_branch2a",
+            data_format=data_format)
         conv1 = self.conv_bn_layer(
             input=conv0,
             num_filters=num_filters,
             filter_size=3,
             act=None,
-            name=name + "_branch2b")
+            name=name + "_branch2b",
+            data_format=data_format)
         short = self.shortcut(
-            input, num_filters, stride, is_first, name=name + "_branch1")
+            input, num_filters, stride, is_first, name=name + "_branch1", data_format=data_format)
         return fluid.layers.elementwise_add(x=short, y=conv1, act='relu')
 
 
diff --git a/PaddleCV/image_classification/models/resnet_acnet.py b/PaddleCV/image_classification/models/resnet_acnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..575603382a2f8676d43d51bbbbe70c499c442b46
--- /dev/null
+++ b/PaddleCV/image_classification/models/resnet_acnet.py
@@ -0,0 +1,332 @@
+#copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+
+__all__ = [
+    "ResNet_ACNet", "ResNet18_ACNet", "ResNet34_ACNet", "ResNet50_ACNet",
+    "ResNet101_ACNet", "ResNet152_ACNet"
+]
+
+
+class ResNetACNet(object):
+    """ ACNet """
+
+    def __init__(self, layers=50, deploy=False):
+        """init"""
+        self.layers = layers
+        self.deploy = deploy
+
+    def net(self, input, class_dim=1000):
+        """model"""
+        layers = self.layers
+        supported_layers = [18, 34, 50, 101, 152]
+        assert layers in supported_layers, \
+            "supported layers are {} but input layer is {}".format(supported_layers, layers)
+
+        if layers == 18:
+            depth = [2, 2, 2, 2]
+        elif layers == 34 or layers == 50:
+            depth = [3, 4, 6, 3]
+        elif layers == 101:
+            depth = [3, 4, 23, 3]
+        elif layers == 152:
+            depth = [3, 8, 36, 3]
+        num_filters = [64, 128, 256, 512]
+
+        conv = self.conv_bn_layer(
+            input=input,
+            num_filters=64,
+            filter_size=7,
+            stride=2,
+            act='relu',
+            name="conv1")
+        conv = fluid.layers.pool2d(
+            input=conv,
+            pool_size=3,
+            pool_stride=2,
+            pool_padding=1,
+            pool_type='max')
+        if layers >= 50:
+            for block in range(len(depth)):
+                for i in range(depth[block]):
+                    if layers in [101, 152] and block == 2:
+                        if i == 0:
+                            conv_name = "res" + str(block + 2) + "a"
+                        else:
+                            conv_name = "res" + str(block + 2) + "b" + str(i)
+                    else:
+                        conv_name = "res" + str(block + 2) + chr(97 + i)
+                    conv = self.bottleneck_block(
+                        input=conv,
+                        num_filters=num_filters[block],
+                        stride=2 if i == 0 and block != 0 else 1,
+                        name=conv_name)
+        else:
+            for block in range(len(depth)):
+                for i in range(depth[block]):
+                    conv_name = "res" + str(block + 2) + chr(97 + i)
+                    conv = self.basic_block(
+                        input=conv,
+                        num_filters=num_filters[block],
+                        stride=2 if i == 0 and block != 0 else 1,
+                        is_first=block == i == 0,
+                        name=conv_name)
+
+        pool = fluid.layers.pool2d(
+            input=conv, pool_size=7, pool_type='avg', global_pooling=True)
+
+        stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
+        out = fluid.layers.fc(
+            input=pool,
+            size=class_dim,
+            param_attr=fluid.param_attr.ParamAttr(
+                initializer=fluid.initializer.Uniform(-stdv, stdv)))
+        return out
+
+    def conv_bn_layer(self, **kwargs):
+        """
+        conv_bn_layer
+        """
+        if kwargs['filter_size'] == 1:
+            return self.conv_bn_layer_ori(**kwargs)
+        else:
+            return self.conv_bn_layer_ac(**kwargs)
+
+    # conv bn+relu
+    def conv_bn_layer_ori(self,
+                          input,
+                          num_filters,
+                          filter_size,
+                          stride=1,
+                          groups=1,
+                          act=None,
+                          name=None):
+        """
+        standard convbn
+        used for 1x1 convbn in acnet
+        """
+        conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=(filter_size - 1) // 2,
+            groups=groups,
+            act=None,
+            param_attr=ParamAttr(name=name + "_weights"),
+            bias_attr=False,
+            name=name + '.conv2d.output.1')
+
+        if name == "conv1":
+            bn_name = "bn_" + name
+        else:
+            bn_name = "bn" + name[3:]
+        return fluid.layers.batch_norm(
+            input=conv,
+            act=act,
+            name=bn_name + '.output.1',
+            param_attr=ParamAttr(name=bn_name + '_scale'),
+            bias_attr=ParamAttr(bn_name + '_offset'),
+            moving_mean_name=bn_name + '_mean',
+            moving_variance_name=bn_name + '_variance', )
+
+    # conv bn+relu
+    def conv_bn_layer_ac(self,
+                         input,
+                         num_filters,
+                         filter_size,
+                         stride=1,
+                         groups=1,
+                         act=None,
+                         name=None):
+        """ ACNet conv bn """
+        padding = (filter_size - 1) // 2
+
+        square_conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            groups=groups,
+            act=act if self.deploy else None,
+            param_attr=ParamAttr(name=name + "_acsquare_weights"),
+            bias_attr=ParamAttr(name=name + "_acsquare_bias")
+            if self.deploy else None,
+            name=name + '.acsquare.conv2d.output.1')
+
+        if self.deploy:
+            return square_conv
+        else:
+            ver_conv = fluid.layers.conv2d(
+                input=input,
+                num_filters=num_filters,
+                filter_size=(filter_size, 1),
+                stride=stride,
+                padding=(padding, 0),
+                groups=groups,
+                act=None,
+                param_attr=ParamAttr(name=name + "_acver_weights"),
+                bias_attr=False,
+                name=name + '.acver.conv2d.output.1')
+
+            hor_conv = fluid.layers.conv2d(
+                input=input,
+                num_filters=num_filters,
+                filter_size=(1, filter_size),
+                stride=stride,
+                padding=(0, padding),
+                groups=groups,
+                act=None,
+                param_attr=ParamAttr(name=name + "_achor_weights"),
+                bias_attr=False,
+                name=name + '.achor.conv2d.output.1')
+
+            if name == "conv1":
+                bn_name = "bn_" + name
+            else:
+                bn_name = "bn" + name[3:]
+
+            square_bn = fluid.layers.batch_norm(
+                input=square_conv,
+                act=None,
+                name=bn_name + '.acsquare.output.1',
+                param_attr=ParamAttr(name=bn_name + '_acsquare_scale'),
+                bias_attr=ParamAttr(bn_name + '_acsquare_offset'),
+                moving_mean_name=bn_name + '_acsquare_mean',
+                moving_variance_name=bn_name + '_acsquare_variance', )
+
+            ver_bn = fluid.layers.batch_norm(
+                input=ver_conv,
+                act=None,
+                name=bn_name + '.acver.output.1',
+                param_attr=ParamAttr(name=bn_name + '_acver_scale'),
+                bias_attr=ParamAttr(bn_name + '_acver_offset'),
+                moving_mean_name=bn_name + '_acver_mean',
+                moving_variance_name=bn_name + '_acver_variance', )
+
+            hor_bn = fluid.layers.batch_norm(
+                input=hor_conv,
+                act=None,
+                name=bn_name + '.achor.output.1',
+                param_attr=ParamAttr(name=bn_name + '_achor_scale'),
+                bias_attr=ParamAttr(bn_name + '_achor_offset'),
+                moving_mean_name=bn_name + '_achor_mean',
+                moving_variance_name=bn_name + '_achor_variance', )
+
+            return fluid.layers.elementwise_add(
+                x=square_bn, y=ver_bn + hor_bn, act=act)
+
+    def shortcut(self, input, ch_out, stride, is_first, name):
+        """ shortcut """
+        ch_in = input.shape[1]
+        if ch_in != ch_out or stride != 1 or is_first == True:
+            return self.conv_bn_layer(
+                input=input,
+                num_filters=ch_out,
+                filter_size=1,
+                stride=stride,
+                name=name)
+        else:
+            return input
+
+    def bottleneck_block(self, input, num_filters, stride, name):
+        """" bottleneck_block """
+        conv0 = self.conv_bn_layer(
+            input=input,
+            num_filters=num_filters,
+            filter_size=1,
+            act='relu',
+            name=name + "_branch2a")
+        conv1 = self.conv_bn_layer(
+            input=conv0,
+            num_filters=num_filters,
+            filter_size=3,
+            stride=stride,
+            act='relu',
+            name=name + "_branch2b")
+        conv2 = self.conv_bn_layer(
+            input=conv1,
+            num_filters=num_filters * 4,
+            filter_size=1,
+            act=None,
+            name=name + "_branch2c")
+
+        short = self.shortcut(
+            input,
+            num_filters * 4,
+            stride,
+            is_first=False,
+            name=name + "_branch1")
+
+        return fluid.layers.elementwise_add(
+            x=short, y=conv2, act='relu', name=name + ".add.output.5")
+
+    def basic_block(self, input, num_filters, stride, is_first, name):
+        """ basic_block """
+        conv0 = self.conv_bn_layer(
+            input=input,
+            num_filters=num_filters,
+            filter_size=3,
+            act='relu',
+            stride=stride,
+            name=name + "_branch2a")
+        conv1 = self.conv_bn_layer(
+            input=conv0,
+            num_filters=num_filters,
+            filter_size=3,
+            act=None,
+            name=name + "_branch2b")
+        short = self.shortcut(
+            input, num_filters, stride, is_first, name=name + "_branch1")
+        return fluid.layers.elementwise_add(x=short, y=conv1, act='relu')
+
+
+def ResNet18_ACNet(deploy=False):
+    """ResNet18 + ACNet"""
+    model = ResNet_ACNet(layers=18, deploy=deploy)
+    return model
+
+
+def ResNet34_ACNet(deploy=False):
+    """ResNet34 + ACNet"""
+    model = ResNetACNet(layers=34, deploy=deploy)
+    return model
+
+
+def ResNet50_ACNet(deploy=False):
+    """ResNet50 + ACNet"""
+    model = ResNetACNet(layers=50, deploy=deploy)
+    return model
+
+
+def ResNet101_ACNet(deploy=False):
+    """ResNet101 + ACNet"""
+    model = ResNetACNet(layers=101, deploy=deploy)
+    return model
+
+
+def ResNet152_ACNet(deploy=False):
+    """ResNet152 + ACNet"""
+    model = ResNetACNet(layers=152, deploy=deploy)
+    return model
diff --git a/PaddleCV/image_classification/models/resnext_vd.py b/PaddleCV/image_classification/models/resnext_vd.py
index 92e366feb517c4ea4ed92ef0e0c24b7b6efa7ef4..fd60da377697c0387f8b739facfae2d98bbbbd9b 100644
--- a/PaddleCV/image_classification/models/resnext_vd.py
+++ b/PaddleCV/image_classification/models/resnext_vd.py
@@ -130,7 +130,6 @@ class ResNeXt():
             padding=(filter_size - 1) // 2,
             groups=groups,
             act=None,
-            use_cudnn=False,
             param_attr=ParamAttr(name=name + "_weights"),
             bias_attr=False)
         if name == "conv1":
@@ -169,7 +168,6 @@ class ResNeXt():
             padding=(filter_size - 1) // 2,
             groups=groups,
             act=None,
-            use_cudnn=False,
             param_attr=ParamAttr(name=name + "_weights"),
             bias_attr=False)
         if name == "conv1":
diff --git a/PaddleCV/image_classification/predict.py b/PaddleCV/image_classification/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..cd1ec9bd9b49851ff300ab03cce19f945c64fc06
--- /dev/null
+++ b/PaddleCV/image_classification/predict.py
@@ -0,0 +1,165 @@
+#copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import argparse
+import numpy as np
+import cv2
+import os
+import logging
+
+from paddle import fluid
+from paddle.fluid.core import PaddleTensor
+from paddle.fluid.core import AnalysisConfig
+from paddle.fluid.core import create_paddle_predictor
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+def resize_short(img, target_size, interpolation=None):
+    """resize image
+    
+    Args:
+        img: image data
+        target_size: resize short target size
+        interpolation: interpolation mode
+
+    Returns:
+        resized image data
+    """
+    percent = float(target_size) / min(img.shape[0], img.shape[1])
+    resized_width = int(round(img.shape[1] * percent))
+    resized_height = int(round(img.shape[0] * percent))
+    if interpolation:
+        resized = cv2.resize(
+            img, (resized_width, resized_height), interpolation=interpolation)
+    else:
+        resized = cv2.resize(img, (resized_width, resized_height))
+    return resized
+
+
+def crop_image(img, target_size, center):
+    """crop image 
+    
+    Args:
+        img: images data
+        target_size: crop target size
+        center: crop mode
+    
+    Returns:
+        img: cropped image data
+    """
+    height, width = img.shape[:2]
+    size = target_size
+    if center == True:
+        w_start = (width - size) // 2
+        h_start = (height - size) // 2
+    else:
+        w_start = np.random.randint(0, width - size + 1)
+        h_start = np.random.randint(0, height - size + 1)
+    w_end = w_start + size
+    h_end = h_start + size
+    img = img[h_start:h_end, w_start:w_end, :]
+    return img
+
+
+def preprocess_image(img_path):
+    """ preprocess_image """
+
+    mean = [0.485, 0.456, 0.406]
+    std = [0.229, 0.224, 0.225]
+    crop_size = 224
+    target_size = 256
+
+    img = cv2.imread(img_path)
+    img = resize_short(img, target_size, interpolation=None)
+    img = crop_image(img, target_size=crop_size, center=True)
+    img = img[:, :, ::-1]
+
+    img = img.astype('float32').transpose((2, 0, 1)) / 255
+    img_mean = np.array(mean).reshape((3, 1, 1))
+    img_std = np.array(std).reshape((3, 1, 1))
+    img -= img_mean
+    img /= img_std
+    img = np.expand_dims(img, axis=0).copy()
+    return img
+
+
+def predict(args):
+    # config AnalysisConfig
+    config = AnalysisConfig(args.model_file, args.params_file)
+    if args.gpu_id < 0:
+        config.disable_gpu()
+    else:
+        config.enable_use_gpu(args.gpu_mem, args.gpu_id)
+
+    # you can enable tensorrt engine if paddle is installed with tensorrt
+    # config.enable_tensorrt_engine() 
+
+    predictor = create_paddle_predictor(config)
+
+    # input
+    inputs = preprocess_image(args.image_path)
+    inputs = PaddleTensor(inputs)
+
+    # predict
+    outputs = predictor.run([inputs])
+
+    # get output
+    output = outputs[0]
+    output = output.as_ndarray().flatten()
+
+    cls = np.argmax(output)
+    score = output[cls]
+    logger.info("class: {0}".format(cls))
+    logger.info("score: {0}".format(score))
+    return
+
+
+def check_args(args):
+    assert os.path.exists(args.model_file), "model_file({}) not exist!".format(
+        args.model_file)
+    assert os.path.exists(
+        args.params_file), "params_file({}) not exist!".format(args.params_file)
+    assert os.path.exists(args.image_path), "image_path({}) not exist!".format(
+        args.image_path)
+    assert isinstance(args.gpu_id, int)
+    assert isinstance(args.gpu_mem, int)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model_file", type=str, default="", help="model filename")
+    parser.add_argument(
+        "--params_file", type=str, default="", help="parameter filename")
+    parser.add_argument("--image_path", type=str, default="", help="image path")
+    parser.add_argument(
+        "--gpu_id",
+        type=int,
+        default=0,
+        help="gpu id, if less than 0, gpu is disabled")
+    parser.add_argument(
+        "--gpu_mem", type=int, default=2000, help="gpu memory, unit: MB")
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+    check_args(args)
+    predict(args)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/PaddleCV/image_classification/reader.py b/PaddleCV/image_classification/reader.py
index f88cfd887c6d7f5921dd856035fecdf1cda575d9..3e49af8e909895357d3e841ccd884c3ec946d1d9 100644
--- a/PaddleCV/image_classification/reader.py
+++ b/PaddleCV/image_classification/reader.py
@@ -18,6 +18,8 @@ import random
 import functools
 import numpy as np
 import cv2
+import logging
+import imghdr
 
 import paddle
 from paddle import fluid
@@ -26,6 +28,9 @@ from PIL import Image
 
 policy = None
 
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
 random.seed(0)
 np.random.seed(0)
 
@@ -195,21 +200,27 @@ def create_mixup_reader(settings, rd):
 
     return mixup_reader
 
+
 def process_image(sample, settings, mode, color_jitter, rotate):
     """ process_image """
 
     mean = settings.image_mean
     std = settings.image_std
-    crop_size = settings.crop_size
+    crop_size = settings.image_shape[1]
 
     img_path = sample[0]
     img = cv2.imread(img_path)
 
+    if img is None:
+        logger.warning("img({0}) is None, pass it.".format(img_path))
+        return None
+
     if mode == 'train':
         if rotate:
             img = rotate_image(img)
         if crop_size > 0:
-            img = random_crop(img, crop_size, settings, interpolation=settings.interpolation)
+            img = random_crop(
+                img, crop_size, settings, interpolation=settings.interpolation)
         if color_jitter:
             img = distort_color(img)
         if np.random.randint(0, 2) == 1:
@@ -217,7 +228,8 @@ def process_image(sample, settings, mode, color_jitter, rotate):
     else:
         if crop_size > 0:
             target_size = settings.resize_short_size
-            img = resize_short(img, target_size, interpolation=settings.interpolation)
+            img = resize_short(
+                img, target_size, interpolation=settings.interpolation)
             img = crop_image(img, target_size=crop_size, center=True)
 
     img = img[:, :, ::-1]
@@ -233,22 +245,34 @@ def process_image(sample, settings, mode, color_jitter, rotate):
     img_std = np.array(std).reshape((3, 1, 1))
     img -= img_mean
     img /= img_std
-
-    if mode == 'train' or mode == 'val':
+    # doing training (train.py)
+    if mode == 'train' or (mode == 'val' and
+                           not hasattr(settings, 'save_json_path')):
         return (img, sample[1])
+    #doing testing (eval.py)
+    elif mode == 'val' and hasattr(settings, 'save_json_path'):
+        return (img, sample[1], sample[0])
+    #doing predict (infer.py)
     elif mode == 'test':
-        return (img, )
+        return (img, sample[0])
+    else:
+        raise Exception("mode not implemented")
+
 
 def process_batch_data(input_data, settings, mode, color_jitter, rotate):
     batch_data = []
     for sample in input_data:
         if os.path.isfile(sample[0]):
-            batch_data.append(
-                process_image(sample, settings, mode, color_jitter, rotate))
+            tmp_data = process_image(sample, settings, mode, color_jitter,
+                                     rotate)
+            if tmp_data is None:
+                continue
+            batch_data.append(tmp_data)
         else:
-            print("File not exist : %s" % sample[0])
+            logger.info("File not exist : {0}".format(sample[0]))
     return batch_data
 
+
 class ImageNetReader:
     def __init__(self, seed=None):
         self.shuffle_seed = seed
@@ -257,7 +281,27 @@ class ImageNetReader:
         assert isinstance(seed, int), "shuffle seed must be int"
         self.shuffle_seed = seed
 
-    def _reader_creator(self, settings,
+    def _get_single_card_bs(self, settings, mode):
+        if settings.use_gpu:
+            if mode == "val" and hasattr(settings, "test_batch_size"):
+                single_card_bs = int(
+                    settings.test_batch_size
+                ) // paddle.fluid.core.get_cuda_device_count()
+            else:
+                single_card_bs = int(
+                    settings.
+                    batch_size) // paddle.fluid.core.get_cuda_device_count()
+        else:
+            if mode == "val" and hasattr(settings, "test_batch_size"):
+                single_card_bs = int(settings.test_batch_size) // int(
+                    os.environ.get('CPU_NUM', 1))
+            else:
+                single_card_bs = int(settings.batch_size) // int(
+                    os.environ.get('CPU_NUM', 1))
+        return single_card_bs
+
+    def _reader_creator(self,
+                        settings,
                         file_list,
                         mode,
                         shuffle=False,
@@ -265,26 +309,35 @@ class ImageNetReader:
                         rotate=False,
                         data_dir=None):
         num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
-        if mode == 'test':
-            batch_size = 1
-        else:
-            batch_size = settings.batch_size / paddle.fluid.core.get_cuda_device_count()
+
+        batch_size = self._get_single_card_bs(settings, mode)
+
         def reader():
             def read_file_list():
                 with open(file_list) as flist:
                     full_lines = [line.strip() for line in flist]
                     if mode != "test" and len(full_lines) < settings.batch_size:
-                        print(
-                            "Warning: The number of the whole data ({}) is smaller than the batch_size ({}), and drop_last is turnning on, so nothing  will feed in program, Terminated now. Please reset batch_size to a smaller number or feed more data!"
-                            .format(len(full_lines), settings.batch_size))
+                        logger.error(
+                            "Error: The number of the whole data ({}) is smaller than the batch_size ({}), and drop_last is turnning on, so nothing  will feed in program, Terminated now. Please reset batch_size to a smaller number or feed more data!".
+                            format(len(full_lines), settings.batch_size))
                         os._exit(1)
                     if num_trainers > 1 and mode == "train":
                         assert self.shuffle_seed is not None, "multiprocess train, shuffle seed must be set!"
-                        np.random.RandomState(self.shuffle_seed).shuffle(full_lines)
+                        np.random.RandomState(self.shuffle_seed).shuffle(
+                            full_lines)
                     elif shuffle:
-                        np.random.shuffle(full_lines)
+                        if not settings.enable_ce or not settings.same_feed:
+                            np.random.shuffle(full_lines)
 
                 batch_data = []
+                if (mode == "train" or mode == "val") and settings.same_feed:
+                    temp_file = full_lines[0]
+                    logger.info("Same images({},nums:{}) will feed in the net".
+                                format(str(temp_file), settings.same_feed))
+                    full_lines = []
+                    for i in range(settings.same_feed):
+                        full_lines.append(temp_file)
+
                 for line in full_lines:
                     img_path, label = line.split()
                     img_path = os.path.join(data_dir, img_path)
@@ -292,18 +345,19 @@ class ImageNetReader:
                     if len(batch_data) == batch_size:
                         if mode == 'train' or mode == 'val' or mode == 'test':
                             yield batch_data
-
                         batch_data = []
 
             return read_file_list
 
         data_reader = reader()
+
         if mode == 'train' and num_trainers > 1:
             assert self.shuffle_seed is not None, \
                 "If num_trainers > 1, the shuffle_seed must be set, because " \
                 "the order of batch data generated by reader " \
                 "must be the same in the respective processes."
-            data_reader = paddle.fluid.contrib.reader.distributed_batch_reader(data_reader)
+            data_reader = paddle.fluid.contrib.reader.distributed_batch_reader(
+                data_reader)
 
         mapper = functools.partial(
             process_batch_data,
@@ -319,7 +373,6 @@ class ImageNetReader:
             settings.reader_buf_size,
             order=False)
 
-
     def train(self, settings):
         """Create a reader for trainning
 
@@ -351,11 +404,11 @@ class ImageNetReader:
             reader = create_mixup_reader(settings, reader)
             reader = fluid.io.batch(
                 reader,
-                batch_size=int(settings.batch_size / paddle.fluid.core.get_cuda_device_count()),
+                batch_size=int(settings.batch_size /
+                               paddle.fluid.core.get_cuda_device_count()),
                 drop_last=True)
         return reader
 
-
     def val(self, settings):
         """Create a reader for eval
 
@@ -370,10 +423,12 @@ class ImageNetReader:
         assert os.path.isfile(
             file_list), "{} doesn't exist, please check data list path".format(
                 file_list)
-
         return self._reader_creator(
-            settings, file_list, 'val', shuffle=False, data_dir=settings.data_dir)
-
+            settings,
+            file_list,
+            'val',
+            shuffle=False,
+            data_dir=settings.data_dir)
 
     def test(self, settings):
         """Create a reader for testing
@@ -384,9 +439,23 @@ class ImageNetReader:
         Returns:
             test reader
         """
-        file_list = os.path.join(settings.data_dir, 'val_list.txt')
-        assert os.path.isfile(
-            file_list), "{} doesn't exist, please check data list path".format(
-                file_list)
+        file_list = ".tmp.txt"
+        imgType_list = {'jpg', 'bmp', 'png', 'jpeg', 'rgb', 'tif', 'tiff'}
+        with open(file_list, "w") as fout:
+            if settings.image_path:
+                fout.write(settings.image_path + " 0" + "\n")
+                settings.batch_size = 1
+                settings.data_dir = ""
+            else:
+                tmp_file_list = os.listdir(settings.data_dir)
+                for file_name in tmp_file_list:
+                    file_path = os.path.join(settings.data_dir, file_name)
+                    if imghdr.what(file_path) not in imgType_list:
+                        continue
+                    fout.write(file_name + " 0" + "\n")
         return self._reader_creator(
-            settings, file_list, 'test', shuffle=False, data_dir=settings.data_dir)
+            settings,
+            file_list,
+            'test',
+            shuffle=False,
+            data_dir=settings.data_dir)
diff --git a/PaddleCV/image_classification/scripts/train/AlexNet.sh b/PaddleCV/image_classification/scripts/train/AlexNet.sh
index 6919f2b969f8f67f7de666a8f133cc7b4b779dbe..f58048c5369276dbdcd90421a5b71387c04a94c6 100644
--- a/PaddleCV/image_classification/scripts/train/AlexNet.sh
+++ b/PaddleCV/image_classification/scripts/train/AlexNet.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
        --model=AlexNet \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=piecewise_decay \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/DPN107.sh b/PaddleCV/image_classification/scripts/train/DPN107.sh
index 5fdcf18e72825b69de3441ff10a18b45bebab4bf..41ac19a939f9e0ab8e263d6a16a8770f43303612 100644
--- a/PaddleCV/image_classification/scripts/train/DPN107.sh
+++ b/PaddleCV/image_classification/scripts/train/DPN107.sh
@@ -2,9 +2,6 @@
 python train.py \
             --model=DPN107 \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/DPN131.sh b/PaddleCV/image_classification/scripts/train/DPN131.sh
index 9ad476715b2ad1536ac834260828037987fb1d63..eba68984bfa196de058cd3984d4f58ccb812804d 100644
--- a/PaddleCV/image_classification/scripts/train/DPN131.sh
+++ b/PaddleCV/image_classification/scripts/train/DPN131.sh
@@ -2,9 +2,6 @@
 python train.py \
             --model=DPN131 \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/DPN68.sh b/PaddleCV/image_classification/scripts/train/DPN68.sh
index 2397a267d3a3c0bc98fdc1fba5029e4651043870..90de0d5f22ef6442403982812f2fe028984121a0 100644
--- a/PaddleCV/image_classification/scripts/train/DPN68.sh
+++ b/PaddleCV/image_classification/scripts/train/DPN68.sh
@@ -2,9 +2,6 @@
 python train.py \
             --model=DPN68 \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/DPN92.sh b/PaddleCV/image_classification/scripts/train/DPN92.sh
index 27578a463eb58a3b10f2a8a1c2d0ef69573f00c5..67547fc3582ae73e623f53c9996d8f413f12f7b8 100644
--- a/PaddleCV/image_classification/scripts/train/DPN92.sh
+++ b/PaddleCV/image_classification/scripts/train/DPN92.sh
@@ -2,9 +2,6 @@
 python train.py \
             --model=DPN92 \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/DPN98.sh b/PaddleCV/image_classification/scripts/train/DPN98.sh
index 150f461b03b521e3ba291a469fa3f07381fb7cce..6a999c203be09ad74241c1e6fad8047b1ae26c8c 100644
--- a/PaddleCV/image_classification/scripts/train/DPN98.sh
+++ b/PaddleCV/image_classification/scripts/train/DPN98.sh
@@ -2,9 +2,6 @@
 python train.py \
             --model=DPN98 \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/DarkNet53.sh b/PaddleCV/image_classification/scripts/train/DarkNet53.sh
index a56f1ea8a033bc93bb152ef732b5e4f49b166d76..cf8a9b07f1aa81781565341c2a80eb8c3d739feb 100644
--- a/PaddleCV/image_classification/scripts/train/DarkNet53.sh
+++ b/PaddleCV/image_classification/scripts/train/DarkNet53.sh
@@ -3,9 +3,7 @@
 python train.py \
        --model=DarkNet53 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,256,256 \
-       --class_dim=1000 \
+       --image_shape 3 256 256 \
        --lr_strategy=cosine_decay \
        --lr=0.1 \
        --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/DenseNet121.sh b/PaddleCV/image_classification/scripts/train/DenseNet121.sh
index 44ecaca2fd3b38d667850c29df8152bf6f8b0d17..1746c2c5a728e9f28afc0d8dd3f2c98aa0a6b859 100644
--- a/PaddleCV/image_classification/scripts/train/DenseNet121.sh
+++ b/PaddleCV/image_classification/scripts/train/DenseNet121.sh
@@ -3,9 +3,6 @@
 python train.py \
        --model=DenseNet121 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,224,224 \
-       --class_dim=1000 \
        --lr_strategy=piecewise_decay \
        --lr=0.1 \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/DenseNet161.sh b/PaddleCV/image_classification/scripts/train/DenseNet161.sh
index 4da3fb6fbbf3e3ab282a021358d578a1b2fdb39b..25c6be246cdc9a35f9956773af47d1ea6412c78b 100644
--- a/PaddleCV/image_classification/scripts/train/DenseNet161.sh
+++ b/PaddleCV/image_classification/scripts/train/DenseNet161.sh
@@ -3,9 +3,6 @@
 python train.py \
        --model=DenseNet161 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,224,224 \
-       --class_dim=1000 \
        --lr_strategy=piecewise_decay \
        --lr=0.1 \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/DenseNet169.sh b/PaddleCV/image_classification/scripts/train/DenseNet169.sh
index 2e5120e4bf65969a9c6a641d512d064e9ea47263..2392aa2b21221e8abf09c181d3d9a870083e832d 100644
--- a/PaddleCV/image_classification/scripts/train/DenseNet169.sh
+++ b/PaddleCV/image_classification/scripts/train/DenseNet169.sh
@@ -3,9 +3,6 @@
 python train.py \
       --model=DenseNet169 \
       --batch_size=256 \
-      --total_images=1281167 \
-      --image_shape=3,224,224 \
-      --class_dim=1000 \
       --lr_strategy=piecewise_decay \
       --lr=0.1 \
       --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/DenseNet201.sh b/PaddleCV/image_classification/scripts/train/DenseNet201.sh
index 7535a86bc137171e79315df9f4784bdee170a07d..25f12a32d904b7d7e3035f5adf92bfd4a0d2a525 100644
--- a/PaddleCV/image_classification/scripts/train/DenseNet201.sh
+++ b/PaddleCV/image_classification/scripts/train/DenseNet201.sh
@@ -3,9 +3,6 @@
 python train.py \
        --model=DenseNet201 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,224,224 \
-       --class_dim=1000 \
        --lr_strategy=piecewise_decay \
        --lr=0.1 \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/DenseNet264.sh b/PaddleCV/image_classification/scripts/train/DenseNet264.sh
index f10df370e46981a23a8d882373e99c6f63608782..423acbaa8f0cdf4f03bb6d59431f5f81bc404e09 100644
--- a/PaddleCV/image_classification/scripts/train/DenseNet264.sh
+++ b/PaddleCV/image_classification/scripts/train/DenseNet264.sh
@@ -3,9 +3,6 @@
 python train.py \
        --model=DenseNet264 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,224,224 \
-       --class_dim=1000 \
        --lr_strategy=piecewise_decay \
        --lr=0.1 \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/EfficientNetB0.sh b/PaddleCV/image_classification/scripts/train/EfficientNetB0.sh
index 37a5d9a4524e06e4dfa77c1527d2dc5d7be6a222..409b4ef80c81bd3f5149ac55dcb8e9f0259d41ae 100644
--- a/PaddleCV/image_classification/scripts/train/EfficientNetB0.sh
+++ b/PaddleCV/image_classification/scripts/train/EfficientNetB0.sh
@@ -8,9 +8,6 @@ python -u train.py \
        --model=EfficientNet \
        --batch_size=512 \
        --test_batch_size=128 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --resize_short_size=256 \
        --model_save_dir=output/ \
        --lr_strategy=exponential_decay_warmup \
diff --git a/PaddleCV/image_classification/scripts/train/GoogLeNet.sh b/PaddleCV/image_classification/scripts/train/GoogLeNet.sh
index 63171b31691466bcf6c66a7153faf3dc2f80d203..309abd9089b783b282678498077dcd78faf9645d 100644
--- a/PaddleCV/image_classification/scripts/train/GoogLeNet.sh
+++ b/PaddleCV/image_classification/scripts/train/GoogLeNet.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=GoogLeNet \
 	--batch_size=256 \
-	--total_images=1281167 \
-	--class_dim=1000 \
-	--image_shape=3,224,224 \
 	--model_save_dir=output/ \
 	--lr_strategy=cosine_decay \
 	--lr=0.01 \
diff --git a/PaddleCV/image_classification/scripts/train/HRNet_W18_C.sh b/PaddleCV/image_classification/scripts/train/HRNet_W18_C.sh
new file mode 100644
index 0000000000000000000000000000000000000000..85fe27e5c496696d5bd3f5e1cb5c98f5e44be534
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/HRNet_W18_C.sh
@@ -0,0 +1,12 @@
+#Training details
+#HRNet_W18_C
+python train.py \
+       --model=HRNet_W18_C \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --class_dim=1000 \
+       --lr_strategy=piecewise_decay \
+       --lr=0.1 \
+       --num_epochs=120 \
+       --model_save_dir=output/ \
+       --l2_decay=1e-4
diff --git a/PaddleCV/image_classification/scripts/train/HRNet_W30_C.sh b/PaddleCV/image_classification/scripts/train/HRNet_W30_C.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bd2ff38297213019282027ccd51609f6cc191caa
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/HRNet_W30_C.sh
@@ -0,0 +1,12 @@
+#Training details
+#HRNet_W30_C
+python train.py \
+       --model=HRNet_W30_C \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --class_dim=1000 \
+       --lr_strategy=piecewise_decay \
+       --lr=0.1 \
+       --num_epochs=120 \
+       --model_save_dir=output/ \
+       --l2_decay=1e-4
diff --git a/PaddleCV/image_classification/scripts/train/HRNet_W32_C.sh b/PaddleCV/image_classification/scripts/train/HRNet_W32_C.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e962d9e951ce761458db5504edbaf3f922f57e56
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/HRNet_W32_C.sh
@@ -0,0 +1,12 @@
+#Training details
+#HRNet_W32_C
+python train.py \
+       --model=HRNet_W32_C \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --class_dim=1000 \
+       --lr_strategy=piecewise_decay \
+       --lr=0.1 \
+       --num_epochs=120 \
+       --model_save_dir=output/ \
+       --l2_decay=1e-4
diff --git a/PaddleCV/image_classification/scripts/train/HRNet_W40_C.sh b/PaddleCV/image_classification/scripts/train/HRNet_W40_C.sh
new file mode 100644
index 0000000000000000000000000000000000000000..189baaf8689d79e5f4ad2f50cc60565f5018e59d
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/HRNet_W40_C.sh
@@ -0,0 +1,12 @@
+#Training details
+#HRNet_W40_C
+python train.py \
+       --model=HRNet_W40_C \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --class_dim=1000 \
+       --lr_strategy=piecewise_decay \
+       --lr=0.1 \
+       --num_epochs=120 \
+       --model_save_dir=output/ \
+       --l2_decay=1e-4
diff --git a/PaddleCV/image_classification/scripts/train/HRNet_W44_C.sh b/PaddleCV/image_classification/scripts/train/HRNet_W44_C.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e42342a841abf6734833c10cd2b43caba3caa6a3
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/HRNet_W44_C.sh
@@ -0,0 +1,12 @@
+#Training details
+#HRNet_W44_C
+python train.py \
+       --model=HRNet_W44_C \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --class_dim=1000 \
+       --lr_strategy=piecewise_decay \
+       --lr=0.1 \
+       --num_epochs=120 \
+       --model_save_dir=output/ \
+       --l2_decay=1e-4
diff --git a/PaddleCV/image_classification/scripts/train/HRNet_W48_C.sh b/PaddleCV/image_classification/scripts/train/HRNet_W48_C.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b040bf310527be66a956a917bae7c18516e169ed
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/HRNet_W48_C.sh
@@ -0,0 +1,12 @@
+#Training details
+#HRNet_W48_C
+python train.py \
+       --model=HRNet_W48_C \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --class_dim=1000 \
+       --lr_strategy=piecewise_decay \
+       --lr=0.1 \
+       --num_epochs=120 \
+       --model_save_dir=output/ \
+       --l2_decay=1e-4
diff --git a/PaddleCV/image_classification/scripts/train/HRNet_W64_C.sh b/PaddleCV/image_classification/scripts/train/HRNet_W64_C.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d8276846ab809cf81ca5c9fae84763ce84c912a0
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/HRNet_W64_C.sh
@@ -0,0 +1,12 @@
+#Training details
+#HRNet_W64_C
+python train.py \
+       --model=HRNet_W64_C \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --class_dim=1000 \
+       --lr_strategy=piecewise_decay \
+       --lr=0.1 \
+       --num_epochs=120 \
+       --model_save_dir=output/ \
+       --l2_decay=1e-4
diff --git a/PaddleCV/image_classification/scripts/train/InceptionV4.sh b/PaddleCV/image_classification/scripts/train/InceptionV4.sh
index ba3c4954c62cd494dba3822eb0a41429023ce33a..fde037255b5719d62e3d6b9c9dbf9cd76a2735f9 100644
--- a/PaddleCV/image_classification/scripts/train/InceptionV4.sh
+++ b/PaddleCV/image_classification/scripts/train/InceptionV4.sh
@@ -9,9 +9,7 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	    --model=InceptionV4 \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,299,299 \
-            --class_dim=1000 \
+            --image_shape 3 299 299 \
             --lr_strategy=cosine_decay \
             --lr=0.045 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/MobileNetV1.sh b/PaddleCV/image_classification/scripts/train/MobileNetV1.sh
index 8d00ce7c8c073fdbbdc220d8c8bf4289ff4ba659..4a2d52f5a216b7767f6f90f503d06a95a9aaf586 100644
--- a/PaddleCV/image_classification/scripts/train/MobileNetV1.sh
+++ b/PaddleCV/image_classification/scripts/train/MobileNetV1.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
        --model=MobileNetV1 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=piecewise_decay \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_25.sh b/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_25.sh
index aa7f74ba8eb394f28a296fff7b26b13972e49afe..c35aaf9f3326e5302bbaecae9f48d458d583a9fb 100644
--- a/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_25.sh
+++ b/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_25.sh
@@ -7,9 +7,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
        --model=MobileNetV1_x0_25 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=piecewise_decay \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_5.sh b/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_5.sh
index 85fdbfdc255b618ad000063931f62f42ae43c380..3ad2601364eee0cd74a6bbc89ea45f7807e421be 100644
--- a/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_5.sh
+++ b/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_5.sh
@@ -7,9 +7,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
        --model=MobileNetV1_x0_5 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=piecewise_decay \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_75.sh b/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_75.sh
index ceeba7449b09a8b87ebb4d6d829907d641f698c0..e157deb48cb2482b073d1413e534600764cdaeab 100644
--- a/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_75.sh
+++ b/PaddleCV/image_classification/scripts/train/MobileNetV1_x0_75.sh
@@ -7,9 +7,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
        --model=MobileNetV1_x0_75 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=piecewise_decay \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/MobileNetV2.sh b/PaddleCV/image_classification/scripts/train/MobileNetV2.sh
index 7a0ce41cadb54030f06534ea5f64cffc1b171bf0..8d1525891a4d1b650cc2edc958c6f34c10bb4403 100644
--- a/PaddleCV/image_classification/scripts/train/MobileNetV2.sh
+++ b/PaddleCV/image_classification/scripts/train/MobileNetV2.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=MobileNetV2 \
 	--batch_size=500 \
-	--total_images=1281167 \
-	--class_dim=1000 \
-	--image_shape=3,224,224 \
 	--model_save_dir=output/ \
 	--lr_strategy=cosine_decay \
 	--num_epochs=240 \
diff --git a/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_25.sh b/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_25.sh
index 8bdb0de897209d91e233336810e47a9e9ca324af..41669d0f7ab90b3693d1a8539ecc0d452655c96f 100644
--- a/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_25.sh
+++ b/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_25.sh
@@ -2,9 +2,6 @@
 python train.py \
        --model=MobileNetV2_x0_25 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=cosine_decay \
        --num_epochs=240 \
diff --git a/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_5.sh b/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_5.sh
index f0ba07adc9e0a0b9d219190387dd961dec45c536..6b28d024e8d3290866ee03d8fe1e50576689c7ff 100644
--- a/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_5.sh
+++ b/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_5.sh
@@ -2,9 +2,6 @@
 python train.py \
        --model=MobileNetV2_x0_5 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=cosine_decay \
        --num_epochs=240 \
diff --git a/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_75.sh b/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_75.sh
index 511cfa71e5189e2dc0b34e7606e216e5fc97a1d3..a753c6ed6d1fd87e8cfddb1c81d1d7ffe00c014d 100644
--- a/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_75.sh
+++ b/PaddleCV/image_classification/scripts/train/MobileNetV2_x0_75.sh
@@ -7,9 +7,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=MobileNetV2_x0_75 \
 	--batch_size=256 \
-	--total_images=1281167 \
-	--class_dim=1000 \
-	--image_shape=3,224,224 \
 	--model_save_dir=output/ \
 	--lr_strategy=cosine_decay \
 	--num_epochs=240 \
diff --git a/PaddleCV/image_classification/scripts/train/MobileNetV2_x1_5.sh b/PaddleCV/image_classification/scripts/train/MobileNetV2_x1_5.sh
index f0ed2a0baa71d58aeb00dee1bbcd91e320995508..130e7f39df335db713bc4676982987130bff0c2f 100644
--- a/PaddleCV/image_classification/scripts/train/MobileNetV2_x1_5.sh
+++ b/PaddleCV/image_classification/scripts/train/MobileNetV2_x1_5.sh
@@ -2,9 +2,6 @@
 python train.py \
        --model=MobileNetV2_x1_5 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=cosine_decay \
        --num_epochs=240 \
diff --git a/PaddleCV/image_classification/scripts/train/MobileNetV2_x2_0.sh b/PaddleCV/image_classification/scripts/train/MobileNetV2_x2_0.sh
index dcfe0b8559b57edd227a58721d1341adc62c236c..45a417c0522274cf60647831358fe6efaaaa192d 100644
--- a/PaddleCV/image_classification/scripts/train/MobileNetV2_x2_0.sh
+++ b/PaddleCV/image_classification/scripts/train/MobileNetV2_x2_0.sh
@@ -2,9 +2,6 @@
 python train.py \
        --model=MobileNetV2_x2_0 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=cosine_decay \
        --num_epochs=240 \
diff --git a/PaddleCV/image_classification/scripts/train/Res2Net101_vd_26w_4s.sh b/PaddleCV/image_classification/scripts/train/Res2Net101_vd_26w_4s.sh
new file mode 100644
index 0000000000000000000000000000000000000000..572b8ce1f837aa6681b9345143daf31b9eb2fc76
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/Res2Net101_vd_26w_4s.sh
@@ -0,0 +1,15 @@
+#Res2Net101_vd_26w_4s
+
+python train.py \
+            --model=Res2Net101_vd_26w_4s \
+            --batch_size=256 \
+            --total_images=1281167 \
+            --class_dim=1000 \
+            --lr_strategy=cosine_decay \
+            --lr=0.1 \
+            --num_epochs=200 \
+            --model_save_dir=output/ \
+            --l2_decay=1e-4 \
+            --use_mixup=True \
+            --use_label_smoothing=True \
+            --label_smoothing_epsilon=0.1
diff --git a/PaddleCV/image_classification/scripts/train/Res2Net200_vd_26w_4s.sh b/PaddleCV/image_classification/scripts/train/Res2Net200_vd_26w_4s.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8048e79b1c48620cfa0a31e42bcea77880ced585
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/Res2Net200_vd_26w_4s.sh
@@ -0,0 +1,15 @@
+#Res2Net200_vd_26w_4s
+
+python train.py \
+            --model=Res2Net200_vd_26w_4s \
+            --batch_size=256 \
+            --total_images=1281167 \
+            --class_dim=1000 \
+            --lr_strategy=cosine_decay \
+            --lr=0.1 \
+            --num_epochs=200 \
+            --model_save_dir=output/ \
+            --l2_decay=1e-4 \
+            --use_mixup=True \
+            --use_label_smoothing=True \
+            --label_smoothing_epsilon=0.1
diff --git a/PaddleCV/image_classification/scripts/train/Res2Net50_14w_8s.sh b/PaddleCV/image_classification/scripts/train/Res2Net50_14w_8s.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0e426eb6bf553a813440c0c9441c6eed6b3871d2
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/Res2Net50_14w_8s.sh
@@ -0,0 +1,15 @@
+#Res2Net50_14w_8s
+
+python train.py \
+            --model=Res2Net50_14w_8s \
+            --batch_size=256 \
+            --total_images=1281167 \
+            --class_dim=1000 \
+            --lr_strategy=cosine_decay \
+            --lr=0.1 \
+            --num_epochs=200 \
+            --model_save_dir=output/ \
+            --l2_decay=1e-4 \
+            --use_mixup=True \
+            --use_label_smoothing=True \
+            --label_smoothing_epsilon=0.1
diff --git a/PaddleCV/image_classification/scripts/train/Res2Net50_26w_4s.sh b/PaddleCV/image_classification/scripts/train/Res2Net50_26w_4s.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3accf9d190fedf674d2da4ad857dc486a7310d5b
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/Res2Net50_26w_4s.sh
@@ -0,0 +1,15 @@
+#Res2Net50_26w_4s
+
+python train.py \
+            --model=Res2Net50_26w_4s \
+            --batch_size=256 \
+            --total_images=1281167 \
+            --class_dim=1000 \
+            --lr_strategy=cosine_decay \
+            --lr=0.1 \
+            --num_epochs=200 \
+            --model_save_dir=output/ \
+            --l2_decay=1e-4 \
+            --use_mixup=True \
+            --use_label_smoothing=True \
+            --label_smoothing_epsilon=0.1
diff --git a/PaddleCV/image_classification/scripts/train/Res2Net50_vd_26w_4s.sh b/PaddleCV/image_classification/scripts/train/Res2Net50_vd_26w_4s.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2a4ac318ed936ae60eb93d660e47f56f12c7d8dd
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/Res2Net50_vd_26w_4s.sh
@@ -0,0 +1,15 @@
+#Res2Net50_vd_26w_4s
+
+python train.py \
+            --model=Res2Net50_vd_26w_4s \
+            --batch_size=256 \
+            --total_images=1281167 \
+            --class_dim=1000 \
+            --lr_strategy=cosine_decay \
+            --lr=0.1 \
+            --num_epochs=200 \
+            --model_save_dir=output/ \
+            --l2_decay=1e-4 \
+            --use_mixup=True \
+            --use_label_smoothing=True \
+            --label_smoothing_epsilon=0.1
diff --git a/PaddleCV/image_classification/scripts/train/ResNeXt101_32x4d.sh b/PaddleCV/image_classification/scripts/train/ResNeXt101_32x4d.sh
index 91d8b5bbae3cb6a580d2bdc9c808c8386a4bb4a5..926e902a70ca5dbeb583f38fbde4815e2c4d3aab 100644
--- a/PaddleCV/image_classification/scripts/train/ResNeXt101_32x4d.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNeXt101_32x4d.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=ResNeXt101_32x4d \
         --batch_size=256 \
-        --total_images=1281167 \
-        --image_shape=3,224,224 \
-        --class_dim=1000 \
         --lr_strategy=piecewise_decay \
         --lr=0.1 \
         --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNeXt101_64x4d.sh b/PaddleCV/image_classification/scripts/train/ResNeXt101_64x4d.sh
index f5aeb3a308719bad13f83f930a8d9c0e12d8daad..828948c4994f47eb6b9c460966bc3ca6dbffe2d6 100644
--- a/PaddleCV/image_classification/scripts/train/ResNeXt101_64x4d.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNeXt101_64x4d.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=ResNeXt101_64x4d \
         --batch_size=256 \
-        --total_images=1281167 \
-        --image_shape=3,224,224 \
-        --class_dim=1000 \
         --lr_strategy=piecewise_decay \
         --lr=0.1 \
         --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNeXt101_vd_32x4d.sh b/PaddleCV/image_classification/scripts/train/ResNeXt101_vd_32x4d.sh
index 5e9344808d2e4e826805ea2c2ff7793f554c59db..91cd37a407d248bfb64ca5670b3cac440a8fae36 100644
--- a/PaddleCV/image_classification/scripts/train/ResNeXt101_vd_32x4d.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNeXt101_vd_32x4d.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=ResNeXt101_vd_32x4d \
         --batch_size=256 \
-        --total_images=1281167 \
-        --image_shape=3,224,224 \
-        --class_dim=1000 \
         --lr_strategy=cosine_decay \
         --lr=0.1 \
         --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNeXt101_vd_64x4d.sh b/PaddleCV/image_classification/scripts/train/ResNeXt101_vd_64x4d.sh
index f3d117798edf33acd9f4d9b8e29bdd987a9ece9c..f57287f3b6560b3dfdb633528f4f58f0ba4d404b 100644
--- a/PaddleCV/image_classification/scripts/train/ResNeXt101_vd_64x4d.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNeXt101_vd_64x4d.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=ResNeXt101_vd_64x4d \
         --batch_size=256 \
-        --total_images=1281167 \
-        --image_shape=3,224,224 \
-        --class_dim=1000 \
         --lr_strategy=cosine_decay \
         --lr=0.1 \
         --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNeXt152_32x4d.sh b/PaddleCV/image_classification/scripts/train/ResNeXt152_32x4d.sh
index 1b81968b830c39e918ab3d571172f42fa8c874b2..db13278356b101e05f0eaa0407b358c023328b34 100644
--- a/PaddleCV/image_classification/scripts/train/ResNeXt152_32x4d.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNeXt152_32x4d.sh
@@ -2,9 +2,6 @@
  python train.py \
        --model=ResNeXt152_32x4d \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,224,224 \
-       --class_dim=1000 \
        --lr_strategy=piecewise_decay \
        --lr=0.1 \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNeXt152_64x4d.sh b/PaddleCV/image_classification/scripts/train/ResNeXt152_64x4d.sh
index 0a1bd5180de93f702b29b80072c2e00425bb90fc..10e178e38499b37d62c75223a82421a9c18a1e81 100644
--- a/PaddleCV/image_classification/scripts/train/ResNeXt152_64x4d.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNeXt152_64x4d.sh
@@ -8,9 +8,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
        --model=ResNeXt152_64x4d \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,224,224 \
-       --class_dim=1000 \
        --lr_strategy=piecewise_decay \
        --lr=0.1 \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNeXt152_vd_32x4d.sh b/PaddleCV/image_classification/scripts/train/ResNeXt152_vd_32x4d.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b08eff7989064236afa16338c3fc35a7d835c867
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/ResNeXt152_vd_32x4d.sh
@@ -0,0 +1,14 @@
+#ResNeXt152_vd_32x4d
+python train.py \
+       --model=ResNeXt152_vd_32x4d \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --class_dim=1000 \
+       --lr_strategy=cosine_decay \
+       --lr=0.1 \
+       --num_epochs=200 \
+       --model_save_dir=output/ \
+       --l2_decay=1e-4 \
+       --use_mixup=True \
+       --use_label_smoothing=True \
+       --label_smoothing_epsilon=0.1 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNeXt152_vd_64x4d.sh b/PaddleCV/image_classification/scripts/train/ResNeXt152_vd_64x4d.sh
index b9663f27bc3418004dc3296883a3089df198e1ec..5ba4478ddc01394e5d4b1715aee0f5147b86a6d6 100644
--- a/PaddleCV/image_classification/scripts/train/ResNeXt152_vd_64x4d.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNeXt152_vd_64x4d.sh
@@ -2,9 +2,6 @@
 python train.py \
        --model=ResNeXt152_vd_64x4d \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,224,224 \
-       --class_dim=1000 \
        --lr_strategy=cosine_decay \
        --lr=0.1 \
        --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNeXt50_32x4d.sh b/PaddleCV/image_classification/scripts/train/ResNeXt50_32x4d.sh
index 91e64708a1605db0a5c6d3f8053dcb6451e114fb..a4454eb07f4dd5ec29de3d7144898b5bc1cb46be 100644
--- a/PaddleCV/image_classification/scripts/train/ResNeXt50_32x4d.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNeXt50_32x4d.sh
@@ -1,9 +1,6 @@
 python train.py \
        --model=ResNeXt50_32x4d \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,224,224 \
-       --class_dim=1000 \
        --lr_strategy=piecewise_decay \
        --lr=0.1 \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNeXt50_vd_64x4d.sh b/PaddleCV/image_classification/scripts/train/ResNeXt50_vd_64x4d.sh
index 9728535b14b528e2f6d5460f729d539c14726de2..a2d89f462e0479e854e0b3bb64825db0db371192 100644
--- a/PaddleCV/image_classification/scripts/train/ResNeXt50_vd_64x4d.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNeXt50_vd_64x4d.sh
@@ -2,9 +2,6 @@
 python train.py \
        --model=ResNeXt50_vd_64x4d \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,224,224 \
-       --class_dim=1000 \
        --lr_strategy=cosine_decay \
        --lr=0.1 \
        --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet101.sh b/PaddleCV/image_classification/scripts/train/ResNet101.sh
index a2af43854b4464da1b2c7077d89e8b155d207e9f..80e929d4a6fe098a77c4be4f8f9ab97283b0d1b4 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet101.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet101.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
        --model=ResNet101 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=piecewise_decay \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet101_vd.sh b/PaddleCV/image_classification/scripts/train/ResNet101_vd.sh
index b9bdf778481b6fe1b6bace474bab31053f87c6c2..5df739442b2a949f963b3fded821a46adf1de47b 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet101_vd.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet101_vd.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
        --model=ResNet101_vd \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,224,224 \
-       --class_dim=1000 \
        --lr_strategy=cosine_decay \
        --lr=0.1 \
        --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet152.sh b/PaddleCV/image_classification/scripts/train/ResNet152.sh
index 44275753a3eab22f78bfc02305c98dff59d55508..e427bf3e9ab813ba9c09f8eb22794018dcf50f8b 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet152.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet152.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
        --model=ResNet152 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=piecewise_decay \
        --lr=0.1 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet152_vd.sh b/PaddleCV/image_classification/scripts/train/ResNet152_vd.sh
index b4cb84ad191ee57749fdfad08428fa5e328bbdf0..4e3641290bd5d490ab68c47878a42b1dd7a58d35 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet152_vd.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet152_vd.sh
@@ -8,9 +8,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
             --model=ResNet152_vd \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet18.sh b/PaddleCV/image_classification/scripts/train/ResNet18.sh
index b3d1018ce0daa0f4f3871e0109bdc293f5d6f81d..f27e94a923045f8385342412d4570f92c30468f0 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet18.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet18.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=ResNet18 \
 	--batch_size=256 \
-	--total_images=1281167 \
-	--class_dim=1000 \
-	--image_shape=3,224,224 \
 	--model_save_dir=output/ \
 	--lr_strategy=cosine_decay \
 	--lr=0.1 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet18_vd.sh b/PaddleCV/image_classification/scripts/train/ResNet18_vd.sh
index c95b9325ead9ff30b8f61d9f247f5abc56c6bd3e..627cc610e0118f377f22b3dd5490d6214cff53ba 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet18_vd.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet18_vd.sh
@@ -6,9 +6,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
             --model=ResNet18_vd \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet200_vd.sh b/PaddleCV/image_classification/scripts/train/ResNet200_vd.sh
index 464db8ac77cc80d6fd441d638d5ee74e2ed3cdf0..a3483d5ada71f4e30aee350cef7b41b0720ac6ce 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet200_vd.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet200_vd.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
             --model=ResNet200_vd \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet34.sh b/PaddleCV/image_classification/scripts/train/ResNet34.sh
index 5ce4689be4c2c67a9d20c3187ba93064fb536b26..c472e9c2927d8f3ad4e919f2c250791a5569152d 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet34.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet34.sh
@@ -8,9 +8,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=ResNet34 \
 	--batch_size=256 \
-	--total_images=1281167 \
-	--class_dim=1000 \
-	--image_shape=3,224,224 \
 	--model_save_dir=output/ \
 	--lr_strategy=cosine_decay \
 	--lr=0.1 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet34_vd.sh b/PaddleCV/image_classification/scripts/train/ResNet34_vd.sh
index 56a31b699d19f39987f6f4be79a5e0b8e9c4e07d..c78b6c1f5d298e131a9ecf7452da9d764a4d0230 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet34_vd.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet34_vd.sh
@@ -6,9 +6,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
             --model=ResNet34_vd \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet50.sh b/PaddleCV/image_classification/scripts/train/ResNet50.sh
index 470630757aa4446fecd42bd45e11a77d89c453fa..258fd89015b624a8a78a648214b1e3b8200d9bea 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet50.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet50.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
        --model=ResNet50 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=piecewise_decay \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet50_ACNet.sh b/PaddleCV/image_classification/scripts/train/ResNet50_ACNet.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4c83be5f773ea817d94339bcb04960458a94e727
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/ResNet50_ACNet.sh
@@ -0,0 +1,15 @@
+##Training details
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+export FLAGS_fast_eager_deletion_mode=1
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_fraction_of_gpu_memory_to_use=0.98
+
+#ResNet50:
+python train.py \
+       --model=ResNet50_ACNet.sh \
+       --batch_size=256 \
+       --model_save_dir=output/ \
+       --lr_strategy=piecewise_decay \
+       --num_epochs=120 \
+       --lr=0.1 \
+       --l2_decay=1e-4
diff --git a/PaddleCV/image_classification/scripts/train/ResNet50_dist.sh b/PaddleCV/image_classification/scripts/train/ResNet50_dist.sh
index fb74449d6ffd5ba9253d26f3ef4d262e74b56d85..0026dcb4b561415d4023a10a5cc2a29e97a5d334 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet50_dist.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet50_dist.sh
@@ -8,9 +8,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python -m paddle.distributed.launch train.py \
        --model=ResNet50 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=piecewise_decay \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet50_fp16.sh b/PaddleCV/image_classification/scripts/train/ResNet50_fp16.sh
new file mode 100755
index 0000000000000000000000000000000000000000..6ebd5c01ff3de389ec5cf9f8aaeb5d0f3f690715
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/ResNet50_fp16.sh
@@ -0,0 +1,40 @@
+#!/bin/bash -ex
+
+export FLAGS_conv_workspace_size_limit=4000 #MB
+export FLAGS_cudnn_exhaustive_search=1
+export FLAGS_cudnn_batchnorm_spatial_persistent=1
+
+DATA_DIR="Your image dataset path, e.g. /work/datasets/ILSVRC2012/"
+
+DATA_FORMAT="NHWC"
+USE_FP16=true #whether to use float16
+USE_DALI=true
+
+if ${USE_DALI}; then
+    export FLAGS_fraction_of_gpu_memory_to_use=0.8
+fi
+
+python train.py \
+       --model=ResNet50 \
+       --data_dir=${DATA_DIR} \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --image_shape 3 224 224 \
+       --class_dim=1000 \
+       --print_step=10 \
+       --model_save_dir=output/ \
+       --lr_strategy=piecewise_decay \
+       --use_fp16=${USE_FP16} \
+       --scale_loss=128.0 \
+       --use_dynamic_loss_scaling=true \
+       --data_format=${DATA_FORMAT} \
+       --fuse_elewise_add_act_ops=true \
+       --fuse_bn_act_ops=true \
+       --validate=true \
+       --is_profiler=false \
+       --profiler_path=profile/ \
+       --reader_thread=10 \
+       --reader_buf_size=4000 \
+       --use_dali=${USE_DALI} \
+       --lr=0.1
+
diff --git a/PaddleCV/image_classification/scripts/train/ResNet50_vc.sh b/PaddleCV/image_classification/scripts/train/ResNet50_vc.sh
index d5d0cc5e5df1e5eb130bc73d9358c343b140abed..7e7797276d5728c53856dbaf2972fb0cdde7b0c5 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet50_vc.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet50_vc.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
   	    --model=ResNet50_vc \
 	    --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/ResNet50_vd.sh b/PaddleCV/image_classification/scripts/train/ResNet50_vd.sh
index 968e3dd02e96c5c5ade62c4df59cd0194787ea89..0841c3c278826a388f2e22b33181e08bb5e56a33 100644
--- a/PaddleCV/image_classification/scripts/train/ResNet50_vd.sh
+++ b/PaddleCV/image_classification/scripts/train/ResNet50_vd.sh
@@ -6,16 +6,17 @@ export FLAGS_eager_delete_tensor_gb=0.0
 export FLAGS_fraction_of_gpu_memory_to_use=0.98
 
 python train.py \
-            --model=ResNet50_vd \
-            --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
-            --lr_strategy=cosine_decay \
-            --lr=0.1 \
-            --num_epochs=200 \
-            --model_save_dir=output/ \
-            --l2_decay=7e-5 \
-            --use_mixup=True \
-            --use_label_smoothing=True \
-            --label_smoothing_epsilon=0.1
+        --data_dir=./data/ILSVRC2012/ \
+        --total_images=1281167 \
+        --class_dim=1000 \
+        --validate=1 \
+        --model=ResNet50_vd \
+        --batch_size=256 \
+        --lr_strategy=cosine_decay \
+        --lr=0.1 \
+        --num_epochs=200 \
+        --model_save_dir=output/ \
+        --l2_decay=7e-5 \
+        --use_mixup=True \
+        --use_label_smoothing=True \
+        --label_smoothing_epsilon=0.1
diff --git a/PaddleCV/image_classification/scripts/train/SENet154_vd.sh b/PaddleCV/image_classification/scripts/train/SENet154_vd.sh
index a363a108e57b940f0c1dea2ec0a69fabf608110a..54587ff7382ab6c32aa11d8b6c19481cc76da46c 100644
--- a/PaddleCV/image_classification/scripts/train/SENet154_vd.sh
+++ b/PaddleCV/image_classification/scripts/train/SENet154_vd.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
             --model=SENet154_vd \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,224,224 \
-            --class_dim=1000 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/SE_ResNeXt101_32x4d.sh b/PaddleCV/image_classification/scripts/train/SE_ResNeXt101_32x4d.sh
index a385814a55a527d4e1f268a2b974df9fa8cc8f2b..81978aec82dd47e4d8c629f23c4417828ba723a5 100644
--- a/PaddleCV/image_classification/scripts/train/SE_ResNeXt101_32x4d.sh
+++ b/PaddleCV/image_classification/scripts/train/SE_ResNeXt101_32x4d.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
         --model=SE_ResNeXt101_32x4d \
         --batch_size=400 \
-        --total_images=1281167 \
-        --class_dim=1000 \
-        --image_shape=3,224,224 \
         --lr_strategy=cosine_decay \
         --model_save_dir=output/ \
         --lr=0.1 \
diff --git a/PaddleCV/image_classification/scripts/train/SE_ResNeXt50_32x4d.sh b/PaddleCV/image_classification/scripts/train/SE_ResNeXt50_32x4d.sh
index acfadb80d983f9571cf93e825958cbbd7864b711..1adfb70d8395c2b8cc91b1aef5af6a774dfa4825 100644
--- a/PaddleCV/image_classification/scripts/train/SE_ResNeXt50_32x4d.sh
+++ b/PaddleCV/image_classification/scripts/train/SE_ResNeXt50_32x4d.sh
@@ -10,9 +10,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
        --model=SE_ResNeXt50_32x4d \
        --batch_size=400 \
-       --total_images=1281167 \
-	--class_dim=1000 \
-       --image_shape=3,224,224 \
        --lr_strategy=cosine_decay \
        --model_save_dir=output/ \
        --lr=0.1 \
diff --git a/PaddleCV/image_classification/scripts/train/SE_ResNeXt50_vd_32x4d.sh b/PaddleCV/image_classification/scripts/train/SE_ResNeXt50_vd_32x4d.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0e288b1411f080830408b581187ac42f347af5a9
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/SE_ResNeXt50_vd_32x4d.sh
@@ -0,0 +1,14 @@
+#SE_ResNeXt50_vd_32x4d
+python train.py \
+       --model=SE_ResNeXt50_vd_32x4d \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --class_dim=1000 \
+       --lr_strategy=cosine_decay \
+       --lr=0.1 \
+       --num_epochs=200 \
+       --model_save_dir=output/ \
+       --l2_decay=1e-4 \
+       --use_mixup=True \
+       --use_label_smoothing=True \
+       --label_smoothing_epsilon=0.1 \
diff --git a/PaddleCV/image_classification/scripts/train/SE_ResNet18_vd.sh b/PaddleCV/image_classification/scripts/train/SE_ResNet18_vd.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a5a2efc96722de700e0e568d90a9896f7bef0e85
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/SE_ResNet18_vd.sh
@@ -0,0 +1,14 @@
+#SE_ResNet18_vd
+python train.py \
+       --model=SE_ResNet18_vd \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --class_dim=1000 \
+       --lr_strategy=cosine_decay \
+       --lr=0.1 \
+       --num_epochs=200 \
+       --model_save_dir=output/ \
+       --l2_decay=7e-5 \
+       --use_mixup=True \
+       --use_label_smoothing=True \
+       --label_smoothing_epsilon=0.1 \
diff --git a/PaddleCV/image_classification/scripts/train/SE_ResNet34_vd.sh b/PaddleCV/image_classification/scripts/train/SE_ResNet34_vd.sh
new file mode 100644
index 0000000000000000000000000000000000000000..656d0112df8ac3737507e35ca05541dee2ef3897
--- /dev/null
+++ b/PaddleCV/image_classification/scripts/train/SE_ResNet34_vd.sh
@@ -0,0 +1,14 @@
+#SE_ResNet34_vd
+python train.py \
+       --model=SE_ResNet34_vd \
+       --batch_size=256 \
+       --total_images=1281167 \
+       --class_dim=1000 \
+       --lr_strategy=cosine_decay \
+       --lr=0.1 \
+       --num_epochs=200 \
+       --model_save_dir=output/ \
+       --l2_decay=7e-5 \
+       --use_mixup=True \
+       --use_label_smoothing=True \
+       --label_smoothing_epsilon=0.1 \
diff --git a/PaddleCV/image_classification/scripts/train/SE_ResNet50_vd.sh b/PaddleCV/image_classification/scripts/train/SE_ResNet50_vd.sh
index 9bddaf7567359e81c2edc5574fca48a8c51b351e..ba4f50f5c074d7f65e920dd8c9c2b3a06295341e 100644
--- a/PaddleCV/image_classification/scripts/train/SE_ResNet50_vd.sh
+++ b/PaddleCV/image_classification/scripts/train/SE_ResNet50_vd.sh
@@ -2,9 +2,6 @@
 python train.py \
        --model=SE_ResNet50_vd \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,224,224 \
-       --class_dim=1000 \
        --lr_strategy=cosine_decay \
        --lr=0.1 \
        --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/ShuffleNetV2.sh b/PaddleCV/image_classification/scripts/train/ShuffleNetV2.sh
index 369e58791e410a82431dfb2413a8b911f7e1e48b..0fedfcc7cf679ef635585ceabb887a5bea2a7c04 100644
--- a/PaddleCV/image_classification/scripts/train/ShuffleNetV2.sh
+++ b/PaddleCV/image_classification/scripts/train/ShuffleNetV2.sh
@@ -8,9 +8,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=ShuffleNetV2 \
 	--batch_size=1024 \
-	--total_images=1281167 \
-	--class_dim=1000 \
-	--image_shape=3,224,224 \
 	--model_save_dir=output/ \
 	--lr_strategy=cosine_decay_warmup \
 	--lr=0.5 \
diff --git a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_swish.sh b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_swish.sh
index b3e29dd31bcdabda41ec8a18e8f6316813d37c1b..3b0b71c3106ca02e1372e3adf555c647c4261fe4 100644
--- a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_swish.sh
+++ b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_swish.sh
@@ -8,9 +8,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=ShuffleNetV2_swish \
 	--batch_size=1024 \
-	--total_images=1281167 \
-	--class_dim=1000 \
-	--image_shape=3,224,224 \
 	--model_save_dir=output/ \
 	--lr_strategy=cosine_decay_warmup \
 	--lr=0.5 \
diff --git a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_25.sh b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_25.sh
index 449119d642d1f763bb77e564313e1fee01adc8ad..29f366ca273cbffb6c243ddce6c6214bf4691563 100644
--- a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_25.sh
+++ b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_25.sh
@@ -2,9 +2,6 @@
 python train.py \
        --model=ShuffleNetV2_x0_25 \
        --batch_size=1024 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=cosine_decay_warmup \
        --num_epochs=240 \
diff --git a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_33.sh b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_33.sh
index f38655b8bc4dcbcf3f4ddea104432e4aaa23b9b6..42205a535c38e168f56e6cf0e33d3564e42f0b83 100644
--- a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_33.sh
+++ b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_33.sh
@@ -2,9 +2,6 @@
 python train.py \
        --model=ShuffleNetV2_x0_33 \
        --batch_size=1024 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=cosine_decay_warmup \
        --num_epochs=240 \
diff --git a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_5.sh b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_5.sh
index 3cb89a4beba7afef2b54b7ea9ecbeea02e5e050d..cbf63d050b23ba5b02226d977aba94aabbada952 100644
--- a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_5.sh
+++ b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x0_5.sh
@@ -2,9 +2,6 @@
 python train.py \
        --model=ShuffleNetV2_x0_5 \
        --batch_size=1024 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=cosine_decay_warmup \
        --num_epochs=240 \
diff --git a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x1_5.sh b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x1_5.sh
index 459bcbc4a256b285b8059d413e4f12f57200ee11..5081ae235ff73e0d3d490ba079024647751d73e3 100644
--- a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x1_5.sh
+++ b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x1_5.sh
@@ -1,9 +1,6 @@
 python train.py \
        --model=ShuffleNetV2_x1_5 \
        --batch_size=512 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=cosine_decay_warmup \
        --num_epochs=240 \
diff --git a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x2_0.sh b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x2_0.sh
index 087e02542665d1060063ed3a0f2785981b1250a4..bf571823b8fd9d54e7f9a9174c6a751bdd261d56 100644
--- a/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x2_0.sh
+++ b/PaddleCV/image_classification/scripts/train/ShuffleNetV2_x2_0.sh
@@ -2,9 +2,6 @@
 python train.py \
        --model=ShuffleNetV2_x2_0 \
        --batch_size=512 \
-       --total_images=1281167 \
-       --class_dim=1000 \
-       --image_shape=3,224,224 \
        --model_save_dir=output/ \
        --lr_strategy=cosine_decay_warmup \
        --num_epochs=240 \
diff --git a/PaddleCV/image_classification/scripts/train/SqueezeNet1_0.sh b/PaddleCV/image_classification/scripts/train/SqueezeNet1_0.sh
index ee722bfd1de730b05d4699e38fcff307db7ee55a..05e2c4dd905b7742f7ca86015773c2c979ca806c 100644
--- a/PaddleCV/image_classification/scripts/train/SqueezeNet1_0.sh
+++ b/PaddleCV/image_classification/scripts/train/SqueezeNet1_0.sh
@@ -2,10 +2,7 @@
 python train.py \
         --model=SqueezeNet1_0 \
         --batch_size=256 \
-        --total_images=1281167 \
-        --image_shape=3,224,224 \
         --lr_strategy=cosine_decay \
-        --class_dim=1000 \
         --model_save_dir=output/ \
         --lr=0.02 \
         --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/SqueezeNet1_1.sh b/PaddleCV/image_classification/scripts/train/SqueezeNet1_1.sh
index 70bd773d8483ab86976542f14ef00fbf80cbdba6..993291d5dd43a8ef85a1d2540945d4b36c86f5e0 100644
--- a/PaddleCV/image_classification/scripts/train/SqueezeNet1_1.sh
+++ b/PaddleCV/image_classification/scripts/train/SqueezeNet1_1.sh
@@ -2,10 +2,7 @@
 python train.py \
         --model=SqueezeNet1_1 \
         --batch_size=256 \
-        --total_images=1281167 \
-        --image_shape=3,224,224 \
         --lr_strategy=cosine_decay \
-        --class_dim=1000 \
         --model_save_dir=output/ \
         --lr=0.02 \
         --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/VGG11.sh b/PaddleCV/image_classification/scripts/train/VGG11.sh
index ad8934e4b0b6687f0839bb325537ad815dc263db..d1a4ef3e6e62e483fa0d4fd6e486a77efc681b94 100644
--- a/PaddleCV/image_classification/scripts/train/VGG11.sh
+++ b/PaddleCV/image_classification/scripts/train/VGG11.sh
@@ -9,10 +9,7 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
         --model=VGG11 \
         --batch_size=512 \
-        --total_images=1281167 \
-        --image_shape=3,224,224 \
         --lr_strategy=cosine_decay \
-        --class_dim=1000 \
         --model_save_dir=output/ \
         --lr=0.1 \
         --num_epochs=90 \
diff --git a/PaddleCV/image_classification/scripts/train/VGG13.sh b/PaddleCV/image_classification/scripts/train/VGG13.sh
index 24960f888d46d3761dbb2712a740f1f3be71581c..16752961eb7fe787f5f239f3f836f343db5afc57 100644
--- a/PaddleCV/image_classification/scripts/train/VGG13.sh
+++ b/PaddleCV/image_classification/scripts/train/VGG13.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
         --model=VGG13 \
         --batch_size=256 \
-        --total_images=1281167 \
-        --class_dim=1000 \
-        --image_shape=3,224,224 \
         --lr_strategy=cosine_decay \
         --lr=0.01 \
         --num_epochs=90 \
diff --git a/PaddleCV/image_classification/scripts/train/VGG16.sh b/PaddleCV/image_classification/scripts/train/VGG16.sh
index ebf5a35627049f10c4afc1a24d7e9f2a9f6f425a..9be1d81f8019601e2e2cf0b3f694943e69bf1bda 100644
--- a/PaddleCV/image_classification/scripts/train/VGG16.sh
+++ b/PaddleCV/image_classification/scripts/train/VGG16.sh
@@ -9,10 +9,7 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=VGG16 \
 	--batch_size=256 \
-	--total_images=1281167 \
-        --class_dim=1000 \
 	--lr_strategy=cosine_decay \
-	--image_shape=3,224,224 \
         --model_save_dir=output/ \
 	--lr=0.01 \
 	--num_epochs=90 \
diff --git a/PaddleCV/image_classification/scripts/train/VGG19.sh b/PaddleCV/image_classification/scripts/train/VGG19.sh
index bca6a002f1eb3acc196f024834117155deeb6191..2d05bc1027a39e61d743ca4884a8634bc5362a82 100644
--- a/PaddleCV/image_classification/scripts/train/VGG19.sh
+++ b/PaddleCV/image_classification/scripts/train/VGG19.sh
@@ -9,9 +9,6 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.98
 python train.py \
 	--model=VGG19 \
 	--batch_size=256 \
-	--total_images=1281167 \
-	--class_dim=1000 \
-	--image_shape=3,224,224 \
 	--lr_strategy=cosine_decay \
 	--lr=0.01 \
 	--num_epochs=150 \
diff --git a/PaddleCV/image_classification/scripts/train/Xception41.sh b/PaddleCV/image_classification/scripts/train/Xception41.sh
index 1be8e5bb7f2e04a4cb775cfb8e3ed2b916e730d2..e9f19efc891f90276abceb6c04466985529939f2 100644
--- a/PaddleCV/image_classification/scripts/train/Xception41.sh
+++ b/PaddleCV/image_classification/scripts/train/Xception41.sh
@@ -1,9 +1,7 @@
 python train.py \
        --model=Xception41 \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,299,299 \
-       --class_dim=1000 \
+       --image_shape 3 299 299 \
        --lr_strategy=cosine_decay \
        --lr=0.045 \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/Xception41_deeplab.sh b/PaddleCV/image_classification/scripts/train/Xception41_deeplab.sh
index 0ba5fcc20a1fbb7edbd7e1386187c103a4e3b2aa..8c529184d6b618a71fcb554158b694a5359f388e 100644
--- a/PaddleCV/image_classification/scripts/train/Xception41_deeplab.sh
+++ b/PaddleCV/image_classification/scripts/train/Xception41_deeplab.sh
@@ -2,9 +2,7 @@
 python train.py \
        --model=Xception41_deeplab \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,299,299 \
-       --class_dim=1000 \
+       --image_shape 3 299 299 \
        --lr_strategy=cosine_decay \
        --lr=0.045 \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/Xception65.sh b/PaddleCV/image_classification/scripts/train/Xception65.sh
index a465194f4dea5f9e6397c70d0027cc9b82946869..d0c4802b1ea554bd2758274b82860a92de96cd69 100644
--- a/PaddleCV/image_classification/scripts/train/Xception65.sh
+++ b/PaddleCV/image_classification/scripts/train/Xception65.sh
@@ -2,9 +2,7 @@
 python train.py \
 	    --model=Xception65 \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,299,299 \
-            --class_dim=1000 \
+            --image_shape 3 299 299 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/scripts/train/Xception65_deeplab.sh b/PaddleCV/image_classification/scripts/train/Xception65_deeplab.sh
index 6cc49d852d7dd7108ffe97e705d17933313a755b..edd86cf895092fc08ae6e31609440610704a97d5 100644
--- a/PaddleCV/image_classification/scripts/train/Xception65_deeplab.sh
+++ b/PaddleCV/image_classification/scripts/train/Xception65_deeplab.sh
@@ -2,9 +2,7 @@
 python train.py \
        --model=Xception65_deeplab \
        --batch_size=256 \
-       --total_images=1281167 \
-       --image_shape=3,299,299 \
-       --class_dim=1000 \
+       --image_shape 3 299 299 \
        --lr_strategy=cosine_decay \
        --lr=0.045 \
        --num_epochs=120 \
diff --git a/PaddleCV/image_classification/scripts/train/Xception71.sh b/PaddleCV/image_classification/scripts/train/Xception71.sh
index 8e40eebca9aed081a68117e3d1eba21acc2ca955..4d61f745fbf48c4df399b3bd4882c51d1a8d838d 100644
--- a/PaddleCV/image_classification/scripts/train/Xception71.sh
+++ b/PaddleCV/image_classification/scripts/train/Xception71.sh
@@ -2,9 +2,7 @@
 python train.py \
 	    --model=Xception71 \
             --batch_size=256 \
-            --total_images=1281167 \
-            --image_shape=3,299,299 \
-            --class_dim=1000 \
+            --image_shape 3 299 299 \
             --lr_strategy=cosine_decay \
             --lr=0.1 \
             --num_epochs=200 \
diff --git a/PaddleCV/image_classification/train.py b/PaddleCV/image_classification/train.py
index 4a4c4899a703710804d11d6c45ed6636487b22f0..c4d5f26aa673ec656f98951b65e3871aa3e1af9e 100755
--- a/PaddleCV/image_classification/train.py
+++ b/PaddleCV/image_classification/train.py
@@ -17,38 +17,28 @@ from __future__ import division
 from __future__ import print_function
 
 import os
-import numpy as np
 import time
 import sys
+import logging
 
-
-def set_paddle_flags(flags):
-    for key, value in flags.items():
-        if os.environ.get(key, None) is None:
-            os.environ[key] = str(value)
-
-
-# NOTE(paddle-dev): All of these flags should be
-# set before `import paddle`. Otherwise, it would
-# not take any effect. 
-set_paddle_flags({
-    'FLAGS_eager_delete_tensor_gb': 0,  # enable gc 
-    'FLAGS_fraction_of_gpu_memory_to_use': 0.98
-})
-
+import numpy as np
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 import reader
 from utils import *
 import models
 from build_model import create_model
 
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
 
 def build_program(is_train, main_prog, startup_prog, args):
-    """build program, and add grad op in program accroding to different mode
+    """build program, and add backward op in program accroding to different mode
 
-    Args:
-        is_train: mode: train or test
+    Parameters:
+        is_train: indicate train mode or test mode
         main_prog: main program
         startup_prog: strartup program
         args: arguments
@@ -58,18 +48,17 @@ def build_program(is_train, main_prog, startup_prog, args):
         test mode: [Loss, data_loader]
     """
     if args.model.startswith('EfficientNet'):
-        is_test = False if is_train else True
         override_params = {"drop_connect_rate": args.drop_connect_rate}
         padding_type = args.padding_type
         use_se = args.use_se
-        model = models.__dict__[args.model](is_test=is_test,
+        model = models.__dict__[args.model](is_test=not is_train,
                                             override_params=override_params,
                                             padding_type=padding_type,
                                             use_se=use_se)
     else:
         model = models.__dict__[args.model]()
     with fluid.program_guard(main_prog, startup_prog):
-        if args.random_seed:
+        if args.random_seed or args.enable_ce:
             main_prog.random_seed = args.random_seed
             startup_prog.random_seed = args.random_seed
         with fluid.unique_name.guard():
@@ -78,11 +67,18 @@ def build_program(is_train, main_prog, startup_prog, args):
             if is_train:
                 optimizer = create_optimizer(args)
                 avg_cost = loss_out[0]
-                optimizer.minimize(avg_cost)
                 #XXX: fetch learning rate now, better implement is required here. 
                 global_lr = optimizer._global_learning_rate()
                 global_lr.persistable = True
                 loss_out.append(global_lr)
+
+                if args.use_fp16:
+                    optimizer = fluid.contrib.mixed_precision.decorate(
+                        optimizer,
+                        init_loss_scaling=args.scale_loss,
+                        use_dynamic_loss_scaling=args.use_dynamic_loss_scaling)
+
+                optimizer.minimize(avg_cost)
                 if args.use_ema:
                     global_steps = fluid.layers.learning_rate_scheduler._decay_step_counter(
                     )
@@ -94,33 +90,45 @@ def build_program(is_train, main_prog, startup_prog, args):
     return loss_out
 
 
-def validate(args, test_data_loader, exe, test_prog, test_fetch_list, pass_id,
-             train_batch_metrics_record):
+def validate(args,
+             test_iter,
+             exe,
+             test_prog,
+             test_fetch_list,
+             pass_id,
+             train_batch_metrics_record,
+             train_batch_time_record=None,
+             train_prog=None):
     test_batch_time_record = []
     test_batch_metrics_record = []
     test_batch_id = 0
-    test_data_loader.start()
-    try:
-        while True:
-            t1 = time.time()
-            test_batch_metrics = exe.run(program=test_prog,
-                                         fetch_list=test_fetch_list)
-            t2 = time.time()
-            test_batch_elapse = t2 - t1
-            test_batch_time_record.append(test_batch_elapse)
-
-            test_batch_metrics_avg = np.mean(
-                np.array(test_batch_metrics), axis=1)
-            test_batch_metrics_record.append(test_batch_metrics_avg)
-
-            print_info(pass_id, test_batch_id, args.print_step,
-                       test_batch_metrics_avg, test_batch_elapse, "batch")
-            sys.stdout.flush()
-            test_batch_id += 1
+    if int(os.environ.get('PADDLE_TRAINERS_NUM', 1)) > 1:
+        compiled_program = test_prog
+    else:
+        compiled_program = best_strategy_compiled(
+            args,
+            test_prog,
+            test_fetch_list[0],
+            exe,
+            mode="val",
+            share_prog=train_prog)
+    for batch in test_iter:
+        t1 = time.time()
+        test_batch_metrics = exe.run(program=compiled_program,
+                                     feed=batch,
+                                     fetch_list=test_fetch_list)
+        t2 = time.time()
+        test_batch_elapse = t2 - t1
+        test_batch_time_record.append(test_batch_elapse)
+
+        test_batch_metrics_avg = np.mean(np.array(test_batch_metrics), axis=1)
+        test_batch_metrics_record.append(test_batch_metrics_avg)
+
+        print_info("batch", test_batch_metrics_avg, test_batch_elapse, pass_id,
+                   test_batch_id, args.print_step, args.class_dim)
+        sys.stdout.flush()
+        test_batch_id += 1
 
-    except fluid.core.EOFException:
-        test_data_loader.reset()
-    #train_epoch_time_avg = np.mean(np.array(train_batch_time_record))
     train_epoch_metrics_avg = np.mean(
         np.array(train_batch_metrics_record), axis=0)
 
@@ -128,9 +136,19 @@ def validate(args, test_data_loader, exe, test_prog, test_fetch_list, pass_id,
     test_epoch_metrics_avg = np.mean(
         np.array(test_batch_metrics_record), axis=0)
 
-    print_info(pass_id, 0, 0,
-               list(train_epoch_metrics_avg) + list(test_epoch_metrics_avg),
-               test_epoch_time_avg, "epoch")
+    print_info(
+        "epoch",
+        list(train_epoch_metrics_avg) + list(test_epoch_metrics_avg),
+        test_epoch_time_avg,
+        pass_id=pass_id,
+        class_dim=args.class_dim)
+    if args.enable_ce:
+        device_num = fluid.core.get_cuda_device_count() if args.use_gpu else 1
+        print_info(
+            "ce",
+            list(train_epoch_metrics_avg) + list(test_epoch_metrics_avg),
+            train_batch_time_record,
+            device_num=device_num)
 
 
 def train(args):
@@ -141,8 +159,6 @@ def train(args):
     """
     startup_prog = fluid.Program()
     train_prog = fluid.Program()
-    test_prog = fluid.Program()
-
     train_out = build_program(
         is_train=True,
         main_prog=train_prog,
@@ -157,82 +173,123 @@ def train(args):
 
     train_fetch_list = [var.name for var in train_fetch_vars]
 
-    test_out = build_program(
-        is_train=False,
-        main_prog=test_prog,
-        startup_prog=startup_prog,
-        args=args)
-    test_data_loader = test_out[-1]
-    test_fetch_vars = test_out[:-1]
+    if args.validate:
+        test_prog = fluid.Program()
+        test_out = build_program(
+            is_train=False,
+            main_prog=test_prog,
+            startup_prog=startup_prog,
+            args=args)
+        test_data_loader = test_out[-1]
+        test_fetch_vars = test_out[:-1]
 
-    test_fetch_list = [var.name for var in test_fetch_vars]
+        test_fetch_list = [var.name for var in test_fetch_vars]
 
-    #Create test_prog and set layers' is_test params to True
-    test_prog = test_prog.clone(for_test=True)
+        #Create test_prog and set layers' is_test params to True
+        test_prog = test_prog.clone(for_test=True)
 
     gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
     place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
     exe = fluid.Executor(place)
     exe.run(startup_prog)
 
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
+
     #init model by checkpoint or pretrianed model.
     init_model(exe, args, train_prog)
     num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
-    imagenet_reader = reader.ImageNetReader(0 if num_trainers > 1 else None)
-    train_reader = imagenet_reader.train(settings=args)
-    test_reader = imagenet_reader.val(settings=args)
-
-    train_data_loader.set_sample_list_generator(train_reader, place)
-    test_data_loader.set_sample_list_generator(test_reader, place)
+    if args.use_dali:
+        import dali
+        train_iter = dali.train(settings=args)
+        if trainer_id == 0:
+            test_iter = dali.val(settings=args)
+    else:
+        imagenet_reader = reader.ImageNetReader(0 if num_trainers > 1 else None)
+        train_reader = imagenet_reader.train(settings=args)
+        if args.use_gpu:
+            if num_trainers <= 1:
+                places = fluid.framework.cuda_places()
+            else:
+                places = place
+        else:
+            if num_trainers <= 1:
+                places = fluid.framework.cpu_places()
+            else:
+                places = place
+
+        train_data_loader.set_sample_list_generator(train_reader, places)
+
+        if args.validate:
+            test_reader = imagenet_reader.val(settings=args)
+            test_data_loader.set_sample_list_generator(test_reader, places)
 
     compiled_train_prog = best_strategy_compiled(args, train_prog,
                                                  train_fetch_vars[0], exe)
-    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    #NOTE: this for benchmark
+    total_batch_num = 0
     for pass_id in range(args.num_epochs):
-        if num_trainers > 1:
+        if num_trainers > 1 and not args.use_dali:
             imagenet_reader.set_shuffle_seed(pass_id + (
                 args.random_seed if args.random_seed else 0))
         train_batch_id = 0
         train_batch_time_record = []
         train_batch_metrics_record = []
 
-        train_data_loader.start()
-
-        try:
-            while True:
-                t1 = time.time()
-                train_batch_metrics = exe.run(compiled_train_prog,
-                                              fetch_list=train_fetch_list)
-                t2 = time.time()
-                train_batch_elapse = t2 - t1
-                train_batch_time_record.append(train_batch_elapse)
-                train_batch_metrics_avg = np.mean(
-                    np.array(train_batch_metrics), axis=1)
-                train_batch_metrics_record.append(train_batch_metrics_avg)
-                if trainer_id == 0:
-                    print_info(pass_id, train_batch_id, args.print_step,
-                               train_batch_metrics_avg, train_batch_elapse,
-                               "batch")
-                    sys.stdout.flush()
-                train_batch_id += 1
-
-        except fluid.core.EOFException:
-            train_data_loader.reset()
-
-        if trainer_id == 0:
+        if not args.use_dali:
+            train_iter = train_data_loader()
+            if args.validate:
+                test_iter = test_data_loader()
+
+        t1 = time.time()
+        for batch in train_iter:
+            #NOTE: this is for benchmark
+            if args.max_iter and total_batch_num == args.max_iter:
+                return
+            train_batch_metrics = exe.run(compiled_train_prog,
+                                          feed=batch,
+                                          fetch_list=train_fetch_list)
+            t2 = time.time()
+            train_batch_elapse = t2 - t1
+            train_batch_time_record.append(train_batch_elapse)
+
+            train_batch_metrics_avg = np.mean(
+                np.array(train_batch_metrics), axis=1)
+            train_batch_metrics_record.append(train_batch_metrics_avg)
+            if trainer_id == 0:
+                print_info("batch", train_batch_metrics_avg, train_batch_elapse,
+                           pass_id, train_batch_id, args.print_step)
+                sys.stdout.flush()
+            train_batch_id += 1
+            t1 = time.time()
+            #NOTE: this for benchmark profiler
+            total_batch_num = total_batch_num + 1
+            if args.is_profiler and pass_id == 0 and train_batch_id == args.print_step:
+                profiler.start_profiler("All")
+            elif args.is_profiler and pass_id == 0 and train_batch_id == args.print_step + 5:
+                profiler.stop_profiler("total", args.profiler_path)
+                return
+
+        if args.use_dali:
+            train_iter.reset()
+
+        if trainer_id == 0 and args.validate:
             if args.use_ema:
-                print('ExponentialMovingAverage validate start...')
+                logger.info('ExponentialMovingAverage validate start...')
                 with ema.apply(exe):
-                    validate(args, test_data_loader, exe, test_prog,
-                             test_fetch_list, pass_id,
-                             train_batch_metrics_record)
-                print('ExponentialMovingAverage validate over!')
-
-            validate(args, test_data_loader, exe, test_prog, test_fetch_list,
-                     pass_id, train_batch_metrics_record)
-            #For now, save model per epoch.
-            if pass_id % args.save_step == 0:
-                save_model(args, exe, train_prog, pass_id)
+                    validate(args, test_iter, exe, test_prog, test_fetch_list,
+                             pass_id, train_batch_metrics_record,
+                             compiled_train_prog)
+                logger.info('ExponentialMovingAverage validate over!')
+
+            validate(args, test_iter, exe, test_prog, test_fetch_list, pass_id,
+                     train_batch_metrics_record, train_batch_time_record,
+                     compiled_train_prog)
+
+            if args.use_dali:
+                test_iter.reset()
+
+        if pass_id % args.save_step == 0:
+            save_model(args, exe, train_prog, pass_id)
 
 
 def main():
diff --git a/PaddleCV/image_classification/utils/__init__.py b/PaddleCV/image_classification/utils/__init__.py
index 4677e4535712c9f261bf18ba08ba6446d2db76d8..0cba6dd0453164a15d7e1ddf941d7e24cadf8408 100644
--- a/PaddleCV/image_classification/utils/__init__.py
+++ b/PaddleCV/image_classification/utils/__init__.py
@@ -12,4 +12,4 @@
 #See the License for the specific language governing permissions and
 #limitations under the License.
 from .optimizer import cosine_decay, lr_warmup, cosine_decay_with_warmup, exponential_decay_with_warmup, Optimizer, create_optimizer
-from .utility import add_arguments, print_arguments, parse_args, check_gpu, check_args, check_version, init_model, save_model, create_data_loader, print_info, best_strategy_compiled, init_model, save_model, ExponentialMovingAverage
+from .utility import add_arguments, print_arguments, parse_args, check_gpu, check_args, check_version, init_model, save_model, create_data_loader, print_info, best_strategy_compiled, init_model, save_model, ExponentialMovingAverage, save_json
diff --git a/PaddleCV/image_classification/utils/acnet/convert_model.sh b/PaddleCV/image_classification/utils/acnet/convert_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..20f43b16d5a8a0d3bd16b07c6ba26499f275ca30
--- /dev/null
+++ b/PaddleCV/image_classification/utils/acnet/convert_model.sh
@@ -0,0 +1,5 @@
+python utils/acnet/weights_aggregator.py \
+    ResNet50ACNet \
+    ./ResNet50ACNet_pretrained \
+    ./ResNet50ACNet_pretrained_after_fuse \
+    1000
diff --git a/PaddleCV/image_classification/utils/acnet/weights_aggregator.py b/PaddleCV/image_classification/utils/acnet/weights_aggregator.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb2db23f3337b745ddd4d0e18987782a2294ed0c
--- /dev/null
+++ b/PaddleCV/image_classification/utils/acnet/weights_aggregator.py
@@ -0,0 +1,198 @@
+#copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+import sys
+import os
+import shutil
+import logging
+
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+
+import models
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+def get_ac_tensor(name):
+    gamma = fluid.global_scope().find_var(name + '_scale').get_tensor()
+    beta = fluid.global_scope().find_var(name + '_offset').get_tensor()
+    mean = fluid.global_scope().find_var(name + '_mean').get_tensor()
+    var = fluid.global_scope().find_var(name + '_variance').get_tensor()
+    return gamma, beta, mean, var
+
+
+def get_kernel_bn_tensors(name):
+    if "conv1" in name:
+        bn_name = "bn_" + name
+    else:
+        bn_name = "bn" + name[3:]
+
+    ac_square = fluid.global_scope().find_var(name +
+                                              "_acsquare_weights").get_tensor()
+    ac_ver = fluid.global_scope().find_var(name + "_acver_weights").get_tensor()
+    ac_hor = fluid.global_scope().find_var(name + "_achor_weights").get_tensor()
+
+    ac_square_bn_gamma, ac_square_bn_beta, ac_square_bn_mean, ac_square_bn_var = \
+            get_ac_tensor(bn_name + '_acsquare')
+    ac_ver_bn_gamma, ac_ver_bn_beta, ac_ver_bn_mean, ac_ver_bn_var = \
+            get_ac_tensor(bn_name + '_acver')
+    ac_hor_bn_gamma, ac_hor_bn_beta, ac_hor_bn_mean, ac_hor_bn_var = \
+            get_ac_tensor(bn_name + '_achor')
+
+    kernels = [np.array(ac_square), np.array(ac_ver), np.array(ac_hor)]
+    gammas = [
+        np.array(ac_square_bn_gamma), np.array(ac_ver_bn_gamma),
+        np.array(ac_hor_bn_gamma)
+    ]
+    betas = [
+        np.array(ac_square_bn_beta), np.array(ac_ver_bn_beta),
+        np.array(ac_hor_bn_beta)
+    ]
+    means = [
+        np.array(ac_square_bn_mean), np.array(ac_ver_bn_mean),
+        np.array(ac_hor_bn_mean)
+    ]
+    var = [
+        np.array(ac_square_bn_var), np.array(ac_ver_bn_var),
+        np.array(ac_hor_bn_var)
+    ]
+
+    return {"kernels": kernels, "bn": (gammas, betas, means, var)}
+
+
+def kernel_fusion(kernels, gammas, betas, means, var):
+    """fuse conv + BN"""
+    kernel_size_h, kernel_size_w = kernels[0].shape[2:]
+
+    square = (gammas[0] / (var[0] + 1e-5)
+              **0.5).reshape(-1, 1, 1, 1) * kernels[0]
+    ver = (gammas[1] / (var[1] + 1e-5)**0.5).reshape(-1, 1, 1, 1) * kernels[1]
+    hor = (gammas[2] / (var[2] + 1e-5)**0.5).reshape(-1, 1, 1, 1) * kernels[2]
+
+    b = 0
+    for i in range(3):
+        b += -((means[i] * gammas[i]) / (var[i] + 1e-5)**0.5) + betas[i]  # eq.7
+
+    square[:, :, :, kernel_size_w // 2:kernel_size_w // 2 + 1] += ver
+    square[:, :, kernel_size_h // 2:kernel_size_h // 2 + 1, :] += hor
+
+    return square, b
+
+
+def convert_main(model_name, input_path, output_path, class_num=1000):
+    model = models.__dict__[model_name]()
+
+    main_prog = fluid.Program()
+    acnet_prog = fluid.Program()
+    startup_prog = fluid.Program()
+
+    with fluid.program_guard(acnet_prog, startup_prog):
+        with fluid.unique_name.guard():
+            image = fluid.data(
+                name="image",
+                shape=[-1, 3, 224, 224],
+                dtype="float32",
+                lod_level=0)
+            model_train = models.__dict__[model_name](deploy=False)
+            model_train.net(image, class_dim=1000)
+
+    with fluid.program_guard(main_prog, startup_prog):
+        with fluid.unique_name.guard():
+            image = fluid.data(
+                name="image",
+                shape=[-1, 3, 224, 224],
+                dtype="float32",
+                lod_level=0)
+            model_infer = models.__dict__[model_name](deploy=True)
+            model_infer.net(image, class_dim=1000)
+
+    acnet_prog = acnet_prog.clone(for_test=True)
+    main_prog = main_prog.clone(for_test=True)
+
+    place = fluid.CUDAPlace(0)
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+
+    assert os.path.exists(
+        input_path), "Pretrained model path {} not exist!".format(input_path)
+    fluid.io.load_vars(exe, input_path,
+                        main_program=acnet_prog,
+                        predicate=lambda var: os.path.exists(os.path.join(input_path, var.name)))
+
+    mapping = {}
+
+    for param in main_prog.blocks[0].all_parameters():
+        if "acsquare" in param.name:
+            name_root = "_".join(param.name.split("_")[:-2])
+            if name_root in mapping.keys():
+                mapping[name_root].append(param.name)
+            else:
+                mapping[name_root] = [param.name]
+        else:
+            assert param.name not in mapping.keys()
+            mapping[param.name] = [param.name]
+
+    for name_root, names in mapping.items():
+        if len(names) == 1:
+            pass
+        else:
+            if "bias" in names[0]:
+                bias_id = 0
+                kernel_id = 1
+            else:
+                bias_id = 1
+                kernel_id = 0
+
+            tensor_bias = fluid.global_scope().find_var(names[
+                bias_id]).get_tensor()
+            tensor_kernel = fluid.global_scope().find_var(names[
+                kernel_id]).get_tensor()
+
+            ret = get_kernel_bn_tensors(name_root)
+            kernels = ret['kernels']
+            gammas, betas, means, var = ret['bn']
+
+            kernel, bias = kernel_fusion(kernels, gammas, betas, means, var)
+
+            logger.info("Before {}: {}".format(names[
+                kernel_id], np.array(tensor_kernel).ravel()[:5]))
+
+            tensor_bias.set(bias, place)
+            tensor_kernel.set(kernel, place)
+
+            logger.info("After {}: {}\n".format(names[
+                kernel_id], np.array(tensor_kernel).ravel()[:5]))
+
+    if os.path.isdir(output_path):
+        shutil.rmtree(output_path)
+    os.makedirs(output_path)
+    fluid.io.save_persistables(exe, output_path, main_program=main_prog)
+
+
+if __name__ == "__main__":
+    assert len(
+        sys.argv
+    ) == 5, "input format: python weights_aggregator.py $model_name $input_path $output_path $class_num"
+    model_name = sys.argv[1]
+    input_path = sys.argv[2]
+    output_path = sys.argv[3]
+    class_num = int(sys.argv[4])
+    logger.info("model_name: {}".format(model_name))
+    logger.info("input_path: {}".format(input_path))
+    logger.info("output_path: {}".format(output_path))
+    logger.info("class_num: {}".format(class_num))
+    convert_main(model_name, input_path, output_path, class_num)
diff --git a/PaddleCV/image_classification/utils/dist_utils.py b/PaddleCV/image_classification/utils/dist_utils.py
index 29df3d3b110357653bd46723298de1d98d296659..681c260e6e07493b0e8035dfcf7b046d8e2f3ba0 100755
--- a/PaddleCV/image_classification/utils/dist_utils.py
+++ b/PaddleCV/image_classification/utils/dist_utils.py
@@ -17,6 +17,10 @@ from __future__ import division
 from __future__ import print_function
 import os
 import paddle.fluid as fluid
+import logging
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
 
 
 def nccl2_prepare(args, startup_prog, main_prog):
@@ -81,8 +85,8 @@ def prepare_for_multi_process(exe, build_strategy, train_prog):
     trainer_id = int(os.environ.get('PADDLE_TRAINER_ID', 0))
     num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
     if num_trainers < 2: return
-    print("PADDLE_TRAINERS_NUM", num_trainers)
-    print("PADDLE_TRAINER_ID", trainer_id)
+    logger.info("PADDLE_TRAINERS_NUM %s" % num_trainers)
+    logger.info("PADDLE_TRAINER_ID %s" % trainer_id)
     build_strategy.num_trainers = num_trainers
     build_strategy.trainer_id = trainer_id
     # NOTE(zcd): use multi processes to train the model,
diff --git a/PaddleCV/image_classification/utils/optimizer.py b/PaddleCV/image_classification/utils/optimizer.py
index 16b96267d274434c6e496e586cef47a13ae9e074..176ba0af0be50aaaba9decffa79c0917f137dfe2 100644
--- a/PaddleCV/image_classification/utils/optimizer.py
+++ b/PaddleCV/image_classification/utils/optimizer.py
@@ -20,7 +20,6 @@ import math
 
 import paddle.fluid as fluid
 import paddle.fluid.layers.ops as ops
-from paddle.fluid.initializer import init_on_cpu
 from paddle.fluid.layers.learning_rate_scheduler import _decay_step_counter
 
 
@@ -30,14 +29,16 @@ def cosine_decay(learning_rate, step_each_epoch, epochs=120):
     """
     global_step = _decay_step_counter()
 
-    with init_on_cpu():
-        epoch = ops.floor(global_step / step_each_epoch)
-        decayed_lr = learning_rate * \
-                     (ops.cos(epoch * (math.pi / epochs)) + 1)/2
+    epoch = ops.floor(global_step / step_each_epoch)
+    decayed_lr = learning_rate * \
+                v(ops.cos(epoch * (math.pi / epochs)) + 1)/2
     return decayed_lr
 
 
-def cosine_decay_with_warmup(learning_rate, step_each_epoch, epochs=120):
+def cosine_decay_with_warmup(learning_rate,
+                             step_each_epoch,
+                             epochs=120,
+                             warm_up_epoch=5.0):
     """Applies cosine decay to the learning rate.
     lr = 0.05 * (math.cos(epoch * (math.pi / 120)) + 1)
     decrease lr for every mini-batch and start with warmup.
@@ -51,49 +52,55 @@ def cosine_decay_with_warmup(learning_rate, step_each_epoch, epochs=120):
         name="learning_rate")
 
     warmup_epoch = fluid.layers.fill_constant(
-        shape=[1], dtype='float32', value=float(5), force_cpu=True)
+        shape=[1], dtype='float32', value=float(warm_up_epoch), force_cpu=True)
 
-    with init_on_cpu():
-        epoch = ops.floor(global_step / step_each_epoch)
-        with fluid.layers.control_flow.Switch() as switch:
-            with switch.case(epoch < warmup_epoch):
-                decayed_lr = learning_rate * (global_step /
-                                              (step_each_epoch * warmup_epoch))
-                fluid.layers.tensor.assign(input=decayed_lr, output=lr)
-            with switch.default():
-                decayed_lr = learning_rate * \
-                    (ops.cos((global_step - warmup_epoch * step_each_epoch) * (math.pi / (epochs * step_each_epoch))) + 1)/2
-                fluid.layers.tensor.assign(input=decayed_lr, output=lr)
+    epoch = ops.floor(global_step / step_each_epoch)
+    with fluid.layers.control_flow.Switch() as switch:
+        with switch.case(epoch < warmup_epoch):
+            decayed_lr = learning_rate * (global_step /
+                                          (step_each_epoch * warmup_epoch))
+            fluid.layers.tensor.assign(input=decayed_lr, output=lr)
+        with switch.default():
+            decayed_lr = learning_rate * \
+                (ops.cos((global_step - warmup_epoch * step_each_epoch) * (math.pi / ((epochs-warmup_epoch) * step_each_epoch))) + 1)/2
+            fluid.layers.tensor.assign(input=decayed_lr, output=lr)
     return lr
 
-def exponential_decay_with_warmup(learning_rate, step_each_epoch, decay_epochs, decay_rate=0.97, warm_up_epoch=5.0):
+
+def exponential_decay_with_warmup(learning_rate,
+                                  step_each_epoch,
+                                  decay_epochs,
+                                  decay_rate=0.97,
+                                  warm_up_epoch=5.0):
     """Applies exponential decay to the learning rate.
     """
     global_step = _decay_step_counter()
     lr = fluid.layers.tensor.create_global_var(
-    shape=[1],
-    value=0.0,
-    dtype='float32',
-    persistable=True,
-    name="learning_rate")
+        shape=[1],
+        value=0.0,
+        dtype='float32',
+        persistable=True,
+        name="learning_rate")
 
     warmup_epoch = fluid.layers.fill_constant(
         shape=[1], dtype='float32', value=float(warm_up_epoch), force_cpu=True)
 
-    with init_on_cpu():
-        epoch = ops.floor(global_step / step_each_epoch)
-        with fluid.layers.control_flow.Switch() as switch:
-            with switch.case(epoch < warmup_epoch):
-                decayed_lr = learning_rate * (global_step / (step_each_epoch * warmup_epoch))
-                fluid.layers.assign(input=decayed_lr, output=lr)
-            with switch.default():
-                div_res = (global_step - warmup_epoch * step_each_epoch) / decay_epochs
-                div_res = ops.floor(div_res)
-                decayed_lr = learning_rate * (decay_rate ** div_res)
-                fluid.layers.assign(input=decayed_lr, output=lr)
+    epoch = ops.floor(global_step / step_each_epoch)
+    with fluid.layers.control_flow.Switch() as switch:
+        with switch.case(epoch < warmup_epoch):
+            decayed_lr = learning_rate * (global_step /
+                                          (step_each_epoch * warmup_epoch))
+            fluid.layers.assign(input=decayed_lr, output=lr)
+        with switch.default():
+            div_res = (
+                global_step - warmup_epoch * step_each_epoch) / decay_epochs
+            div_res = ops.floor(div_res)
+            decayed_lr = learning_rate * (decay_rate**div_res)
+            fluid.layers.assign(input=decayed_lr, output=lr)
 
     return lr
 
+
 def lr_warmup(learning_rate, warmup_steps, start_lr, end_lr):
     """ Applies linear learning rate warmup for distributed training
         Argument learning_rate can be float or a Variable
@@ -197,7 +204,8 @@ class Optimizer(object):
         learning_rate = cosine_decay_with_warmup(
             learning_rate=self.lr,
             step_each_epoch=self.step,
-            epochs=self.num_epochs)
+            epochs=self.num_epochs,
+            warm_up_epoch=self.warm_up_epochs)
         optimizer = fluid.optimizer.Momentum(
             learning_rate=learning_rate,
             momentum=self.momentum_rate,
@@ -222,8 +230,7 @@ class Optimizer(object):
             regularization=fluid.regularizer.L2Decay(self.l2_decay),
             momentum=self.momentum_rate,
             rho=0.9,
-            epsilon=0.001
-        )
+            epsilon=0.001)
         return optimizer
 
     def linear_decay(self):
diff --git a/PaddleCV/image_classification/utils/utility.py b/PaddleCV/image_classification/utils/utility.py
index 092f3c05524e303845ac7247169b6c73812a6be9..46de720468eab594842da4aea4e1e357dc108ad1 100644
--- a/PaddleCV/image_classification/utils/utility.py
+++ b/PaddleCV/image_classification/utils/utility.py
@@ -16,24 +16,31 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
 
-import distutils.util
-import numpy as np
 import six
 import argparse
 import functools
-import logging
 import sys
 import os
+import logging
 import warnings
 import signal
+import json
 
+import numpy as np
 import paddle
 import paddle.fluid as fluid
 from paddle.fluid.wrapped_decorator import signature_safe_contextmanager
 from paddle.fluid.framework import Program, program_guard, name_scope, default_main_program
 from paddle.fluid import unique_name, layers
+
+import distutils.util
 from utils import dist_utils
 
+from utils.optimizer import Optimizer
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
 
 def print_arguments(args):
     """Print argparse's arguments.
@@ -50,14 +57,15 @@ def print_arguments(args):
     :param args: Input argparse.Namespace for printing.
     :type args: argparse.Namespace
     """
-    print("-------------  Configuration Arguments -------------")
+
+    logger.info("-------------  Configuration Arguments -------------")
     for arg, value in sorted(six.iteritems(vars(args))):
-        print("%25s : %s" % (arg, value))
-    print("----------------------------------------------------")
+        logger.info("%25s : %s" % (arg, value))
+    logger.info("----------------------------------------------------")
 
 
 def add_arguments(argname, type, default, help, argparser, **kwargs):
-    """Add argparse's argument. 
+    """Add argparse's argument.
 
     Usage:
 
@@ -79,7 +87,7 @@ def add_arguments(argname, type, default, help, argparser, **kwargs):
 def parse_args():
     """Add arguments
 
-    Returns: 
+    Returns:
         all training args
     """
     parser = argparse.ArgumentParser(description=__doc__)
@@ -91,6 +99,7 @@ def parse_args():
     add_arg('model_save_dir',           str,    "./output",        "The directory path to save model.")
     add_arg('data_dir',                 str,    "./data/ILSVRC2012/",   "The ImageNet dataset root directory.")
     add_arg('pretrained_model',         str,    None,                   "Whether to load pretrained model.")
+    add_arg('finetune_exclude_pretrained_params', str, None,            "Ignore params when doing finetune")
     add_arg('checkpoint',               str,    None,                   "Whether to resume checkpoint.")
     add_arg('print_step',               int,    10,                     "The steps interval to print logs")
     add_arg('save_step',                int,    1,                      "The steps interval to save checkpoints")
@@ -98,11 +107,11 @@ def parse_args():
     # SOLVER AND HYPERPARAMETERS
     add_arg('model',                    str,    "ResNet50",   "The name of network.")
     add_arg('total_images',             int,    1281167,                "The number of total training images.")
+    parser.add_argument('--image_shape', nargs='+', type=int, default=[3, 224, 224], help="The shape of image")
     add_arg('num_epochs',               int,    120,                    "The number of total epochs.")
     add_arg('class_dim',                int,    1000,                   "The number of total classes.")
-    add_arg('image_shape',              str,    "3,224,224",            "The size of Input image, order: [channels, height, weidth] ")
-    add_arg('batch_size',               int,    8,                      "Minibatch size on a device.")
-    add_arg('test_batch_size',          int,    16,                     "Test batch size on a deveice.")
+    add_arg('batch_size',               int,    8,                      "Minibatch size on all the devices.")
+    add_arg('test_batch_size',          int,    8,                   "Test batch size on all the devices.")
     add_arg('lr',                       float,  0.1,                    "The learning rate.")
     add_arg('lr_strategy',              str,    "piecewise_decay",      "The learning rate decay strategy.")
     add_arg('l2_decay',                 float,  1e-4,                   "The l2_decay parameter.")
@@ -114,11 +123,11 @@ def parse_args():
     parser.add_argument('--step_epochs', nargs='+', type=int, default=[30, 60, 90], help="piecewise decay step")
 
     # READER AND PREPROCESS
+    add_arg('use_dali',                 bool,   False,                  "Whether to use nvidia DALI for preprocessing")
     add_arg('lower_scale',              float,  0.08,                   "The value of lower_scale in ramdom_crop")
     add_arg('lower_ratio',              float,  3./4.,                  "The value of lower_ratio in ramdom_crop")
     add_arg('upper_ratio',              float,  4./3.,                  "The value of upper_ratio in ramdom_crop")
     add_arg('resize_short_size',        int,    256,                    "The value of resize_short_size")
-    add_arg('crop_size',                int,    224,                    "The value of crop size")
     add_arg('use_mixup',                bool,   False,                  "Whether to use mixup")
     add_arg('mixup_alpha',              float,  0.2,                    "The value of mixup_alpha")
     add_arg('reader_thread',            int,    8,                      "The number of multi thread reader")
@@ -129,40 +138,51 @@ def parse_args():
     parser.add_argument('--image_std', nargs='+', type=float, default=[0.229, 0.224, 0.225], help="The std of input image data")
 
     # SWITCH
-    #NOTE: (2019/08/08) FP16 is moving to PaddlePaddle/Fleet now
-    #add_arg('use_fp16',                 bool,   False,                  "Whether to enable half precision training with fp16." )
-    #add_arg('scale_loss',               float,  1.0,                    "The value of scale_loss for fp16." )
+    add_arg('validate',                 bool,   True,                   "whether to validate when training.")
+    add_arg('use_fp16',                 bool,   False,                  "Whether to enable half precision training with fp16." )
+    add_arg('scale_loss',               float,  1.0,                    "The value of scale_loss for fp16." )
+    add_arg('use_dynamic_loss_scaling', bool,   True,                   "Whether to use dynamic loss scaling.")
+    add_arg('data_format',              str,    "NCHW",                 "Tensor data format when training.")
+    add_arg('fuse_elewise_add_act_ops', bool,   False,                  "Whether to use elementwise_act fusion.")
+    add_arg('fuse_bn_act_ops',          bool,   False,                  "Whether to use batch_norm and act fusion.")
+
     add_arg('use_label_smoothing',      bool,   False,                  "Whether to use label_smoothing")
     add_arg('label_smoothing_epsilon',  float,  0.1,                    "The value of label_smoothing_epsilon parameter")
     #NOTE: (2019/08/08) temporary disable use_distill
     #add_arg('use_distill',              bool,   False,                  "Whether to use distill")
-    add_arg('random_seed',              int,    None,                   "random seed")
     add_arg('use_ema',                  bool,   False,                  "Whether to use ExponentialMovingAverage.")
     add_arg('ema_decay',                float,  0.9999,                 "The value of ema decay rate")
     add_arg('padding_type',             str,    "SAME",                 "Padding type of convolution")
     add_arg('use_se',                   bool,   True,                   "Whether to use Squeeze-and-Excitation module for EfficientNet.")
-    # yapf: enable
 
+    #NOTE: args for profiler
+    add_arg("enable_ce",                bool,   False,                  "Whether to enable ce")
+    add_arg('random_seed',              int,    None,                   "random seed")
+    add_arg('is_profiler',              bool,   False,                  "Whether to start the profiler")
+    add_arg('profiler_path',            str,    './profilier_files',                   "the profiler output file path")
+    add_arg('max_iter',                 int,    0,                      "the max train batch num")
+    add_arg('same_feed',                int,    0,                      "whether to feed same images")
+
+
+    # yapf: enable
     args = parser.parse_args()
 
     return args
 
 
 def check_gpu():
-    """   
+    """
     Log error and exit when set use_gpu=true in paddlepaddle
     cpu ver sion.
     """
-    logger = logging.getLogger(__name__)
     err = "Config use_gpu cannot be set as true while you are " \
                 "using paddlepaddle cpu version ! \nPlease try: \n" \
                 "\t1. Install paddlepaddle-gpu to run model on GPU \n" \
                 "\t2. Set use_gpu as false in config file to run " \
                 "model on CPU"
-
     try:
         if args.use_gpu and not fluid.is_compiled_with_cuda():
-            print(err)
+            logger.error(err)
             sys.exit(1)
     except Exception as e:
         pass
@@ -180,7 +200,7 @@ def check_version():
     try:
         fluid.require_version('1.6.0')
     except Exception as e:
-        print(err)
+        logger.error(err)
         sys.exit(1)
 
 
@@ -199,29 +219,30 @@ def check_args(args):
         args.model, model_list)
 
     # check learning rate strategy
-    lr_strategy_list = [
-        "piecewise_decay", "cosine_decay", "linear_decay",
-        "cosine_decay_warmup", "exponential_decay_warmup"
-    ]
+    lr_strategy_list = [l for l in dir(Optimizer) if not l.startswith('__')]
     if args.lr_strategy not in lr_strategy_list:
-        warnings.warn(
-            "\n{} is not in lists: {}, \nUse default learning strategy now.".
+        logger.warning(
+            "\n{} is not in lists: {}, \nUse default learning strategy now!".
             format(args.lr_strategy, lr_strategy_list))
         args.lr_strategy = "default_decay"
+
     # check confict of GoogLeNet and mixup
     if args.model == "GoogLeNet":
         assert args.use_mixup == False, "Cannot use mixup processing in GoogLeNet, please set use_mixup = False."
 
+    # check interpolation of reader settings
     if args.interpolation:
         assert args.interpolation in [
             0, 1, 2, 3, 4
         ], "Wrong interpolation, please set:\n0: cv2.INTER_NEAREST\n1: cv2.INTER_LINEAR\n2: cv2.INTER_CUBIC\n3: cv2.INTER_AREA\n4: cv2.INTER_LANCZOS4"
 
+    # check padding type
     if args.padding_type:
         assert args.padding_type in [
             "SAME", "VALID", "DYNAMIC"
         ], "Wrong padding_type, please set:\nSAME\nVALID\nDYNAMIC"
 
+    # check checkpint and pretrained_model
     assert args.checkpoint is None or args.pretrained_model is None, "Do not init model by checkpoint and pretrained_model both."
 
     # check pretrained_model path for loading
@@ -238,14 +259,6 @@ def check_args(args):
             args.checkpoint
         ), "please support available checkpoint path for initing model."
 
-    # check params for loading
-    """
-    if args.save_params:
-        assert isinstance(args.save_params, str)
-        assert os.path.isdir(
-            args.save_params), "please support available save_params path."
-    """
-
     # check gpu: when using gpu, the number of visible cards should divide batch size
     if args.use_gpu:
         assert args.batch_size % fluid.core.get_cuda_device_count(
@@ -257,35 +270,95 @@ def check_args(args):
         args.data_dir
     ), "Data doesn't exist in {}, please load right path".format(args.data_dir)
 
-    #check gpu
+    # check CE
+    if args.enable_ce:
+        args.random_seed = 0
+        logger.warning("CE is running now! already set random seed to 0")
 
+    # check class_dim
+    assert args.class_dim > 1, "class_dim must greater than 1"
+
+    # check dali preprocess
+    if args.use_dali:
+        logger.warning(
+            "DALI preprocessing is activated!!!\nWarning: 1. Please make sure paddlepaddle is compiled by GCC5.4 or later version!\n\t 2. Please make sure nightly builds DALI is installed correctly.\n----------------------------------------------------"
+        )
+
+    #check gpu
     check_gpu()
     check_version()
 
 
 def init_model(exe, args, program):
+    """load model from checkpoint or pretrained model
+    """
+
     if args.checkpoint:
         fluid.io.load_persistables(exe, args.checkpoint, main_program=program)
-        print("Finish initing model from %s" % (args.checkpoint))
+        logger.info("Finish initing model from %s" % (args.checkpoint))
 
     if args.pretrained_model:
-
-        def if_exist(var):
-            return os.path.exists(os.path.join(args.pretrained_model, var.name))
-
+        """
+        # yapf: disable
+        # This is a dict of fc layers in all the classification models.
+        final_fc_name = [
+                         "fc8_weights","fc8_offset", #alexnet
+                         "fc_weights","fc_offset", #darknet, densenet, dpn, hrnet, mobilenet_v3, res2net, res2net_vd, resnext, resnext_vd, xception
+                         #efficient
+                         "out","out_offset", "out1","out1_offset", "out2","out2_offset", #googlenet
+                         "final_fc_weights", "final_fc_offset", #inception_v4
+                         "fc7_weights", "fc7_offset", #mobilenetv1
+                         "fc10_weights", "fc10_offset", #mobilenetv2
+                         "fc_0", #resnet, resnet_vc, resnet_vd
+                         "fc.weight", "fc.bias", #resnext101_wsl
+                         "fc6_weights", "fc6_offset", #se_resnet_vd, se_resnext, se_resnext_vd, shufflenet_v2, shufflenet_v2_swish,
+                         #squeezenet
+                         "fc8_weights", "fc8_offset", #vgg
+                         "fc_bias" #"fc_weights", xception_deeplab
+                         ]
+        # yapf: enable
+        """
+        final_fc_name = []
+        if args.finetune_exclude_pretrained_params:
+            final_fc_name = [
+                str(s)
+                for s in args.finetune_exclude_pretrained_params.split(",")
+            ]
+
+        def is_parameter(var):
+            fc_exclude_flag = False
+            for item in final_fc_name:
+                if item in var.name:
+                    fc_exclude_flag = True
+
+            return isinstance(
+                var, fluid.framework.
+                Parameter) and not fc_exclude_flag and os.path.exists(
+                    os.path.join(args.pretrained_model, var.name))
+
+        logger.info("Load pretrain weights from {}, exclude params {}.".format(
+            args.pretrained_model, final_fc_name))
+        vars = filter(is_parameter, program.list_vars())
         fluid.io.load_vars(
-            exe,
-            args.pretrained_model,
-            main_program=program,
-            predicate=if_exist)
+            exe, args.pretrained_model, vars=vars, main_program=program)
 
 
 def save_model(args, exe, train_prog, info):
+    """save model in model_path
+    """
+
     model_path = os.path.join(args.model_save_dir, args.model, str(info))
     if not os.path.isdir(model_path):
         os.makedirs(model_path)
     fluid.io.save_persistables(exe, model_path, main_program=train_prog)
-    print("Already save model in %s" % (model_path))
+    logger.info("Already save model in %s" % (model_path))
+
+
+def save_json(info, path):
+    """ save eval result or infer result to file as json format.
+    """
+    with open(path, 'w') as f:
+        json.dump(info, f)
 
 
 def create_data_loader(is_train, args):
@@ -294,15 +367,14 @@ def create_data_loader(is_train, args):
     Usage:
         Using mixup process in training, it will return 5 results, include data_loader, image, y_a(label), y_b(label) and lamda, or it will return 3 results, include data_loader, image, and label.
 
-    Args: 
+    Args:
         is_train: mode
         args: arguments
 
     Returns:
-        data_loader and the input data of net, 
+        data_loader and the input data of net,
     """
-    image_shape = [int(m) for m in args.image_shape.split(",")]
-
+    image_shape = args.image_shape
     feed_image = fluid.data(
         name="feed_image",
         shape=[None] + image_shape,
@@ -314,6 +386,8 @@ def create_data_loader(is_train, args):
     feed_y_a = fluid.data(
         name="feed_y_a", shape=[None, 1], dtype="int64", lod_level=0)
 
+    capacity = 64 if int(os.environ.get('PADDLE_TRAINERS_NUM', 1)) <= 1 else 8
+
     if is_train and args.use_mixup:
         feed_y_b = fluid.data(
             name="feed_y_b", shape=[None, 1], dtype="int64", lod_level=0)
@@ -322,21 +396,31 @@ def create_data_loader(is_train, args):
 
         data_loader = fluid.io.DataLoader.from_generator(
             feed_list=[feed_image, feed_y_a, feed_y_b, feed_lam],
-            capacity=64,
+            capacity=capacity,
             use_double_buffer=True,
-            iterable=False)
+            iterable=True)
         return data_loader, [feed_image, feed_y_a, feed_y_b, feed_lam]
     else:
+        if args.use_dali:
+            return None, [feed_image, feed_label]
+
         data_loader = fluid.io.DataLoader.from_generator(
             feed_list=[feed_image, feed_label],
-            capacity=64,
+            capacity=capacity,
             use_double_buffer=True,
-            iterable=False)
+            iterable=True)
 
         return data_loader, [feed_image, feed_label]
 
 
-def print_info(pass_id, batch_id, print_step, metrics, time_info, info_mode):
+def print_info(info_mode,
+               metrics,
+               time_info,
+               pass_id=0,
+               batch_id=0,
+               print_step=1,
+               device_num=1,
+               class_dim=5):
     """print function
 
     Args:
@@ -347,30 +431,33 @@ def print_info(pass_id, batch_id, print_step, metrics, time_info, info_mode):
         time_info: time infomation
         info_mode: mode
     """
+    #XXX: Use specific name to choose pattern, not the length of metrics.
     if info_mode == "batch":
         if batch_id % print_step == 0:
             #if isinstance(metrics,np.ndarray):
             # train and mixup output
             if len(metrics) == 2:
                 loss, lr = metrics
-                print(
+                logger.info(
                     "[Pass {0}, train batch {1}] \tloss {2}, lr {3}, elapse {4}".
                     format(pass_id, batch_id, "%.5f" % loss, "%.5f" % lr,
                            "%2.4f sec" % time_info))
             # train and no mixup output
             elif len(metrics) == 4:
                 loss, acc1, acc5, lr = metrics
-                print(
-                    "[Pass {0}, train batch {1}] \tloss {2}, acc1 {3}, acc5 {4}, lr {5}, elapse {6}".
+                logger.info(
+                    "[Pass {0}, train batch {1}] \tloss {2}, acc1 {3}, acc{7} {4}, lr {5}, elapse {6}".
                     format(pass_id, batch_id, "%.5f" % loss, "%.5f" % acc1,
-                           "%.5f" % acc5, "%.5f" % lr, "%2.4f sec" % time_info))
+                           "%.5f" % acc5, "%.5f" % lr, "%2.4f sec" % time_info,
+                           min(class_dim, 5)))
             # test output
             elif len(metrics) == 3:
                 loss, acc1, acc5 = metrics
-                print(
-                    "[Pass {0}, test  batch {1}] \tloss {2}, acc1 {3}, acc5 {4}, elapse {5}".
+                logger.info(
+                    "[Pass {0}, test  batch {1}] \tloss {2}, acc1 {3}, acc{6} {4}, elapse {5}".
                     format(pass_id, batch_id, "%.5f" % loss, "%.5f" % acc1,
-                           "%.5f" % acc5, "%2.4f sec" % time_info))
+                           "%.5f" % acc5, "%2.4f sec" % time_info,
+                           min(class_dim, 5)))
             else:
                 raise Exception(
                     "length of metrics {} is not implemented, It maybe caused by wrong format of build_program_output".
@@ -379,28 +466,54 @@ def print_info(pass_id, batch_id, print_step, metrics, time_info, info_mode):
 
     elif info_mode == "epoch":
         ## TODO add time elapse
-        #if isinstance(metrics,np.ndarray):
         if len(metrics) == 5:
             train_loss, _, test_loss, test_acc1, test_acc5 = metrics
-            print(
-                "[End pass {0}]\ttrain_loss {1}, test_loss {2}, test_acc1 {3}, test_acc5 {4}".
+            logger.info(
+                "[End pass {0}]\ttrain_loss {1}, test_loss {2}, test_acc1 {3}, test_acc{5} {4}".
                 format(pass_id, "%.5f" % train_loss, "%.5f" % test_loss, "%.5f"
-                       % test_acc1, "%.5f" % test_acc5))
+                       % test_acc1, "%.5f" % test_acc5, min(class_dim, 5)))
         elif len(metrics) == 7:
             train_loss, train_acc1, train_acc5, _, test_loss, test_acc1, test_acc5 = metrics
-            print(
-                "[End pass {0}]\ttrain_loss {1}, train_acc1 {2}, train_acc5 {3},test_loss {4}, test_acc1 {5}, test_acc5 {6}".
+            logger.info(
+                "[End pass {0}]\ttrain_loss {1}, train_acc1 {2}, train_acc{7} {3},test_loss {4}, test_acc1 {5}, test_acc{7} {6}".
                 format(pass_id, "%.5f" % train_loss, "%.5f" % train_acc1, "%.5f"
                        % train_acc5, "%.5f" % test_loss, "%.5f" % test_acc1,
-                       "%.5f" % test_acc5))
+                       "%.5f" % test_acc5, min(class_dim, 5)))
         sys.stdout.flush()
     elif info_mode == "ce":
-        raise Warning("CE code is not ready")
+        assert len(
+            metrics
+        ) == 7, "Enable CE: The Metrics should contain train_loss, train_acc1, train_acc5, test_loss, test_acc1, test_acc5, and train_speed"
+        assert len(
+            time_info
+        ) > 10, "0~9th batch statistics will drop when doing benchmark or ce, because it might be mixed with startup time, so please make sure training at least 10 batches."
+        print_ce(device_num, metrics, time_info)
     else:
         raise Exception("Illegal info_mode")
 
 
-def best_strategy_compiled(args, program, loss, exe):
+def print_ce(device_num, metrics, time_info):
+    """ Print log for CE(for internal test).
+    """
+    train_loss, train_acc1, train_acc5, _, test_loss, test_acc1, test_acc5 = metrics
+
+    train_speed = np.mean(np.array(time_info[10:]))
+
+    logger.info("kpis\ttrain_cost_card{}\t{}".format(device_num, train_loss))
+    logger.info("kpis\ttrain_acc1_card{}\t{}".format(device_num, train_acc1))
+    logger.info("kpis\ttrain_acc5_card{}\t{}".format(device_num, train_acc5))
+    logger.info("kpis\ttest_cost_card{}\t{}".format(device_num, test_loss))
+    logger.info("kpis\ttest_acc1_card{}\t{}".format(device_num, test_acc1))
+    logger.info("kpis\ttest_acc5_card{}\t{}".format(device_num, test_acc5))
+    logger.info("kpis\ttrain_speed_card{}\t{}".format(device_num, train_speed))
+
+
+def best_strategy_compiled(args,
+                           program,
+                           loss,
+                           exe,
+                           mode="train",
+                           share_prog=None):
     """make a program which wrapped by a compiled program
     """
 
@@ -408,11 +521,19 @@ def best_strategy_compiled(args, program, loss, exe):
         return program
     else:
         build_strategy = fluid.compiler.BuildStrategy()
-        #Feature will be supported in Fluid v1.6
-        #build_strategy.enable_inplace = True
+        try:
+            fluid.require_version(min_version='1.7.0')
+            build_strategy.fuse_bn_act_ops = args.fuse_bn_act_ops
+        except Exception as e:
+            logger.info("PaddlePaddle version 1.7.0 or higher is "
+            "required when you want to fuse batch_norm and activation_op.")
+        build_strategy.fuse_elewise_add_act_ops = args.fuse_elewise_add_act_ops
 
         exec_strategy = fluid.ExecutionStrategy()
-        exec_strategy.num_threads = fluid.core.get_cuda_device_count()
+
+        if args.use_gpu:
+            exec_strategy.num_threads = fluid.core.get_cuda_device_count()
+
         exec_strategy.num_iteration_per_drop_scope = 10
 
         num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
@@ -423,7 +544,8 @@ def best_strategy_compiled(args, program, loss, exe):
             exec_strategy.num_threads = 1
 
         compiled_program = fluid.CompiledProgram(program).with_data_parallel(
-            loss_name=loss.name,
+            loss_name=loss.name if mode == "train" else None,
+            share_vars_from=share_prog if mode == "val" else None,
             build_strategy=build_strategy,
             exec_strategy=exec_strategy)
 
diff --git a/PaddleCV/metric_learning/train_elem.py b/PaddleCV/metric_learning/train_elem.py
index 5db572c2036ba1380f5415e36a6a4fc2726efb4c..c49003c5b095e7c290f7d2fb17dcab85b5c7f72a 100644
--- a/PaddleCV/metric_learning/train_elem.py
+++ b/PaddleCV/metric_learning/train_elem.py
@@ -274,8 +274,8 @@ def train_async(args):
     # This is for continuous evaluation only
     if args.enable_ce:
         # Use the mean cost/acc for training
-        print("kpis train_cost      %s" % (avg_loss))
-        print("kpis test_recall     %s" % (recall))
+        print("kpis\ttrain_cost\t{}".format(avg_loss))
+        print("kpis\ttest_recall\t{}".format(recall))
 
 
 def initlogging():
diff --git a/PaddleCV/ocr_recognition/README.md b/PaddleCV/ocr_recognition/README.md
index f6ad4c905bebcf2628ac751e2b9d7e5db902d8ed..70c5fae52c2a823b4e977bccdaf382cc75fcf0f5 100644
--- a/PaddleCV/ocr_recognition/README.md
+++ b/PaddleCV/ocr_recognition/README.md
@@ -1,4 +1,3 @@
->注意：在paddle1.5版本上训练attention model有收敛问题，建议您暂时使用paddle1.4版本，后续我们会修复该问题。
 
 ## 代码结构
 ```
diff --git a/PaddleCV/ocr_recognition/attention_model.py b/PaddleCV/ocr_recognition/attention_model.py
index 963d2168fd6ec4f53724573895344825c18558a3..4a2dad271ed5ff91f251138e4df1504af4a8a5f6 100755
--- a/PaddleCV/ocr_recognition/attention_model.py
+++ b/PaddleCV/ocr_recognition/attention_model.py
@@ -188,7 +188,7 @@ def attention_train_net(args, data_shape, num_classes):
     prediction = gru_decoder_with_attention(trg_embedding, encoded_vector,
                                             encoded_proj, decoder_boot,
                                             decoder_size, num_classes)
-    fluid.clip.set_gradient_clip(fluid.clip.GradientClipByValue(args.gradient_clip))
+    fluid.clip.set_gradient_clip(fluid.clip.GradientClipByGlobalNorm(args.gradient_clip))
     label_out = fluid.layers.cast(x=label_out, dtype='int64')
 
     _, maxid = fluid.layers.topk(input=prediction, k=1)
diff --git a/PaddleCV/ocr_recognition/crnn_ctc_model.py b/PaddleCV/ocr_recognition/crnn_ctc_model.py
index 178f329299104eed2c5c3d8b1c56a66f487f694a..7650478a2fe75ee6dd5aa5092ecc057bca4f1eda 100755
--- a/PaddleCV/ocr_recognition/crnn_ctc_model.py
+++ b/PaddleCV/ocr_recognition/crnn_ctc_model.py
@@ -16,7 +16,6 @@ from __future__ import division
 from __future__ import print_function
 import paddle.fluid as fluid
 from paddle.fluid.layers.learning_rate_scheduler import _decay_step_counter
-from paddle.fluid.initializer import init_on_cpu
 import math
 import six
 
diff --git a/PaddleCV/ocr_recognition/data_reader.py b/PaddleCV/ocr_recognition/data_reader.py
index f1b529391d9fb2ba8f2c43ce2257b29ac971374b..6e42e2af4fedd87ac65cb56a9461aba9b98041d4 100644
--- a/PaddleCV/ocr_recognition/data_reader.py
+++ b/PaddleCV/ocr_recognition/data_reader.py
@@ -212,8 +212,7 @@ class DataGenerator(object):
                     img = img.resize((img.size[0], DATA_SHAPE[1])) # resize height
                     img = np.array(img) - 127.5
                     img = img[np.newaxis, ...]
-                    label = [int(c) for c in line.split(' ')[3].split(',')]
-                    yield img, label
+                    yield img, [[0]]
 
             if img_label_list is not None:
                 lines = []
diff --git a/PaddleCV/ocr_recognition/run_attention.sh b/PaddleCV/ocr_recognition/run_attention.sh
index 50ddba7119d3b9ad09150fe8c677a22ea7732ab2..beae85cf719d9ec01f599812e7412dadb9e5b681 100644
--- a/PaddleCV/ocr_recognition/run_attention.sh
+++ b/PaddleCV/ocr_recognition/run_attention.sh
@@ -1,7 +1,7 @@
 export CUDA_VISIBLE_DEVICES=0
 nohup python train.py \
 --lr=1.0 \
---gradient_clip=10 \
+--gradient_clip=5.0 \
 --model="attention" \
 --log_period=10 \
 > attention.log 2>&1 &
diff --git a/PaddleCV/rcnn/train.py b/PaddleCV/rcnn/train.py
index 705ad33a0eeaa1e645d4e943ad2584de7c9dcd38..e858bd95eb1a572df2af1560bbc6f378fbf4e7f7 100644
--- a/PaddleCV/rcnn/train.py
+++ b/PaddleCV/rcnn/train.py
@@ -40,6 +40,7 @@ import collections
 
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 import reader
 import models.model_builder as model_builder
 import models.resnet as resnet
@@ -197,6 +198,14 @@ def train():
                 sys.stdout.flush()
                 if (iter_id + 1) % cfg.TRAIN.snapshot_iter == 0:
                     save_model("model_iter{}".format(iter_id))
+            
+                #profiler tools, used for benchmark
+                if args.is_profiler and iter_id == 10:
+                    profiler.start_profiler("All")
+                elif args.is_profiler and iter_id == 15: 
+                    profiler.stop_profiler("total", args.profiler_path)
+                    return
+
             end_time = time.time()
             total_time = end_time - start_time
             last_loss = np.array(outs[0]).mean()
@@ -232,6 +241,12 @@ def train():
                 save_model("model_iter{}".format(iter_id))
             if (iter_id + 1) == cfg.max_iter:
                 break
+            #profiler tools, used for benchmark
+            if args.is_profiler and iter_id == 10:
+                profiler.start_profiler("All")
+            elif args.is_profiler and iter_id == 15:
+                profiler.stop_profiler("total", args.profiler_path)
+                return
         end_time = time.time()
         total_time = end_time - start_time
         last_loss = np.array(outs[0]).mean()
diff --git a/PaddleCV/rcnn/utility.py b/PaddleCV/rcnn/utility.py
index 9df7f78b844e2a83ed3c91b93566d8fa78c5ce4e..c464d4efcc5fd5162269506a2863c873d7e71dbe 100644
--- a/PaddleCV/rcnn/utility.py
+++ b/PaddleCV/rcnn/utility.py
@@ -149,6 +149,11 @@ def parse_args():
     add_arg('variance',         float,  [1.,1.,1.,1.],    "The variance of anchors.")
     add_arg('rpn_stride',       float,  [16.,16.],    "Stride of the feature map that RPN is attached.")
     add_arg('rpn_nms_thresh',    float,   0.7,          "NMS threshold used on RPN proposals")
+
+    #NOTE: args for profiler, used for benchmark
+    add_arg('is_profiler',              int,    0,                      "the profiler switch.(used for benchmark)")
+    add_arg('profiler_path',            str,    './',                   "the profiler output file path.(used for benchmark)")
+
     # TRAIN VAL INFER
     add_arg('MASK_ON', bool, False, "Option for different models. If False, choose faster_rcnn. If True, choose mask_rcnn")
     add_arg('im_per_batch',       int,   1,        "Minibatch size.")
diff --git a/PaddleCV/rrpn/README.md b/PaddleCV/rrpn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d9e6fc34e3c4f4cc0189b388ac9f73891afce100
--- /dev/null
+++ b/PaddleCV/rrpn/README.md
@@ -0,0 +1,168 @@
+# RRPN 旋转物体检测
+
+---
+## 内容
+
+- [安装](#安装)
+- [简介](#简介)
+- [数据准备](#数据准备)
+- [模型训练](#模型训练)
+- [模型评估](#模型评估)
+- [模型推断及可视化](#模型推断及可视化)
+
+## 安装
+
+在当前目录下运行样例代码需要PadddlePaddle Fluid的develop或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本，请根据[安装文档](http://www.paddlepaddle.org/)中的说明来更新PaddlePaddle。
+
+
+## 简介
+RRPN是在Faster RCNN基础上拓展出的两阶段目标检测器，可用于文字检测和旋转物体检测。通过对图像生成候选区域，提取特征，判别特征类别并修正候选框位置。
+
+[RRPN](https://arxiv.org/abs/1703.01086) 整体网络可以分为4个主要内容：
+
+1. 基础卷积层。作为一种卷积神经网络目标检测方法，RRPN首先使用一组基础的卷积网络提取图像的特征图。特征图被后续RPN层和全连接层共享。本示例采用[ResNet-50](https://arxiv.org/abs/1512.03385)作为基础卷积层。
+2. 区域生成网络(RPN)。RPN网络用于生成候选区域(proposals)。该层通过一组固定的尺寸、比例和角度得到一组带方向锚点(anchors), 通过softmax判断旋转的锚点属于前景或者背景，再利用区域回归修正锚点从而获得精确的候选区域。
+3. Rotated RoI Align。该层收集输入的特征图和带方向的候选区域，将带方向的候选区域映射到特征图中进行并池化为统一大小的区域特征图，送入全连接层判定目标类别。
+4. 检测层。利用区域特征图计算候选区域的类别，同时再次通过区域回归获得检测框最终的精确位置。
+
+### 编译自定义OP
+
+自定义OP编译方式如下：
+
+    进入 `ext_op/src` 目录，执行编译脚本
+    ```
+    cd ext_op/src
+    sh make.sh  ${cuda_path} ${cudnn_path} ${nccl_path}
+    '''
+    其中${cuda_path}、$cudnn_path}和{nccl_path}分别为cuda、cudnn、nccl的安装路径，需通过命令行进行指定
+    成功编译后，`ext_op/src` 目录下将会生成 `rrpn_lib.so` 
+    
+## 数据准备
+### 公开数据集
+在[ICDAR2015数据集](https://rrc.cvc.uab.es/?ch=4&com=downloads)上进行训练，数据集需进入官网进行注册后方可下载。
+
+数据目录结构如下：
+
+```
+dataset/icdar2015/
+├── ch4_training_images
+│   ├── img_143.jpg
+│   ├── img_144.jpg
+|   ...
+├── ch4_training_localization_transcription_gt
+│   ├── gt_img_143.txt
+│   ├── gt_img_144.txt
+|   ...
+├── ch4_test_images
+│   ├── img_111.jpg
+│   ├── img_112.jpg
+|   ...
+├── ch4_test_localization_transcription_gt
+│   ├── img_111.jpg
+│   ├── img_112.jpg
+|   ...
+```
+### 自定义数据
+原始的RRPN只提供了二分类，若要使用自己数据进行训练多分类，需在utility.py中将dataset改为icdar2017，然后将class_num改为需求类别数，其中0为背景类。
+
+训练自定义数据时，数据目录结构和ICDAR2015一致，标注数据格式如下：
+```
+x1, y1, x2, y2, x3, y3, x4, y4, class_name
+x1, y1, x2, y2, x3, y3, x4, y4, class_name
+```
+
+## 模型训练
+
+**下载预训练模型：** 本示例提供Resnet-50预训练模型，采用如下命令下载预训练模型：
+
+    sh ./pretrained/download.sh
+
+
+通过初始化`pretrained_model` 加载预训练模型。同时在参数微调时也采用该设置加载已训练模型。
+请在训练前确认预训练模型下载与加载正确，否则训练过程中损失可能会出现NAN。
+
+
+- RRPN
+
+    ```
+    python train.py \
+       --model_save_dir=output/ \
+       --pretrained_model=${path_to_pretrain_model} \
+       --data_dir=${path_to_data} \
+    ```
+
+
+
+    - 通过设置export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7指定8卡GPU训练。
+
+    - 可选参数见：
+
+        python train.py --help
+
+**数据读取器说明：** 数据读取器定义在reader.py中。所有图像将短边等比例缩放至`scales`，若长边大于`max_size`, 则再次将长边等比例缩放至`max_size`。在训练阶段，对图像采用随机旋转。
+
+**模型设置：**
+
+* 使用RotatedRoIAlign方法。
+* 训练过程pre\_nms=12000, post\_nms=2000，测试过程pre\_nms=6000, post\_nms=1000。nms阈值为0.7。
+* RPN网络得到labels的过程中，fg\_fraction=0.25，fg\_thresh=0.5，bg\_thresh_hi=0.5，bg\_thresh\_lo=0.0
+* RPN选择anchor时，rpn\_fg\_fraction=0.5，rpn\_positive\_overlap=0.7，rpn\_negative\_overlap=0.3
+
+
+**训练策略：**
+*  默认配置采用8卡，每卡batch size=1
+*  采用momentum优化算法训练，momentum=0.9。
+*  权重衰减系数为0.02，前500轮学习率从0.00333线性增加至0.01。在6250，12500轮时使用0.1,0.01乘子进行学习率衰减，最大训练17500轮。训练最大轮数和学习率策略可以在config.py中对max_iter和lr_steps进行设置。
+*  非基础卷积层卷积bias学习率为整体学习率2倍。
+*  基础卷积层中，affine_layers参数不更新，res2层参数不更新。
+
+## 模型评估
+
+模型评估是指对训练完毕的模型评估各类性能指标。本示例采用[ICDAR2015官方评估](https://rrc.cvc.uab.es/?com=contestant)
+
+`eval.py`是评估模块的主要执行程序，调用示例如下：
+
+- RRPN
+
+    ```
+    python eval.py \
+        --dataset=icdar2015 \
+        --pretrained_model=${path_to_trained_model}
+    ```
+
+    - 通过设置`--pretrained_model=${path_to_trained_model}`指定训练好的模型，注意不是初始化的模型。
+    - 通过设置`export CUDA\_VISIBLE\_DEVICES=0`指定单卡GPU评估。
+
+
+下表为模型评估结果：
+
+RRPN
+
+| 模型                   | 批量大小   | 迭代次数   | F1  |
+| :--------------- | :------------:    | :------------------:    |------: |
+| [RRPN](https://paddleseg.bj.bcebos.com/deploy/temp/model_final.tar) |8   |    17500       | 0.8048 |
+
+
+
+
+
+
+## 模型推断及可视化
+
+模型推断可以获取图像中的物体及其对应的类别，`infer.py`是主要执行程序，调用示例如下：
+
+```
+python infer.py \
+    --pretrained_model=${path_to_trained_model}  \
+    --image_path=dataset/icdar2015 \
+    --draw_threshold=0.6
+```
+
+注意，请正确设置模型路径`${path_to_trained_model}`和预测图片路径。默认使用GPU设备，也可通过设置`--use_gpu=False`使用CPU设备。可通过设置`draw_threshold`调节得分阈值控制检测框的个数。
+
+下图为模型可视化预测结果：
+<p align="center">
+<img src="image/img_120.jpg" height=576 width=1024 hspace='10'/>
+<img src="image/img_119.jpg" height=576 width=1024 hspace='10'/> <br />
+RRPN 预测可视化
+</p>
diff --git a/PaddleCV/rrpn/__init__.py b/PaddleCV/rrpn/__init__.py
new file mode 100755
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleCV/rrpn/checkpoint.py b/PaddleCV/rrpn/checkpoint.py
new file mode 100644
index 0000000000000000000000000000000000000000..7062199e1b3a0fb0bd5619b503559a335e54b2e9
--- /dev/null
+++ b/PaddleCV/rrpn/checkpoint.py
@@ -0,0 +1,186 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import errno
+import os
+import shutil
+import time
+import numpy as np
+import re
+import paddle.fluid as fluid
+import logging
+logger = logging.getLogger(__name__)
+
+
+def load_params(exe, prog, path):
+    """
+    Load model from the given path.
+    Args:
+        exe (fluid.Executor): The fluid.Executor object.
+        prog (fluid.Program): load weight to which Program object.
+        path (string): URL string or loca model path.
+    """
+
+    if not os.path.exists(path):
+        raise ValueError("Model pretrain path {} does not "
+                         "exists.".format(path))
+
+    logger.info('Loading parameters from {}...'.format(path))
+
+    def _if_exist(var):
+        param_exist = os.path.exists(os.path.join(path, var.name))
+        do_load = param_exist
+        if do_load:
+            logger.debug('load weight {}'.format(var.name))
+        return do_load
+
+    fluid.io.load_vars(exe, path, prog, predicate=_if_exist)
+
+
+def save(exe, prog, path):
+    """
+    Load model from the given path.
+    Args:
+        exe (fluid.Executor): The fluid.Executor object.
+        prog (fluid.Program): save weight from which Program object.
+        path (string): the path to save model.
+    """
+    if os.path.isdir(path):
+        shutil.rmtree(path)
+    logger.info('Save model to {}.'.format(path))
+    fluid.io.save_persistables(exe, path, prog)
+
+
+def load_and_fusebn(exe, prog, path):
+    """
+    Fuse params of batch norm to scale and bias.
+    Args:
+        exe (fluid.Executor): The fluid.Executor object.
+        prog (fluid.Program): save weight from which Program object.
+        path (string): the path to save model.
+    """
+    logger.info('Load model and fuse batch norm if have from {}...'.format(
+        path))
+
+    if not os.path.exists(path):
+        raise ValueError("Model path {} does not exists.".format(path))
+
+    def _if_exist(var):
+        b = os.path.exists(os.path.join(path, var.name))
+
+        if b:
+            logger.debug('load weight {}'.format(var.name))
+        return b
+
+    all_vars = list(filter(_if_exist, prog.list_vars()))
+
+    # Since the program uses affine-channel, there is no running mean and var
+    # in the program, here append running mean and var.
+    # NOTE, the params of batch norm should be like:
+    #  x_scale
+    #  x_offset
+    #  x_mean
+    #  x_variance
+    #  x is any prefix
+    mean_variances = set()
+    bn_vars = []
+
+    bn_in_path = True
+
+    inner_prog = fluid.Program()
+    inner_start_prog = fluid.Program()
+    inner_block = inner_prog.global_block()
+    with fluid.program_guard(inner_prog, inner_start_prog):
+        for block in prog.blocks:
+            ops = list(block.ops)
+            if not bn_in_path:
+                break
+            for op in ops:
+                if op.type == 'affine_channel':
+                    # remove 'scale' as prefix
+                    scale_name = op.input('Scale')[0]  # _scale
+                    bias_name = op.input('Bias')[0]  # _offset
+                    prefix = scale_name[:-5]
+                    mean_name = prefix + 'mean'
+                    variance_name = prefix + 'variance'
+
+                    if not os.path.exists(os.path.join(path, mean_name)):
+                        bn_in_path = False
+                        break
+                    if not os.path.exists(os.path.join(path, variance_name)):
+                        bn_in_path = False
+                        break
+
+                    bias = block.var(bias_name)
+
+                    mean_vb = inner_block.create_var(
+                        name=mean_name,
+                        type=bias.type,
+                        shape=bias.shape,
+                        dtype=bias.dtype,
+                        persistable=True)
+                    variance_vb = inner_block.create_var(
+                        name=variance_name,
+                        type=bias.type,
+                        shape=bias.shape,
+                        dtype=bias.dtype,
+                        persistable=True)
+
+                    mean_variances.add(mean_vb)
+                    mean_variances.add(variance_vb)
+
+                    bn_vars.append(
+                        [scale_name, bias_name, mean_name, variance_name])
+
+    if not bn_in_path:
+        fluid.io.load_vars(exe, path, prog, vars=all_vars)
+        logger.warning(
+            "There is no paramters of batch norm in model {}. "
+            "Skip to fuse batch norm. And load paramters done.".format(path))
+        return
+
+    # load running mean and running variance on cpu place into global scope.
+    place = fluid.CPUPlace()
+    exe_cpu = fluid.Executor(place)
+    fluid.io.load_vars(exe_cpu, path, vars=[v for v in mean_variances])
+
+    # load params on real place into global scope.
+    fluid.io.load_vars(exe, path, prog, vars=all_vars)
+
+    eps = 1e-5
+    for names in bn_vars:
+        scale_name, bias_name, mean_name, var_name = names
+
+        scale = fluid.global_scope().find_var(scale_name).get_tensor()
+        bias = fluid.global_scope().find_var(bias_name).get_tensor()
+        mean = fluid.global_scope().find_var(mean_name).get_tensor()
+        var = fluid.global_scope().find_var(var_name).get_tensor()
+
+        scale_arr = np.array(scale)
+        bias_arr = np.array(bias)
+        mean_arr = np.array(mean)
+        var_arr = np.array(var)
+
+        bn_std = np.sqrt(np.add(var_arr, eps))
+        new_scale = np.float32(np.divide(scale_arr, bn_std))
+        new_bias = bias_arr - mean_arr * new_scale
+
+        # fuse to scale and bias in affine_channel
+        scale.set(new_scale, exe.place)
+        bias.set(new_bias, exe.place)
diff --git a/PaddleCV/rrpn/config.py b/PaddleCV/rrpn/config.py
new file mode 100755
index 0000000000000000000000000000000000000000..7cfe7cd5b3a19fe23bfd5a15392d509abf3c6da2
--- /dev/null
+++ b/PaddleCV/rrpn/config.py
@@ -0,0 +1,226 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#    http://www.apache.org/licenses/LICENSE-2.0
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License. 
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from edict import AttrDict
+import six
+import numpy as np
+
+_C = AttrDict()
+cfg = _C
+
+#
+# Training options
+#
+_C.TRAIN = AttrDict()
+
+# scales an image's shortest side
+_C.TRAIN.scales = [800]
+
+# max size of longest side
+_C.TRAIN.max_size = 1333
+
+# images per GPU in minibatch
+_C.TRAIN.im_per_batch = 1
+
+# roi minibatch size per image
+_C.TRAIN.batch_size_per_im = 256
+
+# target fraction of foreground roi minibatch 
+_C.TRAIN.fg_fractrion = 0.25
+
+# overlap threshold for a foreground roi
+_C.TRAIN.fg_thresh = 0.5
+
+# overlap threshold for a background roi
+_C.TRAIN.bg_thresh_hi = 0.5
+_C.TRAIN.bg_thresh_lo = 0.0
+
+# If False, only resize image and not pad, image shape is different between
+# GPUs in one mini-batch. If True, image shape is the same in one mini-batch.
+_C.TRAIN.padding_minibatch = False
+
+# Snapshot period
+_C.TRAIN.snapshot_iter = 1000
+
+# number of RPN proposals to keep before NMS
+_C.TRAIN.rpn_pre_nms_top_n = 12000
+
+# number of RPN proposals to keep after NMS
+_C.TRAIN.rpn_post_nms_top_n = 2000
+
+# NMS threshold used on RPN proposals
+_C.TRAIN.rpn_nms_thresh = 0.7
+
+# min size in RPN proposals
+_C.TRAIN.rpn_min_size = 0.0
+
+# eta for adaptive NMS in RPN
+_C.TRAIN.rpn_eta = 1.0
+
+# number of RPN examples per image
+_C.TRAIN.rpn_batch_size_per_im = 256
+
+# remove anchors out of the image
+_C.TRAIN.rpn_straddle_thresh = 0.
+
+# target fraction of foreground examples pre RPN minibatch
+_C.TRAIN.rpn_fg_fraction = 0.5
+
+# min overlap between anchor and gt box to be a positive examples
+_C.TRAIN.rpn_positive_overlap = 0.7
+
+# max overlap between anchor and gt box to be a negative examples
+_C.TRAIN.rpn_negative_overlap = 0.3
+
+# stopgrad at a specified stage
+_C.TRAIN.freeze_at = 2
+
+# min area of ground truth box
+_C.TRAIN.gt_min_area = -1
+
+#
+# Inference options
+#
+_C.TEST = AttrDict()
+
+# scales an image's shortest side
+_C.TEST.scales = [800]
+
+# max size of longest side
+_C.TEST.max_size = 1333
+
+# eta for adaptive NMS in RPN
+_C.TEST.rpn_eta = 1.0
+
+# min score threshold to infer
+_C.TEST.score_thresh = 0.01
+
+# overlap threshold used for NMS
+_C.TEST.nms_thresh = 0.3
+
+# number of RPN proposals to keep before NMS
+_C.TEST.rpn_pre_nms_top_n = 6000
+
+# number of RPN proposals to keep after NMS
+_C.TEST.rpn_post_nms_top_n = 1000
+
+# min size in RPN proposals
+_C.TEST.rpn_min_size = 0.0
+
+# max number of detections
+_C.TEST.detections_per_im = 300
+
+# NMS threshold used on RPN proposals
+_C.TEST.rpn_nms_thresh = 0.7
+
+#
+# Model options
+#
+
+# Whether use mask rcnn head
+_C.MASK_ON = True
+
+# weight for bbox regression targets
+_C.bbox_reg_weights = [10.0, 10.0, 5.0, 5.0, 1.0]
+
+# RPN anchor sizes
+_C.anchor_sizes = [128, 256, 512]
+
+# RPN anchor ratio
+_C.aspect_ratio = [0.2, 0.5, 1.0]
+
+# RPN anchor angle
+_C.anchor_angle = [-30.0, 0.0, 30.0, 60.0, 90.0, 120.0]
+
+# variance of anchors
+_C.variances = [1., 1., 1., 1., 1.]
+
+# stride of feature map
+_C.rpn_stride = [16.0, 16.0]
+
+# pooled width and pooled height 
+_C.roi_resolution = 14
+
+# spatial scale 
+_C.spatial_scale = 1. / 16.
+
+# resolution to represent rotated roi align
+_C.resolution = 14
+
+#
+# SOLVER options
+#
+
+# derived learning rate the to get the final learning rate.
+_C.learning_rate = 0.01
+
+# maximum number of iterations
+_C.max_iter = 140000
+
+# warm up to learning rate 
+_C.warm_up_iter = 500
+_C.start_factor = 1. / 3
+
+# lr steps_with_decay
+_C.lr_steps = [6250, 12500]
+_C.lr_gamma = 0.1
+
+# L2 regularization hyperparameter
+_C.weight_decay = 0.0001
+
+# momentum with SGD
+_C.momentum = 0.9
+
+#
+# ENV options
+#
+
+# support both CPU and GPU
+_C.use_gpu = True
+
+# Whether use parallel
+_C.parallel = True
+
+# Class number
+_C.class_num = 81
+
+# support pyreader
+_C.use_pyreader = True
+_C.TRAIN.min_size = 800
+_C.TRAIN.max_size = 1333
+_C.TEST.min_size = 1000
+# pixel mean values
+_C.pixel_means = [0.485, 0.456, 0.406]
+_C.pixel_std = [0.229, 0.224, 0.225]
+# clip box to prevent overflowing
+_C.bbox_clip = np.log(1000. / 16.)
+
+
+def merge_cfg_from_args(args, mode):
+    """Merge config keys, values in args into the global config."""
+    if mode == 'train':
+        sub_d = _C.TRAIN
+    else:
+        sub_d = _C.TEST
+    for k, v in sorted(six.iteritems(vars(args))):
+        d = _C
+        try:
+            value = eval(v)
+        except:
+            value = v
+        if k in sub_d:
+            sub_d[k] = value
+        else:
+            d[k] = value
diff --git a/PaddleCV/rrpn/data_utils.py b/PaddleCV/rrpn/data_utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..339f7cb3b0d35a122ec96a1ee80085d84b168d6b
--- /dev/null
+++ b/PaddleCV/rrpn/data_utils.py
@@ -0,0 +1,249 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Based on:
+# --------------------------------------------------------
+# Detectron
+# Copyright (c) 2017-present, Facebook, Inc.
+# Licensed under the Apache License, Version 2.0;
+# Written by Ross Girshick
+# --------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import cv2
+import numpy as np
+from config import cfg
+import os
+from PIL import Image
+
+
+class DatasetPath(object):
+    def __init__(self, mode, dataset_name):
+        self.mode = mode
+        self.data_dir = dataset_name
+
+    def get_data_dir(self):
+        if self.mode == 'train':
+            return os.path.join(self.data_dir, 'ch4_training_images')
+        elif self.mode == 'val':
+            return os.path.join(self.data_dir, 'ch4_test_images')
+
+    def get_file_list(self):
+        if self.mode == 'train':
+            return os.path.join(self.data_dir,
+                                'ch4_training_localization_transcription_gt')
+        elif self.mode == 'val':
+            return os.path.join(self.data_dir,
+                                'ch4_test_localization_transcription_gt')
+
+
+def get_image_blob(roidb, mode):
+    """Builds an input blob from the images in the roidb at the specified
+    scales.
+    """
+    if mode == 'train' or mode == 'val':
+        with open(roidb['image'], 'rb') as f:
+            data = f.read()
+        data = np.frombuffer(data, dtype='uint8')
+        img = cv2.imdecode(data, 1)
+        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+        gt_boxes = roidb['boxes']
+        gt_label = roidb['gt_classes']
+        # resize
+        if mode == 'train':
+            img, im_scale = _resize(img, target_size=800, max_size=1333)
+            need_gt_boxes = gt_boxes.copy()
+            need_gt_boxes[:, :4] *= im_scale
+            img, need_gt_boxes, need_gt_label = _rotation(
+                img, need_gt_boxes, gt_label, prob=1.0, gt_margin=1.4)
+        else:
+            img, im_scale = _resize(img, target_size=1000, max_size=1778)
+            need_gt_boxes = gt_boxes
+            need_gt_label = gt_label
+        img = img.astype(np.float32, copy=False)
+        img = img / 255.0
+        mean = np.array(cfg.pixel_means)[np.newaxis, np.newaxis, :]
+        std = np.array(cfg.pixel_std)[np.newaxis, np.newaxis, :]
+        img -= mean
+        img /= std
+        img = img.transpose((2, 0, 1))
+        return img, im_scale, need_gt_boxes, need_gt_label
+
+
+def _get_size_scale(w, h, min_size, max_size=None):
+    size = min_size
+    scale = 1.0
+    if max_size is not None:
+        min_original_size = float(min((w, h)))
+        max_original_size = float(max((w, h)))
+        if max_original_size / min_original_size * size > max_size:
+            size = int(round(max_size * min_original_size / max_original_size))
+    if (w <= h and w == size) or (h <= w and h == size):
+        return (h, w), scale
+    if w < h:
+        ow = size
+        oh = int(size * h / w)
+        scale = size / w
+    else:
+        oh = size
+        ow = int(size * w / h)
+        scale = size / h
+    scale = ow / w
+    return (oh, ow), scale
+
+
+def _resize(im, target_size=800, max_size=1333):
+    if not isinstance(im, np.ndarray):
+        raise TypeError("{}: image type is not numpy.")
+    if len(im.shape) != 3:
+        raise ImageError('{}: image is not 3-dimensional.')
+    im_shape = im.shape
+    im_size_min = np.min(im_shape[0:2])
+    im_size_max = np.max(im_shape[0:2])
+    selected_size = target_size
+    if float(im_size_min) == 0:
+        raise ZeroDivisionError('min size of image is 0')
+    if max_size != 0:
+        im_scale = float(selected_size) / float(im_size_min)
+        # Prevent the biggest axis from being more than max_size
+        if np.round(im_scale * im_size_max) > max_size:
+            im_scale = float(max_size) / float(im_size_max)
+        im_scale_x = im_scale
+        im_scale_y = im_scale
+
+        resize_w = np.round(im_scale_x * float(im_shape[1]))
+        resize_h = np.round(im_scale_y * float(im_shape[0]))
+        im_info = [resize_h, resize_w, im_scale]
+    else:
+        im_scale_x = float(selected_size) / float(im_shape[1])
+        im_scale_y = float(selected_size) / float(im_shape[0])
+
+        resize_w = selected_size
+        resize_h = selected_size
+
+    im = Image.fromarray(im)
+    im = im.resize((int(resize_w), int(resize_h)), 2)
+    im = np.array(im)
+    return im, im_scale_x
+
+
+def _rotation(image,
+              gt_boxes,
+              gt_label,
+              prob,
+              fixed_angle=-1,
+              r_range=(360, 0),
+              gt_margin=1.4):
+    rotate_range = r_range[0]
+    shift = r_range[1]
+    angle = np.array([np.max([0, fixed_angle])])
+    if np.random.rand() <= prob:
+        angle = np.array(
+            np.random.rand(1) * rotate_range - shift, dtype=np.int16)
+    '''
+    rotate image
+    '''
+    image = np.array(image)
+    (h, w) = image.shape[:2]
+    scale = 1.0
+    # set the rotation center
+    center = (w / 2, h / 2)
+    # anti-clockwise angle in the function
+    M = cv2.getRotationMatrix2D(center, angle, scale)
+    image = cv2.warpAffine(image, M, (w, h))
+    # back to PIL image
+    im_width, im_height = w, h
+    '''
+    rotate boxes
+    '''
+    need_gt_boxes = gt_boxes.copy()
+    origin_gt_boxes = need_gt_boxes
+    rotated_gt_boxes = np.empty((len(need_gt_boxes), 5), dtype=np.float32)
+    # anti-clockwise to clockwise arc
+    cos_cita = np.cos(np.pi / 180 * angle)
+    sin_cita = np.sin(np.pi / 180 * angle)
+    # clockwise matrix
+    rotation_matrix = np.array([[cos_cita, sin_cita], [-sin_cita, cos_cita]])
+    pts_ctr = origin_gt_boxes[:, 0:2]
+    pts_ctr = pts_ctr - np.tile((im_width / 2, im_height / 2),
+                                (gt_boxes.shape[0], 1))
+    pts_ctr = np.array(np.dot(pts_ctr, rotation_matrix), dtype=np.int16)
+    pts_ctr = np.squeeze(
+        pts_ctr, axis=-1) + np.tile((im_width / 2, im_height / 2),
+                                    (gt_boxes.shape[0], 1))
+    origin_gt_boxes[:, 0:2] = pts_ctr
+    len_of_gt = len(origin_gt_boxes)
+    # rectificate the angle in the range of [-45, 45]
+    for idx in range(len_of_gt):
+        ori_angle = origin_gt_boxes[idx, 4]
+        height = origin_gt_boxes[idx, 3]
+        width = origin_gt_boxes[idx, 2]
+        # step 1: normalize gt (-45,135)
+        if width < height:
+            ori_angle += 90
+            width, height = height, width
+        # step 2: rotate (-45,495)
+        rotated_angle = ori_angle + angle
+        # step 3: normalize rotated_angle (-45,135)
+        while rotated_angle > 135:
+            rotated_angle = rotated_angle - 180
+        rotated_gt_boxes[idx, 0] = origin_gt_boxes[idx, 0]
+        rotated_gt_boxes[idx, 1] = origin_gt_boxes[idx, 1]
+        rotated_gt_boxes[idx, 3] = height * gt_margin
+        rotated_gt_boxes[idx, 2] = width * gt_margin
+        rotated_gt_boxes[idx, 4] = rotated_angle
+    x_inbound = np.logical_and(rotated_gt_boxes[:, 0] >= 0,
+                               rotated_gt_boxes[:, 0] < im_width)
+    y_inbound = np.logical_and(rotated_gt_boxes[:, 1] >= 0,
+                               rotated_gt_boxes[:, 1] < im_height)
+    inbound = np.logical_and(x_inbound, y_inbound)
+    need_gt_boxes = rotated_gt_boxes[inbound]
+    need_gt_label = gt_label.copy()
+    need_gt_label = need_gt_label[inbound]
+    return image, need_gt_boxes, need_gt_label
+
+
+def prep_im_for_blob(im, pixel_means, target_size, max_size):
+    """Prepare an image for use as a network input blob. Specially:
+      - Subtract per-channel pixel mean
+      - Convert to float32
+      - Rescale to each of the specified target size (capped at max_size)
+    Returns a list of transformed images, one for each target size. Also returns
+    the scale factors that were used to compute each returned image.
+    """
+    im = im.astype(np.float32, copy=False)
+    im -= pixel_means
+
+    im_shape = im.shape
+    im_size_min = np.min(im_shape[0:2])
+    im_size_max = np.max(im_shape[0:2])
+    im_scale = float(target_size) / float(im_size_min)
+    # Prevent the biggest axis from being more than max_size
+    if np.round(im_scale * im_size_max) > max_size:
+        im_scale = float(max_size) / float(im_size_max)
+    im = cv2.resize(
+        im,
+        None,
+        None,
+        fx=im_scale,
+        fy=im_scale,
+        interpolation=cv2.INTER_LINEAR)
+    im_height, im_width, channel = im.shape
+    channel_swap = (2, 0, 1)  #(batch, channel, height, width)
+    im = im.transpose(channel_swap)
+    return im, im_scale
diff --git a/PaddleCV/rrpn/edict.py b/PaddleCV/rrpn/edict.py
new file mode 100755
index 0000000000000000000000000000000000000000..552ede8e4006b5d4e90dd85d566749fd624c26d1
--- /dev/null
+++ b/PaddleCV/rrpn/edict.py
@@ -0,0 +1,37 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+
+class AttrDict(dict):
+    def __init__(self, *args, **kwargs):
+        super(AttrDict, self).__init__(*args, **kwargs)
+
+    def __getattr__(self, name):
+        if name in self.__dict__:
+            return self.__dict__[name]
+        elif name in self:
+            return self[name]
+        else:
+            raise AttributeError(name)
+
+    def __setattr__(self, name, value):
+        if name in self.__dict__:
+            self.__dict__[name] = value
+        else:
+            self[name] = value
diff --git a/PaddleCV/rrpn/eval.py b/PaddleCV/rrpn/eval.py
new file mode 100755
index 0000000000000000000000000000000000000000..bf7732071967cab8766b9512c6007efb8e23db8a
--- /dev/null
+++ b/PaddleCV/rrpn/eval.py
@@ -0,0 +1,91 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import cv2
+import time
+import numpy as np
+import pickle
+import paddle
+import paddle.fluid as fluid
+import reader
+import models.model_builder as model_builder
+import models.resnet as resnet
+import checkpoint as checkpoint
+from config import cfg
+from utility import print_arguments, parse_args, check_gpu
+from data_utils import DatasetPath
+from eval_helper import *
+import logging
+FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+
+
+def eval():
+
+    place = fluid.CUDAPlace(0) if cfg.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    image_shape = [3, cfg.TEST.max_size, cfg.TEST.max_size]
+    class_nums = cfg.class_num
+    model = model_builder.RRPN(
+        add_conv_body_func=resnet.ResNet(),
+        add_roi_box_head_func=resnet.ResNetC5(),
+        use_pyreader=False,
+        mode='val')
+
+    startup_prog = fluid.Program()
+    infer_prog = fluid.Program()
+    with fluid.program_guard(infer_prog, startup_prog):
+        with fluid.unique_name.guard():
+            model.build_model(image_shape)
+            pred_boxes = model.eval_bbox_out()
+    infer_prog = infer_prog.clone(True)
+    exe.run(startup_prog)
+
+    # yapf: disable
+    def if_exist(var):
+        return os.path.exists(os.path.join(cfg.pretrained_model, var.name))
+    if cfg.pretrained_model:
+        checkpoint.load_params(exe, infer_prog, cfg.pretrained_model)
+    # yapf: enable
+    test_reader = reader.test(1)
+    feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
+
+    fetch_list = [pred_boxes]
+    res_list = []
+    keys = [
+        'bbox', 'gt_box', 'gt_class', 'is_crowed', 'im_info', 'im_id',
+        'is_difficult'
+    ]
+    for i, data in enumerate(test_reader()):
+        im_info = [data[0][1]]
+        result = exe.run(infer_prog,
+                         fetch_list=[v.name for v in fetch_list],
+                         feed=feeder.feed(data),
+                         return_numpy=False)
+        pred_boxes_v = result[0]
+        nmsed_out = pred_boxes_v
+        outs = np.array(nmsed_out)
+        res = get_key_dict(outs, data[0], keys)
+        res_list.append(res)
+        if i % 50 == 0:
+            logger.info('test_iter {}'.format(i))
+    icdar_eval(res_list)
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+    check_gpu(args.use_gpu)
+    eval()
diff --git a/PaddleCV/rrpn/eval_helper.py b/PaddleCV/rrpn/eval_helper.py
new file mode 100755
index 0000000000000000000000000000000000000000..c9e66e67cbb740785cc8b1509006a750d7b0158f
--- /dev/null
+++ b/PaddleCV/rrpn/eval_helper.py
@@ -0,0 +1,379 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import numpy as np
+import paddle.fluid as fluid
+import math
+from config import cfg
+import six
+import numpy as np
+import cv2
+import Polygon as plg
+from PIL import Image
+from PIL import ImageDraw
+from PIL import ImageFont
+from config import cfg
+import logging
+logger = logging.getLogger(__name__)
+
+
+def get_key_dict(out, data, key):
+    res = {}
+    for i in range(len(key)):
+        if i == 0:
+            res[key[i]] = out
+        else:
+            res[key[i]] = data[i]
+    return res
+
+
+def get_labels_maps():
+    default_labels_maps = {1: 'text'}
+    if cfg.dataset == 'icdar2015':
+        return default_labels_maps
+
+    labels_map = {}
+    with open(os.path.join(cfg.data_dir, 'label_list')) as f:
+        lines = f.readlines()
+        for idx, line in enumerate(lines):
+            labels_map[idx + 1] = line.strip()
+        return labels_map
+
+
+def draw_bounding_box_on_image(image_path,
+                               image_name,
+                               nms_out,
+                               im_scale,
+                               draw_threshold=0.8):
+    #if image is None:
+    image = Image.open(os.path.join(image_path, image_name))
+    draw = ImageDraw.Draw(image)
+    im_width, im_height = image.size
+
+    labels_map = get_labels_maps()
+    for dt in np.array(nms_out):
+        num_id, score = dt.tolist()[:2]
+        x1, y1, x2, y2, x3, y3, x4, y4 = dt.tolist()[2:] / im_scale
+        if score < draw_threshold:
+            continue
+        draw.line(
+            [(x1, y1), (x2, y2), (x3, y3), (x4, y4), (x1, y1)],
+            width=2,
+            fill='red')
+        if image.mode == 'RGB':
+            draw.text((x1, y1), labels_map[num_id], (255, 255, 0))
+    print("image with bbox drawed saved as {}".format(image_name))
+    image.save(image_name)
+
+
+def polygon_from_points(points):
+    """
+    Returns a Polygon object to use with the Polygon2 class from a list of 8 points: x1,y1,x2,y2,x3,y3,x4,y4
+    """
+    res_boxes = np.empty([1, 8], dtype='int32')
+    res_boxes[0, 0] = int(points[0])
+    res_boxes[0, 4] = int(points[1])
+    res_boxes[0, 1] = int(points[2])
+    res_boxes[0, 5] = int(points[3])
+    res_boxes[0, 2] = int(points[4])
+    res_boxes[0, 6] = int(points[5])
+    res_boxes[0, 3] = int(points[6])
+    res_boxes[0, 7] = int(points[7])
+    point_mat = res_boxes[0].reshape([2, 4]).T
+    return plg.Polygon(point_mat)
+
+
+def clip_box(bbox, im_info):
+
+    h = im_info[0]
+    w = im_info[1]
+    res = []
+    for b in bbox:
+        pts = b.reshape(4, 2)
+        pts[np.where(pts < 0)] = 1
+        pts[np.where(pts[:, 0] > w), 0] = w - 1
+        pts[np.where(pts[:, 1] > h), 1] = h - 1
+        pts = pts.reshape(-1)
+        pts /= im_info[2]
+        res.append(pts)
+
+    return np.array(res)
+
+
+def get_union(det, gt):
+    area_det = det.area()
+    area_gt = gt.area()
+    return area_det + area_gt - get_intersection(det, gt)
+
+
+def get_intersection_over_union(det, gt):
+    try:
+        return get_intersection(det, gt) / get_union(det, gt)
+    except:
+        return 0
+
+
+def get_intersection(det, gt):
+    inter = det & gt
+    if len(inter) == 0:
+        return 0
+    return inter.area()
+
+
+def parse_gt(result, im_id):
+    for res in result:
+        if res['im_id'] == im_id:
+            gt_boxes = list(res['gt_box'])
+            gt_class = res['gt_class']
+            is_difficult = res['is_difficult'].reshape(-1)
+            objects = []
+            for i in range(len(gt_boxes)):
+                object_struct = {}
+                object_struct['bbox'] = gt_boxes[i]
+                object_struct['class'] = gt_class[i]
+                if is_difficult[i] == 1:
+                    object_struct['difficult'] = 1
+                else:
+                    object_struct['difficult'] = 0
+                object_struct['im_id'] = im_id
+                objects.append(object_struct)
+            return objects
+
+
+def calculate_ap(rec, prec):
+    # 11 point metric
+    ap = 0.
+    for t in np.arange(0., 1.1, 0.1):
+        if np.sum(rec >= t) == 0:
+            p = 0
+        else:
+            p = np.max(prec[rec >= t])
+        ap = ap + p / 11.
+    return ap
+
+
+def icdar_map(result, class_name, ovthresh):
+    im_ids = []
+    for res in result:
+        im_ids.append(res['im_id'])
+    recs = {}
+
+    for i, im_id in enumerate(im_ids):
+        recs[str(im_id)] = parse_gt(result, im_id)
+    class_recs = {}
+    npos = 0
+    for k in im_ids:
+        res = [obj for obj in recs[str(k)] if obj['class'] == class_name]
+        bbox = np.array([x['bbox'] for x in res])
+        difficult = np.array([x['difficult'] for x in res]).astype(np.bool)
+        det = [False] * len(res)
+        npos = npos + sum(~difficult)
+        class_recs[k] = {'bbox': bbox, 'difficult': difficult, 'det': det}
+    image_ids = []
+    confidence = []
+    bbox = []
+    for res in result:
+        im_info = res['im_info']
+        pred_boxes = res['bbox']
+        for box in pred_boxes:
+            if box[0] == class_name:
+                image_ids.append(res['im_id'])
+                confidence.append(box[1])
+                clipd_box = clip_box(box[2:].reshape(-1, 8), im_info)
+                bbox.append(clipd_box[0])
+    confidence = np.array(confidence)
+    sorted_ind = np.argsort(-confidence)
+    sorted_scores = np.sort(-confidence)
+    bbox = np.array(bbox)
+    bbox = bbox[sorted_ind, :]
+    image_ids = [image_ids[x] for x in sorted_ind]
+    nd = len(image_ids)
+    tp = np.zeros(nd)
+    fp = np.zeros(nd)
+    for d in range(nd):
+        res = class_recs[image_ids[d]]
+        bb = bbox[d, :].astype(float)
+        ovmax = -np.inf
+        gt_bbox = res['bbox'].astype(float)
+        if gt_bbox.size > 0:
+            # compute overlaps
+            gt_bbox_xmin = np.min(gt_bbox[:, 0::2], axis=1)
+            gt_bbox_ymin = np.min(gt_bbox[:, 1::2], axis=1)
+            gt_bbox_xmax = np.max(gt_bbox[:, 0::2], axis=1)
+            gt_bbox_ymax = np.max(gt_bbox[:, 1::2], axis=1)
+            bb_xmin = np.min(bb[0::2])
+            bb_ymin = np.min(bb[1::2])
+            bb_xmax = np.max(bb[0::2])
+            bb_ymax = np.max(bb[1::2])
+
+            ixmin = np.maximum(gt_bbox_xmin, bb_xmin)
+            iymin = np.maximum(gt_bbox_ymin, bb_ymin)
+            ixmax = np.minimum(gt_bbox_xmax, bb_xmax)
+            iymax = np.minimum(gt_bbox_ymax, bb_ymax)
+            iw = np.maximum(ixmax - ixmin + 1., 0.)
+            ih = np.maximum(iymax - iymin + 1., 0.)
+            inters = iw * ih
+
+            # union
+            uni = ((bb_xmax - bb_xmin + 1.) * (bb_ymax - bb_ymin + 1.) +
+                   (gt_bbox_xmax - gt_bbox_xmin + 1.) *
+                   (gt_bbox_ymax - gt_bbox_ymin + 1.) - inters)
+
+            overlaps = inters / uni
+            gt_bbox_keep_mask = overlaps > 0
+            gt_bbox_keep = gt_bbox[gt_bbox_keep_mask, :]
+            gt_bbox_keep_index = np.where(overlaps > 0)[0]
+
+            def calcoverlaps(gt_bbox_keep, bb):
+                overlaps = []
+                for index, _ in enumerate(gt_bbox_keep):
+                    p_g = polygon_from_points(gt_bbox_keep[index])
+                    p_d = polygon_from_points(bb)
+                    overlap = get_intersection_over_union(p_d, p_g)
+                    overlaps.append(overlap)
+                return overlaps
+
+            if len(gt_bbox_keep) > 0:
+                overlaps = calcoverlaps(gt_bbox_keep, bb)
+
+                ovmax = np.max(overlaps)
+                jmax = np.argmax(overlaps)
+                jmax = gt_bbox_keep_index[jmax]
+
+        if ovmax > ovthresh:
+            if not res['difficult'][jmax]:
+                if not res['det'][jmax]:
+                    tp[d] = 1.
+                    res['det'][jmax] = 1
+                else:
+                    fp[d] = 1.
+        else:
+            fp[d] = 1.
+    # compute precision recall
+    fp = np.cumsum(fp)
+    tp = np.cumsum(tp)
+
+    rec = tp / float(npos)
+    prec = tp / np.maximum(tp + fp, np.finfo(np.float64).eps)
+    ap = calculate_ap(rec, prec)
+    return rec, prec, ap
+
+
+def icdar_map_eval(result, num_class):
+    map = 0
+    for i in range(num_class - 1):
+        rec, prec, ap = icdar_map(result, i + 1, ovthresh=0.5)
+        map = map + ap
+    map = map / (num_class - 1)
+    logger.info('mAP {}'.format(map))
+
+
+def icdar_box_eval(result, thresh):
+
+    matched_sum = 0
+    num_global_care_gt = 0
+    num_global_care_det = 0
+    for res in result:
+        im_info = res['im_info']
+        h = im_info[1]
+        w = im_info[2]
+        gt_boxes = res['gt_box']
+        pred_boxes = res['bbox']
+        pred_boxes = pred_boxes[np.where(pred_boxes[:, 1] > thresh)]
+        pred_boxes = pred_boxes[:, 2:]
+        pred_boxes = clip_box(pred_boxes, im_info)
+
+        is_difficult = res['is_difficult']
+        det_matched = 0
+
+        iou_mat = np.empty([1, 1])
+
+        gt_pols = []
+        det_pols = []
+
+        gt_pol_points = []
+        det_pol_points = []
+
+        gt_dont_care_pols_num = []
+        det_dont_care_pols_num = []
+
+        det_matched_nums = []
+
+        points_list = list(gt_boxes)
+
+        dony_care = is_difficult.reshape(-1)
+        for i, points in enumerate(points_list):
+            gt_pol = polygon_from_points(list(points))
+            gt_pols.append(gt_pol)
+            gt_pol_points.append(list(points))
+            if dony_care[i] == 1:
+                gt_dont_care_pols_num.append(len(gt_pols) - 1)
+        for i, points in enumerate(pred_boxes):
+            points = list(points.reshape(8).astype(np.int32))
+            det_pol = polygon_from_points(points)
+            det_pols.append(det_pol)
+            det_pol_points.append(points)
+            if len(gt_dont_care_pols_num) > 0:
+                for dont_care_pol in gt_dont_care_pols_num:
+                    dont_care_pol = gt_pols[dont_care_pol]
+                    intersected_area = get_intersection(dont_care_pol, det_pol)
+                    pd_dimensions = det_pol.area()
+                    precision = 0 if pd_dimensions == 0 else intersected_area / pd_dimensions
+                    if (precision > 0.5):
+                        det_dont_care_pols_num.append(len(det_pols) - 1)
+                        break
+        if len(gt_pols) > 0 and len(det_pols) > 0:
+            # Calculate IoU and precision matrixs
+            output_shape = [len(gt_pols), len(det_pols)]
+            iou_mat = np.empty(output_shape)
+            gt_rect_mat = np.zeros(len(gt_pols), np.int8)
+            det_rect_mat = np.zeros(len(det_pols), np.int8)
+            for gt_num in range(len(gt_pols)):
+                for det_num in range(len(det_pols)):
+                    p_d = gt_pols[gt_num]
+                    p_g = det_pols[det_num]
+                    iou_mat[gt_num, det_num] = get_intersection_over_union(p_d,
+                                                                           p_g)
+
+            for gt_num in range(len(gt_pols)):
+                for det_num in range(len(det_pols)):
+                    if gt_rect_mat[gt_num] == 0 and det_rect_mat[
+                            det_num] == 0 and gt_num not in gt_dont_care_pols_num and det_num not in det_dont_care_pols_num:
+                        if iou_mat[gt_num, det_num] > 0.5:
+                            gt_rect_mat[gt_num] = 1
+                            det_rect_mat[det_num] = 1
+                            det_matched += 1
+                            det_matched_nums.append(det_num)
+        num_gt_care = (len(gt_pols) - len(gt_dont_care_pols_num))
+        num_det_care = (len(det_pols) - len(det_dont_care_pols_num))
+        matched_sum += det_matched
+        num_global_care_gt += num_gt_care
+        num_global_care_det += num_det_care
+    method_recall = 0 if num_global_care_gt == 0 else float(
+        matched_sum) / num_global_care_gt
+    method_precision = 0 if num_global_care_det == 0 else float(
+        matched_sum) / num_global_care_det
+    method_hmean = 0 if method_recall + method_precision == 0 else 2 * method_recall * method_precision / (
+        method_recall + method_precision)
+    logger.info('Recall {}'.format(method_recall))
+    logger.info('Precision {}'.format(method_precision))
+    logger.info('F1 {}'.format(method_hmean))
+
+
+def icdar_eval(result):
+    if cfg.dataset == 'icdar2015':
+        icdar_box_eval(result, 0.8)
+    else:
+        icdar_map_eval(result, cfg.class_num)
diff --git a/PaddleCV/rrpn/image/img_119.jpg b/PaddleCV/rrpn/image/img_119.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..01feb6de6bb67ecc39db6dfc041c01306e46a50a
Binary files /dev/null and b/PaddleCV/rrpn/image/img_119.jpg differ
diff --git a/PaddleCV/rrpn/image/img_120.jpg b/PaddleCV/rrpn/image/img_120.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..8a318613b7275415599da6b79e33314b174e2172
Binary files /dev/null and b/PaddleCV/rrpn/image/img_120.jpg differ
diff --git a/PaddleCV/rrpn/infer.py b/PaddleCV/rrpn/infer.py
new file mode 100755
index 0000000000000000000000000000000000000000..3af9d21c2e2da456a5f719225c327633d97f6eb1
--- /dev/null
+++ b/PaddleCV/rrpn/infer.py
@@ -0,0 +1,81 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import cv2
+import time
+import numpy as np
+import pickle
+import paddle
+import paddle.fluid as fluid
+import reader
+import models.model_builder as model_builder
+import models.resnet as resnet
+import checkpoint as checkpoint
+from config import cfg
+from data_utils import DatasetPath
+from eval_helper import *
+from utility import print_arguments, parse_args, check_gpu
+
+
+def infer():
+    place = fluid.CUDAPlace(0) if cfg.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    image_shape = [3, cfg.TEST.max_size, cfg.TEST.max_size]
+    class_nums = cfg.class_num
+    model = model_builder.RRPN(
+        add_conv_body_func=resnet.ResNet(),
+        add_roi_box_head_func=resnet.ResNetC5(),
+        use_pyreader=False,
+        mode='infer')
+    startup_prog = fluid.Program()
+    infer_prog = fluid.Program()
+    with fluid.program_guard(infer_prog, startup_prog):
+        with fluid.unique_name.guard():
+            model.build_model(image_shape)
+            pred_boxes = model.eval_bbox_out()
+    infer_prog = infer_prog.clone(True)
+    exe.run(startup_prog)
+
+    # yapf: disable
+    def if_exist(var):
+        return os.path.exists(os.path.join(cfg.pretrained_model, var.name))
+    if cfg.pretrained_model:
+        checkpoint.load_params(exe, infer_prog, cfg.pretrained_model)
+    # yapf: enable
+    infer_reader = reader.infer(cfg.image_path)
+    feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
+
+    fetch_list = [pred_boxes]
+    imgs = os.listdir(cfg.image_path)
+    imgs.sort()
+
+    for i, data in enumerate(infer_reader()):
+        result = exe.run(infer_prog,
+                         fetch_list=[v.name for v in fetch_list],
+                         feed=feeder.feed(data),
+                         return_numpy=False)
+        nmsed_out = result[0]
+        im_info = data[0][1]
+        im_scale = im_info[2]
+        outs = np.array(nmsed_out)
+        draw_bounding_box_on_image(cfg.image_path, imgs[i], outs, im_scale,
+                                   cfg.draw_threshold)
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+    check_gpu(args.use_gpu)
+    infer()
diff --git a/PaddleCV/rrpn/models/__init__.py b/PaddleCV/rrpn/models/__init__.py
new file mode 100755
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleCV/rrpn/models/ext_op/rrpn_lib.py b/PaddleCV/rrpn/models/ext_op/rrpn_lib.py
new file mode 100644
index 0000000000000000000000000000000000000000..04c11486a5a4487a1ee8891c4141f29bd3aafdee
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/rrpn_lib.py
@@ -0,0 +1,549 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import paddle.fluid as fluid
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.framework import Variable
+fluid.load_op_library('models/ext_op/src/rrpn_lib.so')
+
+
+def rrpn_target_assign(bbox_pred,
+                       cls_logits,
+                       anchor_box,
+                       gt_boxes,
+                       im_info,
+                       rpn_batch_size_per_im=256,
+                       rpn_straddle_thresh=0.0,
+                       rpn_fg_fraction=0.5,
+                       rpn_positive_overlap=0.7,
+                       rpn_negative_overlap=0.3,
+                       use_random=True):
+    """
+    **Target Assign Layer for rotated region proposal network (RRPN).**
+    This layer can be, for given the  Intersection-over-Union (IoU) overlap
+    between anchors and ground truth boxes, to assign classification and
+    regression targets to each each anchor, these target labels are used for
+    train RPN. The classification targets is a binary class label (of being
+    an object or not). Following the paper of RRPN, the positive labels
+    are two kinds of anchors: (i) the anchor/anchors with the highest IoU
+    overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap
+    higher than rpn_positive_overlap(0.7) with any ground-truth box. Note
+    that a single ground-truth box may assign positive labels to multiple
+    anchors. A non-positive anchor is when its IoU ratio is lower than
+    rpn_negative_overlap (0.3) for all ground-truth boxes. Anchors that are
+    neither positive nor negative do not contribute to the training objective.
+    The regression targets are the encoded ground-truth boxes associated with
+    the positive anchors.
+    Args:
+        bbox_pred(Variable): A 3-D Tensor with shape [N, M, 5] represents the
+            predicted locations of M bounding bboxes. N is the batch size,
+            and each bounding box has five coordinate values and the layout
+            is [x, y, w, h, angle]. The data type can be float32 or float64.
+        cls_logits(Variable): A 3-D Tensor with shape [N, M, 1] represents the
+            predicted confidence predictions. N is the batch size, 1 is the
+            frontground and background sigmoid, M is number of bounding boxes.
+            The data type can be float32 or float64.
+        anchor_box(Variable): A 2-D Tensor with shape [M, 5] holds M boxes,
+            each box is represented as [x, y, w, h, angle],
+            [x, y] is the left top coordinate of the anchor box,
+            if the input is image feature map, they are close to the origin
+            of the coordinate system. [w, h] is the right bottom
+            coordinate of the anchor box, angle is the rotation angle of box.
+            The data type can be float32 or float64.
+        gt_boxes (Variable): The ground-truth bounding boxes (bboxes) are a 2D
+            LoDTensor with shape [Ng, 5], Ng is the total number of ground-truth
+            bboxes of mini-batch input. The data type can be float32 or float64.
+        im_info (Variable): A 2-D LoDTensor with shape [N, 3]. N is the batch size,
+        3 is the height, width and scale.
+        rpn_batch_size_per_im(int): Total number of RPN examples per image.
+                                    The data type must be int32.
+        rpn_straddle_thresh(float): Remove RPN anchors that go outside the image
+            by straddle_thresh pixels. The data type must be float32.
+        rpn_fg_fraction(float): Target fraction of RoI minibatch that is labeled
+            foreground (i.e. class > 0), 0-th class is background. The data type must be float32.
+        rpn_positive_overlap(float): Minimum overlap required between an anchor
+            and ground-truth box for the (anchor, gt box) pair to be a positive
+            example. The data type must be float32.
+        rpn_negative_overlap(float): Maximum overlap allowed between an anchor
+            and ground-truth box for the (anchor, gt box) pair to be a negative
+            examples. The data type must be float32.
+        use_random(bool): Whether to sample randomly when sampling.
+    Returns:
+        tuple:
+        A tuple(predicted_scores, predicted_location, target_label,
+        target_bbox) is returned. The predicted_scores 
+        and predicted_location is the predicted result of the RPN.
+        The target_label and target_bbox is the ground truth,
+        respectively. The predicted_location is a 2D Tensor with shape
+        [F, 5], and the shape of target_bbox is same as the shape of
+        the predicted_location, F is the number of the foreground
+        anchors. The predicted_scores is a 2D Tensor with shape
+        [F + B, 1], and the shape of target_label is same as the shape
+        of the predicted_scores, B is the number of the background
+        anchors, the F and B is depends on the input of this operator.
+        Bbox_inside_weight represents whether the predicted loc is fake_fg
+        or not and the shape is [F, 5].
+    Examples:
+        .. code-block:: python
+            import paddle.fluid as fluid
+            bbox_pred = fluid.data(name='bbox_pred', shape=[None, 5], dtype='float32')
+            cls_logits = fluid.data(name='cls_logits', shape=[None, 1], dtype='float32')
+            anchor_box = fluid.data(name='anchor_box', shape=[None, 5], dtype='float32')
+            gt_boxes = fluid.data(name='gt_boxes', shape=[None, 5], dtype='float32')
+            im_info = fluid.data(name='im_infoss', shape=[None, 3], dtype='float32')
+            loc, score, loc_target, score_target = rrpn_target_assign(
+                bbox_pred, cls_logits, anchor_box, gt_boxes, im_info)
+    """
+
+    helper = LayerHelper('rrpn_target_assign', **locals())
+    # Assign target label to anchors
+    loc_index = helper.create_variable_for_type_inference(dtype='int32')
+    score_index = helper.create_variable_for_type_inference(dtype='int32')
+    target_label = helper.create_variable_for_type_inference(dtype='int32')
+    target_bbox = helper.create_variable_for_type_inference(
+        dtype=anchor_box.dtype)
+    helper.append_op(
+        type="rrpn_target_assign",
+        inputs={'Anchor': anchor_box,
+                'GtBoxes': gt_boxes,
+                'ImInfo': im_info},
+        outputs={
+            'LocationIndex': loc_index,
+            'ScoreIndex': score_index,
+            'TargetLabel': target_label,
+            'TargetBBox': target_bbox
+        },
+        attrs={
+            'rpn_batch_size_per_im': rpn_batch_size_per_im,
+            'rpn_straddle_thresh': rpn_straddle_thresh,
+            'rpn_positive_overlap': rpn_positive_overlap,
+            'rpn_negative_overlap': rpn_negative_overlap,
+            'rpn_fg_fraction': rpn_fg_fraction,
+            'use_random': use_random
+        })
+
+    loc_index.stop_gradient = True
+    score_index.stop_gradient = True
+    target_label.stop_gradient = True
+    target_bbox.stop_gradient = True
+
+    cls_logits = fluid.layers.reshape(x=cls_logits, shape=(-1, 1))
+    bbox_pred = fluid.layers.reshape(x=bbox_pred, shape=(-1, 5))
+    predicted_cls_logits = fluid.layers.gather(cls_logits, score_index)
+    predicted_bbox_pred = fluid.layers.gather(bbox_pred, loc_index)
+
+    return predicted_cls_logits, predicted_bbox_pred, target_label, target_bbox
+
+
+def rotated_anchor_generator(input,
+                             anchor_sizes=None,
+                             aspect_ratios=None,
+                             angles=None,
+                             variance=[1.0, 1.0, 1.0, 1.0, 1.0],
+                             stride=None,
+                             offset=0.5,
+                             name=None):
+    """
+    **Rotated Anchor generator operator**
+    Generate anchors for RRPN algorithm.
+    Each position of the input produce N anchors, N =
+    size(anchor_sizes) * size(aspect_ratios) * size(angles).
+    The order of generated anchors is firstly aspect_ratios
+    loop then anchor_sizes loop.
+    Args:
+       input(Variable): 4-D Tensor with shape [N,C,H,W]. The input feature map.
+       anchor_sizes(float32|list|tuple): The anchor sizes of generated
+          anchors, given in absolute pixels e.g. [64., 128., 256., 512.].
+          For instance, the anchor size of 64 means the area of this anchor 
+          equals to 64**2. None by default.
+       aspect_ratios(float32|list|tuple): The height / width ratios 
+           of generated anchors, e.g. [0.5, 1.0, 2.0]. None by default.
+       angle(list|tuple): Rotated angle of prior boxes. The data type is float32.
+       variance(list|tuple): The variances to be used in box 
+           regression deltas. The data type is float32, [1.0, 1.0, 1.0, 1.0, 1.0] by 
+           default.
+       stride(list|tuple): The anchors stride across width and height.
+           The data type is float32. e.g. [16.0, 16.0]. None by default.
+       offset(float32): Prior boxes center offset. 0.5 by default.
+       name(str): Name of this layer. None by default. 
+    Returns:
+       Anchors(Variable): The output anchors with a layout of [H, W, num_anchors, 5].
+                          H is the height of input, W is the width of input,
+                          num_anchors is the box count of each position. Each anchor is
+                          in (x, y, w, h, angle) format.
+       Variances(Variable): The expanded variances of anchors with a layout of
+                            [H, W, num_priors, 5]. H is the height of input,
+                            W is the width of input num_anchors is the box count
+                            of each position. Each variance is in (x, y, w, h, angle) format.
+    Examples:
+        .. code-block:: python
+            import paddle.fluid as fluid
+            conv1 = fluid.data(name='conv1', shape=[None, 48, 16, 16], dtype='float32')
+            anchor, var = rotated_anchor_generator(
+                input=conv1,
+                anchor_sizes=[128, 256, 512],
+                aspect_ratios=[0.2, 0.5, 1.0],
+                variance=[1.0, 1.0, 1.0, 1.0, 1.0],
+                stride=[16.0, 16.0],
+                offset=0.5)
+    """
+    helper = LayerHelper("rotated_anchor_generator", **locals())
+    dtype = helper.input_dtype()
+
+    def _is_list_or_tuple_(data):
+        return (isinstance(data, list) or isinstance(data, tuple))
+
+    if not _is_list_or_tuple_(anchor_sizes):
+        anchor_sizes = [anchor_sizes]
+    if not _is_list_or_tuple_(aspect_ratios):
+        aspect_ratios = [aspect_ratios]
+    if not _is_list_or_tuple_(angles):
+        angles = [angles]
+    if not (_is_list_or_tuple_(stride) and len(stride) == 2):
+        raise ValueError('stride should be a list or tuple ',
+                         'with length 2, (stride_width, stride_height).')
+
+    anchor_sizes = list(map(float, anchor_sizes))
+    aspect_ratios = list(map(float, aspect_ratios))
+    angles = list(map(float, angles))
+    stride = list(map(float, stride))
+
+    attrs = {
+        'anchor_sizes': anchor_sizes,
+        'aspect_ratios': aspect_ratios,
+        'angles': angles,
+        'variances': variance,
+        'stride': stride,
+        'offset': offset
+    }
+
+    anchor = helper.create_variable_for_type_inference(dtype)
+    var = helper.create_variable_for_type_inference(dtype)
+    helper.append_op(
+        type="rotated_anchor_generator",
+        inputs={"Input": input},
+        outputs={"Anchors": anchor,
+                 "Variances": var},
+        attrs=attrs, )
+    anchor.stop_gradient = True
+    var.stop_gradient = True
+    return anchor, var
+
+
+def rrpn_box_coder(prior_box, prior_box_var, target_box, name=None):
+    """
+    Args:
+        prior_box(Variable): Box list prior_box is a 2-D Tensor with shape 
+            [M, 5] holds M boxes and data type is float32 or float64. Each box
+            is represented as [x, y, w, h, angle], [x, y] is the 
+            center coordinate of the anchor box, [w, h] is the width and height
+            of the anchor box, angle is rotated angle of prior_box.
+        prior_box_var(List|Variable|None): "prior_box_var is a 2-D Tensor with
+             shape [M, 5] holds M group of variance."
+        target_box(Variable): This input can be a 2-D LoDTensor with shape 
+            [M, 5]. Each box is represented as [x, y, w, h, angle]. The data
+            type is float32 or float64.
+        name(str): Name of this layer. None by default. 
+    Returns:
+        Variable:
+        output_box(Variable): The output tensor of rrpn_box_coder_op with shape [N, 5] representing the 
+        result of N target boxes encoded with N Prior boxes and variances. 
+        N represents the number of box and 5 represents [x, y, w, h ,angle].
+    Examples:
+ 
+        .. code-block:: python
+ 
+            import paddle.fluid as fluid
+            prior_box_decode = fluid.data(name='prior_box_decode',
+                                          shape=[512, 5],
+                                          dtype='float32')
+            target_box_decode = fluid.data(name='target_box_decode',
+                                           shape=[512, 5],
+                                           dtype='float32')
+            output_decode = rrpn_box_coder(prior_box=prior_box_decode,
+                                           prior_box_var=[10, 10, 5, 5, 1],
+                                           target_box=target_box_decode)
+    """
+
+    helper = LayerHelper("rrpn_box_coder", **locals())
+
+    if name is None:
+        output_box = helper.create_variable_for_type_inference(
+            dtype=prior_box.dtype)
+    else:
+        output_box = helper.create_variable(
+            name=name, dtype=prior_box.dtype, persistable=False)
+
+    inputs = {"PriorBox": prior_box, "TargetBox": target_box}
+    attrs = {}
+    if isinstance(prior_box_var, Variable):
+        inputs['PriorBoxVar'] = prior_box_var
+    elif isinstance(prior_box_var, list):
+        attrs['variance'] = prior_box_var
+    else:
+        raise TypeError(
+            "Input variance of rrpn_box_coder must be Variable or list")
+    helper.append_op(
+        type="rrpn_box_coder",
+        inputs=inputs,
+        attrs=attrs,
+        outputs={"OutputBox": output_box})
+    return output_box
+
+
+def rotated_roi_align(input,
+                      rois,
+                      pooled_height=1,
+                      pooled_width=1,
+                      spatial_scale=1.0,
+                      name=None):
+    """
+    **RotatedRoIAlign Operator**
+
+    Rotated Region of interest align (also known as Rotated RoI align) is to perform
+    bilinear interpolation on inputs of nonuniform sizes to obtain 
+    fixed-size feature maps (e.g. 7*7)
+    
+    Dividing each region proposal into equal-sized sections with
+    the pooled_width and pooled_height. Location remains the origin
+    result.
+    
+    Each ROI bin are transformed to become horizontal by perspective transformation and
+    values in each ROI bin are computed directly through bilinear interpolation. The output is
+    the mean of all values.
+    Thus avoid the misaligned problem.  
+    """
+    helper = LayerHelper('rrpn_rotated_roi_align', **locals())
+    dtype = helper.input_dtype()
+    align_out = helper.create_variable_for_type_inference(dtype)
+    cx = helper.create_variable_for_type_inference('float32')
+    cy = helper.create_variable_for_type_inference('float32')
+    helper.append_op(
+        type="rrpn_rotated_roi_align",
+        inputs={"X": input,
+                "ROIs": rois},
+        outputs={"Out": align_out,
+                 "ConIdX": cx,
+                 "ConIdY": cy},
+        attrs={
+            "pooled_height": pooled_height,
+            "pooled_width": pooled_width,
+            "spatial_scale": spatial_scale,
+        })
+    return align_out
+
+
+def rotated_generate_proposal_labels(rpn_rois,
+                                     gt_classes,
+                                     is_crowd,
+                                     gt_boxes,
+                                     im_info,
+                                     batch_size_per_im=256,
+                                     fg_fraction=0.25,
+                                     fg_thresh=0.25,
+                                     bg_thresh_hi=0.5,
+                                     bg_thresh_lo=0.0,
+                                     bbox_reg_weights=[0.1, 0.1, 0.2, 0.2],
+                                     class_nums=None,
+                                     use_random=True,
+                                     is_cls_agnostic=False):
+    """
+    **Rotated Generate Proposal Labels**
+    This operator can be, for given the RotatedGenerateProposalOp output bounding boxes and groundtruth,
+    to sample foreground boxes and background boxes, and compute loss target.
+    RpnRois is the output boxes of RPN and was processed by rotated_generate_proposal_op, these boxes
+    were combined with groundtruth boxes and sampled according to batch_size_per_im and fg_fraction,
+    If an instance with a groundtruth overlap greater than fg_thresh, then it was considered as a foreground sample.
+    If an instance with a groundtruth overlap greater than bg_thresh_lo and lower than bg_thresh_hi,
+    then it was considered as a background sample.
+    After all foreground and background boxes are chosen (so called Rois),
+    then we apply random sampling to make sure
+    the number of foreground boxes is no more than batch_size_per_im * fg_fraction.
+    For each box in Rois, we assign the classification (class label) and regression targets (box label) to it.
+    Finally BboxInsideWeights and BboxOutsideWeights are used to specify whether it would contribute to training loss.
+    Args:
+        rpn_rois(Variable): A 2-D LoDTensor with shape [N, 5]. N is the number of the RotatedGenerateProposalOp's output, each element is a bounding box with [x, y, w, h, angle] format. The data type can be float32 or float64.
+        gt_classes(Variable): A 2-D LoDTensor with shape [M, 1]. M is the number of groundtruth, each element is a class label of groundtruth. The data type must be int32.
+        is_crowd(Variable): A 2-D LoDTensor with shape [M, 1]. M is the number of groundtruth, each element is a flag indicates whether a groundtruth is crowd. The data type must be int32.
+        gt_boxes(Variable): A 2-D LoDTensor with shape [M, 5]. M is the number of groundtruth, each element is a bounding box with [x, y, w, h, angle] format.
+        im_info(Variable): A 2-D LoDTensor with shape [B, 3]. B is the number of input images, each element consists of im_height, im_width, im_scale.
+        batch_size_per_im(int): Batch size of rois per images. The data type must be int32.
+        fg_fraction(float): Foreground fraction in total batch_size_per_im. The data type must be float32.
+        fg_thresh(float): Overlap threshold which is used to chose foreground sample. The data type must be float32.
+        bg_thresh_hi(float): Overlap threshold upper bound which is used to chose background sample. The data type must be float32.
+        bg_thresh_lo(float): Overlap threshold lower bound which is used to chose background sample. The data type must be float32.
+        bbox_reg_weights(list|tuple): Box regression weights. The data type must be float32.
+        class_nums(int): Class number. The data type must be int32.
+        use_random(bool): Use random sampling to choose foreground and background boxes.
+        is_cls_agnostic(bool): bbox regression use class agnostic simply which only represent fg and bg boxes.
+    Returns:
+        tuple:
+        A tuple with format``(rois, labels_int32, bbox_targets, bbox_inside_weights, bbox_outside_weights)``.
+        - **rois**: 2-D LoDTensor with shape ``[batch_size_per_im * batch_size, 5]``. The data type is the same as ``rpn_rois``.
+        - **labels_int32**: 2-D LoDTensor with shape ``[batch_size_per_im * batch_size, 1]``. The data type must be int32.
+        - **bbox_targets**: 2-D LoDTensor with shape ``[batch_size_per_im * batch_size, 5 * class_num]``. The regression targets of all RoIs. The data type is the same as ``rpn_rois``.
+        - **bbox_inside_weights**: 2-D LoDTensor with shape ``[batch_size_per_im * batch_size, 5 * class_num]``. The weights of foreground boxes' regression loss. The data type is the same as ``rpn_rois``.
+        - **bbox_outside_weights**: 2-D LoDTensor with shape ``[batch_size_per_im * batch_size, 5 * class_num]``. The weights of regression loss. The data type is the same as ``rpn_rois``.
+    Examples:
+        .. code-block:: python
+            import paddle.fluid as fluid
+            rpn_rois = fluid.data(name='rpn_rois', shape=[None, 5], dtype='float32')
+            gt_classes = fluid.data(name='gt_classes', shape=[None, 1], dtype='float32')
+            is_crowd = fluid.data(name='is_crowd', shape=[None, 1], dtype='float32')
+            gt_boxes = fluid.data(name='gt_boxes', shape=[None, 5], dtype='float32')
+            im_info = fluid.data(name='im_info', shape=[None, 3], dtype='float32')
+            rois, labels, bbox, inside_weights, outside_weights = rotated_generate_proposal_labels(
+                           rpn_rois, gt_classes, is_crowd, gt_boxes, im_info,
+                           class_nums=10)
+    """
+    helper = LayerHelper('rrpn_generate_proposal_labels', **locals())
+    rois = helper.create_variable_for_type_inference(dtype=rpn_rois.dtype)
+    labels_int32 = helper.create_variable_for_type_inference(
+        dtype=gt_classes.dtype)
+    bbox_targets = helper.create_variable_for_type_inference(
+        dtype=rpn_rois.dtype)
+    bbox_inside_weights = helper.create_variable_for_type_inference(
+        dtype=rpn_rois.dtype)
+    bbox_outside_weights = helper.create_variable_for_type_inference(
+        dtype=rpn_rois.dtype)
+
+    helper.append_op(
+        type="rrpn_generate_proposal_labels",
+        inputs={
+            'RpnRois': rpn_rois,
+            'GtClasses': gt_classes,
+            'IsCrowd': is_crowd,
+            'GtBoxes': gt_boxes,
+            'ImInfo': im_info
+        },
+        outputs={
+            'Rois': rois,
+            'LabelsInt32': labels_int32,
+            'BboxTargets': bbox_targets,
+            'BboxInsideWeights': bbox_inside_weights,
+            'BboxOutsideWeights': bbox_outside_weights
+        },
+        attrs={
+            'batch_size_per_im': batch_size_per_im,
+            'fg_fraction': fg_fraction,
+            'fg_thresh': fg_thresh,
+            'bg_thresh_hi': bg_thresh_hi,
+            'bg_thresh_lo': bg_thresh_lo,
+            'bbox_reg_weights': bbox_reg_weights,
+            'class_nums': class_nums,
+            'use_random': use_random,
+            'is_cls_agnostic': is_cls_agnostic
+        })
+
+    rois.stop_gradient = True
+    labels_int32.stop_gradient = True
+    bbox_targets.stop_gradient = True
+    bbox_inside_weights.stop_gradient = True
+    bbox_outside_weights.stop_gradient = True
+
+    return rois, labels_int32, bbox_targets, bbox_inside_weights, bbox_outside_weights
+
+
+def rotated_generate_proposals(scores,
+                               bbox_deltas,
+                               im_info,
+                               anchors,
+                               variances,
+                               pre_nms_top_n=6000,
+                               post_nms_top_n=1000,
+                               nms_thresh=0.5,
+                               min_size=0.1,
+                               name=None):
+    """
+    **Rotated Generate proposal**
+    This operation proposes Rotated RoIs according to each box with their
+    probability to be a foreground object and the box can be calculated by anchors.
+    bbox_deltas and scores are the output of RPN. Final proposals could be used to
+    train detection net. For generating proposals, this operation performs following steps:
+    1. Transposes and resizes scores and bbox_deltas in size of
+       (H*W*A, 1) and (H*W*A, 5)
+    2. Calculate box locations as proposals candidates. 
+    3. Remove predicted boxes with small area. 
+    4. Apply NMS to get final proposals as output.
+    Args:
+        scores(Variable): A 4-D Tensor with shape [N, A, H, W] represents
+            the probability for each box to be an object.
+            N is batch size, A is number of anchors, H and W are height and
+            width of the feature map. The data type must be float32.
+        bbox_deltas(Variable): A 4-D Tensor with shape [N, 5*A, H, W]
+            represents the differece between predicted box locatoin and
+            anchor location. The data type must be float32.
+        im_info(Variable): A 2-D Tensor with shape [N, 3] represents origin
+            image information for N batch. Info contains height, width and scale
+            between origin image size and the size of feature map.
+            The data type must be int32.
+        anchors(Variable):   A 4-D Tensor represents the anchors with a layout
+            of [H, W, A, 5]. H and W are height and width of the feature map,
+            num_anchors is the box count of each position. Each anchor is
+            in (x, y, w, h, angle) format. The data type must be float32.
+        variances(Variable): A 4-D Tensor. The expanded variances of anchors with a layout of
+            [H, W, num_priors, 5]. Each variance is in
+            (xcenter, ycenter, w, h) format. The data type must be float32.
+        pre_nms_top_n(float): Number of total bboxes to be kept per
+            image before NMS. The data type must be float32. `6000` by default.
+        post_nms_top_n(float): Number of total bboxes to be kept per
+            image after NMS. The data type must be float32. `1000` by default.
+        nms_thresh(float): Threshold in NMS. The data type must be float32. `0.5` by default.
+        min_size(float): Remove predicted boxes with either height or
+            width < min_size. The data type must be float32. `0.1` by default.
+    Returns:
+        tuple:
+        A tuple with format ``(rrpn_rois, rrpn_roi_probs)``.
+        - **rpn_rois**: The generated RoIs. 2-D Tensor with shape ``[N, 5]`` while ``N`` is the number of RoIs. The data type is the same as ``scores``.
+        - **rpn_roi_probs**: The scores of generated RoIs. 2-D Tensor with shape ``[N, 1]`` while ``N`` is the number of RoIs. The data type is the same as ``scores``.
+    Examples:
+        .. code-block:: python
+        
+            import paddle.fluid as fluid
+            scores = fluid.data(name='scores', shape=[None, 4, 5, 5], dtype='float32')
+            bbox_deltas = fluid.data(name='bbox_deltas', shape=[None, 20, 5, 5], dtype='float32')
+            im_info = fluid.data(name='im_info', shape=[None, 3], dtype='float32')
+            anchors = fluid.data(name='anchors', shape=[None, 5, 4, 5], dtype='float32')
+            variances = fluid.data(name='variances', shape=[None, 5, 10, 5], dtype='float32')
+            rrois, rroi_probs = fluid.layers.rotated_generate_proposals(scores, bbox_deltas,
+                         im_info, anchors, variances)
+    """
+
+    helper = LayerHelper('rrpn_generate_proposals', **locals())
+
+    rpn_rois = helper.create_variable_for_type_inference(
+        dtype=bbox_deltas.dtype)
+    rpn_roi_probs = helper.create_variable_for_type_inference(
+        dtype=scores.dtype)
+    helper.append_op(
+        type="rrpn_generate_proposals",
+        inputs={
+            'Scores': scores,
+            'BboxDeltas': bbox_deltas,
+            'ImInfo': im_info,
+            'Anchors': anchors,
+            'Variances': variances
+        },
+        attrs={
+            'pre_nms_topN': pre_nms_top_n,
+            'post_nms_topN': post_nms_top_n,
+            'nms_thresh': nms_thresh,
+            'min_size': min_size
+        },
+        outputs={'RpnRois': rpn_rois,
+                 'RpnRoiProbs': rpn_roi_probs})
+    rpn_rois.stop_gradient = True
+    rpn_roi_probs.stop_gradient = True
+
+    return rpn_rois, rpn_roi_probs
diff --git a/PaddleCV/rrpn/models/ext_op/src/README.md b/PaddleCV/rrpn/models/ext_op/src/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..cbec185403d934909bb65f51b0c07ebbb99e9c03
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/README.md
@@ -0,0 +1,68 @@
+# 自定义OP的编译过程
+
+## 代码结构
+
+  - src: 扩展OP C++/CUDA 源码
+  - rrpn_lib.py: Python封装
+
+## 安装PaddlePaddle
+
+请通过如下方式安装PaddlePaddle：
+
+- 通过[Paddle develop分支](https://github.com/PaddlePaddle/Paddle/tree/develop)源码编译安装，编译方法如下:
+
+  1. [Ubuntu](https://www.paddlepaddle.org.cn/install/doc/source/ubuntu)
+  1. [CentOS](https://www.paddlepaddle.org.cn/install/doc/source/centos)
+  1. [MasOS](https://www.paddlepaddle.org.cn/install/doc/source/macos)
+  1. [Windows](https://www.paddlepaddle.org.cn/install/doc/source/windows)
+
+  **说明：** 推荐使用docker编译
+
+- 安装Paddle develop[每日版本whl包](https://www.paddlepaddle.org.cn/install/doc/tables#多版本whl包列表-dev-11)
+
+  **注意：** 编译自定义OP使用的gcc版本须与Paddle编译使用gcc版本一致，Paddle develop每日版本目前采用**gcc 4.8.2**版本编译，若使用每日版本，请使用**gcc 4.8.2**版本编译自定义OP，否则可能出现兼容性问题。
+
+## 编译自定义OP
+
+自定义op需要将实现的C++、CUDA代码编译成动态库，mask.sh中通过g++/nvcc编译，当然您也可以写Makefile或者CMake。
+
+编译需要include PaddlePaddle的相关头文件，链接PaddlePaddle的lib库。 头文件和lib库可通过下面命令获取到:
+
+```
+# python
+>>> import paddle
+>>> print(paddle.sysconfig.get_include())
+/paddle/pyenv/local/lib/python2.7/site-packages/paddle/include
+>>> print(paddle.sysconfig.get_lib())
+/paddle/pyenv/local/lib/python2.7/site-packages/paddle/libs
+```
+
+我们提供动态库编译脚本如下：
+
+```
+cd src
+sh make.sh
+```
+
+最终编译会产出`rrpn_lib.so`
+
+**说明：** 若使用源码编译安装PaddlePaddle的方式，编译过程中`cmake`未设置`WITH_MKLDNN`的方式，
+编译自定义OP时会报错找不到`mkldnn.h`等文件，可在`make.sh`中删除编译命令中的`-DPADDLE_WITH_MKLDNN`选项。
+
+## 设置环境变量
+
+需要将Paddle的核心库设置到`LD_LIBRARY_PATH`里, 先运行下面程序获取路径:
+
+```
+import paddle
+print(paddle.sysconfig.get_lib())
+```
+
+可通过如下方式添加动态库路径:
+
+```
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+```
+
+
+更多关于如何在框架外部自定义 C++ OP，可阅读[官网说明文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/index_cn.html)
diff --git a/PaddleCV/rrpn/models/ext_op/src/bbox_util.h b/PaddleCV/rrpn/models/ext_op/src/bbox_util.h
new file mode 100644
index 0000000000000000000000000000000000000000..dc978e2c1780312c22f9367bd6e75d32ce6a4867
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/bbox_util.h
@@ -0,0 +1,360 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+Based on
+--------------------------------------------------------
+@misc{ma2019rrpn,
+    author = {Jianqi Ma},
+    title = {{RRPN in pytorch}},
+    year = {2019},
+    howpublished = {\url{https://github.com/mjq11302010044/RRPN_pytorch}},
+}
+@article{Jianqi17RRPN,
+    Author = {Jianqi Ma and Weiyuan Shao and Hao Ye and Li Wang and Hong Wang
+and Yingbin Zheng and Xiangyang Xue},
+    Title = {Arbitrary-Oriented Scene Text Detection via Rotation Proposals},
+    journal = {IEEE Transactions on Multimedia},
+    volume={20},
+    number={11},
+    pages={3111-3122},
+    year={2018}
+}
+--------------------------------------------------------
+*/
+
+#pragma once
+#include <algorithm>
+#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/tensor.h"
+
+namespace paddle {
+namespace operators {
+
+#define PI 3.141592654
+
+struct RangeInitFunctor {
+  int start;
+  int delta;
+  int* out;
+  HOSTDEVICE void operator()(size_t i) { out[i] = start + i * delta; }
+};
+
+
+// get trangle area after  decompose intersecting polygons into triangles
+template <typename T>
+inline T trangle_area(T* a, T* b, T* c) {
+  return ((a[0] - c[0]) * (b[1] - c[1]) - (a[1] - c[1]) * (b[0] - c[0])) / 2.0;
+}
+
+// get area of intersecting
+template <typename T>
+inline T get_area(T* int_pts, int num_of_inter) {
+  T area = 0.0;
+  for (int i = 0; i < num_of_inter - 2; i++) {
+    area += fabs(
+        trangle_area<T>(int_pts, int_pts + 2 * i + 2, int_pts + 2 * i + 4));
+  }
+  return area;
+}
+
+// sort points to decompose intersecting polygons into triangles
+template <typename T>
+inline void reorder_pts(T* int_pts, int num_of_inter) {
+  if (num_of_inter > 0) {
+    T center[2] = {0.0, 0.0};
+
+    for (int i = 0; i < num_of_inter; i++) {
+      center[0] += int_pts[2 * i];
+      center[1] += int_pts[2 * i + 1];
+    }
+    center[0] /= num_of_inter;
+    center[1] /= num_of_inter;
+
+    T vs[16];
+    T v[2];
+    T d;
+    for (int i = 0; i < num_of_inter; i++) {
+      v[0] = int_pts[2 * i] - center[0];
+      v[1] = int_pts[2 * i + 1] - center[1];
+      d = sqrt(v[0] * v[0] + v[1] * v[1]);
+      v[0] = v[0] / d;
+      v[1] = v[1] / d;
+      if (v[1] < 0) {
+        v[0] = -2 - v[0];
+      }
+      vs[i] = v[0];
+    }
+
+    float temp, tx, ty;
+    int j;
+    for (int i = 1; i < num_of_inter; ++i) {
+      if (vs[i - 1] > vs[i]) {
+        temp = vs[i];
+        tx = int_pts[2 * i];
+        ty = int_pts[2 * i + 1];
+        j = i;
+        while (j > 0 && vs[j - 1] > temp) {
+          vs[j] = vs[j - 1];
+          int_pts[j * 2] = int_pts[j * 2 - 2];
+          int_pts[j * 2 + 1] = int_pts[j * 2 - 1];
+          j--;
+        }
+        vs[j] = temp;
+        int_pts[j * 2] = tx;
+        int_pts[j * 2 + 1] = ty;
+      }
+    }
+  }
+}
+
+// determine if points intersect
+template <typename T>
+inline bool inter2line(T* pts1, T* pts2, int i, int j, T* temp_pts) {
+  T a[2] = {pts1[2 * i], pts1[2 * i + 1]};
+  T b[2] = {pts1[2 * ((i + 1) % 4)], pts1[2 * ((i + 1) % 4) + 1]};
+  T c[2] = {pts2[2 * j], pts2[2 * j + 1]};
+  T d[2] = {pts2[2 * ((j + 1) % 4)], pts2[2 * ((j + 1) % 4) + 1]};
+
+  T area_abc, area_abd, area_cda, area_cdb;
+
+  area_abc = trangle_area<T>(a, b, c);
+  area_abd = trangle_area<T>(a, b, d);
+
+  if (area_abc * area_abd >= -1e-5) {
+    return false;
+  }
+
+  area_cda = trangle_area<T>(c, d, a);
+  area_cdb = area_cda + area_abc - area_abd;
+
+  if (area_cda * area_cdb >= -1e-5) {
+    return false;
+  }
+  T t = area_cda / (area_abd - area_abc);
+
+  T dx = t * (b[0] - a[0]);
+  T dy = t * (b[1] - a[1]);
+  temp_pts[0] = a[0] + dx;
+  temp_pts[1] = a[1] + dy;
+
+  return true;
+}
+
+template <typename T>
+inline bool inrect(T pt_x, T pt_y, T* pts) {
+  T ab[2] = {pts[2] - pts[0], pts[3] - pts[1]};
+  T ad[2] = {pts[6] - pts[0], pts[7] - pts[1]};
+  T ap[2] = {pt_x - pts[0], pt_y - pts[1]};
+
+  T abab = ab[0] * ab[0] + ab[1] * ab[1];
+  T abap = ab[0] * ap[0] + ab[1] * ap[1];
+  T adad = ad[0] * ad[0] + ad[1] * ad[1];
+  T adap = ad[0] * ap[0] + ad[1] * ap[1];
+  bool result = (abab - abap >= -1) and (abap >= -1) and (adad - adap >= -1) and
+                (adap >= -1);
+  return result;
+}
+
+// calculate the number of intersection points
+template <typename T>
+inline int inter_pts(T* pts1, T* pts2, T* int_pts) {
+  int num_of_inter = 0;
+
+  for (int i = 0; i < 4; i++) {
+    if (inrect<T>(pts1[2 * i], pts1[2 * i + 1], pts2)) {
+      int_pts[num_of_inter * 2] = pts1[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts1[2 * i + 1];
+      num_of_inter++;
+    }
+    if (inrect<T>(pts2[2 * i], pts2[2 * i + 1], pts1)) {
+      int_pts[num_of_inter * 2] = pts2[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts2[2 * i + 1];
+      num_of_inter++;
+    }
+  }
+
+  T out_pts[2];
+
+  for (int i = 0; i < 4; i++) {
+    for (int j = 0; j < 4; j++) {
+      bool has_pts = inter2line<T>(pts1, pts2, i, j, out_pts);
+      if (has_pts) {
+        int_pts[num_of_inter * 2] = out_pts[0];
+        int_pts[num_of_inter * 2 + 1] = out_pts[1];
+        num_of_inter++;
+      }
+    }
+  }
+
+
+  return num_of_inter;
+}
+
+// convert x,y,w,h,angle to x1,y1,x2,y2,x3,y3,x4,y4
+template <typename T>
+inline void convert_region(T* pts,
+                           const framework::Tensor& _region,
+                           int index) {
+  auto region = framework::EigenTensor<T, 2>::From(_region);
+  T angle = region(index, 4);
+  T a_cos = cos(angle / 180.0 * PI);
+  T a_sin = -sin(angle / 180.0 * PI);  // anti clock-wise
+
+  T ctr_x = region(index, 0);
+  T ctr_y = region(index, 1);
+  T h = region(index, 3);
+  T w = region(index, 2);
+
+
+  T pts_x[4] = {-w / 2, -w / 2, w / 2, w / 2};
+  T pts_y[4] = {-h / 2, h / 2, h / 2, -h / 2};
+
+  for (int i = 0; i < 4; i++) {
+    pts[2 * i] = a_cos * pts_x[i] - a_sin * pts_y[i] + ctr_x;
+    pts[2 * i + 1] = a_sin * pts_x[i] + a_cos * pts_y[i] + ctr_y;
+  }
+}
+
+
+// Calculate the area of intersection
+template <typename T>
+inline float inter(const framework::Tensor& _region1,
+                   const framework::Tensor& _region2,
+                   const int& r,
+                   const int& c) {
+  T pts1[8];
+  T pts2[8];
+  T int_pts[16];
+  int num_of_inter;
+
+
+  convert_region<T>(pts1, _region1, r);
+  convert_region<T>(pts2, _region2, c);
+
+  num_of_inter = inter_pts<T>(pts1, pts2, int_pts);
+
+  reorder_pts<T>(int_pts, num_of_inter);
+
+  return get_area<T>(int_pts, num_of_inter);
+}
+
+template <typename T>
+inline float devRotateIoU(const framework::Tensor& _region1,
+                          const framework::Tensor& _region2,
+                          const int r,
+                          const int c) {
+  auto __region1 = framework::EigenTensor<T, 2>::From(_region1);
+  auto __region2 = framework::EigenTensor<T, 2>::From(_region2);
+
+  if ((fabs(__region1(r, 0) - __region2(c, 0)) < 1e-5) &&
+      (fabs(__region1(r, 1) - __region2(c, 1)) < 1e-5) &&
+      (fabs(__region1(r, 2) - __region2(c, 2)) < 1e-5) &&
+      (fabs(__region1(r, 3) - __region2(c, 3)) < 1e-5) &&
+      (fabs(__region1(r, 4) - __region2(c, 4)) < 1e-5)) {
+    return 1.0;
+  }
+  T area1, area2, area_inter;
+  area1 = __region1(r, 2) * __region1(r, 3);
+  area2 = __region2(c, 2) * __region2(c, 3);
+  area_inter = inter<T>(_region1, _region2, r, c);
+  auto result = area_inter / (area1 + area2 - area_inter);
+
+  if (result < 0) {
+    result = 0.0;
+  }
+  // may have bugs which cause overlap > 1
+  if (result > 1.00000001) {
+    result = 0.0;
+  }
+  return result;
+}
+
+
+template <typename T>
+inline void BoxToDelta2(const int box_num,
+                        const framework::Tensor& ex_boxes,
+                        const framework::Tensor& gt_boxes,
+                        const float* weights,
+                        framework::Tensor* box_delta) {
+  auto ex_boxes_et = framework::EigenTensor<T, 2>::From(ex_boxes);
+  auto gt_boxes_et = framework::EigenTensor<T, 2>::From(gt_boxes);
+  auto trg = framework::EigenTensor<T, 2>::From(*box_delta);
+  T ex_w, ex_h, ex_ctr_x, ex_ctr_y, ex_angle, gt_w, gt_h, gt_ctr_x, gt_ctr_y,
+      gt_angle;
+  for (int64_t i = 0; i < box_num; ++i) {
+    ex_w = ex_boxes_et(i, 2);
+    ex_h = ex_boxes_et(i, 3);
+    ex_ctr_x = ex_boxes_et(i, 0);
+    ex_ctr_y = ex_boxes_et(i, 1);
+    ex_angle = ex_boxes_et(i, 4);
+
+    gt_w = gt_boxes_et(i, 2);
+    gt_h = gt_boxes_et(i, 3);
+    gt_ctr_x = gt_boxes_et(i, 0);
+    gt_ctr_y = gt_boxes_et(i, 1);
+    gt_angle = gt_boxes_et(i, 4);
+
+    trg(i, 0) = (gt_ctr_x - ex_ctr_x) / ex_w;
+    trg(i, 1) = (gt_ctr_y - ex_ctr_y) / ex_h;
+    trg(i, 2) = std::log(gt_w / ex_w);
+    trg(i, 3) = std::log(gt_h / ex_h);
+    trg(i, 4) = gt_angle - ex_angle;
+
+    if (weights) {
+      trg(i, 0) = trg(i, 0) * weights[0];
+      trg(i, 1) = trg(i, 1) * weights[1];
+      trg(i, 2) = trg(i, 2) * weights[2];
+      trg(i, 3) = trg(i, 3) * weights[3];
+      trg(i, 4) = trg(i, 4) * weights[4];
+    }
+
+    if (gt_angle <= -30 && ex_angle >= 120) {
+      trg(i, 4) = trg(i, 4) + 180.0;
+    }
+    if (gt_angle >= 120 && ex_angle <= -30) {
+      trg(i, 4) = trg(i, 4) - 180.0;
+    }
+    trg(i, 4) = (PI / 180) * trg(i, 4);
+  }
+}
+
+
+template <typename T>
+void Gather(
+    const T* in, const int in_stride, const int* index, const int num, T* out) {
+  const int stride_bytes = in_stride * sizeof(T);
+  for (int i = 0; i < num; ++i) {
+    int id = index[i];
+    memcpy(out + i * in_stride, in + id * in_stride, stride_bytes);
+  }
+}
+
+template <typename T>
+void BboxOverlaps2(const framework::Tensor& r_boxes,
+                   const framework::Tensor& c_boxes,
+                   framework::Tensor* overlaps) {
+  auto overlaps_et = framework::EigenTensor<T, 2>::From(*overlaps);
+  int r_num = r_boxes.dims()[0];
+  int c_num = c_boxes.dims()[0];
+  for (int i = 0; i < r_num; ++i) {
+    for (int j = 0; j < c_num; ++j) {
+      overlaps_et(i, j) = devRotateIoU<T>(r_boxes, c_boxes, i, j);
+    }
+  }
+}
+
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/PaddleCV/rrpn/models/ext_op/src/blas.h b/PaddleCV/rrpn/models/ext_op/src/blas.h
new file mode 100644
index 0000000000000000000000000000000000000000..5229882c36bb173109e3b44cb575afea6ecc9a9a
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/blas.h
@@ -0,0 +1,487 @@
+//   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once
+
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/tensor.h"
+
+#ifdef PADDLE_WITH_MKLML
+#include "paddle/fluid/platform/dynload/mklml.h"
+#endif
+
+#ifdef PADDLE_WITH_LIBXSMM
+#include <libxsmm.h>
+#endif
+
+#ifdef PADDLE_USE_OPENBLAS
+#include <cblas.h>
+#endif
+
+namespace paddle {
+namespace operators {
+namespace math {
+
+/**
+ * Matrix Descriptor of a memory buffer.
+ *
+ * It is used for Blas::MatMul. MatMul operator can be batched.
+ * if Mat A is [BatchSize, H, W], Mat B is [BatchSize, H, W]. It will be a
+ * `batch_size` times of GEMM. The batched GEMM could be faster base on the
+ * implementation of the blas library. The batch size could be zero. If any
+ * matrix of `matmul` has a batch size, the will be a batched GEMM, too. e.g.,
+ * Mat A is [BatchSize, H1, W2], and Mat B [H2, W2], The result matrix wil be
+ * [BatchSize, H1, W2]
+ *
+ * The boolean flag, `trans`, describe the memory is the transpose of matrix or
+ * not. If the trans is true, the last two dims of matrix are transposed. The
+ * memory layout of the matrix is [Width, Height] or [BatchSize, Width, Height].
+ *
+ * The MatDescriptor is not only the dimension or shape of a matrix, it also
+ * contains the layout, stride of matrix. It is clearer to have a structure than
+ * reuse `DDim`.
+ */
+struct MatDescriptor {
+  int64_t height_;
+  int64_t width_;
+  int64_t stride_{0};
+  int64_t batch_size_{0};
+  bool trans_;
+};
+
+/**
+ * Create Matrix Descriptor from a tensor dim, num_flatten_cols, and transpose
+ * flag
+ *
+ * @param tensor_dim: The dimension of the tensor. The rank of this dimension
+ * must larger than 1.
+ *
+ * @param num_flatten_cols:  Reshape a tensor to a matrix. The matrix's first
+ * dimension(column length) will be the product of tensor's first `num_col_dims`
+ * dimensions. If num_flatten_cols is zero, the first N-2 dimension will be the
+ * batch_size of descriptor.
+ *
+ * @param trans: True if the matrix is transposed.
+ */
+extern MatDescriptor CreateMatrixDescriptor(const framework::DDim& tensor_dim,
+                                            int num_flatten_cols,
+                                            bool trans);
+
+template <typename DeviceContext>
+class Blas {
+public:
+  explicit Blas(const DeviceContext& context) : context_(context) {}
+
+  template <typename T>
+  void GEMM(CBLAS_TRANSPOSE transA,
+            CBLAS_TRANSPOSE transB,
+            int M,
+            int N,
+            int K,
+            T alpha,
+            const T* A,
+            const T* B,
+            T beta,
+            T* C) const;
+
+  template <typename T>
+  void GEMM(bool transA,
+            bool transB,
+            int M,
+            int N,
+            int K,
+            T alpha,
+            const T* A,
+            int lda,
+            const T* B,
+            int ldb,
+            T beta,
+            T* C,
+            int ldc) const;
+
+  template <typename T>
+  void GEMM(CBLAS_TRANSPOSE transA,
+            CBLAS_TRANSPOSE transB,
+            int M,
+            int N,
+            int K,
+            T alpha,
+            const T* A,
+            int lda,
+            const T* B,
+            int ldb,
+            T beta,
+            T* C,
+            int ldc) const;
+
+#ifdef PADDLE_WITH_MKLML
+  template <typename T>
+  T* GEMM_ALLOC(const CBLAS_IDENTIFIER id,
+                const int M,
+                const int N,
+                const int K) const;
+
+  template <typename T>
+  void GEMM_PACK(const CBLAS_IDENTIFIER id,
+                 const CBLAS_TRANSPOSE trans,
+                 int M,
+                 int N,
+                 int K,
+                 const T alpha,
+                 const T* src,
+                 const int ld,
+                 T* dst) const;
+
+  template <typename T>
+  void GEMM_COMPUTE(int transA,
+                    int transB,
+                    int M,
+                    int N,
+                    int K,
+                    const T* A,
+                    const int lda,
+                    const T* B,
+                    const int ldb,
+                    T beta,
+                    T* C,
+                    const int ldc) const;
+
+  template <typename T>
+  void GEMM_FREE(T* data) const;
+
+  template <typename T>
+  void CSRMM(const char* transa,
+             const int* m,
+             const int* n,
+             const int* k,
+             const T* alpha,
+             const char* matdescra,
+             const T* val,
+             const int* indx,
+             const int* pntrb,
+             const int* pntre,
+             const T* b,
+             const int* ldb,
+             const T* beta,
+             T* c,
+             const int* ldc) const;
+
+#if !defined(PADDLE_WITH_CUDA)
+  template <typename T>
+  void MatMulWithHead(const framework::Tensor& mat_a,
+                      const MatDescriptor& dim_a,
+                      const framework::Tensor& mat_b,
+                      const MatDescriptor& dim_b,
+                      T alpha,
+                      int head_number,
+                      framework::Tensor* mat_out,
+                      T beta,
+                      bool mat_y_split_vertical) const;
+#endif
+#endif
+
+  template <typename T>
+  void MatMul(const int M,
+              const int N,
+              const int K,
+              const T* A,
+              const T* B,
+              T* C) const;
+
+  template <typename T>
+  void MatMul(const framework::Tensor& mat_a,
+              bool trans_a,
+              const framework::Tensor& mat_b,
+              bool trans_b,
+              T alpha,
+              framework::Tensor* mat_out,
+              T beta) const;
+
+  template <typename T>
+  void MatMul(const framework::Tensor& mat_a,
+              bool trans_a,
+              const framework::Tensor& mat_b,
+              bool trans_b,
+              framework::Tensor* mat_out) const {
+    MatMul(mat_a,
+           trans_a,
+           mat_b,
+           trans_b,
+           static_cast<T>(1.0),
+           mat_out,
+           static_cast<T>(0.0));
+  }
+
+  template <typename T>
+  void MatMul(const framework::Tensor& mat_a,
+              const framework::Tensor& mat_b,
+              framework::Tensor* mat_out) const {
+    this->template MatMul<T>(mat_a, false, mat_b, false, mat_out);
+  }
+
+  template <typename T>
+  void AXPY(int n, T alpha, const T* x, T* y) const;
+
+  template <typename T>
+  void VADD(int n, const T* x, const T* y, T* z) const;
+
+  template <typename T>
+  void VSUB(int n, const T* x, const T* y, T* z) const;
+
+  template <typename T>
+  void VMUL(int n, const T* x, const T* y, T* z) const;
+
+  template <typename T>
+  void VDIV(int n, const T* x, const T* y, T* z) const;
+
+  template <typename T>
+  void VCOPY(int n, const T* x, T* y) const;
+
+  template <typename T>
+  void VEXP(int n, const T* x, T* y) const;
+
+  template <typename T>
+  void VSQUARE(int n, const T* x, T* y) const;
+
+  template <typename T>
+  void VPOW(int n, const T* x, T alpha, T* y) const;
+
+  template <typename T>
+  void GEMV(bool trans_a,
+            int M,
+            int N,
+            T alpha,
+            const T* A,
+            const T* B,
+            T beta,
+            T* C) const;
+
+  template <typename T>
+  T DOT(int n, const T* x, const T* y) const;
+
+  template <typename T>
+  void SCAL(int n, const T a, T* x) const;
+
+  template <typename T>
+  T ASUM(int n, T* x, int inc) const;
+
+  template <typename T>
+  void BatchedGEMM(CBLAS_TRANSPOSE transA,
+                   CBLAS_TRANSPOSE transB,
+                   int M,
+                   int N,
+                   int K,
+                   T alpha,
+                   const T* A,
+                   const T* B,
+                   T beta,
+                   T* C,
+                   int batchCount,
+                   int64_t strideA,
+                   int64_t strideB) const;
+
+#if defined(PADDLE_WITH_MKLML) && !defined(PADDLE_WITH_CUDA)
+  template <typename T>
+  void BatchedGEMMWithHead(CBLAS_TRANSPOSE transA,
+                           CBLAS_TRANSPOSE transB,
+                           int W1,
+                           int H1,
+                           int W2,
+                           int H2,
+                           T alpha,
+                           const T* A,
+                           const T* B,
+                           T beta,
+                           T* C,
+                           int batchCount,
+                           int64_t strideA,
+                           int64_t strideB,
+                           int64_t head_number,
+                           bool split_b_vertical) const;
+#endif
+
+  template <typename T>
+  void MatMul(const framework::Tensor& mat_a,
+              const MatDescriptor& dim_a,
+              const framework::Tensor& mat_b,
+              const MatDescriptor& dim_b,
+              T alpha,
+              framework::Tensor* mat_out,
+              T beta) const;
+
+  template <typename T>
+  void VINV(int n, const T* a, T* y) const;
+
+  template <typename T>
+  void VMERF(int n, const T* a, T* y, int64_t mode) const;
+
+private:
+  const DeviceContext& context_;
+};
+
+template <typename DeviceContext, typename T>
+class BlasT : private Blas<DeviceContext> {
+public:
+  using Blas<DeviceContext>::Blas;
+
+  template <typename... ARGS>
+  void GEMM(ARGS... args) const {
+    Base()->template GEMM<T>(args...);
+  }
+
+#ifdef PADDLE_WITH_MKLML
+  template <typename... ARGS>
+  T* GEMM_ALLOC(ARGS... args) const {
+    return Base()->template GEMM_ALLOC<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void GEMM_PACK(ARGS... args) const {
+    Base()->template GEMM_PACK<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void GEMM_COMPUTE(ARGS... args) const {
+    Base()->template GEMM_COMPUTE<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void GEMM_FREE(ARGS... args) const {
+    Base()->template GEMM_FREE<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void CSRMM(ARGS... args) const {
+    Base()->template CSRMM<T>(args...);
+  }
+
+#if !defined(PADDLE_WITH_CUDA)
+  template <typename... ARGS>
+  void MatMulWithHead(ARGS... args) const {
+    Base()->template MatMulWithHead<T>(args...);
+  }
+#endif
+#endif
+
+  template <typename... ARGS>
+  void MatMul(ARGS... args) const {
+    Base()->template MatMul<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void AXPY(ARGS... args) const {
+    Base()->template AXPY<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void VADD(ARGS... args) const {
+    Base()->template VADD<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void VSUB(ARGS... args) const {
+    Base()->template VSUB<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void VMUL(ARGS... args) const {
+    Base()->template VMUL<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void VDIV(ARGS... args) const {
+    Base()->template VDIV<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void VCOPY(ARGS... args) const {
+    Base()->template VCOPY<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void VEXP(ARGS... args) const {
+    Base()->template VEXP<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void VSQUARE(ARGS... args) const {
+    Base()->template VSQUARE<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void VPOW(ARGS... args) const {
+    Base()->template VPOW<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void GEMV(ARGS... args) const {
+    Base()->template GEMV<T>(args...);
+  }
+
+  template <typename... ARGS>
+  T DOT(ARGS... args) const {
+    return Base()->template DOT<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void SCAL(ARGS... args) const {
+    Base()->template SCAL<T>(args...);
+  }
+
+  template <typename... ARGS>
+  T ASUM(ARGS... args) const {
+    return Base()->template ASUM<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void BatchedGEMM(ARGS... args) const {
+    Base()->template BatchedGEMM<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void VINV(ARGS... args) const {
+    Base()->template VINV<T>(args...);
+  }
+
+  template <typename... ARGS>
+  void VMERF(ARGS... args) const {
+    Base()->template VMERF<T>(args...);
+  }
+
+private:
+  const Blas<DeviceContext>* Base() const {
+    return static_cast<const Blas<DeviceContext>*>(this);
+  }
+};
+
+template <typename DeviceContext, typename T>
+inline BlasT<DeviceContext, T> GetBlas(
+    const framework::ExecutionContext& exe_ctx) {
+  return BlasT<DeviceContext, T>(
+      exe_ctx.template device_context<DeviceContext>());
+}
+
+template <typename DeviceContext, typename T>
+inline BlasT<DeviceContext, T> GetBlas(const DeviceContext& dev_ctx) {
+  return BlasT<DeviceContext, T>(dev_ctx);
+}
+
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
+
+#include "paddle/fluid/operators/math/blas_impl.h"
+#ifdef PADDLE_WITH_CUDA
+#include "paddle/fluid/operators/math/blas_impl.cu.h"
+#endif
diff --git a/PaddleCV/rrpn/models/ext_op/src/concat_and_split.cc b/PaddleCV/rrpn/models/ext_op/src/concat_and_split.cc
new file mode 100644
index 0000000000000000000000000000000000000000..20bf99637e22dd3a833d5f922d39f8b9d9dd4a87
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/concat_and_split.cc
@@ -0,0 +1,76 @@
+/* Copyright (c) 2019 paddlepaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "concat_and_split.h"
+#include <vector>
+
+namespace paddle {
+namespace operators {
+namespace math {
+
+/*
+ * All tensors' dimension should be the same and the values of
+ * each dimension must be the same, except the axis dimension.
+ */
+template <typename T>
+class ConcatFunctor<platform::CPUDeviceContext, T> {
+public:
+  void operator()(const platform::CPUDeviceContext& context,
+                  const std::vector<framework::Tensor>& input,
+                  int axis,
+                  framework::Tensor* output) {
+    // TODO(zcd): Add input data validity checking
+    int num = input.size();
+
+    int rows = 1;
+    auto dim_0 = input[0].dims();
+    for (int i = 0; i < axis; ++i) {
+      rows *= dim_0[i];
+    }
+    int out_rows = rows, out_cols = 0;
+
+    std::vector<int64_t> input_cols(input.size());
+    for (int i = 0; i < num; ++i) {
+      int t_cols = input[i].numel() / rows;
+      out_cols += t_cols;
+      input_cols[i] = t_cols;
+    }
+    auto cpu_place = boost::get<platform::CPUPlace>(context.GetPlace());
+
+    // computation
+    auto output_data = output->data<T>();
+    int col_idx = 0;
+    for (int j = 0; j < num; ++j) {
+      int col_len = input_cols[j];
+      auto input_data = input[j].data<T>();
+      for (int k = 0; k < out_rows; ++k) {
+        memory::Copy(cpu_place,
+                     output_data + k * out_cols + col_idx,
+                     cpu_place,
+                     input_data + k * col_len,
+                     sizeof(T) * col_len);
+      }
+      col_idx += col_len;
+    }
+  }
+};
+
+#define DEFINE_FUNCTOR(type) \
+  template class ConcatFunctor<platform::CPUDeviceContext, type>;
+
+FOR_ALL_TYPES(DEFINE_FUNCTOR);
+
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
diff --git a/PaddleCV/rrpn/models/ext_op/src/concat_and_split.h b/PaddleCV/rrpn/models/ext_op/src/concat_and_split.h
new file mode 100644
index 0000000000000000000000000000000000000000..d5947597f6bfcee10973aa41deea45053955e3b1
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/concat_and_split.h
@@ -0,0 +1,59 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <vector>
+#include "paddle/fluid/framework/data_type.h"
+#include "paddle/fluid/framework/lod_tensor.h"
+
+namespace paddle {
+namespace operators {
+namespace math {
+
+/*
+ * \brief Concatenate the input tensors along the dimension axis.
+ *  TODO(zcd): maybe it needs to be more detailed.
+ *  Examples:
+ *     Input[0] = [[1,2],[3,4]]
+ *     Input[1] = [[5,6]]
+ *     axis = 0
+ *
+ *     Output = [[1,2],
+ *               [3,4],
+ *               [5,6]]
+ */
+template <typename DeviceContext, typename T>
+class ConcatFunctor {
+public:
+  void operator()(const DeviceContext& context,
+                  const std::vector<framework::Tensor>& input,
+                  int axis,
+                  framework::Tensor* output);
+};
+
+
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
+
+#define FOR_ALL_TYPES(macro) \
+  macro(int);                \
+  macro(float);              \
+  macro(double);             \
+  macro(bool);               \
+  macro(int64_t);            \
+  macro(int16_t);            \
+  macro(uint8_t);            \
+  macro(int8_t);             \
+  macro(::paddle::platform::float16)
diff --git a/PaddleCV/rrpn/models/ext_op/src/gather.cu.h b/PaddleCV/rrpn/models/ext_op/src/gather.cu.h
new file mode 100644
index 0000000000000000000000000000000000000000..9e6b76b37c4ccae3e3001589a20cf9db4c76fe4a
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/gather.cu.h
@@ -0,0 +1,125 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <vector>
+#include "paddle/fluid/framework/dim.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/tensor.h"
+#include "paddle/fluid/memory/malloc.h"
+#include "paddle/fluid/platform/cuda_primitives.h"
+#include "paddle/fluid/platform/place.h"
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+using platform::DeviceContext;
+
+#define CUDA_1D_KERNEL_LOOP(i, n)                              \
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); \
+       i += blockDim.x * gridDim.x)
+
+template <typename T, typename IndexT = int>
+__global__ void GatherCUDAKernel(const T* params,
+                                 const IndexT* indices,
+                                 T* output,
+                                 size_t index_size,
+                                 size_t slice_size) {
+  CUDA_1D_KERNEL_LOOP(i, index_size * slice_size) {
+    int indices_i = i / slice_size;
+    int slice_i = i - indices_i * slice_size;  // offset inside the slice
+    IndexT gather_i = indices[indices_i];
+    IndexT params_i = gather_i * slice_size + slice_i;
+    *(output + i) = *(params + params_i);
+  }
+}
+
+template <typename T, typename IndexT = int>
+__global__ void GatherNdCUDAKernel(const T* input,
+                                   const int* input_dims,
+                                   const IndexT* indices,
+                                   T* output,
+                                   size_t remain_size,
+                                   size_t slice_size,
+                                   size_t end_size) {
+  CUDA_1D_KERNEL_LOOP(i, remain_size * slice_size) {
+    int indices_i = i / slice_size;
+    int slice_i = i - indices_i * slice_size;  // offset inside the slice
+    IndexT gather_i = 0;
+    int64_t temp = slice_size;
+    for (int64_t j = end_size - 1; j >= 0; --j) {
+      auto index_value = indices[indices_i * end_size + j];
+      assert(index_value >= 0 && index_value < input_dims[j]);
+      gather_i += (index_value * temp);
+      temp *= input_dims[j];
+    }
+    IndexT input_i = gather_i + slice_i;
+    *(output + i) = *(input + input_i);
+  }
+}
+
+/**
+ * A thin wrapper on gpu tensor
+ * Return a new tensor from source tensor, gathered according to index
+ * input[src]: type-T source Tensor
+ * input[index]: type-IndexT index Tensor (1-D)
+ * return: output tensor
+ */
+template <typename T, typename IndexT = int>
+void GPUGather(const platform::DeviceContext& ctx,
+               const Tensor& src,
+               const Tensor& index,
+               Tensor* output) {
+  // check index of shape 1-D
+  if (index.dims().size() == 1) {
+    PADDLE_ENFORCE_GT(index.dims()[0],
+                      0,
+                      "The index of gather_op should not be empty when the "
+                      "index's rank is 1.");
+  } else if (index.dims().size() == 2) {
+    PADDLE_ENFORCE_EQ(index.dims()[1],
+                      1,
+                      " If the index's rank of gather_op is 2, the second "
+                      "dimension should be 1.");
+  }
+
+  int index_size = index.dims()[0];
+
+  auto src_dims = src.dims();
+  framework::DDim output_dims(src_dims);
+  output_dims[0] = index_size;
+
+  // slice size
+  int slice_size = 1;
+  for (int i = 1; i < src_dims.size(); ++i) slice_size *= src_dims[i];
+
+  const T* p_src = src.data<T>();
+  const IndexT* p_index = index.data<IndexT>();
+  T* p_output = output->data<T>();
+
+  int block = 512;
+  int n = slice_size * index_size;
+  int grid = (n + block - 1) / block;
+
+  GatherCUDAKernel<T, IndexT><<<
+      grid,
+      block,
+      0,
+      reinterpret_cast<const platform::CUDADeviceContext&>(ctx).stream()>>>(
+      p_src, p_index, p_output, index_size, slice_size);
+}
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/PaddleCV/rrpn/models/ext_op/src/gather.h b/PaddleCV/rrpn/models/ext_op/src/gather.h
new file mode 100644
index 0000000000000000000000000000000000000000..a2ee07427724a7850b009402a8a6b1255d2e8a79
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/gather.h
@@ -0,0 +1,74 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <memory.h>
+#include <cstring>
+
+#include "paddle/fluid/framework/ddim.h"
+#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/framework/tensor.h"
+#include "paddle/fluid/platform/place.h"
+
+namespace paddle {
+namespace operators {
+
+using framework::Tensor;
+
+/**
+ * A thin wrapper for gathering on cpu tensor
+ * Return a new tensor from source tensor, gathered according to index
+ * input[src]: type-T source Tensor
+ * input[index]: type-IndexT index Tensor (1-D)
+ * return: output tensor
+ */
+template <typename T, typename IndexT = int>
+void CPUGather(const platform::DeviceContext& ctx,
+               const Tensor& src,
+               const Tensor& index,
+               Tensor* output) {
+  PADDLE_ENFORCE_EQ(platform::is_cpu_place(ctx.GetPlace()), true);
+  // check index of shape 1-D
+  if (index.dims().size() == 2) {
+    PADDLE_ENFORCE_EQ(index.dims()[1],
+                      1,
+                      "index.dims()[1] should be 1 when index.dims().size() == "
+                      "2 in gather_op.");
+  } else {
+    PADDLE_ENFORCE_EQ(index.dims().size(),
+                      1,
+                      "index.dims().size() should be 1 or 2 in gather_op.");
+  }
+  int64_t index_size = index.dims()[0];
+
+  auto src_dims = src.dims();
+
+  const T* p_src = src.data<T>();
+  const IndexT* p_index = index.data<IndexT>();
+  T* p_output = output->data<T>();
+
+  // slice size
+  int slice_size = 1;
+  for (int i = 1; i < src_dims.size(); ++i) slice_size *= src_dims[i];
+
+  const size_t slice_bytes = slice_size * sizeof(T);
+
+  for (int64_t i = 0; i < index_size; ++i) {
+    IndexT index_ = p_index[i];
+    memcpy(p_output + i * slice_size, p_src + index_ * slice_size, slice_bytes);
+  }
+}
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/PaddleCV/rrpn/models/ext_op/src/make.sh b/PaddleCV/rrpn/models/ext_op/src/make.sh
new file mode 100644
index 0000000000000000000000000000000000000000..96810820da26e00c7aa23a6a103f6dc3401ce655
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/make.sh
@@ -0,0 +1,73 @@
+include_dir=$( python -c 'import paddle; print(paddle.sysconfig.get_include())' )
+lib_dir=$( python -c 'import paddle; print(paddle.sysconfig.get_lib())' )
+
+echo $include_dir
+echo $lib_dir
+
+CUDA=$1
+CUDNN=$2
+NCCL=$3
+
+if [ ! -d "$CUDA" ]; then
+echo "Usage: sh make.sh \$CUDA_PATH \$CUDNN_PATH \$NCCL_PATH"
+exit
+fi
+
+if [ ! -d "$CUDNN" ]; then
+echo "Usage: sh make.sh \${CUDA_PATH} \${CUDNN_PATH} \${NCCL_PATH}"
+exit
+fi
+
+if [ ! -d "$NCCL" ]; then
+echo "Usage: sh make.sh \${CUDA_PATH} \${CUDNN_PATH} \${NCCL_PATH}"
+exit
+fi
+
+git clone https://github.com/NVlabs/cub.git
+
+nvcc rrpn_generate_proposals_op.cu -c -o rrpn_generate_proposals_op.cu.o -ccbin cc -DPADDLE_WITH_MKLDNN -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O3 -DNVCC \
+    -I ${include_dir} \
+   -I ${include_dir}/third_party \
+    -I ${CUDA}/include \
+    -I ${CUDNN}/include \
+    -I ${NCCL}/include \
+    -L ${lib_dir} -lpaddle_framework \
+    -L ${CUDA}/lib64 -lcudart
+
+
+nvcc rotated_anchor_generator_op.cu -c -o rotated_anchor_generator_op.cu.o -ccbin cc -DPADDLE_WITH_MKLDNN -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O3 -DNVCC \
+    -I ${include_dir} \
+    -I ${include_dir}/third_party \
+    -I ${CUDA}/include \
+    -I ${CUDNN}/include \
+    -I ${NCCL}/include \
+    -L ${lib_dir} -lpaddle_framework \
+    -L ${CUDA}/lib64 -lcudart
+
+nvcc rrpn_box_coder_op.cu -c -o rrpn_box_coder_op.cu.o -ccbin cc -DPADDLE_WITH_MKLDNN -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O3 -DNVCC \
+    -I ${include_dir} \
+    -I ${include_dir}/third_party \
+    -I ${CUDA}/include \
+    -I ${CUDNN}/include \
+    -I ${NCCL}/include \
+    -L ${lib_dir} -lpaddle_framework \
+    -L ${CUDA}/lib64 -lcudart
+
+nvcc rrpn_rotated_roi_align_op.cu -c -o rrpn_rotated_roi_align_op.cu.o -ccbin cc -DPADDLE_WITH_MKLDNN -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O3 -DNVCC \
+    -I ${include_dir} \
+    -I ${include_dir}/third_party \
+    -I ${CUDA}/include \
+    -I ${CUDNN}/include \
+    -I ${NCCL}/include \
+    -L ${lib_dir} -lpaddle_framework \
+    -L ${CUDA}/lib64 -lcudart
+
+
+g++ rotated_anchor_generator_op.cc concat_and_split.cc rrpn_generate_proposal_labels_op.cc rrpn_generate_proposals_op.cc rrpn_target_assign_op.cc rrpn_box_coder_op.cc rrpn_rotated_roi_align_op.cc rrpn_rotated_roi_align_op.cu.o rrpn_box_coder_op.cu.o rotated_anchor_generator_op.cu.o rrpn_generate_proposals_op.cu.o -o rrpn_lib.so -shared -fPIC -std=c++11 -O3 -DPADDLE_WITH_MKLDNN -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO \
+  -I ${include_dir} \
+  -I ${include_dir}/third_party \
+  -I ${CUDA}/include \
+  -I ${CUDNN}/include \
+  -I ${NCCL}/include \
+  -L ${lib_dir} -lpaddle_framework \
+  -L ${CUDA}/lib64 -lcudart 
diff --git a/PaddleCV/rrpn/models/ext_op/src/math_function.cc b/PaddleCV/rrpn/models/ext_op/src/math_function.cc
new file mode 100644
index 0000000000000000000000000000000000000000..24d5909eafa2293b1474f4c4e22f1ab177d610ac
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/math_function.cc
@@ -0,0 +1,73 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "math_function.h"
+
+#ifdef PADDLE_WITH_MKLML
+#include "paddle/fluid/platform/dynload/mklml.h"
+#endif
+
+#ifdef PADDLE_USE_OPENBLAS
+#include <cblas.h>
+#endif
+
+#include <vector>
+#include "math_function_impl.h"
+#include "paddle/fluid/framework/data_type.h"
+#include "paddle/fluid/platform/float16.h"
+
+namespace paddle {
+namespace operators {
+namespace math {
+
+#define DEFINE_CPU_TRANS(RANK)                                          \
+  template struct Transpose<platform::CPUDeviceContext,                 \
+                            platform::float16,                          \
+                            RANK>;                                      \
+  template struct Transpose<platform::CPUDeviceContext, float, RANK>;   \
+  template struct Transpose<platform::CPUDeviceContext, double, RANK>;  \
+  template struct Transpose<platform::CPUDeviceContext, int, RANK>;     \
+  template struct Transpose<platform::CPUDeviceContext, int64_t, RANK>; \
+  template struct Transpose<platform::CPUDeviceContext, bool, RANK>;    \
+  template struct Transpose<platform::CPUDeviceContext, int16_t, RANK>; \
+  template struct Transpose<platform::CPUDeviceContext, uint8_t, RANK>; \
+  template struct Transpose<platform::CPUDeviceContext, int8_t, RANK>;
+
+DEFINE_CPU_TRANS(1);
+DEFINE_CPU_TRANS(2);
+DEFINE_CPU_TRANS(3);
+DEFINE_CPU_TRANS(4);
+DEFINE_CPU_TRANS(5);
+DEFINE_CPU_TRANS(6);
+
+template <typename DeviceContext, typename T, int Rank>
+void Transpose<DeviceContext, T, Rank>::operator()(
+    const DeviceContext& context,
+    const framework::Tensor& in,
+    framework::Tensor* out,
+    const std::vector<int>& axis) {
+  Eigen::array<int, Rank> permute;
+  for (int i = 0; i < Rank; i++) {
+    permute[i] = axis[i];
+  }
+  auto eigen_in = framework::EigenTensor<T, Rank>::From(in);
+  auto eigen_out = framework::EigenTensor<T, Rank>::From(*out);
+  auto* dev = context.eigen_device();
+  eigen_out.device(*dev) = eigen_in.shuffle(permute);
+}
+
+
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
diff --git a/PaddleCV/rrpn/models/ext_op/src/math_function.h b/PaddleCV/rrpn/models/ext_op/src/math_function.h
new file mode 100644
index 0000000000000000000000000000000000000000..b8043943ed4c3f0b26ca92a2d1ab8e091213673b
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/math_function.h
@@ -0,0 +1,43 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <cmath>
+#include <vector>
+
+#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/tensor.h"
+#include "paddle/fluid/framework/tensor_util.h"
+#include "paddle/fluid/platform/device_context.h"
+#include "paddle/fluid/platform/enforce.h"
+
+namespace paddle {
+namespace operators {
+namespace math {
+template <typename DeviceContext, typename T, int Rank>
+struct Transpose {
+  void operator()(const DeviceContext& context,
+                  const framework::Tensor& in,
+                  framework::Tensor* out,
+                  const std::vector<int>& axis);
+};
+
+void set_constant(const platform::DeviceContext& context,
+                  framework::Tensor* tensor,
+                  float value);
+
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
diff --git a/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.cc b/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..854245aaa2ce80024d608fb76fda001b48a505ac
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.cc
@@ -0,0 +1,172 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "rotated_anchor_generator_op.h"
+
+namespace paddle {
+namespace operators {
+
+class RotatedAnchorGeneratorOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(
+        ctx->HasInput("Input"),
+        "Input(Input) of RotatedAnchorGeneratorOp should not be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("Anchors"),
+        "Output(Anchors) of RotatedAnchorGeneratorOp should not be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("Variances"),
+        "Output(Variances) of RotatedAnchorGeneratorOp should not be null.");
+
+    auto input_dims = ctx->GetInputDim("Input");
+    PADDLE_ENFORCE(input_dims.size() == 4, "The layout of input is NCHW.");
+
+    auto anchor_sizes = ctx->Attrs().Get<std::vector<float>>("anchor_sizes");
+    auto aspect_ratios = ctx->Attrs().Get<std::vector<float>>("aspect_ratios");
+    auto angles = ctx->Attrs().Get<std::vector<float>>("angles");
+    auto stride = ctx->Attrs().Get<std::vector<float>>("stride");
+    auto variances = ctx->Attrs().Get<std::vector<float>>("variances");
+
+    size_t num_anchors =
+        aspect_ratios.size() * anchor_sizes.size() * angles.size();
+
+    std::vector<int64_t> dim_vec(4);
+    dim_vec[0] = input_dims[2];
+    dim_vec[1] = input_dims[3];
+    dim_vec[2] = num_anchors;
+    dim_vec[3] = 5;
+    ctx->SetOutputDim("Anchors", framework::make_ddim(dim_vec));
+    ctx->SetOutputDim("Variances", framework::make_ddim(dim_vec));
+  }
+
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        ctx.Input<framework::Tensor>("Input")->type(), ctx.device_context());
+  }
+};
+
+class RotatedAnchorGeneratorOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("Input",
+             "(Tensor, default Tensor<float>), "
+             "the input feature is a tensor with a rank of 4. "
+             "The layout is NCHW.");
+    AddOutput("Anchors",
+              "(Tensor, default Tensor<float>), the output is a "
+              "tensor with a rank of 4. The layout is [H, W, num_anchors, 5]. "
+              "H is the height of input, W is the width of input, num_anchors "
+              "is the box count of each position. "
+              "Each anchor is in (xctr, yctr, w, h, thelta) format");
+    AddOutput("Variances",
+              "(Tensor, default Tensor<float>), the expanded variances for "
+              "normalizing bbox regression targets. The layout is [H, W, "
+              "num_anchors, 5]. "
+              "H is the height of input, W is the width of input, num_anchors "
+              "is the box count of each position. "
+              "Each variance is in (xctr, yctr, w, h, thelta) format");
+
+    AddAttr<std::vector<float>>(
+        "anchor_sizes",
+        "(vector<float>) List of Rotated Region Proposal Network(RRPN) anchor "
+        "sizes "
+        " given in absolute pixels e.g. (64, 128, 256, 512)."
+        " For instance, the anchor size of 64 means the area of this anchor "
+        "equals to 64**2.")
+        .AddCustomChecker([](const std::vector<float>& anchor_sizes) {
+          PADDLE_ENFORCE_GT(anchor_sizes.size(),
+                            0UL,
+                            "Size of anchor_sizes must be at least 1.");
+          for (size_t i = 0; i < anchor_sizes.size(); ++i) {
+            PADDLE_ENFORCE_GT(
+                anchor_sizes[i], 0.0, "anchor_sizes[%d] must be positive.", i);
+          }
+        });
+    AddAttr<std::vector<float>>(
+        "aspect_ratios",
+        "(vector<float>) List of Rotated Region Proposal Network(RRPN) anchor "
+        "aspect "
+        "ratios, e.g. (0.5, 1, 2)."
+        "For instacne, the aspect ratio of 0.5 means the height / width of "
+        "this anchor equals 0.5.");
+    AddAttr<std::vector<float>>(
+        "angles",
+        "(vector<float>) List of Rotated Region Proposal Network(RRPN) anchor "
+        "angles, "
+        "e.g. (-30.0, 0.0, 30.0, 60.0, 90.0, 120.0)."
+        "For instacne, the aspect ratio of 0.5 means the height / width of "
+        "this anchor equals 0.5.");
+
+    AddAttr<std::vector<float>>("variances",
+                                "(vector<float>) List of variances to be used "
+                                "in box regression deltas")
+        .AddCustomChecker([](const std::vector<float>& variances) {
+          PADDLE_ENFORCE_EQ(
+              variances.size(), 5UL, "Must and only provide 5 variance.");
+          for (size_t i = 0; i < variances.size(); ++i) {
+            PADDLE_ENFORCE_GT(
+                variances[i], 0.0, "variance[%d] must be greater than 0.", i);
+          }
+        });
+
+    AddAttr<std::vector<float>>("stride",
+                                "Anchors stride across width and height, "
+                                "with a default of (16, 16)")
+        .SetDefault(std::vector<float>(2, 16.0))
+        .AddCustomChecker([](const std::vector<float>& stride) {
+          PADDLE_ENFORCE_EQ(
+              stride.size(),
+              2UL,
+              "Must and only provide 2 stride for width and height.");
+          for (size_t i = 0; i < stride.size(); ++i) {
+            PADDLE_ENFORCE_GT(
+                stride[i], 0.0, "stride[%d] should be larger than 0.", i);
+          }
+        });
+
+    AddAttr<float>("offset",
+                   "(float) "
+                   "Anchor center offset, with a default of 0.5")
+        .SetDefault(0.5);
+    AddComment(R"DOC(
+RotatedAnchorGenerator operator
+Generates anchors for RRPN. algorithm.
+Each position of the input produce N anchors, N =
+ size(anchor_sizes) * size(aspect_ratios) * size(angles).
+
+Please get more information from the following papers:
+https://arxiv.org/abs/1703.01086.
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(
+    rotated_anchor_generator,
+    ops::RotatedAnchorGeneratorOp,
+    ops::RotatedAnchorGeneratorOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
+
+REGISTER_OP_CPU_KERNEL(rotated_anchor_generator,
+                       ops::RotatedAnchorGeneratorOpKernel<float>,
+                       ops::RotatedAnchorGeneratorOpKernel<double>);
diff --git a/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.cu b/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..9c5250103326baa08588c3acacc7c1767da18332
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.cu
@@ -0,0 +1,153 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "rotated_anchor_generator_op.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename T>
+__global__ void GenRAnchors(T* out,
+                            const T* aspect_ratios,
+                            const int ar_num,
+                            const T* anchor_sizes,
+                            const int as_num,
+                            const T* angles,
+                            const int aa_num,
+                            const T* stride,
+                            const int sd_num,
+                            const int height,
+                            const int width,
+                            const T offset) {
+  int num_anchors = as_num * ar_num * aa_num;
+  int box_num = height * width * num_anchors;
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < box_num;
+       i += blockDim.x * gridDim.x) {
+    int h_idx = i / (num_anchors * width);
+    int w_idx = (i / num_anchors) % width;
+    T stride_width = stride[0];
+    T stride_height = stride[1];
+    T x_ctr = (w_idx * stride_width) + offset * stride_width - 1;
+    T y_ctr = (h_idx * stride_height) + offset * stride_height - 1;
+    T area, area_ratios;
+    T base_w, base_h;
+    T scale_w, scale_h;
+    T anchor_width, anchor_height;
+    int anch_idx = i % num_anchors;
+    int ar_idx = anch_idx / (as_num * aa_num);
+    int as_idx = anch_idx / aa_num % as_num;
+    int aa_idx = anch_idx % aa_num;
+    T aspect_ratio = aspect_ratios[ar_idx];
+    T anchor_size = anchor_sizes[as_idx];
+    T angle = angles[aa_idx];
+    area = stride_width * stride_height;
+    area_ratios = area / aspect_ratio;
+    base_w = round(sqrt(area_ratios));
+    base_h = round(base_w * aspect_ratio);
+    scale_w = anchor_size / stride_width;
+    scale_h = anchor_size / stride_height;
+    anchor_width = scale_w * base_w;
+    anchor_height = scale_h * base_h;
+    out[i * 5] = x_ctr;
+    out[i * 5 + 1] = y_ctr;
+    out[i * 5 + 2] = anchor_width;
+    out[i * 5 + 3] = anchor_height;
+    out[i * 5 + 4] = angle;
+  }
+}
+
+template <typename T>
+__global__ void SetVariance(T* out,
+                            const T* var,
+                            const int vnum,
+                            const int num) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < num;
+       i += blockDim.x * gridDim.x) {
+    out[i] = var[i % vnum];
+  }
+}
+
+template <typename T>
+class RotatedAnchorGeneratorOpCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* input = ctx.Input<paddle::framework::Tensor>("Input");
+    auto* anchors = ctx.Output<paddle::framework::Tensor>("Anchors");
+    auto* vars = ctx.Output<paddle::framework::Tensor>("Variances");
+
+    auto anchor_sizes = ctx.Attr<std::vector<float>>("anchor_sizes");
+    auto aspect_ratios = ctx.Attr<std::vector<float>>("aspect_ratios");
+    auto angles = ctx.Attr<std::vector<float>>("angles");
+    auto stride = ctx.Attr<std::vector<float>>("stride");
+    auto variances = ctx.Attr<std::vector<float>>("variances");
+
+    T offset = static_cast<T>(ctx.Attr<float>("offset"));
+
+    auto width = input->dims()[3];
+    auto height = input->dims()[2];
+
+    int num_anchors =
+        aspect_ratios.size() * anchor_sizes.size() * angles.size();
+
+    int box_num = width * height * num_anchors;
+
+    int block = 512;
+    int grid = (box_num + block - 1) / block;
+
+    auto stream =
+        ctx.template device_context<platform::CUDADeviceContext>().stream();
+
+    anchors->mutable_data<T>(ctx.GetPlace());
+    vars->mutable_data<T>(ctx.GetPlace());
+
+    framework::Tensor ar;
+    framework::TensorFromVector(aspect_ratios, ctx.device_context(), &ar);
+
+    framework::Tensor as;
+    framework::TensorFromVector(anchor_sizes, ctx.device_context(), &as);
+
+    framework::Tensor aa;
+    framework::TensorFromVector(angles, ctx.device_context(), &aa);
+
+    framework::Tensor sd;
+    framework::TensorFromVector(stride, ctx.device_context(), &sd);
+
+    GenRAnchors<T><<<grid, block, 0, stream>>>(anchors->data<T>(),
+                                               ar.data<T>(),
+                                               aspect_ratios.size(),
+                                               as.data<T>(),
+                                               anchor_sizes.size(),
+                                               aa.data<T>(),
+                                               angles.size(),
+                                               sd.data<T>(),
+                                               stride.size(),
+                                               height,
+                                               width,
+                                               offset);
+
+    framework::Tensor v;
+    framework::TensorFromVector(variances, ctx.device_context(), &v);
+    grid = (box_num * 5 + block - 1) / block;
+    SetVariance<T><<<grid, block, 0, stream>>>(
+        vars->data<T>(), v.data<T>(), variances.size(), box_num * 5);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(rotated_anchor_generator,
+                        ops::RotatedAnchorGeneratorOpCUDAKernel<float>,
+                        ops::RotatedAnchorGeneratorOpCUDAKernel<double>);
diff --git a/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.h b/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.h
new file mode 100644
index 0000000000000000000000000000000000000000..81239d1f97303ca946efecb5234c4c09d31ae2c0
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.h
@@ -0,0 +1,111 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <algorithm>
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+//#include "paddle/fluid/operators/math/math_function.h"
+#include "paddle/fluid/platform/transform.h"
+
+namespace paddle {
+namespace operators {
+
+template <typename T>
+class RotatedAnchorGeneratorOpKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* input = ctx.Input<paddle::framework::Tensor>("Input");
+    auto* anchors = ctx.Output<paddle::framework::Tensor>("Anchors");
+    auto* vars = ctx.Output<paddle::framework::Tensor>("Variances");
+
+    auto anchor_sizes = ctx.Attr<std::vector<float>>("anchor_sizes");
+    auto aspect_ratios = ctx.Attr<std::vector<float>>("aspect_ratios");
+    auto angles = ctx.Attr<std::vector<float>>("angles");
+    auto stride = ctx.Attr<std::vector<float>>("stride");
+    auto variances = ctx.Attr<std::vector<float>>("variances");
+
+    T offset = static_cast<T>(ctx.Attr<float>("offset"));
+
+    auto feature_width = input->dims()[3];
+    auto feature_height = input->dims()[2];
+
+    T stride_width, stride_height;
+    stride_width = stride[0];
+    stride_height = stride[1];
+
+    int num_anchors =
+        aspect_ratios.size() * anchor_sizes.size() * angles.size();
+
+    anchors->mutable_data<T>(ctx.GetPlace());
+    vars->mutable_data<T>(ctx.GetPlace());
+
+    auto e_anchors = framework::EigenTensor<T, 4>::From(*anchors);
+    for (int h_idx = 0; h_idx < feature_height; ++h_idx) {
+      for (int w_idx = 0; w_idx < feature_width; ++w_idx) {
+        T x_ctr = (w_idx * stride_width) + offset * stride_width - 1;
+        T y_ctr = (h_idx * stride_height) + offset * stride_height - 1;
+        T area, area_ratios;
+        T base_w, base_h;
+        T scale_w, scale_h;
+        T anchor_width, anchor_height;
+        int idx = 0;
+        for (size_t r = 0; r < aspect_ratios.size(); ++r) {
+          auto ar = aspect_ratios[r];
+          for (size_t s = 0; s < anchor_sizes.size(); ++s) {
+            auto anchor_size = anchor_sizes[s];
+            area = stride_width * stride_height;
+            area_ratios = area / ar;
+            base_w = round(sqrt(area_ratios));
+            base_h = round(base_w * ar);
+            scale_w = anchor_size / stride_width;
+            scale_h = anchor_size / stride_height;
+            anchor_width = scale_w * base_w;
+            anchor_height = scale_h * base_h;
+            for (size_t a = 0; a < angles.size(); ++a) {
+              auto angle = angles[a];
+              e_anchors(h_idx, w_idx, idx, 0) = x_ctr;
+              e_anchors(h_idx, w_idx, idx, 1) = y_ctr;
+              e_anchors(h_idx, w_idx, idx, 2) = anchor_width;
+              e_anchors(h_idx, w_idx, idx, 3) = anchor_height;
+              e_anchors(h_idx, w_idx, idx, 4) = angle;
+              idx++;
+            }
+          }
+        }
+      }
+    }
+
+    framework::Tensor var_t;
+    var_t.mutable_data<T>(
+        framework::make_ddim({1, static_cast<int>(variances.size())}),
+        ctx.GetPlace());
+    auto var_et = framework::EigenTensor<T, 2>::From(var_t);
+    for (size_t i = 0; i < variances.size(); ++i) {
+      var_et(0, i) = variances[i];
+    }
+
+    int anchor_num = feature_height * feature_width * num_anchors;
+    auto var_dim = vars->dims();
+    vars->Resize({anchor_num, static_cast<int>(variances.size())});
+
+    auto e_vars = framework::EigenMatrix<T, Eigen::RowMajor>::From(*vars);
+    e_vars = var_et.broadcast(Eigen::DSizes<int, 2>(anchor_num, 1));
+
+    vars->Resize(var_dim);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
diff --git a/PaddleCV/rrpn/models/ext_op/src/rrpn_box_coder_op.cc b/PaddleCV/rrpn/models/ext_op/src/rrpn_box_coder_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..63c6c6e97e1bf88b0a95f5e4d965004be1321517
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_box_coder_op.cc
@@ -0,0 +1,128 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+//#include "rrpn_box_coder_op.h"
+#include <string>
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+class RRPNBoxCoderOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+protected:
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("PriorBox"),
+                   "Input(PriorBox) of BoxCoderOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("TargetBox"),
+                   "Input(TargetBox) of BoxCoderOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("OutputBox"),
+                   "Output(OutputBox) of BoxCoderOp should not be null.");
+
+    auto prior_box_dims = ctx->GetInputDim("PriorBox");
+    // auto target_box_dims = ctx->GetInputDim("TargetBox");
+
+    if (ctx->IsRuntime()) {
+      PADDLE_ENFORCE_EQ(
+          prior_box_dims.size(), 2, "The rank of Input PriorBox must be 2");
+      PADDLE_ENFORCE_EQ(
+          prior_box_dims[1], 5, "The shape of PriorBox is [N, 5]");
+      if (ctx->HasInput("PriorBoxVar")) {
+        auto prior_box_var_dims = ctx->GetInputDim("PriorBoxVar");
+        PADDLE_ENFORCE(prior_box_var_dims.size() == 2,
+                       "Input(PriorBoxVar) of BoxCoderOp should be 2.");
+        PADDLE_ENFORCE_EQ(
+            prior_box_dims,
+            prior_box_var_dims,
+            "The dimension of Input(PriorBoxVar) should be equal to"
+            "the dimension of Input(PriorBox) when the rank is 2.");
+      }
+    }
+  }
+};
+
+class RRPNBoxCoderOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput(
+        "PriorBox",
+        "(Tensor, default Tensor<float>) "
+        "Box list PriorBox is a 2-D Tensor with shape [M, 5] holds M boxes, "
+        "each box is represented as [x, y, w, h, angle], "
+        "[x, y] is the center coordinate of the anchor box, "
+        "if the input is image feature map, they are close to the origin "
+        "of the coordinate system. [w, h] is the width and height "
+        "of the anchor box, angle is angle of rotation.");
+    AddInput("PriorBoxVar",
+             "(Tensor, default Tensor<float>, optional) "
+             "PriorBoxVar is a 2-D Tensor with shape [M, 5] holds M group "
+             "of variance. PriorBoxVar will set all elements to 1 by "
+             "default.")
+        .AsDispensable();
+    AddInput(
+        "TargetBox",
+        "(LoDTensor or Tensor) This input can be a 2-D LoDTensor with shape "
+        "[N, 5], each box is represented as [x, y, w, h, angle],"
+        "[x, y] is the center coordinate of the box, [w, h] is width and "
+        "height of the box,"
+        "angle is angle of rotation around the center of box.");
+    AddAttr<std::vector<float>>(
+        "variance",
+        "(vector<float>, default {}),"
+        "variance of prior box with shape [5]. PriorBoxVar and variance can"
+        "not be provided at the same time.")
+        .SetDefault(std::vector<float>{});
+    AddOutput("OutputBox",
+              "(Tensor) "
+              "2-D Tensor with shape [M, 5] which M represents the number of "
+              "deocded boxes"
+              "and 5 represents [x, y, w, h, angle]");
+
+    AddComment(R"DOC(
+
+Rotatedi Bounding Box Coder.
+
+Decode the target bounding box with the priorbox information.
+
+The Decoding schema described below:
+
+    ox = pw * tx / pxv + cx
+
+    oy = ph * ty / pyv + cy
+
+    ow = exp(tw / pwv) * pw
+
+    oh = exp(th / phv) * ph
+
+    oa = ta / pav  * 1.0 / 3.141592653 * 180 + pa
+
+where `tx`, `ty`, `tw`, `th`, `ta` denote the target box's center coordinates, width
+,height and angle respectively. Similarly, `px`, `py`, `pw`, `ph`, `pa` denote the
+priorbox's (anchor) center coordinates, width, height and angle. `pxv`, `pyv`, `pwv`,
+`phv`, `pav` denote the variance of the priorbox and `ox`, `oy`, `ow`, `oh`, `oa`
+denote the encoded/decoded coordinates, width and height. 
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(
+    rrpn_box_coder,
+    ops::RRPNBoxCoderOp,
+    ops::RRPNBoxCoderOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
diff --git a/PaddleCV/rrpn/models/ext_op/src/rrpn_box_coder_op.cu b/PaddleCV/rrpn/models/ext_op/src/rrpn_box_coder_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..9640f0fff0019e2a3abe9be6efb09ad5b0569555
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_box_coder_op.cu
@@ -0,0 +1,198 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <thrust/device_vector.h>
+#include <thrust/host_vector.h>
+#include <string>
+#include <vector>
+#include "paddle/fluid/memory/memory.h"
+//#include "rrpn_box_coder_op.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/platform/cuda_primitives.h"
+
+namespace paddle {
+namespace operators {
+
+#define PI 3.141592654
+
+template <typename T>
+__global__ void DecodeCenterSizeKernel(const T* prior_box_data,
+                                       const T* prior_box_var_data,
+                                       const T* target_box_data,
+                                       const int row,
+                                       const int len,
+                                       const T prior_box_var_size,
+                                       const float* variance,
+                                       const int var_size,
+                                       T* output) {
+  const int idx = threadIdx.x + blockIdx.x * blockDim.x;
+  int prior_box_offset = 0;
+  if (idx < row) {
+    const int row_idx = idx;
+    prior_box_offset = row_idx * len;
+    T prior_box_width = prior_box_data[prior_box_offset + 2];
+
+    T prior_box_height = prior_box_data[prior_box_offset + 3];
+
+    T prior_box_center_x = prior_box_data[prior_box_offset];
+    T prior_box_center_y = prior_box_data[prior_box_offset + 1];
+    T prior_box_angle = prior_box_data[prior_box_offset + 4];
+
+    T target_box_width, target_box_height, target_box_angle;
+    T target_box_center_x, target_box_center_y;
+    T box_var_x = T(1), box_var_y = T(1);
+    T box_var_w = T(1), box_var_h = T(1), box_var_angle = T(1);
+    if (prior_box_var_data) {
+      int prior_var_offset = row_idx * len;
+      box_var_x = prior_box_var_data[prior_var_offset];
+      box_var_y = prior_box_var_data[prior_var_offset + 1];
+      box_var_w = prior_box_var_data[prior_var_offset + 2];
+      box_var_h = prior_box_var_data[prior_var_offset + 3];
+      box_var_angle = prior_box_var_data[prior_var_offset + 4];
+    } else if (var_size == 5) {
+      box_var_x = static_cast<T>(variance[0]);
+      box_var_y = static_cast<T>(variance[1]);
+      box_var_w = static_cast<T>(variance[2]);
+      box_var_h = static_cast<T>(variance[3]);
+      box_var_angle = static_cast<T>(variance[4]);
+    }
+    target_box_width =
+        exp(target_box_data[idx * len + 2] / box_var_w) * prior_box_width / 1.4;
+    target_box_height = exp(target_box_data[idx * len + 3] / box_var_h) *
+                        prior_box_height / 1.4;
+    target_box_center_x =
+        target_box_data[idx * len] / box_var_x * prior_box_width +
+        prior_box_center_x;
+    target_box_center_y =
+        target_box_data[idx * len + 1] / box_var_y * prior_box_height +
+        prior_box_center_y;
+
+    target_box_angle =
+        (target_box_data[idx * len + 4] / box_var_angle) * 1.0 / PI * 180 +
+        prior_box_angle;
+
+    T a_cos = cos(PI / 180 * target_box_angle);
+    T a_sin = -sin(PI / 180 * target_box_angle);
+
+    T rotation_matrix[3][3];
+
+    rotation_matrix[0][0] = a_cos;
+    rotation_matrix[0][1] = a_sin;
+    rotation_matrix[0][2] = 0;
+    rotation_matrix[1][0] = -a_sin;
+    rotation_matrix[1][1] = a_cos;
+    rotation_matrix[1][2] = 0;
+    rotation_matrix[2][0] = -target_box_center_x * a_cos +
+                            target_box_center_y * a_sin + target_box_center_x;
+    rotation_matrix[2][1] = -target_box_center_x * a_sin -
+                            target_box_center_y * a_cos + target_box_center_y;
+    rotation_matrix[2][2] = 1;
+
+    T pt_x0 = target_box_center_x - target_box_width / 2;
+    T pt_x1 = target_box_center_x + target_box_width / 2;
+    T pt_x2 = target_box_center_x + target_box_width / 2;
+    T pt_x3 = target_box_center_x - target_box_width / 2;
+
+    T pt_y0 = target_box_center_y - target_box_height / 2;
+    T pt_y1 = target_box_center_y - target_box_height / 2;
+    T pt_y2 = target_box_center_y + target_box_height / 2;
+    T pt_y3 = target_box_center_y + target_box_height / 2;
+
+
+    output[idx * 8] = pt_x0 * rotation_matrix[0][0] +
+                      pt_y0 * rotation_matrix[1][0] + rotation_matrix[2][0];
+    output[idx * 8 + 1] = pt_x0 * rotation_matrix[0][1] +
+                          pt_y0 * rotation_matrix[1][1] + rotation_matrix[2][1];
+    output[idx * 8 + 2] = pt_x1 * rotation_matrix[0][0] +
+                          pt_y1 * rotation_matrix[1][0] + rotation_matrix[2][0];
+    output[idx * 8 + 3] = pt_x1 * rotation_matrix[0][1] +
+                          pt_y1 * rotation_matrix[1][1] + rotation_matrix[2][1];
+    output[idx * 8 + 4] = pt_x2 * rotation_matrix[0][0] +
+                          pt_y2 * rotation_matrix[1][0] + rotation_matrix[2][0];
+    output[idx * 8 + 5] = pt_x2 * rotation_matrix[0][1] +
+                          pt_y2 * rotation_matrix[1][1] + rotation_matrix[2][1];
+    output[idx * 8 + 6] = pt_x3 * rotation_matrix[0][0] +
+                          pt_y3 * rotation_matrix[1][0] + rotation_matrix[2][0];
+    output[idx * 8 + 7] = pt_x3 * rotation_matrix[0][1] +
+                          pt_y3 * rotation_matrix[1][1] + rotation_matrix[2][1];
+  }
+}
+
+template <typename DeviceContext, typename T>
+class RRPNBoxCoderCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    PADDLE_ENFORCE(platform::is_gpu_place(context.GetPlace()),
+                   "This kernel only runs on GPU device.");
+    auto* prior_box = context.Input<framework::Tensor>("PriorBox");
+    auto* prior_box_var = context.Input<framework::Tensor>("PriorBoxVar");
+    auto* target_box = context.Input<framework::LoDTensor>("TargetBox");
+    auto* output_box = context.Output<framework::Tensor>("OutputBox");
+    std::vector<float> variance = context.Attr<std::vector<float>>("variance");
+    const T* prior_box_data = prior_box->data<T>();
+    const T* target_box_data = target_box->data<T>();
+    const T* prior_box_var_data = nullptr;
+    auto prior_box_var_size = 0;
+    if (prior_box_var) {
+      PADDLE_ENFORCE(variance.empty(),
+                     "Input 'PriorBoxVar' and attribute 'variance' should not"
+                     "be used at the same time.");
+      prior_box_var_data = prior_box_var->data<T>();
+      prior_box_var_size = prior_box_var->dims().size();
+    }
+    if (!(variance.empty())) {
+      PADDLE_ENFORCE(static_cast<int>(variance.size()) == 5,
+                     "Size of attribute 'variance' should be 4");
+    }
+
+    if (target_box->lod().size()) {
+      PADDLE_ENFORCE_EQ(
+          target_box->lod().size(), 1, "Only support 1 level of LoD.");
+    }
+    const int var_size = static_cast<int>(variance.size());
+    auto row = target_box->dims()[0];
+    auto len = 5;
+    int block = 512;
+    int grid = (row + block - 1) / block;
+    auto& device_ctx = context.cuda_device_context();
+
+    int bytes = var_size * sizeof(float);
+    auto dev_var = memory::Alloc(device_ctx, bytes);
+    float* dev_var_data = reinterpret_cast<float*>(dev_var->ptr());
+    auto cplace = platform::CPUPlace();
+    const auto gplace = boost::get<platform::CUDAPlace>(context.GetPlace());
+    memory::Copy(
+        gplace, dev_var_data, cplace, &variance[0], bytes, device_ctx.stream());
+
+    output_box->mutable_data<T>({row, 8}, context.GetPlace());
+    T* output = output_box->data<T>();
+
+    DecodeCenterSizeKernel<T><<<grid, block, 0, device_ctx.stream()>>>(
+        prior_box_data,
+        prior_box_var_data,
+        target_box_data,
+        row,
+        len,
+        prior_box_var_size,
+        dev_var_data,
+        var_size,
+        output);
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(
+    rrpn_box_coder,
+    ops::RRPNBoxCoderCUDAKernel<paddle::platform::CUDADeviceContext, float>,
+    ops::RRPNBoxCoderCUDAKernel<paddle::platform::CUDADeviceContext, double>);
diff --git a/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposal_labels_op.cc b/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposal_labels_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..3174df86e9276af31bdbe184a7e1a51f6a49aba3
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposal_labels_op.cc
@@ -0,0 +1,638 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <math.h>
+#include <algorithm>
+#include <fstream>
+#include <string>
+#include <vector>
+#include "bbox_util.h"
+#include "concat_and_split.h"
+#include "gather.h"
+#include "math_function.h"
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+const int kBoxDim = 5;
+
+template <typename T>
+void AppendRois(LoDTensor* out, int64_t offset, Tensor* to_add) {
+  auto* out_data = out->data<T>();
+  auto* to_add_data = to_add->data<T>();
+  memcpy(out_data + offset, to_add_data, to_add->numel() * sizeof(T));
+}
+
+
+class RRPNGenerateProposalLabelsOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("RpnRois"),
+                   "Input(RpnRois) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("GtClasses"),
+                   "Input(GtClasses) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("IsCrowd"),
+                   "Input(IsCrowd) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("GtBoxes"),
+                   "Input(GtBoxes) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("ImInfo"), "Input(ImInfo) shouldn't be null.");
+
+    PADDLE_ENFORCE(
+        ctx->HasOutput("Rois"),
+        "Output(Rois) of RRPNGenerateProposalLabelsOp should not be null");
+    PADDLE_ENFORCE(ctx->HasOutput("LabelsInt32"),
+                   "Output(LabelsInt32) of RRPNGenerateProposalLabelsOp should "
+                   "not be null");
+    PADDLE_ENFORCE(ctx->HasOutput("BboxTargets"),
+                   "Output(BboxTargets) of RRPNGenerateProposalLabelsOp should "
+                   "not be null");
+    PADDLE_ENFORCE(ctx->HasOutput("BboxInsideWeights"),
+                   "Output(BboxInsideWeights) of RRPNGenerateProposalLabelsOp "
+                   "should not be null");
+    PADDLE_ENFORCE(ctx->HasOutput("BboxOutsideWeights"),
+                   "Output(BboxOutsideWeights) of RRPNGenerateProposalLabelsOp "
+                   "should not be null");
+
+    auto rpn_rois_dims = ctx->GetInputDim("RpnRois");
+    auto gt_boxes_dims = ctx->GetInputDim("GtBoxes");
+    auto im_info_dims = ctx->GetInputDim("ImInfo");
+
+    PADDLE_ENFORCE_EQ(
+        rpn_rois_dims.size(), 2, "The rank of Input(RpnRois) must be 2.");
+    PADDLE_ENFORCE_EQ(
+        gt_boxes_dims.size(), 2, "The rank of Input(GtBoxes) must be 2.");
+    PADDLE_ENFORCE_EQ(
+        im_info_dims.size(), 2, "The rank of Input(ImInfo) must be 2.");
+
+    int class_nums = ctx->Attrs().Get<int>("class_nums");
+
+    ctx->SetOutputDim("Rois", {-1, 5});
+    ctx->SetOutputDim("LabelsInt32", {-1, 1});
+    ctx->SetOutputDim("BboxTargets", {-1, 5 * class_nums});
+    ctx->SetOutputDim("BboxInsideWeights", {-1, 5 * class_nums});
+    ctx->SetOutputDim("BboxOutsideWeights", {-1, 5 * class_nums});
+  }
+
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        ctx.Input<framework::LoDTensor>("RpnRois")->type(),
+        platform::CPUPlace());
+  }
+};
+
+template <typename T>
+void Concat(const platform::CPUDeviceContext& context,
+            const Tensor& in_tensor_a,
+            const Tensor& in_tensor_b,
+            Tensor* out_tensor) {
+  int axis = 0;
+  std::vector<Tensor> inputs;
+  inputs.emplace_back(in_tensor_a);
+  inputs.emplace_back(in_tensor_b);
+  math::ConcatFunctor<platform::CPUDeviceContext, T> concat_functor;
+  concat_functor(context, inputs, axis, out_tensor);
+}
+
+template <typename T>
+std::vector<std::vector<int>> SampleFgBgGt(
+    const platform::CPUDeviceContext& context,
+    Tensor* iou,
+    const Tensor& is_crowd,
+    const int batch_size_per_im,
+    const float fg_fraction,
+    const float fg_thresh,
+    const float bg_thresh_hi,
+    const float bg_thresh_lo,
+    std::minstd_rand engine,
+    const bool use_random,
+    const Tensor& rpn_rois) {
+  std::vector<int> fg_inds;
+  std::vector<int> bg_inds;
+  std::vector<int> mapped_gt_inds;
+  int64_t gt_num = is_crowd.numel();
+  const int* crowd_data = is_crowd.data<int>();
+  T* proposal_to_gt_overlaps = iou->data<T>();
+  int64_t row = iou->dims()[0];
+  int64_t col = iou->dims()[1];
+  float epsilon = 0.00001;
+  const T* rpn_rois_dt = rpn_rois.data<T>();
+  // Follow the Faster RCNN's implementation
+  for (int64_t i = 0; i < row; ++i) {
+    const T* v = proposal_to_gt_overlaps + i * col;
+    T max_overlap = *std::max_element(v, v + col);
+    if ((i < gt_num) && (crowd_data[i])) {
+      max_overlap = -1.0;
+    }
+    if (max_overlap >= fg_thresh) {
+      // fg mapped gt label index
+      for (int64_t j = 0; j < col; ++j) {
+        T val = proposal_to_gt_overlaps[i * col + j];
+        auto diff = std::abs(max_overlap - val);
+        if (diff < epsilon) {
+          fg_inds.emplace_back(i);
+          mapped_gt_inds.emplace_back(j);
+          break;
+        }
+      }
+    } else if ((max_overlap >= bg_thresh_lo) && (max_overlap < bg_thresh_hi)) {
+      bg_inds.emplace_back(i);
+    } else {
+      continue;
+    }
+  }
+
+  std::vector<std::vector<int>> res;
+  // sampling fg
+  std::uniform_real_distribution<float> uniform(0, 1);
+  int fg_rois_per_im = std::floor(batch_size_per_im * fg_fraction);
+  int fg_rois_this_image = fg_inds.size();
+  int fg_rois_per_this_image = std::min(fg_rois_per_im, fg_rois_this_image);
+  if (use_random) {
+    const int64_t fg_size = static_cast<int64_t>(fg_inds.size());
+    if (fg_size > fg_rois_per_this_image) {
+      for (int64_t i = fg_rois_per_this_image; i < fg_size; ++i) {
+        int rng_ind = std::floor(uniform(engine) * i);
+        if (rng_ind < fg_rois_per_this_image) {
+          std::iter_swap(fg_inds.begin() + rng_ind, fg_inds.begin() + i);
+          std::iter_swap(mapped_gt_inds.begin() + rng_ind,
+                         mapped_gt_inds.begin() + i);
+        }
+      }
+    }
+  }
+  std::vector<int> new_fg_inds(fg_inds.begin(),
+                               fg_inds.begin() + fg_rois_per_this_image);
+  std::vector<int> new_gt_inds(mapped_gt_inds.begin(),
+                               mapped_gt_inds.begin() + fg_rois_per_this_image);
+  // sampling bg
+  int bg_rois_per_image = batch_size_per_im - fg_rois_per_this_image;
+  int bg_rois_this_image = bg_inds.size();
+  int bg_rois_per_this_image = std::min(bg_rois_per_image, bg_rois_this_image);
+  if (use_random) {
+    const int64_t bg_size = static_cast<int64_t>(bg_inds.size());
+    if (bg_size > bg_rois_per_this_image) {
+      for (int64_t i = bg_rois_per_this_image; i < bg_size; ++i) {
+        int rng_ind = std::floor(uniform(engine) * i);
+        if (rng_ind < fg_rois_per_this_image)
+          std::iter_swap(bg_inds.begin() + rng_ind, bg_inds.begin() + i);
+      }
+    }
+  }
+  std::vector<int> new_bg_inds(bg_inds.begin(),
+                               bg_inds.begin() + bg_rois_per_this_image);
+  res.emplace_back(new_fg_inds);
+  res.emplace_back(new_bg_inds);
+  res.emplace_back(new_gt_inds);
+
+  return res;
+}
+
+template <typename T>
+void GatherBoxesLabels(const platform::CPUDeviceContext& context,
+                       const Tensor& boxes,
+                       const Tensor& gt_boxes,
+                       const Tensor& gt_classes,
+                       const std::vector<int>& fg_inds,
+                       const std::vector<int>& bg_inds,
+                       const std::vector<int>& gt_inds,
+                       Tensor* sampled_boxes,
+                       Tensor* sampled_labels,
+                       Tensor* sampled_gts) {
+  int fg_num = fg_inds.size();
+  int bg_num = bg_inds.size();
+  Tensor fg_inds_t, bg_inds_t, gt_box_inds_t, gt_label_inds_t;
+  int* fg_inds_data = fg_inds_t.mutable_data<int>({fg_num}, context.GetPlace());
+  int* bg_inds_data = bg_inds_t.mutable_data<int>({bg_num}, context.GetPlace());
+  int* gt_box_inds_data =
+      gt_box_inds_t.mutable_data<int>({fg_num}, context.GetPlace());
+  int* gt_label_inds_data =
+      gt_label_inds_t.mutable_data<int>({fg_num}, context.GetPlace());
+  std::copy(fg_inds.begin(), fg_inds.end(), fg_inds_data);
+  std::copy(bg_inds.begin(), bg_inds.end(), bg_inds_data);
+  std::copy(gt_inds.begin(), gt_inds.end(), gt_box_inds_data);
+  std::copy(gt_inds.begin(), gt_inds.end(), gt_label_inds_data);
+
+  Tensor fg_boxes, bg_boxes, fg_labels, bg_labels;
+  fg_boxes.mutable_data<T>({fg_num, kBoxDim}, context.GetPlace());
+  CPUGather<T>(context, boxes, fg_inds_t, &fg_boxes);
+  bg_boxes.mutable_data<T>({bg_num, kBoxDim}, context.GetPlace());
+  CPUGather<T>(context, boxes, bg_inds_t, &bg_boxes);
+  Concat<T>(context, fg_boxes, bg_boxes, sampled_boxes);
+  CPUGather<T>(context, gt_boxes, gt_box_inds_t, sampled_gts);
+  fg_labels.mutable_data<int>({fg_num}, context.GetPlace());
+  CPUGather<int>(context, gt_classes, gt_label_inds_t, &fg_labels);
+  bg_labels.mutable_data<int>({bg_num}, context.GetPlace());
+  math::set_constant(context, &bg_labels, 0);
+  Concat<int>(context, fg_labels, bg_labels, sampled_labels);
+}
+
+template <typename T>
+std::vector<Tensor> SampleRoisForOneImage(
+    const platform::CPUDeviceContext& context,
+    const Tensor& rpn_rois_in,
+    const Tensor& gt_classes,
+    const Tensor& is_crowd,
+    const Tensor& gt_boxes,
+    const Tensor& im_info,
+    const int batch_size_per_im,
+    const float fg_fraction,
+    const float fg_thresh,
+    const float bg_thresh_hi,
+    const float bg_thresh_lo,
+    const std::vector<float>& bbox_reg_weights,
+    const int class_nums,
+    std::minstd_rand engine,
+    bool use_random,
+    bool is_cls_agnostic) {
+  // 1.1 map to original image
+  auto im_scale = im_info.data<T>()[2];
+  Tensor rpn_rois_slice;
+  Tensor rpn_rois;
+
+  rpn_rois.mutable_data<T>(rpn_rois_in.dims(), context.GetPlace());
+  const T* rpn_rois_in_dt = rpn_rois_in.data<T>();
+  T* rpn_rois_dt = rpn_rois.data<T>();
+  for (int i = 0; i < rpn_rois.numel(); ++i) {
+    rpn_rois_dt[i] = rpn_rois_in_dt[i];
+  }
+
+  // 1.2 compute overlaps
+  int proposals_num = gt_boxes.dims()[0] + rpn_rois.dims()[0];
+  Tensor boxes;
+  boxes.mutable_data<T>({proposals_num, kBoxDim}, context.GetPlace());
+  Concat<T>(context, gt_boxes, rpn_rois, &boxes);
+  Tensor proposal_to_gt_overlaps;
+  proposal_to_gt_overlaps.mutable_data<T>({proposals_num, gt_boxes.dims()[0]},
+                                          context.GetPlace());
+  BboxOverlaps2<T>(boxes, gt_boxes, &proposal_to_gt_overlaps);
+  std::vector<std::vector<int>> fg_bg_gt =
+      SampleFgBgGt<T>(context,
+                      &proposal_to_gt_overlaps,
+                      is_crowd,
+                      batch_size_per_im,
+                      fg_fraction,
+                      fg_thresh,
+                      bg_thresh_hi,
+                      bg_thresh_lo,
+                      engine,
+                      use_random,
+                      boxes);
+  std::vector<int> fg_inds = fg_bg_gt[0];
+  std::vector<int> bg_inds = fg_bg_gt[1];
+  std::vector<int> mapped_gt_inds = fg_bg_gt[2];  // mapped_gt_labels
+
+
+  Tensor sampled_boxes, sampled_labels, sampled_gts;
+  int fg_num = fg_inds.size();
+  int bg_num = bg_inds.size();
+  int boxes_num = fg_num + bg_num;
+  framework::DDim bbox_dim({boxes_num, kBoxDim});
+
+  sampled_boxes.mutable_data<T>(bbox_dim, context.GetPlace());
+
+  sampled_labels.mutable_data<int>({boxes_num}, context.GetPlace());
+
+  sampled_gts.mutable_data<T>({fg_num, kBoxDim}, context.GetPlace());
+
+  GatherBoxesLabels<T>(context,
+                       boxes,
+                       gt_boxes,
+                       gt_classes,
+                       fg_inds,
+                       bg_inds,
+                       mapped_gt_inds,
+                       &sampled_boxes,
+                       &sampled_labels,
+                       &sampled_gts);
+
+  // Compute targets
+  Tensor bbox_targets_single;
+  bbox_targets_single.mutable_data<T>(bbox_dim, context.GetPlace());
+  BoxToDelta2<T>(fg_num,
+                 sampled_boxes,
+                 sampled_gts,
+                 bbox_reg_weights.data(),
+                 &bbox_targets_single);
+
+  // Scale rois
+  Tensor sampled_rois;
+  sampled_rois.mutable_data<T>(sampled_boxes.dims(), context.GetPlace());
+  auto sampled_rois_et = framework::EigenTensor<T, 2>::From(sampled_rois);
+  auto sampled_boxes_et = framework::EigenTensor<T, 2>::From(sampled_boxes);
+
+  sampled_rois_et = sampled_boxes_et;
+  // Expand box targets
+  Tensor bbox_targets, bbox_inside_weights, bbox_outside_weights;
+  framework::DDim bbox_expand_dim({boxes_num, kBoxDim * class_nums});
+  bbox_targets.mutable_data<T>(bbox_expand_dim, context.GetPlace());
+  bbox_inside_weights.mutable_data<T>(bbox_expand_dim, context.GetPlace());
+  bbox_outside_weights.mutable_data<T>(bbox_expand_dim, context.GetPlace());
+  math::set_constant(context, &bbox_targets, 0.0);
+  math::set_constant(context, &bbox_inside_weights, 0.0);
+  math::set_constant(context, &bbox_outside_weights, 0.0);
+
+  auto* bbox_targets_single_data = bbox_targets_single.data<T>();
+  auto* sampled_labels_data = sampled_labels.data<int>();
+  auto* bbox_targets_data = bbox_targets.data<T>();
+  auto* bbox_inside_weights_data = bbox_inside_weights.data<T>();
+  auto* bbox_outside_weights_data = bbox_outside_weights.data<T>();
+  int width = kBoxDim * class_nums;
+
+  for (int64_t i = 0; i < boxes_num; ++i) {
+    int label = sampled_labels_data[i];
+
+    if (label > 0) {
+      if (is_cls_agnostic) {
+        label = 1;
+      }
+
+      int dst_idx = i * width + kBoxDim * label;
+      int src_idx = kBoxDim * i;
+      bbox_targets_data[dst_idx] = bbox_targets_single_data[src_idx];
+      bbox_targets_data[dst_idx + 1] = bbox_targets_single_data[src_idx + 1];
+      bbox_targets_data[dst_idx + 2] = bbox_targets_single_data[src_idx + 2];
+      bbox_targets_data[dst_idx + 3] = bbox_targets_single_data[src_idx + 3];
+      bbox_targets_data[dst_idx + 4] = bbox_targets_single_data[src_idx + 4];
+
+      bbox_inside_weights_data[dst_idx] = 1;
+      bbox_inside_weights_data[dst_idx + 1] = 1;
+      bbox_inside_weights_data[dst_idx + 2] = 1;
+      bbox_inside_weights_data[dst_idx + 3] = 1;
+      bbox_inside_weights_data[dst_idx + 4] = 1;
+
+
+      bbox_outside_weights_data[dst_idx] = 1;
+      bbox_outside_weights_data[dst_idx + 1] = 1;
+      bbox_outside_weights_data[dst_idx + 2] = 1;
+      bbox_outside_weights_data[dst_idx + 3] = 1;
+      bbox_outside_weights_data[dst_idx + 4] = 1;
+    }
+  }
+
+
+  std::vector<Tensor> res;
+  res.emplace_back(sampled_rois);
+  res.emplace_back(sampled_labels);
+  res.emplace_back(bbox_targets);
+  res.emplace_back(bbox_inside_weights);
+  res.emplace_back(bbox_outside_weights);
+  return res;
+}
+
+template <typename T>
+class RRPNGenerateProposalLabelsKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto* rpn_rois = context.Input<LoDTensor>("RpnRois");
+    auto* gt_classes = context.Input<LoDTensor>("GtClasses");
+    auto* is_crowd = context.Input<LoDTensor>("IsCrowd");
+    auto* gt_boxes = context.Input<LoDTensor>("GtBoxes");
+    auto* im_info = context.Input<LoDTensor>("ImInfo");
+
+    auto* rois = context.Output<LoDTensor>("Rois");
+    auto* labels_int32 = context.Output<LoDTensor>("LabelsInt32");
+    auto* bbox_targets = context.Output<LoDTensor>("BboxTargets");
+    auto* bbox_inside_weights = context.Output<LoDTensor>("BboxInsideWeights");
+    auto* bbox_outside_weights =
+        context.Output<LoDTensor>("BboxOutsideWeights");
+
+    int batch_size_per_im = context.Attr<int>("batch_size_per_im");
+    float fg_fraction = context.Attr<float>("fg_fraction");
+    float fg_thresh = context.Attr<float>("fg_thresh");
+    float bg_thresh_hi = context.Attr<float>("bg_thresh_hi");
+    float bg_thresh_lo = context.Attr<float>("bg_thresh_lo");
+    std::vector<float> bbox_reg_weights =
+        context.Attr<std::vector<float>>("bbox_reg_weights");
+    int class_nums = context.Attr<int>("class_nums");
+    bool use_random = context.Attr<bool>("use_random");
+    bool is_cls_agnostic = context.Attr<bool>("is_cls_agnostic");
+    PADDLE_ENFORCE_EQ(
+        rpn_rois->lod().size(),
+        1UL,
+        "RRPNGenerateProposalLabelsOp rpn_rois needs 1 level of LoD");
+    PADDLE_ENFORCE_EQ(
+        gt_classes->lod().size(),
+        1UL,
+        "RRPNGenerateProposalLabelsOp gt_classes needs 1 level of LoD");
+    PADDLE_ENFORCE_EQ(
+        is_crowd->lod().size(),
+        1UL,
+        "RRPNGenerateProposalLabelsOp is_crowd needs 1 level of LoD");
+    PADDLE_ENFORCE_EQ(
+        gt_boxes->lod().size(),
+        1UL,
+        "RRPNGenerateProposalLabelsOp gt_boxes needs 1 level of LoD");
+    int64_t n = static_cast<int64_t>(rpn_rois->lod().back().size() - 1);
+
+    rois->mutable_data<T>({n * batch_size_per_im, kBoxDim}, context.GetPlace());
+    labels_int32->mutable_data<int>({n * batch_size_per_im, 1},
+                                    context.GetPlace());
+    bbox_targets->mutable_data<T>({n * batch_size_per_im, kBoxDim * class_nums},
+                                  context.GetPlace());
+    bbox_inside_weights->mutable_data<T>(
+        {n * batch_size_per_im, kBoxDim * class_nums}, context.GetPlace());
+    bbox_outside_weights->mutable_data<T>(
+        {n * batch_size_per_im, kBoxDim * class_nums}, context.GetPlace());
+
+    std::random_device rnd;
+    std::minstd_rand engine;
+    int seed = rnd();
+    engine.seed(seed);
+
+    framework::LoD lod;
+    std::vector<size_t> lod0(1, 0);
+
+    int64_t num_rois = 0;
+    auto& dev_ctx = context.device_context<platform::CPUDeviceContext>();
+
+    auto rpn_rois_lod = rpn_rois->lod().back();
+    auto gt_classes_lod = gt_classes->lod().back();
+    auto is_crowd_lod = is_crowd->lod().back();
+    auto gt_boxes_lod = gt_boxes->lod().back();
+
+    for (int i = 0; i < n; ++i) {
+      if (rpn_rois_lod[i] == rpn_rois_lod[i + 1]) {
+        lod0.emplace_back(num_rois);
+        continue;
+      }
+      Tensor rpn_rois_slice =
+          rpn_rois->Slice(rpn_rois_lod[i], rpn_rois_lod[i + 1]);
+      Tensor gt_classes_slice =
+          gt_classes->Slice(gt_classes_lod[i], gt_classes_lod[i + 1]);
+      Tensor is_crowd_slice =
+          is_crowd->Slice(is_crowd_lod[i], is_crowd_lod[i + 1]);
+      Tensor gt_boxes_slice =
+          gt_boxes->Slice(gt_boxes_lod[i], gt_boxes_lod[i + 1]);
+      Tensor im_info_slice = im_info->Slice(i, i + 1);
+
+      std::vector<Tensor> tensor_output =
+          SampleRoisForOneImage<T>(dev_ctx,
+                                   rpn_rois_slice,
+                                   gt_classes_slice,
+                                   is_crowd_slice,
+                                   gt_boxes_slice,
+                                   im_info_slice,
+                                   batch_size_per_im,
+                                   fg_fraction,
+                                   fg_thresh,
+                                   bg_thresh_hi,
+                                   bg_thresh_lo,
+                                   bbox_reg_weights,
+                                   class_nums,
+                                   engine,
+                                   use_random,
+                                   is_cls_agnostic);
+      Tensor sampled_rois = tensor_output[0];
+      Tensor sampled_labels_int32 = tensor_output[1];
+      Tensor sampled_bbox_targets = tensor_output[2];
+      Tensor sampled_bbox_inside_weights = tensor_output[3];
+      Tensor sampled_bbox_outside_weights = tensor_output[4];
+
+      AppendRois<T>(rois, kBoxDim * num_rois, &sampled_rois);
+      AppendRois<int>(labels_int32, num_rois, &sampled_labels_int32);
+      AppendRois<T>(
+          bbox_targets, kBoxDim * num_rois * class_nums, &sampled_bbox_targets);
+      AppendRois<T>(bbox_inside_weights,
+                    kBoxDim * num_rois * class_nums,
+                    &sampled_bbox_inside_weights);
+      AppendRois<T>(bbox_outside_weights,
+                    kBoxDim * num_rois * class_nums,
+                    &sampled_bbox_outside_weights);
+
+      num_rois += sampled_rois.dims()[0];
+      lod0.emplace_back(num_rois);
+    }
+
+    lod.emplace_back(lod0);
+    rois->set_lod(lod);
+    labels_int32->set_lod(lod);
+    bbox_targets->set_lod(lod);
+    bbox_inside_weights->set_lod(lod);
+    bbox_outside_weights->set_lod(lod);
+    rois->Resize({num_rois, kBoxDim});
+    labels_int32->Resize({num_rois, 1});
+    bbox_targets->Resize({num_rois, kBoxDim * class_nums});
+    bbox_inside_weights->Resize({num_rois, kBoxDim * class_nums});
+    bbox_outside_weights->Resize({num_rois, kBoxDim * class_nums});
+  }
+};
+
+class RRPNGenerateProposalLabelsOpMaker
+    : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("RpnRois",
+             "(LoDTensor), This input is a 2D LoDTensor with shape [N, 5]. "
+             "N is the number of the GenerateProposalOp's output, "
+             "each element is a bounding box with [x, y, w, h, angle] format.");
+    AddInput("GtClasses",
+             "(LoDTensor), This input is a 2D LoDTensor with shape [M, 1]. "
+             "M is the number of groundtruth, "
+             "each element is a class label of groundtruth.");
+    AddInput(
+        "IsCrowd",
+        "(LoDTensor), This input is a 2D LoDTensor with shape [M, 1]. "
+        "M is the number of groundtruth, "
+        "each element is a flag indicates whether a groundtruth is crowd.");
+    AddInput("GtBoxes",
+             "(LoDTensor), This input is a 2D LoDTensor with shape [M, 5. "
+             "M is the number of groundtruth, "
+             "each element is a bounding box with [x, y, w, h, angle] format.");
+    AddInput("ImInfo",
+             "(Tensor), This input is a 2D Tensor with shape [B, 3]. "
+             "B is the number of input images, "
+             "each element consists of im_height, im_width, im_scale.");
+
+    AddOutput(
+        "Rois",
+        "(LoDTensor), This output is a 2D LoDTensor with shape [P, 5]. "
+        "P usuall equal to  batch_size_per_im * batch_size, "
+        "each element is a bounding box with [x, y, w, h ,angle] format.");
+    AddOutput("LabelsInt32",
+              "(LoDTensor), This output is a 2D LoDTensor with shape [P, 1], "
+              "each element repersents a class label of a roi");
+    AddOutput("BboxTargets",
+              "(LoDTensor), This output is a 2D LoDTensor with shape [P, 5 * "
+              "class_nums], "
+              "each element repersents a box label of a roi");
+    AddOutput(
+        "BboxInsideWeights",
+        "(LoDTensor), This output is a 2D LoDTensor with shape [P, 5 * "
+        "class_nums], "
+        "each element indicates whether a box should contribute to loss.");
+    AddOutput(
+        "BboxOutsideWeights",
+        "(LoDTensor), This output is a 2D LoDTensor with shape [P, 5 * "
+        "class_nums], "
+        "each element indicates whether a box should contribute to loss.");
+
+    AddAttr<int>("batch_size_per_im", "Batch size of rois per images.");
+    AddAttr<float>("fg_fraction",
+                   "Foreground fraction in total batch_size_per_im.");
+    AddAttr<float>(
+        "fg_thresh",
+        "Overlap threshold which is used to chose foreground sample.");
+    AddAttr<float>("bg_thresh_hi",
+                   "Overlap threshold upper bound which is used to chose "
+                   "background sample.");
+    AddAttr<float>("bg_thresh_lo",
+                   "Overlap threshold lower bound which is used to chose "
+                   "background sample.");
+    AddAttr<std::vector<float>>("bbox_reg_weights", "Box regression weights.");
+    AddAttr<int>("class_nums", "Class number.");
+    AddAttr<bool>(
+        "use_random",
+        "Use random sampling to choose foreground and background boxes.")
+        .SetDefault(true);
+    AddAttr<bool>(
+        "is_cls_agnostic",
+        "the box regress will only include fg and bg locations if set true ")
+        .SetDefault(false);
+
+    AddComment(R"DOC(
+This operator can be, for given the RotatedGenerateProposalOp output rotated bounding boxes and groundtruth,
+to sample foreground boxes and background boxes, and compute loss target.
+
+RpnRois is the output boxes of RPN and was processed by rotated_generate_proposal_op, these boxes
+were combined with groundtruth boxes and sampled according to batch_size_per_im and fg_fraction,
+If an instance with a groundtruth overlap greater than fg_thresh, then it was considered as a foreground sample.
+If an instance with a groundtruth overlap greater than bg_thresh_lo and lower than bg_thresh_hi,
+then it was considered as a background sample.
+After all foreground and background boxes are chosen (so called Rois),
+then we apply random sampling to make sure
+the number of foreground boxes is no more than batch_size_per_im * fg_fraction.
+
+For each box in Rois, we assign the classification (class label) and regression targets (box label) to it.
+Finally BboxInsideWeights and BboxOutsideWeights are used to specify whether it would contribute to training loss.
+    )DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(
+    rrpn_generate_proposal_labels,
+    ops::RRPNGenerateProposalLabelsOp,
+    ops::RRPNGenerateProposalLabelsOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
+REGISTER_OP_CPU_KERNEL(rrpn_generate_proposal_labels,
+                       ops::RRPNGenerateProposalLabelsKernel<float>,
+                       ops::RRPNGenerateProposalLabelsKernel<double>);
diff --git a/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposals_op.cc b/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposals_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..5e344f0df5c359efbe95eb5edd80d4456f651c93
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposals_op.cc
@@ -0,0 +1,694 @@
+/*opyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <cmath>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <string>
+#include <vector>
+#include "gather.h"
+#include "math_function.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "safe_ref.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+
+static const double kBBoxClipDefault = std::log(1000.0 / 16.0);
+#define PI 3.141592654
+
+static void RRPNAppendProposals(Tensor *dst,
+                                int64_t offset,
+                                const Tensor &src) {
+  auto *out_data = dst->data<void>();
+  auto *to_add_data = src.data<void>();
+  size_t size_of_t = framework::SizeOfType(src.type());
+  offset *= size_of_t;
+  std::memcpy(
+      reinterpret_cast<void *>(reinterpret_cast<uintptr_t>(out_data) + offset),
+      to_add_data,
+      src.numel() * size_of_t);
+}
+
+template <class T>
+inline T axr(T x, T r) {
+  return 0.5 * PI * r * r - x * sqrt(r * r - x * x) - r * r * std::asin(x / r);
+}
+
+class RRPNGenerateProposalsOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Scores"), "Input(Scores) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("BboxDeltas"),
+                   "Input(BboxDeltas) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("ImInfo"), "Input(ImInfo) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Anchors"),
+                   "Input(Anchors) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Variances"),
+                   "Input(Variances) shouldn't be null.");
+
+    ctx->SetOutputDim("RpnRois", {-1, 5});
+    ctx->SetOutputDim("RpnRoiProbs", {-1, 1});
+  }
+
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext &ctx) const override {
+    return framework::OpKernelType(ctx.Input<Tensor>("Anchors")->type(),
+                                   ctx.device_context());
+  }
+};
+
+template <class T>
+static inline void RBoxCoder(const platform::DeviceContext &ctx,
+                             Tensor *all_anchors,
+                             Tensor *bbox_deltas,
+                             Tensor *variances,
+                             Tensor *proposals) {
+  T *proposals_data = proposals->mutable_data<T>(ctx.GetPlace());
+
+  int64_t row = all_anchors->dims()[0];
+  int64_t len = all_anchors->dims()[1];
+
+  auto *bbox_deltas_data = bbox_deltas->data<T>();
+  auto *anchor_data = all_anchors->data<T>();
+  const T *variances_data = nullptr;
+  if (variances) {
+    variances_data = variances->data<T>();
+  }
+
+  for (int64_t i = 0; i < row; ++i) {
+    T anchor_width = anchor_data[i * len + 2];
+    T anchor_height = anchor_data[i * len + 3];
+    T anchor_angle = anchor_data[i * len + 4];
+
+    T anchor_center_x = anchor_data[i * len];
+    T anchor_center_y = anchor_data[i * len + 1];
+
+    T bbox_center_x = 0, bbox_center_y = 0;
+    T bbox_width = 0, bbox_height = 0, bbox_angle = 0;
+
+    if (variances) {
+      bbox_center_x =
+          bbox_deltas_data[i * len] / variances_data[i * len] * anchor_width +
+          anchor_center_x;
+      bbox_center_y = bbox_deltas_data[i * len + 1] /
+                          variances_data[i * len + 1] * anchor_height +
+                      anchor_center_y;
+      bbox_width = std::exp(std::min<T>(bbox_deltas_data[i * len + 2] /
+                                            variances_data[i * len + 2],
+                                        kBBoxClipDefault)) *
+                   anchor_width;
+      bbox_height = std::exp(std::min<T>(bbox_deltas_data[i * len + 3] /
+                                             variances_data[i * len + 3],
+                                         kBBoxClipDefault)) *
+                    anchor_height;
+      bbox_angle =
+          (bbox_deltas_data[i * len + 4] / variances_data[i * len + 4]) * 1.0 /
+              PI * 180 +
+          anchor_angle;
+
+    } else {
+      bbox_center_x =
+          bbox_deltas_data[i * len] * anchor_width + anchor_center_x;
+      bbox_center_y =
+          bbox_deltas_data[i * len + 1] * anchor_height + anchor_center_y;
+      bbox_width = std::exp(std::min<T>(bbox_deltas_data[i * len + 2],
+                                        kBBoxClipDefault)) *
+                   anchor_width;
+      bbox_height = std::exp(std::min<T>(bbox_deltas_data[i * len + 3],
+                                         kBBoxClipDefault)) *
+                    anchor_height;
+      bbox_angle =
+          bbox_deltas_data[i * len + 4] * 1.0 / PI * 180 + anchor_angle;
+    }
+
+    proposals_data[i * len] = bbox_center_x;
+    proposals_data[i * len + 1] = bbox_center_y;
+    proposals_data[i * len + 2] = bbox_width;
+    proposals_data[i * len + 3] = bbox_height;
+    proposals_data[i * len + 4] = bbox_angle;
+  }
+  // return proposals;
+}
+
+
+template <class T>
+static inline void RFilterBoxes(const platform::DeviceContext &ctx,
+                                Tensor *boxes,
+                                float min_size,
+                                const Tensor &im_info,
+                                Tensor *keep) {
+  T *boxes_data = boxes->mutable_data<T>(ctx.GetPlace());
+  keep->Resize({boxes->dims()[0]});
+  min_size = std::max(min_size, 0.0f);
+  int *keep_data = keep->mutable_data<int>(ctx.GetPlace());
+
+  int keep_len = 0;
+  for (int i = 0; i < boxes->dims()[0]; ++i) {
+    T ws = boxes_data[5 * i + 2];
+    T hs = boxes_data[5 * i + 3];
+    if (ws >= min_size && hs >= min_size) {
+      keep_data[keep_len++] = i;
+    }
+  }
+  keep->Resize({keep_len});
+}
+
+template <class T>
+static inline std::vector<std::pair<T, int>> GetSortedScoreIndex(
+    const std::vector<T> &scores) {
+  std::vector<std::pair<T, int>> sorted_indices;
+  sorted_indices.reserve(scores.size());
+  for (size_t i = 0; i < scores.size(); ++i) {
+    sorted_indices.emplace_back(scores[i], i);
+  }
+  // Sort the score pair according to the scores in descending order
+  std::stable_sort(sorted_indices.begin(),
+                   sorted_indices.end(),
+                   [](const std::pair<T, int> &a, const std::pair<T, int> &b) {
+                     return a.first < b.first;
+                   });
+  return sorted_indices;
+}
+
+
+template <typename T>
+static inline Tensor VectorToTensor(const std::vector<T> &selected_indices,
+                                    int selected_num) {
+  Tensor keep_nms;
+  keep_nms.Resize({selected_num});
+  auto *keep_data = keep_nms.mutable_data<T>(platform::CPUPlace());
+  for (int i = 0; i < selected_num; ++i) {
+    keep_data[i] = selected_indices[i];
+  }
+  return keep_nms;
+}
+
+template <typename T>
+inline T trangle_area(T *a, T *b, T *c) {
+  return ((a[0] - c[0]) * (b[1] - c[1]) - (a[1] - c[1]) * (b[0] - c[0])) / 2.0;
+}
+
+template <typename T>
+inline T area(T *int_pts, int num_of_inter) {
+  float area = 0.0;
+  for (int i = 0; i < num_of_inter - 2; i++) {
+    area +=
+        fabs(trangle_area(int_pts, int_pts + 2 * i + 2, int_pts + 2 * i + 4));
+  }
+  return area;
+}
+
+template <typename T>
+inline void reorder_pts(T *int_pts, int num_of_inter) {
+  if (num_of_inter > 0) {
+    float center[2];
+
+    center[0] = 0.0;
+    center[1] = 0.0;
+
+    for (int i = 0; i < num_of_inter; i++) {
+      center[0] += int_pts[2 * i];
+      center[1] += int_pts[2 * i + 1];
+    }
+    center[0] /= num_of_inter;
+    center[1] /= num_of_inter;
+
+    float vs[16];
+    float v[2];
+    float d;
+    for (int i = 0; i < num_of_inter; i++) {
+      v[0] = int_pts[2 * i] - center[0];
+      v[1] = int_pts[2 * i + 1] - center[1];
+      d = sqrt(v[0] * v[0] + v[1] * v[1]);
+      v[0] = v[0] / d;
+      v[1] = v[1] / d;
+      if (v[1] < 0) {
+        v[0] = -2 - v[0];
+      }
+      vs[i] = v[0];
+    }
+
+    float temp, tx, ty;
+    int j;
+    for (int i = 1; i < num_of_inter; ++i) {
+      if (vs[i - 1] > vs[i]) {
+        temp = vs[i];
+        tx = int_pts[2 * i];
+        ty = int_pts[2 * i + 1];
+        j = i;
+        while (j > 0 && vs[j - 1] > temp) {
+          vs[j] = vs[j - 1];
+          int_pts[j * 2] = int_pts[j * 2 - 2];
+          int_pts[j * 2 + 1] = int_pts[j * 2 - 1];
+          j--;
+        }
+        vs[j] = temp;
+        int_pts[j * 2] = tx;
+        int_pts[j * 2 + 1] = ty;
+      }
+    }
+  }
+}
+
+template <typename T>
+inline bool inter2line(T *pts1, T *pts2, int i, int j, T *temp_pts) {
+  T a[2];
+  T b[2];
+  T c[2];
+  T d[2];
+
+  T area_abc, area_abd, area_cda, area_cdb;
+
+  a[0] = pts1[2 * i];
+  a[1] = pts1[2 * i + 1];
+
+  b[0] = pts1[2 * ((i + 1) % 4)];
+  b[1] = pts1[2 * ((i + 1) % 4) + 1];
+
+  c[0] = pts2[2 * j];
+  c[1] = pts2[2 * j + 1];
+
+  d[0] = pts2[2 * ((j + 1) % 4)];
+  d[1] = pts2[2 * ((j + 1) % 4) + 1];
+
+  area_abc = trangle_area(a, b, c);
+  area_abd = trangle_area(a, b, d);
+
+  if (area_abc * area_abd >= 0) {
+    return false;
+  }
+
+  area_cda = trangle_area(c, d, a);
+  area_cdb = area_cda + area_abc - area_abd;
+
+  if (area_cda * area_cdb >= 0) {
+    return false;
+  }
+  float t = area_cda / (area_abd - area_abc);
+
+  float dx = t * (b[0] - a[0]);
+  float dy = t * (b[1] - a[1]);
+  temp_pts[0] = a[0] + dx;
+  temp_pts[1] = a[1] + dy;
+
+  return true;
+}
+
+template <typename T>
+inline bool in_rect(T pt_x, T pt_y, T *pts) {
+  float ab[2];
+  float ad[2];
+  float ap[2];
+
+  float abab;
+  float abap;
+  float adad;
+  float adap;
+
+  ab[0] = pts[2] - pts[0];
+  ab[1] = pts[3] - pts[1];
+
+  ad[0] = pts[6] - pts[0];
+  ad[1] = pts[7] - pts[1];
+
+  ap[0] = pt_x - pts[0];
+  ap[1] = pt_y - pts[1];
+
+  abab = ab[0] * ab[0] + ab[1] * ab[1];
+  abap = ab[0] * ap[0] + ab[1] * ap[1];
+  adad = ad[0] * ad[0] + ad[1] * ad[1];
+  adap = ad[0] * ap[0] + ad[1] * ap[1];
+
+  return abab >= abap and abap >= 0 and adad >= adap and adap >= 0;
+}
+
+template <typename T>
+inline int inter_pts(T *pts1, T *pts2, T *int_pts) {
+  int num_of_inter = 0;
+
+  for (int i = 0; i < 4; i++) {
+    if (in_rect(pts1[2 * i], pts1[2 * i + 1], pts2)) {
+      int_pts[num_of_inter * 2] = pts1[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts1[2 * i + 1];
+      num_of_inter++;
+    }
+    if (in_rect(pts2[2 * i], pts2[2 * i + 1], pts1)) {
+      int_pts[num_of_inter * 2] = pts2[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts2[2 * i + 1];
+      num_of_inter++;
+    }
+  }
+
+  T temp_pts[2];
+
+  for (int i = 0; i < 4; i++) {
+    for (int j = 0; j < 4; j++) {
+      bool has_pts = inter2line(pts1, pts2, i, j, temp_pts);
+      if (has_pts) {
+        int_pts[num_of_inter * 2] = temp_pts[0];
+        int_pts[num_of_inter * 2 + 1] = temp_pts[1];
+        num_of_inter++;
+      }
+    }
+  }
+
+  return num_of_inter;
+}
+
+template <typename T>
+inline void convert_region(T *pts, const T *region) {
+  float angle = region[4];
+  float a_cos = cos(angle / 180.0 * PI);
+  float a_sin = -sin(angle / 180.0 * PI);  // anti clock-wise
+
+  float ctr_x = region[0];
+  float ctr_y = region[1];
+  float h = region[3];
+  float w = region[2];
+
+  float pts_x[4];
+  float pts_y[4];
+
+  pts_x[0] = -w / 2;
+  pts_x[1] = -w / 2;
+  pts_x[2] = w / 2;
+  pts_x[3] = w / 2;
+
+  pts_y[0] = -h / 2;
+  pts_y[1] = h / 2;
+  pts_y[2] = h / 2;
+  pts_y[3] = -h / 2;
+
+  for (int i = 0; i < 4; i++) {
+    pts[2 * i] = a_cos * pts_x[i] - a_sin * pts_y[i] + ctr_x;
+    pts[2 * i + 1] = a_sin * pts_x[i] + a_cos * pts_y[i] + ctr_y;
+  }
+}
+
+template <typename T>
+inline float inter(const T *region1, const T *region2) {
+  T pts1[8];
+  T pts2[8];
+  T int_pts[16];
+  int num_of_inter;
+
+  convert_region<T>(pts1, region1);
+  convert_region<T>(pts2, region2);
+
+  num_of_inter = inter_pts<T>(pts1, pts2, int_pts);
+
+  reorder_pts<T>(int_pts, num_of_inter);
+
+  return area<T>(int_pts, num_of_inter);
+}
+
+template <typename T>
+inline float DevRotateIoU(const T *region1, const T *region2) {
+  T area1 = region1[2] * region1[3];
+  T area2 = region2[2] * region2[3];
+  T area_inter = inter<T>(region1, region2);
+
+  return area_inter / (area1 + area2 - area_inter);
+}
+
+template <class T>
+static inline Tensor RNMS(const platform::DeviceContext &ctx,
+                          Tensor *bbox,
+                          Tensor *scores,
+                          T nms_threshold) {
+  PADDLE_ENFORCE_NOT_NULL(bbox);
+  int64_t num_boxes = bbox->dims()[0];
+  // 4: [xmin ymin xmax ymax]
+  int64_t box_size = bbox->dims()[1];
+
+  std::vector<T> scores_data(num_boxes);
+  std::copy_n(scores->data<T>(), num_boxes, scores_data.begin());
+  std::vector<std::pair<T, int>> sorted_indices =
+      GetSortedScoreIndex<T>(scores_data);
+
+  std::vector<int> selected_indices;
+  int selected_num = 0;
+  T adaptive_threshold = nms_threshold;
+  const T *bbox_data = bbox->data<T>();
+  while (sorted_indices.size() != 0) {
+    int idx = sorted_indices.back().second;
+    bool flag = true;
+    for (int kept_idx : selected_indices) {
+      if (flag) {
+        T overlap = DevRotateIoU<T>(bbox_data + idx * box_size,
+                                    bbox_data + kept_idx * box_size);
+        flag = (overlap <= adaptive_threshold);
+      } else {
+        break;
+      }
+    }
+    if (flag) {
+      selected_indices.push_back(idx);
+      ++selected_num;
+    }
+    sorted_indices.erase(sorted_indices.end() - 1);
+  }
+  return VectorToTensor(selected_indices, selected_num);
+}
+
+template <typename T>
+class RRPNGenerateProposalsKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext &context) const override {
+    auto *scores = context.Input<Tensor>("Scores");
+    auto *bbox_deltas = context.Input<Tensor>("BboxDeltas");
+    auto *im_info = context.Input<Tensor>("ImInfo");
+    auto anchors = detail::Ref(context.Input<Tensor>("Anchors"),
+                               "Cannot find input Anchors(%s) in scope",
+                               context.InputNames("Anchors")[0]);
+    auto variances = detail::Ref(context.Input<Tensor>("Variances"),
+                                 "Cannot find input Variances(%s) in scope",
+                                 context.InputNames("Variances")[0]);
+
+    auto *rpn_rois = context.Output<LoDTensor>("RpnRois");
+    auto *rpn_roi_probs = context.Output<LoDTensor>("RpnRoiProbs");
+
+    int pre_nms_top_n = context.Attr<int>("pre_nms_topN");
+    int post_nms_top_n = context.Attr<int>("post_nms_topN");
+    float nms_thresh = context.Attr<float>("nms_thresh");
+    float min_size = context.Attr<float>("min_size");
+
+    auto &dev_ctx =
+        context.template device_context<platform::CPUDeviceContext>();
+
+    auto &scores_dim = scores->dims();
+    int64_t num = scores_dim[0];
+    int64_t c_score = scores_dim[1];
+    int64_t h_score = scores_dim[2];
+    int64_t w_score = scores_dim[3];
+
+    auto &bbox_dim = bbox_deltas->dims();
+    int64_t c_bbox = bbox_dim[1];
+    int64_t h_bbox = bbox_dim[2];
+    int64_t w_bbox = bbox_dim[3];
+
+    rpn_rois->mutable_data<T>({bbox_deltas->numel() / 5, 5},
+                              context.GetPlace());
+    rpn_roi_probs->mutable_data<T>({scores->numel(), 1}, context.GetPlace());
+
+    Tensor bbox_deltas_swap, scores_swap;
+    bbox_deltas_swap.mutable_data<T>({num, h_bbox, w_bbox, c_bbox},
+                                     dev_ctx.GetPlace());
+    scores_swap.mutable_data<T>({num, h_score, w_score, c_score},
+                                dev_ctx.GetPlace());
+
+    math::Transpose<platform::CPUDeviceContext, T, 4> trans;
+    std::vector<int> axis = {0, 2, 3, 1};
+    trans(dev_ctx, *bbox_deltas, &bbox_deltas_swap, axis);
+    trans(dev_ctx, *scores, &scores_swap, axis);
+
+    framework::LoD lod;
+    lod.resize(1);
+    auto &lod0 = lod[0];
+    lod0.push_back(0);
+    anchors.Resize({anchors.numel() / 5, 5});
+    variances.Resize({variances.numel() / 5, 5});
+
+    int64_t num_proposals = 0;
+    for (int64_t i = 0; i < num; ++i) {
+      Tensor im_info_slice = im_info->Slice(i, i + 1);
+      Tensor bbox_deltas_slice = bbox_deltas_swap.Slice(i, i + 1);
+      Tensor scores_slice = scores_swap.Slice(i, i + 1);
+
+      bbox_deltas_slice.Resize({h_bbox * w_bbox * c_bbox / 5, 5});
+      scores_slice.Resize({h_score * w_score * c_score, 1});
+
+      std::pair<Tensor, Tensor> tensor_pair =
+          ProposalForOneImage(dev_ctx,
+                              im_info_slice,
+                              anchors,
+                              variances,
+                              bbox_deltas_slice,
+                              scores_slice,
+                              pre_nms_top_n,
+                              post_nms_top_n,
+                              nms_thresh,
+                              min_size);
+      Tensor &proposals = tensor_pair.first;
+      Tensor &scores = tensor_pair.second;
+
+      RRPNAppendProposals(rpn_rois, 5 * num_proposals, proposals);
+      RRPNAppendProposals(rpn_roi_probs, num_proposals, scores);
+      num_proposals += proposals.dims()[0];
+      lod0.push_back(num_proposals);
+    }
+    rpn_rois->set_lod(lod);
+    rpn_roi_probs->set_lod(lod);
+    rpn_rois->Resize({num_proposals, 5});
+    rpn_roi_probs->Resize({num_proposals, 1});
+  }
+
+  std::pair<Tensor, Tensor> ProposalForOneImage(
+      const platform::CPUDeviceContext &ctx,
+      const Tensor &im_info_slice,
+      const Tensor &anchors,
+      const Tensor &variances,
+      const Tensor &bbox_deltas_slice,  // [M, 5]
+      const Tensor &scores_slice,       // [N, 1]
+      int pre_nms_top_n,
+      int post_nms_top_n,
+      float nms_thresh,
+      float min_size) const {
+    auto *scores_data = scores_slice.data<T>();
+    // Sort index
+    Tensor index_t;
+    index_t.Resize({scores_slice.numel()});
+    int *index = index_t.mutable_data<int>(ctx.GetPlace());
+    for (int i = 0; i < scores_slice.numel(); ++i) {
+      index[i] = i;
+    }
+    auto compare = [scores_data](const int64_t &i, const int64_t &j) {
+      return scores_data[i] > scores_data[j];
+    };
+
+    if (pre_nms_top_n <= 0 || pre_nms_top_n >= scores_slice.numel()) {
+      std::sort(index, index + scores_slice.numel(), compare);
+    } else {
+      std::nth_element(
+          index, index + pre_nms_top_n, index + scores_slice.numel(), compare);
+      index_t.Resize({pre_nms_top_n});
+    }
+
+    Tensor scores_sel, bbox_sel, anchor_sel, var_sel;
+    scores_sel.mutable_data<T>({index_t.numel(), 1}, ctx.GetPlace());
+    bbox_sel.mutable_data<T>({index_t.numel(), 5}, ctx.GetPlace());
+    anchor_sel.mutable_data<T>({index_t.numel(), 5}, ctx.GetPlace());
+    var_sel.mutable_data<T>({index_t.numel(), 5}, ctx.GetPlace());
+
+    CPUGather<T>(ctx, scores_slice, index_t, &scores_sel);
+    CPUGather<T>(ctx, bbox_deltas_slice, index_t, &bbox_sel);
+    CPUGather<T>(ctx, anchors, index_t, &anchor_sel);
+    CPUGather<T>(ctx, variances, index_t, &var_sel);
+
+    auto *scores_ = scores_sel.data<T>();
+
+    Tensor proposals;
+    proposals.mutable_data<T>({index_t.numel(), 5}, ctx.GetPlace());
+    RBoxCoder<T>(ctx, &anchor_sel, &bbox_sel, &var_sel, &proposals);
+
+    Tensor keep;
+    RFilterBoxes<T>(ctx, &proposals, min_size, im_info_slice, &keep);
+    Tensor scores_filter;
+    bbox_sel.mutable_data<T>({keep.numel(), 5}, ctx.GetPlace());
+    scores_filter.mutable_data<T>({keep.numel(), 1}, ctx.GetPlace());
+    CPUGather<T>(ctx, proposals, keep, &bbox_sel);
+    CPUGather<T>(ctx, scores_sel, keep, &scores_filter);
+    if (nms_thresh <= 0) {
+      return std::make_pair(bbox_sel, scores_filter);
+    }
+    Tensor keep_nms = RNMS<T>(ctx, &bbox_sel, &scores_filter, nms_thresh);
+
+    if (post_nms_top_n > 0 && post_nms_top_n < keep_nms.numel()) {
+      keep_nms.Resize({post_nms_top_n});
+    }
+    proposals.mutable_data<T>({keep_nms.numel(), 5}, ctx.GetPlace());
+    scores_sel.mutable_data<T>({keep_nms.numel(), 1}, ctx.GetPlace());
+    CPUGather<T>(ctx, bbox_sel, keep_nms, &proposals);
+    CPUGather<T>(ctx, scores_filter, keep_nms, &scores_sel);
+
+    return std::make_pair(proposals, scores_sel);
+  }
+};
+
+class RRPNGenerateProposalsOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("Scores",
+             "(Tensor) The scores from conv is in shape (N, A, H, W), "
+             "N is batch size, A is number of anchors, "
+             "H and W are height and width of the feature map");
+    AddInput("BboxDeltas",
+             "(Tensor) Bounding box deltas from conv is in "
+             "shape (N, 5*A, H, W).");
+    AddInput("ImInfo",
+             "(Tensor) Information for image reshape is in shape (N, 3), "
+             "in format (height, width, scale)");
+    AddInput("Anchors",
+             "(Tensor) Bounding box anchors from anchor_generator_op "
+             "is in shape (A, H, W, 5).");
+    AddInput("Variances",
+             "(Tensor) Bounding box variances with same shape as `Anchors`.");
+
+    AddOutput("RpnRois",
+              "(LoDTensor), Output proposals with shape (rois_num, 5).");
+    AddOutput("RpnRoiProbs",
+              "(LoDTensor) Scores of proposals with shape (rois_num, 1).");
+    AddAttr<int>("pre_nms_topN",
+                 "Number of top scoring RPN proposals to keep before "
+                 "applying NMS.");
+    AddAttr<int>("post_nms_topN",
+                 "Number of top scoring RPN proposals to keep after "
+                 "applying NMS");
+    AddAttr<float>("nms_thresh", "NMS threshold used on RPN proposals.");
+    AddAttr<float>("min_size",
+                   "Proposal height and width both need to be greater "
+                   "than this min_size.");
+    AddComment(R"DOC(
+This operator Generate bounding box proposals for Faster RCNN.
+The propoasls are generated for a list of images based on image
+score 'Scores', bounding box regression result 'BboxDeltas' as
+well as predefined bounding box shapes 'anchors'. Greedy
+non-maximum suppression is applied to generate the final bounding
+boxes.
+
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(
+    rrpn_generate_proposals,
+    ops::RRPNGenerateProposalsOp,
+    ops::RRPNGenerateProposalsOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
+REGISTER_OP_CPU_KERNEL(rrpn_generate_proposals,
+                       ops::RRPNGenerateProposalsKernel<float>,
+                       ops::RRPNGenerateProposalsKernel<double>);
diff --git a/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposals_op.cu b/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposals_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..a074e79e00475ed604b191d1ac7e12ad8e898abe
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposals_op.cu
@@ -0,0 +1,747 @@
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+Based on
+--------------------------------------------------------
+@misc{ma2019rrpn,
+    author = {Jianqi Ma},
+    title = {{RRPN in pytorch}},
+    year = {2019},
+    howpublished = {\url{https://github.com/mjq11302010044/RRPN_pytorch}},
+}
+@article{Jianqi17RRPN,
+    Author = {Jianqi Ma and Weiyuan Shao and Hao Ye and Li Wang and Hong Wang
+and Yingbin Zheng and Xiangyang Xue},
+    Title = {Arbitrary-Oriented Scene Text Detection via Rotation Proposals},
+    journal = {IEEE Transactions on Multimedia},
+    volume={20},
+    number={11},
+    pages={3111-3122},
+    year={2018}
+}
+--------------------------------------------------------
+*/
+
+#include <paddle/fluid/memory/allocation/allocator.h>
+#include <stdio.h>
+#include <string>
+#include <vector>
+#include "cub/cub/cub.cuh"
+#include "gather.cu.h"
+#include "math_function.h"
+#include "paddle/fluid/framework/mixed_vector.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/memory/memory.h"
+#include "paddle/fluid/platform/for_range.h"
+#include "safe_ref.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+#define PI 3.141592654
+
+namespace {
+
+#define DIVUP(m, n) ((m) / (n) + ((m) % (n) > 0))
+#define CUDA_1D_KERNEL_LOOP(i, n)                              \
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); \
+       i += blockDim.x * gridDim.x)
+
+int const kThreadsPerBlock = sizeof(uint64_t) * 8;
+
+static const double kBBoxClipDefault = std::log(1000.0 / 16.0);
+
+struct RangeInitFunctor {
+  int start_;
+  int delta_;
+  int *out_;
+  __device__ void operator()(size_t i) { out_[i] = start_ + i * delta_; }
+};
+
+template <typename T>
+static void RSortDescending(const platform::CUDADeviceContext &ctx,
+                            const Tensor &value,
+                            Tensor *value_out,
+                            Tensor *index_out) {
+  int num = static_cast<int>(value.numel());
+  Tensor index_in_t;
+  int *idx_in = index_in_t.mutable_data<int>({num}, ctx.GetPlace());
+  platform::ForRange<platform::CUDADeviceContext> for_range(ctx, num);
+  for_range(RangeInitFunctor{0, 1, idx_in});
+
+  int *idx_out = index_out->mutable_data<int>({num}, ctx.GetPlace());
+
+  const T *keys_in = value.data<T>();
+  T *keys_out = value_out->mutable_data<T>({num}, ctx.GetPlace());
+
+  // Determine temporary device storage requirements
+  size_t temp_storage_bytes = 0;
+  cub::DeviceRadixSort::SortPairsDescending<T, int>(
+      nullptr, temp_storage_bytes, keys_in, keys_out, idx_in, idx_out, num);
+  // Allocate temporary storage
+  auto place = boost::get<platform::CUDAPlace>(ctx.GetPlace());
+  auto d_temp_storage = memory::Alloc(place, temp_storage_bytes);
+
+  // Run sorting operation
+  cub::DeviceRadixSort::SortPairsDescending<T, int>(d_temp_storage->ptr(),
+                                                    temp_storage_bytes,
+                                                    keys_in,
+                                                    keys_out,
+                                                    idx_in,
+                                                    idx_out,
+                                                    num);
+}
+
+template <typename T>
+struct RBoxDecodeAndClipFunctor {
+  const T *anchor;
+  const T *deltas;
+  const T *var;
+  const int *index;
+  const T *im_info;
+
+  T *proposals;
+
+  RBoxDecodeAndClipFunctor(const T *anchor,
+                           const T *deltas,
+                           const T *var,
+                           const int *index,
+                           const T *im_info,
+                           T *proposals)
+      : anchor(anchor),
+        deltas(deltas),
+        var(var),
+        index(index),
+        im_info(im_info),
+        proposals(proposals) {}
+
+  T bbox_clip_default{static_cast<T>(kBBoxClipDefault)};
+
+  __device__ void operator()(size_t i) {
+    int k = index[i] * 5;
+
+    T w = anchor[k + 2];
+    T h = anchor[k + 3];
+    T cx = anchor[k];
+    T cy = anchor[k + 1];
+    T angle = anchor[k + 4];
+
+    T de_cx = deltas[k];
+    T de_cy = deltas[k + 1];
+    T de_w = deltas[k + 2];
+    T de_h = deltas[k + 3];
+    T de_g = deltas[k + 4];
+
+    T d_cx, d_cy, d_w, d_h, d_g;
+    if (var) {
+      d_cx = cx + de_cx * w / var[k];
+      d_cy = cy + de_cy * h / var[k + 1];
+      d_w = exp(Min(de_w / var[k + 2], bbox_clip_default)) * w;
+      d_h = exp(Min(de_h / var[k + 3], bbox_clip_default)) * h;
+      d_g = de_g / var[k + 4] * 1.0 / PI * 180 + angle;
+    } else {
+      d_cx = cx + de_cx * w;
+      d_cy = cy + de_cy * h;
+      d_w = exp(Min(de_w, bbox_clip_default)) * w;
+      d_h = exp(Min(de_h, bbox_clip_default)) * h;
+      d_g = de_g * 1.0 / PI * 180 + angle;
+    }
+
+    proposals[i * 5] = d_cx;
+    proposals[i * 5 + 1] = d_cy;
+    proposals[i * 5 + 2] = d_w;
+    proposals[i * 5 + 3] = d_h;
+    proposals[i * 5 + 4] = d_g;
+  }
+
+  __device__ __forceinline__ T Min(T a, T b) const { return a > b ? b : a; }
+
+  __device__ __forceinline__ T Max(T a, T b) const { return a > b ? a : b; }
+};
+
+template <typename T, int BlockSize>
+static __global__ void RFilterBBoxes(const T *bboxes,
+                                     const T *im_info,
+                                     const T min_size,
+                                     const int num,
+                                     int *keep_num,
+                                     int *keep) {
+  T im_h = im_info[0];
+  T im_w = im_info[1];
+  T im_scale = im_info[2];
+
+  int cnt = 0;
+  __shared__ int keep_index[BlockSize];
+
+  CUDA_1D_KERNEL_LOOP(i, num) {
+    keep_index[threadIdx.x] = -1;
+    __syncthreads();
+
+    int k = i * 5;
+
+    T cx = bboxes[k];
+    T cy = bboxes[k + 1];
+    T w_s = bboxes[k + 2];
+    T h_s = bboxes[k + 3];
+
+    if (w_s >= min_size && h_s >= min_size) {
+      keep_index[threadIdx.x] = i;
+    }
+    __syncthreads();
+    if (threadIdx.x == 0) {
+      int size = (num - i) < BlockSize ? num - i : BlockSize;
+      for (int j = 0; j < size; ++j) {
+        if (keep_index[j] > -1) {
+          keep[cnt++] = keep_index[j];
+        }
+      }
+    }
+    __syncthreads();
+  }
+  if (threadIdx.x == 0) {
+    keep_num[0] = cnt;
+  }
+}
+
+
+__device__ inline float trangle_area(float *a, float *b, float *c) {
+  return ((a[0] - c[0]) * (b[1] - c[1]) - (a[1] - c[1]) * (b[0] - c[0])) / 2.0;
+}
+
+
+__device__ inline float area(float *int_pts, int num_of_inter) {
+  float area = 0.0;
+  for (int i = 0; i < num_of_inter - 2; i++) {
+    area +=
+        fabs(trangle_area(int_pts, int_pts + 2 * i + 2, int_pts + 2 * i + 4));
+  }
+  return area;
+}
+
+
+__device__ inline void reorder_pts(float *int_pts, int num_of_inter) {
+  if (num_of_inter > 0) {
+    float center[2] = {0.0, 0.0};
+
+    //    center[0] = 0.0;
+    //    center[1] = 0.0;
+
+    for (int i = 0; i < num_of_inter; i++) {
+      center[0] += int_pts[2 * i];
+      center[1] += int_pts[2 * i + 1];
+    }
+    center[0] /= num_of_inter;
+    center[1] /= num_of_inter;
+
+    float vs[16];
+    float v[2];
+    float d;
+    for (int i = 0; i < num_of_inter; i++) {
+      v[0] = int_pts[2 * i] - center[0];
+      v[1] = int_pts[2 * i + 1] - center[1];
+      d = sqrt(v[0] * v[0] + v[1] * v[1]);
+      v[0] = v[0] / d;
+      v[1] = v[1] / d;
+      if (v[1] < 0) {
+        v[0] = -2 - v[0];
+      }
+      vs[i] = v[0];
+    }
+
+    float temp, tx, ty;
+    int j;
+    for (int i = 1; i < num_of_inter; ++i) {
+      if (vs[i - 1] > vs[i]) {
+        temp = vs[i];
+        tx = int_pts[2 * i];
+        ty = int_pts[2 * i + 1];
+        j = i;
+        while (j > 0 && vs[j - 1] > temp) {
+          vs[j] = vs[j - 1];
+          int_pts[j * 2] = int_pts[j * 2 - 2];
+          int_pts[j * 2 + 1] = int_pts[j * 2 - 1];
+          j--;
+        }
+        vs[j] = temp;
+        int_pts[j * 2] = tx;
+        int_pts[j * 2 + 1] = ty;
+      }
+    }
+  }
+}
+
+
+__device__ inline bool inter2line(
+    float *pts1, float *pts2, int i, int j, float *temp_pts) {
+  float a[2] = {pts1[2 * i], pts1[2 * i + 1]};
+  float b[2] = {pts1[2 * ((i + 1) % 4)], pts1[2 * ((i + 1) % 4) + 1]};
+  float c[2] = {pts2[2 * j], pts2[2 * j + 1]};
+  float d[2] = {pts2[2 * ((j + 1) % 4)], pts2[2 * ((j + 1) % 4) + 1]};
+
+  // T area_abc, area_abd, area_cda, area_cdb;
+
+  // a[0] = pts1[2 * i];
+  // a[1] = pts1[2 * i + 1];
+
+  // b[0] = pts1[2 * ((i + 1) % 4)];
+  // b[1] = pts1[2 * ((i + 1) % 4) + 1];
+
+  // c[0] = pts2[2 * j];
+  // c[1] = pts2[2 * j + 1];
+
+  // d[0] = pts2[2 * ((j + 1) % 4)];
+  // d[1] = pts2[2 * ((j + 1) % 4) + 1];
+
+  float area_abc = trangle_area(a, b, c);
+  float area_abd = trangle_area(a, b, d);
+
+  if (area_abc * area_abd >= 0) {
+    return false;
+  }
+
+  float area_cda = trangle_area(c, d, a);
+  float area_cdb = area_cda + area_abc - area_abd;
+
+  if (area_cda * area_cdb >= 0) {
+    return false;
+  }
+  float t = area_cda / (area_abd - area_abc);
+
+  float dx = t * (b[0] - a[0]);
+  float dy = t * (b[1] - a[1]);
+  temp_pts[0] = a[0] + dx;
+  temp_pts[1] = a[1] + dy;
+
+  return true;
+}
+
+
+__device__ inline bool in_rect(float pt_x, float pt_y, float *pts) {
+  float ab[2] = {pts[2] - pts[0], pts[3] - pts[1]};
+  float ad[2] = {pts[6] - pts[0], pts[7] - pts[1]};
+  float ap[2] = {pt_x - pts[0], pt_y - pts[1]};
+
+  //  float abab;
+  //  float abap;
+  //  float adad;
+  //  float adap;
+
+  //  ab[0] = pts[2] - pts[0];
+  //  ab[1] = pts[3] - pts[1];
+  //
+  //  ad[0] = pts[6] - pts[0];
+  //  ad[1] = pts[7] - pts[1];
+  //
+  //  ap[0] = pt_x - pts[0];
+  //  ap[1] = pt_y - pts[1];
+
+  float abab = ab[0] * ab[0] + ab[1] * ab[1];
+  float abap = ab[0] * ap[0] + ab[1] * ap[1];
+  float adad = ad[0] * ad[0] + ad[1] * ad[1];
+  float adap = ad[0] * ap[0] + ad[1] * ap[1];
+
+  return abab >= abap and abap >= 0 and adad >= adap and adap >= 0;
+}
+
+
+__device__ inline int inter_pts(float *pts1, float *pts2, float *int_pts) {
+  int num_of_inter = 0;
+
+  for (int i = 0; i < 4; i++) {
+    if (in_rect(pts1[2 * i], pts1[2 * i + 1], pts2)) {
+      int_pts[num_of_inter * 2] = pts1[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts1[2 * i + 1];
+      num_of_inter++;
+    }
+    if (in_rect(pts2[2 * i], pts2[2 * i + 1], pts1)) {
+      int_pts[num_of_inter * 2] = pts2[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts2[2 * i + 1];
+      num_of_inter++;
+    }
+  }
+
+  float temp_pts[2];
+
+  for (int i = 0; i < 4; i++) {
+    for (int j = 0; j < 4; j++) {
+      bool has_pts = inter2line(pts1, pts2, i, j, temp_pts);
+      if (has_pts) {
+        int_pts[num_of_inter * 2] = temp_pts[0];
+        int_pts[num_of_inter * 2 + 1] = temp_pts[1];
+        num_of_inter++;
+      }
+    }
+  }
+
+  return num_of_inter;
+}
+
+
+__device__ inline void convert_region(float *pts, const float *region) {
+  float angle = region[4];
+  float a_cos = cos(angle / 180.0 * PI);
+  float a_sin = -sin(angle / 180.0 * PI);  // anti clock-wise
+
+  float ctr_x = region[0];
+  float ctr_y = region[1];
+  float h = region[3];
+  float w = region[2];
+
+  float pts_x[4] = {-w / 2, -w / 2, w / 2, w / 2};
+  float pts_y[4] = {-h / 2, h / 2, h / 2, -h / 2};
+
+  //  pts_x[0] = -w / 2;
+  //  pts_x[1] = -w / 2;
+  //  pts_x[2] = w / 2;
+  //  pts_x[3] = w / 2;
+  //
+  //  pts_y[0] = -h / 2;
+  //  pts_y[1] = h / 2;
+  //  pts_y[2] = h / 2;
+  //  pts_y[3] = -h / 2;
+
+  for (int i = 0; i < 4; i++) {
+    pts[2 * i] = a_cos * pts_x[i] - a_sin * pts_y[i] + ctr_x;
+    pts[2 * i + 1] = a_sin * pts_x[i] + a_cos * pts_y[i] + ctr_y;
+  }
+}
+
+__device__ inline float inter(const float *region1, const float *region2) {
+  float pts1[8];
+  float pts2[8];
+  float int_pts[16];
+  int num_of_inter;
+
+  convert_region(pts1, region1);
+  convert_region(pts2, region2);
+
+  num_of_inter = inter_pts(pts1, pts2, int_pts);
+
+  reorder_pts(int_pts, num_of_inter);
+
+  return area(int_pts, num_of_inter);
+}
+
+
+__device__ inline float IoU(const float *region1, const float *region2) {
+  float area1 = region1[2] * region1[3];
+  float area2 = region2[2] * region2[3];
+  float area_inter = inter(region1, region2);
+
+  return area_inter / (area1 + area2 - area_inter);
+}
+
+static __global__ void RNMSKernel(const int n_boxes,
+                                  const float nms_overlap_thresh,
+                                  const float *dev_boxes,
+                                  uint64_t *dev_mask) {
+  const int row_start = blockIdx.y;
+  const int col_start = blockIdx.x;
+
+  const int row_size =
+      min(n_boxes - row_start * kThreadsPerBlock, kThreadsPerBlock);
+  const int col_size =
+      min(n_boxes - col_start * kThreadsPerBlock, kThreadsPerBlock);
+
+  __shared__ float block_boxes[kThreadsPerBlock * 5];
+  if (threadIdx.x < col_size) {
+    block_boxes[threadIdx.x * 5 + 0] =
+        dev_boxes[(kThreadsPerBlock * col_start + threadIdx.x) * 5 + 0];
+    block_boxes[threadIdx.x * 5 + 1] =
+        dev_boxes[(kThreadsPerBlock * col_start + threadIdx.x) * 5 + 1];
+    block_boxes[threadIdx.x * 5 + 2] =
+        dev_boxes[(kThreadsPerBlock * col_start + threadIdx.x) * 5 + 2];
+    block_boxes[threadIdx.x * 5 + 3] =
+        dev_boxes[(kThreadsPerBlock * col_start + threadIdx.x) * 5 + 3];
+    block_boxes[threadIdx.x * 5 + 4] =
+        dev_boxes[(kThreadsPerBlock * col_start + threadIdx.x) * 5 + 4];
+  }
+  __syncthreads();
+
+  if (threadIdx.x < row_size) {
+    const int cur_box_idx = kThreadsPerBlock * row_start + threadIdx.x;
+    const float *cur_box = dev_boxes + cur_box_idx * 5;
+    int i = 0;
+    uint64_t t = 0;
+    int start = 0;
+    if (row_start == col_start) {
+      start = threadIdx.x + 1;
+    }
+    for (i = start; i < col_size; i++) {
+      if (IoU(cur_box, block_boxes + i * 5) > nms_overlap_thresh) {
+        t |= 1ULL << i;
+      }
+    }
+    const int col_blocks = DIVUP(n_boxes, kThreadsPerBlock);
+    dev_mask[cur_box_idx * col_blocks + col_start] = t;
+  }
+}
+
+template <typename T>
+static void RNMS(const platform::CUDADeviceContext &ctx,
+                 const Tensor &proposals,
+                 const Tensor &sorted_indices,
+                 const T nms_threshold,
+                 Tensor *keep_out) {
+  int boxes_num = proposals.dims()[0];
+  PADDLE_ENFORCE_EQ(boxes_num, sorted_indices.dims()[0]);
+
+  const int col_blocks = DIVUP(boxes_num, kThreadsPerBlock);
+  dim3 blocks(DIVUP(boxes_num, kThreadsPerBlock),
+              DIVUP(boxes_num, kThreadsPerBlock));
+  dim3 threads(kThreadsPerBlock);
+
+  const T *boxes = proposals.data<T>();
+  auto place = boost::get<platform::CUDAPlace>(ctx.GetPlace());
+  framework::Vector<uint64_t> mask(boxes_num * col_blocks);
+  RNMSKernel<<<blocks, threads>>>(
+      boxes_num,
+      nms_threshold,
+      boxes,
+      mask.CUDAMutableData(boost::get<platform::CUDAPlace>(ctx.GetPlace())));
+
+  std::vector<uint64_t> remv(col_blocks);
+  memset(&remv[0], 0, sizeof(uint64_t) * col_blocks);
+
+  std::vector<int> keep_vec;
+  int num_to_keep = 0;
+  for (int i = 0; i < boxes_num; i++) {
+    int nblock = i / kThreadsPerBlock;
+    int inblock = i % kThreadsPerBlock;
+
+    if (!(remv[nblock] & (1ULL << inblock))) {
+      ++num_to_keep;
+      keep_vec.push_back(i);
+      uint64_t *p = &mask[0] + i * col_blocks;
+      for (int j = nblock; j < col_blocks; j++) {
+        remv[j] |= p[j];
+      }
+    }
+  }
+  int *keep = keep_out->mutable_data<int>({num_to_keep}, ctx.GetPlace());
+  memory::Copy(place,
+               keep,
+               platform::CPUPlace(),
+               keep_vec.data(),
+               sizeof(int) * num_to_keep,
+               ctx.stream());
+  ctx.Wait();
+}
+
+template <typename T>
+static std::pair<Tensor, Tensor> RRPNProposalForOneImage(
+    const platform::CUDADeviceContext &ctx,
+    const Tensor &im_info,
+    const Tensor &anchors,
+    const Tensor &variances,
+    const Tensor &bbox_deltas,  // [M, 5]
+    const Tensor &scores,       // [N, 1]
+    int pre_nms_top_n,
+    int post_nms_top_n,
+    float nms_thresh,
+    float min_size) {
+  // 1. pre nms
+  Tensor scores_sort, index_sort;
+  RSortDescending<T>(ctx, scores, &scores_sort, &index_sort);
+  int num = scores.numel();
+  int pre_nms_num = (pre_nms_top_n <= 0 || pre_nms_top_n > num) ? scores.numel()
+                                                                : pre_nms_top_n;
+  scores_sort.Resize({pre_nms_num, 1});
+  index_sort.Resize({pre_nms_num, 1});
+
+  // 2. box decode and clipping
+  Tensor proposals;
+  proposals.mutable_data<T>({pre_nms_num, 5}, ctx.GetPlace());
+
+  {
+    platform::ForRange<platform::CUDADeviceContext> for_range(ctx, pre_nms_num);
+    for_range(RBoxDecodeAndClipFunctor<T>{anchors.data<T>(),
+                                          bbox_deltas.data<T>(),
+                                          variances.data<T>(),
+                                          index_sort.data<int>(),
+                                          im_info.data<T>(),
+                                          proposals.data<T>()});
+  }
+
+  // 3. filter
+  Tensor keep_index, keep_num_t;
+  keep_index.mutable_data<int>({pre_nms_num}, ctx.GetPlace());
+  keep_num_t.mutable_data<int>({1}, ctx.GetPlace());
+  min_size = std::max(min_size, 0.0f);
+  auto stream = ctx.stream();
+  RFilterBBoxes<T, 256><<<1, 256, 0, stream>>>(proposals.data<T>(),
+                                               im_info.data<T>(),
+                                               min_size,
+                                               pre_nms_num,
+                                               keep_num_t.data<int>(),
+                                               keep_index.data<int>());
+  int keep_num;
+  const auto gpu_place = boost::get<platform::CUDAPlace>(ctx.GetPlace());
+  memory::Copy(platform::CPUPlace(),
+               &keep_num,
+               gpu_place,
+               keep_num_t.data<int>(),
+               sizeof(int),
+               ctx.stream());
+  ctx.Wait();
+  keep_index.Resize({keep_num});
+
+  Tensor scores_filter, proposals_filter;
+  proposals_filter.mutable_data<T>({keep_num, 5}, ctx.GetPlace());
+  scores_filter.mutable_data<T>({keep_num, 1}, ctx.GetPlace());
+  GPUGather<T>(ctx, proposals, keep_index, &proposals_filter);
+  GPUGather<T>(ctx, scores_sort, keep_index, &scores_filter);
+
+  if (nms_thresh <= 0) {
+    return std::make_pair(proposals_filter, scores_filter);
+  }
+
+  // 4. nms
+  Tensor keep_nms;
+  RNMS<T>(ctx, proposals_filter, keep_index, nms_thresh, &keep_nms);
+  if (post_nms_top_n > 0 && post_nms_top_n < keep_nms.numel()) {
+    keep_nms.Resize({post_nms_top_n});
+  }
+
+  Tensor scores_nms, proposals_nms;
+  proposals_nms.mutable_data<T>({keep_nms.numel(), 5}, ctx.GetPlace());
+  scores_nms.mutable_data<T>({keep_nms.numel(), 1}, ctx.GetPlace());
+  GPUGather<T>(ctx, proposals_filter, keep_nms, &proposals_nms);
+  GPUGather<T>(ctx, scores_filter, keep_nms, &scores_nms);
+
+  return std::make_pair(proposals_nms, scores_nms);
+}
+}  // namespace
+
+template <typename DeviceContext, typename T>
+class CUDARRPNGenerateProposalsKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext &context) const override {
+    auto *scores = context.Input<Tensor>("Scores");
+    auto *bbox_deltas = context.Input<Tensor>("BboxDeltas");
+    auto *im_info = context.Input<Tensor>("ImInfo");
+    auto anchors = detail::Ref(context.Input<Tensor>("Anchors"),
+                               "Cannot find input Anchors(%s) in scope",
+                               context.InputNames("Anchors")[0]);
+    auto variances = detail::Ref(context.Input<Tensor>("Variances"),
+                                 "Cannot find input Variances(%s) in scope",
+                                 context.InputNames("Variances")[0]);
+
+    auto *rpn_rois = context.Output<LoDTensor>("RpnRois");
+    auto *rpn_roi_probs = context.Output<LoDTensor>("RpnRoiProbs");
+
+    int pre_nms_top_n = context.Attr<int>("pre_nms_topN");
+    int post_nms_top_n = context.Attr<int>("post_nms_topN");
+    float nms_thresh = context.Attr<float>("nms_thresh");
+    float min_size = context.Attr<float>("min_size");
+
+    auto &dev_ctx = context.template device_context<DeviceContext>();
+
+    auto scores_dim = scores->dims();
+    int64_t num = scores_dim[0];
+    int64_t c_score = scores_dim[1];
+    int64_t h_score = scores_dim[2];
+    int64_t w_score = scores_dim[3];
+
+    auto bbox_dim = bbox_deltas->dims();
+    int64_t c_bbox = bbox_dim[1];
+    int64_t h_bbox = bbox_dim[2];
+    int64_t w_bbox = bbox_dim[3];
+
+    Tensor bbox_deltas_swap, scores_swap;
+    bbox_deltas_swap.mutable_data<T>({num, h_bbox, w_bbox, c_bbox},
+                                     dev_ctx.GetPlace());
+    scores_swap.mutable_data<T>({num, h_score, w_score, c_score},
+                                dev_ctx.GetPlace());
+
+    math::Transpose<DeviceContext, T, 4> trans;
+    std::vector<int> axis = {0, 2, 3, 1};
+    trans(dev_ctx, *bbox_deltas, &bbox_deltas_swap, axis);
+    trans(dev_ctx, *scores, &scores_swap, axis);
+
+    anchors.Resize({anchors.numel() / 5, 5});
+    variances.Resize({variances.numel() / 5, 5});
+
+    rpn_rois->mutable_data<T>({bbox_deltas->numel() / 5, 5},
+                              context.GetPlace());
+    rpn_roi_probs->mutable_data<T>({scores->numel(), 1}, context.GetPlace());
+
+    T *rpn_rois_data = rpn_rois->data<T>();
+    T *rpn_roi_probs_data = rpn_roi_probs->data<T>();
+
+    auto place = boost::get<platform::CUDAPlace>(dev_ctx.GetPlace());
+
+    int64_t num_proposals = 0;
+    std::vector<size_t> offset(1, 0);
+    for (int64_t i = 0; i < num; ++i) {
+      Tensor im_info_slice = im_info->Slice(i, i + 1);
+      Tensor bbox_deltas_slice = bbox_deltas_swap.Slice(i, i + 1);
+      Tensor scores_slice = scores_swap.Slice(i, i + 1);
+
+      bbox_deltas_slice.Resize({h_bbox * w_bbox * c_bbox / 5, 5});
+      scores_slice.Resize({h_score * w_score * c_score, 1});
+      // auto* scores_data = scores_slice.data<T>();
+      // for(int k=0; k < 256; k++) {
+      //     std::cout << scores_data[k] << std::endl;
+      // }
+      std::pair<Tensor, Tensor> box_score_pair =
+          RRPNProposalForOneImage<T>(dev_ctx,
+                                     im_info_slice,
+                                     anchors,
+                                     variances,
+                                     bbox_deltas_slice,
+                                     scores_slice,
+                                     pre_nms_top_n,
+                                     post_nms_top_n,
+                                     nms_thresh,
+                                     min_size);
+
+      Tensor &proposals = box_score_pair.first;
+      Tensor &scores = box_score_pair.second;
+
+      memory::Copy(place,
+                   rpn_rois_data + num_proposals * 5,
+                   place,
+                   proposals.data<T>(),
+                   sizeof(T) * proposals.numel(),
+                   dev_ctx.stream());
+      memory::Copy(place,
+                   rpn_roi_probs_data + num_proposals,
+                   place,
+                   scores.data<T>(),
+                   sizeof(T) * scores.numel(),
+                   dev_ctx.stream());
+      dev_ctx.Wait();
+      num_proposals += proposals.dims()[0];
+      offset.emplace_back(num_proposals);
+    }
+    framework::LoD lod;
+    lod.emplace_back(offset);
+    rpn_rois->set_lod(lod);
+    rpn_roi_probs->set_lod(lod);
+    rpn_rois->Resize({num_proposals, 5});
+    rpn_roi_probs->Resize({num_proposals, 1});
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(
+    rrpn_generate_proposals,
+    ops::CUDARRPNGenerateProposalsKernel<paddle::platform::CUDADeviceContext,
+                                         float>);
diff --git a/PaddleCV/rrpn/models/ext_op/src/rrpn_rotated_roi_align_op.cc b/PaddleCV/rrpn/models/ext_op/src/rrpn_rotated_roi_align_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..2d3ff4553a631b7f65edc13489cfbc863359b91d
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_rotated_roi_align_op.cc
@@ -0,0 +1,197 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <algorithm>
+#include <limits>
+#include <memory>
+#include "math_function.h"
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+
+class RRPNRotatedROIAlignOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("X"),
+                   "Input(X) of Rotated ROIAlignOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("ROIs"),
+                   "Input(ROIs) of Rotated ROIAlignOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("Out"),
+                   "Output(Out) of Rotated ROIAlignOp should not be null.");
+    auto input_dims = ctx->GetInputDim("X");
+    auto rois_dims = ctx->GetInputDim("ROIs");
+
+    PADDLE_ENFORCE(input_dims.size() == 4,
+                   "The format of input tensor is NCHW.");
+    PADDLE_ENFORCE(rois_dims.size() == 2,
+                   "ROIs should be a 2-D LoDTensor of shape (num_rois, 5)"
+                   "given as [[x1, y1, x2, y2, theta], ...].");
+    if (ctx->IsRuntime()) {
+      PADDLE_ENFORCE(rois_dims[1] == 5,
+                     "ROIs should be a 2-D LoDTensor of shape (num_rois, 5)"
+                     "given as [[x1, y1, x2, y2, theta], ...].");
+    }
+    int pooled_height = ctx->Attrs().Get<int>("pooled_height");
+    int pooled_width = ctx->Attrs().Get<int>("pooled_width");
+    float spatial_scale = ctx->Attrs().Get<float>("spatial_scale");
+
+    PADDLE_ENFORCE_GT(
+        pooled_height, 0, "The pooled output height must greater than 0");
+    PADDLE_ENFORCE_GT(
+        pooled_width, 0, "The pooled output width must greater than 0");
+    PADDLE_ENFORCE_GT(
+        spatial_scale, 0.0f, "The spatial scale must greater than 0");
+
+    auto out_dims = input_dims;
+    out_dims[0] = rois_dims[0];
+    out_dims[1] = input_dims[1];
+    out_dims[2] = pooled_height;
+    out_dims[3] = pooled_width;
+
+    ctx->SetOutputDim("Out", out_dims);
+    ctx->SetOutputDim("ConIdX", out_dims);
+    ctx->SetOutputDim("ConIdY", out_dims);
+  }
+
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(ctx.Input<framework::Tensor>("X")->type(),
+                                   ctx.device_context());
+  }
+};
+
+class RRPNRotatedROIAlignGradOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
+                   "The GRAD@Out of RotatedROIAlignGradOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutputs(framework::GradVarName("X")),
+                   "The GRAD@X of RotatedROIAlignGradOp should not be null.");
+    ctx->SetOutputsDim(framework::GradVarName("X"), ctx->GetInputsDim("X"));
+  }
+
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(ctx.Input<framework::Tensor>("ROIs")->type(),
+                                   ctx.device_context());
+  }
+};
+
+class RRPNRotatedROIAlignOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("X",
+             "(Tensor), "
+             "The input of RRPNRotatedROIAlignOp. The data type is float32 or "
+             "float64."
+             "The format of input tensor is NCHW. Where N is batch size, "
+             "C is the number of input channels, "
+             "H is the height of the feature, and "
+             "W is the width of the feature.");
+    AddInput("ROIs",
+             "(LoDTensor), "
+             "ROIs (Regions of Interest) to pool over. "
+             "should be a 2-D LoDTensor of shape (num_rois, 5)"
+             "given as [[x, y, w, h, theta], ...]. "
+             "(x, y) is the center coordinates, and "
+             "(w, h) is the bottom right coordinates, theta is rotation angle"
+             "of ROI.");
+    AddOutput("Out",
+              "(Tensor), "
+              "The output of ROIAlignOp is a 4-D tensor with shape "
+              "(num_rois, channels, pooled_h, pooled_w). The data type is "
+              "float32 or float64.");
+    AddOutput("ConIdX",
+              "(Tensor), "
+              "index x of affine transform");
+    AddOutput("ConIdY",
+              "(Tensor), "
+              "index y of affine transform");
+
+    AddAttr<float>("spatial_scale",
+                   "(float, default 1.0), "
+                   "Multiplicative spatial scale factor "
+                   "to translate ROI coords from their input scale "
+                   "to the scale used when pooling.")
+        .SetDefault(1.0);
+    AddAttr<int>("pooled_height",
+                 "(int, default 1), "
+                 "The pooled output height.")
+        .SetDefault(1);
+    AddAttr<int>("pooled_width",
+                 "(int, default 1), "
+                 "The pooled output width.")
+        .SetDefault(1);
+    AddComment(R"DOC(
+**RotatedRoIAlign Operator**
+
+Rotated Region of interest align (also known as Rotated RoI align) is to perform
+bilinear interpolation on inputs of nonuniform sizes to obtain 
+fixed-size feature maps (e.g. 7*7)
+
+Dividing each region proposal into equal-sized sections with
+the pooled_width and pooled_height. Location remains the origin
+result.
+
+In each ROI bin, the value of the four regularly sampled locations 
+are computed directly through bilinear interpolation. The output is
+the mean of four locations.
+Thus avoid the misaligned problem.   
+    )DOC");
+  }
+};
+
+template <typename T>
+class RRPNRotatedROIAlignGradMaker : public framework::SingleGradOpMaker<T> {
+public:
+  using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
+
+protected:
+  std::unique_ptr<T> Apply() const override {
+    std::unique_ptr<T> op(new T);
+    op->SetType("rrpn_rotated_roi_align_grad");
+    op->SetInput("X", this->Input("X"));
+    op->SetInput("ROIs", this->Input("ROIs"));
+    op->SetInput("ConIdX", this->Output("ConIdX"));
+    op->SetInput("ConIdY", this->Output("ConIdY"));
+    op->SetInput(framework::GradVarName("Out"), this->OutputGrad("Out"));
+    op->SetOutput(framework::GradVarName("X"), this->InputGrad("X"));
+    op->SetAttrMap(this->Attrs());
+    return op;
+  }
+};
+
+DECLARE_NO_NEED_BUFFER_VARS_INFERENCE(
+    RRPNRotatedRoiAlignGradNoNeedBufVarsInferer, "X");
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(
+    rrpn_rotated_roi_align,
+    ops::RRPNRotatedROIAlignOp,
+    ops::RRPNRotatedROIAlignOpMaker,
+    ops::RRPNRotatedROIAlignGradMaker<paddle::framework::OpDesc>,
+    ops::RRPNRotatedROIAlignGradMaker<paddle::imperative::OpBase>);
+REGISTER_OPERATOR(rrpn_rotated_roi_align_grad,
+                  ops::RRPNRotatedROIAlignGradOp,
+                  ops::RRPNRotatedRoiAlignGradNoNeedBufVarsInferer);
diff --git a/PaddleCV/rrpn/models/ext_op/src/rrpn_rotated_roi_align_op.cu b/PaddleCV/rrpn/models/ext_op/src/rrpn_rotated_roi_align_op.cu
new file mode 100644
index 0000000000000000000000000000000000000000..c68209e22c9f7cd902929c2c79031c7f13ebb1af
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_rotated_roi_align_op.cu
@@ -0,0 +1,442 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+Based on
+@misc{ma2019rrpn,
+    author = {Jianqi Ma},
+    title = {{RRPN in pytorch}},
+    year = {2019},
+    howpublished = {\url{https://github.com/mjq11302010044/RRPN_pytorch}},
+}
+@article{Jianqi17RRPN,
+    Author = {Jianqi Ma and Weiyuan Shao and Hao Ye and Li Wang and Hong Wang
+and Yingbin Zheng and Xiangyang Xue},
+    Title = {Arbitrary-Oriented Scene Text Detection via Rotation Proposals},
+    journal = {IEEE Transactions on Multimedia},
+    volume={20},
+    number={11},
+    pages={3111-3122},
+    year={2018}
+}*/
+
+#include <algorithm>
+#include <limits>
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/memory/memory.h"
+#include "paddle/fluid/platform/cuda_primitives.h"
+
+#define CUDA_1D_KERNEL_LOOP(i, n)                            \
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; \
+       i += blockDim.x * gridDim.x)
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+
+static constexpr int kNumCUDAThreads = 512;
+static constexpr int kNumMaxinumNumBlocks = 4096;
+#define PI 3.141592654
+
+static inline int NumBlocks(const int N) {
+  return std::min((N + kNumCUDAThreads - 1) / kNumCUDAThreads,
+                  kNumMaxinumNumBlocks);
+}
+
+
+template <typename T>
+__global__ void Zero(T* x, int num) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < num;
+       i += blockDim.x * gridDim.x) {
+    x[i] = static_cast<T>(0);
+  }
+}
+
+template <typename T>
+__global__ void RROIAlignForward(const int nthreads,
+                                 const T* bottom_data,
+                                 const T spatial_scale,
+                                 int height,
+                                 int width,
+                                 int channels,
+                                 const int pooled_height,
+                                 const int pooled_width,
+                                 const T* bottom_rois,
+                                 int* roi_batch_id_data,
+                                 T* top_data,
+                                 T* con_idx_x,
+                                 T* con_idx_y) {
+  CUDA_1D_KERNEL_LOOP(index, nthreads) {
+    int imageWidth = width;
+    int imageHeight = height;
+
+    // (n, c, ph, pw) is an element in the pooled output
+    int n = index;
+    int pw = n % pooled_width;
+    n /= pooled_width;
+    int ph = n % pooled_height;
+    n /= pooled_height;
+    int c = n % channels;
+    n /= channels;
+
+    const T* offset_bottom_rois = bottom_rois + n * 5;
+
+    int roi_batch_ind = roi_batch_id_data[n];
+    T cx = offset_bottom_rois[0];
+    T cy = offset_bottom_rois[1];
+    T h = offset_bottom_rois[3];
+    T w = offset_bottom_rois[2];
+    T angle = offset_bottom_rois[4] / 180.0 * PI;
+
+    // TransformPrepare
+    T dx = -pooled_width / 2.0;
+    T dy = -pooled_height / 2.0;
+    T Sx = w * spatial_scale / pooled_width;
+    T Sy = h * spatial_scale / pooled_height;
+    T Alpha = cos(angle);
+    T Beta = sin(angle);
+    T Dx = cx * spatial_scale;
+    T Dy = cy * spatial_scale;
+
+    T M[2][3];
+    M[0][0] = Alpha * Sx;
+    M[0][1] = Beta * Sy;
+    M[0][2] = Alpha * Sx * dx + Beta * Sy * dy + Dx;
+    M[1][0] = -Beta * Sx;
+    M[1][1] = Alpha * Sy;
+    M[1][2] = -Beta * Sx * dx + Alpha * Sy * dy + Dy;
+
+    T P[8];
+    P[0] = M[0][0] * pw + M[0][1] * ph + M[0][2];
+    P[1] = M[1][0] * pw + M[1][1] * ph + M[1][2];
+    P[2] = M[0][0] * pw + M[0][1] * (ph + 1) + M[0][2];
+    P[3] = M[1][0] * pw + M[1][1] * (ph + 1) + M[1][2];
+    P[4] = M[0][0] * (pw + 1) + M[0][1] * ph + M[0][2];
+    P[5] = M[1][0] * (pw + 1) + M[1][1] * ph + M[1][2];
+    P[6] = M[0][0] * (pw + 1) + M[0][1] * (ph + 1) + M[0][2];
+    P[7] = M[1][0] * (pw + 1) + M[1][1] * (ph + 1) + M[1][2];
+
+    T leftMost = (max(round(min(min(P[0], P[2]), min(P[4], P[6]))), 0.0));
+    T rightMost =
+        (min(round(max(max(P[0], P[2]), max(P[4], P[6]))), imageWidth - 1.0));
+    T topMost = (max(round(min(min(P[1], P[3]), min(P[5], P[7]))), 0.0));
+    T bottomMost =
+        (min(round(max(max(P[1], P[3]), max(P[5], P[7]))), imageHeight - 1.0));
+
+    const T* offset_bottom_data =
+        bottom_data + (roi_batch_ind * channels + c) * height * width;
+
+
+    float bin_cx = (leftMost + rightMost) / 2.0;  // shift
+    float bin_cy = (topMost + bottomMost) / 2.0;
+
+    int bin_l = (int)floor(bin_cx);
+    int bin_r = (int)ceil(bin_cx);
+    int bin_t = (int)floor(bin_cy);
+    int bin_b = (int)ceil(bin_cy);
+
+    T lt_value = 0.0;
+    if (bin_t > 0 && bin_l > 0 && bin_t < height && bin_l < width)
+      lt_value = offset_bottom_data[bin_t * width + bin_l];
+    T rt_value = 0.0;
+    if (bin_t > 0 && bin_r > 0 && bin_t < height && bin_r < width)
+      rt_value = offset_bottom_data[bin_t * width + bin_r];
+    T lb_value = 0.0;
+    if (bin_b > 0 && bin_l > 0 && bin_b < height && bin_l < width)
+      lb_value = offset_bottom_data[bin_b * width + bin_l];
+    T rb_value = 0.0;
+    if (bin_b > 0 && bin_r > 0 && bin_b < height && bin_r < width)
+      rb_value = offset_bottom_data[bin_b * width + bin_r];
+
+    T rx = bin_cx - floor(bin_cx);
+    T ry = bin_cy - floor(bin_cy);
+
+    T wlt = (1.0 - rx) * (1.0 - ry);
+    T wrt = rx * (1.0 - ry);
+    T wrb = rx * ry;
+    T wlb = (1.0 - rx) * ry;
+
+    T inter_val = 0.0;
+
+    inter_val += lt_value * wlt;
+    inter_val += rt_value * wrt;
+    inter_val += rb_value * wrb;
+    inter_val += lb_value * wlb;
+
+    platform::CudaAtomicAdd(top_data + index, static_cast<T>(inter_val));
+    platform::CudaAtomicAdd(con_idx_x + index, static_cast<T>(bin_cx));
+    platform::CudaAtomicAdd(con_idx_y + index, static_cast<T>(bin_cy));
+  }
+}
+
+template <typename T>
+__global__ void RROIAlignBackward(const int nthreads,
+                                  const T* top_diff,
+                                  const float* con_idx_x,
+                                  const float* con_idx_y,
+                                  const int num_rois,
+                                  const float spatial_scale,
+                                  const int height,
+                                  const int width,
+                                  const int channels,
+                                  const int pooled_height,
+                                  const int pooled_width,
+                                  T* bottom_diff,
+                                  const T* bottom_rois,
+                                  int* roi_batch_id_data) {
+  CUDA_1D_KERNEL_LOOP(index, nthreads) {
+    // (n, c, ph, pw) is an element in the pooled output
+    int n = index;
+    n /= pooled_width;
+    n /= pooled_height;
+    int c = n % channels;
+    n /= channels;
+
+    const T* offset_bottom_rois = bottom_rois + n * 5;
+    int roi_batch_ind = roi_batch_id_data[n];
+    T* offset_bottom_diff =
+        bottom_diff + (roi_batch_ind * channels + c) * height * width;
+
+
+    float bw = con_idx_x[index];
+    float bh = con_idx_y[index];
+
+    int bin_xs = int(floor(bw));
+    int bin_ys = int(floor(bh));
+
+    float rx = bw - float(bin_xs);
+    float ry = bh - float(bin_ys);
+
+    T wlt = (1.0 - rx) * (1.0 - ry);
+    T wrt = rx * (1.0 - ry);
+    T wrb = rx * ry;
+    T wlb = (1.0 - rx) * ry;
+
+
+    int min_x = (int)floor(bw);
+    int max_x = (int)ceil(bw);
+    int min_y = (int)floor(bh);
+    int max_y = (int)ceil(bh);
+
+    T top_diff_of_bin = top_diff[index];
+
+    T v1 = wlt * top_diff_of_bin;
+    T v2 = wrt * top_diff_of_bin;
+    T v3 = wrb * top_diff_of_bin;
+    T v4 = wlb * top_diff_of_bin;
+
+
+    if (min_y > 0 && min_x > 0 && min_y < height - 1 && min_x < width - 1)
+      platform::CudaAtomicAdd(offset_bottom_diff + min_y * width + min_x,
+                              static_cast<T>(v1));
+    if (min_y > 0 && max_x < width - 1 && min_y < height - 1 && max_x > 0)
+      platform::CudaAtomicAdd(offset_bottom_diff + min_y * width + max_x,
+                              static_cast<T>(v2));
+    if (max_y < height - 1 && max_x < width - 1 && max_y > 0 && max_x > 0)
+      platform::CudaAtomicAdd(offset_bottom_diff + max_y * width + max_x,
+                              static_cast<T>(v3));
+    if (max_y < height - 1 && min_x > 0 && max_y > 0 && min_x < width - 1)
+      platform::CudaAtomicAdd(offset_bottom_diff + max_y * width + min_x,
+                              static_cast<T>(v4));
+  }
+}
+
+template <typename Place, typename T>
+class RRPNROIAlignRotatedCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* input = ctx.Input<Tensor>("X");
+    auto* rois = ctx.Input<LoDTensor>("ROIs");
+    auto* out = ctx.Output<Tensor>("Out");
+    auto* con_idx_x = ctx.Output<Tensor>("ConIdX");
+    auto* con_idx_y = ctx.Output<Tensor>("ConIdY");
+
+    auto pooled_height = ctx.Attr<int>("pooled_height");
+    auto pooled_width = ctx.Attr<int>("pooled_width");
+    auto spatial_scale = ctx.Attr<float>("spatial_scale");
+
+    auto in_dims = input->dims();
+    int batch_size = in_dims[0];
+    int channels = in_dims[1];
+    int height = in_dims[2];
+    int width = in_dims[3];
+
+    int rois_num = rois->dims()[0];
+
+    if (rois_num == 0) return;
+
+    int output_size = out->numel();
+    int blocks = NumBlocks(output_size);
+    int threads = kNumCUDAThreads;
+
+    Tensor roi_batch_id_list;
+    roi_batch_id_list.Resize({rois_num});
+    auto cplace = platform::CPUPlace();
+    int* roi_batch_id_data = roi_batch_id_list.mutable_data<int>(cplace);
+    auto lod = rois->lod();
+    PADDLE_ENFORCE_EQ(
+        lod.empty(),
+        false,
+        "Input(ROIs) Tensor of ROIAlignOp does not contain LoD information.");
+    auto rois_lod = lod.back();
+    int rois_batch_size = rois_lod.size() - 1;
+    PADDLE_ENFORCE_EQ(
+        rois_batch_size,
+        batch_size,
+        "The rois_batch_size and imgs batch_size must be the same.");
+    int rois_num_with_lod = rois_lod[rois_batch_size];
+    PADDLE_ENFORCE_EQ(rois_num,
+                      rois_num_with_lod,
+                      "The rois_num from input and lod must be the same.");
+    for (int n = 0; n < rois_batch_size; ++n) {
+      for (size_t i = rois_lod[n]; i < rois_lod[n + 1]; ++i) {
+        roi_batch_id_data[i] = n;
+      }
+    }
+    auto& dev_ctx = ctx.cuda_device_context();
+    int bytes = roi_batch_id_list.numel() * sizeof(int);
+    auto roi_ptr = memory::Alloc(dev_ctx, bytes);
+    int* roi_id_data = reinterpret_cast<int*>(roi_ptr->ptr());
+    const auto gplace = boost::get<platform::CUDAPlace>(ctx.GetPlace());
+    memory::Copy(gplace,
+                 roi_id_data,
+                 cplace,
+                 roi_batch_id_data,
+                 bytes,
+                 dev_ctx.stream());
+
+    T* out_ = out->mutable_data<T>(ctx.GetPlace());
+    T* con_idx_x_ = con_idx_x->mutable_data<T>(ctx.GetPlace());
+    T* con_idx_y_ = con_idx_y->mutable_data<T>(ctx.GetPlace());
+
+    int idx_x_num = con_idx_x->numel();
+    int idx_y_num = con_idx_y->numel();
+    int out_num = out->numel();
+    Zero<<<(idx_x_num + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(con_idx_x_,
+                                                                    idx_x_num);
+    Zero<<<(idx_y_num + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(con_idx_y_,
+                                                                    idx_y_num);
+    Zero<<<(out_num + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(out_,
+                                                                  out_num);
+
+    RROIAlignForward<T><<<blocks, threads, 0, dev_ctx.stream()>>>(
+        output_size,
+        input->data<T>(),
+        spatial_scale,
+        height,
+        width,
+        channels,
+        pooled_height,
+        pooled_width,
+        rois->data<T>(),
+        roi_id_data,
+        out_,
+        con_idx_x_,
+        con_idx_y_);
+  }
+};
+
+template <typename Place, typename T>
+class RRPNROIAlignRotatedGradCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* input = ctx.Input<Tensor>("X");
+    auto* rois = ctx.Input<LoDTensor>("ROIs");
+
+    auto* out_grad = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto* in_grad = ctx.Output<Tensor>(framework::GradVarName("X"));
+    auto* con_idx_x = ctx.Input<Tensor>("ConIdX");
+    auto* con_idx_y = ctx.Input<Tensor>("ConIdY");
+    auto pooled_height = ctx.Attr<int>("pooled_height");
+    auto pooled_width = ctx.Attr<int>("pooled_width");
+    auto spatial_scale = ctx.Attr<float>("spatial_scale");
+
+    int rois_num = rois->dims()[0];
+    int channels = input->dims()[1];
+    int height = input->dims()[2];
+    int width = input->dims()[3];
+
+    if (!in_grad) {
+      return;
+    }
+    Tensor roi_batch_id_list;
+    roi_batch_id_list.Resize({rois_num});
+    auto cplace = platform::CPUPlace();
+    int* roi_batch_id_data = roi_batch_id_list.mutable_data<int>(cplace);
+    auto rois_lod = rois->lod().back();
+    int rois_batch_size = rois_lod.size() - 1;
+    for (int n = 0; n < rois_batch_size; ++n) {
+      for (size_t i = rois_lod[n]; i < rois_lod[n + 1]; ++i) {
+        roi_batch_id_data[i] = n;
+      }
+    }
+    auto& dev_ctx = ctx.cuda_device_context();
+    auto roi_ptr =
+        memory::Alloc(dev_ctx, roi_batch_id_list.numel() * sizeof(int));
+    int* roi_id_data = reinterpret_cast<int*>(roi_ptr->ptr());
+    int bytes = roi_batch_id_list.numel() * sizeof(int);
+    const auto gplace = boost::get<platform::CUDAPlace>(ctx.GetPlace());
+    memory::Copy(gplace,
+                 roi_id_data,
+                 cplace,
+                 roi_batch_id_data,
+                 bytes,
+                 dev_ctx.stream());
+    T* in_grad_ = in_grad->mutable_data<T>(ctx.GetPlace());
+    int in_grad_num = in_grad->numel();
+    Zero<<<(in_grad_num + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(
+        in_grad_, in_grad_num);
+    int output_grad_size = out_grad->numel();
+    int blocks = NumBlocks(output_grad_size);
+    int threads = kNumCUDAThreads;
+    con_idx_x->data<float>();
+    con_idx_y->data<float>();
+    out_grad->data<T>();
+    rois->data<T>();
+    if (output_grad_size > 0) {
+      RROIAlignBackward<T><<<blocks, threads, 0, dev_ctx.stream()>>>(
+          output_grad_size,
+          out_grad->data<T>(),
+          con_idx_x->data<float>(),
+          con_idx_y->data<float>(),
+          rois_num,
+          spatial_scale,
+          height,
+          width,
+          channels,
+          pooled_height,
+          pooled_width,
+          in_grad_,
+          // in_grad->mutable_data<T>(ctx.GetPlace()),
+          rois->data<T>(),
+          roi_id_data);
+    }
+  }
+};
+
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(
+    rrpn_rotated_roi_align,
+    ops::RRPNROIAlignRotatedCUDAKernel<paddle::platform::CUDADeviceContext,
+                                       float>,
+    ops::RRPNROIAlignRotatedCUDAKernel<paddle::platform::CUDADeviceContext,
+                                       double>);
+REGISTER_OP_CUDA_KERNEL(
+    rrpn_rotated_roi_align_grad,
+    ops::RRPNROIAlignRotatedGradCUDAKernel<paddle::platform::CUDADeviceContext,
+                                           float>,
+    ops::RRPNROIAlignRotatedGradCUDAKernel<paddle::platform::CUDADeviceContext,
+                                           double>);
diff --git a/PaddleCV/rrpn/models/ext_op/src/rrpn_target_assign_op.cc b/PaddleCV/rrpn/models/ext_op/src/rrpn_target_assign_op.cc
new file mode 100644
index 0000000000000000000000000000000000000000..74e2fe0bdbeda27c45699e76d323b7da39815c89
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_target_assign_op.cc
@@ -0,0 +1,544 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <fstream>
+#include <iostream>
+#include <random>
+#include "bbox_util.h"
+#include "paddle/fluid/framework/op_registry.h"
+
+namespace paddle {
+namespace operators {
+
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+template <typename T,
+          int MajorType = Eigen::RowMajor,
+          typename IndexType = Eigen::DenseIndex>
+using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
+
+class RRpnTargetAssignOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Anchor"),
+                   "Input(Anchor) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(ctx->HasInput("GtBoxes"),
+                   "Input(GtBoxes) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(ctx->HasInput("ImInfo"),
+                   "Input(ImInfo) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("LocationIndex"),
+        "Output(LocationIndex) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("ScoreIndex"),
+        "Output(ScoreIndex) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("TargetLabel"),
+        "Output(TargetLabel) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("TargetBBox"),
+        "Output(TargetBBox) of RRpnTargetAssignOp should not be null");
+
+    auto anchor_dims = ctx->GetInputDim("Anchor");
+    auto gt_boxes_dims = ctx->GetInputDim("GtBoxes");
+    auto im_info_dims = ctx->GetInputDim("ImInfo");
+    PADDLE_ENFORCE_EQ(
+        anchor_dims.size(), 2, "The rank of Input(Anchor) must be 2.");
+    PADDLE_ENFORCE_EQ(
+        gt_boxes_dims.size(), 2, "The rank of Input(GtBoxes) must be 2.");
+    PADDLE_ENFORCE_EQ(
+        im_info_dims.size(), 2, "The rank of Input(ImInfo) must be 2.");
+
+    ctx->SetOutputDim("LocationIndex", {-1});
+    ctx->SetOutputDim("ScoreIndex", {-1});
+    ctx->SetOutputDim("TargetLabel", {-1, 1});
+    ctx->SetOutputDim("TargetBBox", {-1, 5});
+  }
+
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        ctx.Input<framework::LoDTensor>("Anchor")->type(),
+        platform::CPUPlace());
+  }
+};
+
+
+template <typename T>
+void AppendRpns(LoDTensor* out, int64_t offset, Tensor* to_add) {
+  auto* out_data = out->data<T>();
+  auto* to_add_data = to_add->data<T>();
+  memcpy(out_data + offset, to_add_data, to_add->numel() * sizeof(T));
+}
+
+
+template <typename T>
+std::vector<Tensor> FilterStraddleAnchor(
+    const platform::CPUDeviceContext& context,
+    const Tensor* anchor,
+    const float rpn_straddle_thresh,
+    T im_height,
+    T im_width,
+    int64_t offset) {
+  std::vector<int> inds_inside;
+  int anchor_num = anchor->dims()[0];
+  auto* anchor_data = anchor->data<T>();
+  if (rpn_straddle_thresh >= 0) {
+    int index;
+    for (int i = 0; i < anchor_num; ++i) {
+      index = i * offset;
+      if ((anchor_data[index + 0] >= -rpn_straddle_thresh) &&
+          (anchor_data[index + 1] >= -rpn_straddle_thresh) &&
+          (anchor_data[index + 2] < im_width + rpn_straddle_thresh) &&
+          (anchor_data[index + 3] < im_height + rpn_straddle_thresh)) {
+        inds_inside.emplace_back(i);
+      }
+    }
+  } else {
+    for (int i = 0; i < anchor_num; ++i) {
+      inds_inside.emplace_back(i);
+    }
+  }
+  int inside_num = inds_inside.size();
+  Tensor inds_inside_t;
+  int* inds_inside_data =
+      inds_inside_t.mutable_data<int>({inside_num}, context.GetPlace());
+  std::copy(inds_inside.begin(), inds_inside.end(), inds_inside_data);
+  Tensor inside_anchor_t;
+  T* inside_anchor_data =
+      inside_anchor_t.mutable_data<T>({inside_num, offset}, context.GetPlace());
+  Gather<T>(anchor->data<T>(),
+            offset,
+            inds_inside_data,
+            inside_num,
+            inside_anchor_data);
+  std::vector<Tensor> res;
+  res.emplace_back(inds_inside_t);
+  res.emplace_back(inside_anchor_t);
+  return res;
+}
+
+
+void ReservoirSampling(const int num,
+                       std::vector<int>* inds,
+                       std::minstd_rand engine,
+                       bool use_random) {
+  std::uniform_real_distribution<float> uniform(0, 1);
+  size_t len = inds->size();
+  if (len > static_cast<size_t>(num)) {
+    if (use_random) {
+      for (size_t i = num; i < len; ++i) {
+        int rng_ind = std::floor(uniform(engine) * i);
+        if (rng_ind < num)
+          std::iter_swap(inds->begin() + rng_ind, inds->begin() + i);
+      }
+    }
+    inds->resize(num);
+  }
+}
+
+
+template <typename T>
+void RRpnScoreAssign(const T* anchor_by_gt_overlap_data,
+                     const Tensor& anchor_to_gt_max,
+                     const Tensor& gt_to_anchor_max,
+                     const int rpn_batch_size_per_im,
+                     const float rpn_fg_fraction,
+                     const float rpn_positive_overlap,
+                     const float rpn_negative_overlap,
+                     std::vector<int>* fg_inds,
+                     std::vector<int>* bg_inds,
+                     std::vector<int>* tgt_lbl,
+                     std::minstd_rand engine,
+                     bool use_random) {
+  float epsilon = 0.00000001;
+  int anchor_num = anchor_to_gt_max.dims()[0];
+  int gt_num = gt_to_anchor_max.dims()[0];
+  std::vector<int> target_label(anchor_num, -1);
+  const T* anchor_to_gt_max_data = anchor_to_gt_max.data<T>();
+  const T* gt_to_anchor_max_data = gt_to_anchor_max.data<T>();
+  for (int64_t i = 0; i < anchor_num; ++i) {
+    bool is_anchors_with_max_overlap = false;
+    int64_t j = 0;
+    for (; j < gt_num; ++j) {
+      T value = anchor_by_gt_overlap_data[i * gt_num + j];
+      T diff = std::abs(value - gt_to_anchor_max_data[j]);
+      if (diff < epsilon) {
+        is_anchors_with_max_overlap = true;
+        break;
+      }
+    }
+    bool is_anchor_great_than_thresh =
+        (anchor_to_gt_max_data[i] >= rpn_positive_overlap);
+    if (is_anchors_with_max_overlap || is_anchor_great_than_thresh) {
+      fg_inds->emplace_back(i);
+      target_label[i] = 1;
+    }
+  }
+
+  // Reservoir Sampling
+  int fg_num = 0;
+  if (rpn_fg_fraction > 0 && rpn_batch_size_per_im > 0) {
+    fg_num = static_cast<int>(rpn_fg_fraction * rpn_batch_size_per_im);
+    ReservoirSampling(fg_num, fg_inds, engine, use_random);
+  }
+  fg_num = static_cast<int>(fg_inds->size());
+
+  for (int64_t i = 0; i < anchor_num; ++i) {
+    if (anchor_to_gt_max_data[i] < rpn_negative_overlap &&
+        target_label[i] != 1) {
+      bg_inds->emplace_back(i);
+      target_label[i] = 0;
+    }
+  }
+
+
+  int bg_num = 0;
+  if (rpn_fg_fraction > 0 && rpn_batch_size_per_im > 0) {
+    bg_num = rpn_batch_size_per_im - fg_num;
+    ReservoirSampling(bg_num, bg_inds, engine, use_random);
+  }
+  bg_num = static_cast<int>(bg_inds->size());
+  tgt_lbl->resize(fg_num + bg_num, 0);
+  std::vector<int> fg_lbl(fg_num, 1);
+  std::vector<int> bg_lbl(bg_num, 0);
+  std::copy(fg_lbl.begin(), fg_lbl.end(), tgt_lbl->data());
+  std::copy(bg_lbl.begin(), bg_lbl.end(), tgt_lbl->data() + fg_num);
+}
+
+template <typename T>
+std::vector<Tensor> SampleRRpnFgBgGt(const platform::CPUDeviceContext& ctx,
+                                     const Tensor& anchor_by_gt_overlap,
+                                     const int rpn_batch_size_per_im,
+                                     const float rpn_positive_overlap,
+                                     const float rpn_negative_overlap,
+                                     const float rpn_fg_fraction,
+                                     std::minstd_rand engine,
+                                     bool use_random) {
+  auto* anchor_by_gt_overlap_data = anchor_by_gt_overlap.data<T>();
+  int anchor_num = anchor_by_gt_overlap.dims()[0];
+  int gt_num = anchor_by_gt_overlap.dims()[1];
+
+  std::vector<int> fg_inds;
+  std::vector<int> bg_inds;
+  std::vector<int> gt_inds;
+  std::vector<int> tgt_lbl;
+  // Calculate the max IoU between anchors and gt boxes
+  // Map from anchor to gt box that has highest overlap
+  auto place = ctx.GetPlace();
+  Tensor anchor_to_gt_max, anchor_to_gt_argmax, gt_to_anchor_max;
+  anchor_to_gt_max.mutable_data<T>({anchor_num}, place);
+  int* argmax = anchor_to_gt_argmax.mutable_data<int>({anchor_num}, place);
+  gt_to_anchor_max.mutable_data<T>({gt_num}, place);
+
+  auto anchor_by_gt_overlap_et =
+      framework::EigenMatrix<T>::From(anchor_by_gt_overlap);
+  auto anchor_to_gt_max_et =
+      framework::EigenVector<T>::Flatten(anchor_to_gt_max);
+  auto gt_to_anchor_max_et =
+      framework::EigenVector<T>::Flatten(gt_to_anchor_max);
+  auto anchor_to_gt_argmax_et =
+      framework::EigenVector<int>::Flatten(anchor_to_gt_argmax);
+  anchor_to_gt_max_et =
+      anchor_by_gt_overlap_et.maximum(Eigen::DSizes<int, 1>(1));
+  anchor_to_gt_argmax_et =
+      anchor_by_gt_overlap_et.argmax(1).template cast<int>();
+  gt_to_anchor_max_et =
+      anchor_by_gt_overlap_et.maximum(Eigen::DSizes<int, 1>(0));
+
+  // Follow the Faster RCNN's implementation
+  RRpnScoreAssign(anchor_by_gt_overlap_data,
+                  anchor_to_gt_max,
+                  gt_to_anchor_max,
+                  rpn_batch_size_per_im,
+                  rpn_fg_fraction,
+                  rpn_positive_overlap,
+                  rpn_negative_overlap,
+                  &fg_inds,
+                  &bg_inds,
+                  &tgt_lbl,
+                  engine,
+                  use_random);
+
+  int fg_num = fg_inds.size();
+  int bg_num = bg_inds.size();
+  gt_inds.reserve(fg_num);
+  for (int i = 0; i < fg_num; ++i) {
+    gt_inds.emplace_back(argmax[fg_inds[i]]);
+  }
+  Tensor loc_index_t, score_index_t, tgt_lbl_t, gt_inds_t;
+  int* loc_index_data = loc_index_t.mutable_data<int>({fg_num}, place);
+  int* score_index_data =
+      score_index_t.mutable_data<int>({fg_num + bg_num}, place);
+  int* tgt_lbl_data = tgt_lbl_t.mutable_data<int>({fg_num + bg_num}, place);
+  int* gt_inds_data = gt_inds_t.mutable_data<int>({fg_num}, place);
+  std::copy(fg_inds.begin(), fg_inds.end(), loc_index_data);
+  std::copy(fg_inds.begin(), fg_inds.end(), score_index_data);
+  std::copy(bg_inds.begin(), bg_inds.end(), score_index_data + fg_num);
+  std::copy(tgt_lbl.begin(), tgt_lbl.end(), tgt_lbl_data);
+  std::copy(gt_inds.begin(), gt_inds.end(), gt_inds_data);
+  std::vector<Tensor> loc_score_tgtlbl_gt;
+  loc_score_tgtlbl_gt.emplace_back(loc_index_t);
+  loc_score_tgtlbl_gt.emplace_back(score_index_t);
+  loc_score_tgtlbl_gt.emplace_back(tgt_lbl_t);
+  loc_score_tgtlbl_gt.emplace_back(gt_inds_t);
+
+  return loc_score_tgtlbl_gt;
+}
+
+template <typename T>
+class RRpnTargetAssignKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto* anchor = context.Input<Tensor>("Anchor");  // (H*W*A) * 5
+    auto* gt_boxes = context.Input<LoDTensor>("GtBoxes");
+    auto* im_info = context.Input<LoDTensor>("ImInfo");
+
+    auto* loc_index = context.Output<LoDTensor>("LocationIndex");
+    auto* score_index = context.Output<LoDTensor>("ScoreIndex");
+    auto* tgt_bbox = context.Output<LoDTensor>("TargetBBox");
+    auto* tgt_lbl = context.Output<LoDTensor>("TargetLabel");
+
+    PADDLE_ENFORCE_EQ(gt_boxes->lod().size(),
+                      1UL,
+                      "RRpnTargetAssignOp gt_boxes needs 1 level of LoD");
+    int64_t anchor_num = static_cast<int64_t>(anchor->dims()[0]);
+    int64_t batch_num = static_cast<int64_t>(gt_boxes->lod().back().size() - 1);
+
+    int rpn_batch_size_per_im = context.Attr<int>("rpn_batch_size_per_im");
+    float rpn_straddle_thresh = context.Attr<float>("rpn_straddle_thresh");
+    float rpn_positive_overlap = context.Attr<float>("rpn_positive_overlap");
+    float rpn_negative_overlap = context.Attr<float>("rpn_negative_overlap");
+    float rpn_fg_fraction = context.Attr<float>("rpn_fg_fraction");
+    bool use_random = context.Attr<bool>("use_random");
+    int64_t max_num = batch_num * rpn_batch_size_per_im;
+    auto place = context.GetPlace();
+
+    loc_index->mutable_data<int>({max_num}, place);
+    score_index->mutable_data<int>({max_num}, place);
+    tgt_bbox->mutable_data<T>({max_num, 5}, place);
+    tgt_lbl->mutable_data<int>({max_num, 1}, place);
+    auto& dev_ctx = context.device_context<platform::CPUDeviceContext>();
+
+    std::random_device rnd;
+    std::minstd_rand engine;
+    int seed = rnd();
+    engine.seed(seed);
+
+    framework::LoD lod_loc, loc_score;
+    std::vector<size_t> lod0_loc(1, 0);
+    std::vector<size_t> lod0_score(1, 0);
+
+    int total_loc_num = 0;
+    int total_score_num = 0;
+    auto gt_boxes_lod = gt_boxes->lod().back();
+    for (int i = 0; i < batch_num; ++i) {
+      Tensor gt_boxes_slice =
+          gt_boxes->Slice(gt_boxes_lod[i], gt_boxes_lod[i + 1]);
+      Tensor im_info_slice = im_info->Slice(i, i + 1);
+      auto* im_info_data = im_info_slice.data<T>();
+      auto im_height = im_info_data[0];
+      auto im_width = im_info_data[1];
+      // auto im_scale = im_info_data[2];
+      // Filter straddle anchor
+      std::vector<Tensor> filter_output = FilterStraddleAnchor<T>(
+          dev_ctx, anchor, rpn_straddle_thresh, im_height, im_width, 5);
+      Tensor inds_inside = filter_output[0];
+      Tensor inside_anchor = filter_output[1];
+
+      Tensor anchor_by_gt_overlap;
+      anchor_by_gt_overlap.mutable_data<T>(
+          {inside_anchor.dims()[0], gt_boxes_slice.dims()[0]}, place);
+      BboxOverlaps2<T>(inside_anchor, gt_boxes_slice, &anchor_by_gt_overlap);
+      auto loc_score_tgtlbl_gt = SampleRRpnFgBgGt<T>(dev_ctx,
+                                                     anchor_by_gt_overlap,
+                                                     rpn_batch_size_per_im,
+                                                     rpn_positive_overlap,
+                                                     rpn_negative_overlap,
+                                                     rpn_fg_fraction,
+                                                     engine,
+                                                     use_random);
+
+      Tensor sampled_loc_index = loc_score_tgtlbl_gt[0];
+      Tensor sampled_score_index = loc_score_tgtlbl_gt[1];
+      Tensor sampled_tgtlbl = loc_score_tgtlbl_gt[2];
+      Tensor sampled_gt_index = loc_score_tgtlbl_gt[3];
+
+      int loc_num = sampled_loc_index.dims()[0];
+      int score_num = sampled_score_index.dims()[0];
+      // unmap to all anchor
+      Tensor sampled_loc_index_unmap, sampled_score_index_unmap;
+      sampled_loc_index_unmap.mutable_data<int>({loc_num}, place);
+      sampled_score_index_unmap.mutable_data<int>({score_num}, place);
+      Gather<int>(inds_inside.data<int>(),
+                  1,
+                  sampled_loc_index.data<int>(),
+                  loc_num,
+                  sampled_loc_index_unmap.data<int>());
+      Gather<int>(inds_inside.data<int>(),
+                  1,
+                  sampled_score_index.data<int>(),
+                  score_num,
+                  sampled_score_index_unmap.data<int>());
+
+      // get target bbox deltas
+      Tensor sampled_anchor, sampled_gt, sampled_tgt_bbox;
+      auto* sampled_anchor_data =
+          sampled_anchor.mutable_data<T>({loc_num, 5}, place);
+      auto* sampled_gt_data = sampled_gt.mutable_data<T>({loc_num, 5}, place);
+      Gather<T>(anchor->data<T>(),
+                5,
+                sampled_loc_index_unmap.data<int>(),
+                loc_num,
+                sampled_anchor_data);
+      Gather<T>(gt_boxes_slice.data<T>(),
+                5,
+                sampled_gt_index.data<int>(),
+                loc_num,
+                sampled_gt_data);
+      sampled_tgt_bbox.mutable_data<T>({loc_num, 5}, place);
+      BoxToDelta2<T>(
+          loc_num, sampled_anchor, sampled_gt, nullptr, &sampled_tgt_bbox);
+      std::ofstream file_anchor;
+      // Add anchor offset
+      int anchor_offset = i * anchor_num;
+      auto sampled_loc_index_unmap_et =
+          framework::EigenTensor<int, 1>::From(sampled_loc_index_unmap);
+      sampled_loc_index_unmap_et = sampled_loc_index_unmap_et + anchor_offset;
+      auto sampled_score_index_unmap_et =
+          framework::EigenTensor<int, 1>::From(sampled_score_index_unmap);
+      sampled_score_index_unmap_et =
+          sampled_score_index_unmap_et + anchor_offset;
+      AppendRpns<int>(loc_index, total_loc_num, &sampled_loc_index_unmap);
+      AppendRpns<int>(score_index, total_score_num, &sampled_score_index_unmap);
+      AppendRpns<T>(tgt_bbox, total_loc_num * 5, &sampled_tgt_bbox);
+      AppendRpns<int>(tgt_lbl, total_score_num, &sampled_tgtlbl);
+      total_loc_num += loc_num;
+      total_score_num += score_num;
+      lod0_loc.emplace_back(total_loc_num);
+      lod0_score.emplace_back(total_score_num);
+    }
+
+    PADDLE_ENFORCE_LE(total_loc_num, max_num);
+    PADDLE_ENFORCE_LE(total_score_num, max_num);
+
+    lod_loc.emplace_back(lod0_loc);
+    loc_score.emplace_back(lod0_score);
+    loc_index->set_lod(lod_loc);
+    score_index->set_lod(loc_score);
+    tgt_bbox->set_lod(lod_loc);
+    tgt_lbl->set_lod(loc_score);
+    loc_index->Resize({total_loc_num});
+    score_index->Resize({total_score_num});
+    tgt_bbox->Resize({total_loc_num, 5});
+    tgt_lbl->Resize({total_score_num, 1});
+  }
+};
+
+class RRpnTargetAssignOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("Anchor",
+             "(Tensor) input anchor is a 2-D Tensor with shape [H*W*A, 5].");
+    AddInput("GtBoxes",
+             "(LoDTensor) input ground-truth bbox with shape [K, 5].");
+    AddInput("ImInfo",
+             "(LoDTensor) input image information with shape [N, 3]. "
+             "N is the batch size, each image information includes height, "
+             "width and scale.");
+    AddAttr<int>("rpn_batch_size_per_im",
+                 "Total number of RPN examples per image.")
+        .SetDefault(256);
+    AddAttr<float>(
+        "rpn_straddle_thresh",
+        "Remove RPN anchors that go outside the image by straddle_thresh "
+        "pixels, "
+        "Set to -1 or a large value, e.g. 100000, to disable pruning anchors.");
+    AddAttr<float>(
+        "rpn_positive_overlap",
+        "Minimum overlap required between an anchor and ground-truth "
+        "box for the (anchor, gt box) pair to be a positive example.")
+        .SetDefault(0.7);
+    AddAttr<float>(
+        "rpn_negative_overlap",
+        "Maximum overlap allowed between an anchor and ground-truth "
+        "box for the (anchor, gt box) pair to be a negative examples.")
+        .SetDefault(0.3);
+    AddAttr<float>(
+        "rpn_fg_fraction",
+        "Target fraction of RoI minibatch that "
+        "is labeled foreground (i.e. class > 0), 0-th class is background.")
+        .SetDefault(0.25);
+    AddAttr<bool>("use_random",
+                  "A flag indicating whether to use a ReservoirSampling. "
+                  "NOTE: DO NOT set this flag to false in training. "
+                  "Setting this flag to false is only useful in unittest.")
+        .SetDefault(true);
+    AddOutput(
+        "LocationIndex",
+        "(Tensor), The indexes of foreground anchors in all RPN anchors, the "
+        "shape of the LocationIndex is [F], F depends on the value of input "
+        "tensor and attributes.");
+    AddOutput(
+        "ScoreIndex",
+        "(Tensor), The indexes of foreground and background anchors in all "
+        "RPN anchors(The rest anchors are ignored). The shape of the "
+        "ScoreIndex is [F + B], F and B are sampled foreground and background "
+        " number.");
+    AddOutput("TargetBBox",
+              "(Tensor), The target bbox deltas with shape "
+              "[F, 5], F is the sampled foreground number.");
+    AddOutput(
+        "TargetLabel",
+        "(Tensor<int>), The target labels of each anchor with shape "
+        "[F + B, 1], F and B are sampled foreground and background number.");
+    AddComment(R"DOC(
+This operator can be, for a given set of ground truth bboxes and the
+anchors, to assign classification and regression targets to each prediction.
+The ScoreIndex and LocationIndex will be generated according to the anchor-groundtruth IOU.
+The rest anchors would not contibute to the RPN training loss
+
+ScoreIndex is composed of foreground anchor indexes(positive labels) and
+background anchor indexes(negative labels). LocationIndex is exactly same
+as the foreground anchor indexes since we can not assign regression target to 
+the background anchors.
+
+The classification targets(TargetLabel) is a binary class label (of being
+an object or not). Following the paper of Faster-RCNN, the positive labels
+are two kinds of anchors: (i) the anchor/anchors with the highest IoU
+overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap
+higher than rpn_positive_overlap(0.7) with any ground-truth box. Note that
+a single ground-truth box may assign positive labels to multiple anchors.
+A non-positive anchor is when its IoU ratio is lower than rpn_negative_overlap
+(0.3) for all ground-truth boxes. Anchors that are neither positive nor
+negative do not contribute to the training objective.
+
+)DOC");
+  }
+};
+
+}  // namespace operators
+}  // namespace paddle
+
+namespace ops = paddle::operators;
+
+REGISTER_OPERATOR(
+    rrpn_target_assign,
+    ops::RRpnTargetAssignOp,
+    ops::RRpnTargetAssignOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
+REGISTER_OP_CPU_KERNEL(rrpn_target_assign,
+                       ops::RRpnTargetAssignKernel<float>,
+                       ops::RRpnTargetAssignKernel<double>);
diff --git a/PaddleCV/rrpn/models/ext_op/src/safe_ref.h b/PaddleCV/rrpn/models/ext_op/src/safe_ref.h
new file mode 100755
index 0000000000000000000000000000000000000000..6a67b1a7835958cfea242383cb6df992a288c722
--- /dev/null
+++ b/PaddleCV/rrpn/models/ext_op/src/safe_ref.h
@@ -0,0 +1,35 @@
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once
+#include <vector>
+#include "paddle/fluid/platform/enforce.h"
+
+namespace paddle {
+namespace operators {
+namespace detail {
+/**
+ * Get Reference From Pointer with check. The error message is printf format,
+ * and passed by `args`
+ */
+template <typename T, typename... ARGS>
+inline T& Ref(T* ptr, ARGS&&... args) {
+  PADDLE_ENFORCE_NOT_NULL(ptr, ::paddle::string::Sprintf(args...));
+  return *ptr;
+}
+
+
+}  // namespace detail
+}  // namespace operators
+}  // namespace paddle
diff --git a/PaddleCV/rrpn/models/model_builder.py b/PaddleCV/rrpn/models/model_builder.py
new file mode 100755
index 0000000000000000000000000000000000000000..1f976faca76732e0f3e31b485bd482602fbb8c4c
--- /dev/null
+++ b/PaddleCV/rrpn/models/model_builder.py
@@ -0,0 +1,379 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Constant
+from paddle.fluid.initializer import Normal
+from paddle.fluid.initializer import MSRA
+from paddle.fluid.regularizer import L2Decay
+from config import cfg
+from models.ext_op.rrpn_lib import *
+
+
+class RRPN(object):
+    def __init__(self,
+                 add_conv_body_func=None,
+                 add_roi_box_head_func=None,
+                 mode='train',
+                 use_pyreader=True,
+                 use_random=True):
+        self.add_conv_body_func = add_conv_body_func
+        self.add_roi_box_head_func = add_roi_box_head_func
+        self.mode = mode
+        self.use_pyreader = use_pyreader
+        self.use_random = use_random
+
+    def build_model(self, image_shape):
+        self.build_input(image_shape)
+        body_conv = self.add_conv_body_func(self.image)
+        # RPN
+        self.rpn_heads(body_conv)
+        # Fast RCNN
+        self.fast_rcnn_heads(body_conv)
+        if self.mode != 'train':
+            self.eval_bbox()
+
+    def loss(self):
+        losses = []
+        # Fast RCNN loss
+        loss_cls, loss_bbox = self.fast_rcnn_loss()
+        # RPN loss
+        rpn_cls_loss, rpn_reg_loss = self.rpn_loss()
+        losses = [loss_cls, loss_bbox, rpn_cls_loss, rpn_reg_loss]
+        rkeys = ['loss', 'loss_cls', 'loss_bbox', \
+                 'loss_rpn_cls', 'loss_rpn_bbox',]
+        loss = fluid.layers.sum(losses)
+        rloss = [loss] + losses
+        return rloss, rkeys, self.rpn_rois
+
+    def eval_bbox_out(self):
+        return self.pred_result
+
+    def build_input(self, image_shape):
+        if self.use_pyreader:
+            in_shapes = [[-1] + image_shape, [-1, 5], [-1, 1], [-1, 1],
+                         [-1, 3], [-1, 1]]
+            lod_levels = [0, 1, 1, 1, 0, 0]
+            dtypes = [
+                'float32', 'float32', 'int32', 'int32', 'float32', 'int64'
+            ]
+            self.py_reader = fluid.layers.py_reader(
+                capacity=64,
+                shapes=in_shapes,
+                lod_levels=lod_levels,
+                dtypes=dtypes,
+                use_double_buffer=True)
+            ins = fluid.layers.read_file(self.py_reader)
+            self.image = ins[0]
+            self.gt_box = ins[1]
+            self.gt_label = ins[2]
+            self.is_crowd = ins[3]
+            self.im_info = ins[4]
+            self.im_id = ins[5]
+        else:
+            self.image = fluid.layers.data(
+                name='image', shape=image_shape, dtype='float32')
+            self.gt_box = fluid.layers.data(
+                name='gt_box', shape=[4], dtype='float32', lod_level=1)
+            self.gt_label = fluid.layers.data(
+                name='gt_label', shape=[1], dtype='int32', lod_level=1)
+            self.is_crowd = fluid.layers.data(
+                name='is_crowd', shape=[1], dtype='int32', lod_level=1)
+            self.im_info = fluid.layers.data(
+                name='im_info', shape=[3], dtype='float32')
+            self.im_id = fluid.layers.data(
+                name='im_id', shape=[1], dtype='int64')
+
+            self.difficult = fluid.layers.data(
+                name='difficult', shape=[1], dtype='float32', lod_level=1)
+
+    def feeds(self):
+        if self.mode == 'infer':
+            return [self.image, self.im_info]
+        if self.mode == 'val':
+            return [
+                self.image, self.gt_box, self.gt_label, self.is_crowd,
+                self.im_info, self.im_id, self.difficult
+            ]
+        return [
+            self.image, self.gt_box, self.gt_label, self.is_crowd, self.im_info,
+            self.im_id
+        ]
+
+    def eval_bbox(self):
+        self.im_scale = fluid.layers.slice(
+            self.im_info, [1], starts=[2], ends=[3])
+        im_scale_lod = fluid.layers.sequence_expand(self.im_scale,
+                                                    self.rpn_rois)
+        results = []
+        boxes = self.rpn_rois
+        cls_prob = fluid.layers.softmax(self.cls_score, use_cudnn=False)
+        bbox_pred = fluid.layers.reshape(self.bbox_pred, (-1, cfg.class_num, 5))
+        for i in range(cfg.class_num - 1):
+            bbox_pred_slice = fluid.layers.slice(
+                bbox_pred, axes=[1], starts=[i + 1], ends=[i + 2])
+            bbox_pred_reshape = fluid.layers.reshape(bbox_pred_slice, (-1, 5))
+            decoded_box = rrpn_box_coder(prior_box=boxes, \
+                                         target_box=bbox_pred_reshape, \
+                                         prior_box_var=cfg.bbox_reg_weights)
+            score_slice = fluid.layers.slice(
+                cls_prob, axes=[1], starts=[i + 1], ends=[i + 2])
+            score_slice = fluid.layers.reshape(score_slice, shape=[-1, 1])
+            box_positive = fluid.layers.reshape(decoded_box, shape=[-1, 8])
+            box_reshape = fluid.layers.reshape(x=box_positive, shape=[1, -1, 8])
+            score_reshape = fluid.layers.reshape(
+                x=score_slice, shape=[1, 1, -1])
+            pred_result = fluid.layers.multiclass_nms(
+                bboxes=box_reshape,
+                scores=score_reshape,
+                score_threshold=cfg.TEST.score_thresh,
+                nms_top_k=-1,
+                nms_threshold=cfg.TEST.nms_thresh,
+                keep_top_k=cfg.TEST.detections_per_im,
+                normalized=False,
+                background_label=-1)
+            result_shape = fluid.layers.shape(pred_result)
+            res_dimension = fluid.layers.slice(
+                result_shape, axes=[0], starts=[1], ends=[2])
+            res_dimension = fluid.layers.reshape(res_dimension, shape=[1, 1])
+            dimension = fluid.layers.fill_constant(
+                shape=[1, 1], value=2, dtype='int32')
+            cond = fluid.layers.less_than(dimension, res_dimension)
+            res = fluid.layers.create_global_var(
+                shape=[1, 10], value=0.0, dtype='float32', persistable=False)
+            with fluid.layers.control_flow.Switch() as switch:
+                with switch.case(cond):
+                    coordinate = fluid.layers.fill_constant(
+                        shape=[9], value=0.0, dtype='float32')
+                    pred_class = fluid.layers.fill_constant(
+                        shape=[1], value=i + 1, dtype='float32')
+                    add_class = fluid.layers.concat(
+                        [pred_class, coordinate], axis=0)
+                    normal_result = fluid.layers.elementwise_add(pred_result,
+                                                                 add_class)
+                    fluid.layers.assign(normal_result, res)
+                with switch.default():
+                    normal_result = fluid.layers.fill_constant(
+                        shape=[1, 10], value=-1.0, dtype='float32')
+                    fluid.layers.assign(normal_result, res)
+            results.append(res)
+        if len(results) == 1:
+            self.pred_result = results[0]
+            return
+        outs = []
+        out = fluid.layers.concat(results)
+        zero = fluid.layers.fill_constant(
+            shape=[1, 1], value=0.0, dtype='float32')
+        out_split, _ = fluid.layers.split(out, dim=1, num_or_sections=[1, 9])
+        out_bool = fluid.layers.greater_than(out_split, zero)
+        idx = fluid.layers.where(out_bool)
+        idx_split, _ = fluid.layers.split(idx, dim=1, num_or_sections=[1, 1])
+        idx = fluid.layers.reshape(idx_split, [-1, 1])
+        self.pred_result = fluid.layers.gather(input=out, index=idx)
+
+    def rpn_heads(self, rpn_input):
+        # RPN hidden representation
+        dim_out = rpn_input.shape[1]
+        rpn_conv = fluid.layers.conv2d(
+            input=rpn_input,
+            num_filters=dim_out,
+            filter_size=3,
+            stride=1,
+            padding=1,
+            act='relu',
+            name='conv_rpn',
+            param_attr=ParamAttr(
+                name="conv_rpn_w", initializer=Normal(
+                    loc=0., scale=0.01)),
+            bias_attr=ParamAttr(
+                name="conv_rpn_b", learning_rate=2., regularizer=L2Decay(0.)))
+        self.anchor, self.var = rotated_anchor_generator(
+            input=rpn_conv,
+            anchor_sizes=cfg.anchor_sizes,
+            aspect_ratios=cfg.aspect_ratios,
+            angles=cfg.anchor_angle,
+            variance=cfg.variance,
+            stride=cfg.rpn_stride,
+            offset=0.5)
+        num_anchor = self.anchor.shape[2]
+        # Proposal classification scores
+        self.rpn_cls_score = fluid.layers.conv2d(
+            rpn_conv,
+            num_filters=num_anchor,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            act=None,
+            name='rpn_cls_score',
+            param_attr=ParamAttr(
+                name="rpn_cls_logits_w", initializer=Normal(
+                    loc=0., scale=0.01)),
+            bias_attr=ParamAttr(
+                name="rpn_cls_logits_b",
+                learning_rate=2.,
+                regularizer=L2Decay(0.)))
+        # Proposal bbox regression deltas
+        self.rpn_bbox_pred = fluid.layers.conv2d(
+            rpn_conv,
+            num_filters=5 * num_anchor,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            act=None,
+            name='rpn_bbox_pred',
+            param_attr=ParamAttr(
+                name="rpn_bbox_pred_w", initializer=Normal(
+                    loc=0., scale=0.01)),
+            bias_attr=ParamAttr(
+                name="rpn_bbox_pred_b",
+                learning_rate=2.,
+                regularizer=L2Decay(0.)))
+        rpn_cls_score_prob = fluid.layers.sigmoid(
+            self.rpn_cls_score, name='rpn_cls_score_prob')
+
+        param_obj = cfg.TRAIN if self.mode == 'train' else cfg.TEST
+        pre_nms_top_n = param_obj.rpn_pre_nms_top_n
+        post_nms_top_n = param_obj.rpn_post_nms_top_n
+        nms_thresh = param_obj.rpn_nms_thresh
+        min_size = param_obj.rpn_min_size
+        self.rpn_rois, self.rpn_roi_probs = rotated_generate_proposals(
+            scores=rpn_cls_score_prob,
+            bbox_deltas=self.rpn_bbox_pred,
+            im_info=self.im_info,
+            anchors=self.anchor,
+            variances=self.var,
+            pre_nms_top_n=pre_nms_top_n,
+            post_nms_top_n=post_nms_top_n,
+            nms_thresh=param_obj.rpn_nms_thresh,
+            min_size=param_obj.rpn_min_size)
+        if self.mode == 'train':
+            outs = rotated_generate_proposal_labels(
+                rpn_rois=self.rpn_rois,
+                gt_classes=self.gt_label,
+                is_crowd=self.is_crowd,
+                gt_boxes=self.gt_box,
+                im_info=self.im_info,
+                batch_size_per_im=cfg.TRAIN.batch_size_per_im,
+                fg_fraction=cfg.TRAIN.fg_fractrion,
+                fg_thresh=cfg.TRAIN.fg_thresh,
+                bg_thresh_hi=cfg.TRAIN.bg_thresh_hi,
+                bg_thresh_lo=cfg.TRAIN.bg_thresh_lo,
+                bbox_reg_weights=cfg.bbox_reg_weights,
+                class_nums=cfg.class_num,
+                use_random=self.use_random)
+
+            self.rois = outs[0]
+            self.labels_int32 = outs[1]
+            self.bbox_targets = outs[2]
+            self.bbox_inside_weights = outs[3]
+            self.bbox_outside_weights = outs[4]
+
+    def fast_rcnn_heads(self, roi_input):
+        if self.mode == 'train':
+            pool_rois = self.rois
+        else:
+            pool_rois = self.rpn_rois
+        pool = rotated_roi_align(
+            input=roi_input,
+            rois=pool_rois,
+            pooled_height=cfg.roi_resolution,
+            pooled_width=cfg.roi_resolution,
+            spatial_scale=cfg.spatial_scale)
+        self.res5_2_sum = self.add_roi_box_head_func(pool)
+        rcnn_out = fluid.layers.pool2d(
+            self.res5_2_sum, pool_type='avg', pool_size=7, name='res5_pool')
+        self.cls_score = fluid.layers.fc(input=rcnn_out,
+                                         size=cfg.class_num,
+                                         act=None,
+                                         name='cls_score',
+                                         param_attr=ParamAttr(
+                                             name='cls_score_w',
+                                             initializer=Normal(
+                                                 loc=0.0, scale=0.001)),
+                                         bias_attr=ParamAttr(
+                                             name='cls_score_b',
+                                             learning_rate=2.,
+                                             regularizer=L2Decay(0.)))
+        self.bbox_pred = fluid.layers.fc(input=rcnn_out,
+                                         size=5 * cfg.class_num,
+                                         act=None,
+                                         name='bbox_pred',
+                                         param_attr=ParamAttr(
+                                             name='bbox_pred_w',
+                                             initializer=Normal(
+                                                 loc=0.0, scale=0.01)),
+                                         bias_attr=ParamAttr(
+                                             name='bbox_pred_b',
+                                             learning_rate=2.,
+                                             regularizer=L2Decay(0.)))
+
+    def fast_rcnn_loss(self):
+        labels_int64 = fluid.layers.cast(x=self.labels_int32, dtype='int64')
+        labels_int64.stop_gradient = True
+        loss_cls = fluid.layers.softmax_with_cross_entropy(
+            logits=self.cls_score,
+            label=labels_int64,
+            numeric_stable_mode=True, )
+        loss_cls = fluid.layers.reduce_mean(loss_cls)
+        loss_bbox = fluid.layers.smooth_l1(
+            x=self.bbox_pred,
+            y=self.bbox_targets,
+            inside_weight=self.bbox_inside_weights,
+            outside_weight=self.bbox_outside_weights,
+            sigma=1.0)
+        loss_bbox = fluid.layers.reduce_mean(loss_bbox)
+        return loss_cls, loss_bbox
+
+    def rpn_loss(self):
+        rpn_cls_score_reshape = fluid.layers.transpose(
+            self.rpn_cls_score, perm=[0, 2, 3, 1])
+        rpn_bbox_pred_reshape = fluid.layers.transpose(
+            self.rpn_bbox_pred, perm=[0, 2, 3, 1])
+
+        anchor_reshape = fluid.layers.reshape(self.anchor, shape=(-1, 5))
+        var_reshape = fluid.layers.reshape(self.var, shape=(-1, 5))
+
+        rpn_cls_score_reshape = fluid.layers.reshape(
+            x=rpn_cls_score_reshape, shape=(0, -1, 1))
+        rpn_bbox_pred_reshape = fluid.layers.reshape(
+            x=rpn_bbox_pred_reshape, shape=(0, -1, 5))
+        score_pred, loc_pred, score_tgt, loc_tgt = \
+            rrpn_target_assign(
+                bbox_pred=rpn_bbox_pred_reshape,
+                cls_logits=rpn_cls_score_reshape,
+                anchor_box=anchor_reshape,
+                gt_boxes=self.gt_box,
+                im_info=self.im_info,
+                rpn_batch_size_per_im=cfg.TRAIN.rpn_batch_size_per_im,
+                rpn_straddle_thresh=-1,
+                rpn_fg_fraction=cfg.TRAIN.rpn_fg_fraction,
+                rpn_positive_overlap=cfg.TRAIN.rpn_positive_overlap,
+                rpn_negative_overlap=cfg.TRAIN.rpn_negative_overlap,
+                use_random=self.use_random)
+        score_tgt = fluid.layers.cast(x=score_tgt, dtype='float32')
+        rpn_cls_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
+            x=score_pred, label=score_tgt)
+        rpn_cls_loss = fluid.layers.reduce_mean(
+            rpn_cls_loss, name='loss_rpn_cls')
+
+        rpn_reg_loss = fluid.layers.smooth_l1(x=loc_pred, y=loc_tgt, sigma=3.0)
+        rpn_reg_loss = fluid.layers.reduce_sum(
+            rpn_reg_loss, name='loss_rpn_bbox')
+        score_shape = fluid.layers.shape(score_tgt)
+        score_shape = fluid.layers.cast(x=score_shape, dtype='float32')
+        norm = fluid.layers.reduce_prod(score_shape)
+        norm.stop_gradient = True
+        rpn_reg_loss = rpn_reg_loss / norm
+        return rpn_cls_loss, rpn_reg_loss
diff --git a/PaddleCV/rrpn/models/name_adapter.py b/PaddleCV/rrpn/models/name_adapter.py
new file mode 100644
index 0000000000000000000000000000000000000000..aab88b55a2cf7d54545d731a800646ccaef2fe73
--- /dev/null
+++ b/PaddleCV/rrpn/models/name_adapter.py
@@ -0,0 +1,71 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+class NameAdapter(object):
+    """Fix the backbones variable names for pretrained weight"""
+
+    def __init__(self, model):
+        super(NameAdapter, self).__init__()
+        self.model = model
+
+    @property
+    def model_type(self):
+        return getattr(self.model, '_model_type', '')
+
+    @property
+    def variant(self):
+        return getattr(self.model, 'variant', '')
+
+    def fix_conv_norm_name(self, name):
+        if name == "conv1":
+            bn_name = "bn_" + name
+        else:
+            bn_name = "bn" + name[3:]
+        # the naming rule is same as pretrained weight
+        if self.model_type == 'SEResNeXt':
+            bn_name = name + "_bn"
+        return bn_name
+
+    def fix_shortcut_name(self, name):
+        if self.model_type == 'SEResNeXt':
+            name = 'conv' + name + '_prj'
+        return name
+
+    def fix_bottleneck_name(self, name):
+        if self.model_type == 'SEResNeXt':
+            conv_name1 = 'conv' + name + '_x1'
+            conv_name2 = 'conv' + name + '_x2'
+            conv_name3 = 'conv' + name + '_x3'
+            shortcut_name = name
+        else:
+            conv_name1 = name + "_branch2a"
+            conv_name2 = name + "_branch2b"
+            conv_name3 = name + "_branch2c"
+            shortcut_name = name + "_branch1"
+        return conv_name1, conv_name2, conv_name3, shortcut_name
+
+    def fix_layer_warp_name(self, stage_num, count, i):
+        name = 'res' + str(stage_num)
+        if count > 10 and stage_num == 4:
+            if i == 0:
+                conv_name = name + "a"
+            else:
+                conv_name = name + "b" + str(i)
+        else:
+            conv_name = name + chr(ord("a") + i)
+        return conv_name
+
+    def fix_c1_stage_name(self):
+        return "conv1"
diff --git a/PaddleCV/rrpn/models/resnet.py b/PaddleCV/rrpn/models/resnet.py
new file mode 100644
index 0000000000000000000000000000000000000000..d1505fbea3c4c149bf794254c4c546f2598c3ea8
--- /dev/null
+++ b/PaddleCV/rrpn/models/resnet.py
@@ -0,0 +1,358 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from collections import OrderedDict
+
+from paddle import fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.framework import Variable
+from paddle.fluid.regularizer import L2Decay
+from paddle.fluid.initializer import Constant
+
+from numbers import Integral
+from .name_adapter import NameAdapter
+
+
+class ResNet(object):
+    """
+    Residual Network, see https://arxiv.org/abs/1512.03385
+    Args:
+        depth (int): ResNet depth, should be 18, 34, 50, 101, 152.
+        freeze_at (int): freeze the backbone at which stage
+        norm_type (str): normalization type, 'bn'/'sync_bn'/'affine_channel'
+        freeze_norm (bool): freeze normalization layers
+        norm_decay (float): weight decay for normalization layer weights
+        variant (str): ResNet variant, supports 'a', 'b', 'c', 'd' currently
+        feature_maps (list): index of stages whose feature maps are returned
+    """
+    __shared__ = ['norm_type', 'freeze_norm', 'weight_prefix_name']
+
+    def __init__(self,
+                 depth=50,
+                 freeze_at=2,
+                 norm_type='affine_channel',
+                 freeze_norm=True,
+                 norm_decay=0.,
+                 variant='b',
+                 feature_maps=4,
+                 weight_prefix_name=''):
+        super(ResNet, self).__init__()
+
+        if isinstance(feature_maps, Integral):
+            feature_maps = [feature_maps]
+
+        assert depth in [18, 34, 50, 101, 152], \
+            "depth {} not in [18, 34, 50, 101, 152]"
+        assert variant in ['a', 'b', 'c', 'd'], "invalid ResNet variant"
+        assert 0 <= freeze_at <= 4, "freeze_at should be 0, 1, 2, 3 or 4"
+        assert len(feature_maps) > 0, "need one or more feature maps"
+        assert norm_type in ['bn', 'sync_bn', 'affine_channel']
+
+        self.depth = depth
+        self.freeze_at = freeze_at
+        self.norm_type = norm_type
+        self.norm_decay = norm_decay
+        self.freeze_norm = freeze_norm
+        self.variant = variant
+        self._model_type = 'ResNet'
+        self.feature_maps = feature_maps
+        self.depth_cfg = {
+            18: ([2, 2, 2, 2], self.basicblock),
+            34: ([3, 4, 6, 3], self.basicblock),
+            50: ([3, 4, 6, 3], self.bottleneck),
+            101: ([3, 4, 23, 3], self.bottleneck),
+            152: ([3, 8, 36, 3], self.bottleneck)
+        }
+        self.stage_filters = [64, 128, 256, 512]
+        self._c1_out_chan_num = 64
+        self.na = NameAdapter(self)
+        self.prefix_name = weight_prefix_name
+
+    def _conv_offset(self,
+                     input,
+                     filter_size,
+                     stride,
+                     padding,
+                     act=None,
+                     name=None):
+        out_channel = filter_size * filter_size * 3
+        out = fluid.layers.conv2d(
+            input,
+            num_filters=out_channel,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            param_attr=ParamAttr(
+                initializer=Constant(0.0), name=name + ".w_0"),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.0), name=name + ".b_0"),
+            act=act,
+            name=name)
+        return out
+
+    def _conv_norm(self,
+                   input,
+                   num_filters,
+                   filter_size,
+                   stride=1,
+                   groups=1,
+                   act=None,
+                   name=None):
+        _name = self.prefix_name + name if self.prefix_name != '' else name
+        conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=(filter_size - 1) // 2,
+            groups=groups,
+            act=None,
+            param_attr=ParamAttr(name=_name + "_weights"),
+            bias_attr=False,
+            name=_name + '.conv2d.output.1')
+
+        bn_name = self.na.fix_conv_norm_name(name)
+        bn_name = self.prefix_name + bn_name if self.prefix_name != '' else bn_name
+
+        norm_lr = 0. if self.freeze_norm else 1.
+        norm_decay = self.norm_decay
+        pattr = ParamAttr(
+            name=bn_name + '_scale',
+            learning_rate=norm_lr,
+            regularizer=L2Decay(norm_decay))
+        battr = ParamAttr(
+            name=bn_name + '_offset',
+            learning_rate=norm_lr,
+            regularizer=L2Decay(norm_decay))
+
+        if self.norm_type in ['bn', 'sync_bn']:
+            global_stats = True if self.freeze_norm else False
+            out = fluid.layers.batch_norm(
+                input=conv,
+                act=act,
+                name=bn_name + '.output.1',
+                param_attr=pattr,
+                bias_attr=battr,
+                moving_mean_name=bn_name + '_mean',
+                moving_variance_name=bn_name + '_variance',
+                use_global_stats=global_stats)
+            scale = fluid.framework._get_var(pattr.name)
+            bias = fluid.framework._get_var(battr.name)
+        elif self.norm_type == 'affine_channel':
+            scale = fluid.layers.create_parameter(
+                shape=[conv.shape[1]],
+                dtype=conv.dtype,
+                attr=pattr,
+                default_initializer=fluid.initializer.Constant(1.))
+            bias = fluid.layers.create_parameter(
+                shape=[conv.shape[1]],
+                dtype=conv.dtype,
+                attr=battr,
+                default_initializer=fluid.initializer.Constant(0.))
+            out = fluid.layers.affine_channel(
+                x=conv, scale=scale, bias=bias, act=act)
+        if self.freeze_norm:
+            scale.stop_gradient = True
+            bias.stop_gradient = True
+        return out
+
+    def _shortcut(self, input, ch_out, stride, is_first, name):
+        max_pooling_in_short_cut = self.variant == 'd'
+        ch_in = input.shape[1]
+        # the naming rule is same as pretrained weight
+        name = self.na.fix_shortcut_name(name)
+        std_senet = getattr(self, 'std_senet', False)
+        if ch_in != ch_out or stride != 1 or (self.depth < 50 and is_first):
+            if std_senet:
+                if is_first:
+                    return self._conv_norm(input, ch_out, 1, stride, name=name)
+                else:
+                    return self._conv_norm(input, ch_out, 3, stride, name=name)
+            if max_pooling_in_short_cut and not is_first:
+                input = fluid.layers.pool2d(
+                    input=input,
+                    pool_size=2,
+                    pool_stride=2,
+                    pool_padding=0,
+                    ceil_mode=True,
+                    pool_type='avg')
+                return self._conv_norm(input, ch_out, 1, 1, name=name)
+            return self._conv_norm(input, ch_out, 1, stride, name=name)
+        else:
+            return input
+
+    def bottleneck(self, input, num_filters, stride, is_first, name):
+        if self.variant == 'a':
+            stride1, stride2 = stride, 1
+        else:
+            stride1, stride2 = 1, stride
+
+        # ResNeXt
+        groups = getattr(self, 'groups', 1)
+        group_width = getattr(self, 'group_width', -1)
+        if groups == 1:
+            expand = 4
+        elif (groups * group_width) == 256:
+            expand = 1
+        else:  # FIXME hard code for now, handles 32x4d, 64x4d and 32x8d
+            num_filters = num_filters // 2
+            expand = 2
+
+        conv_name1, conv_name2, conv_name3, \
+            shortcut_name = self.na.fix_bottleneck_name(name)
+        std_senet = getattr(self, 'std_senet', False)
+        conv_def = [[num_filters, 1, stride1, 'relu', 1, conv_name1],
+                    [num_filters, 3, stride2, 'relu', groups, conv_name2],
+                    [num_filters * expand, 1, 1, None, 1, conv_name3]]
+
+        residual = input
+        for i, (c, k, s, act, g, _name) in enumerate(conv_def):
+            residual = self._conv_norm(
+                input=residual,
+                num_filters=c,
+                filter_size=k,
+                stride=s,
+                act=act,
+                groups=g,
+                name=_name)
+        short = self._shortcut(
+            input,
+            num_filters * expand,
+            stride,
+            is_first=is_first,
+            name=shortcut_name)
+        return fluid.layers.elementwise_add(
+            x=short, y=residual, act='relu', name=name + ".add.output.5")
+
+    def basicblock(self, input, num_filters, stride, is_first, name):  #,
+        conv0 = self._conv_norm(
+            input=input,
+            num_filters=num_filters,
+            filter_size=3,
+            act='relu',
+            stride=stride,
+            name=name + "_branch2a")
+        conv1 = self._conv_norm(
+            input=conv0,
+            num_filters=num_filters,
+            filter_size=3,
+            act=None,
+            name=name + "_branch2b")
+        short = self._shortcut(
+            input, num_filters, stride, is_first, name=name + "_branch1")
+        return fluid.layers.elementwise_add(x=short, y=conv1, act='relu')
+
+    def layer_warp(self, input, stage_num):
+        """
+        Args:
+            input (Variable): input variable.
+            stage_num (int): the stage number, should be 2, 3, 4, 5
+        Returns:
+            The last variable in endpoint-th stage.
+        """
+        assert stage_num in [2, 3, 4, 5]
+
+        stages, block_func = self.depth_cfg[self.depth]
+        count = stages[stage_num - 2]
+
+        ch_out = self.stage_filters[stage_num - 2]
+        is_first = False if stage_num != 2 else True
+        # Make the layer name and parameter name consistent
+        # with ImageNet pre-trained model
+        conv = input
+        for i in range(count):
+            conv_name = self.na.fix_layer_warp_name(stage_num, count, i)
+            if self.depth < 50:
+                is_first = True if i == 0 and stage_num == 2 else False
+            conv = block_func(
+                input=conv,
+                num_filters=ch_out,
+                stride=2 if i == 0 and stage_num != 2 else 1,
+                is_first=is_first,
+                name=conv_name)
+        return conv
+
+    def c1_stage(self, input):
+        out_chan = self._c1_out_chan_num
+
+        conv1_name = self.na.fix_c1_stage_name()
+
+        if self.variant in ['c', 'd']:
+            conv_def = [
+                [out_chan // 2, 3, 2, "conv1_1"],
+                [out_chan // 2, 3, 1, "conv1_2"],
+                [out_chan, 3, 1, "conv1_3"],
+            ]
+        else:
+            conv_def = [[out_chan, 7, 2, conv1_name]]
+
+        for (c, k, s, _name) in conv_def:
+            input = self._conv_norm(
+                input=input,
+                num_filters=c,
+                filter_size=k,
+                stride=s,
+                act='relu',
+                name=_name)
+
+        output = fluid.layers.pool2d(
+            input=input,
+            pool_size=3,
+            pool_stride=2,
+            pool_padding=1,
+            pool_type='max')
+        return output
+
+    def __call__(self, input):
+        assert isinstance(input, Variable)
+        assert not (set(self.feature_maps) - set([2, 3, 4, 5])), \
+            "feature maps {} not in [2, 3, 4, 5]".format(self.feature_maps)
+
+        res_endpoints = []
+
+        res = input
+        feature_maps = self.feature_maps
+        severed_head = getattr(self, 'severed_head', False)
+        if not severed_head:
+            res = self.c1_stage(res)
+            feature_maps = range(2, max(self.feature_maps) + 1)
+
+        for i in feature_maps:
+            res = self.layer_warp(res, i)
+            if i in self.feature_maps:
+                res_endpoints.append(res)
+            if self.freeze_at >= i:
+                res.stop_gradient = True
+        return res
+
+
+class ResNetC5(ResNet):
+    __doc__ = ResNet.__doc__
+
+    def __init__(self,
+                 depth=50,
+                 freeze_at=2,
+                 norm_type='affine_channel',
+                 freeze_norm=True,
+                 norm_decay=0.,
+                 variant='b',
+                 feature_maps=[5],
+                 weight_prefix_name=''):
+        super(ResNetC5, self).__init__(depth, freeze_at, norm_type, freeze_norm,
+                                       norm_decay, variant, feature_maps)
+        self.severed_head = True
diff --git a/PaddleCV/rrpn/pretrained/download.sh b/PaddleCV/rrpn/pretrained/download.sh
new file mode 100755
index 0000000000000000000000000000000000000000..7999b199f0cd0b939ece51a5f7c097fa4c33e1fa
--- /dev/null
+++ b/PaddleCV/rrpn/pretrained/download.sh
@@ -0,0 +1,5 @@
+# Download the data.
+echo "Downloading..."
+wget https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_cos_pretrained.tar --no-check-certificate
+echo "Extracting..."
+tar -xf ResNet50_cos_pretrained.tar
diff --git a/PaddleCV/rrpn/reader.py b/PaddleCV/rrpn/reader.py
new file mode 100755
index 0000000000000000000000000000000000000000..b54561b8a0fc0e79edb1c36c6ad46ce5f5d10276
--- /dev/null
+++ b/PaddleCV/rrpn/reader.py
@@ -0,0 +1,180 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import numpy as np
+import xml.etree.ElementTree
+import os
+import time
+import copy
+import six
+import cv2
+import math
+import paddle
+from collections import deque
+
+import data_utils
+from roidbs import ICDAR2015Dataset, ICDAR2017Dataset
+from config import cfg
+from PIL import Image
+from data_utils import _resize
+num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+np.random.seed(10)
+
+
+def roidb_reader(roidb, mode):
+    im, im_scales, gt_boxes, gt_classes = data_utils.get_image_blob(roidb, mode)
+    im_id = roidb['im_id']
+    is_crowd = roidb['is_crowd']
+    im_height = np.round(roidb['height'] * im_scales)
+    im_width = np.round(roidb['width'] * im_scales)
+    is_difficult = roidb['is_difficult']
+    im_info = np.array([im_height, im_width, im_scales], dtype=np.float32)
+    if mode == 'val':
+        return im, gt_boxes, gt_classes, is_crowd, im_info, im_id, is_difficult
+
+    outs = (im, gt_boxes, gt_classes, is_crowd, im_info, im_id)
+
+    return outs
+
+
+def RRPNData(mode,
+             batch_size=None,
+             total_batch_size=None,
+             padding_total=False,
+             shuffle=False,
+             shuffle_seed=None):  #,
+    #roidbs=None):
+    total_batch_size = total_batch_size if total_batch_size else batch_size
+    assert total_batch_size % batch_size == 0
+    if cfg.dataset == "icdar2015":
+        icdar2015_dataset = ICDAR2015Dataset(mode)
+        roidbs = icdar2015_dataset.get_roidb()
+    else:
+        icdar2017_dataset = ICDAR2017Dataset(mode)
+        roidbs = icdar2017_dataset.get_roidb()
+
+    print("{} on {} with {} roidbs".format(mode, cfg.dataset, len(roidbs)))
+
+    def reader():
+        if mode == "train":
+            if shuffle:
+                if shuffle_seed is not None:
+                    np.random.seed(shuffle_seed)
+                roidb_perm = deque(np.random.permutation(roidbs))
+            else:
+                roidb_perm = deque(roidbs)
+            roidb_cur = 0
+            count = 0
+            batch_out = []
+            device_num = total_batch_size / batch_size
+            while True:
+                start = time.time()
+                roidb = roidb_perm[0]
+                roidb_cur += 1
+                roidb_perm.rotate(-1)
+                if roidb_cur >= len(roidbs):
+                    if shuffle:
+                        roidb_perm = deque(np.random.permutation(roidbs))
+                    else:
+                        roidb_perm = deque(roidbs)
+                    roidb_cur = 0
+                # im, gt_boxes, gt_classes, is_crowd, im_info, im_id, gt_masks
+                datas = roidb_reader(roidb, mode)
+                if datas[1].shape[0] == 0:
+                    continue
+                batch_out.append(datas)
+                end = time.time()
+                #print('reader time:', end - start)
+                if len(batch_out) == batch_size:
+                    yield batch_out
+                    count += 1
+                    batch_out = []
+                iter_id = count // device_num
+                if iter_id >= cfg.max_iter * num_trainers:
+                    return
+        elif mode == "val":
+            batch_out = []
+            for roidb in roidbs:
+                im, gt_boxes, gt_classes, is_crowd, im_info, im_id, is_difficult = roidb_reader(
+                    roidb, mode)
+                batch_out.append((im, gt_boxes, gt_classes, is_crowd, im_info,
+                                  im_id, is_difficult))
+                if len(batch_out) == batch_size:
+                    yield batch_out
+                    batch_out = []
+            if len(batch_out) != 0:
+                yield batch_out
+
+    return reader
+
+
+def train(batch_size,
+          total_batch_size=None,
+          padding_total=False,
+          num_workers=20,
+          shuffle=True,
+          shuffle_seed=None):
+    return RRPNData(
+        'train',
+        batch_size,
+        total_batch_size,
+        padding_total,
+        shuffle=shuffle,
+        shuffle_seed=shuffle_seed)
+
+
+def test(batch_size, total_batch_size=None, padding_total=False):
+    return RRPNData('val', batch_size, total_batch_size, shuffle=False)
+
+
+def infer(file_path):
+    def reader():
+        imgs = os.listdir(file_path)
+        imgs.sort()
+        for image in imgs:
+            if not os.path.exists(file_path):
+                raise ValueError("Image path [%s] does not exist." %
+                                 (file_path))
+            with open(os.path.join(file_path, image), 'rb') as f:
+                data = f.read()
+            data = np.frombuffer(data, dtype='uint8')
+            img = cv2.imdecode(data, 1)
+            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+            img, im_scale = _resize(img, target_size=1000, max_size=1778)
+            img = img.astype(np.float32, copy=False)
+            img = img / 255.0
+            mean = np.array(cfg.pixel_means)[np.newaxis, np.newaxis, :]
+            std = np.array(cfg.pixel_std)[np.newaxis, np.newaxis, :]
+            img -= mean
+            img /= std
+            img = img.transpose((2, 0, 1))
+            h = img.shape[1]
+            w = img.shape[2]
+            im_info = np.array([h, w, im_scale], dtype=np.float32)
+            yield [(img, im_info)]
+
+    return reader
+
+
+if __name__ == '__main__':
+    from utility import parse_args
+    args = parse_args()
+    train_reader = train(1, shuffle=True)
+    import time
+    time0 = time.time()
+    for iter_id, data in enumerate(train_reader()):
+        print('iter:', iter_id)
+        print('cost:', time.time() - time0)
+        time0 = time.time()
diff --git a/PaddleCV/rrpn/roidbs.py b/PaddleCV/rrpn/roidbs.py
new file mode 100755
index 0000000000000000000000000000000000000000..705244acac3d2c4868424cac27f2c1c4d5820a77
--- /dev/null
+++ b/PaddleCV/rrpn/roidbs.py
@@ -0,0 +1,364 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Based on:
+# --------------------------------------------------------
+# Detectron
+# Copyright (c) 2017-present, Facebook, Inc.
+# Licensed under the Apache License, Version 2.0;
+# Written by Ross Girshick
+# --------------------------------------------------------
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import copy
+import logging
+import numpy as np
+import os
+import scipy.sparse
+import random
+import time
+import matplotlib
+import cv2
+#import segm_utils
+from config import cfg
+from data_utils import DatasetPath
+logger = logging.getLogger(__name__)
+
+
+class ICDAR2015Dataset(object):
+    """A class representing a ICDAR2015 dataset."""
+
+    def __init__(self, mode):
+        print('Creating: {}'.format(cfg.dataset))
+        self.name = cfg.data_dir
+        self.mode = mode
+        data_path = DatasetPath(mode, self.name)
+        data_dir = data_path.get_data_dir()
+        file_list = data_path.get_file_list()
+        self.image_dir = data_dir
+        self.gt_dir = file_list
+
+    def get_roidb(self):
+        """Return an roidb corresponding to the txt dataset. Optionally:
+           - include ground truth boxes in the roidb
+        """
+        image_list = os.listdir(self.image_dir)
+        image_list.sort()
+        im_infos = []
+        count = 0
+        for image in image_list:
+            prefix = image[:-4]
+            if image.split('.')[-1] != 'jpg':
+                continue
+            img_name = os.path.join(self.image_dir, image)
+            gt_name = os.path.join(self.gt_dir, 'gt_' + prefix + '.txt')
+            easy_boxes = []
+            hard_boxes = []
+            boxes = []
+            gt_obj = open(gt_name, 'r', encoding='UTF-8-sig')
+            gt_txt = gt_obj.read()
+            gt_split = gt_txt.split('\n')
+            img = cv2.imread(img_name)
+            f = False
+            for gt_line in gt_split:
+                gt_ind = gt_line.split(',')
+
+                # can get the text information
+                if len(gt_ind) > 3 and '###' not in gt_ind[8]:
+                    pt1 = (int(gt_ind[0]), int(gt_ind[1]))
+                    pt2 = (int(gt_ind[2]), int(gt_ind[3]))
+                    pt3 = (int(gt_ind[4]), int(gt_ind[5]))
+                    pt4 = (int(gt_ind[6]), int(gt_ind[7]))
+                    edge1 = np.sqrt((pt1[0] - pt2[0]) * (pt1[0] - pt2[0]) + (
+                        pt1[1] - pt2[1]) * (pt1[1] - pt2[1]))
+                    edge2 = np.sqrt((pt2[0] - pt3[0]) * (pt2[0] - pt3[0]) + (
+                        pt2[1] - pt3[1]) * (pt2[1] - pt3[1]))
+                    angle = 0
+                    if edge1 > edge2:
+                        width = edge1
+                        height = edge2
+                        if pt1[0] - pt2[0] != 0:
+                            angle = -np.arctan(
+                                float(pt1[1] - pt2[1]) /
+                                float(pt1[0] - pt2[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    elif edge2 >= edge1:
+                        width = edge2
+                        height = edge1
+                        # print pt2[0], pt3[0]
+                        if pt2[0] - pt3[0] != 0:
+                            angle = -np.arctan(
+                                float(pt2[1] - pt3[1]) /
+                                float(pt2[0] - pt3[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    if angle < -45.0:
+                        angle = angle + 180
+                    x_ctr = float(pt1[0] + pt3[
+                        0]) / 2  # pt1[0] + np.abs(float(pt1[0] - pt3[0])) / 2
+                    y_ctr = float(pt1[1] + pt3[
+                        1]) / 2  # pt1[1] + np.abs(float(pt1[1] - pt3[1])) / 2
+                    if self.mode == 'val':
+                        easy_boxes.append(
+                            list(np.array([pt1, pt2, pt3, pt4]).reshape(8)))
+                    else:
+                        easy_boxes.append([x_ctr, y_ctr, width, height, angle])
+                # can‘t get the text information    
+                if len(gt_ind) > 3 and '###' in gt_ind[8]:
+                    pt1 = (int(gt_ind[0]), int(gt_ind[1]))
+                    pt2 = (int(gt_ind[2]), int(gt_ind[3]))
+                    pt3 = (int(gt_ind[4]), int(gt_ind[5]))
+                    pt4 = (int(gt_ind[6]), int(gt_ind[7]))
+                    edge1 = np.sqrt((pt1[0] - pt2[0]) * (pt1[0] - pt2[0]) + (
+                        pt1[1] - pt2[1]) * (pt1[1] - pt2[1]))
+                    edge2 = np.sqrt((pt2[0] - pt3[0]) * (pt2[0] - pt3[0]) + (
+                        pt2[1] - pt3[1]) * (pt2[1] - pt3[1]))
+                    angle = 0
+                    if edge1 > edge2:
+                        width = edge1
+                        height = edge2
+                        if pt1[0] - pt2[0] != 0:
+                            angle = -np.arctan(
+                                float(pt1[1] - pt2[1]) /
+                                float(pt1[0] - pt2[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    elif edge2 >= edge1:
+                        width = edge2
+                        height = edge1
+                        if pt2[0] - pt3[0] != 0:
+                            angle = -np.arctan(
+                                float(pt2[1] - pt3[1]) /
+                                float(pt2[0] - pt3[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    if angle < -45.0:
+                        angle = angle + 180
+                    x_ctr = float(pt1[0] + pt3[
+                        0]) / 2  # pt1[0] + np.abs(float(pt1[0] - pt3[0])) / 2
+                    y_ctr = float(pt1[1] + pt3[
+                        1]) / 2  # pt1[1] + np.abs(float(pt1[1] - pt3[1])) / 2
+                    if self.mode == 'val':
+                        hard_boxes.append(
+                            list(np.array([pt1, pt2, pt3, pt4]).reshape(8)))
+                    else:
+                        hard_boxes.append([x_ctr, y_ctr, width, height, angle])
+
+            #print(easy_boxes)
+            if self.mode == 'train':
+                boxes.extend(easy_boxes)
+                # hard box only get 1/3 for train
+                boxes.extend(hard_boxes[0:int(len(hard_boxes) / 3)])
+                is_difficult = [0] * len(easy_boxes)
+                is_difficult.extend([1] * int(len(hard_boxes) / 3))
+            else:
+                boxes.extend(easy_boxes)
+                boxes.extend(hard_boxes)
+                is_difficult = [0] * len(easy_boxes)
+                is_difficult.extend([1] * int(len(hard_boxes)))
+            len_of_bboxes = len(boxes)
+            #is_difficult = [0] * len(easy_boxes)
+            #is_difficult.extend([1] * int(len(hard_boxes)))
+            is_difficult = np.array(is_difficult).reshape(
+                1, len_of_bboxes).astype(np.int32)
+            if self.mode == 'train':
+                gt_boxes = np.zeros((len_of_bboxes, 5), dtype=np.int32)
+            else:
+                gt_boxes = np.zeros((len_of_bboxes, 8), dtype=np.int32)
+            gt_classes = np.zeros((len_of_bboxes), dtype=np.int32)
+            is_crowd = np.zeros((len_of_bboxes), dtype=np.int32)
+            for idx in range(len(boxes)):
+                if self.mode == 'train':
+                    gt_boxes[idx, :] = [
+                        boxes[idx][0], boxes[idx][1], boxes[idx][2],
+                        boxes[idx][3], boxes[idx][4]
+                    ]
+                else:
+                    gt_boxes[idx, :] = [
+                        boxes[idx][0], boxes[idx][1], boxes[idx][2],
+                        boxes[idx][3], boxes[idx][4], boxes[idx][5],
+                        boxes[idx][6], boxes[idx][7]
+                    ]
+                gt_classes[idx] = 1
+            if gt_boxes.shape[0] <= 0:
+                continue
+            gt_boxes = gt_boxes.astype(np.float64)
+            im_info = {
+                'im_id': count,
+                'gt_classes': gt_classes,
+                'image': img_name,
+                'boxes': gt_boxes,
+                'height': img.shape[0],
+                'width': img.shape[1],
+                'is_crowd': is_crowd,
+                'is_difficult': is_difficult
+            }
+            im_infos.append(im_info)
+            count += 1
+
+        return im_infos
+
+
+class ICDAR2017Dataset(object):
+    """A class representing a ICDAR2017 dataset."""
+
+    def __init__(self, mode):
+        print('Creating: {}'.format(cfg.dataset))
+        self.name = cfg.data_dir
+        #print('**************', self.name)
+        self.mode = mode
+        data_path = DatasetPath(mode, self.name)
+        data_dir = data_path.get_data_dir()
+        #print("&**************", data_dir)
+        file_list = data_path.get_file_list()
+        self.image_dir = data_dir
+        self.gt_dir = file_list
+
+    def get_roidb(self):
+        """Return an roidb corresponding to the json dataset. Optionally:
+           - include ground truth boxes in the roidb
+        """
+        image_list = os.listdir(self.image_dir)
+        image_list.sort()
+        im_infos = []
+        count = 0
+        class_idx = 1
+        class_name = {}
+        post_fix = ['jpg', 'bmp', 'png']
+        if self.mode == 'val':
+            labels_map = get_labels_maps()
+        for image in image_list:
+            prefix = image[:-4]
+            #print(image)
+
+            if image.split('.')[-1] not in post_fix:
+                continue
+            img_name = os.path.join(self.image_dir, image)
+            gt_name = os.path.join(self.gt_dir, 'gt_' + prefix + '.txt')
+            gt_classes = []
+            #boxes = []
+            #hard_boxes = []
+            boxes = []
+            gt_obj = open(gt_name, 'r', encoding='UTF-8-sig')
+            gt_txt = gt_obj.read()
+            gt_split = gt_txt.split('\n')
+            img = cv2.imread(img_name)
+            f = False
+            for gt_line in gt_split:
+                gt_ind = gt_line.split(',')
+                # can get the text information
+                if len(gt_ind) > 3:
+                    if self.mode == 'val':
+                        gt_classes.append(labels_map[gt_ind[-1]])
+                    else:
+                        if gt_ind[-1] not in class_name:
+                            class_name[gt_ind[-1]] = class_idx
+                            #gt_classes.append(class_idx)
+                            class_idx += 1
+                        gt_classes.append(class_name[gt_ind[-1]])
+                    pt1 = (int(gt_ind[0]), int(gt_ind[1]))
+                    pt2 = (int(gt_ind[2]), int(gt_ind[3]))
+                    pt3 = (int(gt_ind[4]), int(gt_ind[5]))
+                    pt4 = (int(gt_ind[6]), int(gt_ind[7]))
+                    edge1 = np.sqrt((pt1[0] - pt2[0]) * (pt1[0] - pt2[0]) + (
+                        pt1[1] - pt2[1]) * (pt1[1] - pt2[1]))
+                    edge2 = np.sqrt((pt2[0] - pt3[0]) * (pt2[0] - pt3[0]) + (
+                        pt2[1] - pt3[1]) * (pt2[1] - pt3[1]))
+                    angle = 0
+                    if edge1 > edge2:
+                        width = edge1
+                        height = edge2
+                        if pt1[0] - pt2[0] != 0:
+                            angle = -np.arctan(
+                                float(pt1[1] - pt2[1]) /
+                                float(pt1[0] - pt2[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    elif edge2 >= edge1:
+                        width = edge2
+                        height = edge1
+                        # print pt2[0], pt3[0]
+                        if pt2[0] - pt3[0] != 0:
+                            angle = -np.arctan(
+                                float(pt2[1] - pt3[1]) /
+                                float(pt2[0] - pt3[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    if angle < -45.0:
+                        angle = angle + 180
+                    x_ctr = float(pt1[0] + pt3[
+                        0]) / 2  # pt1[0] + np.abs(float(pt1[0] - pt3[0])) / 2
+                    y_ctr = float(pt1[1] + pt3[
+                        1]) / 2  # pt1[1] + np.abs(float(pt1[1] - pt3[1])) / 2
+                    if self.mode == 'val':
+                        boxes.append(
+                            list(np.array([pt1, pt2, pt3, pt4]).reshape(8)))
+                    else:
+                        boxes.append([x_ctr, y_ctr, width, height, angle])
+            len_of_bboxes = len(boxes)
+            #print(len_of_bboxes)
+            is_difficult = np.zeros((len_of_bboxes, 1), dtype=np.int32)
+            if self.mode == 'train':
+                gt_boxes = np.zeros((len_of_bboxes, 5), dtype=np.int32)
+            else:
+                gt_boxes = np.zeros((len_of_bboxes, 8), dtype=np.int32)
+            gt_classes = np.array(gt_classes).reshape(len_of_bboxes, 1)
+            is_crowd = np.zeros((len_of_bboxes), dtype=np.int32)
+            for idx in range(len(boxes)):
+                if self.mode == 'train':
+                    gt_boxes[idx, :] = [
+                        boxes[idx][0], boxes[idx][1], boxes[idx][2],
+                        boxes[idx][3], boxes[idx][4]
+                    ]
+                else:
+                    gt_boxes[idx, :] = [
+                        boxes[idx][0], boxes[idx][1], boxes[idx][2],
+                        boxes[idx][3], boxes[idx][4], boxes[idx][5],
+                        boxes[idx][6], boxes[idx][7]
+                    ]
+                #gt_classes[idx] = 1
+            if gt_boxes.shape[0] <= 0:
+                continue
+            gt_boxes = gt_boxes.astype(np.float64)
+            im_info = {
+                'im_id': count,
+                'gt_classes': gt_classes,
+                'image': img_name,
+                'boxes': gt_boxes,
+                'height': img.shape[0],
+                'width': img.shape[1],
+                'is_crowd': is_crowd,
+                'is_difficult': is_difficult
+            }
+            im_infos.append(im_info)
+            count += 1
+            if self.mode == 'train':
+                with open(os.path.join(cfg.data_dir, 'label_list'), 'w') as g:
+                    for k in class_name:
+                        g.write(k + "\n")
+        return im_infos
+
+
+def get_labels_maps():
+    labels_map = {}
+    with open(os.path.join(cfg.data_dir, 'label_list')) as f:
+        lines = f.readlines()
+        for idx, line in enumerate(lines):
+            labels_map[line.strip()] = idx + 1
+        return labels_map
diff --git a/PaddleCV/rrpn/train.py b/PaddleCV/rrpn/train.py
new file mode 100755
index 0000000000000000000000000000000000000000..11dafa990d074ab8035feaaf3d966a5a883d6ac5
--- /dev/null
+++ b/PaddleCV/rrpn/train.py
@@ -0,0 +1,224 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+
+
+def set_paddle_flags(flags):
+    for key, value in flags.items():
+        if os.environ.get(key, None) is None:
+            os.environ[key] = str(value)
+
+
+set_paddle_flags({
+    'FLAGS_conv_workspace_size_limit': 500,
+    'FLAGS_eager_delete_tensor_gb': 0,  # enable gc
+    'FLAGS_memory_fraction_of_eager_deletion': 1,
+    'FLAGS_fraction_of_gpu_memory_to_use': 0.98
+})
+
+import sys
+import numpy as np
+import time
+import shutil
+import collections
+import paddle
+import paddle.fluid as fluid
+import reader
+import models.model_builder as model_builder
+import models.resnet as resnet
+import checkpoint as checkpoint
+from config import cfg
+from utility import parse_args, print_arguments, SmoothedValue, TrainingStats, now_time, check_gpu
+num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+
+
+def get_device_num():
+    # NOTE(zcd): for multi-processe training, each process use one GPU card.
+    if num_trainers > 1:
+        return 1
+    return fluid.core.get_cuda_device_count()
+
+
+def train():
+    learning_rate = cfg.learning_rate
+    image_shape = [3, cfg.TRAIN.max_size, cfg.TRAIN.max_size]
+
+    devices_num = get_device_num()
+    total_batch_size = devices_num * cfg.TRAIN.im_per_batch
+
+    use_random = True
+    startup_prog = fluid.Program()
+    train_prog = fluid.Program()
+    with fluid.program_guard(train_prog, startup_prog):
+        with fluid.unique_name.guard():
+            model = model_builder.RRPN(
+                add_conv_body_func=resnet.ResNet(),
+                add_roi_box_head_func=resnet.ResNetC5(),
+                use_pyreader=cfg.use_pyreader,
+                use_random=use_random)
+            model.build_model(image_shape)
+            losses, keys, rpn_rois = model.loss()
+            loss = losses[0]
+            fetch_list = losses
+
+            boundaries = cfg.lr_steps
+            gamma = cfg.lr_gamma
+            step_num = len(cfg.lr_steps)
+            values = [learning_rate * (gamma**i) for i in range(step_num + 1)]
+            start_lr = learning_rate * cfg.start_factor
+            lr = fluid.layers.piecewise_decay(boundaries, values)
+            lr = fluid.layers.linear_lr_warmup(lr, cfg.warm_up_iter, start_lr,
+                                               learning_rate)
+            optimizer = fluid.optimizer.Momentum(
+                learning_rate=lr,
+                regularization=fluid.regularizer.L2Decay(cfg.weight_decay),
+                momentum=cfg.momentum)
+            optimizer.minimize(loss)
+            fetch_list = fetch_list + [lr]
+
+            for var in fetch_list:
+                var.persistable = True
+    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+    place = fluid.CUDAPlace(gpu_id) if cfg.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+
+    build_strategy = fluid.BuildStrategy()
+    build_strategy.fuse_all_optimizer_ops = False
+    build_strategy.fuse_elewise_add_act_ops = True
+    exec_strategy = fluid.ExecutionStrategy()
+    exec_strategy.num_iteration_per_drop_scope = 1
+    exe.run(startup_prog)
+
+    if cfg.pretrained_model:
+        checkpoint.load_and_fusebn(exe, train_prog, cfg.pretrained_model)
+    compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
+        loss_name=loss.name,
+        build_strategy=build_strategy,
+        exec_strategy=exec_strategy)
+
+    shuffle = True
+    shuffle_seed = None
+    if num_trainers > 1:
+        shuffle_seed = 1
+    if cfg.use_pyreader:
+        train_reader = reader.train(
+            batch_size=cfg.TRAIN.im_per_batch,
+            total_batch_size=total_batch_size,
+            padding_total=cfg.TRAIN.padding_minibatch,
+            shuffle=shuffle,
+            shuffle_seed=shuffle_seed)
+        if num_trainers > 1:
+            assert shuffle_seed is not None, \
+                "If num_trainers > 1, the shuffle_seed must be set, because " \
+                "the order of batch data generated by reader " \
+                "must be the same in the respective processes."
+            # NOTE: the order of batch data generated by batch_reader
+            # must be the same in the respective processes.
+            if num_trainers > 1:
+                train_reader = fluid.contrib.reader.distributed_batch_reader(
+                    train_reader)
+        py_reader = model.py_reader
+        py_reader.decorate_paddle_reader(train_reader)
+    else:
+        if num_trainers > 1: shuffle = False
+        train_reader = reader.train(
+            batch_size=total_batch_size, shuffle=shuffle)
+        feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
+
+    def train_loop_pyreader():
+        py_reader.start()
+        train_stats = TrainingStats(cfg.log_window, keys)
+        try:
+            start_time = time.time()
+            prev_start_time = start_time
+            for iter_id in range(cfg.max_iter):
+                prev_start_time = start_time
+                start_time = time.time()
+                outs = exe.run(compiled_train_prog,
+                               fetch_list=[v.name for v in fetch_list])
+                stats = {k: np.array(v).mean() for k, v in zip(keys, outs[:-1])}
+                train_stats.update(stats)
+                logs = train_stats.log()
+                if iter_id % 10 == 0:
+                    strs = '{}, iter: {}, lr: {:.5f}, {}, time: {:.3f}'.format(
+                        now_time(), iter_id,
+                        np.mean(outs[-1]), logs, start_time - prev_start_time)
+                    print(strs)
+                sys.stdout.flush()
+                if (iter_id) % cfg.TRAIN.snapshot_iter == 0 and iter_id != 0:
+                    save_name = "{}".format(iter_id)
+                    checkpoint.save(exe, train_prog,
+                                    os.path.join(cfg.model_save_dir, save_name))
+                if (iter_id) == cfg.max_iter:
+                    checkpoint.save(
+                        exe, train_prog,
+                        os.path.join(cfg.model_save_dir, "model_final"))
+                    break
+            end_time = time.time()
+            total_time = end_time - start_time
+            last_loss = np.array(outs[0]).mean()
+        except (StopIteration, fluid.core.EOFException):
+            py_reader.reset()
+
+    def train_loop():
+        start_time = time.time()
+        prev_start_time = start_time
+        start = start_time
+        train_stats = TrainingStats(cfg.log_window, keys)
+        for iter_id, data in enumerate(train_reader()):
+            prev_start_time = start_time
+            start_time = time.time()
+            if data[0][1].shape[0] == 0:
+                continue
+
+            outs = exe.run(compiled_train_prog,
+                           fetch_list=[v.name for v in fetch_list],
+                           feed=feeder.feed(data))
+            stats = {k: np.array(v).mean() for k, v in zip(keys, outs[:-1])}
+            train_stats.update(stats)
+            logs = train_stats.log()
+            if iter_id % 10 == 0:
+                strs = '{}, iter: {}, lr: {:.5f}, {}, time: {:.3f}'.format(
+                    now_time(), iter_id,
+                    np.mean(outs[-1]), logs, start_time - prev_start_time)
+                print(strs)
+            sys.stdout.flush()
+            if (iter_id + 1) % cfg.TRAIN.snapshot_iter == 0 and iter_id != 0:
+                save_name = "{}".format(iter_id + 1)
+                checkpoint.save(exe, train_prog,
+                                os.path.join(cfg.model_save_dir, save_name))
+            if (iter_id + 1) == cfg.max_iter:
+                checkpoint.save(exe, train_prog,
+                                os.path.join(cfg.model_save_dir, "model_final"))
+                break
+
+        end_time = time.time()
+        total_time = end_time - start_time
+        last_loss = np.array(outs[0]).mean()
+
+    if cfg.use_pyreader:
+        train_loop_pyreader()
+    else:
+        train_loop()
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+    check_gpu(args.use_gpu)
+    train()
diff --git a/PaddleCV/rrpn/utility.py b/PaddleCV/rrpn/utility.py
new file mode 100755
index 0000000000000000000000000000000000000000..d737d3e78146f7ca2cacaa2edc443b4ae654b3fd
--- /dev/null
+++ b/PaddleCV/rrpn/utility.py
@@ -0,0 +1,188 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains common utility functions.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import sys
+import paddle.fluid as fluid
+import distutils.util
+import numpy as np
+import six
+import argparse
+import functools
+import collections
+import datetime
+from collections import deque
+from paddle.fluid import core
+from collections import deque
+from config import *
+
+
+def print_arguments(args):
+    """Print argparse's arguments.
+
+    Usage:
+
+    .. code-block:: python
+
+        parser = argparse.ArgumentParser()
+        parser.add_argument("name", default="Jonh", type=str, help="User name.")
+        args = parser.parse_args()
+        print_arguments(args)
+
+    :param args: Input argparse.Namespace for printing.
+    :type args: argparse.Namespace
+    """
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(six.iteritems(vars(args))):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def add_arguments(argname, type, default, help, argparser, **kwargs):
+    """Add argparse's argument.
+
+    Usage:
+
+    .. code-block:: python
+
+        parser = argparse.ArgumentParser()
+        add_argument("name", str, "Jonh", "User name.", parser)
+        args = parser.parse_args()
+    """
+    type = distutils.util.strtobool if type == bool else type
+    argparser.add_argument(
+        "--" + argname,
+        default=default,
+        type=type,
+        help=help + ' Default: %(default)s.',
+        **kwargs)
+
+
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+
+    def __init__(self, window_size):
+        self.deque = deque(maxlen=window_size)
+
+    def add_value(self, value):
+        self.deque.append(value)
+
+    def get_median_value(self):
+        return np.median(self.deque)
+
+
+def now_time():
+    return datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')
+
+
+class TrainingStats(object):
+    def __init__(self, window_size, stats_keys):
+        self.smoothed_losses_and_metrics = {
+            key: SmoothedValue(window_size)
+            for key in stats_keys
+        }
+
+    def update(self, stats):
+        for k, v in self.smoothed_losses_and_metrics.items():
+            v.add_value(stats[k])
+
+    def get(self, extras=None):
+        stats = collections.OrderedDict()
+        if extras:
+            for k, v in extras.items():
+                stats[k] = v
+        for k, v in self.smoothed_losses_and_metrics.items():
+            stats[k] = round(v.get_median_value(), 3)
+
+        return stats
+
+    def log(self, extras=None):
+        d = self.get(extras)
+        strs = ', '.join(str(dict({x: y})).strip('{}') for x, y in d.items())
+        return strs
+
+
+def parse_args():
+    """return all args
+    """
+    parser = argparse.ArgumentParser(description=__doc__)
+    add_arg = functools.partial(add_arguments, argparser=parser)
+    # yapf: disable
+    # ENV
+    add_arg('use_gpu',          bool,  True,      "Whether use GPU.")
+    add_arg('model_save_dir',   str,    'output',     "The path to save model.")
+    add_arg('pretrained_model', str,    'ResNet50_cos_pretrained', "The init model path.")
+    add_arg('dataset',          str,   'icdar2015',  "icdar2015, icdar2017.")
+    add_arg('class_num',        int,   2,          "Class number.")
+    add_arg('data_dir',         str,   'dataset/icdar2015',        "The data root path.")
+    add_arg('use_pyreader',     bool,   False,           "Use pyreader.")
+    add_arg('use_profile',         bool,   False,       "Whether use profiler.")
+    add_arg('padding_minibatch',bool,   False,
+        "If False, only resize image and not pad, image shape is different between"
+        " GPUs in one mini-batch. If True, image shape is the same in one mini-batch.")
+    #SOLVER
+    add_arg('learning_rate',    float,  0.02,     "Learning rate.")
+    add_arg('max_iter',         int,    17500,   "Iter number.")
+    add_arg('log_window',       int,    20,        "Log smooth window, set 1 for debug, set 20 for train.")
+    # RCNN
+    # RPN
+    add_arg('anchor_sizes',     int,    [128, 256, 512],  "The size of anchors.")
+    add_arg('aspect_ratios',    float,  [0.2, 0.5,1.0],    "The ratio of anchors.")
+    add_arg('anchor_angle',    float,  [-30.0, 0.0, 30.0, 60.0, 90.0, 120.0],    "The angles of anchors.")
+    add_arg('variance',         float,  [1.0, 1.0, 1.0, 1.0, 1.0],    "The variance of anchors.")
+    add_arg('rpn_stride',       float,  [16.,16.],    "Stride of the feature map that RPN is attached.")
+    add_arg('rpn_nms_thresh',    float,   0.7,          "NMS threshold used on RPN proposals")
+    # TRAIN VAL INFER
+    add_arg('im_per_batch',       int,   1,        "Minibatch size.")
+    add_arg('pixel_means',     float,   [0.485, 0.456, 0.406], "pixel mean")
+    add_arg('nms_thresh',    float, 0.3,    "NMS threshold.")
+    add_arg('score_thresh',    float, 0.01,    "score threshold for NMS.")
+    add_arg('snapshot_stride',  int,    1000,    "save model every snapshot stride.")
+    # SINGLE EVAL AND DRAW
+    add_arg('draw_threshold',  float, 0.8,    "Confidence threshold to draw bbox.")
+    add_arg('image_path',       str,   'ICDAR2015/tmp/',  "The image path used to inference and visualize.")
+    # yapf: enable
+    args = parser.parse_args()
+    file_name = sys.argv[0]
+    if 'train' in file_name or 'profile' in file_name:
+        merge_cfg_from_args(args, 'train')
+    else:
+        merge_cfg_from_args(args, 'val')
+    return args
+
+
+def check_gpu(use_gpu):
+    """
+     Log error and exit when set use_gpu=true in paddlepaddle
+     cpu version.
+     """
+    err = "Config use_gpu cannot be set as true while you are " \
+          "using paddlepaddle cpu version ! \nPlease try: \n" \
+          "\t1. Install paddlepaddle-gpu to run model on GPU \n" \
+          "\t2. Set use_gpu as false in config file to run " \
+          "model on CPU"
+
+    try:
+        if use_gpu and not fluid.is_compiled_with_cuda():
+            logger.error(err)
+            sys.exit(1)
+    except Exception as e:
+        pass
diff --git a/PaddleCV/yolov3/train.py b/PaddleCV/yolov3/train.py
index 5f2284cf0c264e261c1cd6cab1a675c59b1981a7..6dab2c80f5021f646b63ec55b242ab255670608e 100644
--- a/PaddleCV/yolov3/train.py
+++ b/PaddleCV/yolov3/train.py
@@ -41,6 +41,7 @@ from utility import (parse_args, print_arguments,
 
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 import reader
 from models.yolov3 import YOLOv3
 from learning_rate import exponential_with_warmup_decay
@@ -76,8 +77,8 @@ def train():
     loss = model.loss()
     loss.persistable = True
 
-    devices_num = get_device_num()
-    print("Found {} CUDA devices.".format(devices_num))
+    devices_num = get_device_num() if cfg.use_gpu else 1
+    print("Found {} CUDA/CPU devices.".format(devices_num))
 
     learning_rate = cfg.learning_rate
     boundaries = cfg.lr_steps
@@ -186,6 +187,13 @@ def train():
                 iter_id, lr[0],
                 smoothed_loss.get_mean_value(), start_time - prev_start_time))
             sys.stdout.flush()
+            #add profiler tools
+            if args.is_profiler and iter_id == 5:
+                profiler.start_profiler("All")
+            elif args.is_profiler and iter_id == 10:
+                profiler.stop_profiler("total", args.profiler_path)
+                return
+
             if (iter_id + 1) % cfg.snapshot_iter == 0:
                 save_model("model_iter{}".format(iter_id))
                 print("Snapshot {} saved, average loss: {}, \
diff --git a/PaddleCV/yolov3/utility.py b/PaddleCV/yolov3/utility.py
index 3e0f308289397143abdef77ed6c42f483d546bf3..9d442f4ee79b29691b84ec95b78ccfe76c4c55a0 100644
--- a/PaddleCV/yolov3/utility.py
+++ b/PaddleCV/yolov3/utility.py
@@ -146,6 +146,9 @@ def parse_args():
     add_arg('draw_thresh',      float,  0.5,
             "Confidence score threshold to draw prediction box in image in debug mode")
     add_arg('enable_ce',        bool,  False,                "If set True, enable continuous evaluation job.")
+    # args for profiler tools
+    add_arg('is_profiler',        int,  0,                "the switch of profiler")
+    add_arg('profiler_path',        str,  './',                "the path to save profiler output files")
     # yapf: enable
     args = parser.parse_args()
     file_name = sys.argv[0]
diff --git a/PaddleNLP/PaddleDialogue/auto_dialogue_evaluation/ade/reader.py b/PaddleNLP/PaddleDialogue/auto_dialogue_evaluation/ade/reader.py
index d3e2f952e3f28edb0375e29e750f7d25dcceda84..94850a356772bc47f90f72d762a29abdf1a203d3 100755
--- a/PaddleNLP/PaddleDialogue/auto_dialogue_evaluation/ade/reader.py
+++ b/PaddleNLP/PaddleDialogue/auto_dialogue_evaluation/ade/reader.py
@@ -19,45 +19,42 @@ import sys
 import time
 import random
 import numpy as np
+import os
 
 import paddle
 import paddle.fluid as fluid
 
 
-class DataProcessor(object): 
-    def __init__(self, data_path, max_seq_length, batch_size): 
+class DataProcessor(object):
+    def __init__(self, data_path, max_seq_length, batch_size):
         """init"""
         self.data_file = data_path
         self.max_seq_len = max_seq_length
         self.batch_size = batch_size
         self.num_examples = {'train': -1, 'dev': -1, 'test': -1}
 
-    def get_examples(self): 
+    def get_examples(self):
         """load examples"""
         examples = []
         index = 0
         fr = io.open(self.data_file, 'r', encoding="utf8")
-        for line in fr: 
-            if index !=0 and index % 100 == 0: 
+        for line in fr:
+            if index != 0 and index % 100 == 0:
                 print("processing data: %d" % index)
             index += 1
             examples.append(line.strip())
         return examples
 
-    def get_num_examples(self, phase): 
+    def get_num_examples(self, phase):
         """Get number of examples for train, dev or test."""
-        if phase not in ['train', 'dev', 'test']: 
+        if phase not in ['train', 'dev', 'test']:
             raise ValueError(
                 "Unknown phase, which should be in ['train', 'dev', 'test'].")
         count = len(io.open(self.data_file, 'r', encoding="utf8").readlines())
         self.num_examples[phase] = count
         return self.num_examples[phase]
 
-    def data_generator(self,
-                       place,
-                       phase="train",
-                       shuffle=True,
-                       sample_pro=1):
+    def data_generator(self, place, phase="train", shuffle=True, sample_pro=1):
         """
         Generate data for train, dev or test.
 
@@ -67,25 +64,34 @@ class DataProcessor(object):
             sample_pro: sample data ratio
         """
         examples = self.get_examples()
-        if shuffle: 
+
+        # used for ce
+        if 'ce_mode' in os.environ:
+            np.random.seed(0)
+            random.seed(0)
+            shuffle = False
+
+        if shuffle:
             np.random.shuffle(examples)
-        
-        def batch_reader():  
+
+        def batch_reader():
             """read batch data"""
             batch = []
-            for example in examples: 
+            for example in examples:
                 if sample_pro < 1:
                     if random.random() > sample_pro:
                         continue
                 tokens = example.strip().split('\t')
-                
-                if len(tokens) != 3: 
+
+                if len(tokens) != 3:
                     print("data format error: %s" % example.strip())
                     print("please input data: context \t response \t label")
                     continue
 
-                context = [int(x) for x in tokens[0].split()[: self.max_seq_len]]
-                response = [int(x) for x in tokens[1].split()[: self.max_seq_len]]
+                context = [int(x) for x in tokens[0].split()[:self.max_seq_len]]
+                response = [
+                    int(x) for x in tokens[1].split()[:self.max_seq_len]
+                ]
                 label = [int(tokens[2])]
                 instance = (context, response, label)
 
@@ -96,15 +102,15 @@ class DataProcessor(object):
                         yield batch
                     batch = [instance]
 
-            if len(batch) > 0: 
+            if len(batch) > 0:
                 yield batch
 
-        def create_lodtensor(data_ids, place): 
+        def create_lodtensor(data_ids, place):
             """create LodTensor for input ids"""
             cur_len = 0
             lod = [cur_len]
             seq_lens = [len(ids) for ids in data_ids]
-            for l in seq_lens: 
+            for l in seq_lens:
                 cur_len += l
                 lod.append(cur_len)
             flattened_data = np.concatenate(data_ids, axis=0).astype("int64")
@@ -114,9 +120,9 @@ class DataProcessor(object):
             res.set_lod([lod])
             return res
 
-        def wrapper(): 
-            """yield batch data to network""" 
-            for batch_data in batch_reader(): 
+        def wrapper():
+            """yield batch data to network"""
+            for batch_data in batch_reader():
                 context_ids = [batch[0] for batch in batch_data]
                 response_ids = [batch[1] for batch in batch_data]
                 label_ids = [batch[2] for batch in batch_data]
@@ -125,6 +131,5 @@ class DataProcessor(object):
                 label_ids = np.array(label_ids).astype("int64").reshape([-1, 1])
                 input_batch = [context_res, response_res, label_ids]
                 yield input_batch
-        
-        return wrapper
 
+        return wrapper
diff --git a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/evaluation.py b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/evaluation.py
index 43d3fe636eadea3b1b8da6c7c2e082ea7e1e246b..d8c06944e20004547ae5516b9ce70eb338ecc68b 100644
--- a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/evaluation.py
+++ b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/evaluation.py
@@ -22,26 +22,27 @@ class EvalDA(object):
     """
     evaluate da testset, swda|mrda
     """
-    def __init__(self, task_name, pred, refer): 
+
+    def __init__(self, task_name, pred, refer):
         """
         predict file
         """
         self.pred_file = pred
         self.refer_file = refer
 
-    def load_data(self): 
+    def load_data(self):
         """
         load reference label and predict label
         """
         pred_label = []
         refer_label = []
         fr = io.open(self.refer_file, 'r', encoding="utf8")
-        for line in fr:  
+        for line in fr:
             label = line.rstrip('\n').split('\t')[1]
             refer_label.append(int(label))
         idx = 0
         fr = io.open(self.pred_file, 'r', encoding="utf8")
-        for line in fr: 
+        for line in fr:
             elems = line.rstrip('\n').split('\t')
             if len(elems) != 2 or not elems[0].isdigit():
                 continue
@@ -49,15 +50,15 @@ class EvalDA(object):
             pred_label.append(tag_id)
         return pred_label, refer_label
 
-    def evaluate(self): 
+    def evaluate(self):
         """
         calculate acc metrics
         """
         pred_label, refer_label = self.load_data()
         common_num = 0
         total_num = len(pred_label)
-        for i in range(total_num): 
-            if pred_label[i] == refer_label[i]: 
+        for i in range(total_num):
+            if pred_label[i] == refer_label[i]:
                 common_num += 1
         acc = float(common_num) / total_num
         return acc
@@ -67,26 +68,27 @@ class EvalATISIntent(object):
     """
     evaluate da testset, swda|mrda
     """
-    def __init__(self, pred, refer): 
+
+    def __init__(self, pred, refer):
         """
         predict file
         """
         self.pred_file = pred
         self.refer_file = refer
 
-    def load_data(self): 
+    def load_data(self):
         """
         load reference label and predict label
         """
         pred_label = []
         refer_label = []
         fr = io.open(self.refer_file, 'r', encoding="utf8")
-        for line in fr:  
+        for line in fr:
             label = line.rstrip('\n').split('\t')[0]
             refer_label.append(int(label))
         idx = 0
         fr = io.open(self.pred_file, 'r', encoding="utf8")
-        for line in fr: 
+        for line in fr:
             elems = line.rstrip('\n').split('\t')
             if len(elems) != 2 or not elems[0].isdigit():
                 continue
@@ -94,45 +96,46 @@ class EvalATISIntent(object):
             pred_label.append(tag_id)
         return pred_label, refer_label
 
-    def evaluate(self): 
+    def evaluate(self):
         """
         calculate acc metrics
         """
         pred_label, refer_label = self.load_data()
         common_num = 0
         total_num = len(pred_label)
-        for i in range(total_num): 
-            if pred_label[i] == refer_label[i]: 
+        for i in range(total_num):
+            if pred_label[i] == refer_label[i]:
                 common_num += 1
         acc = float(common_num) / total_num
         return acc
 
 
-class EvalATISSlot(object): 
+class EvalATISSlot(object):
     """
     evaluate atis slot
     """
-    def __init__(self, pred, refer): 
+
+    def __init__(self, pred, refer):
         """
         pred file
         """
         self.pred_file = pred
         self.refer_file = refer
 
-    def load_data(self): 
+    def load_data(self):
         """
         load reference label and predict label
         """
         pred_label = []
         refer_label = []
         fr = io.open(self.refer_file, 'r', encoding="utf8")
-        for line in fr: 
+        for line in fr:
             labels = line.rstrip('\n').split('\t')[1].split()
             labels = [int(l) for l in labels]
             refer_label.append(labels)
         fr = io.open(self.pred_file, 'r', encoding="utf8")
-        for line in fr: 
-            if len(line.split('\t')) != 2 or not line[0].isdigit(): 
+        for line in fr:
+            if len(line.split('\t')) != 2 or not line[0].isdigit():
                 continue
             labels = line.rstrip('\n').split('\t')[1].split()[1:]
             labels = [int(l) for l in labels]
@@ -140,15 +143,15 @@ class EvalATISSlot(object):
         pred_label_equal = []
         refer_label_equal = []
         assert len(refer_label) == len(pred_label)
-        for i in range(len(refer_label)): 
+        for i in range(len(refer_label)):
             num = len(refer_label[i])
             refer_label_equal.extend(refer_label[i])
-            pred_label[i] = pred_label[i][: num]
+            pred_label[i] = pred_label[i][:num]
             pred_label_equal.extend(pred_label[i])
 
         return pred_label_equal, refer_label_equal
 
-    def evaluate(self):  
+    def evaluate(self):
         """
         evaluate f1_micro score
         """
@@ -156,13 +159,13 @@ class EvalATISSlot(object):
         tp = dict()
         fn = dict()
         fp = dict()
-        for i in range(len(refer_label)): 
+        for i in range(len(refer_label)):
             if refer_label[i] == pred_label[i]:
-                if refer_label[i] not in tp: 
+                if refer_label[i] not in tp:
                     tp[refer_label[i]] = 0
                 tp[refer_label[i]] += 1
-            else: 
-                if pred_label[i] not in fp: 
+            else:
+                if pred_label[i] not in fp:
                     fp[pred_label[i]] = 0
                 fp[pred_label[i]] += 1
                 if refer_label[i] not in fn:
@@ -170,17 +173,17 @@ class EvalATISSlot(object):
                 fn[refer_label[i]] += 1
 
         results = ["label    precision    recall"]
-        for i in range(0, 130): 
-            if i not in tp: 
+        for i in range(0, 130):
+            if i not in tp:
                 results.append(" %s:    0.0     0.0" % i)
                 continue
-            if i in fp: 
+            if i in fp:
                 precision = float(tp[i]) / (tp[i] + fp[i])
-            else: 
+            else:
                 precision = 1.0
-            if i in fn: 
+            if i in fn:
                 recall = float(tp[i]) / (tp[i] + fn[i])
-            else: 
+            else:
                 recall = 1.0
             results.append(" %s:    %.4f    %.4f" % (i, precision, recall))
         tp_total = sum(tp.values())
@@ -193,32 +196,33 @@ class EvalATISSlot(object):
         return "\n".join(results)
 
 
-class EvalUDC(object): 
+class EvalUDC(object):
     """
     evaluate udc
     """
-    def __init__(self, pred, refer): 
+
+    def __init__(self, pred, refer):
         """
         predict file
         """
         self.pred_file = pred
         self.refer_file = refer
 
-    def load_data(self): 
+    def load_data(self):
         """
         load reference label and predict label
         """
-        data = [] 
+        data = []
         refer_label = []
         fr = io.open(self.refer_file, 'r', encoding="utf8")
-        for line in fr: 
+        for line in fr:
             label = line.rstrip('\n').split('\t')[0]
             refer_label.append(label)
         idx = 0
         fr = io.open(self.pred_file, 'r', encoding="utf8")
-        for line in fr: 
+        for line in fr:
             elems = line.rstrip('\n').split('\t')
-            if len(elems) != 2 or not elems[0].isdigit(): 
+            if len(elems) != 2 or not elems[0].isdigit():
                 continue
             match_prob = elems[1]
             data.append((float(match_prob), int(refer_label[idx])))
@@ -230,8 +234,8 @@ class EvalUDC(object):
         calculate precision in recall n
         """
         pos_score = data[ind][0]
-        curr = data[ind: ind + m]
-        curr = sorted(curr, key = lambda x: x[0], reverse = True)
+        curr = data[ind:ind + m]
+        curr = sorted(curr, key=lambda x: x[0], reverse=True)
 
         if curr[n - 1][0] <= pos_score:
             return 1
@@ -241,20 +245,20 @@ class EvalUDC(object):
         """
         calculate udc data
         """
-        data = self.load_data() 
+        data = self.load_data()
         assert len(data) % 10 == 0
-        
+
         p_at_1_in_2 = 0.0
         p_at_1_in_10 = 0.0
         p_at_2_in_10 = 0.0
         p_at_5_in_10 = 0.0
 
-        length = len(data)/10
+        length = int(len(data) / 10)
 
         for i in range(0, length):
             ind = i * 10
             assert data[ind][1] == 1
-    
+
             p_at_1_in_2 += self.get_p_at_n_in_m(data, 1, 2, ind)
             p_at_1_in_10 += self.get_p_at_n_in_m(data, 1, 10, ind)
             p_at_2_in_10 += self.get_p_at_n_in_m(data, 2, 10, ind)
@@ -262,13 +266,14 @@ class EvalUDC(object):
 
         metrics_out = [p_at_1_in_2 / length, p_at_1_in_10 / length, \
                 p_at_2_in_10 / length, p_at_5_in_10 / length]
-        return metrics_out 
+        return metrics_out
 
 
-class EvalDSTC2(object): 
+class EvalDSTC2(object):
     """
     evaluate dst testset, dstc2
     """
+
     def __init__(self, task_name, pred, refer):
         """
         predict file
@@ -277,39 +282,39 @@ class EvalDSTC2(object):
         self.pred_file = pred
         self.refer_file = refer
 
-    def load_data(self): 
+    def load_data(self):
         """
         load reference label and predict label
         """
         pred_label = []
         refer_label = []
         fr = io.open(self.refer_file, 'r', encoding="utf8")
-        for line in fr: 
+        for line in fr:
             line = line.strip('\n')
             labels = [int(l) for l in line.split('\t')[-1].split()]
             labels = sorted(list(set(labels)))
             refer_label.append(" ".join([str(l) for l in labels]))
         all_pred = []
         fr = io.open(self.pred_file, 'r', encoding="utf8")
-        for line in fr: 
+        for line in fr:
             line = line.strip('\n')
             all_pred.append(line)
         all_pred = all_pred[len(all_pred) - len(refer_label):]
-        for line in all_pred: 
+        for line in all_pred:
             labels = [int(l) for l in line.split('\t')[-1].split()]
             labels = sorted(list(set(labels)))
             pred_label.append(" ".join([str(l) for l in labels]))
         return pred_label, refer_label
 
-    def evaluate(self): 
+    def evaluate(self):
         """
         calculate joint acc && overall acc
         """
         overall_all = 0.0
         correct_joint = 0
         pred_label, refer_label = self.load_data()
-        for i in range(len(refer_label)): 
-            if refer_label[i] != pred_label[i]: 
+        for i in range(len(refer_label)):
+            if refer_label[i] != pred_label[i]:
                 continue
             correct_joint += 1
         joint_all = float(correct_joint) / len(refer_label)
@@ -317,9 +322,9 @@ class EvalDSTC2(object):
         return metrics_out
 
 
-def evaluate(task_name, pred_file, refer_file): 
+def evaluate(task_name, pred_file, refer_file):
     """evaluate task metrics"""
-    if task_name.lower() == 'udc': 
+    if task_name.lower() == 'udc':
         eval_inst = EvalUDC(pred_file, refer_file)
         eval_metrics = eval_inst.evaluate()
         print("MATCHING TASK: %s metrics in testset: " % task_name)
@@ -328,45 +333,46 @@ def evaluate(task_name, pred_file, refer_file):
         print("R2@10: %s" % eval_metrics[2])
         print("R5@10: %s" % eval_metrics[3])
 
-    elif task_name.lower() in ['swda', 'mrda']: 
+    elif task_name.lower() in ['swda', 'mrda']:
         eval_inst = EvalDA(task_name.lower(), pred_file, refer_file)
         eval_metrics = eval_inst.evaluate()
         print("DA TASK: %s metrics in testset: " % task_name)
         print("ACC: %s" % eval_metrics)
 
-    elif task_name.lower() == 'atis_intent': 
+    elif task_name.lower() == 'atis_intent':
         eval_inst = EvalATISIntent(pred_file, refer_file)
         eval_metrics = eval_inst.evaluate()
         print("INTENTION TASK: %s metrics in testset: " % task_name)
         print("ACC: %s" % eval_metrics)
 
-    elif task_name.lower() == 'atis_slot': 
+    elif task_name.lower() == 'atis_slot':
         eval_inst = EvalATISSlot(pred_file, refer_file)
         eval_metrics = eval_inst.evaluate()
         print("SLOT FILLING TASK: %s metrics in testset: " % task_name)
         print(eval_metrics)
-    elif task_name.lower() in ['dstc2', 'dstc2_asr']: 
+    elif task_name.lower() in ['dstc2', 'dstc2_asr']:
         eval_inst = EvalDSTC2(task_name.lower(), pred_file, refer_file)
         eval_metrics = eval_inst.evaluate()
         print("DST TASK: %s metrics in testset: " % task_name)
         print("JOINT ACC: %s" % eval_metrics[0])
-    elif task_name.lower() == "multi-woz": 
+    elif task_name.lower() == "multi-woz":
         eval_inst = EvalMultiWoz(pred_file, refer_file)
         eval_metrics = eval_inst.evaluate()
         print("DST TASK: %s metrics in testset: " % task_name)
         print("JOINT ACC: %s" % eval_metrics[0])
         print("OVERALL ACC: %s" % eval_metrics[1])
-    else: 
-        print("task name not in [udc|swda|mrda|atis_intent|atis_slot|dstc2|dstc2_asr|multi-woz]")
+    else:
+        print(
+            "task name not in [udc|swda|mrda|atis_intent|atis_slot|dstc2|dstc2_asr|multi-woz]"
+        )
 
 
-if __name__ == "__main__": 
-    if len(sys.argv[1:]) < 3: 
+if __name__ == "__main__":
+    if len(sys.argv[1:]) < 3:
         print("please input task_name predict_file reference_file")
 
     task_name = sys.argv[1]
     pred_file = sys.argv[2]
     refer_file = sys.argv[3]
 
-
     evaluate(task_name, pred_file, refer_file)
diff --git a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/reader.py b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/reader.py
index b825a889cd7cce4f00a16957bc1e6acc44e4a804..05f39ec07e04d9bb81663d79799f7fe6cf8b42b1 100644
--- a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/reader.py
+++ b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/reader.py
@@ -23,17 +23,21 @@ import numpy as np
 from dgu import tokenization
 from dgu.batching import prepare_batch_data
 
+if sys.version[0] == '2':
+    reload(sys)
+    sys.setdefaultencoding('utf-8')
+
 
 class DataProcessor(object):
     """Base class for data converters for sequence classification data sets."""
 
-    def __init__(self, 
-                 data_dir, 
-                 vocab_path, 
-                 max_seq_len, 
-                 do_lower_case, 
+    def __init__(self,
+                 data_dir,
+                 vocab_path,
+                 max_seq_len,
+                 do_lower_case,
                  in_tokens,
-                 task_name, 
+                 task_name,
                  random_seed=None):
         self.data_dir = data_dir
         self.max_seq_len = max_seq_len
@@ -90,7 +94,7 @@ class DataProcessor(object):
                             mask_id=-1,
                             return_input_mask=True,
                             return_max_len=False,
-                            return_num_token=False): 
+                            return_num_token=False):
         """generate batch data"""
         return prepare_batch_data(
             self.task_name,
@@ -112,7 +116,7 @@ class DataProcessor(object):
         f = io.open(input_file, "r", encoding="utf8")
         reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
         lines = []
-        for line in reader: 
+        for line in reader:
             lines.append(line)
         return lines
 
@@ -145,21 +149,21 @@ class DataProcessor(object):
             raise ValueError(
                 "Unknown phase, which should be in ['train', 'dev', 'test'].")
 
-        def instance_reader(): 
+        def instance_reader():
             """generate instance data"""
             if shuffle:
                 np.random.shuffle(examples)
-            for (index, example) in enumerate(examples): 
-                feature = self.convert_example(
-                    index, example,
-                    self.get_labels(), self.max_seq_len, self.tokenizer)
+            for (index, example) in enumerate(examples):
+                feature = self.convert_example(index, example,
+                                               self.get_labels(),
+                                               self.max_seq_len, self.tokenizer)
                 instance = self.generate_instance(feature)
                 yield instance
 
-        def batch_reader(reader, batch_size, in_tokens): 
+        def batch_reader(reader, batch_size, in_tokens):
             """read batch data"""
             batch, total_token_num, max_len = [], 0, 0
-            for instance in reader(): 
+            for instance in reader():
                 token_ids, sent_ids, pos_ids, label = instance[:4]
                 max_len = max(max_len, len(token_ids))
                 if in_tokens:
@@ -177,13 +181,13 @@ class DataProcessor(object):
             if len(batch) > 0:
                 yield batch, total_token_num
 
-        def wrapper(): 
+        def wrapper():
             """yield batch data to network"""
             for batch_data, total_token_num in batch_reader(
-                    instance_reader, batch_size, self.in_tokens): 
-                if self.in_tokens: 
+                    instance_reader, batch_size, self.in_tokens):
+                if self.in_tokens:
                     max_seq = -1
-                else: 
+                else:
                     max_seq = self.max_seq_len
                 batch_data = self.generate_batch_data(
                     batch_data,
@@ -197,7 +201,7 @@ class DataProcessor(object):
                 yield batch_data
 
         return wrapper
-    
+
 
 class InputExample(object):
     """A single training/test example for simple sequence classification."""
@@ -248,19 +252,24 @@ class InputFeatures(object):
         self.label_id = label_id
 
 
-class UDCProcessor(DataProcessor): 
+class UDCProcessor(DataProcessor):
     """Processor for the UDC data set."""
-    def _create_examples(self, lines, set_type): 
+
+    def _create_examples(self, lines, set_type):
         """Creates examples for the training and dev sets."""
         examples = []
-        print("UDC dataset is too big, loading data spent a long time, please wait patiently..................")
-        for (i, line) in enumerate(lines): 
-            if len(line) < 3: 
+        print(
+            "UDC dataset is too big, loading data spent a long time, please wait patiently.................."
+        )
+        for (i, line) in enumerate(lines):
+            if len(line) < 3:
                 print("data format error: %s" % "\t".join(line))
-                print("data row contains at least three parts: label\tconv1\t.....\tresponse")
+                print(
+                    "data row contains at least three parts: label\tconv1\t.....\tresponse"
+                )
                 continue
             guid = "%s-%d" % (set_type, i)
-            text_a = "\t".join(line[1: -1])
+            text_a = "\t".join(line[1:-1])
             text_a = tokenization.convert_to_unicode(text_a)
             text_a = text_a.split('\t')
             text_b = line[-1]
@@ -271,21 +280,21 @@ class UDCProcessor(DataProcessor):
                     guid=guid, text_a=text_a, text_b=text_b, label=label))
         return examples
 
-    def get_train_examples(self, data_dir): 
+    def get_train_examples(self, data_dir):
         """See base class."""
         examples = []
         lines = self._read_tsv(os.path.join(data_dir, "train.txt"))
         examples = self._create_examples(lines, "train")
         return examples
 
-    def get_dev_examples(self, data_dir): 
+    def get_dev_examples(self, data_dir):
         """See base class."""
         examples = []
         lines = self._read_tsv(os.path.join(data_dir, "dev.txt"))
         examples = self._create_examples(lines, "dev")
         return examples
 
-    def get_test_examples(self, data_dir): 
+    def get_test_examples(self, data_dir):
         """See base class."""
         examples = []
         lines = self._read_tsv(os.path.join(data_dir, "test.txt"))
@@ -293,19 +302,20 @@ class UDCProcessor(DataProcessor):
         return examples
 
     @staticmethod
-    def get_labels(): 
+    def get_labels():
         """See base class."""
         return ["0", "1"]
 
 
-class SWDAProcessor(DataProcessor): 
+class SWDAProcessor(DataProcessor):
     """Processor for the SWDA data set."""
-    def _create_examples(self, lines, set_type): 
+
+    def _create_examples(self, lines, set_type):
         """Creates examples for the training and dev sets."""
         examples = create_multi_turn_examples(lines, set_type)
         return examples
-        
-    def get_train_examples(self, data_dir): 
+
+    def get_train_examples(self, data_dir):
         """See base class."""
         examples = []
         lines = self._read_tsv(os.path.join(data_dir, "train.txt"))
@@ -327,21 +337,22 @@ class SWDAProcessor(DataProcessor):
         return examples
 
     @staticmethod
-    def get_labels(): 
+    def get_labels():
         """See base class."""
         labels = range(42)
         labels = [str(label) for label in labels]
         return labels
 
 
-class MRDAProcessor(DataProcessor): 
+class MRDAProcessor(DataProcessor):
     """Processor for the MRDA data set."""
-    def _create_examples(self, lines, set_type): 
+
+    def _create_examples(self, lines, set_type):
         """Creates examples for the training and dev sets."""
         examples = create_multi_turn_examples(lines, set_type)
         return examples
-        
-    def get_train_examples(self, data_dir): 
+
+    def get_train_examples(self, data_dir):
         """See base class."""
         examples = []
         lines = self._read_tsv(os.path.join(data_dir, "train.txt"))
@@ -363,22 +374,25 @@ class MRDAProcessor(DataProcessor):
         return examples
 
     @staticmethod
-    def get_labels(): 
+    def get_labels():
         """See base class."""
         labels = range(42)
         labels = [str(label) for label in labels]
         return labels
 
 
-class ATISSlotProcessor(DataProcessor): 
+class ATISSlotProcessor(DataProcessor):
     """Processor for the ATIS Slot data set."""
-    def _create_examples(self, lines, set_type): 
+
+    def _create_examples(self, lines, set_type):
         """Creates examples for the training and dev sets."""
         examples = []
-        for (i, line) in enumerate(lines): 
-            if len(line) != 2: 
+        for (i, line) in enumerate(lines):
+            if len(line) != 2:
                 print("data format error: %s" % "\t".join(line))
-                print("data row contains two parts: conversation_content \t label1 label2 label3")
+                print(
+                    "data row contains two parts: conversation_content \t label1 label2 label3"
+                )
                 continue
             guid = "%s-%d" % (set_type, i)
             text_a = line[0]
@@ -390,7 +404,7 @@ class ATISSlotProcessor(DataProcessor):
                     guid=guid, text_a=text_a, label=label_list))
         return examples
 
-    def get_train_examples(self, data_dir): 
+    def get_train_examples(self, data_dir):
         """See base class."""
         examples = []
         lines = self._read_tsv(os.path.join(data_dir, "train.txt"))
@@ -412,30 +426,30 @@ class ATISSlotProcessor(DataProcessor):
         return examples
 
     @staticmethod
-    def get_labels(): 
+    def get_labels():
         """See base class."""
         labels = range(130)
         labels = [str(label) for label in labels]
         return labels
 
 
-class ATISIntentProcessor(DataProcessor): 
+class ATISIntentProcessor(DataProcessor):
     """Processor for the ATIS intent data set."""
-    def _create_examples(self, lines, set_type): 
+
+    def _create_examples(self, lines, set_type):
         """Creates examples for the training and dev sets."""
         examples = []
-        for (i, line) in enumerate(lines): 
-            if len(line) != 2: 
+        for (i, line) in enumerate(lines):
+            if len(line) != 2:
                 print("data format error: %s" % "\t".join(line))
-                print("data row contains two parts: label \t conversation_content")
+                print(
+                    "data row contains two parts: label \t conversation_content")
                 continue
             guid = "%s-%d" % (set_type, i)
             text_a = line[1]
             text_a = tokenization.convert_to_unicode(text_a)
             label = tokenization.convert_to_unicode(line[0])
-            examples.append(
-                InputExample(
-                    guid=guid, text_a=text_a, label=label))
+            examples.append(InputExample(guid=guid, text_a=text_a, label=label))
         return examples
 
     def get_train_examples(self, data_dir):
@@ -467,53 +481,60 @@ class ATISIntentProcessor(DataProcessor):
         return labels
 
 
-class DSTC2Processor(DataProcessor): 
+class DSTC2Processor(DataProcessor):
     """Processor for the DSTC2 data set."""
-    def _create_turns(self, conv_example): 
+
+    def _create_turns(self, conv_example):
         """create multi turn dataset"""
         samples = []
         max_turns = 20
-        for i in range(len(conv_example)): 
-            conv_turns = conv_example[max(i - max_turns, 0): i + 1]
+        for i in range(len(conv_example)):
+            conv_turns = conv_example[max(i - max_turns, 0):i + 1]
             conv_info = "\1".join([sample[0] for sample in conv_turns])
             samples.append((conv_info.split('\1'), conv_example[i][1]))
         return samples
 
-    def _create_examples(self, lines, set_type): 
+    def _create_examples(self, lines, set_type):
         """Creates examples for multi-turn dialogue sets."""
         examples = []
         conv_id = -1
         index = 0
         conv_example = []
-        for (i, line) in enumerate(lines): 
-            if len(line) != 3: 
+        for (i, line) in enumerate(lines):
+            if len(line) != 3:
                 print("data format error: %s" % "\t".join(line))
-                print("data row contains three parts: conversation_content \t question \1 answer \t state1 state2 state3......")
+                print(
+                    "data row contains three parts: conversation_content \t question \1 answer \t state1 state2 state3......"
+                )
                 continue
             conv_no = line[0]
             text_a = line[1]
             label_list = line[2].split()
-            if conv_no != conv_id and i != 0: 
+            if conv_no != conv_id and i != 0:
                 samples = self._create_turns(conv_example)
-                for sample in samples: 
+                for sample in samples:
                     guid = "%s-%s" % (set_type, index)
                     index += 1
                     history = sample[0]
                     dst_label = sample[1]
-                    examples.append(InputExample(guid=guid, text_a=history, label=dst_label))
+                    examples.append(
+                        InputExample(
+                            guid=guid, text_a=history, label=dst_label))
                 conv_example = []
                 conv_id = conv_no
             if i == 0:
                 conv_id = conv_no
             conv_example.append((text_a, label_list))
-        if conv_example: 
+        if conv_example:
             samples = self._create_turns(conv_example)
             for sample in samples:
                 guid = "%s-%s" % (set_type, index)
                 index += 1
                 history = sample[0]
                 dst_label = sample[1]
-                examples.append(InputExample(guid=guid, text_a=history, label=dst_label))
+                examples.append(
+                    InputExample(
+                        guid=guid, text_a=history, label=dst_label))
         return examples
 
     def get_train_examples(self, data_dir):
@@ -545,20 +566,22 @@ class DSTC2Processor(DataProcessor):
         return labels
 
 
-class MULTIWOZProcessor(DataProcessor): 
+class MULTIWOZProcessor(DataProcessor):
     """Processor for the MULTIWOZ data set."""
-    def _create_turns(self, conv_example): 
+
+    def _create_turns(self, conv_example):
         """create multi turn dataset"""
         samples = []
         max_turns = 2
         for i in range(len(conv_example)):
-            prefix_turns = conv_example[max(i - max_turns, 0): i]
+            prefix_turns = conv_example[max(i - max_turns, 0):i]
             conv_info = "\1".join([turn[0] for turn in prefix_turns])
             current_turns = conv_example[i][0]
-            samples.append((conv_info.split('\1'), current_turns.split('\1'), conv_example[i][1]))
+            samples.append((conv_info.split('\1'), current_turns.split('\1'),
+                            conv_example[i][1]))
         return samples
 
-    def _create_examples(self, lines, set_type): 
+    def _create_examples(self, lines, set_type):
         """Creates examples for multi-turn dialogue sets."""
         examples = []
         conv_id = -1
@@ -568,7 +591,7 @@ class MULTIWOZProcessor(DataProcessor):
             conv_no = line[0]
             text_a = line[2]
             label_list = line[1].split()
-            if conv_no != conv_id and i != 0: 
+            if conv_no != conv_id and i != 0:
                 samples = self._create_turns(conv_example)
                 for sample in samples:
                     guid = "%s-%s" % (set_type, index)
@@ -576,13 +599,18 @@ class MULTIWOZProcessor(DataProcessor):
                     history = sample[0]
                     current = sample[1]
                     dst_label = sample[2]
-                    examples.append(InputExample(guid=guid, text_a=history, text_b=current, label=dst_label))
+                    examples.append(
+                        InputExample(
+                            guid=guid,
+                            text_a=history,
+                            text_b=current,
+                            label=dst_label))
                 conv_example = []
                 conv_id = conv_no
-            if i == 0: 
+            if i == 0:
                 conv_id = conv_no
             conv_example.append((text_a, label_list))
-        if conv_example: 
+        if conv_example:
             samples = self._create_turns(conv_example)
             for sample in samples:
                 guid = "%s-%s" % (set_type, index)
@@ -590,10 +618,15 @@ class MULTIWOZProcessor(DataProcessor):
                 history = sample[0]
                 current = sample[1]
                 dst_label = sample[2]
-                examples.append(InputExample(guid=guid, text_a=history, text_b=current, label=dst_label))
+                examples.append(
+                    InputExample(
+                        guid=guid,
+                        text_a=history,
+                        text_b=current,
+                        label=dst_label))
         return examples
 
-    def get_train_examples(self, data_dir): 
+    def get_train_examples(self, data_dir):
         """See base class."""
         examples = []
         lines = self._read_tsv(os.path.join(data_dir, "train.txt"))
@@ -622,34 +655,38 @@ class MULTIWOZProcessor(DataProcessor):
         return labels
 
 
-def create_dialogue_examples(conv): 
+def create_dialogue_examples(conv):
     """Creates dialogue sample"""
     samples = []
-    for i in range(len(conv)): 
+    for i in range(len(conv)):
         cur_txt = "%s : %s" % (conv[i][2], conv[i][3])
-        pre_txt = ["%s : %s" % (c[2], c[3]) for c in conv[max(0, i - 5): i]]
-        suf_txt = ["%s : %s" % (c[2], c[3]) for c in conv[i + 1: min(len(conv), i + 3)]]
+        pre_txt = ["%s : %s" % (c[2], c[3]) for c in conv[max(0, i - 5):i]]
+        suf_txt = [
+            "%s : %s" % (c[2], c[3]) for c in conv[i + 1:min(len(conv), i + 3)]
+        ]
         sample = [conv[i][1], pre_txt, cur_txt, suf_txt]
         samples.append(sample)
     return samples
 
 
-def create_multi_turn_examples(lines, set_type): 
+def create_multi_turn_examples(lines, set_type):
     """Creates examples for multi-turn dialogue sets."""
     conv_id = -1
     examples = []
     conv_example = []
     index = 0
-    for (i, line) in enumerate(lines): 
-        if len(line) != 4: 
+    for (i, line) in enumerate(lines):
+        if len(line) != 4:
             print("data format error: %s" % "\t".join(line))
-            print("data row contains four parts: conversation_id \t label \t caller \t conversation_content")
+            print(
+                "data row contains four parts: conversation_id \t label \t caller \t conversation_content"
+            )
             continue
         tokens = line
         conv_no = tokens[0]
-        if conv_no != conv_id and i != 0: 
+        if conv_no != conv_id and i != 0:
             samples = create_dialogue_examples(conv_example)
-            for sample in samples: 
+            for sample in samples:
                 guid = "%s-%s" % (set_type, index)
                 index += 1
                 label = sample[0]
@@ -657,15 +694,20 @@ def create_multi_turn_examples(lines, set_type):
                 text_b = sample[2]
                 text_c = sample[3]
                 examples.append(
-                    InputExample(guid=guid, text_a=text_a, text_b=text_b, text_c=text_c, label=label))
+                    InputExample(
+                        guid=guid,
+                        text_a=text_a,
+                        text_b=text_b,
+                        text_c=text_c,
+                        label=label))
             conv_example = []
             conv_id = conv_no
-        if i == 0: 
+        if i == 0:
             conv_id = conv_no
         conv_example.append(tokens)
-    if conv_example: 
+    if conv_example:
         samples = create_dialogue_examples(conv_example)
-        for sample in samples: 
+        for sample in samples:
             guid = "%s-%s" % (set_type, index)
             index += 1
             label = sample[0]
@@ -673,62 +715,67 @@ def create_multi_turn_examples(lines, set_type):
             text_b = sample[2]
             text_c = sample[3]
             examples.append(
-                InputExample(guid=guid, text_a=text_a, text_b=text_b, text_c=text_c, label=label))
+                InputExample(
+                    guid=guid,
+                    text_a=text_a,
+                    text_b=text_b,
+                    text_c=text_c,
+                    label=label))
     return examples
 
 
-def convert_tokens(tokens, sep_id, tokenizer): 
+def convert_tokens(tokens, sep_id, tokenizer):
     """Converts tokens to ids"""
     tokens_ids = []
-    if not tokens: 
+    if not tokens:
         return tokens_ids
-    if isinstance(tokens, list): 
-        for text in tokens: 
+    if isinstance(tokens, list):
+        for text in tokens:
             tok_text = tokenizer.tokenize(text)
             ids = tokenizer.convert_tokens_to_ids(tok_text)
             tokens_ids.extend(ids)
             tokens_ids.append(sep_id)
-        tokens_ids = tokens_ids[: -1]
-    else: 
+        tokens_ids = tokens_ids[:-1]
+    else:
         tok_text = tokenizer.tokenize(tokens)
         tokens_ids = tokenizer.convert_tokens_to_ids(tok_text)
     return tokens_ids
 
 
-def convert_single_example(ex_index, example, label_list, max_seq_length, 
+def convert_single_example(ex_index, example, label_list, max_seq_length,
                            tokenizer, task_name):
     """Converts a single DA `InputExample` into a single `InputFeatures`."""
     label_map = {}
-    SEP = 102 
+    SEP = 102
     CLS = 101
 
-    if task_name == 'udc': 
+    if task_name == 'udc':
         INNER_SEP = 1
         limit_length = 60
-    elif task_name == 'swda': 
+    elif task_name == 'swda':
         INNER_SEP = 1
         limit_length = 50
-    elif task_name == 'mrda': 
+    elif task_name == 'mrda':
         INNER_SEP = 1
         limit_length = 50
-    elif task_name == 'atis_intent': 
+    elif task_name == 'atis_intent':
         INNER_SEP = -1
         limit_length = -1
-    elif task_name == 'atis_slot': 
+    elif task_name == 'atis_slot':
         INNER_SEP = -1
         limit_length = -1
-    elif task_name == 'dstc2': 
+    elif task_name == 'dstc2':
         INNER_SEP = 1
         limit_length = -1
-    elif task_name == 'dstc2_asr': 
+    elif task_name == 'dstc2_asr':
         INNER_SEP = 1
         limit_length = -1
-    elif task_name == 'multi-woz': 
+    elif task_name == 'multi-woz':
         INNER_SEP = 1
         limit_length = 200
-    for (i, label) in enumerate(label_list): 
+    for (i, label) in enumerate(label_list):
         label_map[label] = i
-    
+
     tokens_a = example.text_a
     tokens_b = example.text_b
     tokens_c = example.text_c
@@ -737,30 +784,36 @@ def convert_single_example(ex_index, example, label_list, max_seq_length,
     tokens_b_ids = convert_tokens(tokens_b, INNER_SEP, tokenizer)
     tokens_c_ids = convert_tokens(tokens_c, INNER_SEP, tokenizer)
 
-    if tokens_b_ids: 
+    if tokens_b_ids:
         tokens_b_ids = tokens_b_ids[:min(limit_length, len(tokens_b_ids))]
-    else: 
+    else:
         if len(tokens_a_ids) > max_seq_length - 2:
             tokens_a_ids = tokens_a_ids[len(tokens_a_ids) - max_seq_length + 2:]
-    if not tokens_c_ids: 
-        if len(tokens_a_ids) > max_seq_length - len(tokens_b_ids) - 3: 
-            tokens_a_ids = tokens_a_ids[len(tokens_a_ids) - max_seq_length + len(tokens_b_ids) + 3:]
-    else: 
-        if len(tokens_a_ids) + len(tokens_b_ids) + len(tokens_c_ids) > max_seq_length - 4: 
+    if not tokens_c_ids:
+        if len(tokens_a_ids) > max_seq_length - len(tokens_b_ids) - 3:
+            tokens_a_ids = tokens_a_ids[len(tokens_a_ids) - max_seq_length +
+                                        len(tokens_b_ids) + 3:]
+    else:
+        if len(tokens_a_ids) + len(tokens_b_ids) + len(
+                tokens_c_ids) > max_seq_length - 4:
             left_num = max_seq_length - len(tokens_b_ids) - 4
-            if len(tokens_a_ids) > len(tokens_c_ids): 
+            if len(tokens_a_ids) > len(tokens_c_ids):
                 suffix_num = int(left_num / 2)
-                tokens_c_ids = tokens_c_ids[: min(len(tokens_c_ids), suffix_num)]
+                tokens_c_ids = tokens_c_ids[:min(len(tokens_c_ids), suffix_num)]
                 prefix_num = left_num - len(tokens_c_ids)
-                tokens_a_ids = tokens_a_ids[max(0, len(tokens_a_ids) - prefix_num):]
-            else: 
-                if not tokens_a_ids: 
-                    tokens_c_ids = tokens_c_ids[max(0, len(tokens_c_ids) - left_num):]
-                else: 
+                tokens_a_ids = tokens_a_ids[max(
+                    0, len(tokens_a_ids) - prefix_num):]
+            else:
+                if not tokens_a_ids:
+                    tokens_c_ids = tokens_c_ids[max(
+                        0, len(tokens_c_ids) - left_num):]
+                else:
                     prefix_num = int(left_num / 2)
-                    tokens_a_ids = tokens_a_ids[max(0, len(tokens_a_ids) - prefix_num):]
+                    tokens_a_ids = tokens_a_ids[max(
+                        0, len(tokens_a_ids) - prefix_num):]
                     suffix_num = left_num - len(tokens_a_ids)
-                    tokens_c_ids = tokens_c_ids[: min(len(tokens_c_ids), suffix_num)]
+                    tokens_c_ids = tokens_c_ids[:min(
+                        len(tokens_c_ids), suffix_num)]
 
     input_ids = []
     segment_ids = []
@@ -770,31 +823,31 @@ def convert_single_example(ex_index, example, label_list, max_seq_length,
     segment_ids.extend([0] * len(tokens_a_ids))
     input_ids.append(SEP)
     segment_ids.append(0)
-    if tokens_b_ids: 
+    if tokens_b_ids:
         input_ids.extend(tokens_b_ids)
         segment_ids.extend([1] * len(tokens_b_ids))
         input_ids.append(SEP)
         segment_ids.append(1)
-    if tokens_c_ids: 
+    if tokens_c_ids:
         input_ids.extend(tokens_c_ids)
         segment_ids.extend([0] * len(tokens_c_ids))
         input_ids.append(SEP)
         segment_ids.append(0)
 
     input_mask = [1] * len(input_ids)
-    if task_name == 'atis_slot': 
+    if task_name == 'atis_slot':
         label_id = [0] + [label_map[l] for l in example.label] + [0]
-    elif task_name in ['dstc2', 'dstc2_asr', 'multi-woz']: 
+    elif task_name in ['dstc2', 'dstc2_asr', 'multi-woz']:
         label_id_enty = [label_map[l] for l in example.label]
         label_id = []
-        for i in range(len(label_map)): 
-            if i in label_id_enty: 
+        for i in range(len(label_map)):
+            if i in label_id_enty:
                 label_id.append(1)
-            else: 
+            else:
                 label_id.append(0)
-    else:  
+    else:
         label_id = label_map[example.label]
-    
+
     if ex_index < 5:
         print("*** Example ***")
         print("guid: %s" % (example.guid))
@@ -807,7 +860,5 @@ def convert_single_example(ex_index, example, label_list, max_seq_length,
         input_mask=input_mask,
         segment_ids=segment_ids,
         label_id=label_id)
-    
-    return feature
-
 
+    return feature
diff --git a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_atis_dataset.py b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_atis_dataset.py
index 2ea18357ca847f834f3892d09db07e2b63c19c84..09f3746039d3d55f9b824e76bf8434e95bffa670 100755
--- a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_atis_dataset.py
+++ b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_atis_dataset.py
@@ -73,11 +73,11 @@ class ATIS(object):
             if example[1] not in self.intent_dict: 
                 self.intent_dict[example[1]] = self.intent_id
                 self.intent_id += 1
-            fw.write("%s\t%s\n" % (self.intent_dict[example[1]], example[0].lower()))
+            fw.write(u"%s\t%s\n" % (self.intent_dict[example[1]], example[0].lower()))
 
         fw = io.open(self.map_tag_intent, 'w', encoding="utf8")
         for tag in self.intent_dict: 
-            fw.write("%s\t%s\n" % (tag, self.intent_dict[tag]))
+            fw.write(u"%s\t%s\n" % (tag, self.intent_dict[tag]))
 
     def _parser_slot_data(self, examples, data_type): 
         """
@@ -119,11 +119,11 @@ class ATIS(object):
             if entities[-1]['end'] < len(text): 
                 suffix_num = len(text[entities[-1]['end']:].strip().split())
                 tags.extend([str(self.slot_dict['O'])] * suffix_num)
-            fw.write("%s\t%s\n" % (text.encode('utf8'), " ".join(tags).encode('utf8')))
+            fw.write(u"%s\t%s\n" % (text.encode('utf8'), " ".join(tags).encode('utf8')))
         
         fw = io.open(self.map_tag_slot, 'w', encoding="utf8")
         for slot in self.slot_dict: 
-            fw.write("%s\t%s\n" % (slot, self.slot_dict[slot]))
+            fw.write(u"%s\t%s\n" % (slot, self.slot_dict[slot]))
 
     def get_train_dataset(self): 
         """
diff --git a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_dstc2_dataset.py b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_dstc2_dataset.py
index f2c83e0b7b417622bc959d975c6e2a6f1fd1109b..9655ce7268028ac5a30b843105de09c1f13a7b68 100755
--- a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_dstc2_dataset.py
+++ b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_dstc2_dataset.py
@@ -106,8 +106,8 @@ class DSTC2(object):
                 out = "%s\t%s\1%s\t%s" % (session_id, mach, user, labels_ids)
                 user_asr = log_turn['input']['live']['asr-hyps'][0]['asr-hyp'].strip()
                 out_asr = "%s\t%s\1%s\t%s" % (session_id, mach, user_asr, labels_ids)
-                fw.write("%s\n" % out.encode('utf8'))
-                fw_asr.write("%s\n" % out_asr.encode('utf8'))
+                fw.write(u"%s\n" % out.encode('utf8'))
+                fw_asr.write(u"%s\n" % out_asr.encode('utf8'))
 
     def get_train_dataset(self): 
         """
@@ -133,7 +133,7 @@ class DSTC2(object):
         """
         fw = io.open(self.map_tag, 'w', encoding="utf8")
         for elem in self.map_tag_dict: 
-            fw.write("%s\t%s\n" % (elem, self.map_tag_dict[elem]))
+            fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem]))
 
     def main(self): 
         """
diff --git a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_mrda_dataset.py b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_mrda_dataset.py
index 7de02adc2b4552526777752ee67a0ca506801f42..e5c0406fce45637364e3dc8ed7cc2ed7739c15a1 100755
--- a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_mrda_dataset.py
+++ b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_mrda_dataset.py
@@ -121,7 +121,7 @@ class MRDA(object):
             caller = elem.split('_')[0].split('-')[-1]
             conv_no = elem.split('_')[0].split('-')[0]
             out = "%s\t%s\t%s\t%s" % (conv_no, self.map_tag_dict[tag], caller, v_trans[0])
-            fw.write("%s\n" % out)
+            fw.write(u"%s\n" % out)
 
     def get_train_dataset(self): 
         """
@@ -147,7 +147,7 @@ class MRDA(object):
         """
         fw = io.open(self.map_tag, 'w', encoding="utf8")
         for elem in self.map_tag_dict: 
-            fw.write("%s\t%s\n" % (elem, self.map_tag_dict[elem]))
+            fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem]))
 
     def main(self): 
         """
diff --git a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_swda_dataset.py b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_swda_dataset.py
index c821e7fe52a620d6931456c5216276354fa56257..441d2852c760e9cef31147e666855f89dba406bb 100755
--- a/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_swda_dataset.py
+++ b/PaddleNLP/PaddleDialogue/dialogue_general_understanding/dgu/scripts/build_swda_dataset.py
@@ -69,7 +69,7 @@ class SWDA(object):
                     idx += 1
                     continue
                 out = self._parser_utterence(r)
-                fw.write("%s\n" % out)
+                fw.write(u"%s\n" % out)
 
     def _clean_text(self, text): 
         """
@@ -213,7 +213,7 @@ class SWDA(object):
         """
         fw = io.open(self.map_tag, 'w', encoding='utf8')
         for elem in self.map_tag_dict: 
-            fw.write("%s\t%s\n" % (elem, self.map_tag_dict[elem]))
+            fw.write(u"%s\t%s\n" % (elem, self.map_tag_dict[elem]))
 
     def main(self): 
         """
diff --git a/PaddleNLP/PaddleLARK/BERT/README.md b/PaddleNLP/PaddleLARK/BERT/README.md
index 7ed1d28bca0f5454995b6e67951894b7430c20bc..47ceb84eef23fc0d621da5a839e80ebb10ccf4a0 100644
--- a/PaddleNLP/PaddleLARK/BERT/README.md
+++ b/PaddleNLP/PaddleLARK/BERT/README.md
@@ -416,6 +416,4 @@ for (size_t i = 0; i < output.front().data.length() / sizeof(float); i += 3) {
 }
 ```
 
-## Contributors
 
-本项目由百度自然语言处理部语义计算团队([@nbcc](https://github.com/nbcc) [@tianxin1860](https://github.com/tianxin1860))和深度学习技术平台部 PaddlePaddle 团队([@kuke](https://github.com/kuke) [@gongweibao](https://github.com/gongweibao) [@fc500110](https://github.com/fc500110) [@iclementine](https://github.com/iclementine))合作完成。
diff --git a/PaddleNLP/PaddleLARK/BERT/model/classifier.py b/PaddleNLP/PaddleLARK/BERT/model/classifier.py
index 03bfb7aa1504253399b5bdd4d1f4fe9c41cad27b..079d5683edf843b6a7f40947e4b10f2c1b799e15 100644
--- a/PaddleNLP/PaddleLARK/BERT/model/classifier.py
+++ b/PaddleNLP/PaddleLARK/BERT/model/classifier.py
@@ -25,8 +25,8 @@ from model.bert import BertModel
 def create_model(args, bert_config, num_labels, is_prediction=False):
     input_fields = {
         'names': ['src_ids', 'pos_ids', 'sent_ids', 'input_mask', 'labels'],
-        'shapes': [[None, None], [None, None], [None, None],
-                   [None, args.max_seq_len, 1], [None, 1]],
+        'shapes': [[None, None], [None, None], [None, None], [None, None, 1],
+                   [None, 1]],
         'dtypes': ['int64', 'int64', 'int64', 'float32', 'int64'],
         'lod_levels': [0, 0, 0, 0, 0],
     }
diff --git a/PaddleNLP/PaddleLARK/BERT/run_classifier.py b/PaddleNLP/PaddleLARK/BERT/run_classifier.py
index 1e769d3446da2c92f27ae0f13f31db3b2f30735a..221a14f768097afa8da3f96cd7e8c0f8690d1a99 100644
--- a/PaddleNLP/PaddleLARK/BERT/run_classifier.py
+++ b/PaddleNLP/PaddleLARK/BERT/run_classifier.py
@@ -32,6 +32,7 @@ import multiprocessing
 
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 
 import reader.cls as reader
 from model.bert import BertConfig
@@ -93,6 +94,12 @@ data_g.add_arg("do_lower_case", bool, True,
 data_g.add_arg("random_seed",   int,  0,     "Random seed.")
 
 run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+
+# NOTE:profiler args, used for benchmark
+run_type_g.add_arg("profiler_path",                str,    './', "the profiler output file path. (used for benchmark)")
+run_type_g.add_arg("is_profiler",                  int,    0,     "the profiler switch. (used for benchmark)")
+run_type_g.add_arg("max_iter",                     int,    0,     "the max batch nums to train. (used for benchmark)")
+
 run_type_g.add_arg("use_cuda",                     bool,   True,  "If set, use GPU for training.")
 run_type_g.add_arg("use_fast_executor",            bool,   False, "If set, use fast parallel executor (in experiment).")
 run_type_g.add_arg("shuffle",                      bool,   True,  "")
@@ -317,9 +324,17 @@ def main(args):
         time_begin = time.time()
         throughput = []
         ce_info = []
+
+        total_batch_num=0 # used for benchmark
+
         while True:
             try:
                 steps += 1
+
+                total_batch_num += 1 # used for benchmark
+                if args.max_iter and total_batch_num == args.max_iter: # used for benchmark
+                    return
+
                 if steps % args.skip_steps == 0:
                     if args.use_fp16:
                         fetch_list = [loss.name, accuracy.name, scheduled_lr.name, num_seqs.name, loss_scaling.name]
@@ -353,6 +368,13 @@ def main(args):
                     time_end = time.time()
                     used_time = time_end - time_begin
 
+                    # profiler tools
+                    if args.is_profiler and current_epoch == 0 and steps == args.skip_steps:
+                        profiler.start_profiler("All")
+                    elif args.is_profiler and current_epoch == 0 and steps == args.skip_steps * 2:
+                        profiler.stop_profiler("total", args.profiler_path)
+                        return
+
                     log_record = "epoch: {}, progress: {}/{}, step: {}, ave loss: {}, ave acc: {}".format(
                            current_epoch, current_example, num_train_examples,
                            steps, np.sum(total_cost) / np.sum(total_num_seqs),
diff --git a/PaddleNLP/PaddleLARK/BERT/run_squad.py b/PaddleNLP/PaddleLARK/BERT/run_squad.py
index fc3659b61eeb7d06c252a61f83c99fbe3adfea97..e005b2439113d6a0d20a6cc145f3d0110f474af2 100644
--- a/PaddleNLP/PaddleLARK/BERT/run_squad.py
+++ b/PaddleNLP/PaddleLARK/BERT/run_squad.py
@@ -111,7 +111,7 @@ def create_model(bert_config, is_training=False):
         input_fields = {
             'names': ['src_ids', 'pos_ids', 'sent_ids', 'input_mask', 'start_positions', 'end_positions'],
             'shapes': [[None, None], [None, None], [None, None],
-                    [None, args.max_seq_len, 1], [None, 1], [None, 1]],
+                    [None, None, 1], [None, 1], [None, 1]],
             'dtypes': [
                 'int64', 'int64', 'int64', 'float32', 'int64', 'int64'],
             'lod_levels': [0, 0, 0, 0, 0, 0],
@@ -120,7 +120,7 @@ def create_model(bert_config, is_training=False):
         input_fields = {
             'names': ['src_ids', 'pos_ids', 'sent_ids', 'input_mask', 'unique_id'],
             'shapes': [[None, None], [None, None], [None, None],
-                    [None, args.max_seq_len, 1], [None, 1]],
+                    [None, None, 1], [None, 1]],
             'dtypes': [
                 'int64', 'int64', 'int64', 'float32', 'int64'],
             'lod_levels': [0, 0, 0, 0, 0],
diff --git a/PaddleNLP/PaddleLARK/BERT/utils/init.py b/PaddleNLP/PaddleLARK/BERT/utils/init.py
index 3844d01298ecbb70aed37b467aebca62caadd391..df2406b5c52e04215634ba0b6f6e4c554eadf0d6 100644
--- a/PaddleNLP/PaddleLARK/BERT/utils/init.py
+++ b/PaddleNLP/PaddleLARK/BERT/utils/init.py
@@ -34,7 +34,7 @@ def cast_fp32_to_fp16(exe, main_program):
             master_param_var = fluid.global_scope().find_var(param.name +
                                                              ".master")
             if master_param_var is not None:
-                master_param_var.get_tensor().set(data, exe.place)
+                master_param_var.get_tensor().set(np.float32(data), exe.place)
 
 
 def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False):
@@ -44,7 +44,9 @@ def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False):
     def existed_persitables(var):
         if not fluid.io.is_persistable(var):
             return False
-        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
+        if os.path.exists(os.path.join(init_checkpoint_path, var.name)):
+            print("INIT {}".format(var.name))
+            return True
 
     fluid.io.load_vars(
         exe,
@@ -67,7 +69,12 @@ def init_pretraining_params(exe,
     def existed_params(var):
         if not isinstance(var, fluid.framework.Parameter):
             return False
-        return os.path.exists(os.path.join(pretraining_params_path, var.name))
+        if os.path.exists(os.path.join(pretraining_params_path, var.name)):
+            print("INIT {}".format(var.name))
+            return True
+        else:
+            print("SKIP {}".format(var.name))
+            return False
 
     fluid.io.load_vars(
         exe,
diff --git a/PaddleNLP/PaddleLARK/ELMo/README.md b/PaddleNLP/PaddleLARK/ELMo/README.md
index 999354cc96dd838ca62071c954185a819232e518..edf79e2b509e9eb5c3faf85c281a5ab3680adec7 100755
--- a/PaddleNLP/PaddleLARK/ELMo/README.md
+++ b/PaddleNLP/PaddleLARK/ELMo/README.md
@@ -92,5 +92,3 @@ word_embedding=fluid.layers.concat(input=[elmo_embedding, word_embedding], axis=
 [Deep contextualized word representations](https://arxiv.org/abs/1802.05365)
 
 
-### Contributors
-本项目由百度深度学习技术平台部 PaddlePaddle 团队([@xuezhong](https://github.com/xuezhong) [@JesseyXujin](https://github.com/JesseyXujin))和百度自然语言处理部语义计算团队([@nbcc](https://github.com/nbcc) [@tianxin1860](https://github.com/tianxin1860))合作完成。
diff --git a/PaddleNLP/PaddleLARK/XLNet/.run_ce.sh b/PaddleNLP/PaddleLARK/XLNet/.run_ce.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7fa5fc829608ed6fdab8097922517b70ac3d2f89
--- /dev/null
+++ b/PaddleNLP/PaddleLARK/XLNet/.run_ce.sh
@@ -0,0 +1,31 @@
+
+train(){
+python run_classifier.py \
+    --data_dir data/STS-B \
+    --verbose True \
+    --shuffle false \
+    --init_checkpoint xlnet_cased_L-12_H-768_A-12/params \
+    --predict_dir exp/sts-b \
+    --model_config_path xlnet_cased_L-12_H-768_A-12/xlnet_config.json \
+    --uncased False \
+    --save_steps 50 \
+    --train_steps 50 \
+    --epoch 1 \
+    --skip_steps 10 \
+    --validation_steps 30 \
+    --task_name sts-b \
+    --warmup_steps 5 \
+    --random_seed 100 \
+    --spiece_model_file xlnet_cased_L-12_H-768_A-12/spiece.model \
+    --checkpoints checkpoints_sts-b \
+    --is_regression True \
+    --use_cuda True \
+    --eval_batch_size 4 \
+    --enable_ce
+}
+
+export CUDA_VISIBLE_DEVICES=0
+train | python _ce.py
+
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+train | python _ce.py
diff --git a/PaddleNLP/PaddleLARK/XLNet/_ce.py b/PaddleNLP/PaddleLARK/XLNet/_ce.py
new file mode 100644
index 0000000000000000000000000000000000000000..094434273c106f0b0862a8bc0e63f54f4f208289
--- /dev/null
+++ b/PaddleNLP/PaddleLARK/XLNet/_ce.py
@@ -0,0 +1,67 @@
+####this file is only used for continuous evaluation test!
+
+import os
+import sys
+sys.path.insert(0, os.environ['ceroot'])
+from kpi import CostKpi, DurationKpi, AccKpi
+
+#### NOTE kpi.py should shared in models in some way!!!!
+
+
+train_duration_sts_b_card1 = DurationKpi(
+    'train_duration_sts_b_card1', 0.01, 0, actived=True)
+train_cost_sts_b_card1 = CostKpi(
+    'train_cost_sts_b_card1', 0.02, 0, actived=True)
+train_duration_sts_b_card4 = DurationKpi(
+    'train_duration_sts_b_card4', 0.04, 0, actived=True)
+train_cost_sts_b_card4 = CostKpi(
+    'train_cost_sts_b_card4', 0.08, 0, actived=False)
+
+tracking_kpis = [
+    train_duration_sts_b_card1,
+    train_cost_sts_b_card1,
+    train_duration_sts_b_card4,
+    train_cost_sts_b_card4,
+]
+
+
+def parse_log(log):
+    '''
+    This method should be implemented by model developers.
+    The suggestion:
+    each line in the log should be key, value, for example:
+    "
+    train_cost\t1.0
+    test_cost\t1.0
+    train_cost\t1.0
+    train_cost\t1.0
+    train_acc\t1.2
+    "
+    '''
+    for line in log.split('\n'):
+        fs = line.strip().split('\t')
+        print(fs)
+        if len(fs) == 3 and fs[0] == 'kpis':
+            print("-----%s" % fs)
+            kpi_name = fs[1]
+            kpi_value = float(fs[2])
+            yield kpi_name, kpi_value
+
+
+def log_to_ce(log):
+    kpi_tracker = {}
+    for kpi in tracking_kpis:
+        kpi_tracker[kpi.name] = kpi
+
+    for (kpi_name, kpi_value) in parse_log(log):
+        print(kpi_name, kpi_value)
+        kpi_tracker[kpi_name].add_record(kpi_value)
+        kpi_tracker[kpi_name].persist()
+
+
+if __name__ == '__main__':
+    log = sys.stdin.read()
+    print("*****")
+    print(log)
+    print("****")
+    log_to_ce(log)
diff --git a/PaddleNLP/PaddleLARK/XLNet/run_classifier.py b/PaddleNLP/PaddleLARK/XLNet/run_classifier.py
index 8b5a0c80f90c0d957f4f543bf13538af37c6a0e1..795eae548cdb0137856b46ab8e43a4361098c9c8 100644
--- a/PaddleNLP/PaddleLARK/XLNet/run_classifier.py
+++ b/PaddleNLP/PaddleLARK/XLNet/run_classifier.py
@@ -41,6 +41,7 @@ from model.classifier import create_model
 from optimization import optimization
 from utils.args import ArgumentGroup, print_arguments, check_cuda
 from utils.init import init_pretraining_params, init_checkpoint
+from utils.cards import get_cards
 
 num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
 
@@ -432,20 +433,16 @@ def main(args):
         if args.enable_ce:
             card_num = get_cards()
             ce_cost = 0
-            ce_acc = 0
             ce_time = 0
             try:
                 ce_cost = ce_info[-2][0]
-                ce_acc = ce_info[-2][1]
-                ce_time = ce_info[-2][2]
+                ce_time = ce_info[-2][1]
             except:
                 print("ce info error")
             print("kpis\ttrain_duration_%s_card%s\t%s" %
-                (args.task_name, card_num, ce_time))
+                (args.task_name.replace("-", "_"), card_num, ce_time))
             print("kpis\ttrain_cost_%s_card%s\t%f" %
-                (args.task_name, card_num, ce_cost))
-            print("kpis\ttrain_acc_%s_card%s\t%f" %
-                (args.task_name, card_num, ce_acc))
+                (args.task_name.replace("-", "_"), card_num, ce_cost))
 
 
     # final eval on dev set
diff --git a/PaddleNLP/PaddleLARK/XLNet/utils/cards.py b/PaddleNLP/PaddleLARK/XLNet/utils/cards.py
new file mode 100644
index 0000000000000000000000000000000000000000..70c58ee30da7f68f00d12af0b5dc1025dad42630
--- /dev/null
+++ b/PaddleNLP/PaddleLARK/XLNet/utils/cards.py
@@ -0,0 +1,26 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+
+def get_cards():
+    """
+    get gpu cards number
+    """
+    num = 0
+    cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
+    if cards != '':
+        num = len(cards.split(","))
+    return num
diff --git a/PaddleNLP/PaddleMT/transformer/desc.py b/PaddleNLP/PaddleMT/transformer/desc.py
index d6c34191cd5f182b17eaabbce29c811985e97703..f6fa768adc42ebcebe36eb60b52f7f6e366b3887 100644
--- a/PaddleNLP/PaddleMT/transformer/desc.py
+++ b/PaddleNLP/PaddleMT/transformer/desc.py
@@ -12,65 +12,73 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# The placeholder for batch_size in compile time. Must be -1 currently to be
-# consistent with some ops' infer-shape output in compile time, such as the
-# sequence_expand op used in beamsearch decoder.
-batch_size = None
-# The placeholder for squence length in compile time.
-seq_len = None
-# The placeholder for head number in compile time.
-n_head = 8
-# The placeholder for model dim in compile time.
-d_model = 512
-# Here list the data shapes and data types of all inputs.
-# The shapes here act as placeholder and are set to pass the infer-shape in
-# compile time.
-input_descs = {
-    # The actual data shape of src_word is:
-    # [batch_size, max_src_len_in_batch]
-    "src_word": [(batch_size, seq_len), "int64", 2],
-    # The actual data shape of src_pos is:
-    # [batch_size, max_src_len_in_batch, 1]
-    "src_pos": [(batch_size, seq_len), "int64"],
-    # This input is used to remove attention weights on paddings in the
-    # encoder.
-    # The actual data shape of src_slf_attn_bias is:
-    # [batch_size, n_head, max_src_len_in_batch, max_src_len_in_batch]
-    "src_slf_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
-    # The actual data shape of trg_word is:
-    # [batch_size, max_trg_len_in_batch, 1]
-    "trg_word": [(batch_size, seq_len), "int64",
-                 2],  # lod_level is only used in fast decoder.
-    # The actual data shape of trg_pos is:
-    # [batch_size, max_trg_len_in_batch, 1]
-    "trg_pos": [(batch_size, seq_len), "int64"],
-    # This input is used to remove attention weights on paddings and
-    # subsequent words in the decoder.
-    # The actual data shape of trg_slf_attn_bias is:
-    # [batch_size, n_head, max_trg_len_in_batch, max_trg_len_in_batch]
-    "trg_slf_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
-    # This input is used to remove attention weights on paddings of the source
-    # input in the encoder-decoder attention.
-    # The actual data shape of trg_src_attn_bias is:
-    # [batch_size, n_head, max_trg_len_in_batch, max_src_len_in_batch]
-    "trg_src_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
-    # This input is used in independent decoder program for inference.
-    # The actual data shape of enc_output is:
-    # [batch_size, max_src_len_in_batch, d_model]
-    "enc_output": [(batch_size, seq_len, d_model), "float32"],
-    # The actual data shape of label_word is:
-    # [batch_size * max_trg_len_in_batch, 1]
-    "lbl_word": [(None, 1), "int64"],
-    # This input is used to mask out the loss of paddding tokens.
-    # The actual data shape of label_weight is:
-    # [batch_size * max_trg_len_in_batch, 1]
-    "lbl_weight": [(None, 1), "float32"],
-    # This input is used in beam-search decoder.
-    "init_score": [(batch_size, 1), "float32", 2],
-    # This input is used in beam-search decoder for the first gather
-    # (cell states updation)
-    "init_idx": [(batch_size, ), "int32"],
-}
+def get_input_descs(args):
+    """
+    Generate a dict mapping data fields to the corresponding data shapes and
+    data types.
+    """
+    # The placeholder for batch_size in compile time. Must be -1 currently to be
+    # consistent with some ops' infer-shape output in compile time, such as the
+    # sequence_expand op used in beamsearch decoder.
+    batch_size = None
+    # The placeholder for squence length in compile time.
+    seq_len = None
+    # The head number.
+    n_head = getattr(args, "n_head", 8)
+    # The model dim.
+    d_model = getattr(args, "d_model", 512)
+
+    # Here list the data shapes and data types of all inputs.
+    # The shapes here act as placeholder and are set to pass the infer-shape in
+    # compile time.
+    input_descs = {
+        # The actual data shape of src_word is:
+        # [batch_size, max_src_len_in_batch]
+        "src_word": [(batch_size, seq_len), "int64", 2],
+        # The actual data shape of src_pos is:
+        # [batch_size, max_src_len_in_batch, 1]
+        "src_pos": [(batch_size, seq_len), "int64"],
+        # This input is used to remove attention weights on paddings in the
+        # encoder.
+        # The actual data shape of src_slf_attn_bias is:
+        # [batch_size, n_head, max_src_len_in_batch, max_src_len_in_batch]
+        "src_slf_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
+        # The actual data shape of trg_word is:
+        # [batch_size, max_trg_len_in_batch, 1]
+        "trg_word": [(batch_size, seq_len), "int64",
+                    2],  # lod_level is only used in fast decoder.
+        # The actual data shape of trg_pos is:
+        # [batch_size, max_trg_len_in_batch, 1]
+        "trg_pos": [(batch_size, seq_len), "int64"],
+        # This input is used to remove attention weights on paddings and
+        # subsequent words in the decoder.
+        # The actual data shape of trg_slf_attn_bias is:
+        # [batch_size, n_head, max_trg_len_in_batch, max_trg_len_in_batch]
+        "trg_slf_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
+        # This input is used to remove attention weights on paddings of the source
+        # input in the encoder-decoder attention.
+        # The actual data shape of trg_src_attn_bias is:
+        # [batch_size, n_head, max_trg_len_in_batch, max_src_len_in_batch]
+        "trg_src_attn_bias": [(batch_size, n_head, seq_len, seq_len), "float32"],
+        # This input is used in independent decoder program for inference.
+        # The actual data shape of enc_output is:
+        # [batch_size, max_src_len_in_batch, d_model]
+        "enc_output": [(batch_size, seq_len, d_model), "float32"],
+        # The actual data shape of label_word is:
+        # [batch_size * max_trg_len_in_batch, 1]
+        "lbl_word": [(None, 1), "int64"],
+        # This input is used to mask out the loss of paddding tokens.
+        # The actual data shape of label_weight is:
+        # [batch_size * max_trg_len_in_batch, 1]
+        "lbl_weight": [(None, 1), "float32"],
+        # This input is used in beam-search decoder.
+        "init_score": [(batch_size, 1), "float32", 2],
+        # This input is used in beam-search decoder for the first gather
+        # (cell states updation)
+        "init_idx": [(batch_size, ), "int32"],
+    }
+
+    return input_descs
 
 # Names of word embedding table which might be reused for weight sharing.
 word_emb_param_names = (
diff --git a/PaddleNLP/PaddleMT/transformer/inference_model.py b/PaddleNLP/PaddleMT/transformer/inference_model.py
index 40fc7edeb229d3eb1cfbf4f6c4911b3716291efa..5de0a107cd941c7d2eba42d1ec9922095bde5d8d 100644
--- a/PaddleNLP/PaddleMT/transformer/inference_model.py
+++ b/PaddleNLP/PaddleMT/transformer/inference_model.py
@@ -93,10 +93,11 @@ def do_save_inference_model(args):
             # define input and reader
 
             input_field_names = desc.encoder_data_input_fields + desc.fast_decoder_data_input_fields
+            input_descs = desc.get_input_descs(args.args)
             input_slots = [{
                 "name": name,
-                "shape": desc.input_descs[name][0],
-                "dtype": desc.input_descs[name][1]
+                "shape": input_descs[name][0],
+                "dtype": input_descs[name][1]
             } for name in input_field_names]
 
             input_field = InputField(input_slots)
diff --git a/PaddleNLP/PaddleMT/transformer/predict.py b/PaddleNLP/PaddleMT/transformer/predict.py
index 7ad847fd313ae688e04cea4373912280e220358a..2ad93e5838d6a87c1aa9deb8e35da7f071aec51d 100644
--- a/PaddleNLP/PaddleMT/transformer/predict.py
+++ b/PaddleNLP/PaddleMT/transformer/predict.py
@@ -134,10 +134,11 @@ def do_predict(args):
             # define input and reader
 
             input_field_names = desc.encoder_data_input_fields + desc.fast_decoder_data_input_fields
+            input_descs = desc.get_input_descs(args.args)
             input_slots = [{
                 "name": name,
-                "shape": desc.input_descs[name][0],
-                "dtype": desc.input_descs[name][1]
+                "shape": input_descs[name][0],
+                "dtype": input_descs[name][1]
             } for name in input_field_names]
 
             input_field = InputField(input_slots)
diff --git a/PaddleNLP/PaddleMT/transformer/train.py b/PaddleNLP/PaddleMT/transformer/train.py
index 48b4847f68e849b109133d2d413b03e456e9825c..c9fb5d7220c325477d6a0e5984f11e4e9b85f79a 100644
--- a/PaddleNLP/PaddleMT/transformer/train.py
+++ b/PaddleNLP/PaddleMT/transformer/train.py
@@ -21,6 +21,7 @@ import time
 import numpy as np
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 
 import utils.dist_utils as dist_utils
 from utils.input_field import InputField
@@ -174,10 +175,11 @@ def do_train(args):
 
             input_field_names = desc.encoder_data_input_fields + \
                     desc.decoder_data_input_fields[:-1] + desc.label_data_input_fields
+            input_descs = desc.get_input_descs(args.args)
             input_slots = [{
                 "name": name,
-                "shape": desc.input_descs[name][0],
-                "dtype": desc.input_descs[name][1]
+                "shape": input_descs[name][0],
+                "dtype": input_descs[name][1]
             } for name in input_field_names]
 
             input_field = InputField(input_slots)
@@ -250,12 +252,15 @@ def do_train(args):
     # start training
 
     step_idx = 0
+    total_batch_num = 0  # this is for benchmark
     for pass_id in range(args.epoch):
         pass_start_time = time.time()
         input_field.loader.start()
 
         batch_id = 0
         while True:
+            if args.max_iter and total_batch_num == args.max_iter: # this for benchmark
+                return
             try:
                 outs = exe.run(compiled_train_prog,
                                fetch_list=[sum_cost.name, token_num.name]
@@ -299,6 +304,14 @@ def do_train(args):
 
                 batch_id += 1
                 step_idx += 1
+                total_batch_num = total_batch_num + 1 # this is for benchmark
+
+                # profiler tools for benchmark
+                if args.is_profiler and pass_id == 0 and batch_id == args.print_step:
+                    profiler.start_profiler("All")
+                elif args.is_profiler and pass_id == 0 and batch_id == args.print_step + 5:
+                    profiler.stop_profiler("total", args.profiler_path)
+                    return
 
             except fluid.core.EOFException:
                 input_field.loader.reset()
diff --git a/PaddleNLP/PaddleMT/transformer/utils/configure.py b/PaddleNLP/PaddleMT/transformer/utils/configure.py
index 2ea9fd96817f461889d24cbbd0c5d9ae76585a0a..67e601282fee572518435eaed38a4ed8e26fc5f9 100644
--- a/PaddleNLP/PaddleMT/transformer/utils/configure.py
+++ b/PaddleNLP/PaddleMT/transformer/utils/configure.py
@@ -198,6 +198,11 @@ class PDConfig(object):
         self.default_g.add_arg("do_save_inference_model", bool, False,
                                "Whether to perform model saving for inference.")
 
+        # NOTE: args for profiler
+        self.default_g.add_arg("is_profiler", int, 0, "the switch of profiler tools. (used for benchmark)")
+        self.default_g.add_arg("profiler_path", str, './', "the profiler output file path. (used for benchmark)")
+        self.default_g.add_arg("max_iter", int, 0, "the max train batch num.(used for benchmark)")
+
         self.parser = parser
 
         if json_file != "":
diff --git a/PaddleNLP/PaddleTextGEN/seq2seq/args.py b/PaddleNLP/PaddleTextGEN/seq2seq/args.py
index ee056e33597651f9e166e4d6399c89bfc36598f7..99f21b0800d9a2696e245fc807b393308a98e09a 100644
--- a/PaddleNLP/PaddleTextGEN/seq2seq/args.py
+++ b/PaddleNLP/PaddleTextGEN/seq2seq/args.py
@@ -122,6 +122,11 @@ def parse_args():
 
     parser.add_argument(
         "--profile", action='store_true', help="Whether enable the profile.")
-
+    # NOTE: profiler args, used for benchmark
+    parser.add_argument(
+        "--profiler_path",
+        type=str,
+        default='./seq2seq.profile',
+        help="the profiler output file path. (used for benchmark)")
     args = parser.parse_args()
     return args
diff --git a/PaddleNLP/PaddleTextGEN/seq2seq/train.py b/PaddleNLP/PaddleTextGEN/seq2seq/train.py
index 51d4d29eac141f35676fe92ef2713c63f86d4aae..e44d9a47692d4a527afc09486a02119d4037ea65 100644
--- a/PaddleNLP/PaddleTextGEN/seq2seq/train.py
+++ b/PaddleNLP/PaddleTextGEN/seq2seq/train.py
@@ -27,6 +27,7 @@ import contextlib
 
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 import paddle.fluid.framework as framework
 import paddle.fluid.profiler as profiler
 from paddle.fluid.executor import Executor
@@ -46,9 +47,9 @@ import pickle
 
 
 @contextlib.contextmanager
-def profile_context(profile=True):
+def profile_context(profile=True, profiler_path='./seq2seq.profile'):
     if profile:
-        with profiler.profiler('All', 'total', 'seq2seq.profile'):
+        with profiler.profiler('All', 'total', profiler_path):
             yield
     else:
         yield
@@ -213,6 +214,12 @@ def main():
                     ce_ppl.append(np.exp(total_loss / word_count))
                     total_loss = 0.0
                     word_count = 0.0
+                
+                # profiler tools
+                if args.profile and epoch_id == 0 and batch_id == 100:
+                    profiler.reset_profiler()
+                elif args.profile and epoch_id == 0 and batch_id == 105:
+                    return
 
             end_time = time.time()
             epoch_time = end_time - start_time
@@ -244,7 +251,7 @@ def main():
             print("kpis\ttrain_duration_card%s\t%s" % (card_num, _time))
             print("kpis\ttrain_ppl_card%s\t%f" % (card_num, _ppl))
 
-    with profile_context(args.profile):
+    with profile_context(args.profile, args.profiler_path):
         train()
 
 
diff --git a/PaddleNLP/PaddleTextGEN/variational_seq2seq/train.py b/PaddleNLP/PaddleTextGEN/variational_seq2seq/train.py
index 08b9e1ae15b848e80d8a671893be6e8babc355f7..98515a8329fba8508b01accbbed940ef2df65842 100644
--- a/PaddleNLP/PaddleTextGEN/variational_seq2seq/train.py
+++ b/PaddleNLP/PaddleTextGEN/variational_seq2seq/train.py
@@ -203,8 +203,8 @@ def main():
             word_count = 0.0
             batch_count = 0.0
             batch_times = []
-            batch_start_time = time.time()
             for batch_id, batch in enumerate(train_data_iter):
+                batch_start_time = time.time()
                 kl_w = min(1.0, kl_w + anneal_r)
                 kl_weight = kl_w
                 input_data_feed, src_word_num, dec_word_sum = prepare_input(
@@ -280,6 +280,18 @@ def main():
         print('\nbest testing nll: %.4f, best testing ppl %.4f\n' %
               (best_nll, best_ppl))
 
+        if args.enable_ce:
+            card_num = get_cards()
+            _ppl = 0
+            _time = 0
+            try:
+                _time = ce_time[-1]
+                _ppl = ce_ppl[-1]
+            except:
+                print("ce info error")
+            print("kpis\ttrain_duration_card%s\t%s" % (card_num, _time))
+            print("kpis\ttrain_ppl_card%s\t%f" % (card_num, _ppl))
+
     with profile_context(args.profile):
         train()
 
diff --git a/PaddleNLP/README.md b/PaddleNLP/README.md
index c8a9e4113f56634e5baf39632e994797b7e157c6..aa26b2ecfc27444bdb63f588250a2fb93e2081cf 100644
--- a/PaddleNLP/README.md
+++ b/PaddleNLP/README.md
@@ -10,7 +10,7 @@
 
 - **丰富而全面的NLP任务支持：**
 
-  - PaddleNLP为您提供了多粒度，多场景的应用支持。涵盖了从[分词](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)，[词性标注](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)，[命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)等NLP基础技术，到[文本分类](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/sentiment_classification)，[文本相似度计算](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/similarity_net)，[语义表示](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/language_representations_kit)，[文本生成](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleTextGEN)等NLP核心技术。同时，PaddleNLP还提供了针对常见NLP大型应用系统（如[阅读理解](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMRC)，[对话系统](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleDialogue)，[机器翻译系统](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMT)等）的特定核心技术和工具组件，模型和预训练参数等，让您在NLP领域畅通无阻。
+  - PaddleNLP为您提供了多粒度，多场景的应用支持。涵盖了从[分词](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)，[词性标注](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)，[命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)等NLP基础技术，到[文本分类](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/sentiment_classification)，[文本相似度计算](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/similarity_net)，[语义表示](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleLARK)，[文本生成](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleTextGEN)等NLP核心技术。同时，PaddleNLP还提供了针对常见NLP大型应用系统（如[阅读理解](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMRC)，[对话系统](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleDialogue)，[机器翻译系统](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleMT)等）的特定核心技术和工具组件，模型和预训练参数等，让您在NLP领域畅通无阻。
 
 - **稳定可靠的NLP模型和强大的预训练参数：**
 
diff --git a/PaddleNLP/Research/Dialogue-PLATO/.gitignore b/PaddleNLP/Research/Dialogue-PLATO/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..38ce4278b6f53ca9c84b17a805fab69689273e86
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/.gitignore
@@ -0,0 +1,123 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don’t work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
diff --git a/PaddleNLP/Research/Dialogue-PLATO/README.md b/PaddleNLP/Research/Dialogue-PLATO/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..ed9b9a85c682253c0aae468d1f1bbbc90a5e449d
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/README.md
@@ -0,0 +1,147 @@
+# PLATO
+**PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable**
+[paper link](http://arxiv.org/abs/1910.07931)
+
+**\*\*\*\*\* Update \*\*\*\*\***
+
+Nov. 14: Support new APIs in paddlepaddle 1.6.0 (model files in the link have been updated accordingly), multi-GPU training and decoding strategy of top-k sampling. Release our baseline model `PLATO w/o latent`.
+
+## Requirements
+```
+- python >= 3.6
+- paddlepaddle >= 1.6.0
+- numpy
+- nltk
+- tqdm
+- visualdl >= 1.3.0 (optional)
+- regex
+```
+
+## Pre-trained dialogue generation model
+A novel pre-training model for dialogue generation is introduced in this work, incorporated with latent discrete variables for one-to-many relationship modeling. Our model is flexible enough to support various kinds of conversations, including chit-chat, knowledge grounded dialogues, and conversational question answering. The pre-training is carried out with Reddit and Twitter corpora. You can download the uncased pre-trained model from:
+* PLATO, uncased [model](https://baidu-nlp.bj.bcebos.com/PLATO/model.tar.gz): 12-layers, 768-hidden, 12-heads, 132M parameters
+* PLATO w/o latent, uncased [model](https://baidu-nlp.bj.bcebos.com/PLATO/model-baseline.tar.gz): 12-layers 768-hidden, 12-heads, 109M parameters
+
+```bash
+mv /path/to/model.tar.gz .
+tar xzf model.tar.gz
+```
+
+## Fine-tuning
+We also provide instructions to fine-tune PLATO on different conversation datasets (chit-chat, knowledge grounded dialogues and conversational question answering).
+
+### Data preparation
+Download data from the [link](https://baidu-nlp.bj.bcebos.com/PLATO/data.tar.gz).
+The tar file contains three processed datasets: `DailyDialog`, `PersonaChat` and `DSTC7_AVSD`.
+```bash
+mv /path/to/data.tar.gz .
+tar xzf data.tar.gz
+```
+
+### Data format
+Our model supports two kinds of data formats for dialogue context: `multi` and `multi_knowledge`.
+* `multi`: multi-turn dialogue context.
+```txt
+u_1 __eou__ u_2 __eou__ ... u_n \t r
+```
+* `multi_knowledge`: multi-turn dialogue context with background knowledges.
+```txt
+k_1 __eou__ k_2 __eou__ ... k_m \t u_1 __eou__ u_2 __eou__ ... u_n \t r
+```
+
+If you want to use this model on other datasets, you can process your data accordingly.
+
+### Train
+Fine-tuning the pre-trained model on different `${DATASET}`.
+```bash
+# DailyDialog / PersonaChat / DSTC7_AVSD
+DATASET=DailyDialog
+sh scripts/${DATASET}/train.sh
+```
+After training, you can find the output folder `outputs/${DATASET}` (by default). It contatins `best.model` (best results on validation dataset), `hparams.json` (hyper-parameters of training script) and `trainer.log` (training log).
+
+
+Fine-tuning the pre-trained model on multiple GPUs.
+
+Note: You need to install NCCL library and set up the environment variable `LD_LIBRARY` properly.
+```bash
+sh scripts/DailyDialog/multi_gpu_train.sh
+```
+
+You can fine-tune PLATO w/o latent on different `${DATASET}`. We provide an example script on DailyDialog dataset.
+```bash
+sh scripts/DailyDialog/baseline_train.sh
+```
+
+#### Recommended settings
+
+For the fine-tuning of our pre-trained model, it usually requires about 10 epochs to reach convergence with learning rate = 1e-5 and about 2-3 epochs to reach convergence with learning rate = 5e-5.
+
+GPU Memory | batch size | max len
+------|------|------
+16G | 6 | 256
+32G | 12 | 256
+
+### Infer
+Running inference on test dataset.
+```bash
+# DailyDialog / PersonaChat / DSTC7_AVSD
+DATASET=DailyDialog
+sh scripts/${DATASET}/infer.sh
+
+# Running inference of PLATO w/o latent
+sh scripts/DailyDialog/baseline_infer.sh
+```
+After inference, you can find the output foler `outputs/${DATASET}.infer` (by default). It contains `infer_0.result.json` (the inference result), `hparams.json` (hyper-parameters of inference scipt) and `trainer.log` (inference log).
+
+If you want to use top-k sampling (beam search by default), you can follow the example script:
+```bash
+sh scripts/DailyDialog/topk_infer.sh
+```
+
+## Result
+
+### DailyDialog
+Model | BLEU-1/2 | Distinct-1/2 | Fluency | Coherence | Informativeness | Overall
+------|------|------|------|------|------|-------
+Seq2Seq | 0.336/0.268 | 0.030/0.128 | 1.85 | 0.37 | 0.44 | 0.33
+iVAE_MI | 0.309/0.249 | 0.029/0.250 | 1.53 | 0.34 | 0.59 | 0.30
+Our w/o Latent | **0.405/0.322** | 0.046/0.246 | 1.91 | **1.58** | 1.03 | 1.44
+Our Method | 0.397/0.311 | **0.053/0.291** | **1.97** | 1.57 | **1.23** | **1.48**
+
+### PersonaChat
+Model | BLEU-1/2 | Distinct-1/2 | Knowledge R/P/F1 | Fluency | Coherence | Informativeness | Overall
+------|------|------|------|------|------|-------|-------
+Seq2Seq | 0.448/0.353 | 0.004/0.016 | 0.004/0.016/0.006 | 1.82 | 0.37 | 0.85 | 0.34
+LIC | 0.405/0.320 | 0.019/0.113 | 0.042/0.154/0.064 | 1.95 | 1.34 | 1.09 | 1.29
+Our w/o Latent | **0.458/0.357** | 0.012/0.064 | 0.085/0.263/0.125 | 1.98 | 1.36 | 1.04 | 1.30
+Our Method | 0.406/0.315 | **0.021/0.121** | **0.142/0.461/0.211** | **1.99** | **1.51** | **1.70** | **1.50**
+
+### DSTC7_AVSD
+Model | BELU-1 | BELU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGH-L | CIDEr
+------|------|------|------|------|------|-------|-------
+Baseline | 0.629 | 0.485 | 0.383 | 0.309 | 0.215 | 0.487 | 0.746
+CMU | 0.718 | 0.584 | 0.478 | 0.394 | 0.267 | 0.563 | 1.094
+Our Method | **0.784** | **0.637** | **0.525** | **0.435** | **0.286** | **0.596** | **1.209**
+Our Method Upper Bound | 0.925 | 0.843 | 0.767 | 0.689 | 0.361 | 0.731 | 1.716
+
+Note: In the experiments on `DSTC7_AVSD`, the response selection of our method is strengthened with an extra ranking step, which ranks the candidates according to the automatic scores and selects the top one as the final answer.
+
+## Citation
+If you find PLATO useful in your work, please cite the following Arxiv paper:
+```
+@article{bao2019plato,
+    title={PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable},
+    author={Bao, Siqi and He, Huang and Wang, Fan and Wu, Hua and Wang, Haifeng},
+    journal={arXiv preprint arXiv:1910.07931},
+    year={2019}
+}
+```
+
+## Disclaimer
+This project aims to facilitate further research progress in dialogue generation. Baidu is not responsible for the 3rd party's generation with the pre-trained system.
+
+## Contact information
+For help or issues using PLATO, please submit a GitHub issue.
+
+For personal communication related to PLATO, please contact Siqi Bao (`baosiqi@baidu.com`), or Huang He (`hehuang@baidu.com`).
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/args.py b/PaddleNLP/Research/Dialogue-PLATO/plato/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5fc9e5813fd2311d38cf8ad243d2c4d29fb2f33
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/args.py
@@ -0,0 +1,79 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Parse argument.
+"""
+
+import argparse
+import json
+
+
+def str2bool(v):
+    if v.lower() in ('yes', 'true', 't', 'y', '1'):
+        return True
+    elif v.lower() in ('no', 'false', 'f', 'n', '0'):
+        return False
+    else:
+        raise argparse.ArgumentTypeError('Unsupported value encountered.')
+
+
+class HParams(dict):
+    """ Hyper-parameters class
+
+    Store hyper-parameters in training / infer / ... scripts.
+    """
+
+    def __getattr__(self, name):
+        if name in self.keys():
+            return self[name]
+        for v in self.values():
+            if isinstance(v, HParams):
+                if name in v:
+                    return v[name]
+        raise AttributeError(f"'HParams' object has no attribute '{name}'")
+
+    def __setattr__(self, name, value):
+        self[name] = value
+
+    def save(self, filename):
+        with open(filename, "w", encoding="utf-8") as fp:
+            json.dump(self, fp, ensure_ascii=False,
+                      indent=4, sort_keys=False)
+
+    def load(self, filename):
+        with open(filename, "r", encoding="utf-8") as fp:
+            params_dict = json.load(fp)
+        for k, v in params_dict.items():
+            if isinstance(v, dict):
+                self[k].update(HParams(v))
+            else:
+                self[k] = v
+
+
+def parse_args(parser):
+    """ Parse hyper-parameters from cmdline. """
+    parsed = parser.parse_args()
+    args = HParams()
+    optional_args = parser._action_groups[1]
+    for action in optional_args._group_actions[1:]:
+        arg_name = action.dest
+        args[arg_name] = getattr(parsed, arg_name)
+    for group in parser._action_groups[2:]:
+        group_args = HParams()
+        for action in group._group_actions:
+            arg_name = action.dest
+            group_args[arg_name] = getattr(parsed, arg_name)
+        if len(group_args) > 0:
+            args[group.title] = group_args
+    return args
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/data/data_loader.py b/PaddleNLP/Research/Dialogue-PLATO/plato/data/data_loader.py
new file mode 100644
index 0000000000000000000000000000000000000000..8cd9e20a7d00155cd201061201954ce1f9221cad
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/data/data_loader.py
@@ -0,0 +1,72 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+DataLoader class
+"""
+
+import math
+
+import paddle.fluid as fluid
+import paddle.batch
+
+from plato.args import str2bool
+from plato.data.sampler import RandomSampler
+from plato.data.sampler import SequentialSampler
+from plato.data.sampler import SortedSampler
+import plato.modules.parallel as parallel
+
+
+class DataLoader(object):
+    """ Implement of DataLoader. """
+
+    @classmethod
+    def add_cmdline_argument(cls, group):
+        group.add_argument("--shuffle", type=str2bool, default=True)
+        group.add_argument("--sort_pool_size", type=int, default=0)
+        return group
+
+    def __init__(self, dataset, hparams, collate_fn=None, sampler=None, is_test=False, is_train=False):
+        self.dataset = dataset
+        self.collate_fn = collate_fn
+        self.sort_pool_size = hparams.sort_pool_size
+
+        if sampler is None:
+            if hparams.shuffle and not is_test:
+                sampler = RandomSampler(dataset)
+            else:
+                sampler = SequentialSampler(dataset)
+
+        if self.sort_pool_size > 0 and not is_test:
+            sampler = SortedSampler(sampler, self.sort_pool_size)
+
+        def reader():
+            for idx in sampler:
+                yield idx
+
+        self.reader = paddle.batch(reader, batch_size=hparams.batch_size, drop_last=False)
+        self.num_batches = math.ceil(len(dataset) / hparams.batch_size)
+
+        if hparams.use_data_distributed and parallel.Env().nranks > 1 and is_train:
+            self.reader = fluid.contrib.reader.distributed_batch_reader(self.reader)
+            self.num_batches = self.num_batches // fluid.dygraph.parallel.Env().nranks
+
+        return
+
+    def __len__(self):
+        return self.num_batches
+
+    def __iter__(self):
+        for batch_indices in self.reader():
+            samples = [self.dataset[idx] for idx in batch_indices]
+            yield self.collate_fn(samples)
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/data/dataset.py b/PaddleNLP/Research/Dialogue-PLATO/plato/data/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..1952ef1f1f01300305acedc5ccb0eef72b5ecbf8
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/data/dataset.py
@@ -0,0 +1,77 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Dataset class
+"""
+
+import json
+
+
+class Dataset(object):
+    """ Basic Dataset interface class. """
+
+    @classmethod
+    def add_cmdline_argument(cls, parser):
+        group = parser.add_argument_group("Dataset")
+        group.add_argument("--data_dir", type=str, required=True,
+                           help="The dataset dir.")
+        group.add_argument("--data_type", type=str, required=True,
+                           choices=["multi", "multi_knowledge"],
+                           help="The type of dataset.")
+        return group
+
+    def __init__(self, data):
+        self.data = data
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, idx):
+        return self.data[idx]
+
+
+class LazyDataset(Dataset):
+    """
+    Lazy load dataset from disk.
+
+    Each line of data file is a preprocessed example.
+    """
+
+    def __init__(self, data_file, transform=lambda s: json.loads(s)):
+        """
+        Initialize lazy dataset.
+
+        By default, loading .jsonl format.
+
+        :param data_file
+        :type str
+
+        :param transform
+        :type callable
+        """
+        self.data_file = data_file
+        self.transform = transform
+        self.offsets = [0]
+        with open(data_file, "r", encoding="utf-8") as fp:
+            while fp.readline() != "":
+                self.offsets.append(fp.tell())
+        self.offsets.pop()
+        self.fp = open(data_file, "r", encoding="utf-8")
+
+    def __len__(self):
+        return len(self.offsets)
+
+    def __getitem__(self, idx):
+        self.fp.seek(self.offsets[idx], 0)
+        return self.transform(self.fp.readline().strip())
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/data/field.py b/PaddleNLP/Research/Dialogue-PLATO/plato/data/field.py
new file mode 100644
index 0000000000000000000000000000000000000000..a5ca7312329d86a7ef45bd9a6ead0c154a49fba3
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/data/field.py
@@ -0,0 +1,397 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Field class
+"""
+
+from itertools import chain
+import json
+import numpy as np
+import pickle
+import time
+from tqdm import tqdm
+
+from plato.args import str2bool
+from plato.data.tokenizer import Tokenizer
+
+
+def max_lens(X):
+    lens = [len(X)]
+    while isinstance(X[0], list):
+        lens.append(max(map(len, X)))
+        X = [x for xs in X for x in xs]
+    return lens
+
+
+def list2np(X, padding=0, dtype="int64"):
+    shape = max_lens(X)
+    ret = np.full(shape, padding, dtype=np.int32)
+
+    if len(shape) == 1:
+        ret = np.array(X)
+    elif len(shape) == 2:
+        for i, x in enumerate(X):
+            ret[i, :len(x)] = np.array(x)
+    elif len(shape) == 3:
+        for i, xs in enumerate(X):
+            for j, x in enumerate(xs):
+                ret[i, j, :len(x)] = np.array(x)
+    return ret.astype(dtype)
+
+class BPETextField(object):
+
+    pad_token = "[PAD]"
+    bos_token = "[BOS]"
+    eos_token = "[EOS]"
+    unk_token = "[UNK]"
+
+    @classmethod
+    def add_cmdline_argument(cls, parser):
+        group = parser.add_argument_group("BPETextField")
+        group.add_argument("--vocab_path", type=str, required=True,
+                           help="The vocabulary file path.")
+        group.add_argument("--filtered", type=str2bool, default=False,
+                           help="Whether to filter the data with too long utterance/context. "
+                           "If the data is unfiltered, it will be truncated.")
+        group.add_argument("--max_len", type=int, default=256,
+                           help="The maximum length of context or knowledges.")
+        group.add_argument("--min_utt_len", type=int, default=1,
+                           help="The minimum length of utterance.")
+        group.add_argument("--max_utt_len", type=int, default=50,
+                           help="The maximum length of utterance.")
+        group.add_argument("--min_ctx_turn", type=int, default=1,
+                           help="The minimum turn of context.")
+        group.add_argument("--max_ctx_turn", type=int, default=16,
+                           help="The maximum turn of context.")
+        group.add_argument("--max_knowledge_num", type=int, default=16,
+                           help="The maximum number of knowledges.")
+        group.add_argument("--max_knowledge_len", type=int, default=16,
+                           help="The maximum length of each knowledges.")
+        group.add_argument("--tokenizer_type", type=str, default="Bert",
+                           choices=["Bert", "GPT2"],
+                           help="The type of tokenizer.")
+        return group
+
+    def __init__(self, hparams):
+        special_tokens = [self.pad_token, self.bos_token, self.eos_token, self.unk_token]
+        self.tokenizer = Tokenizer(vocab_path=hparams.vocab_path,
+                                   special_tokens=special_tokens,
+                                   tokenizer_type=hparams.tokenizer_type)
+
+        self.filtered = hparams.filtered
+        self.max_len = hparams.max_len
+        self.min_utt_len = hparams.min_utt_len
+        self.max_utt_len = hparams.max_utt_len
+        self.min_ctx_turn = hparams.min_ctx_turn
+        self.max_ctx_turn = hparams.max_ctx_turn - 1 # subtract reply turn
+        self.max_knowledge_num = hparams.max_knowledge_num
+        self.max_knowledge_len = hparams.max_knowledge_len
+        return
+
+    @property
+    def vocab_size(self):
+        return self.tokenizer.vocab_size
+
+    @property
+    def num_specials(self):
+        return len(self.special_tokens)
+
+    @property
+    def pad_id(self):
+        return self.tokenizer.convert_tokens_to_ids([self.pad_token])[0]
+
+    @property
+    def bos_id(self):
+        return self.tokenizer.convert_tokens_to_ids([self.bos_token])[0]
+
+    @property
+    def eos_id(self):
+        return self.tokenizer.convert_tokens_to_ids([self.eos_token])[0]
+
+    @property
+    def unk_id(self):
+        return self.tokenizer.convert_tokens_to_ids([self.unk_token])[0]
+
+    @property
+    def bot_id(self):
+        return 0
+
+    @property
+    def user_id(self):
+        return 1
+
+    @property
+    def knowledge_id(self):
+        return 2
+
+    def numericalize(self, tokens):
+        assert isinstance(tokens, list)
+        if len(tokens) == 0:
+            return []
+        element = tokens[0]
+        if isinstance(element, list):
+            return [self.numericalize(s) for s in tokens]
+        else:
+            return self.tokenizer.convert_tokens_to_ids(tokens)
+
+    def denumericalize(self, numbers):
+        assert isinstance(numbers, list)
+        if len(numbers) == 0:
+            return []
+        element = numbers[0]
+        if isinstance(element, list):
+            return [self.denumericalize(x) for x in numbers]
+        else:
+            return self.tokenizer.decode(
+                numbers, ignore_tokens=[self.bos_token, self.eos_token, self.pad_token])
+
+    def save_examples(self, examples, filename):
+        print(f"Saving examples to '{filename}' ...")
+        start = time.time()
+        if filename.endswith("pkl"):
+            with open(filename, "wb") as fp:
+                pickle.dump(examples, fp)
+        elif filename.endswith("jsonl"):
+            with open(filename, "w", encoding="utf-8") as fp:
+                for ex in examples:
+                    fp.write(json.dumps(ex) + "\n")
+        else:
+            raise ValueError(f"Unsport file format: {filename}")
+        elapsed = time.time() - start
+        print(f"Saved {len(examples)} examples (elapsed {elapsed:.2f}s)")
+
+    def load_examples(self, filename):
+        print(f"Loading examples from '{filename}' ...")
+        start = time.time()
+        if filename.endswith("pkl"):
+            with open(filename, "rb") as fp:
+                examples = pickle.load(fp)
+        else:
+            with open(filename, "r", encoding="utf-8") as fp:
+                examples = list(map(lambda s: json.loads(s.strip()), fp))
+        elapsed = time.time() - start
+        print(f"Loaded {len(examples)} examples (elapsed {elapsed:.2f}s)")
+        return examples
+
+    def utt_filter_pred(self, utt):
+        return self.min_utt_len <= len(utt) \
+            and (not self.filtered or len(utt) <= self.max_utt_len)
+
+    def utts_filter_pred(self, utts):
+        return self.min_ctx_turn <= len(utts) \
+            and (not self.filtered or len(utts) <= self.max_ctx_turn)
+
+    def build_example_multi_turn(self, req):
+        examples = []
+        src = [self.tokenizer.tokenize(s) for s in req["context"]]
+        src = [s[-self.max_utt_len:] for s in src[-self.max_ctx_turn:]]
+        src = [self.numericalize(s) + [self.eos_id] for s in src]
+        ex = {"src": src}
+        examples.append(ex)
+        return examples
+
+    def build_example_multi_turn_with_knowledge(self, req):
+        examples = []
+        src = [self.tokenizer.tokenize(s) for s in req["context"]]
+        src = [s[-self.max_utt_len:] for s in src[-self.max_ctx_turn:]]
+        src = [self.numericalize(s) + [self.eos_id] for s in src]
+        knowledge = [self.tokenizer.tokenize(k) for k in req["knowledge"]]
+        knowledge = [k[:self.max_knowledge_len] for k in knowledge]
+        knowledge = [self.numericalize(k) + [self.eos_id] for k in knowledge]
+        ex = {"src": src, "knowledge": knowledge}
+        examples.append(ex)
+        return examples
+
+    def build_examples_multi_turn(self, data_file, data_type="train"):
+        print(f"Reading examples from '{data_file}' ...")
+        examples = []
+        ignored = 0
+
+        with open(data_file, "r", encoding="utf-8") as f:
+            for line in tqdm(f, total=None):
+                src, tgt = line.strip("\n").split("\t")
+                tgt = self.tokenizer.tokenize(tgt)
+                src = [self.tokenizer.tokenize(s) for s in src.split(" __eou__ ")]
+
+                if (self.utts_filter_pred(src) and all(map(self.utt_filter_pred, src))
+                        and self.utt_filter_pred(tgt)) or data_type == "test":
+                    src = [s[-self.max_utt_len:] for s in src[-self.max_ctx_turn:]]
+                    src = [self.numericalize(s) + [self.eos_id] for s in src]
+                    tgt = [self.bos_id] + self.numericalize(tgt) + [self.eos_id]
+                    if data_type != "test":
+                        tgt = tgt[:self.max_utt_len + 2]
+                    ex = {"src": src, "tgt": tgt}
+                    examples.append(ex)
+                else:
+                    ignored += 1
+        print(f"Built {len(examples)} {data_type.upper()} examples ({ignored} filtered)")
+        return examples
+
+    def build_examples_multi_turn_with_knowledge(self, data_file, data_type="train"):
+        print(f"Reading examples from '{data_file}' ...")
+        examples = []
+        ignored = 0
+
+        with open(data_file, "r", encoding="utf-8") as f:
+            for line in tqdm(f, total=None):
+                knowledge, src, tgt = line.strip("\n").split("\t")
+                tgt = self.tokenizer.tokenize(tgt)
+                knowledge = [self.tokenizer.tokenize(k) for k in knowledge.split(" __eou__ ")]
+                knowledge = [k[:self.max_knowledge_len]
+                             for k in knowledge[-self.max_knowledge_num:]]
+                src = [self.tokenizer.tokenize(s) for s in src.split(" __eou__ ")]
+
+                if (self.utts_filter_pred(src) and all(map(self.utt_filter_pred, src)) 
+                        and self.utt_filter_pred(tgt)) or data_type == "test":
+                    src = [s[-self.max_utt_len:] for s in src[-self.max_ctx_turn:]]
+                    src = [self.numericalize(s) + [self.eos_id] for s in src]
+                    knowledge = [self.numericalize(k) + [self.eos_id] for k in knowledge]
+                    tgt = [self.bos_id] + self.numericalize(tgt) + [self.eos_id]
+                    if data_type != "test":
+                        tgt = tgt[:self.max_utt_len + 2]
+                    ex = {"src": src, "knowledge": knowledge, "tgt": tgt}
+                    examples.append(ex)
+                else:
+                    ignored += 1
+        print(f"Built {len(examples)} {data_type.upper()} examples ({ignored} filtered)")
+        return examples
+
+    def collate_fn_multi_turn(self, samples):
+        batch_size = len(samples)
+
+        src = [sp["src"] for sp in samples]
+
+        src_token, src_pos, src_turn, src_role = [], [], [], []
+        for utts in src:
+            utt_lens = [len(utt) for utt in utts]
+
+            # Token ids
+            src_token.append(list(chain(*utts))[-self.max_len:])
+
+            # Position ids
+            pos = [list(range(l)) for l in utt_lens]
+            src_pos.append(list(chain(*pos))[-self.max_len:])
+
+            # Turn ids
+            turn = [[len(utts) - i] * l for i, l in enumerate(utt_lens)]
+            src_turn.append(list(chain(*turn))[-self.max_len:])
+
+            # Role ids
+            role = [[self.bot_id if (len(utts) - i) % 2 == 0 else self.user_id] * l
+                    for i, l in enumerate(utt_lens)]
+            src_role.append(list(chain(*role))[-self.max_len:])
+
+        src_token = list2np(src_token, padding=self.pad_id)
+        src_pos = list2np(src_pos, padding=self.pad_id)
+        src_turn = list2np(src_turn, padding=self.pad_id)
+        src_role = list2np(src_role, padding=self.pad_id)
+
+        batch = {}
+        batch["src_token"] = src_token
+        batch["src_mask"] = (src_token != self.pad_id).astype("int64")
+        batch["src_pos"] = src_pos
+        batch["src_type"] = src_role
+        batch["src_turn"] = src_turn
+
+        if "tgt" in samples[0]:
+            tgt = [sp["tgt"] for sp in samples]
+
+            # Token ids & Label ids
+            tgt_token = list2np(tgt, padding=self.pad_id)
+
+            # Position ids
+            tgt_pos = np.zeros_like(tgt_token)
+            tgt_pos[:] = np.arange(tgt_token.shape[1], dtype=tgt_token.dtype)
+
+            # Turn ids
+            tgt_turn = np.zeros_like(tgt_token)
+
+            # Role ids
+            tgt_role = np.full_like(tgt_token, self.bot_id)
+
+            batch["tgt_token"] = tgt_token
+            batch["tgt_mask"] = (tgt_token != self.pad_id).astype("int64")
+            batch["tgt_pos"] = tgt_pos
+            batch["tgt_type"] = tgt_role
+            batch["tgt_turn"] = tgt_turn
+
+        return batch, batch_size
+
+    def collate_fn_multi_turn_with_knowledge(self, samples):
+        batch_size = len(samples)
+
+        src = [sp["src"] for sp in samples]
+        knowledge = [sp["knowledge"] for sp in samples]
+
+        src_token, src_pos, src_turn, src_role = [], [], [], []
+        for utts, ks in zip(src, knowledge):
+            utt_lens = [len(utt) for utt in utts]
+            k_lens = [len(k) for k in ks]
+
+            # Token ids
+            token = list(chain(*utts))[-self.max_len:]
+            token.extend(list(chain(*ks))[-self.max_len:])
+            src_token.append(token)
+
+            # Position ids
+            pos = list(chain(*[list(range(l)) for l in utt_lens]))[-self.max_len:]
+            pos.extend(list(chain(*[list(range(l)) for l in k_lens]))[-self.max_len:])
+            src_pos.append(pos)
+
+            # Turn ids
+            turn = list(chain(*[[len(utts) - i] * l for i, l in enumerate(utt_lens)]))[-self.max_len:]
+            turn.extend(list(chain(*[[i] * l for i, l in enumerate(k_lens)]))[-self.max_len:])
+            src_turn.append(turn)
+
+            # Role ids
+            role = list(chain(*[[self.bot_id if (len(utts)-i) % 2 == 0 else self.user_id] * l
+                                for i, l in enumerate(utt_lens)]))[-self.max_len:]
+            role.extend(list(chain(*[[self.knowledge_id] * l for l in k_lens]))[-self.max_len:])
+            src_role.append(role)
+        
+        src_token = list2np(src_token, padding=self.pad_id)
+        src_pos = list2np(src_pos, padding=self.pad_id)
+        src_turn = list2np(src_turn, padding=self.pad_id)
+        src_role = list2np(src_role, padding=self.pad_id)
+
+        batch = {}
+        batch["src_token"] = src_token
+        batch["src_mask"] = (src_token != self.pad_id).astype("int64")
+        batch["src_pos"] = src_pos
+        batch["src_type"] = src_role
+        batch["src_turn"] = src_turn
+
+        if "tgt" in samples[0]:
+            tgt = [sp["tgt"] for sp in samples]
+
+            # Token ids & Label ids
+            tgt_token = list2np(tgt, padding=self.pad_id)
+
+            # Position ids
+            tgt_pos = np.zeros_like(tgt_token)
+            tgt_pos[:] = np.arange(tgt_token.shape[1], dtype=tgt_token.dtype)
+
+            # Turn ids
+            tgt_turn = np.zeros_like(tgt_token)
+
+            # Role ids
+            tgt_role = np.full_like(tgt_token, self.bot_id)
+
+            batch["tgt_token"] = tgt_token
+            batch["tgt_mask"] = (tgt_token != self.pad_id).astype("int64")
+            batch["tgt_pos"] = tgt_pos
+            batch["tgt_type"] = tgt_role
+            batch["tgt_turn"] = tgt_turn
+
+        return batch, batch_size
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/data/sampler.py b/PaddleNLP/Research/Dialogue-PLATO/plato/data/sampler.py
new file mode 100644
index 0000000000000000000000000000000000000000..f807ed107f1223b850e9255e534f0e8e94d9e350
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/data/sampler.py
@@ -0,0 +1,89 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Sampler class.
+"""
+
+import numpy as np
+
+
+class Sampler(object):
+    
+    def __init__(self):
+        return
+
+    def __len__(self):
+        raise NotImplementedError
+
+    def __iter__(self):
+        raise NotImplementedError
+
+
+class SequentialSampler(Sampler):
+
+    def __init__(self, dataset):
+        self.dataset = dataset
+        return
+
+    def __len__(self):
+        return len(self.dataset)
+
+    def __iter__(self):
+        return iter(range(len(self)))
+
+
+class RandomSampler(Sampler):
+
+    def __init__(self, dataset):
+        self.dataset = dataset
+        self.epoch = 0
+        return
+
+    def __len__(self):
+        return len(self.dataset)
+
+    def __iter__(self):
+        np.random.seed(self.epoch)
+        self.epoch += 1
+        return iter(np.random.permutation(len(self)))
+
+
+class SortedSampler(Sampler):
+    """ Sorted Sampler.
+    
+    Sort each block of examples by key.
+    """
+
+    def __init__(self, sampler, sort_pool_size, key="src"):
+        self.sampler = sampler
+        self.sort_pool_size = sort_pool_size
+        self.key = lambda idx: len(self.sampler.dataset[idx][key])
+        return
+
+    def __len__(self):
+        return len(self.sampler)
+
+    def __iter__(self):
+        pool = []
+        for idx in self.sampler:
+            pool.append(idx)
+            if len(pool) == self.sort_pool_size:
+                pool = sorted(pool, key=self.key)
+                for i in pool:
+                    yield i
+                pool = []
+        if len(pool) > 0:
+            pool = sorted(pool, key=self.key)
+            for i in pool:
+                yield i
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/data/tokenizer.py b/PaddleNLP/Research/Dialogue-PLATO/plato/data/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c523eb12feb144fa89ffb309e83e429796a0fa1
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/data/tokenizer.py
@@ -0,0 +1,628 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Tokenizer class.
+"""
+
+from __future__ import absolute_import, division, print_function, unicode_literals
+
+import collections
+import json
+import logging
+import os
+import regex as re
+import sys
+import unicodedata
+
+
+def clean_string(string):
+    replace_mp = {
+        " - ": "-",
+        " ' ": "'",
+        " n't": "n't",
+        " 'm": "'m",
+        " do not": " don't",
+        " 's": "'s",
+        " 've": "'ve",
+        " 're": "'re"
+    }
+    for k, v in replace_mp.items():
+        string = string.replace(k, v)
+    return string
+
+
+class Tokenizer(object):
+
+    def __init__(self, vocab_path, special_tokens=[], tokenizer_type="Bert"):
+        self.tokenizer_type = tokenizer_type
+        if tokenizer_type == "Bert":
+            self.spec_convert_dict = {"[BOS]": "[unused0]", "[EOS]": "[unused1]"}
+            self.spec_revert_dict = {v: k for k,
+                                     v in self.spec_convert_dict.items()}
+            special_tokens = [self.spec_convert_dict.get(tok, tok)
+                              for tok in special_tokens]
+            self.special_tokens = ("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")
+            self.special_tokens += tuple(x for x in special_tokens if x not in self.special_tokens)
+
+            self._tokenizer = BertTokenizer(vocab_path, never_split=self.special_tokens)
+            for tok in self.special_tokens:
+                assert tok in self._tokenizer.vocab, f"special token '{tok}' is not in the vocabulary"
+            self.vocab_size = len(self._tokenizer.vocab)
+        elif tokenizer_type == "GPT2":
+            self.spec_convert_dict = {"[UNK]": "<unk>"}
+            self.spec_revert_dict = {v: k for k,
+                                     v in self.spec_convert_dict.items()}
+            special_tokens = [tok for tok in special_tokens
+                              if tok not in self.spec_convert_dict]
+            vocab_file = os.path.join(vocab_path, "vocab.json")
+            merges_file = os.path.join(vocab_path, "merges.txt")
+            self._tokenizer = GPT2Tokenizer(vocab_file, merges_file, special_tokens=special_tokens)
+            self.num_specials = len(special_tokens)
+            self.vocab_size = len(self._tokenizer)
+        else:
+            raise ValueError
+
+    def tokenize(self, text):
+        return self._tokenizer.tokenize(text)
+
+    def convert_tokens_to_ids(self, tokens):
+        if self.tokenizer_type == "Bert":
+            tokens = [self.spec_convert_dict.get(tok, tok) for tok in tokens]
+            ids = self._tokenizer.convert_tokens_to_ids(tokens)
+            return ids
+        else:
+            tokens = [self.spec_convert_dict.get(tok, tok) for tok in tokens]
+            ids = self._tokenizer.convert_tokens_to_ids(tokens)
+            ids = [(i + self.num_specials) % self.vocab_size for i in ids]
+            return ids
+
+    def convert_ids_to_tokens(self, ids):
+        if self.tokenizer_type == "Bert":
+            tokens = self._tokenizer.convert_ids_to_tokens(ids)
+            tokens = [self.spec_revert_dict.get(tok, tok) for tok in tokens]
+            return tokens
+        else:
+            ids = [(i - self.num_specials) % self.vocab_size for i in ids]
+            tokens = self._tokenizer.convert_ids_to_tokens(ids)
+            tokens = [self.spec_revert_dict.get(tok, tok) for tok in tokens]
+            return tokens
+
+    def decode(self, ids, ignore_tokens=[]):
+        tokens = self.convert_ids_to_tokens(ids)
+        if len(ignore_tokens) > 0:
+            ignore_tokens = set(ignore_tokens)
+            tokens = [tok for tok in tokens if tok not in ignore_tokens]
+        if self.tokenizer_type == "Bert":
+            string = " ".join(tokens).replace(" ##", "")
+        else:
+            string = "".join(tokens)
+            string = bytearray([self._tokenizer.byte_decoder[c]
+                                for c in string]).decode("utf-8")
+        string = clean_string(string)
+        return string
+
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes."""
+
+
+logger = logging.getLogger(__name__)
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    index = 0
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        while True:
+            token = reader.readline()
+            if not token:
+                break
+            token = token.strip()
+            vocab[token] = index
+            index += 1
+    return vocab
+
+
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a piece of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+class BertTokenizer(object):
+    """Runs end-to-end tokenization: punctuation splitting + wordpiece"""
+
+    def __init__(self, vocab_file, do_lower_case=True, max_len=None, do_basic_tokenize=True,
+                 never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
+        """Constructs a BertTokenizer.
+
+        Args:
+          vocab_file: Path to a one-wordpiece-per-line vocabulary file
+          do_lower_case: Whether to lower case the input
+                         Only has an effect when do_wordpiece_only=False
+          do_basic_tokenize: Whether to do basic tokenization before wordpiece.
+          max_len: An artificial maximum length to truncate tokenized sequences to;
+                         Effective maximum length is always the minimum of this
+                         value (if specified) and the underlying BERT model's
+                         sequence length.
+          never_split: List of tokens which will never be split during tokenization.
+                         Only has an effect when do_wordpiece_only=False
+        """
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
+                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file))
+        self.vocab = load_vocab(vocab_file)
+        self.ids_to_tokens = collections.OrderedDict(
+            [(ids, tok) for tok, ids in self.vocab.items()])
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+          self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case,
+                                                never_split=never_split)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+        self.max_len = max_len if max_len is not None else int(1e12)
+
+    def tokenize(self, text):
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text):
+                for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                    split_tokens.append(sub_token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        """Converts a sequence of tokens into ids using the vocab."""
+        ids = []
+        for token in tokens:
+            ids.append(self.vocab[token])
+        if len(ids) > self.max_len:
+            logger.warning(
+                "Token indices sequence length is longer than the specified maximum "
+                " sequence length for this BERT model ({} > {}). Running this"
+                " sequence through BERT will result in indexing errors".format(len(ids), self.max_len)
+            )
+        return ids
+
+    def convert_ids_to_tokens(self, ids):
+        """Converts a sequence of ids in wordpiece tokens using the vocab."""
+        tokens = []
+        for i in ids:
+            tokens.append(self.ids_to_tokens[i])
+        return tokens
+
+
+class BasicTokenizer(object):
+    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+
+    def __init__(self,
+                 do_lower_case=True,
+                 never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]")):
+        """Constructs a BasicTokenizer.
+
+        Args:
+          do_lower_case: Whether to lower case the input.
+        """
+        self.do_lower_case = do_lower_case
+        self.never_split = never_split
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text."""
+        text = self._clean_text(text)
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        text = self._tokenize_chinese_chars(text)
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if self.do_lower_case and token not in self.never_split:
+                token = token.lower()
+                token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text):
+        """Splits punctuation on a piece of text."""
+        if text in self.never_split:
+            return [text]
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+                (cp >= 0x3400 and cp <= 0x4DBF) or  #
+                (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+                (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+                (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+                (cp >= 0x2B820 and cp <= 0x2CEAF) or
+                (cp >= 0xF900 and cp <= 0xFAFF) or  #
+                (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xfffd or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenization."""
+
+    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text into its word pieces.
+
+        This uses a greedy longest-match-first algorithm to perform tokenization
+        using the given vocabulary.
+
+        For example:
+          input = "unaffable"
+          output = ["un", "##aff", "##able"]
+
+        Args:
+          text: A single token or whitespace separated tokens. This should have
+            already been passed through `BasicTokenizer`.
+
+        Returns:
+          A list of wordpiece tokens.
+        """
+
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+
+
+def _is_whitespace(char):
+    """Checks whether `chars` is a whitespace character."""
+    # \t, \n, and \r are technically contorl characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+
+
+def _is_control(char):
+    """Checks whether `chars` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
+
+
+def _is_punctuation(char):
+    """Checks whether `chars` is a punctuation character."""
+    cp = ord(char)
+    # We treat all non-letter/number ASCII as punctuation.
+    # Characters such as "^", "$", and "`" are not in the Unicode
+    # Punctuation class but we treat them as punctuation anyways, for
+    # consistency.
+    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
+            (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
+        return True
+    cat = unicodedata.category(char)
+    if cat.startswith("P"):
+        return True
+    return False
+
+# Copyright 2018 The Open AI Team Authors and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for OpenAI GPT."""
+
+
+try:
+    from functools import lru_cache
+except ImportError:
+    # Just a dummy decorator to get the checks to run on python2
+    # because honestly I don't want to support a byte-level unicode BPE tokenizer on python 2 right now.
+    def lru_cache():
+        return lambda func: func
+
+
+@lru_cache()
+def bytes_to_unicode():
+    """
+    Returns list of utf-8 byte and a corresponding list of unicode strings.
+    The reversible bpe codes work on unicode strings.
+    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
+    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
+    This is a signficant percentage of your normal, say, 32K bpe vocab.
+    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
+    And avoids mapping to whitespace/control characters the bpe code barfs on.
+    """
+    _chr = unichr if sys.version_info[0] == 2 else chr
+    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8+n)
+            n += 1
+    cs = [_chr(n) for n in cs]
+    return dict(zip(bs, cs))
+
+def get_pairs(word):
+    """Return set of symbol pairs in a word.
+
+    Word is represented as tuple of symbols (symbols being variable-length strings).
+    """
+    pairs = set()
+    prev_char = word[0]
+    for char in word[1:]:
+        pairs.add((prev_char, char))
+        prev_char = char
+    return pairs
+
+class GPT2Tokenizer(object):
+    """
+    GPT-2 BPE tokenizer. Peculiarities:
+        - Byte-level BPE
+    """
+
+    def __init__(self, vocab_file, merges_file, errors='replace', special_tokens=None, max_len=None):
+        self.max_len = max_len if max_len is not None else int(1e12)
+        self.encoder = json.load(open(vocab_file))
+        self.decoder = {v:k for k,v in self.encoder.items()}
+        self.errors = errors # how to handle errors in decoding
+        self.byte_encoder = bytes_to_unicode()
+        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
+        bpe_data = open(merges_file, encoding='utf-8').read().split('\n')[1:-1]
+        bpe_merges = [tuple(merge.split()) for merge in bpe_data]
+        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
+        self.cache = {}
+
+        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions
+        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
+
+        self.special_tokens = {}
+        self.special_tokens_decoder = {}
+        self.set_special_tokens(special_tokens)
+
+    def __len__(self):
+        return len(self.encoder) + len(self.special_tokens)
+
+    def set_special_tokens(self, special_tokens):
+        """ Add a list of additional tokens to the encoder.
+            The additional tokens are indexed starting from the last index of the
+            current vocabulary in the order of the `special_tokens` list.
+        """
+        if not special_tokens:
+            self.special_tokens = {}
+            self.special_tokens_decoder = {}
+            return
+        self.special_tokens = dict((tok, len(self.encoder) + i) for i, tok in enumerate(special_tokens))
+        self.special_tokens_decoder = {v:k for k, v in self.special_tokens.items()}
+        logger.info("Special tokens {}".format(self.special_tokens))
+
+    def bpe(self, token):
+        if token in self.cache:
+            return self.cache[token]
+        word = tuple(token)
+        pairs = get_pairs(word)
+
+        if not pairs:
+            return token
+
+        while True:
+            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
+            if bigram not in self.bpe_ranks:
+                break
+            first, second = bigram
+            new_word = []
+            i = 0
+            while i < len(word):
+                try:
+                    j = word.index(first, i)
+                    new_word.extend(word[i:j])
+                    i = j
+                except:
+                    new_word.extend(word[i:])
+                    break
+
+                if word[i] == first and i < len(word)-1 and word[i+1] == second:
+                    new_word.append(first+second)
+                    i += 2
+                else:
+                    new_word.append(word[i])
+                    i += 1
+            new_word = tuple(new_word)
+            word = new_word
+            if len(word) == 1:
+                break
+            else:
+                pairs = get_pairs(word)
+        word = ' '.join(word)
+        self.cache[token] = word
+        return word
+
+    def tokenize(self, text):
+        """ Tokenize a string. """
+        bpe_tokens = []
+        for token in re.findall(self.pat, text):
+            token = ''.join(self.byte_encoder[ord(b)] for b in token if ord(b) in self.byte_encoder)
+            if token == '':
+                continue
+            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(' '))
+        return bpe_tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        """ Converts a sequence of tokens into ids using the vocab. """
+        ids = []
+        if isinstance(tokens, str) or (sys.version_info[0] == 2 and isinstance(tokens, unicode)):
+            if tokens in self.special_tokens:
+                return self.special_tokens[tokens]
+            else:
+                return self.encoder.get(tokens, 0)
+        for token in tokens:
+            if token in self.special_tokens:
+                ids.append(self.special_tokens[token])
+            else:
+                ids.append(self.encoder.get(token, 0))
+        if len(ids) > self.max_len:
+            logger.warning(
+                "Token indices sequence length is longer than the specified maximum "
+                " sequence length for this OpenAI GPT model ({} > {}). Running this"
+                " sequence through the model will result in indexing errors".format(len(ids), self.max_len)
+            )
+        return ids
+
+    def convert_ids_to_tokens(self, ids, skip_special_tokens=False):
+        """Converts a sequence of ids in BPE tokens using the vocab."""
+        tokens = []
+        for i in ids:
+            if i in self.special_tokens_decoder:
+                if not skip_special_tokens:
+                    tokens.append(self.special_tokens_decoder[i])
+            else:
+                tokens.append(self.decoder[i])
+        return tokens
+
+    def encode(self, text):
+        return self.convert_tokens_to_ids(self.tokenize(text))
+
+    def decode(self, tokens):
+        text = ''.join([self.decoder[token] for token in tokens])
+        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors=self.errors)
+        return text
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/metrics/metrics.py b/PaddleNLP/Research/Dialogue-PLATO/plato/metrics/metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..2c6c545dee1a0f0410e099e7d2fabb1cc43dfe21
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/metrics/metrics.py
@@ -0,0 +1,69 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Metrics class.
+"""
+
+from collections import Counter
+
+from nltk.translate import bleu_score
+from nltk.translate.bleu_score import SmoothingFunction
+import numpy as np
+
+
+def distinct(seqs):
+    """ Calculate intra/inter distinct 1/2. """
+    batch_size = len(seqs)
+    intra_dist1, intra_dist2 = [], []
+    unigrams_all, bigrams_all = Counter(), Counter()
+    for seq in seqs:
+        unigrams = Counter(seq)
+        bigrams = Counter(zip(seq, seq[1:]))
+        intra_dist1.append((len(unigrams)+1e-12) / (len(seq)+1e-5))
+        intra_dist2.append((len(bigrams)+1e-12) / (max(0, len(seq)-1)+1e-5))
+
+        unigrams_all.update(unigrams)
+        bigrams_all.update(bigrams)
+
+    inter_dist1 = (len(unigrams_all)+1e-12) / (sum(unigrams_all.values())+1e-5)
+    inter_dist2 = (len(bigrams_all)+1e-12) / (sum(bigrams_all.values())+1e-5)
+    intra_dist1 = np.average(intra_dist1)
+    intra_dist2 = np.average(intra_dist2)
+    return intra_dist1, intra_dist2, inter_dist1, inter_dist2
+
+
+def bleu(hyps, refs):
+    """ Calculate bleu 1/2. """
+    bleu_1 = []
+    bleu_2 = []
+    for hyp, ref in zip(hyps, refs):
+        try:
+            score = bleu_score.sentence_bleu(
+                [ref], hyp,
+                smoothing_function=SmoothingFunction().method7,
+                weights=[1, 0, 0, 0])
+        except:
+            score = 0
+        bleu_1.append(score)
+        try:
+            score = bleu_score.sentence_bleu(
+                [ref], hyp,
+                smoothing_function=SmoothingFunction().method7,
+                weights=[0.5, 0.5, 0, 0])
+        except:
+            score = 0
+        bleu_2.append(score)
+    bleu_1 = np.average(bleu_1)
+    bleu_2 = np.average(bleu_2)
+    return bleu_1, bleu_2
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/metrics/metrics_tracker.py b/PaddleNLP/Research/Dialogue-PLATO/plato/metrics/metrics_tracker.py
new file mode 100644
index 0000000000000000000000000000000000000000..eb621a462031cb2c8df3e8f6006a25ea198ea0d1
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/metrics/metrics_tracker.py
@@ -0,0 +1,85 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+MetricsTracker class
+"""
+
+from collections import defaultdict
+import math
+
+
+class MetricsTracker(object):
+    """ Tracking metrics. """
+
+    def __init__(self):
+        self.metrics_val = defaultdict(float)
+        self.metrics_avg = defaultdict(float)
+        self.num_samples = 0
+
+    def update(self, metrics, num_samples):
+        for key, val in metrics.items():
+            if val is not None:
+                val = float(val)
+                self.metrics_val[key] = val
+                avg_val = (self.metrics_avg.get(key, 0) * self.num_samples +
+                           val * num_samples) / (self.num_samples + num_samples)
+                self.metrics_avg[key] = avg_val
+        self.num_samples += num_samples
+
+    def clear(self):
+        self.metrics_val = defaultdict(float)
+        self.metrics_avg = defaultdict(float)
+        self.num_samples = 0
+
+    def items(self):
+        return self.metrics_avg.items()
+
+    def get(self, name):
+        if self.num_samples == 0:
+            raise ValueError("There is no data in Metrics.")
+        return self.metrics_avg.get(name)
+
+    def state_dict(self):
+        return {
+            "metrics_val": self.metrics_val,
+            "metrics_avg": self.metrics_avg,
+            "num_samples": self.num_samples,
+        }
+
+    def load_state_dict(self, state_dict):
+        self.metrics_val = state_dict["metrics_val"]
+        self.metrics_avg = state_dict["metrics_avg"]
+        self.num_samples = state_dict["num_samples"]
+
+    def value(self):
+        metric_strs = []
+        for key, val in self.metrics_val.items():
+            metric_str = f"{key.upper()}-{val:.3f}"
+            metric_strs.append(metric_str)
+        if "token_nll" in self.metrics_val:
+            metric_str = f"TOKEN_PPL-{math.exp(self.metrics_val['token_nll']):.3f}"
+            metric_strs.append(metric_str)
+        metric_strs = "   ".join(metric_strs)
+        return metric_strs
+
+    def summary(self):
+        metric_strs = []
+        for key, val in self.metrics_avg.items():
+            metric_str = f"{key.upper()}-{val:.3f}"
+            metric_strs.append(metric_str)
+        if "token_nll" in self.metrics_avg:
+            metric_str = f"TOKEN_PPL-{math.exp(self.metrics_avg['token_nll']):.3f}"
+            metric_strs.append(metric_str)
+        metric_strs = "   ".join(metric_strs)
+        return metric_strs
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/models/__init__.py b/PaddleNLP/Research/Dialogue-PLATO/plato/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..a3e7df80234415f4563834c6f9187b0d7f2a51f1
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/models/__init__.py
@@ -0,0 +1,18 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Loading models.
+"""
+
+import plato.models.unified_transformer
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/models/generator.py b/PaddleNLP/Research/Dialogue-PLATO/plato/models/generator.py
new file mode 100644
index 0000000000000000000000000000000000000000..f28df4240138d05b50b9fb2e99b1a48f7cc65342
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/models/generator.py
@@ -0,0 +1,445 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Generator class.
+"""
+
+import bisect
+import math
+import sys
+
+import numpy as np
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+from paddle.fluid.framework import Variable
+
+from plato.args import str2bool
+import plato.modules.functions as F
+
+
+def repeat(var, times):
+    if isinstance(var, list):
+        return [repeat(x, times) for x in var]
+    elif isinstance(var, dict):
+        return {k: repeat(v, times) for k, v in var.items()}
+    elif isinstance(var, Variable):
+        var = F.unsqueeze(var, [1])
+        expand_times = [1] * len(var.shape)
+        expand_times[1] = times
+        dtype = var.dtype
+        var = layers.cast(var, "float32")
+        var = layers.expand(var, expand_times)
+        shape = [var.shape[0] * var.shape[1]] + var.shape[2:]
+        var = layers.reshape(var, shape)
+        var = layers.cast(var, dtype)
+        return var
+    else:
+        return var
+
+
+def gather(var, idx):
+    if isinstance(var, list):
+        return [gather(x, idx) for x in var]
+    elif isinstance(var, dict):
+        return {k: gather(v, idx) for k, v in var.items()}
+    elif isinstance(var, Variable):
+        out = layers.gather(var, idx)
+        return out
+    else:
+        return var
+
+
+class Generator(object):
+    """ Genrator class. """
+
+    _registry = dict()
+
+    @classmethod
+    def register(cls, name):
+        Generator._registry[name] = cls
+        return
+
+    @staticmethod
+    def by_name(name):
+        return Generator._registry[name]
+
+    @staticmethod
+    def create(hparams, *args, **kwargs):
+        """ Create generator. """
+        generator_cls = Generator.by_name(hparams.generator)
+        return generator_cls(hparams, *args, **kwargs)
+
+    @classmethod
+    def add_cmdline_argument(cls, parser):
+        group = parser.add_argument_group("Generator")
+        group.add_argument("--generator", type=str, default="BeamSearch",
+                           choices=["TopKSampling", "TopPSampling", "GreedySampling",
+                                    "BeamSearch"])
+        group.add_argument("--min_gen_len", type=int, default=1,
+                           help="The minimum length of generated response.")
+        group.add_argument("--max_gen_len", type=int, default=30,
+                           help="The maximum length of generated response.")
+        args, _ = parser.parse_known_args()
+        generator_cls = cls.by_name(args.generator)
+        generator_cls.add_cmdline_argument(group)
+        return group
+
+    def __init__(self, hparams, bpe):
+        self.vocab_size = bpe.vocab_size
+        self.bos_id = bpe.bos_id
+        self.eos_id = bpe.eos_id
+        self.unk_id = bpe.unk_id
+        self.pad_id = bpe.pad_id
+        self.min_gen_len = hparams.min_gen_len
+        self.max_gen_len = hparams.max_gen_len
+        assert 1 <= self.min_gen_len <= self.max_gen_len
+        return
+
+    def __call__(self, step_fn, state):
+        """
+        Running generation.
+
+        @param : step_fn : decoding one step
+        @type : function
+
+        @param : state : initial state
+        @type : dict
+        """
+        raise NotImplementedError
+
+
+class Sampling(Generator):
+    """ Sampling Generator. """
+
+    @classmethod
+    def add_cmdline_argument(cls, group):
+        group.add_argument("--ignore_unk", type=str2bool, default=True,
+                           help="Whether to ignore unkown token in generation.")
+        group.add_argument("--sampling_temperature", type=float, default=1.0)
+        return group
+
+    def __init__(self, hparams, bpe):
+        super().__init__(hparams, bpe)
+        self.ignore_unk = hparams.ignore_unk
+        self.temperature = hparams.sampling_temperature
+        return
+
+    def _sampling(self, scores):
+        """ Sampling function. """
+        raise NotImplementedError
+
+    def __call__(self, step_fn, state):
+        """
+        Running generation.
+
+        @param : step_fn : decoding one step
+        @type : function
+
+        @param : state : initial state
+        @type : dict
+        """
+        batch_size = state["batch_size"]
+        vocab_size = self.vocab_size
+
+        pos_index = layers.range(0, batch_size, 1, dtype="int64")
+        pos_index = layers.scale(pos_index, vocab_size)
+
+        # shape: [batch_size, beam_size, 1]
+        predictions = layers.fill_constant(shape=[batch_size, 1],
+                                           dtype="int64",
+                                           value=self.bos_id)
+        sequence_scores = layers.fill_constant(shape=[batch_size],
+                                               dtype="float32",
+                                               value=0.0)
+
+        unk_penalty = np.zeros(vocab_size, dtype="float32")
+        unk_penalty[self.unk_id] = -1e10
+        unk_penalty = layers.assign(unk_penalty)
+
+        eos_penalty = np.zeros(vocab_size, dtype="float32")
+        eos_penalty[self.eos_id] = -1e10
+        eos_penalty = layers.assign(eos_penalty)
+
+        scores_after_end = np.full(vocab_size, -1e10, dtype="float32")
+        scores_after_end[self.pad_id] = 0
+        scores_after_end = layers.assign(scores_after_end)
+
+        # initial input
+        for step in range(1, self.max_gen_len + 1):
+            pre_ids = predictions[:, -1:]
+            state["pred_token"] = F.unsqueeze(pre_ids, [2])
+            if step > 1:
+                state["pred_mask"] = 1 - F.equal(state["pred_token"], self.pad_id)
+                state["pred_pos"] = state["pred_pos"] + 1
+            scores, state = step_fn(state)
+
+            # Generate next
+            # scores shape: [batch_size, vocab_size]
+            if self.ignore_unk:
+                scores = scores + unk_penalty
+
+            if step <= self.min_gen_len:
+                scores = scores + eos_penalty
+
+            # previous token is [PAD] or [EOS]
+            # shape: [batch_size, 1]
+            pre_eos_mask = F.equal(pre_ids, self.eos_id) + F.equal(pre_ids, self.pad_id)
+            scores = scores * (1 - pre_eos_mask) + \
+                layers.expand(pre_eos_mask, [1, vocab_size]) * scores_after_end
+
+            scores = scores / self.temperature
+            preds = self._sampling(scores)
+
+            predictions = layers.concat([predictions, F.unsqueeze(preds, [1])], axis=1)
+
+            scores = layers.reshape(scores, [batch_size * vocab_size])
+            preds = preds + pos_index
+            scores = gather(scores, preds)
+            sequence_scores = sequence_scores + scores
+
+        results = {
+            "preds": predictions,
+            "scores": sequence_scores
+        }
+        return results
+
+
+class GreedySampling(Sampling):
+    """ Greedy sampling. """
+
+    @classmethod
+    def add_cmdline_argument(cls, group):
+        return Sampling.add_cmdline_argument(group)
+
+    def _sampling(self, logits):
+        """ Implement greedy sampling. """
+        preds = layers.argmax(logits, axis=1)
+        return preds
+
+
+class TopKSampling(Sampling):
+    """ Top-k sampling. """
+
+    @classmethod
+    def add_cmdline_argument(cls, group):
+        Sampling.add_cmdline_argument(group)
+        group.add_argument("--top_k_ratio", type=float, default=None)
+        group.add_argument("--top_k_num", type=int, default=None)
+        return group
+
+    def __init__(self, hparams, bpe):
+        super().__init__(hparams, bpe)
+        assert hparams.top_k_ratio is not None or hparams.top_k_num is not None
+        if hparams.top_k_num is not None:
+            self.top_k_num = hparams.top_k_num
+        else:
+            self.top_k_num = math.floor(hparams.top_k_ratio * self.vocab_size)
+        assert self.top_k_num >= 1
+        return
+
+    def _sampling(self, logits):
+        """ Implement top-k sampling. """
+        probs = layers.softmax(logits, axis=1)
+        probs, indices = layers.topk(probs, self.top_k_num)
+        probs = probs / layers.reduce_sum(probs, dim=1, keep_dim=True)
+        preds = []
+        for p, ids in zip(probs.numpy(), indices.numpy()):
+            o = np.random.choice(ids, p=p)
+            preds.append(o)
+        preds = np.array(preds, dtype="int64")
+        return fluid.dygraph.to_variable(preds)
+
+
+class TopPSampling(Sampling):
+    """ Top-p sampling. """
+
+    @classmethod
+    def add_cmdline_argument(cls, group):
+        Sampling.add_cmdline_argument(group)
+        group.add_argument("--top_p_ratio", type=float, default=1.0)
+        return group
+
+    def __init__(self, hparams, bpe):
+        super().__init__(hparams, bpe)
+        self.top_p_ratio = hparams.top_p_ratio
+        return
+
+    def _sampling(self, logits):
+        """ Implement top-k sampling. """
+        probs = layers.softmax(logits, axis=1)
+        preds = []
+        for p in probs.numpy():
+            ids = np.argsort(-p)
+            p = p[ids]
+            c_p = np.cumsum(p)
+            i = bisect.bisect_right(c_p, self.top_p_ratio) + 1
+            o = np.random.choice(ids[:i], p=p[:i]/np.sum(p[:i]))
+            preds.append(o)
+        preds = np.array(preds, dtype="int64")
+        return fluid.dygraph.to_variable(preds)
+
+
+class BeamSearch(Generator):
+    """ BeamSearch generator. """
+
+    @classmethod
+    def add_cmdline_argument(cls, group):
+        group.add_argument("--beam_size", type=int, default=5,
+                           help="The beam size in beam search.")
+        group.add_argument("--length_average", type=str2bool, default=False,
+                           help="Whether to use length average.")
+        group.add_argument("--length_penalty", type=float, default=-1.0,
+                           help="The parameter(alpha) of length penalty.")
+        group.add_argument("--ignore_unk", type=str2bool, default=True,
+                           help="Whether to ignore unkown token in generation.")
+        return group
+
+    def __init__(self, hparams, bpe):
+        super().__init__(hparams, bpe)
+        self.beam_size = hparams.beam_size
+        self.length_average = hparams.length_average
+        self.length_penalty = hparams.length_penalty
+        self.ignore_unk = hparams.ignore_unk
+        return
+
+    def __call__(self, step_fn, state):
+        """
+        Running beam search.
+
+        @param : step_fn : decoding one step
+        @type : function
+
+        @param : state : initial state
+        @type : dict
+        """
+        batch_size = state["batch_size"]
+        beam_size = self.beam_size
+
+        # shape: [batch_size, 1]
+        pos_index = layers.range(0, batch_size, 1, dtype="int64")
+        pos_index = layers.scale(pos_index, beam_size)
+        pos_index = F.unsqueeze(pos_index, [1])
+
+        # shape: [batch_size, beam_size, 1]
+        predictions = layers.fill_constant(shape=[batch_size, beam_size, 1],
+                                           dtype="int64",
+                                           value=self.bos_id)
+
+        # initial input
+        state["pred_token"] = predictions[:, :1]
+        # shape: [batch_size, vocab_size]
+        scores, state = step_fn(state)
+
+        unk_penalty = np.zeros(self.vocab_size, dtype="float32")
+        unk_penalty[self.unk_id] = -1e10
+        unk_penalty = layers.assign(unk_penalty)
+
+        eos_penalty = np.zeros(self.vocab_size, dtype="float32")
+        eos_penalty[self.eos_id] = -1e10
+        eos_penalty = layers.assign(eos_penalty)
+
+        scores_after_end = np.full(self.vocab_size, -1e10, dtype="float32")
+        scores_after_end[self.pad_id] = 0
+        scores_after_end = layers.assign(scores_after_end)
+
+        if self.ignore_unk:
+            scores = scores + unk_penalty
+        scores = scores + eos_penalty
+
+        # shape: [batch_size, beam_size]
+        sequence_scores, preds = layers.topk(scores, self.beam_size)
+
+        predictions = layers.concat([predictions, F.unsqueeze(preds, [2])], axis=2)
+        state = repeat(state, beam_size)
+
+        parent_idx_list = []
+        pred_list = []
+
+        for step in range(2, self.max_gen_len + 1):
+            pre_ids = predictions[:, :, -1:]
+            state["pred_token"] = layers.reshape(pre_ids, shape=[batch_size * beam_size, 1, 1])
+            state["pred_mask"] = 1 - F.equal(state["pred_token"], self.pad_id)
+            state["pred_pos"] = state["pred_pos"] + 1
+            scores, state = step_fn(state)
+
+            # Generate next
+            # scores shape: [batch_size, beam_size, vocab_size]
+            if self.ignore_unk:
+                scores = scores + unk_penalty
+
+            if step <= self.min_gen_len:
+                scores = scores + eos_penalty
+
+            scores = layers.reshape(scores, shape=[batch_size, beam_size, self.vocab_size])
+
+            # previous token is [PAD] or [EOS]
+            pre_eos_mask = F.equal(pre_ids, self.eos_id) + F.equal(pre_ids, self.pad_id)
+
+            scores = scores * (1 - pre_eos_mask) + \
+                layers.expand(pre_eos_mask, [1, 1, self.vocab_size]) * scores_after_end
+            if self.length_average:
+                scaled_value = pre_eos_mask + (1 - pre_eos_mask) * (1 - 1 / step)
+                sequence_scores = F.unsqueeze(sequence_scores, [2]) * scaled_value
+                scaled_value = pre_eos_mask + (1 - pre_eos_mask) * (1 / step)
+                scores = scores * scaled_value
+            elif self.length_penalty >= 0.0:
+                scaled_value = pre_eos_mask + (1 - pre_eos_mask) * \
+                    (math.pow((4 + step) / (5 + step), self.length_penalty))
+                sequence_scores = layers.elementwise_mul(scaled_value, sequence_scores, axis=0)
+                scaled_value = pre_eos_mask + (1 - pre_eos_mask) * \
+                    (math.pow(1 / (5 + step), self.length_penalty))
+                scores = scores * scaled_value
+            scores = layers.elementwise_add(scores, sequence_scores, axis=0)
+            scores = layers.reshape(scores, shape=[batch_size, beam_size * self.vocab_size])
+
+            topk_scores, topk_indices = layers.topk(scores, beam_size)
+            vocab_size = layers.fill_constant(shape=[1], dtype="int64", value=self.vocab_size)
+            parent_idx = layers.elementwise_floordiv(topk_indices, vocab_size)
+            preds = layers.elementwise_mod(topk_indices, vocab_size)
+
+            # Gather state / sequence_scores
+            parent_idx = layers.elementwise_add(parent_idx, pos_index, axis=0)
+            parent_idx = layers.reshape(parent_idx, [batch_size * beam_size])
+            state = gather(state, parent_idx)
+            sequence_scores = topk_scores
+
+            predictions = layers.reshape(predictions, shape=[batch_size * beam_size, step])
+            predictions = gather(predictions, parent_idx)
+            predictions = layers.reshape(predictions, shape=[batch_size, beam_size, step])
+            predictions = layers.concat([predictions, F.unsqueeze(preds, [2])], axis=2)
+
+        pre_ids = predictions[:, :, -1]
+        pre_eos_mask = F.equal(pre_ids, self.eos_id) + F.equal(pre_ids, self.pad_id)
+        sequence_scores = sequence_scores * pre_eos_mask + layers.scale(1 - pre_eos_mask, -1e10)
+
+        _, indices = layers.argsort(sequence_scores, axis=1)
+        indices = indices + pos_index
+        indices = layers.reshape(indices, [-1])
+        sequence_scores = layers.reshape(sequence_scores, [batch_size * beam_size])
+        predictions = layers.reshape(predictions, [batch_size * beam_size, -1])
+        sequence_scores = gather(sequence_scores, indices)
+        predictions = layers.gather(predictions, indices)
+        sequence_scores = layers.reshape(sequence_scores, [batch_size, beam_size])
+        predictions = layers.reshape(predictions, [batch_size, beam_size, -1])
+
+        results = {
+            "preds": predictions[:, -1],
+            "scores": sequence_scores[:, -1]
+        }
+        return results
+
+BeamSearch.register("BeamSearch")
+GreedySampling.register("GreedySampling")
+TopKSampling.register("TopKSampling")
+TopPSampling.register("TopPSampling")
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/models/model_base.py b/PaddleNLP/Research/Dialogue-PLATO/plato/models/model_base.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d801e9275786a3b92894020db95a4da21b9952c
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/models/model_base.py
@@ -0,0 +1,145 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Model base
+"""
+
+import paddle.fluid as fluid
+from paddle.fluid.dygraph import parallel_helper
+
+
+class ModelBase(fluid.dygraph.Layer):
+    """
+    Basic model wrapper for static graph and dygrpah.
+    """
+    _registry = dict()
+
+    @classmethod
+    def register(cls, name):
+        ModelBase._registry[name] = cls
+        return
+
+    @staticmethod
+    def by_name(name):
+        return ModelBase._registry[name]
+
+    @staticmethod
+    def create(name_scope, hparams, *args, **kwargs):
+        model_cls = ModelBase.by_name(hparams.model)
+        return model_cls(name_scope, hparams, *args, **kwargs)
+
+    @classmethod
+    def add_cmdline_argument(cls, parser):
+        """ Add cmdline argument. """
+        group = parser.add_argument_group("Model")
+        group.add_argument("--init_checkpoint", type=str, default=None)
+        group.add_argument("--model", type=str, default="UnifiedTransformer",
+                           choices=["UnifiedTransformer"])
+        args, _ = parser.parse_known_args()
+        model_cls = ModelBase.by_name(args.model)
+        model_cls.add_cmdline_argument(group)
+        return group
+
+    def __init__(self, name_scope, hparams):
+        super().__init__(name_scope)
+        self.init_checkpoint = hparams.init_checkpoint
+        return
+
+    def __call__(self, *args, **kwargs):
+        """ Re-implement __call__ function in dygraph mode. """
+        if not self._built:
+            self._build_once(*args, **kwargs)
+            self._built = True
+
+        outputs = self.forward(*args, **kwargs)
+        return outputs
+
+    def _build_once(self, inputs, *args, **kwargs):
+        """
+        Build only once.
+
+        1. Initialize models's parameters.
+        2. Boardcast parameters if in data parallel mode.
+        3. Load saved parameters
+        """
+        # Initial parameters.
+        self._create_parameters()
+
+        if parallel_helper._is_data_parallel_mode():
+            parallel_helper._broadcast_parameters(self._parameters.values())
+
+        # Load persitables
+        self._load_params()
+        return
+
+    def _create_parameters(self):
+        """ Create model's paramters. """
+        raise NotImplementedError
+
+    def _load_params(self):
+        """ Load saved paramters. """
+        raise NotImplementedError
+
+    def _forward(self, inputs, is_training):
+        """ Real forward process of model in different mode(train/test). """
+        raise NotImplementedError
+
+    def _collect_metrics(self, inputs, outputs):
+        """ Calculate loss function by using inputs and outputs. """
+        raise NotImplementedError
+
+    def _optimize(self, loss):
+        """ Optimize loss function and update model. """
+        raise NotImplementedError
+
+    def _infer(self, inputs):
+        """ Real inference process of model. """
+        raise NotImplementedError
+
+    def forward(self, inputs, is_training=False):
+        """
+        Forward process, include real forward, collect metrices and optimize(optional)
+
+        @params : inputs : input data
+        @type : dict of numpy.ndarray/int/float/...
+        """
+        if is_training:
+            self.train()
+        else:
+            self.eval()
+
+        outputs = self._forward(inputs, is_training)
+        metrics = self._collect_metrics(inputs, outputs)
+        loss = metrics["loss"]
+        if is_training:
+            self._optimize(loss)
+
+        metrics = {k: v.numpy() for k, v in metrics.items()}
+        return metrics
+
+    def infer(self, inputs):
+        """
+        Inference process.
+
+        @params : inputs : input data
+        @type : dict of numpy.ndarray/int/float/...
+        """
+        if not self._built:
+            self._build_once(inputs)
+            self._built = True
+
+        self.eval()
+        results = self._infer(inputs)
+        results = {name: results[name].numpy() for name in results}
+        return results
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/models/unified_transformer.py b/PaddleNLP/Research/Dialogue-PLATO/plato/models/unified_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..8ccc45c70a343533f29e8711c4f396bd353b65d9
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/models/unified_transformer.py
@@ -0,0 +1,747 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+UnifiedTransformer
+"""
+
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.dygraph import FC
+import paddle.fluid.layers as layers
+
+from plato.args import str2bool
+from plato.modules.embedder import Embedder
+import plato.modules.functions as F
+from plato.modules.layer_norm import LayerNorm
+from plato.modules.transformer_block import TransformerBlock
+from plato.models.model_base import ModelBase
+
+
+class UnifiedTransformer(ModelBase):
+    """
+    Implement unified transformer.
+    """
+
+    @classmethod
+    def add_cmdline_argument(cls, group):
+        """ Add cmdline argument. """
+        group.add_argument("--num_token_embeddings", type=int, default=-1,
+                           help="The number of tokens in vocabulary. "
+                           "It will be automatically calculated after loading vocabulary.")
+        group.add_argument("--num_pos_embeddings", type=int, default=512,
+                           help="The maximum number of position.")
+        group.add_argument("--num_type_embeddings", type=int, default=2,
+                           help="The number of different type of tokens.")
+        group.add_argument("--num_turn_embeddings", type=int, default=16,
+                           help="The maximum number of turn.")
+        group.add_argument("--num_latent", type=int, default=20,
+                           help="The number of latent.")
+        group.add_argument("--tau", type=float, default=0.67,
+                           help="The parameter of gumbel softmax.")
+        group.add_argument("--with_bow", type=str2bool, default=True,
+                           help="Whether to use BoW loss.")
+        group.add_argument("--hidden_dim", type=int, default=768,
+                           help="The size of hidden vector in transformer.")
+        group.add_argument("--num_heads", type=int, default=12,
+                           help="The number of heads in multi head attention.")
+        group.add_argument("--num_layers", type=int, default=12,
+                           help="The number of layers in transformer.")
+        group.add_argument("--padding_idx", type=int, default=0,
+                           help="The padding index.")
+        group.add_argument("--dropout", type=float, default=0.1,
+                           help="The dropout ratio after multi head attention and feed forward network.")
+        group.add_argument("--embed_dropout", type=float, default=0.0,
+                           help="The dropout ratio of embedding layers.")
+        group.add_argument("--attn_dropout", type=float, default=0.1,
+                           help="The dropout ratio of multi head attention.")
+        group.add_argument("--ff_dropout", type=float, default=0.1,
+                           help="The dropout ratio of feed forward network.")
+        group.add_argument("--use_discriminator", type=str2bool, default=False,
+                           help="Whether to use discriminator loss.")
+        group.add_argument("--dis_ratio", type=float, default=1.0,
+                           help="The ratio of discriminator loss.")
+        group.add_argument("--weight_sharing", type=str2bool, default=True,
+                           help="Whether to share weight between token embedding and "
+                           "predictor FC layer.")
+        group.add_argument("--pos_trainable", type=str2bool, default=True,
+                           help="Whether to train position embeddings.")
+        group.add_argument("--two_layer_predictor", type=str2bool, default=False,
+                           help="Use two layer predictor. "
+                           "Traditional BERT use two FC layers to predict masked token.")
+        group.add_argument("--bidirectional_context", type=str2bool, default=True,
+                           help="Whether to use bidirectional self-attention in context tokens.")
+        group.add_argument("--label_smooth", type=float, default=0.0,
+                           help="Use soft label to calculate NLL loss and BoW loss.")
+        group.add_argument("--initializer_range", type=float, default=0.02,
+                           help="Use to initialize parameters.")
+
+        group.add_argument("--lr", type=float, default=5e-5,
+                           help="The inital learning rate for Adam.")
+        group.add_argument("--weight_decay", type=float, default=0.0,
+                           help="The weight decay for Adam.")
+        group.add_argument("--max_grad_norm", type=float, default=None,
+                           help="The maximum norm of gradient.")
+        return group
+
+    def __init__(self, name_scope, hparams, generator, dtype="float32"):
+        super().__init__(name_scope, hparams)
+        self.generator = generator
+        self.num_token_embeddings = hparams.num_token_embeddings
+        self.num_pos_embeddings = hparams.num_pos_embeddings
+        self.num_type_embeddings = hparams.num_type_embeddings
+        self.num_turn_embeddings = hparams.num_turn_embeddings
+        self.num_latent = hparams.num_latent
+        self.tau = hparams.tau
+        self.with_bow = hparams.with_bow
+        self.hidden_dim = hparams.hidden_dim
+        self.num_heads = hparams.num_heads
+        self.num_layers = hparams.num_layers
+        self.padding_idx = hparams.padding_idx
+        self.dropout = hparams.dropout
+        self.embed_dropout = hparams.embed_dropout
+        self.attn_dropout = hparams.attn_dropout
+        self.ff_dropout = hparams.ff_dropout
+        self.use_discriminator = hparams.use_discriminator
+        self.weight_sharing = hparams.weight_sharing
+        self.pos_trainable = hparams.pos_trainable
+        self.two_layer_predictor = hparams.two_layer_predictor
+        self.bidirectional_context = hparams.bidirectional_context
+        self.label_smooth = hparams.label_smooth
+        self.initializer_range = hparams.initializer_range
+
+        self.embedder = Embedder(self.full_name(),
+                                 self.hidden_dim,
+                                 self.num_token_embeddings,
+                                 self.num_pos_embeddings,
+                                 self.num_type_embeddings,
+                                 self.num_turn_embeddings,
+                                 padding_idx=self.padding_idx,
+                                 dropout=self.embed_dropout,
+                                 pos_trainable=self.pos_trainable)
+        self.embed_layer_norm = LayerNorm(self.full_name(),
+                                          begin_norm_axis=2,
+                                          epsilon=1e-12,
+                                          param_attr=fluid.ParamAttr(
+                                              regularizer=fluid.regularizer.L2Decay(0.0)),
+                                          bias_attr=fluid.ParamAttr(
+                                              regularizer=fluid.regularizer.L2Decay(0.0)))
+
+        self.layers = []
+        for i in range(hparams.num_layers):
+            layer = TransformerBlock(self.full_name(),
+                                     self.hidden_dim,
+                                     self.num_heads,
+                                     self.dropout,
+                                     self.attn_dropout,
+                                     self.ff_dropout)
+            self.layers.append(layer)
+            self.add_sublayer(f"layer_{i}", layer)
+
+        if self.num_latent > 0:
+            self.post_network = FC(name_scope=self.full_name() + ".post_network",
+                                   size=self.num_latent,
+                                   bias_attr=False)
+
+            if self.use_discriminator:
+                self.dis_ratio = hparams.dis_ratio
+                self.discriminator = FC(name_scope=self.full_name() + ".discriminator",
+                                        size=1,
+                                        act="sigmoid")
+
+        if self.two_layer_predictor:
+            self.pre_predictor = FC(name_scope=self.full_name() + ".pre_predictor",
+                                    size=self.hidden_dim,
+                                    num_flatten_dims=2,
+                                    act="gelu")
+            if self.num_latent > 0 and self.with_bow:
+                self.pre_bow_predictor = FC(name_scope=self.full_name() + ".pre_bow_predictor",
+                                            size=self.hidden_dim,
+                                            act="gelu")
+        if not self.weight_sharing:
+            self.predictor = FC(name_scope=self.full_name() + ".predictor",
+                                size=self.num_token_embeddings,
+                                num_flatten_dims=2,
+                                bias_attr=False)
+        if self.num_latent > 0 and self.with_bow:
+            self.bow_predictor = FC(name_scope=self.full_name() + ".bow_predictor",
+                                    size=self.num_token_embeddings,
+                                    bias_attr=False)
+
+        self.max_grad_norm = hparams.max_grad_norm
+        if self.max_grad_norm is not None:
+            self.grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(hparams.max_grad_norm)
+        else:
+            self.grad_clip = None
+        self.weight_decay = hparams.weight_decay
+        self.optimizer = fluid.optimizer.AdamOptimizer(
+            learning_rate=hparams.lr,
+            regularization=fluid.regularizer.L2Decay(self.weight_decay))
+
+        self._dtype = dtype
+
+        # DataDistributed
+        self.before_backward_fn = None
+        self.after_backward_fn = None
+        return
+
+    def _create_parameters(self):
+        """ Create model's paramters. """
+        if self.num_latent > 0:
+            self.mask_embed = self.create_parameter(
+                attr=fluid.ParamAttr(
+                    name="mask_embed",
+                    initializer=fluid.initializer.NormalInitializer(scale=self.initializer_range)),
+                shape=[1, 1, self.hidden_dim],
+                dtype=self._dtype)
+            self.latent_embeddings = self.create_parameter(
+                attr=fluid.ParamAttr(
+                    name="latent_embeddings",
+                    initializer=fluid.initializer.NormalInitializer(scale=self.initializer_range)),
+                shape=[self.num_latent, self.hidden_dim],
+                dtype=self._dtype)
+
+        sequence_mask = np.tri(self.num_pos_embeddings, self.num_pos_embeddings, dtype=self._dtype)
+        self.sequence_mask = self.create_parameter(
+            attr=fluid.ParamAttr(
+                name="sequence_mask",
+                initializer=fluid.initializer.NumpyArrayInitializer(sequence_mask),
+                trainable=False),
+            shape=sequence_mask.shape,
+            dtype=sequence_mask.dtype)
+        return
+
+    def _load_params(self):
+        """ Load saved paramters. """
+        if self.init_checkpoint is not None:
+            print(f"Loading parameters from {self.init_checkpoint}")
+            if hasattr(fluid, "load_dygraph"):
+                # >= 1.6.0 compatible
+                models, optimizers = fluid.load_dygraph(self.init_checkpoint)
+            else:
+                models, optimizers = fluid.dygraph.load_persistables(self.init_checkpoint)
+            parameters = {param.name: param for param in self.parameters()}
+            for name, param in models.items():
+                if name in parameters:
+                    if param.shape != parameters[name].shape:
+                        print(f"part of parameter({name}) random normlize initialize")
+                        if hasattr(param, "numpy"):
+                            arr = param.numpy()
+                        else:
+                            value = param.value()
+                            tensor = value.get_tensor()
+                            arr = np.array(tensor)
+                        z = np.random.normal(scale=self.initializer_range,
+                                             size=parameters[name].shape).astype("float32")
+                        if name == "Model/UnifiedTransformer_0/Embedder_0/Embedding_0.w_0":
+                            z[-param.shape[0]:] = arr
+                        else:
+                            z[:param.shape[0]] = arr
+                        z = fluid.dygraph.to_variable(z)
+                        models[name] = z
+            for name in parameters:
+                if name not in models:
+                    if parameters[name].trainable:
+                        print(f"parameter({name}) random normlize initialize")
+                        z = np.random.normal(scale=self.initializer_range,
+                                             size=parameters[name].shape).astype("float32")
+                        models[name] = fluid.dygraph.to_variable(z)
+                    else:
+                        models[name] = parameters[name]
+            self.load_dict(models)
+            print(f"Loaded parameters from {self.init_checkpoint}")
+
+    def _create_mask(self, input_mask, append_head=False, auto_regressive=False):
+        """
+        Create attention mask.
+
+        @param : input_mask
+        @type : Variable(shape: [batch_size, max_seq_len])
+
+        @param : auto_regressive
+        @type : bool
+        """
+        input_mask = fluid.layers.unsqueeze(input=input_mask, axes=[2])
+        seq_len = input_mask.shape[1]
+
+        input_mask = layers.cast(input_mask, self._dtype)
+        mask1 = layers.expand(input_mask, [1, 1, seq_len])
+        mask2 = layers.transpose(mask1, [0, 2, 1])
+        mask = layers.elementwise_mul(mask1, mask2)
+
+        if append_head:
+            mask = layers.concat([mask[:, :1, :], mask], axis=1)
+            mask = layers.concat([mask[:, :, :1], mask], axis=2)
+            seq_len += 1
+
+        if auto_regressive:
+            seq_mask = self.sequence_mask[:seq_len, :seq_len]
+            mask = layers.elementwise_mul(mask, seq_mask)
+
+        mask = 1 - mask
+        return mask
+
+    def _join_mask(self, mask1, mask2):
+        """ Merge source attention mask and target attention mask.
+
+        @param : mask1 : source attention mask
+        @type : Variable(shape: [batch_size, max_src_len, max_src_len])
+
+        @param : mask1 : target attention mask
+        @type : Variable(shape: [batch_size, max_tgt_len, max_tgt_len])
+        """
+        batch_size = mask1.shape[0]
+        seq_len1 = mask1.shape[1]
+        seq_len2 = mask2.shape[1]
+        seq_len = seq_len1 + seq_len2
+
+        mask_lu = mask1
+        mask_ru = layers.fill_constant([batch_size, seq_len1, seq_len2], self._dtype, 1)
+        mask3 = layers.expand(mask2[:, :, :1], [1, 1, seq_len1])
+        mask4 = layers.expand(mask1[:, :1], [1, seq_len2, 1])
+        mask_lb = mask3 + mask4 - mask3 * mask4
+        mask_rb = mask2
+        mask_u = layers.concat([mask_lu, mask_ru], axis=2)
+        mask_b = layers.concat([mask_lb, mask_rb], axis=2)
+        mask = layers.concat([mask_u, mask_b], axis=1)
+        return mask
+
+    def _posteriori_network(self, input_mask, embed, batch_size, src_len, tgt_len):
+        """ Basic posteriori network implement. """
+        mask_embed = self.mask_embed
+        mask_embed = layers.expand(mask_embed, [batch_size, 1, 1])
+        mask_embed = self.embed_layer_norm(mask_embed)
+        post_embed = layers.concat([mask_embed, embed], axis=1)
+
+        mask = self._create_mask(input_mask, auto_regressive=not self.bidirectional_context,
+                                 append_head=True)
+
+        for layer in self.layers:
+            post_embed = layer(post_embed, mask, None)
+
+        post_embed = post_embed[:, 0]
+        post_logits = self.post_network(post_embed)
+        post_probs = layers.softmax(post_logits, axis=-1)
+        post_logits = layers.log(post_probs)
+        return post_embed, post_probs, post_logits
+
+    def _discriminator_network(self, input_mask, embed, batch_size, src_len, tgt_len, pos_embed):
+        """ Basic discriminator network implement. """
+        # if batch_size <= 1:
+        #     raise ValueError("Warmming: If you use discriminator loss in traning, the batch_size must be greater than 1.")
+
+        src_embed = embed[:, :src_len]
+        tgt_embed = embed[:, src_len:]
+        if batch_size > 1:
+            neg_tgt_embed = layers.concat([tgt_embed[1:], tgt_embed[:1]], axis=0)
+        else:
+            # Cannot train discriminator if batch_size == 1
+            neg_tgt_embed = tgt_embed
+        neg_embed = layers.concat([src_embed, neg_tgt_embed], axis=1)
+
+        # Create generation network mask
+        src_mask = input_mask[:, :src_len]
+        tgt_mask = input_mask[:, src_len:]
+        if batch_size > 1:
+            neg_tgt_mask = layers.concat([tgt_mask[1:], tgt_mask[:1]], axis=0)
+        else:
+            # Cannot train discriminator if batch_size == 1
+            neg_tgt_mask = tgt_mask
+        neg_mask = layers.concat([src_mask, neg_tgt_mask], axis=1)
+        mask = self._create_mask(neg_mask, auto_regressive=not self.bidirectional_context,
+                                 append_head=True)
+
+        mask_embed = self.mask_embed
+        mask_embed = layers.expand(mask_embed, [batch_size, 1, 1])
+        mask_embed = self.embed_layer_norm(mask_embed)
+        neg_embed= layers.concat([mask_embed, neg_embed], axis=1)
+
+        for layer in self.layers:
+            neg_embed = layer(neg_embed, mask, None)
+
+        neg_embed = neg_embed[:, 0]
+
+        pos_probs = self.discriminator(pos_embed)
+        neg_probs = self.discriminator(neg_embed)
+
+        return pos_probs, neg_probs
+
+    def _generation_network(self, input_mask, embed, batch_size, src_len, tgt_len, latent_embed):
+        """ Basic generation network implement. """
+        if self.num_latent > 0:
+            latent_embed = F.unsqueeze(latent_embed, [1])
+            latent_embed = self.embed_layer_norm(latent_embed)
+            dec_embed = layers.concat([latent_embed, embed], axis=1)
+        else:
+            dec_embed = embed
+
+        # Create generation network mask
+        src_mask = input_mask[:, :src_len]
+        tgt_mask = input_mask[:, src_len:]
+        enc_mask = self._create_mask(src_mask, auto_regressive=not self.bidirectional_context,
+                                     append_head=self.num_latent > 0)
+        dec_mask = self._create_mask(tgt_mask, auto_regressive=True)
+        mask = self._join_mask(enc_mask, dec_mask)
+
+        for layer in self.layers:
+            dec_embed = layer(dec_embed, mask, None)
+
+        if self.num_latent > 0:
+            latent_embed = dec_embed[:, 0]
+        else:
+            latent_embed = None
+        dec_embed = dec_embed[:, -tgt_len:]
+        if self.two_layer_predictor:
+            dec_embed = self.pre_predictor(dec_embed)
+        if self.weight_sharing:
+            token_embedding = self.embedder.token_embedding.weight
+            dec_logits = layers.matmul(
+                x=dec_embed,
+                y=token_embedding,
+                transpose_y=True
+            )
+        else:
+            dec_logits = self.predictor(dec_embed)
+
+        dec_probs = layers.softmax(dec_logits, axis=-1)
+
+        return latent_embed, dec_probs
+
+    def _forward(self, inputs, is_training):
+        """ Real forward process of model in different mode(train/test). """
+        outputs = {}
+
+        src_token = inputs["src_token"]
+        src_mask = inputs["src_mask"]
+        src_pos = inputs["src_pos"]
+        src_type = inputs["src_type"]
+        src_turn = inputs["src_turn"]
+
+        tgt_token = inputs["tgt_token"][:, :-1]
+        tgt_mask = inputs["tgt_mask"][:, :-1]
+        tgt_pos = inputs["tgt_pos"][:, :-1]
+        tgt_type = inputs["tgt_type"][:, :-1]
+        tgt_turn = inputs["tgt_turn"][:, :-1]
+
+        input_mask = layers.concat([src_mask, tgt_mask], axis=1)
+        input_mask.stop_gradient = True
+        src_embed = self.embedder(src_token, src_pos, src_type, src_turn)
+        tgt_embed = self.embedder(tgt_token, tgt_pos, tgt_type, tgt_turn)
+        embed = layers.concat([src_embed, tgt_embed], axis=1)
+        embed = self.embed_layer_norm(embed)
+
+        batch_size = src_token.shape[0]
+        src_len = src_token.shape[1]
+        tgt_len = tgt_token.shape[1]
+
+        if self.num_latent > 0:
+            post_embed, post_probs, post_logits = self._posteriori_network(
+                input_mask, embed, batch_size, src_len, tgt_len)
+            outputs["post_logits"] = post_logits
+
+            if self.use_discriminator:
+                pos_probs, neg_probs = self._discriminator_network(
+                    input_mask, embed, batch_size, src_len, tgt_len, post_embed)
+                outputs["pos_probs"] = pos_probs
+                outputs["neg_probs"] = neg_probs
+
+            if is_training:
+                z = F.gumbel_softmax(post_logits, self.tau)
+            else:
+                indices = layers.argmax(post_logits, axis=1)
+                z = layers.one_hot(F.unsqueeze(indices, [1]), self.num_latent)
+            latent_embeddings = self.latent_embeddings
+            latent_embed = layers.matmul(z, latent_embeddings)
+            outputs["latent_embed"] = latent_embed
+        else:
+            latent_embed = None
+
+        latent_embed, dec_probs = self._generation_network(
+            input_mask, embed, batch_size, src_len, tgt_len, latent_embed)
+        outputs["dec_probs"] = dec_probs
+
+        if self.num_latent > 0 and self.with_bow:
+            if self.two_layer_predictor:
+                latent_embed = self.pre_bow_predictor(latent_embed)
+            bow_logits = self.bow_predictor(latent_embed)
+            bow_probs = layers.softmax(bow_logits)
+            outputs["bow_probs"] = bow_probs
+
+        return outputs
+
+    def _collect_metrics(self, inputs, outputs):
+        """ Calculate loss function by using inputs and outputs. """
+        metrics = {}
+
+        tgt_len = layers.reduce_sum(layers.reduce_sum(inputs["tgt_mask"], dim=1) - 1)
+        tgt_len.stop_gradient = True
+
+        label = inputs["tgt_token"][:, 1:]
+        if self.label_smooth > 0:
+            one_hot_label = layers.one_hot(label, self.num_token_embeddings)
+            smooth_label = layers.label_smooth(one_hot_label, epsilon=self.label_smooth,
+                                               dtype=self._dtype)
+            nll = layers.cross_entropy(outputs["dec_pred"], smooth_label, soft_label=True,
+                                       ignore_index=self.padding_idx)
+        else:
+            nll = layers.cross_entropy(outputs["dec_probs"], label, ignore_index=self.padding_idx)
+        nll = layers.reduce_sum(nll, dim=1)
+        token_nll = layers.reduce_sum(nll) / tgt_len
+        nll = layers.reduce_mean(nll)
+        metrics["nll"] = nll
+        metrics["token_nll"] = token_nll
+        loss = nll
+
+        if self.num_latent > 0 and self.with_bow:
+            bow_probs = F.unsqueeze(outputs["bow_probs"], [1])
+            bow_probs = layers.expand(bow_probs, [1, label.shape[1], 1])
+            if self.label_smooth > 0:
+                bow = layers.cross_entropy(bow_probs, smooth_label, soft_label=True,
+                                           ignore_index=self.padding_idx)
+            else:
+                bow = layers.cross_entropy(bow_probs, label, ignore_index=self.padding_idx)
+            bow = layers.reduce_sum(bow, dim=1)
+            token_bow = layers.reduce_sum(bow) / tgt_len
+            bow = layers.reduce_mean(bow)
+            metrics["bow"] = bow
+            metrics["token_bow"] = token_bow
+            loss = loss + bow
+
+        if self.num_latent > 0 and self.use_discriminator:
+            dis = 0.0 - (layers.log(outputs["pos_probs"]) + layers.log(1.0 - outputs["neg_probs"]))
+            dis = layers.reduce_mean(dis)
+            metrics["dis"] = dis
+            loss = loss + dis * self.dis_ratio
+
+        metrics["loss"] = loss
+        metrics["token_num"] = tgt_len
+        return metrics
+
+    def _optimize(self, loss):
+        """ Optimize loss function and update model. """
+        if self.before_backward_fn is not None:
+            loss = self.before_backward_fn(loss)
+        loss.backward()
+        if self.after_backward_fn is not None:
+            self.after_backward_fn()
+        self.optimizer.minimize(loss,
+                                grad_clip=self.grad_clip,
+                                parameter_list=self.parameters())
+        self.clear_gradients()
+        return
+
+    def _init_state(self, inputs):
+        """ Initialize decode state. """
+        state = {}
+
+        src_token = inputs["src_token"]
+        src_mask = inputs["src_mask"]
+        src_pos = inputs["src_pos"]
+        src_type = inputs["src_type"]
+        src_turn = inputs["src_turn"]
+
+        batch_size = src_token.shape[0]
+        seq_len = src_token.shape[1]
+
+        src_embed = self.embedder(src_token, src_pos, src_type, src_turn)
+        src_embed = self.embed_layer_norm(src_embed)
+
+        mask = self._create_mask(src_mask, append_head=self.num_latent > 0)
+
+        if self.num_latent > 0:
+            src_embed = F.unsqueeze(src_embed, [1])
+            src_embed = layers.expand(src_embed, [1, self.num_latent, 1, 1])
+            src_embed = layers.reshape(src_embed, [-1, seq_len, self.hidden_dim])
+
+            latent_embed = self.latent_embeddings
+            latent_embed = F.unsqueeze(latent_embed, [1])
+            latent_embed = layers.expand(latent_embed, [batch_size, 1, 1])
+            latent_embed = self.embed_layer_norm(latent_embed)
+
+            enc_out = layers.concat([latent_embed, src_embed], axis=1)
+
+            mask = F.unsqueeze(mask, [1])
+            mask = layers.expand(mask, [1, self.num_latent, 1, 1])
+            mask = layers.reshape(mask, [-1, seq_len + 1, seq_len + 1])
+        else:
+            enc_out = src_embed
+
+        cache = {}
+        for l, layer in enumerate(self.layers):
+            cache[f"layer_{l}"] = {}
+            enc_out = layer(enc_out, mask, cache[f"layer_{l}"])
+
+        state["cache"] = cache
+        state["mask"] = mask[:, :1]
+        if self.num_latent > 0:
+            state["batch_size"] = batch_size * self.num_latent
+            shape = [batch_size * self.num_latent, 1, 1]
+        else:
+            state["batch_size"] = batch_size
+            shape = [batch_size, 1, 1]
+        state["pred_mask"] = layers.ones(shape, self._dtype)
+        state["pred_pos"] = layers.zeros(shape, "int64")
+        state["pred_type"] = layers.zeros(shape, "int64")
+        state["pred_turn"] = layers.zeros(shape, "int64")
+
+        if "tgt_token" in inputs and self.num_latent > 0:
+            tgt_token = inputs["tgt_token"][:, :-1]
+            tgt_mask = inputs["tgt_mask"][:, :-1]
+            tgt_pos = inputs["tgt_pos"][:, :-1]
+            tgt_type = inputs["tgt_type"][:, :-1]
+            tgt_turn = inputs["tgt_turn"][:, :-1]
+
+            input_mask = layers.concat([src_mask, tgt_mask], axis=1)
+            input_mask.stop_gradient = True
+            src_embed = self.embedder(src_token, src_pos, src_type, src_turn)
+            tgt_embed = self.embedder(tgt_token, tgt_pos, tgt_type, tgt_turn)
+            embed = layers.concat([src_embed, tgt_embed], axis=1)
+            embed = self.embed_layer_norm(embed)
+
+            batch_size = src_token.shape[0]
+            src_len = src_token.shape[1]
+            tgt_len = tgt_token.shape[1]
+
+            post_embed, post_probs, post_logits = self._posteriori_network(
+                input_mask, embed, batch_size, src_len, tgt_len)
+            state["post_probs"] = post_probs
+
+        return state
+
+    def _decode(self, state):
+        """ Decoding one time stamp. """
+        # shape: [batch_size, 1, seq_len]
+        mask = state["mask"]
+
+        # shape: [batch_size, 1]
+        pred_token = state["pred_token"]
+        pred_mask = state["pred_mask"]
+        pred_pos = state["pred_pos"]
+        pred_type = state["pred_type"]
+        pred_turn = state["pred_turn"]
+
+        # list of shape(len: num_layers): [batch_size, seq_len, hidden_dim]
+        cache = state["cache"]
+
+        pred_embed = self.embedder(pred_token, pred_pos, pred_type, pred_turn)
+        pred_embed = self.embed_layer_norm(pred_embed)
+
+        # shape: [batch_size, 1, seq_len + 1]
+        mask = layers.concat([mask, 1 - pred_mask], axis=2)
+
+        # shape: [batch_size, 1, hidden_dim]
+        for l, layer in enumerate(self.layers):
+            pred_embed = layer(pred_embed, mask, cache[f"layer_{l}"])
+
+        # shape: [batch_size, 1, vocab_size]
+        if self.two_layer_predictor:
+            pred_embed = self.pre_predictor(pred_embed)
+        if self.weight_sharing:
+            token_embedding = self.embedder.token_embedding.weight
+            pred_logits = layers.matmul(
+                x=pred_embed,
+                y=token_embedding,
+                transpose_y=True
+            )
+        else:
+            pred_logits = self.predictor(pred_embed)
+        pred_logits = pred_logits[: , 0]
+        pred_probs = layers.softmax(pred_logits, axis=1)
+        pred_logits = layers.log(pred_probs)
+
+        state["mask"] = mask
+        return pred_logits, state
+
+    def _ranking(self, inputs, predictions):
+        """ Reranking generated responses. """
+        src_token = inputs["src_token"]
+        src_mask = inputs["src_mask"]
+        src_pos = inputs["src_pos"]
+        src_type = inputs["src_type"]
+        src_turn = inputs["src_turn"]
+        src_embed = self.embedder(src_token, src_pos, src_type, src_turn)
+
+        batch_size, num_latent, tgt_seq_len = predictions.shape
+
+        # shape: [batch_size, num_latent, seq_len, 1]
+        preds_token = F.unsqueeze(predictions, [3])
+        preds_mask = F.not_equal(preds_token, self.padding_idx, "int64")
+        preds_pos = layers.range(0, tgt_seq_len, 1, dtype="float32")
+        preds_pos = F.unsqueeze(preds_pos, [0, 0, 1])
+        preds_pos = layers.expand(preds_pos, [batch_size, num_latent, 1, 1])
+        preds_pos = layers.cast(preds_pos, "int64")
+        preds_type = layers.zeros_like(preds_token)
+        preds_turn = layers.zeros_like(preds_token)
+
+        scores = []
+        for i in range(num_latent):
+            pred_token = preds_token[:, i]
+            pred_mask = preds_mask[:, i]
+            pred_pos = preds_pos[:, i]
+            pred_type = preds_type[:, i]
+            pred_turn = preds_turn[:, i]
+
+            input_mask = layers.concat([src_mask, pred_mask], axis=1)
+            input_mask.stop_gradient = True
+            pred_embed = self.embedder(pred_token, pred_pos, pred_type, pred_turn)
+            embed = layers.concat([src_embed, pred_embed], axis=1)
+            embed = self.embed_layer_norm(embed)
+
+            mask_embed = self.mask_embed
+            mask_embed = layers.expand(mask_embed, [batch_size, 1, 1])
+            mask_embed = self.embed_layer_norm(mask_embed)
+
+            out = layers.concat([mask_embed, embed], axis=1)
+            mask = self._create_mask(input_mask, append_head=True)
+
+            for layer in self.layers:
+                out = layer(out, mask, None)
+
+            mask_embed = out[:, 0]
+            score = self.discriminator(mask_embed)
+            scores.append(score[:, 0])
+        scores = layers.stack(scores, axis=1)
+        return scores
+
+    def _infer(self, inputs):
+        """ Real inference process of model. """
+        results = {}
+
+        # Initial decode state.
+        state = self._init_state(inputs)
+        if "post_probs" in state:
+            results["post_probs"] = state.pop("post_probs")
+
+        # Generation process.
+        gen_results = self.generator(self._decode, state)
+        results.update(gen_results)
+
+        if self.num_latent > 0:
+            batch_size = state["batch_size"] // self.num_latent
+            results["scores"] = layers.reshape(results["scores"], [batch_size, self.num_latent])
+            results["log_p"] = results["scores"]
+            results["src"] = layers.reshape(inputs["src_token"], [batch_size, -1])
+            if "tgt_token" in inputs:
+                results["tgt"] = layers.reshape(inputs["tgt_token"], [batch_size, -1])
+            results["preds"] = layers.reshape(results["preds"], [batch_size, self.num_latent, -1])
+            if self.use_discriminator:
+                results["scores"] = self._ranking(inputs, results["preds"])
+        else:
+            batch_size = state["batch_size"]
+            if "tgt_token" in inputs:
+                results["tgt"] = layers.reshape(inputs["tgt_token"], [batch_size, -1])
+        return results
+
+
+UnifiedTransformer.register("UnifiedTransformer")
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/modules/embedder.py b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/embedder.py
new file mode 100644
index 0000000000000000000000000000000000000000..bfebcc875473de4d73f85c35bd5d9ce6c4b4502b
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/embedder.py
@@ -0,0 +1,79 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Embedder class.
+"""
+
+import paddle.fluid as fluid
+from paddle.fluid.dygraph import Embedding
+from paddle.fluid.dygraph import Layer
+import paddle.fluid.layers as layers
+
+import plato.modules.functions as F
+
+
+class Embedder(Layer):
+    """
+    Composite embedding layer.
+    """
+
+    def __init__(self,
+                 name_scope,
+                 hidden_dim,
+                 num_token_embeddings,
+                 num_pos_embeddings,
+                 num_type_embeddings,
+                 num_turn_embeddings,
+                 padding_idx=None,
+                 dropout=0.1,
+                 pos_trainable=False):
+        super().__init__(name_scope)
+
+        self.token_embedding = Embedding(name_scope=self.full_name(),
+                                         size=[num_token_embeddings, hidden_dim])
+        self.pos_embedding = Embedding(name_scope=self.full_name(),
+                                       size=[num_pos_embeddings, hidden_dim],
+                                       param_attr=fluid.ParamAttr(trainable=pos_trainable))
+        self.type_embedding = Embedding(name_scope=self.full_name(),
+                                        size=[num_type_embeddings, hidden_dim])
+        self.turn_embedding = Embedding(name_scope=self.full_name(),
+                                        size=[num_turn_embeddings, hidden_dim])
+        self.dropout = dropout
+        return
+
+    def forward(self, token_inp, pos_inp, type_inp, turn_inp):
+        embed = self.token_embedding(token_inp) + \
+            self.pos_embedding(pos_inp) + \
+            self.type_embedding(type_inp) + \
+            self.turn_embedding(turn_inp)
+        embed = F.dropout(embed, self.dropout)
+        return embed
+
+
+def main():
+    import numpy as np
+
+    place = fluid.CPUPlace()
+    with fluid.dygraph.guard(place):
+        model = Embedder("Embedder", 10, 20, 20, 20, 20)
+        token_inp = fluid.dygraph.to_variable(np.random.randint(0, 19, [10, 10]).astype("int64"))
+        pos_inp = fluid.dygraph.to_variable(np.random.randint(0, 19, [10, 10]).astype("int64"))
+        type_inp = fluid.dygraph.to_variable(np.random.randint(0, 19, [10, 10]).astype("int64"))
+        turn_inp = fluid.dygraph.to_variable(np.random.randint(0, 19, [10, 10]).astype("int64"))
+        out = model(token_inp, pos_inp, type_inp, turn_inp)
+        print(out)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/modules/feedforward.py b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/feedforward.py
new file mode 100644
index 0000000000000000000000000000000000000000..b083c0060db510f8183f4ef35d1db0a4b9643a46
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/feedforward.py
@@ -0,0 +1,66 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+FeedForward class.
+"""
+
+import paddle.fluid as fluid
+from paddle.fluid.dygraph import FC
+from paddle.fluid.dygraph import Layer
+import paddle.fluid.layers as layers
+
+import plato.modules.functions as F
+
+
+class FeedForward(Layer):
+    """
+    Positional feed forward layer.
+    """
+
+    def __init__(self, name_scope, hidden_dim, inner_dim, dropout):
+        super().__init__(name_scope)
+
+        self.hidden_dim = hidden_dim
+        self.inner_dim = inner_dim
+        self.linear_hidden = FC(name_scope=self.full_name(),
+                                size=inner_dim,
+                                num_flatten_dims=2,
+                                act="gelu")
+        self.linear_out = FC(name_scope=self.full_name(),
+                             size=hidden_dim,
+                             num_flatten_dims=2)
+        self.dropout = dropout
+        return
+
+    def forward(self, x):
+        out = self.linear_hidden(x)
+        out = F.dropout(out, self.dropout)
+        out = self.linear_out(out)
+        return out
+
+
+def main():
+    import numpy as np
+
+    place = fluid.CPUPlace()
+    with fluid.dygraph.guard(place):
+        model = FeedForward("FeedForward", 10, 20, 0.5)
+        inp = np.random.rand(2, 3, 10).astype("float32")
+        inp = fluid.dygraph.to_variable(inp)
+        out = model(inp)
+        print(out)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/modules/functions.py b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/functions.py
new file mode 100644
index 0000000000000000000000000000000000000000..d6b418e34a86333d62055a4af5c26c1b8365bb21
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/functions.py
@@ -0,0 +1,67 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Helpful functions.
+"""
+
+import numpy as np
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+
+def unsqueeze(input, axes):
+    """ Implement unsqueeze in dygraph mode. """
+    # return layers.unsqueeze(input, axes)
+    # op:unsqueeze has bug in dygraph
+    axes = [axis if axis >= 0 else axis + len(input.shape) + 1 for axis in axes]
+    axes = sorted(axes, reverse=True)
+    shape = list(input.shape)
+    for axis in axes:
+        shape.insert(axis, 1)
+    return layers.reshape(input, shape)
+
+
+def gumbel_softmax(input, tau=1, eps=1e-10):
+    """ Basic implement of gumbel_softmax. """
+    U = fluid.dygraph.to_variable(np.random.rand(*input.shape))
+    # U = layers.uniform_random(input.shape, dtype=input.dtype, min=0.0, max=1.0)
+    # U.stop_gradient = True
+    gumbel = 0.0 - layers.log(eps - layers.log(U + eps))
+    y = input + gumbel
+    return layers.softmax(y / tau)
+
+
+def equal(x, y, dtype=None):
+    """ Implement equal in dygraph mode. """
+    # if not isinstance(y, fluid.framework.Variable):
+    #     y = layers.fill_constant(x.shape, x.dtype, y)
+    # return layers.cast(layers.equal(x, y), dtype)
+    if dtype is None:
+        dtype = "float32"
+    if isinstance(x, fluid.framework.Variable):
+        x = x.numpy()
+    if isinstance(y, fluid.framework.Variable):
+        y = y.numpy()
+    out = np.equal(x, y).astype(dtype)
+    return fluid.dygraph.to_variable(out)
+
+
+def not_equal(x, y, dtype=None):
+    """ Implement not_equal in dygraph mode. """
+    return 1 - equal(x, y, dtype)
+
+
+def dropout(x, p):
+    """ Implement dropout function like tensorflow/pytorch. """
+    return layers.dropout(x, p, dropout_implementation="upscale_in_train")
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/modules/layer_norm.py b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/layer_norm.py
new file mode 100644
index 0000000000000000000000000000000000000000..af439b12317bd18988b9b7b86cba01beb131f143
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/layer_norm.py
@@ -0,0 +1,91 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+LayerNorm layer.
+"""
+
+# from paddle.fluid.dygraph import LayerNorm
+
+from six.moves import reduce
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+from paddle.fluid.dygraph import Layer
+import logging
+
+class LayerNorm(Layer):
+    """ Implement LayerNorm in dygraph mode. """
+
+    def __init__(self,
+                 name_scope,
+                 scale=True,
+                 shift=True,
+                 begin_norm_axis=1,
+                 epsilon=1e-05,
+                 param_attr=None,
+                 bias_attr=None,
+                 act=None):
+        super().__init__(name_scope)
+        self._scale = scale
+        self._shift = shift
+        self._begin_norm_axis = begin_norm_axis
+        self._epsilon = epsilon
+        self._param_attr = param_attr
+        self._bias_attr = bias_attr
+        self._act = act
+        return
+
+    def _build_once(self, input):
+        """ Create parameters. """
+        self._dtype = self._helper.input_dtype(input)
+        input_shape = input.shape
+        param_shape = [
+            reduce(lambda x, y: x * y, input_shape[self._begin_norm_axis:])
+        ]
+        if self._scale:
+            self._scale_w = self.create_parameter(
+                attr=self._param_attr,
+                shape=param_shape,
+                dtype=self._dtype,
+                default_initializer=fluid.initializer.Constant(1.0))
+        else:
+            if self._param_attr:
+                logging.warn("param_attr are only avaliable with scale is True")
+
+        if self._shift:
+            assert self._bias_attr is not False
+            self._bias_w = self.create_parameter(
+                attr=self._bias_attr,
+                shape=param_shape,
+                dtype=self._dtype,
+                is_bias=True)
+        else:
+            if self._bias_attr:
+                logging.warn("bias_attr are only avaliable with shift is True")
+        return
+
+    def forward(self, x):
+        """ Forward process of LayerNorm. """
+        mean = layers.reduce_mean(x,
+                                  dim=list(range(self._begin_norm_axis, len(x.shape))),
+                                  keep_dim=True)
+        shift_x = layers.elementwise_sub(x=x, y=mean, axis=0)
+        variance = layers.reduce_mean(layers.square(shift_x),
+                                      dim=list(range(self._begin_norm_axis, len(x.shape))),
+                                      keep_dim=True)
+        r_stdev = layers.rsqrt(variance + self._epsilon)
+        norm_x = layers.elementwise_mul(x=shift_x, y=r_stdev, axis=0)
+        out = layers.elementwise_mul(x=norm_x, y=self._scale_w, axis=-1)
+        out = layers.elementwise_add(x=out, y=self._bias_w, axis=-1)
+        return out
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/modules/multihead_attention.py b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/multihead_attention.py
new file mode 100644
index 0000000000000000000000000000000000000000..1fee956cff1a7577dd1a1784b88734aa1314dcd0
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/multihead_attention.py
@@ -0,0 +1,119 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+MultiheadAttention class.
+"""
+
+import paddle.fluid as fluid
+from paddle.fluid.dygraph import Layer
+from paddle.fluid.dygraph import FC
+import paddle.fluid.layers as layers
+
+import plato.modules.functions as F
+
+
+class MultiheadAttention(Layer):
+    """
+    Multi head attention layer.
+    """
+
+    def __init__(self, name_scope, hidden_dim, num_heads, dropout):
+        assert hidden_dim % num_heads == 0
+        super().__init__(name_scope)
+
+        self.hidden_dim = hidden_dim
+        self.num_heads = num_heads
+        self.head_dim = hidden_dim // num_heads
+        self.scale = self.head_dim ** -0.5
+        self.linear_qkv = FC(name_scope=self.full_name(),
+                             size=hidden_dim * 3,
+                             num_flatten_dims=2)
+        self.linear_out = FC(name_scope=self.full_name(),
+                             size=hidden_dim,
+                             num_flatten_dims=2)
+        self.dropout = dropout
+        return
+
+    def _split_heads(self, x, is_key=False):
+        x = layers.reshape(
+            x=x, shape=[0, 0, self.num_heads, self.head_dim]
+        )
+        x = layers.transpose(x=x, perm=[0, 2, 3, 1] if is_key else [0, 2, 1, 3])
+        return x
+
+    def _merge_heads(self, x):
+        x = layers.transpose(x=x, perm=[0, 2, 1, 3])
+        x = layers.reshape(x=x, shape=[0, 0, self.hidden_dim])
+        return x
+
+    def _attn(self, query, key, value, mask):
+        # shape: [batch_size, num_head, seq_len, seq_len]
+        scores = layers.matmul(x=query, y=key, alpha=self.scale)
+
+        if mask is not None:
+            mask = F.unsqueeze(mask, [1])
+            mask = layers.expand(mask, [1, self.num_heads, 1, 1])
+            mask.stop_gradient = True
+            scores = (1 - mask) * scores + layers.scale(mask, scale=-1e10)
+
+        attn = layers.softmax(scores, axis=-1)
+        attn = F.dropout(attn, self.dropout)
+
+        if mask is not None:
+            attn = (1 - mask) * attn
+
+        out = layers.matmul(x=attn, y=value)
+        return out
+
+    def forward(self, inp, mask=None, cache=None):
+        """ Forward process of self attention. """
+        # shape: [batch_size, seq_len, 3 * hidden_dim]
+        qkv = self.linear_qkv(inp)
+        query, key, value = layers.split(qkv, num_or_sections=3, dim=2)
+
+
+        # shape: [batch_size, num_head, seq_len, head_dim]
+        query = self._split_heads(query)
+        # shape: [batch_size, num_head, head_dim, seq_len]
+        key = self._split_heads(key, is_key=True)
+        # shape: [batch_size, num_head, seq_len, head_dim]
+        value = self._split_heads(value)
+
+        if cache is not None:
+            if "key" in cache and "value" in cache:
+                key = layers.concat([cache["key"], key], axis=3)
+                value = layers.concat([cache["value"], value], axis=2)
+            cache["key"] = key
+            cache["value"] = value
+
+        out = self._attn(query, key, value, mask)
+        out = self._merge_heads(out)
+        out = self.linear_out(out)
+        return out
+
+
+def main():
+    import numpy as np
+
+    place = fluid.CPUPlace()
+    with fluid.dygraph.guard(place):
+        model = MultiheadAttention("MultiheadAttention", 10, 2, 0.5)
+        inp = np.random.rand(2, 3, 10).astype("float32")
+        inp = fluid.dygraph.to_variable(inp)
+        out = model(inp, inp, inp)
+        print(out)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/modules/parallel.py b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/parallel.py
new file mode 100644
index 0000000000000000000000000000000000000000..e574168120db03efa64f509d16809caec1679011
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/parallel.py
@@ -0,0 +1,263 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Parallel class.
+"""
+
+from collections import OrderedDict
+import os
+
+import numpy as np
+from paddle.fluid import core
+from paddle.fluid.dygraph import layers
+from paddle.fluid.dygraph import parallel_helper
+import paddle.fluid.framework as framework
+from paddle.fluid.layers import collective
+from paddle.fluid.dygraph.base import to_variable, no_grad
+
+ParallelStrategy = core.ParallelStrategy
+
+
+def prepare_context(strategy=None):
+    """ Copy codes. """
+    if strategy is None:
+        strategy = ParallelStrategy()
+        strategy.nranks = Env().nranks
+        strategy.local_rank = Env().local_rank
+        strategy.trainer_endpoints = Env().trainer_endpoints
+        strategy.current_endpoint = Env().current_endpoint
+    if strategy.nranks < 2:
+        return
+    assert framework.in_dygraph_mode() is True, \
+        "dygraph.parallel.prepare_context should be used with dygrahp mode."
+    place = framework._current_expected_place()
+    assert place is not None, \
+        "dygraph.parallel.prepare_context should be used in fluid.dygraph.guard(place) guard."
+    if isinstance(place, core.CUDAPlace):
+        parallel_helper._set_parallel_ctx(
+            core.NCCLParallelContext(strategy, place))
+    else:
+        # TODO(Yancey1989): add Gloo Parallel Context to support CPU parallel computation
+        assert ("Only support CUDAPlace for now.")
+    parallel_helper._init_parallel_ctx()
+    return strategy
+
+
+class Env(object):
+    """ Copy codes. """
+    def __init__(self):
+        self._nranks = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
+        self._local_rank = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+        self._dev_id = int(os.getenv("FLAGS_selected_gpus", "0"))
+        self._trainer_endpoints = os.getenv("PADDLE_TRAINER_ENDPOINTS",
+                                            "").split(",")
+        self._current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT", "")
+
+    @property
+    def nranks(self):
+        """ Copy codes. """
+        return self._nranks
+
+    @property
+    def local_rank(self):
+        """ Copy codes. """
+        return self._local_rank
+
+    @property
+    def dev_id(self):
+        """ Copy codes. """
+        return self._dev_id
+
+    @property
+    def current_endpoint(self):
+        """ Copy codes. """
+        return self._current_endpoint
+
+    @property
+    def trainer_endpoints(self):
+        """ Copy codes. """
+        return self._trainer_endpoints
+
+
+class DataParallel(layers.Layer):
+    """
+    Runs the module with data parallelism.
+
+    Currently, DataParallel only supports to run the dynamic graph
+    with multi-process. The usage is:
+    `python -m paddle.distributed.launch --gpus 2 dynamic_graph_test.py`.
+    And the content of `dynamic_graph_test.py` is the code of examples.
+
+    Examples:
+        .. code-block:: python
+
+           import numpy as np
+           import paddle.fluid as fluid
+           import paddle.fluid.dygraph as dygraph
+           from paddle.fluid.optimizer import AdamOptimizer
+           from paddle.fluid.dygraph.nn import FC
+           from paddle.fluid.dygraph.base import to_variable
+
+           place = fluid.CUDAPlace(0)
+           with fluid.dygraph.guard(place=place):
+
+               # prepare the data parallel context
+               strategy=dygraph.parallel.prepare_context()
+
+               fc_layer = FC("FC", 10, act="softmax")
+               adam = fluid.optimizer.AdamOptimizer()
+
+               # make the module become the data parallelism module
+               fc_layer = dygraph.parallel.DataParallel(fc_layer, strategy)
+
+               x_data = np.random.random(size=[10, 1]).astype(np.float32)
+               data = to_variable(x_data)
+
+               hidden = fc_layer(data)
+               avg_loss = fluid.layers.mean(hidden)
+
+               # scale the loss according to the number of trainers.
+               avg_loss = fc_layer.scale_loss(avg_loss)
+
+               avg_loss.backward()
+
+               # collect the gradients of trainers.
+               fc_layer.apply_collective_grads()
+
+               adam.minimize(avg_loss)
+               fc_layer.clear_gradients()
+
+    Args:
+        layers(Layer): The module that should be executed by data parallel.
+        strategy(ParallelStrategy): The strategy of data parallelism.
+
+    Returns:
+        Layer: The data paralleled module.
+    """
+
+    def __init__(self, layers, strategy):
+        super(DataParallel,
+              self).__init__(layers.full_name() + "_data_parallel")
+
+        self._layers = layers
+        self._strategy = strategy
+
+    def forward(self, *inputs, **kwargs):
+        return self._layers(*inputs, **kwargs)
+
+    def __call__(self, *args, **kwargs):
+        # Reimplement __call__ function
+        if not self._built:
+            self._built = True
+
+        outputs = self.forward(*args, **kwargs)
+        return outputs
+
+    def scale_loss(self, loss):
+        """
+        Scale the loss. In data parallel mode, the loss should be scale with
+        the number of trainers. If not in data parallel mode, return the loss
+        directly.
+
+        Args:
+            loss(Layer): The loss of the current Model.
+
+        Returns:
+            Layer: the scaled loss.
+        """
+        if not self._is_data_parallel_mode():
+            return loss
+
+        loss_scale = to_variable(
+            np.array([self._strategy.nranks]).astype("float32"))
+        loss_scale.stop_gradient = True
+        loss = loss / loss_scale
+        return loss
+
+    def _coalesce_tensors(self, var_groups):
+        from paddle.fluid.layers import nn
+        coalesced_grads_and_grad_vars = []
+        for group_id, grad_vars in var_groups.items():
+            flattened_vars = []
+            g_var_shapes = []
+            for g_var in grad_vars:
+                g_var_shapes.append(g_var.shape)
+                flattened_vars.append(
+                    nn.reshape(
+                        x=g_var, shape=[np.prod(g_var.shape)], inplace=True))
+            coalesced_grad = nn.concat(flattened_vars)
+            coalesced_grads_and_grad_vars.append(
+                [coalesced_grad, grad_vars, g_var_shapes])
+        return coalesced_grads_and_grad_vars
+
+    def _split_tensors(self, coalesced_grads_and_grad_vars):
+        from paddle.fluid.layers import nn
+        for coalesced_grad, origin_grad_vars, grad_shapes in coalesced_grads_and_grad_vars:
+            grad_var_len = [np.prod(g_shape) for g_shape in grad_shapes]
+            self._helper.main_program.current_block().append_op(
+                type='split',
+                inputs={'X': coalesced_grad},
+                outputs={'Out': origin_grad_vars},
+                attrs={'sections': grad_var_len,
+                       'axis': 0})
+            for g_var, g_shape in zip(origin_grad_vars, grad_shapes):
+                nn.reshape(x=g_var, shape=g_shape, inplace=True)
+
+    @no_grad
+    def apply_collective_grads(self):
+        """
+        AllReduce the Parameters' gradient.
+        """
+        if not self._is_data_parallel_mode():
+            return
+
+        grad_var_set = set()
+        grad_vars = []
+        for param in self._layers.parameters():
+            # NOTE(zcd): The grad_ivar maybe no generated.
+            if param.trainable and param._grad_ivar():
+                g_var = param._grad_ivar()
+                grad_vars.append(g_var)
+                assert g_var not in grad_var_set
+                grad_var_set.add(g_var)
+
+        # FIXME(zcd): the type of the var should be LoDTensor, i.e
+        # the gradients should be dense, otherwise, the following
+        # logic should be updated.
+        # 128 MB as a group
+        mega_bytes = 128 * 1024 * 1024
+        group_idx = 0
+        memory_counter = 0
+        grad_var_groups = OrderedDict()
+        dtype = grad_vars[0].dtype
+        for g_var in grad_vars:
+            # Note: the dtype of the same group should be the same.
+            bytes = np.prod(g_var.shape) * core.size_of_dtype(g_var.dtype)
+            if memory_counter < mega_bytes and dtype == g_var.dtype:
+                memory_counter += bytes
+            else:
+                memory_counter = bytes
+                group_idx += 1
+            grad_var_groups.setdefault(group_idx, []).append(g_var)
+
+        coalesced_grads_and_vars = self._coalesce_tensors(grad_var_groups)
+
+        for coalesced_grad, g_vars, g_shapes in coalesced_grads_and_vars:
+            collective._allreduce(
+                coalesced_grad, coalesced_grad, sync_mode=False)
+
+        self._split_tensors(coalesced_grads_and_vars)
+
+    def _is_data_parallel_mode(self):
+        return self._strategy.nranks > 1
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/modules/transformer_block.py b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/transformer_block.py
new file mode 100644
index 0000000000000000000000000000000000000000..b105c75d83560ea01c3c4464db6ce1c28c14fa43
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/modules/transformer_block.py
@@ -0,0 +1,100 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+TransformerBlock class.
+"""
+
+import paddle.fluid as fluid
+from paddle.fluid.dygraph import FC
+from paddle.fluid.dygraph import Layer
+import paddle.fluid.layers as layers
+
+from plato.modules.feedforward import FeedForward
+from plato.modules.layer_norm import LayerNorm
+from plato.modules.multihead_attention import MultiheadAttention
+import plato.modules.functions as F
+
+
+class TransformerBlock(Layer):
+    """
+    Transformer block module.
+    """
+
+    def __init__(self, name_scope, hidden_dim, num_heads, dropout, attn_dropout, ff_dropout):
+        super().__init__(name_scope)
+
+        self.attn = MultiheadAttention(name_scope=self.full_name(),
+                                       hidden_dim=hidden_dim,
+                                       num_heads=num_heads,
+                                       dropout=attn_dropout)
+        self.attn_norm = LayerNorm(name_scope=self.full_name(),
+                                   begin_norm_axis=2,
+                                   epsilon=1e-12,
+                                   param_attr=fluid.ParamAttr(
+                                       regularizer=fluid.regularizer.L2Decay(0.0)),
+                                   bias_attr=fluid.ParamAttr(
+                                       regularizer=fluid.regularizer.L2Decay(0.0)))
+        self.ff = FeedForward(name_scope=self.full_name(),
+                              hidden_dim=hidden_dim,
+                              inner_dim=4 * hidden_dim,
+                              dropout=ff_dropout)
+        self.ff_norm = LayerNorm(name_scope=self.full_name(),
+                                 begin_norm_axis=2,
+                                 epsilon=1e-12,
+                                 param_attr=fluid.ParamAttr(
+                                     regularizer=fluid.regularizer.L2Decay(0.0)),
+                                 bias_attr=fluid.ParamAttr(
+                                     regularizer=fluid.regularizer.L2Decay(0.0)))
+        self.dropout = dropout
+        return
+
+    def forward(self, inp, mask=None, cache=None):
+        """
+        Forward process on one transformer layer.
+
+        @param : x
+        @type : Variable(shape: [batch_size, seq_len, hidden_size])
+
+        @param : memory
+        @type : Variable(shape: [batch_size, seq_len, hidden_size])
+
+        @param : mask
+
+        @param : cache
+        """
+        attn_out = self.attn(inp, mask, cache)
+        attn_out = F.dropout(attn_out, self.dropout)
+        attn_out = self.attn_norm(attn_out + inp)
+
+        ff_out = self.ff(attn_out)
+        ff_out = F.dropout(ff_out, self.dropout)
+        ff_out = self.ff_norm(ff_out + attn_out)
+
+        return ff_out
+
+
+def main():
+    import numpy as np
+
+    place = fluid.CPUPlace()
+    with fluid.dygraph.guard(place):
+        model = TransformerBlock("TransformerBlock", 10, 2, 0.5, 0.5, 0.5)
+        inp = np.random.rand(2, 3, 10).astype("float32")
+        inp = fluid.dygraph.to_variable(inp)
+        out = model(inp, inp)
+        print(out)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/PaddleNLP/Research/Dialogue-PLATO/plato/trainer.py b/PaddleNLP/Research/Dialogue-PLATO/plato/trainer.py
new file mode 100644
index 0000000000000000000000000000000000000000..f464323f7e75677842c7272f705fab743c11249f
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/plato/trainer.py
@@ -0,0 +1,362 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Trainer class.
+"""
+
+import json
+import logging
+import os
+import sys
+import time
+
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import paddle.fluid.dygraph as dygraph
+from tqdm import tqdm
+
+from plato.args import str2bool
+from plato.data.data_loader import DataLoader
+from plato.metrics.metrics_tracker import MetricsTracker
+from plato.metrics.metrics import bleu
+from plato.metrics.metrics import distinct
+import plato.modules.parallel as parallel
+
+
+def get_logger(log_path, name="default"):
+    logger = logging.getLogger(name)
+    logger.propagate = False
+    logger.setLevel(logging.DEBUG)
+
+    formatter = logging.Formatter("%(message)s")
+
+    sh = logging.StreamHandler(sys.stdout)
+    sh.setFormatter(formatter)
+    logger.addHandler(sh)
+
+    fh = logging.FileHandler(log_path, mode="w")
+    fh.setFormatter(formatter)
+    logger.addHandler(fh)
+
+    return logger
+
+
+def evaluate_generation_result(results):
+    tgt = [result["tgt"].split(" ") for result in results]
+    pred = [result["preds"][np.argmax(result["scores"])]
+            if isinstance(result["preds"], list)
+            else result["preds"]
+            for result in results]
+    pred = [p.split(" ") for p in pred]
+    metrics = {}
+    metrics_tracker = MetricsTracker()
+
+    bleu1, bleu2 = bleu(pred, tgt)
+    metrics.update({"bleu_1": bleu1, "bleu_2": bleu2})
+
+    intra_dist1, intra_dist2, inter_dist1, inter_dist2 = distinct(pred)
+    metrics.update({"intra_dist_1": intra_dist1,
+                    "intra_dist_2": intra_dist2,
+                    "inter_dist_1": inter_dist1,
+                    "inter_dist_2": inter_dist2})
+
+    avg_len = sum(map(len, pred)) / len(pred)
+    metrics.update({"len": avg_len})
+
+    metrics_tracker.update(metrics, num_samples=1)
+    return metrics_tracker
+
+
+def save(model, model_path):
+    if isinstance(model, parallel.DataParallel):
+        model = model._layers
+    if hasattr(fluid, "save_dygraph"):
+        # >= 1.6.0 compatible
+        fluid.save_dygraph(model.state_dict(), model_path)
+        fluid.save_dygraph(model.optimizer.state_dict(), model_path)
+    else:
+        dygraph.save_persistables(model.state_dict(), model_path, optimizers=model.optimizer)
+    return
+
+
+class Trainer(object):
+
+    @classmethod
+    def add_cmdline_argument(cls, parser):
+        """ Add the cmdline arguments of trainer. """
+        group = parser.add_argument_group("Trainer")
+        group.add_argument("--use_data_distributed", type=str2bool, default=False,
+                           help="Whether to use data distributed for parallel training.")
+        group.add_argument("--valid_metric_name", type=str, default="-loss",
+                           help="The validation metric determining which checkpoint is the best.")
+        group.add_argument("--num_epochs", type=int, default=10,
+                           help="Total number of training epochs to perform.")
+        group.add_argument("--save_dir", type=str, required=True,
+                           help="The output directory where the model will be saved.")
+        group.add_argument("--batch_size", type=int, default=8,
+                           help="Total batch size for training/evaluation/inference.")
+        group.add_argument("--log_steps", type=int, default=100,
+                           help="The number of training steps to output current metrics "
+                           "on past training dataset.")
+        group.add_argument("--valid_steps", type=int, default=2000,
+                           help="The number of training steps to perform a evaluation "
+                           "on validation datasets.")
+        group.add_argument("--save_checkpoint", type=str2bool, default=True,
+                           help="Whether to save one checkpoints for each training epoch.")
+        group.add_argument("--save_summary", type=str2bool, default=False,
+                           help="Whether to save metrics summary for visualDL module.")
+        DataLoader.add_cmdline_argument(group)
+        return group
+
+    def __init__(self, model, to_tensor, hparams, logger=None):
+        # Use data distributed
+        if hparams.use_data_distributed:
+            strategy = parallel.prepare_context()
+            if strategy is not None:
+                parallel_model = parallel.DataParallel(model, strategy)
+                model.before_backward_fn = parallel_model.scale_loss
+                model.after_backward_fn = parallel_model.apply_collective_grads
+                model = parallel_model
+
+        self.model = model
+        self.to_tensor = to_tensor
+
+        self.is_decreased_valid_metric = hparams.valid_metric_name[0] == "-"
+        self.valid_metric_name = hparams.valid_metric_name[1:]
+        self.num_epochs = hparams.num_epochs
+        self.save_dir = hparams.save_dir
+        self.log_steps = hparams.log_steps
+        self.valid_steps = hparams.valid_steps
+        self.save_checkpoint = hparams.save_checkpoint
+        self.save_summary = hparams.save_summary
+
+        if not os.path.exists(self.save_dir):
+            os.makedirs(self.save_dir)
+
+        self.logger = logger or get_logger(os.path.join(self.save_dir, "trainer.log"), "trainer")
+
+        if self.save_summary:
+            from visualdl import LogWriter
+            self.summary_logger = LogWriter(os.path.join(self.save_dir, "summary"), sync_cycle=10000)
+            self.train_summary = {}
+            self.valid_summary = {}
+
+        self.batch_metrics_tracker = MetricsTracker()
+        self.token_metrics_tracker = MetricsTracker()
+
+        self.best_valid_metric = float("inf" if self.is_decreased_valid_metric else "-inf")
+        self.epoch = 0
+        self.batch_num = 0
+
+    def train_epoch(self, train_iter, valid_iter, infer_iter=None, infer_parse_dict=None):
+        """
+        Train an epoch.
+
+        @param train_iter
+        @type : DataLoader
+
+        @param valid_iter
+        @type : DataLoader
+
+        @param infer_iter
+        @type : DataLoader
+
+        @param infer_parse_dict
+        @type : dict of function
+        """
+        self.epoch += 1
+        num_batches = len(train_iter)
+        self.batch_metrics_tracker.clear()
+        self.token_metrics_tracker.clear()
+        times = []
+        for batch_id, (batch, batch_size) in enumerate(train_iter, 1):
+            batch = type(batch)(map(lambda kv: (kv[0], self.to_tensor(kv[1])), batch.items()))
+            batch["epoch"] = self.epoch
+            batch["num_steps"] = self.batch_num
+
+            # Do a training iteration
+            start_time = time.time()
+            metrics = self.model(batch, is_training=True)
+            token_num = metrics.pop("token_num", None)
+            elapsed = time.time() - start_time
+            times.append(elapsed)
+
+            batch_metrics = {k: v for k, v in metrics.items() if "token" not in k}
+            token_metrics = {k: v for k, v in metrics.items() if "token" in k}
+            self.batch_metrics_tracker.update(batch_metrics, batch_size)
+            self.token_metrics_tracker.update(token_metrics, token_num)
+            self.batch_num += 1
+
+            if self.log_steps and batch_id % self.log_steps == 0:
+                batch_metrics_message = self.batch_metrics_tracker.value()
+                token_metrics_message = self.token_metrics_tracker.value()
+                message_prefix = f"[Train][{self.epoch}][{batch_id}/{num_batches}]"
+                avg_time = f"AVG_Time-{sum(times[-self.log_steps:]) / self.log_steps:.3f}"
+                message = "   ".join([message_prefix, batch_metrics_message, token_metrics_message,
+                                      avg_time])
+                self.logger.info(message)
+
+            if self.save_summary:
+                with self.summary_logger.mode("train"):
+                    for k, v in self.batch_metrics_tracker.items():
+                        if k not in self.train_summary:
+                            self.train_summary[k] = self.summary_logger.scalar(k)
+                        scalar = self.train_summary[k]
+                        scalar.add_record(self.batch_num, v)
+                    for k, v in self.token_metrics_tracker.items():
+                        if k not in self.train_summary:
+                            self.train_summary[k] = self.summary_logger.scalar(k)
+                        scalar = self.train_summary[k]
+                        scalar.add_record(self.batch_num, v)
+
+            if self.valid_steps and valid_iter is not None and \
+                    batch_id % self.valid_steps == 0:
+                self.evaluate(valid_iter)
+
+        if valid_iter is not None:
+            self.evaluate(valid_iter)
+
+        if infer_iter is not None and infer_parse_dict is not None:
+            self.infer(infer_iter, infer_parse_dict)
+
+        return
+
+    def infer(self, data_iter, parse_dict, num_batches=None):
+        """
+        Inference interface.
+
+        @param : data_iter
+        @type : DataLoader
+
+        @param : parse_dict
+        @type : dict of function
+
+        @param : num_batches : the number of batch to infer
+        @type : int/None
+        """
+        self.logger.info("Generation starts ...")
+        infer_save_file = os.path.join(self.save_dir, f"infer_{self.epoch}.result.json")
+
+        # Inference
+        infer_results = []
+        batch_cnt = 0
+        begin_time = time.time()
+        for batch, batch_size in tqdm(data_iter, total=num_batches):
+            batch = type(batch)(map(lambda kv: (kv[0], self.to_tensor(kv[1])), batch.items()))
+
+            result = self.model.infer(inputs=batch)
+            batch_result = {}
+
+            def to_list(batch):
+                """ Parse list. """
+                return batch.tolist()
+
+            # parse
+            for k in result:
+                if k in parse_dict:
+                    parse_fn = parse_dict[k]
+                else:
+                    parse_fn = to_list
+                if result[k] is not None:
+                    batch_result[k] = parse_fn(result[k])
+
+            for vs in zip(*batch_result.values()):
+                infer_result = {}
+                for k, v in zip(batch_result.keys(), vs):
+                    infer_result[k] = v
+                infer_results.append(infer_result)
+
+            batch_cnt += 1
+            if batch_cnt == num_batches:
+                break
+
+        self.logger.info(f"Saved inference results to {infer_save_file}")
+        with open(infer_save_file, "w") as fp:
+            json.dump(infer_results, fp, indent=2)
+        infer_metrics_tracker = evaluate_generation_result(infer_results)
+        metrics_message = infer_metrics_tracker.summary()
+        message_prefix = f"[Infer][{self.epoch}]"
+        time_cost = f"TIME-{time.time() - begin_time:.3f}"
+        message = "   ".join([message_prefix, metrics_message, time_cost])
+        self.logger.info(message)
+        return
+
+    def evaluate(self, data_iter, need_save=True):
+        """
+        Evaluation interface
+
+        @param : data_iter
+        @type : DataLoader
+
+        @param : need_save
+        @type : bool
+        """
+        if isinstance(self.model, parallel.DataParallel):
+            need_save = need_save and parallel.Env().local_rank == 0
+
+        # Evaluation
+        begin_time = time.time()
+        batch_metrics_tracker = MetricsTracker()
+        token_metrics_tracker = MetricsTracker()
+        for batch, batch_size in data_iter:
+            batch = type(batch)(map(lambda kv: (kv[0], self.to_tensor(kv[1])), batch.items()))
+            metrics = self.model(batch, is_training=False)
+            token_num = int(metrics.pop("token_num"))
+            batch_metrics = {k: v for k, v in metrics.items() if "token" not in k}
+            token_metrics = {k: v for k, v in metrics.items() if "token" in k}
+            batch_metrics_tracker.update(batch_metrics, batch_size)
+            token_metrics_tracker.update(token_metrics, token_num)
+        batch_metrics_message = batch_metrics_tracker.summary()
+        token_metrics_message = token_metrics_tracker.summary()
+        message_prefix = f"[Valid][{self.epoch}]"
+        time_cost = f"TIME-{time.time() - begin_time:.3f}"
+        message = "   ".join([message_prefix, batch_metrics_message, token_metrics_message, time_cost])
+        self.logger.info(message)
+
+        if need_save:
+            # Check valid metric
+            cur_valid_metric = batch_metrics_tracker.get(self.valid_metric_name)
+            if self.is_decreased_valid_metric:
+                is_best = cur_valid_metric < self.best_valid_metric
+            else:
+                is_best = cur_valid_metric > self.best_valid_metric
+            if is_best:
+                # Save current best model
+                self.best_valid_metric = cur_valid_metric
+                best_model_path = os.path.join(self.save_dir, "best.model")
+                save(self.model, best_model_path)
+                self.logger.info(
+                    f"Saved best model to '{best_model_path}' with new best valid metric "
+                    f"{self.valid_metric_name.upper()}-{self.best_valid_metric:.3f}")
+
+            # Save checkpoint
+            if self.save_checkpoint:
+                model_file = os.path.join(self.save_dir, f"epoch_{self.epoch}.model")
+                save(self.model, model_file)
+
+            if self.save_summary:
+                with self.summary_logger.mode("valid"):
+                    for k, v in self.batch_metrics_tracker.items():
+                        if k not in self.valid_summary:
+                            self.valid_summary[k] = self.summary_logger.scalar(k)
+                        scalar = self.valid_summary[k]
+                        scalar.add_record(self.batch_num, v)
+                    for k, v in self.token_metrics_tracker.items():
+                        if k not in self.valid_summary:
+                            self.valid_summary[k] = self.summary_logger.scalar(k)
+                        scalar = self.valid_summary[k]
+                        scalar.add_record(self.batch_num, v)
+
+        return
diff --git a/PaddleNLP/Research/Dialogue-PLATO/preprocess.py b/PaddleNLP/Research/Dialogue-PLATO/preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..0e8c2bd86d13e055480666437833fa0561d02b3f
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/preprocess.py
@@ -0,0 +1,66 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess script.
+"""
+
+import os
+import argparse
+
+from plato.args import str2bool
+from plato.args import parse_args
+from plato.data.dataset import Dataset
+from plato.data.field import BPETextField
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    
+    BPETextField.add_cmdline_argument(parser)
+    Dataset.add_cmdline_argument(parser)
+    
+    args = parse_args(parser)
+    
+    raw_train_file = os.path.join(args.data_dir, "dial.train")
+    raw_valid_file = os.path.join(args.data_dir, "dial.valid")
+    raw_test_file = os.path.join(args.data_dir, "dial.test")
+    train_file = raw_train_file + f".{args.tokenizer_type}.jsonl"
+    valid_file = raw_valid_file + f".{args.tokenizer_type}.jsonl"
+    test_file = raw_test_file + f".{args.tokenizer_type}.jsonl"
+    
+    bpe = BPETextField(args.BPETextField)
+    
+    BUILD_EXAMPLES_FN = {
+        "multi": bpe.build_examples_multi_turn,
+        "multi_knowledge": bpe.build_examples_multi_turn_with_knowledge
+    }
+    build_examples_fn = BUILD_EXAMPLES_FN[args.data_type]
+    
+    if os.path.exists(raw_valid_file) and not os.path.exists(valid_file):
+        valid_examples = build_examples_fn(raw_valid_file, data_type="valid")
+        bpe.save_examples(valid_examples, valid_file)
+    
+    if os.path.exists(raw_test_file) and not os.path.exists(test_file):
+        test_examples = build_examples_fn(raw_test_file, data_type="test")
+        bpe.save_examples(test_examples, test_file)
+    
+    if os.path.exists(raw_train_file) and not os.path.exists(train_file):
+        train_examples = build_examples_fn(raw_train_file, data_type="train")
+        bpe.save_examples(train_examples, train_file)
+
+    return
+
+
+if __name__ == "__main__":
+    main()
diff --git a/PaddleNLP/Research/Dialogue-PLATO/run.py b/PaddleNLP/Research/Dialogue-PLATO/run.py
new file mode 100644
index 0000000000000000000000000000000000000000..a45410d8b2e3b014a0ccf5c6b0224112b353e88f
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/run.py
@@ -0,0 +1,162 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Running scripts.
+"""
+
+import argparse
+import json
+import os
+
+import numpy as np
+import paddle.fluid as fluid
+
+from plato.args import parse_args
+from plato.args import str2bool
+from plato.data.data_loader import DataLoader
+from plato.data.dataset import Dataset
+from plato.data.dataset import LazyDataset
+from plato.data.field import BPETextField
+from plato.trainer import Trainer
+from plato.models.model_base import ModelBase
+from plato.models.generator import Generator
+import plato.modules.parallel as parallel
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--do_train", type=str2bool, default=False,
+                        help="Whether to run trainning.")
+    parser.add_argument("--do_test", type=str2bool, default=False,
+                        help="Whether to run evaluation on the test dataset.")
+    parser.add_argument("--do_infer", type=str2bool, default=False,
+                        help="Whether to run inference on the test dataset.")
+    parser.add_argument("--num_infer_batches", type=int, default=None,
+                        help="The number of batches need to infer.\n"
+                        "Stay 'None': infer on entrie test dataset.")
+    parser.add_argument("--hparams_file", type=str, default=None,
+                        help="Loading hparams setting from file(.json format).")
+    BPETextField.add_cmdline_argument(parser)
+    Dataset.add_cmdline_argument(parser)
+    Trainer.add_cmdline_argument(parser)
+    ModelBase.add_cmdline_argument(parser)
+    Generator.add_cmdline_argument(parser)
+
+    hparams = parse_args(parser)
+
+    if hparams.hparams_file and os.path.exists(hparams.hparams_file):
+        print(f"Loading hparams from {hparams.hparams_file} ...")
+        hparams.load(hparams.hparams_file)
+        print(f"Loaded hparams from {hparams.hparams_file}")
+
+    print(json.dumps(hparams, indent=2))
+
+    if not os.path.exists(hparams.save_dir):
+        os.makedirs(hparams.save_dir)
+    hparams.save(os.path.join(hparams.save_dir, "hparams.json"))
+    
+    bpe = BPETextField(hparams.BPETextField)
+    hparams.Model.num_token_embeddings = bpe.vocab_size
+
+    generator = Generator.create(hparams.Generator, bpe=bpe)
+
+    COLLATE_FN = {
+        "multi": bpe.collate_fn_multi_turn,
+        "multi_knowledge": bpe.collate_fn_multi_turn_with_knowledge
+    }
+    collate_fn = COLLATE_FN[hparams.data_type]
+
+    # Loading datasets
+    if hparams.do_train:
+        raw_train_file = os.path.join(hparams.data_dir, "dial.train")
+        train_file = raw_train_file + f".{hparams.tokenizer_type}.jsonl"
+        assert os.path.exists(train_file), f"{train_file} isn't exist"
+        train_dataset = LazyDataset(train_file)
+        train_loader = DataLoader(train_dataset, hparams.Trainer, collate_fn=collate_fn, is_train=True)
+        raw_valid_file = os.path.join(hparams.data_dir, "dial.valid")
+        valid_file = raw_valid_file + f".{hparams.tokenizer_type}.jsonl"
+        assert os.path.exists(valid_file), f"{valid_file} isn't exist"
+        valid_dataset = LazyDataset(valid_file)
+        valid_loader = DataLoader(valid_dataset, hparams.Trainer, collate_fn=collate_fn)
+
+    if hparams.do_infer or hparams.do_test:
+        raw_test_file = os.path.join(hparams.data_dir, "dial.test")
+        test_file = raw_test_file + f".{hparams.tokenizer_type}.jsonl"
+        assert os.path.exists(test_file), f"{test_file} isn't exist"
+        test_dataset = LazyDataset(test_file)
+        test_loader = DataLoader(test_dataset, hparams.Trainer, collate_fn=collate_fn, is_test=hparams.do_infer)
+
+    def to_tensor(array):
+        return fluid.dygraph.to_variable(array)
+
+    if hparams.use_data_distributed:
+        place = fluid.CUDAPlace(parallel.Env().dev_id)
+    else:
+        place = fluid.CUDAPlace(0)
+
+    with fluid.dygraph.guard(place):
+        # Construct Model
+        model = ModelBase.create("Model", hparams, generator=generator)
+
+        # Construct Trainer
+        trainer = Trainer(model, to_tensor, hparams.Trainer)
+
+        if hparams.do_train:
+            # Training process
+            for epoch in range(hparams.num_epochs):
+                trainer.train_epoch(train_loader, valid_loader)
+
+        if hparams.do_test:
+            # Validation process
+            trainer.evaluate(test_loader, need_save=False)
+
+        if hparams.do_infer:
+            # Inference process
+            def split(xs, sep, pad):
+                """ Split id list by separator. """
+                out, o = [], []
+                for x in xs:
+                    if x == pad:
+                        continue
+                    if x != sep:
+                        o.append(x)
+                    else:
+                        if len(o) > 0:
+                            out.append(list(o))
+                            o = []
+                if len(o) > 0:
+                    out.append(list(o))
+                assert(all(len(o) > 0 for o in out))
+                return out
+
+            def parse_context(batch):
+                """ Parse context. """
+                return bpe.denumericalize([split(xs, bpe.eos_id, bpe.pad_id)
+                                           for xs in batch.tolist()])
+
+            def parse_text(batch):
+                """ Parse text. """
+                return bpe.denumericalize(batch.tolist())
+
+            infer_parse_dict = {
+                "src": parse_context,
+                "tgt": parse_text,
+                "preds": parse_text
+            }
+            trainer.infer(test_loader, infer_parse_dict, num_batches=hparams.num_infer_batches)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/PaddleNLP/Research/Dialogue-PLATO/scripts/DSTC7_AVSD/infer.sh b/PaddleNLP/Research/Dialogue-PLATO/scripts/DSTC7_AVSD/infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..76610ebaa1b85cbb91b443ce49796060cda0d9d7
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/scripts/DSTC7_AVSD/infer.sh
@@ -0,0 +1,34 @@
+#!/bin/bash
+set -ux
+
+SAVE_DIR=outputs/DSTC7_AVSD.infer
+VOCAB_PATH=model/Bert/vocab.txt
+DATA_DIR=data/DSTC7_AVSD
+INIT_CHECKPOINT=outputs/DSTC7_AVSD/best.model
+DATA_TYPE=multi_knowledge
+
+# CUDA environment settings.
+export CUDA_VISIBLE_DEVICES=0
+
+# Paddle environment settings.
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+export FLAGS_eager_delete_scope=True
+export FLAGS_eager_delete_tensor_gb=0.0
+
+python -u \
+    ./preprocess.py \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE
+
+python -u \
+    ./run.py \
+    --do_infer true \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE \
+    --batch_size 4 \
+    --num_type_embeddings 3 \
+    --use_discriminator true \
+    --init_checkpoint $INIT_CHECKPOINT \
+    --save_dir $SAVE_DIR
diff --git a/PaddleNLP/Research/Dialogue-PLATO/scripts/DSTC7_AVSD/train.sh b/PaddleNLP/Research/Dialogue-PLATO/scripts/DSTC7_AVSD/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b2bd742deb79fa0c30e0b0a16de1d5e77c9103f3
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/scripts/DSTC7_AVSD/train.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+set -ux
+
+SAVE_DIR=outputs/DSTC7_AVSD
+VOCAB_PATH=model/Bert/vocab.txt
+DATA_DIR=data/DSTC7_AVSD
+INIT_CHECKPOINT=model/PLATO
+DATA_TYPE=multi_knowledge
+USE_VISUALDL=false
+
+# CUDA environment settings.
+export CUDA_VISIBLE_DEVICES=0
+
+# Paddle environment settings.
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+export FLAGS_eager_delete_scope=True
+export FLAGS_eager_delete_tensor_gb=0.0
+
+python -u \
+    ./preprocess.py \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE
+
+if [[ "$USE_VISUALDL" = true ]]; then
+    visualdl --logdir=$SAVE_DIR/summary --port=8083 --host=`hostname` &
+    VISUALDL_PID=$!
+fi
+
+python -u \
+    ./run.py \
+    --do_train true \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE \
+    --batch_size 4 \
+    --valid_steps 2000 \
+    --num_type_embeddings 3 \
+    --use_discriminator true \
+    --num_epoch 20 \
+    --lr 1e-5 \
+    --save_checkpoint false \
+    --save_summary $USE_VISUALDL \
+    --init_checkpoint $INIT_CHECKPOINT \
+    --save_dir $SAVE_DIR
+
+if [[ $USE_VISUALDL = true ]]; then
+    kill $VISUALDL_PID
+fi
diff --git a/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/baseline_infer.sh b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/baseline_infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..65a8f1be1fa8e5965c64036646865e589fba0668
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/baseline_infer.sh
@@ -0,0 +1,35 @@
+#!/bin/bash
+set -ux
+
+SAVE_DIR=outputs/DailyDialog.baseline.infer
+VOCAB_PATH=model/Bert/vocab.txt
+DATA_DIR=data/DailyDialog
+INIT_CHECKPOINT=outputs/DailyDialog.baseline/best.model
+DATA_TYPE=multi
+
+# CUDA environment settings.
+export CUDA_VISIBLE_DEVICES=0
+
+# Paddle environment settings.
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+export FLAGS_eager_delete_scope=True
+export FLAGS_eager_delete_tensor_gb=0.0
+
+python -u \
+    ./preprocess.py \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE
+
+python -u \
+    ./run.py \
+    --do_infer true \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE \
+    --batch_size 48 \
+    --num_latent 0 \
+    --num_type_embeddings 2 \
+    --init_checkpoint $INIT_CHECKPOINT \
+    --length_average true \
+    --save_dir $SAVE_DIR
diff --git a/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/baseline_train.sh b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/baseline_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f7593df31991e63c5627d09a8b96c7844a1afeb9
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/baseline_train.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+set -ux
+
+SAVE_DIR=outputs/DailyDialog.baseline
+VOCAB_PATH=model-baseline/Bert/vocab.txt
+DATA_DIR=data/DailyDialog
+INIT_CHECKPOINT=model-baseline/PLATO.baseline
+DATA_TYPE=multi
+USE_VISUALDL=false
+
+# CUDA environment settings.
+export CUDA_VISIBLE_DEVICES=2
+
+# Paddle environment settings.
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+export FLAGS_eager_delete_scope=True
+export FLAGS_eager_delete_tensor_gb=0.0
+
+python -u \
+    ./preprocess.py \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE
+
+if [[ "$USE_VISUALDL" = true ]]; then
+    visualdl --logdir=$SAVE_DIR/summary --port=8083 --host=`hostname` &
+    VISUALDL_PID=$!
+fi
+
+python -u \
+    ./run.py \
+    --do_train true \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE \
+    --batch_size 2 \
+    --valid_steps 2000 \
+    --num_type_embeddings 2 \
+    --num_latent 0 \
+    --num_epoch 20 \
+    --lr 1e-5 \
+    --save_checkpoint false \
+    --save_summary $USE_VISUALDL \
+    --init_checkpoint $INIT_CHECKPOINT \
+    --save_dir $SAVE_DIR
+
+if [[ $USE_VISUALDL = true ]]; then
+    kill $VISUALDL_PID
+fi
diff --git a/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/infer.sh b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7857a17503c78ac5db5505c2dc63eff14b3370dd
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/infer.sh
@@ -0,0 +1,35 @@
+#!/bin/bash
+set -ux
+
+SAVE_DIR=outputs/DailyDialog.infer
+VOCAB_PATH=model/Bert/vocab.txt
+DATA_DIR=data/DailyDialog
+INIT_CHECKPOINT=outputs/DailyDialog/best.model
+DATA_TYPE=multi
+
+# CUDA environment settings.
+export CUDA_VISIBLE_DEVICES=0
+
+# Paddle environment settings.
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+export FLAGS_eager_delete_scope=True
+export FLAGS_eager_delete_tensor_gb=0.0
+
+python -u \
+    ./preprocess.py \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE
+
+python -u \
+    ./run.py \
+    --do_infer true \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE \
+    --batch_size 4 \
+    --num_type_embeddings 2 \
+    --num_latent 20 \
+    --use_discriminator true \
+    --init_checkpoint $INIT_CHECKPOINT \
+    --save_dir $SAVE_DIR
diff --git a/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/multi_gpu_train.sh b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/multi_gpu_train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..446b149657cdd708ccfe7272df96c8e06dbd135d
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/multi_gpu_train.sh
@@ -0,0 +1,55 @@
+#!/bin/bash
+set -ux
+
+SAVE_DIR=outputs/DailyDialog
+VOCAB_PATH=model/Bert/vocab.txt
+DATA_DIR=data/DailyDialog
+INIT_CHECKPOINT=model/PLATO
+DATA_TYPE=multi
+USE_VISUALDL=false
+
+# CUDA environment settings.
+export CUDA_VISIBLE_DEVICES=0,1
+
+# Paddle environment settings.
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+export FLAGS_eager_delete_scope=True
+export FLAGS_eager_delete_tensor_gb=0.0
+
+if [[ ! -e $DATA_DIR/dial.train.jsonl ]]; then
+    python -u \
+        ./preprocess.py \
+        --vocab_path $VOCAB_PATH \
+        --data_dir $DATA_DIR \
+        --data_type $DATA_TYPE
+fi
+
+if [[ "$USE_VISUALDL" = true ]]; then
+    visualdl --logdir=$SAVE_DIR/summary --port=8083 --host=`hostname` &
+    VISUALDL_PID=$!
+fi
+
+python -m \
+    paddle.distributed.launch \
+    --log_dir $SAVE_DIR \
+    --started_port 8888 \
+    ./run.py \
+    --use_data_distributed true \
+    --do_train true \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE \
+    --batch_size 6 \
+    --valid_steps 2000 \
+    --num_type_embeddings 2 \
+    --use_discriminator true \
+    --num_epoch 20 \
+    --lr 1e-5 \
+    --save_checkpoint false \
+    --save_summary $USE_VISUALDL \
+    --init_checkpoint $INIT_CHECKPOINT \
+    --save_dir $SAVE_DIR
+
+if [[ $USE_VISUALDL = true ]]; then
+    kill $VISUALDL_PID
+fi
diff --git a/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/topk_infer.sh b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/topk_infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a550f40c232e0a12493300f22eacfaf40681d89a
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/topk_infer.sh
@@ -0,0 +1,39 @@
+#!/bin/bash
+set -ux
+
+SAVE_DIR=outputs/DailyDialog.infer
+VOCAB_PATH=model/Bert/vocab.txt
+DATA_DIR=data/DailyDialog
+INIT_CHECKPOINT=outputs/DailyDialog/best.model
+DATA_TYPE=multi
+
+# CUDA environment settings.
+export CUDA_VISIBLE_DEVICES=0
+
+# Paddle environment settings.
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+export FLAGS_eager_delete_scope=True
+export FLAGS_eager_delete_tensor_gb=0.0
+
+if [[ ! -e $DATA_DIR/dial.test.jsonl ]]; then
+    python -u \
+        ./preprocess.py \
+        --vocab_path $VOCAB_PATH \
+        --data_dir $DATA_DIR \
+        --data_type $DATA_TYPE
+fi
+
+python -u \
+    ./run.py \
+    --do_infer true \
+    --generator TopKSampling \
+    --top_k_num 10 \
+    --sampling_temperate 0.8 \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE \
+    --batch_size 16 \
+    --num_type_embeddings 2 \
+    --use_discriminator true \
+    --init_checkpoint $INIT_CHECKPOINT \
+    --save_dir $SAVE_DIR
diff --git a/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/train.sh b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cc53d39bb8a60c11cab33938bcda9451526e1fdb
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/scripts/DailyDialog/train.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+set -ux
+
+SAVE_DIR=outputs/DailyDialog
+VOCAB_PATH=model/Bert/vocab.txt
+DATA_DIR=data/DailyDialog
+INIT_CHECKPOINT=model/PLATO
+DATA_TYPE=multi
+USE_VISUALDL=false
+
+# CUDA environment settings.
+export CUDA_VISIBLE_DEVICES=0
+
+# Paddle environment settings.
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+export FLAGS_eager_delete_scope=True
+export FLAGS_eager_delete_tensor_gb=0.0
+
+python -u \
+    ./preprocess.py \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE
+
+if [[ "$USE_VISUALDL" = true ]]; then
+    visualdl --logdir=$SAVE_DIR/summary --port=8083 --host=`hostname` &
+    VISUALDL_PID=$!
+fi
+
+python -u \
+    ./run.py \
+    --do_train true \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE \
+    --batch_size 6 \
+    --valid_steps 2000 \
+    --num_type_embeddings 2 \
+    --use_discriminator true \
+    --num_epoch 20 \
+    --lr 1e-5 \
+    --save_checkpoint false \
+    --save_summary $USE_VISUALDL \
+    --init_checkpoint $INIT_CHECKPOINT \
+    --save_dir $SAVE_DIR
+
+if [[ $USE_VISUALDL = true ]]; then
+    kill $VISUALDL_PID
+fi
diff --git a/PaddleNLP/Research/Dialogue-PLATO/scripts/PersonaChat/infer.sh b/PaddleNLP/Research/Dialogue-PLATO/scripts/PersonaChat/infer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..06aa1e3f6b5171c8efbda0476ebfe7a7369ea597
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/scripts/PersonaChat/infer.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+set -ux
+
+SAVE_DIR=outputs/PersonaChat.infer
+VOCAB_PATH=model/Bert/vocab.txt
+DATA_DIR=data/PersonaChat
+INIT_CHECKPOINT=outputs/PersonaChat/best.model
+DATA_TYPE=multi_knowledge
+
+# CUDA environment settings.
+export CUDA_VISIBLE_DEVICES=0
+
+# Paddle environment settings.
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+export FLAGS_eager_delete_scope=True
+export FLAGS_eager_delete_tensor_gb=0.0
+
+python -u \
+    ./preprocess.py \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE
+
+python -u \
+    ./run.py \
+    --do_infer true \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE \
+    --batch_size 2 \
+    --num_type_embeddings 3 \
+    --use_discriminator true \
+    --init_checkpoint $INIT_CHECKPOINT \
+    --save_dir $SAVE_DIR
+
+python -u ./tools/knowledge_f1.py $SAVE_DIR/infer_0.result.json $DATA_DIR/dial.test
diff --git a/PaddleNLP/Research/Dialogue-PLATO/scripts/PersonaChat/train.sh b/PaddleNLP/Research/Dialogue-PLATO/scripts/PersonaChat/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..480024f4ce85bda65251b1674108295b23ced985
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/scripts/PersonaChat/train.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+set -ux
+
+SAVE_DIR=outputs/PersonaChat
+VOCAB_PATH=model/Bert/vocab.txt
+DATA_DIR=data/PersonaChat
+INIT_CHECKPOINT=model/PLATO
+DATA_TYPE=multi_knowledge
+USE_VISUALDL=false
+
+# CUDA environment settings.
+export CUDA_VISIBLE_DEVICES=0
+
+# Paddle environment settings.
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+export FLAGS_eager_delete_scope=True
+export FLAGS_eager_delete_tensor_gb=0.0
+
+python -u \
+    ./preprocess.py \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE
+
+if [[ "$USE_VISUALDL" = true ]]; then
+    visualdl --logdir=$SAVE_DIR/summary --port=8083 --host=`hostname` &
+    VISUALDL_PID=$!
+fi
+
+python -u \
+    ./run.py \
+    --do_train true \
+    --vocab_path $VOCAB_PATH \
+    --data_dir $DATA_DIR \
+    --data_type $DATA_TYPE \
+    --batch_size 4 \
+    --valid_steps 2000 \
+    --num_type_embeddings 3 \
+    --use_discriminator true \
+    --num_epoch 20 \
+    --lr 1e-5 \
+    --save_checkpoint false \
+    --save_summary $USE_VISUALDL \
+    --init_checkpoint $INIT_CHECKPOINT \
+    --save_dir $SAVE_DIR
+
+if [[ $USE_VISUALDL = true ]]; then
+    kill $VISUALDL_PID
+fi
diff --git a/PaddleNLP/Research/Dialogue-PLATO/tools/dstc7_avsd_eval.py b/PaddleNLP/Research/Dialogue-PLATO/tools/dstc7_avsd_eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..01d39ff119c812311926d90373ae8282bf74ac0d
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/tools/dstc7_avsd_eval.py
@@ -0,0 +1,90 @@
+import sys
+import math
+import json
+
+import numpy as np
+
+from pycocoevalcap.bleu.bleu import Bleu
+from pycocoevalcap.rouge.rouge import Rouge
+from pycocoevalcap.cider.cider import Cider
+from pycocoevalcap.meteor.meteor import Meteor
+
+def_scorers = [
+    (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
+    (Meteor(),"METEOR"),
+    (Rouge(), "ROUGE_L"),
+    (Cider(), "CIDEr")
+]
+
+best_scorers = [
+    (Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
+    (Meteor(),"METEOR"),
+    (Rouge(), "ROUGE_L")
+]
+
+def score_fn(ref, sample, scorers=def_scorers):
+    # ref and sample are both dict
+    
+    final_scores = {}
+    for scorer, method in scorers:
+        # print('computing %s score with COCO-EVAL...'%(scorer.method()))
+        score, scores = scorer.compute_score(ref, sample)
+        if type(score) == list:
+            for m, s in zip(method, score):
+                final_scores[m] = s
+        else:
+            final_scores[method] = score
+    return final_scores
+
+from collections import defaultdict
+chosen_by_scores = defaultdict(int)
+chosen_by_best = defaultdict(int)
+
+acc = 0
+
+with open(sys.argv[1]) as file:
+    datas = json.load(file)
+
+cnt = 0
+all_refs = dict()
+all_cands = dict()
+
+for data in datas:
+    ref = list(map(lambda x : x.strip(), data['tgt'].split('|')))
+
+    # if False:
+    best_pred = ''
+    best_score = -1e9
+    best_idx = -1
+    for i, pred in enumerate(data['preds']):
+        refs = dict()
+        cands = dict()
+        refs[0] = ref
+        cands[0] = [pred]
+        ret = score_fn(refs, cands, best_scorers)
+        score = sum(map(lambda x : ret[x], ret))
+        if score > best_score:
+            best_idx = i
+            best_score = score
+            best_pred = pred
+    chosen_by_best[best_idx] += 1
+
+    idx = np.argmax(data['scores'])
+    chosen_by_scores[idx] += 1
+    chosen_pred = data['preds'][idx]
+
+    if idx == best_idx:
+        acc += 1
+
+    all_refs[cnt] = ref
+    all_cands[cnt] = [chosen_pred]
+    cnt += 1
+
+print(f"Acc: {acc / len(datas)}")
+for i in range(20):
+    print(f"{i} {chosen_by_scores[i]} {chosen_by_best[i]}"
+          f" {chosen_by_scores[i] / len(datas):.4f}"
+          f" {chosen_by_scores[i] / chosen_by_best[i]:.4f}")
+res = score_fn(all_refs, all_cands)
+for name in res:
+    print(f"{name}: {res[name]:.4f}")
diff --git a/PaddleNLP/Research/Dialogue-PLATO/tools/knowledge_f1.py b/PaddleNLP/Research/Dialogue-PLATO/tools/knowledge_f1.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa1a6946954f423a1df279a6c556961fe7966dc0
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/tools/knowledge_f1.py
@@ -0,0 +1,69 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Calculate Knowledge f1.
+"""
+
+import sys
+import json
+
+import numpy as np
+
+eval_file = sys.argv[1]
+test_file = sys.argv[2]
+
+cnt = 0
+res = 0.0
+r = 0.0
+p = 0.0
+stopwords = set()
+with open("./tools/stopwords.txt") as f:
+    for line in f:
+        word = line.strip()
+        stopwords.add(word)
+
+with open(eval_file) as f:
+    for result, line in zip(json.load(f), open(test_file)):
+        cnt += 1
+        if "scores" in result:
+            pred = result["preds"][np.argmax(result["scores"])]
+        else:
+            pred = result["preds"][0]
+        knowledges, _, reply = line.strip().split('\t')
+
+        words = set()
+        for sent in knowledges.split(" __eou__ "):
+            for word in sent.split():
+                words.add(word)
+        words = words - stopwords
+        k_len = len(words)
+
+        pred1 = set(pred.split())
+        pred1 = pred1 - stopwords
+        pred_len = len(pred1)
+        overlap = len(words & pred1)
+
+        if overlap == 0:
+            continue
+
+        recall = float(overlap) / k_len
+        r += recall
+        precison = float(overlap) / pred_len
+        p += precison
+        res += 2*recall*precison/(recall+precison)
+print(f"Recall:{r/cnt}")
+print(f"Precison:{p/cnt}")
+print(f"F1:{res/cnt}")
+print("Recall/Precision/F1:{:0,.4f}/{:0,.4f}/{:0,.4f}".format(r/cnt, p/cnt, res/cnt))
+
diff --git a/PaddleNLP/Research/Dialogue-PLATO/tools/stopwords.txt b/PaddleNLP/Research/Dialogue-PLATO/tools/stopwords.txt
new file mode 100644
index 0000000000000000000000000000000000000000..08777dd3195a4e688b7d202692c37842d70dffcd
--- /dev/null
+++ b/PaddleNLP/Research/Dialogue-PLATO/tools/stopwords.txt
@@ -0,0 +1,284 @@
+a
+according 
+about
+above
+across
+after
+again
+against
+all
+almost
+alone
+along
+already
+also
+although
+always
+among
+an
+and
+another
+any
+are
+around
+as
+ask
+asked
+asking
+asks
+at
+away
+b
+back
+backed
+backing
+backs
+be
+became
+because
+become
+becomes
+been
+began
+being
+beings
+between
+both
+but
+by
+c
+can
+cannot
+certain
+certainly
+come
+could
+d
+did
+differ
+different
+differently
+do
+does
+done
+during
+e
+each
+either
+even
+evenly
+ever
+every
+f
+felt
+find
+finds
+for
+from
+further
+furthered
+furthering
+furthers
+g
+gave
+general
+generally
+get
+gets
+give
+given
+gives
+go
+going
+got
+h
+had
+has
+have
+having
+he
+her
+here
+herself
+him
+himself
+his
+how
+however
+i
+if
+in
+into
+is
+it
+its
+itself
+j
+just
+k
+keep
+keeps
+kind
+knew
+know
+known
+knows
+l
+let
+lets
+likely
+m
+may
+me
+might
+mostly
+much
+must
+my
+myself
+n
+need
+needed
+needing
+needs
+never
+no
+nobody
+non
+noone
+not
+nothing
+now
+nowhere
+o
+of
+on
+once
+one
+only
+or
+other
+others
+our
+out
+over
+overall
+p
+per
+perhaps
+put
+puts
+q
+r
+rather
+really
+s
+seem
+seemed
+seeming
+seems
+shall
+she
+should
+showed
+showing
+shows
+since
+so
+some
+still
+still
+such
+sure
+t
+take
+taken
+than
+that
+the
+their
+them
+then
+there
+therefore
+particularly
+nevertheless
+these
+they
+thing
+things
+think
+thinks
+this
+those
+though
+thought
+thoughts
+through
+thus
+try
+trying
+tried
+to
+anyway
+anymore
+together
+too
+took
+toward
+u
+under
+until
+up
+upon
+us
+use
+used
+uses
+v
+very
+w
+want
+wanted
+wanting
+wants
+was
+way
+ways
+we
+well
+wells
+went
+were
+what
+when
+where
+whether
+which
+while
+who
+whole
+whose
+why
+will
+with
+within
+would
+x
+y
+yet
+you
+your
+yours
+z
+.
+am
+like
+love
+favorite
+work
+,
+enjoy
+'m
+'re
+great
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/README.md b/PaddleNLP/Research/EMNLP2019-MAL/README.md
new file mode 100755
index 0000000000000000000000000000000000000000..9100f30f0f1ca8ffe7d8e78f76f26b121ebbdb77
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/README.md
@@ -0,0 +1,64 @@
+# Multi-agent Learning for Neural Machine Translation（MAL）
+
+## 简介
+
+MAL是百度翻译团队近期提出的首个多智能体端到端联合学习框架，该框架显著提升了单智能体学习能力，在多个机器翻译测试集上刷新了当前最好结果。 该框架投稿并被EMNLP2019录用 [Multi-agent Learning for Neural Machine Translation](https://www.aclweb.org/anthology/D19-1079.pdf)。 具体结构如下：
+
+<p align="center">
+<img src="images/arch.png" width = "340" height = "300" /> <br />
+MAL整体框架
+</p>
+
+这个repo包含了PaddlePaddle版本的MAL实现，框架在论文的基础上做了一些修改，在WMT英德2014测试集上BLEU达到30.04，超过了论文中的结果，在不改变模型结构的基础上，刷新了SOTA。
+
+### 实验结果
+
+#### WMT 英德
+
+|  Models  | En-De |
+| :------------- | :---------: |
+| [ConvS2S](https://pdfs.semanticscholar.org/bb3e/bc09b65728d6eced04929df72a006fb5210b.pdf) | 25.20 | 
+| [Transformer](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf) | 28.40 | 
+| [Rel](https://www.aclweb.org/anthology/N18-2074.pdf) | 29.20 | 
+| [DynamicConv](https://openreview.net/pdf?id=SkVhlh09tX) | 29.70 | 
+| L2R | 28.88 | 
+| MAL-L2R | **30.04** | 
+
+
+## 运行
+
+### 环境
+
+运行环境需要满足如下要求:
++ python 2.7
++ paddlepaddle-gpu (1.6.1)
+    + CUDA, CuDNN and NCCL (CUDA 9.0, CuDNN v7 and NCCL 2.3.5)
+
+    WMT英德的实验结果复现需要56张 32G V100, 运行30W步左右。
+
+### 数据准备
+    运行get_data.sh脚本拉取原始数据并做预处理，形成训练需要的文件格式
+    ```
+    sh get_data.sh
+    ```
+### 模型运行
+    在运行前，需要配置CUDA, CuDNN, NCCL的路径，具体路径修改在env/env.sh
+    调用train.sh运行MAL，产出的模型在output下，模型会边训练，边预测，针对训练过程中解码出来的文件，可以调用evaluate.sh来测BLEU
+    在train.sh中有个参数是distributed_args，这里需要使用者根据自身机器的情况来改变，需要修改的有nproc_per_node和selected_gpus，nproc_per_node代表每台机器需要使用几张卡，selected_gpus为gpu的卡号，例如一台8卡的v100，使用8张卡跑训练，那么nproc_per_node设置为8，selected_gpus为0, 1, 2, 3, 4, 5, 6, 7
+    ```
+    sh train.sh ip1,ip2,ip3...(机器的ip地址，不要写127.0.0.1，填写hostname -i的结果) &
+    sh evaluate.sh file_path(预测出的文件，在output路径下)
+    ```
+### 复现论文中结果
+    我们提供了MAL在英德任务上训练出的模型，调用infer.sh可以观察到最终结果(因为测试集需要提前生成，所以在调用infer.sh前，请先调用get_data.sh，同时也需要设置好CUDA, CuDNN路径)
+    ```
+    sh infer.sh
+    ```
+
+### 代码结构
+    我们主要的代码均在src文件夹中
+    train.py 训练的入口文件
+    infer.py 模型预测入口
+    config.py 定义了该项目模型的相关配置，包括具体模型类别、以及模型的超参数
+    reader.py 定义了读入数据的功能
+    bleu_hook.py BLEU计算脚本
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/env/cloud_job_conf.conf b/PaddleNLP/Research/EMNLP2019-MAL/env/cloud_job_conf.conf
new file mode 100644
index 0000000000000000000000000000000000000000..1eb89d774fdea3e762084d303ccd2db51c338a8d
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/env/cloud_job_conf.conf
@@ -0,0 +1,47 @@
+####
+################################## User Define Configuration ###########################
+################################## Data Configuration ##################################
+#type of storage cluster
+#storage_type = "afs"
+#attention: files for training should be put on hdfs
+##the list contains all file locations should be specified here
+#fs_name = "afs://xingtian.afs.baidu.com:9902"
+##If force_reuse_output_path is True ,paddle will remove output_path without check output_path exist
+#force_reuse_output_path = "True"
+##ugi of hdfs
+#fs_ugi = "NLP_KM_Data,NLP_km_2018"
+
+#the initial model path on hdfs used to init parameters
+#init_model_path=
+
+#the initial model path for pservers
+#pserver_model_dir=
+
+#which pass
+#pserver_model_pass=
+
+#example of above 2 args:
+#if set pserver_model_dir to /app/paddle/models
+#and set pserver_model_pass to 123
+#then rank 0 will download model from /app/paddle/models/rank-00000/pass-00123/
+#and rank 1 will download model from /app/paddle/models/rank-00001/pass-00123/, etc.
+##train data path on hdfs
+#train_data_path = "/user/NLP_KM_Data/gongweibao/transformer/paddle_training_data/train_data"
+##test data path on hdfs, can be null or not setted
+#test_data_path = "/app/inf/mpi/bml-guest/paddle-platform/dataset/mnist/data/test/"
+#the output directory on hdfs
+#output_path = "/user/NLP_KM_Data/gongweibao/transformer/output"
+#add datareader to thirdparty
+#thirdparty_path = "/user/NLP_KM_Data/gongweibao/transformer/thirdparty"
+FLAGS_rpc_deadline=3000000
+#whl_name=paddlepaddle_ab57d3_post97_gpu-0.0.0-cp27-cp27mu-linux_x86_64.whl
+#dataset_path=/user/NLP_KM_Data/gongweibao/transformer/small/paddle_training_data
+
+PROFILE=0
+FUSE=1
+NCCL_COMM_NUM=2
+NUM_THREADS=3
+USE_HIERARCHICAL_ALLREDUCE=True
+NUM_CARDS=8
+NUM_EPOCHS=100
+BATCH_SIZE=4096
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/env/env.sh b/PaddleNLP/Research/EMNLP2019-MAL/env/env.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6822d193f6bdfe2f1ccbe3d4e793c25684618537
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/env/env.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+export BASE_PATH="$PWD"
+
+#NCCL
+export NCCL_DEBUG=INFO
+export NCCL_IB_GID_INDEX=3
+#export NCCL_IB_RETRY_CNT=0
+
+#PADDLE
+export FLAGS_fraction_of_gpu_memory_to_use=0.98
+export FLAGS_sync_nccl_allreduce=0
+export FLAGS_eager_delete_tensor_gb=0.0
+
+#Cudnn
+#export FLAGS_cudnn_exhaustive_search=1
+export LD_LIBRARY_PATH=/home/work/cuda-9.0/lib64:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH=/home/work/cudnn/cudnn_v7/cuda/lib64:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH="${BASE_PATH}/nccl_2.3.5/lib/:$LD_LIBRARY_PATH"
+#proxy
+unset https_proxy http_proxy
+
+# GLOG
+export GLOG_v=1
+#export GLOG_vmodule=fused_all_reduce_op_handle=10,all_reduce_op_handle=10,alloc_continuous_space_op=10,fuse_all_reduce_op_pass=10,alloc_continuous_space_for_grad_pass=10,fast_threaded_ssa_graph_executor=10,threaded_ssa_graph_executor=10,backward_op_deps_pass=10,graph=10
+export GLOG_logtostderr=1
+
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/env/utils.sh b/PaddleNLP/Research/EMNLP2019-MAL/env/utils.sh
new file mode 100644
index 0000000000000000000000000000000000000000..46038e8f361ebf8506989c1f9e674c41776d4eb8
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/env/utils.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+
+set -u
+
+function check_iplist() {
+
+    if [ ${iplist:-} ]; then
+        #paddle envs
+        export PADDLE_PSERVER_PORT=9184
+        export PADDLE_TRAINER_IPS=${iplist} 
+        #export PADDLE_CURRENT_IP=`/sbin/ip a | grep inet | grep global | awk '{print $2}' | sed 's/\/[0-9][0-9].*$//g'`
+        export PADDLE_CURRENT_IP=`hostname -i`
+        
+        iparray=(${iplist//,/ })
+        for i in "${!iparray[@]}"; do
+        echo $i
+        if [ ${iparray[$i]} == ${PADDLE_CURRENT_IP} ]; then
+            export PADDLE_TRAINER_ID=$i
+        fi
+        done
+        
+        export TRAINING_ROLE=TRAINER
+        #export PADDLE_PSERVERS=127.0.0.1
+        export PADDLE_INIT_TRAINER_COUNT=${#iparray[@]}
+        export PADDLE_PORT=${PADDLE_PSERVER_PORT}
+        export PADDLE_TRAINERS=${PADDLE_TRAINER_IPS}
+        export POD_IP=${PADDLE_CURRENT_IP}
+        export PADDLE_TRAINERS_NUM=${PADDLE_INIT_TRAINER_COUNT}
+            #is local
+        export PADDLE_IS_LOCAL=0
+        echo "****************************************************"
+  
+        #paddle debug envs
+        export GLOG_v=0
+        export GLOG_logtostderr=1
+        
+        #nccl debug envs
+        export NCCL_DEBUG=INFO
+        #export NCCL_IB_DISABLE=1
+        #export NCCL_IB_GDR_LEVEL=4
+        export NCCL_IB_GID_INDEX=3
+        #export NCCL_SOCKET_IFNAME=eth2
+    fi
+}
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/evaluate.sh b/PaddleNLP/Research/EMNLP2019-MAL/evaluate.sh
new file mode 100755
index 0000000000000000000000000000000000000000..822c0ce55f3b2fa7e614ebb937f2b883fcdf6089
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/evaluate.sh
@@ -0,0 +1,9 @@
+#! /bin/sh
+
+path=$1
+
+python ./src/id2word.py data/vocab.source.32000 < ${path} > ${path}_word
+head -n 3003 ${path}_word > ${path}_word_tmp
+mv ${path}_word_tmp ${path}_word
+cat ${path}_word | sed 's/@@ //g' > ${path}.trans.post
+python ./src/bleu_hook.py --reference wmt16_en_de/newstest2014.tok.de --translation ${path}.trans.post
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/get_data.sh b/PaddleNLP/Research/EMNLP2019-MAL/get_data.sh
new file mode 100755
index 0000000000000000000000000000000000000000..9cb421d71e6ac3546acf6b7dd222e964b4def443
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/get_data.sh
@@ -0,0 +1,34 @@
+#! /bin/sh
+
+tmp_dir=wmt16_en_de
+data_dir=data
+source_file=train.tok.clean.bpe.32000.en
+target_file=train.tok.clean.bpe.32000.de
+source_vocab_size=32000
+target_vocab_size=32000
+num_shards=100
+
+if [ ! -d wmt16_en_de ]
+then
+    mkdir wmt16_en_de
+fi
+
+wget https://baidu-nlp.bj.bcebos.com/EMNLP2019-MAL/wmt16_en_de.tar.gz -O wmt16_en_de/wmt16_en_de.tar.gz
+tar -zxf wmt16_en_de/wmt16_en_de.tar.gz -C wmt16_en_de
+
+if [ ! -d $data_dir ]
+then
+    mkdir data
+fi
+
+if [ ! -d testset ]
+then
+    mkdir testset
+fi
+
+
+cp wmt16_en_de/vocab.bpe.32000 data/vocab.source.32000
+
+python ./src/gen_records.py --tmp_dir ${tmp_dir} --data_dir ${data_dir} --source_train_files ${source_file} --target_train_files ${target_file} --source_vocab_size ${source_vocab_size} --target_vocab_size ${target_vocab_size} --num_shards ${num_shards} --token True --onevocab True
+
+python ./src/preprocess/gen_utils.py --vocab $data_dir/vocab.source.${source_vocab_size} --testset ${tmp_dir}/newstest2014.tok.bpe.32000.en --output ./testset/testfile
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/images/arch.png b/PaddleNLP/Research/EMNLP2019-MAL/images/arch.png
new file mode 100644
index 0000000000000000000000000000000000000000..4c0b0f1385458184fb0eee92e6440235a5ff5666
Binary files /dev/null and b/PaddleNLP/Research/EMNLP2019-MAL/images/arch.png differ
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/infer.sh b/PaddleNLP/Research/EMNLP2019-MAL/infer.sh
new file mode 100755
index 0000000000000000000000000000000000000000..4a9bf970930e526cb085e4f8428c78675bae3225
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/infer.sh
@@ -0,0 +1,32 @@
+#! /bin/sh
+
+export LD_LIBRARY_PATH=/home/work/cuda-9.0/lib64:/home/work/cudnn/cudnn_v7/cuda/lib64:/home/work/cuda-9.0/extras/CUPTI/lib64:$LD_LIBRARY_PATH
+
+wget https://baidu-nlp.bj.bcebos.com/EMNLP2019-MAL/checkpoint.best.tgz
+tar -zxf checkpoint.best.tgz
+
+infer(){
+        CUDA_VISIBLE_DEVICES=$1 python -u src/infer.py \
+          --val_file_pattern $3 \
+          --vocab_size $4 \
+          --special_token '<s>' '<e>' '<unk>' \
+          --use_mem_opt True \
+          --use_delay_load True \
+          --infer_batch_size 16 \
+          --decode_alpha 0.3 \
+          d_model 1024 \
+          d_inner_hid 4096 \
+          n_head 16 \
+          prepostprocess_dropout 0.0 \
+          attention_dropout 0.0 \
+          relu_dropout 0.0 \
+          model_path $2 \
+          beam_size 4 \
+          max_out_len 306 \
+          max_length 256
+}
+
+infer 0 checkpoint.best testset/testfile 37007
+
+sh evaluate.sh trans/forward_checkpoint.best
+grep "BLEU_cased" trans/*
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/args.py b/PaddleNLP/Research/EMNLP2019-MAL/src/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..e28bea6d9864f5269312c69c5d598d74845df0f7
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/args.py
@@ -0,0 +1,70 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import six
+import argparse
+
+
+def str2bool(v):
+    """
+    because argparse does not support to parse "true, False" as python
+    boolean directly
+    """
+    return v.lower() in ("true", "t", "1")
+
+
+class ArgumentGroup(object):
+    """
+        ArgumentGroup
+    """
+    def __init__(self, parser, title, des):
+        self._group = parser.add_argument_group(title=title, description=des)
+
+    def add_arg(self, name, type, default, help, positional_arg=False, **kwargs):
+        """
+            add_arg
+        """
+        prefix = "" if positional_arg else "--"
+        type = str2bool if type == bool else type
+        self._group.add_argument(
+            prefix + name,
+            default=default,
+            type=type,
+            help=help + ' Default: %(default)s.',
+            **kwargs)
+
+
+def print_arguments(args):
+    """
+        print_arguments
+    """
+    print('-----------  Configuration Arguments -----------')
+    for arg, value in sorted(six.iteritems(vars(args))):
+        print('%s: %s' % (arg, value))
+    print('------------------------------------------------')
+
+
+def inv_arguments(args):
+    """
+        inv_arguments
+    """
+    print('[Warning] Only keyword argument type is supported.')
+    args_list = []
+    for arg, value in sorted(six.iteritems(vars(args))):
+        args_list.extend(['--' + str(arg), str(value)])
+    return args_list
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/attention.py b/PaddleNLP/Research/EMNLP2019-MAL/src/attention.py
new file mode 100644
index 0000000000000000000000000000000000000000..643cefe864b493e7ba60359862bde5090071c168
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/attention.py
@@ -0,0 +1,133 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+from paddle.fluid.layer_helper import LayerHelper as LayerHelper
+
+def generate_relative_positions_matrix(length, max_relative_position, cache=False):
+    if not cache:
+        range_vec = layers.range(0, length, 1, 'int32')
+        range_vec.stop_gradient = True
+        shapes = layers.shape(range_vec)
+        range_vec = layers.reshape(range_vec, shape=[1, shapes[0]])
+        range_mat = layers.expand(range_vec, [shapes[0], 1])
+        distance_mat = range_mat - layers.transpose(range_mat, [1, 0])
+    else:
+        distance_mat = layers.range(-1 * length+1, 1, 1, 'int32')
+        distance_mat.stop_gradient = True
+        shapes = layers.shape(distance_mat)
+        distance_mat = layers.reshape(distance_mat, [1, shapes[0]])
+
+    distance_mat_clipped = layers.clip(layers.cast(distance_mat, dtype="float32"), float(-max_relative_position), float(max_relative_position))
+    final_mat = layers.cast(distance_mat_clipped, dtype = 'int32') + max_relative_position
+    return final_mat
+
+
+def generate_relative_positions_embeddings(length, depth, max_relative_position, name, cache=False):
+    relative_positions_matrix = generate_relative_positions_matrix(
+                                    length, max_relative_position, cache=cache)
+
+    y = layers.reshape(relative_positions_matrix, [-1])
+    y.stop_gradient = True
+    vocab_size = max_relative_position * 2 + 1
+    #embeddings_table = layers.create_parameter(shape=[vocab_size, depth], dtype='float32', default_initializer=fluid.initializer.Constant(1.2345), name=name)
+    embeddings_table = layers.create_parameter(shape=[vocab_size, depth], dtype='float32', name=name)
+    #layers.Print(embeddings_table, message = "embeddings_table=====")
+    embeddings_1 = layers.gather(embeddings_table, y)
+    embeddings = layers.reshape(embeddings_1, [-1, length, depth])
+    return embeddings
+
+
+def _relative_attention_inner(q, k, v, transpose):
+    batch_size = layers.shape(q)[0]
+    heads = layers.shape(q)[1]
+    length = layers.shape(q)[2]
+
+    xy_matmul = layers.matmul(q, k, transpose_y=transpose)
+    x_t = layers.transpose(q, [2, 0, 1, 3])
+    x_t_r = layers.reshape(x_t, [length, batch_size * heads, -1])
+    x_tz_matmul = layers.matmul(x_t_r, v, transpose_y = transpose)
+    x_tz_matmul_r = layers.reshape(x_tz_matmul, [length, batch_size, heads, -1])
+    x_tz_matmul_r_t = layers.transpose(x_tz_matmul_r, [1, 2, 0, 3])
+    return xy_matmul + x_tz_matmul_r_t
+
+def _dot_product_relative(q, k, v, bias, dropout=0.1, cache=None, params_type="normal"):
+    depth_constant = int(k.shape[3])
+    heads = layers.shape(k)[1]
+    length = layers.shape(k)[2]
+
+    max_relative_position = 4
+    pre_name = "relative_positions_"
+    if params_type == "fixed":
+        pre_name = "fixed_relative_positions_"
+    elif params_type == "new":
+        pre_name = "new_relative_positions_"
+    relations_keys = generate_relative_positions_embeddings(
+                        length, depth_constant, max_relative_position, name=pre_name + "keys",
+                        cache=cache is not None)
+
+    relations_values = generate_relative_positions_embeddings(
+                            length, depth_constant, max_relative_position,
+                            name = pre_name + "values",
+                            cache=cache is not None)
+
+    logits = _relative_attention_inner(q, k, relations_keys, True)
+
+    if bias is not None: logits += bias
+    weights = layers.softmax(logits, name = "attention_weights")
+    weights = layers.dropout(weights, dropout_prob=float(dropout))
+    output =  _relative_attention_inner(weights, v, relations_values, False)
+    return output
+
+def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+    """
+    Scaled Dot-Product Attention
+    """
+    scaled_q = layers.scale(x=q, scale=d_key ** -0.5)
+    product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+    if attn_bias:
+        product += attn_bias
+    weights = layers.softmax(product)
+    if dropout_rate:
+        weights = layers.dropout(
+            weights,
+            dropout_prob=dropout_rate,
+            seed=ModelHyperParams.dropout_seed,
+            is_test=False, dropout_implementation='upscale_in_train')
+    out = layers.matmul(weights, v)
+    return out
+
+if __name__ == "__main__":
+    batch_size = 2
+    heads = 8
+    length = 5
+    depth = 3
+    cpu = fluid.core.CPUPlace()
+    exe = fluid.Executor(cpu)
+    startup_prog = fluid.Program()
+    train_prog = fluid.Program()
+    with fluid.program_guard(train_prog, startup_prog):
+        with fluid.unique_name.guard("forward"):
+            x = layers.reshape(layers.cast(layers.range(0, 18, 1, "int32"), dtype = "float32"), shape =[-1, 3, 3])
+            y = layers.reshape(layers.cast(layers.range(0, 2, 1, "int32"), dtype = "float32"), shape =[-1, 1])
+            z = x * y
+
+    exe.run(startup_prog)
+    outs = exe.run(train_prog, fetch_list=[x, y, z])
+    print outs[0]
+    print outs[1]
+    print outs[2]
+
+
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/beam_search.py b/PaddleNLP/Research/EMNLP2019-MAL/src/beam_search.py
new file mode 100644
index 0000000000000000000000000000000000000000..2673baf54099999d6cc472be061ea56af1e68c96
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/beam_search.py
@@ -0,0 +1,206 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+
+INF = 1. * 1e9
+
+class BeamSearch(object):
+    """
+        beam_search class
+    """
+    def __init__(self, beam_size, batch_size, alpha, vocab_size, hidden_size):
+        self.beam_size = beam_size
+        self.batch_size = batch_size
+        self.alpha = alpha
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.gather_top2k_append_index = layers.range(0, 2 * self.batch_size * beam_size, 1, 'int64') // \
+                                                      (2 * self.beam_size) * (self.beam_size)
+
+        self.gather_topk_append_index = layers.range(0, self.batch_size * beam_size, 1, 'int64') // \
+                                                     self.beam_size * (2 * self.beam_size)
+
+        self.gather_finish_topk_append_index = layers.range(0, self.batch_size * beam_size, 1, 'int64') // \
+                                                            self.beam_size * (3 * self.beam_size)
+
+        self.eos_id = layers.fill_constant([self.batch_size, 2 * self.beam_size], 'int64', value=1)
+        self.get_alive_index = layers.range(0, self.batch_size, 1, 'int64') * self.beam_size
+
+    
+    def gather_cache(self, kv_caches, select_id):
+        """
+            gather cache
+        """
+        for index in xrange(len(kv_caches)):
+            kv_cache = kv_caches[index]
+            select_k = layers.gather(kv_cache['k'], [select_id])
+            select_v = layers.gather(kv_cache['v'], [select_id])
+            layers.assign(select_k, kv_caches[index]['k'])
+            layers.assign(select_v, kv_caches[index]['v'])
+    
+
+    # topk_seq, topk_scores, topk_log_probs, topk_finished, cache
+    def compute_topk_scores_and_seq(self, sequences, scores, scores_to_gather, flags, pick_finish=False, cache=None):
+        """
+            compute_topk_scores_and_seq
+        """
+        topk_scores, topk_indexes = layers.topk(scores, k=self.beam_size) #[batch_size, beam_size]
+        if not pick_finish:
+            flat_topk_indexes = layers.reshape(topk_indexes, [-1]) + self.gather_topk_append_index
+            flat_sequences = layers.reshape(sequences, [2 * self.batch_size * self.beam_size, -1])
+        else:
+            flat_topk_indexes = layers.reshape(topk_indexes, [-1]) + self.gather_finish_topk_append_index
+            flat_sequences = layers.reshape(sequences, [3 * self.batch_size * self.beam_size, -1])
+
+        topk_seq = layers.gather(flat_sequences, [flat_topk_indexes])
+        topk_seq = layers.reshape(topk_seq, [self.batch_size, self.beam_size, -1])
+
+        flat_flags = layers.reshape(flags, [-1])
+        topk_flags = layers.gather(flat_flags, [flat_topk_indexes])
+        topk_flags = layers.reshape(topk_flags, [-1, self.beam_size])
+
+        flat_scores = layers.reshape(scores_to_gather, [-1])
+        topk_gathered_scores = layers.gather(flat_scores, [flat_topk_indexes]) 
+        topk_gathered_scores = layers.reshape(topk_gathered_scores, [-1, self.beam_size])
+        
+        if cache:
+            self.gather_cache(cache, flat_topk_indexes)
+
+        return topk_seq, topk_gathered_scores, topk_flags, cache
+
+
+    def grow_topk(self, i, logits, alive_seq, alive_log_probs, cache, enc_output, enc_bias):
+        """
+            grow_topk
+        """
+        logits = layers.reshape(logits, [self.batch_size, self.beam_size, -1])
+        
+        candidate_log_probs = layers.log(layers.softmax(logits, axis=2))
+        log_probs = candidate_log_probs + layers.unsqueeze(alive_log_probs, axes=[2]) 
+        
+        base_1 = layers.cast(i, 'float32') + 6.0
+        base_1 /= 6.0
+        length_penalty = layers.pow(base_1, self.alpha)
+        #length_penalty = layers.pow(((5.0 + layers.cast(i+1, 'float32')) / 6.0), self.alpha)
+        
+        curr_scores = log_probs / length_penalty
+        flat_curr_scores = layers.reshape(curr_scores, [self.batch_size, self.beam_size * self.vocab_size])
+
+        topk_scores, topk_ids = layers.topk(flat_curr_scores, k=self.beam_size * 2)
+        
+        topk_log_probs = topk_scores * length_penalty
+
+        select_beam_index = topk_ids // self.vocab_size
+        select_id = topk_ids % self.vocab_size
+
+        #layers.Print(select_id, message="select_id", summarize=1024)
+        #layers.Print(topk_scores, message="topk_scores", summarize=10000000)
+        
+        flat_select_beam_index = layers.reshape(select_beam_index, [-1]) + self.gather_top2k_append_index
+        
+        topk_seq = layers.gather(alive_seq, [flat_select_beam_index])
+        topk_seq = layers.reshape(topk_seq, [self.batch_size, 2 * self.beam_size, -1])
+        
+        
+        #concat with current ids
+        topk_seq = layers.concat([topk_seq, layers.unsqueeze(select_id, axes=[2])], axis=2)
+        topk_finished = layers.cast(layers.equal(select_id, self.eos_id), 'float32') 
+        
+        #gather cache
+        self.gather_cache(cache, flat_select_beam_index)
+
+        #topk_seq: [batch_size, 2*beam_size, i+1]
+        #topk_log_probs, topk_scores, topk_finished: [batch_size, 2*beam_size]
+        return topk_seq, topk_log_probs, topk_scores, topk_finished, cache
+
+
+    def grow_alive(self, curr_seq, curr_scores, curr_log_probs, curr_finished, cache):
+        """
+            grow_alive
+        """
+        finish_float_flag = layers.cast(curr_finished, 'float32')
+        finish_float_flag = finish_float_flag * -INF
+        curr_scores += finish_float_flag
+
+        return self.compute_topk_scores_and_seq(curr_seq, curr_scores, 
+                                curr_log_probs, curr_finished, cache=cache)
+
+    
+    def grow_finished(self, i, finished_seq, finished_scores, finished_flags, curr_seq, 
+                      curr_scores, curr_finished):
+        """
+            grow_finished
+        """
+        finished_seq = layers.concat([finished_seq, 
+                                layers.fill_constant([self.batch_size, self.beam_size, 1], dtype='int64', value=0)], 
+                                axis=2)
+
+        curr_scores = curr_scores + (1.0 - layers.cast(curr_finished, 'int64')) * -INF
+
+        curr_finished_seq = layers.concat([finished_seq, curr_seq], axis=1)
+        curr_finished_scores = layers.concat([finished_scores, curr_scores], axis=1)
+        curr_finished_flags = layers.concat([finished_flags, curr_finished], axis=1)
+         
+        return self.compute_topk_scores_and_seq(curr_finished_seq, curr_finished_scores, 
+                                                curr_finished_scores, curr_finished_flags, 
+                                                pick_finish=True)
+
+
+    def inner_func(self, i, logits, alive_seq, alive_log_probs, finished_seq, finished_scores, 
+                   finished_flags, cache, enc_output, enc_bias):
+        """
+            inner_func
+        """
+        topk_seq, topk_log_probs, topk_scores, topk_finished, cache = self.grow_topk(
+                i, logits, alive_seq, alive_log_probs, cache, enc_output, enc_bias)
+
+        alive_seq, alive_log_probs, _, cache = self.grow_alive(
+                topk_seq, topk_scores, topk_log_probs, topk_finished, cache)
+        #layers.Print(alive_seq, message="alive_seq", summarize=1024)
+
+        finished_seq, finished_scores, finished_flags, _ = self.grow_finished(
+                i, finished_seq, finished_scores, finished_flags, topk_seq, topk_scores, topk_finished)
+
+        return alive_seq, alive_log_probs, finished_seq, finished_scores, finished_flags, cache
+
+
+    def is_finished(self, step_idx, source_length, alive_log_probs, finished_scores, finished_in_finished):
+        """
+            is_finished
+        """
+        base_1 = layers.cast(source_length, 'float32') + 55.0
+        base_1 /= 6.0
+        max_length_penalty = layers.pow(base_1, self.alpha)
+
+        flat_alive_log_probs = layers.reshape(alive_log_probs, [-1])
+        lower_bound_alive_scores_1 = layers.gather(flat_alive_log_probs, [self.get_alive_index])
+        
+        lower_bound_alive_scores = lower_bound_alive_scores_1 / max_length_penalty
+        
+        lowest_score_of_finished_in_finish = layers.reduce_min(finished_scores * finished_in_finished, dim=1)
+
+        finished_in_finished = layers.cast(finished_in_finished, 'bool')
+        lowest_score_of_finished_in_finish += \
+                        ((1.0 - layers.cast(layers.reduce_any(finished_in_finished, 1), 'float32')) * -INF)
+        
+        #print lowest_score_of_finished_in_finish
+        bound_is_met = layers.reduce_all(layers.greater_than(lowest_score_of_finished_in_finish, 
+                                                             lower_bound_alive_scores))
+
+        decode_length = source_length + 50
+        length_cond = layers.less_than(x=step_idx, y=decode_length)
+
+        return layers.logical_and(x=layers.logical_not(bound_is_met), y=length_cond)
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/bleu_hook.py b/PaddleNLP/Research/EMNLP2019-MAL/src/bleu_hook.py
new file mode 100644
index 0000000000000000000000000000000000000000..05e722aeddb54ed3716fa822716976a0be5b5483
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/bleu_hook.py
@@ -0,0 +1,209 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+
+import collections
+import math
+import os
+import re
+import sys
+import time
+import unicodedata
+
+# Dependency imports
+
+import numpy as np
+import six
+from six.moves import range
+from six.moves import zip
+
+from preprocess import text_encoder
+
+
+def _get_ngrams(segment, max_order):
+  """Extracts all n-grams up to a given maximum order from an input segment.
+
+  Args:
+    segment: text segment from which n-grams will be extracted.
+    max_order: maximum length in tokens of the n-grams returned by this
+        methods.
+
+  Returns:
+    The Counter containing all n-grams up to max_order in segment
+    with a count of how many times each n-gram occurred.
+  """
+  ngram_counts = collections.Counter()
+  for order in range(1, max_order + 1):
+    for i in range(0, len(segment) - order + 1):
+      ngram = tuple(segment[i:i + order])
+      ngram_counts[ngram] += 1
+  return ngram_counts
+
+
+def compute_bleu(reference_corpus,
+                 translation_corpus,
+                 max_order=4,
+                 use_bp=True):
+  """Computes BLEU score of translated segments against one or more references.
+
+  Args:
+    reference_corpus: list of references for each translation. Each
+        reference should be tokenized into a list of tokens.
+    translation_corpus: list of translations to score. Each translation
+        should be tokenized into a list of tokens.
+    max_order: Maximum n-gram order to use when computing BLEU score.
+    use_bp: boolean, whether to apply brevity penalty.
+
+  Returns:
+    BLEU score.
+  """
+  reference_length = 0
+  translation_length = 0
+  bp = 1.0
+  geo_mean = 0
+
+  matches_by_order = [0] * max_order
+  possible_matches_by_order = [0] * max_order
+  precisions = []
+
+  for (references, translations) in zip(reference_corpus, translation_corpus):
+    reference_length += len(references)
+    translation_length += len(translations)
+    ref_ngram_counts = _get_ngrams(references, max_order)
+    translation_ngram_counts = _get_ngrams(translations, max_order)
+
+    overlap = dict((ngram,
+                    min(count, translation_ngram_counts[ngram]))
+                   for ngram, count in ref_ngram_counts.items())
+
+    for ngram in overlap:
+      matches_by_order[len(ngram) - 1] += overlap[ngram]
+    for ngram in translation_ngram_counts:
+      possible_matches_by_order[len(ngram)-1] += translation_ngram_counts[ngram]
+  precisions = [0] * max_order
+  smooth = 1.0
+  for i in range(0, max_order):
+    if possible_matches_by_order[i] > 0:
+      precisions[i] = matches_by_order[i] / possible_matches_by_order[i]
+      if matches_by_order[i] > 0:
+        precisions[i] = matches_by_order[i] / possible_matches_by_order[i]
+      else:
+        smooth *= 2
+        precisions[i] = 1.0 / (smooth * possible_matches_by_order[i])
+    else:
+      precisions[i] = 0.0
+
+  if max(precisions) > 0:
+    p_log_sum = sum(math.log(p) for p in precisions if p)
+    geo_mean = math.exp(p_log_sum / max_order)
+
+  if use_bp:
+    ratio = (translation_length + 1e-6) / reference_length
+    bp = math.exp(1 - 1. / ratio) if ratio < 1.0 else 1.0
+  bleu = geo_mean * bp
+  return np.float32(bleu)
+
+
+class UnicodeRegex(object):
+  """Ad-hoc hack to recognize all punctuation and symbols."""
+
+  def __init__(self):
+    punctuation = self.property_chars("P")
+    self.nondigit_punct_re = re.compile(r"([^\d])([" + punctuation + r"])")
+    self.punct_nondigit_re = re.compile(r"([" + punctuation + r"])([^\d])")
+    self.symbol_re = re.compile("([" + self.property_chars("S") + "])")
+
+  def property_chars(self, prefix):
+    """
+        get unicode of specified chars
+    """
+    return "".join(six.unichr(x) for x in range(sys.maxunicode)
+                   if unicodedata.category(six.unichr(x)).startswith(prefix))
+
+
+uregex = UnicodeRegex()
+
+
+def bleu_tokenize(string):
+  r"""Tokenize a string following the official BLEU implementation.
+
+  See https://github.com/moses-smt/mosesdecoder/"
+           "blob/master/scripts/generic/mteval-v14.pl#L954-L983
+  In our case, the input string is expected to be just one line
+  and no HTML entities de-escaping is needed.
+  So we just tokenize on punctuation and symbols,
+  except when a punctuation is preceded and followed by a digit
+  (e.g. a comma/dot as a thousand/decimal separator).
+
+  Note that a number (e.g. a year) followed by a dot at the end of sentence
+  is NOT tokenized,
+  i.e. the dot stays with the number because `s/(\p{P})(\P{N})/ $1 $2/g`
+  does not match this case (unless we add a space after each sentence).
+  However, this error is already in the original mteval-v14.pl
+  and we want to be consistent with it.
+
+  Args:
+    string: the input string
+
+  Returns:
+    a list of tokens
+  """
+  string = uregex.nondigit_punct_re.sub(r"\1 \2 ", string)
+  string = uregex.punct_nondigit_re.sub(r" \1 \2", string)
+  string = uregex.symbol_re.sub(r" \1 ", string)
+  return string.split()
+
+
+def bleu_wrapper(ref_filename, hyp_filename, case_sensitive=False):
+  """Compute BLEU for two files (reference and hypothesis translation)."""
+  ref_lines = text_encoder.native_to_unicode(
+      open(ref_filename, "r").read()).splitlines()
+  hyp_lines = text_encoder.native_to_unicode(
+      open(hyp_filename, "r").read()).splitlines()
+  assert len(ref_lines) == len(hyp_lines)
+  if not case_sensitive:
+    ref_lines = [x.lower() for x in ref_lines]
+    hyp_lines = [x.lower() for x in hyp_lines]
+  ref_tokens = [bleu_tokenize(x) for x in ref_lines]
+  hyp_tokens = [bleu_tokenize(x) for x in hyp_lines]
+
+  return compute_bleu(ref_tokens, hyp_tokens)
+
+
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser("Calc BLEU.")
+
+    parser.add_argument(
+        "--reference",
+        type=str,
+        required=True,
+        help="path of reference.")
+
+    parser.add_argument(
+        "--translation",
+        type=str,
+        required=True,
+        help="path of translation.")
+    args = parser.parse_args()
+
+    bleu_uncased = 100 * bleu_wrapper(args.reference, args.translation, case_sensitive=False)
+    bleu_cased = 100 * bleu_wrapper(args.reference, args.translation, case_sensitive=True)
+    
+    f = open("%s.bleu" % args.translation, 'w')
+    f.write("BLEU_uncased = %6.2f\n" % (bleu_uncased))
+    f.write("BLEU_cased = %6.2f\n" % (bleu_cased))
+    f.close()
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/config.py b/PaddleNLP/Research/EMNLP2019-MAL/src/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..77ed70cc08eb685101349698af78bd4036591a96
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/config.py
@@ -0,0 +1,315 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+class TrainTaskConfig(object):
+    """
+        TrainTaskConfig
+    """
+    # support both CPU and GPU now.
+    use_gpu = True
+    # the epoch number to train.
+    pass_num = 30
+    # the number of sequences contained in a mini-batch.
+    # deprecated, set batch_size in args.
+    batch_size = 32
+    # the hyper parameters for Adam optimizer.
+    # This static learning_rate will be multiplied to the LearningRateScheduler
+    # derived learning rate the to get the final learning rate.
+    learning_rate = 4.0
+    beta1 = 0.9
+    beta2 = 0.997
+    eps = 1e-9
+    # the parameters for learning rate scheduling.
+    warmup_steps = 8000
+    # the weight used to mix up the ground-truth distribution and the fixed
+    # uniform distribution in label smoothing when training.
+    # Set this as zero if label smoothing is not wanted.
+    label_smooth_eps = 0.1
+    # the directory for saving trained models.
+    model_dir = "trained_models"
+    # the directory for saving checkpoints.
+    ckpt_dir = "trained_ckpts"
+    # the directory for loading checkpoint.
+    # If provided, continue training from the checkpoint.
+    ckpt_path = None
+    # the parameter to initialize the learning rate scheduler.
+    # It should be provided if use checkpoints, since the checkpoint doesn't
+    # include the training step counter currently.
+    start_step = 0
+    # the frequency to save trained models.
+    save_freq = 5000
+    # the frequency to copy unfixed parameters to fixed parameters
+    fixed_freq = 50000
+    beta = 0.7
+
+class InferTaskConfig(object):
+    """
+        InferTaskConfig
+    """
+    use_gpu = True
+    # the number of examples in one run for sequence generation.
+    batch_size = 10
+    # the parameters for beam search.
+    beam_size = 5
+    max_out_len = 256
+    # the number of decoded sentences to output.
+    n_best = 1
+    # the flags indicating whether to output the special tokens.
+    output_bos = False
+    output_eos = False
+    output_unk = True
+    # the directory for loading the trained model.
+    model_path = "trained_models/pass_1.infer.model"
+    decode_alpha = 0.6
+
+
+class ModelHyperParams(object):
+    """
+        ModelHyperParams
+    """
+    # These following five vocabularies related configurations will be set
+    # automatically according to the passed vocabulary path and special tokens.
+    # size of source word dictionary.
+    src_vocab_size = 10000
+    # size of target word dictionay
+    trg_vocab_size = 10000
+    # index for <bos> token
+    bos_idx = 0
+    # index for <eos> token
+    eos_idx = 1
+    # index for <unk> token
+    unk_idx = 2
+    # max length of sequences deciding the size of position encoding table.
+    max_length = 256
+    # the dimension for word embeddings, which is also the last dimension of
+    # the input and output of multi-head attention, position-wise feed-forward
+    # networks, encoder and decoder.
+    d_model = 1024
+    # size of the hidden layer in position-wise feed-forward networks.
+    d_inner_hid = 4096
+    # the dimension that keys are projected to for dot-product attention.
+    d_key = 64
+    # the dimension that values are projected to for dot-product attention.
+    d_value = 64
+    # number of head used in multi-head attention.
+    n_head = 16
+    # number of sub-layers to be stacked in the encoder and decoder.
+    n_layer = 6
+    # dropout rates of different modules.
+    prepostprocess_dropout = 0.1
+    attention_dropout = 0.1
+    relu_dropout = 0.1
+    # to process before each sub-layer
+    preprocess_cmd = "n"  # layer normalization
+    # to process after each sub-layer
+    postprocess_cmd = "da"  # dropout + residual connection
+    # random seed used in dropout for CE.
+    dropout_seed = None
+    # the flag indicating whether to share embedding and softmax weights.
+    # vocabularies in source and target should be same for weight sharing.
+    weight_sharing = True
+    embedding_sharing = True
+
+class DenseModelHyperParams(object):
+    """
+        DenseModelHyperParams
+    """
+    # These following five vocabularies related configurations will be set
+    # automatically according to the passed vocabulary path and special tokens.
+    # size of source word dictionary.
+    src_vocab_size = 37007
+    # size of target word dictionay
+    trg_vocab_size = 37007
+    # index for <bos> token
+    bos_idx = 0
+    # index for <eos> token
+    eos_idx = 1
+    # index for <unk> token
+    unk_idx = 2
+    # max length of sequences deciding the size of position encoding table.
+    max_length = 256
+    # the dimension for word embeddings, which is also the last dimension of
+    # the input and output of multi-head attention, position-wise feed-forward
+    # networks, encoder and decoder.
+    d_model = 512
+    # size of the hidden layer in position-wise feed-forward networks.
+    d_inner_hid = 2048
+    # the dimension that keys are projected to for dot-product attention.
+    d_key = 64
+    # the dimension that values are projected to for dot-product attention.
+    d_value = 64
+    # number of head used in multi-head attention.
+    n_head = 8
+    # number of sub-layers to be stacked in the encoder and decoder.
+    n_layer = 6
+    enc_n_layer = 25
+    # dropout rates of different modules.
+    prepostprocess_dropout = 0.1
+    attention_dropout = 0.1
+    relu_dropout = 0.1
+    # to process before each sub-layer
+    preprocess_cmd = "n"  # layer normalization
+    # to process after each sub-layer
+    postprocess_cmd = "da"  # dropout + residual connection
+    # random seed used in dropout for CE.
+    dropout_seed = None
+    # the flag indicating whether to share embedding and softmax weights.
+    # vocabularies in source and target should be same for weight sharing.
+    weight_sharing = True
+    embedding_sharing = True
+
+def merge_cfg_from_list(cfg_list, g_cfgs):
+    """
+    Set the above global configurations using the cfg_list. 
+    """
+    assert len(cfg_list) % 2 == 0
+    for key, value in zip(cfg_list[0::2], cfg_list[1::2]):
+        for g_cfg in g_cfgs:
+            if hasattr(g_cfg, key):
+                try:
+                    value = eval(value)
+                except Exception:  # for file path
+                    pass
+                setattr(g_cfg, key, value)
+                break
+
+
+# The placeholder for batch_size in compile time. Must be -1 currently to be
+# consistent with some ops' infer-shape output in compile time, such as the
+# sequence_expand op used in beamsearch decoder.
+batch_size = -1
+# The placeholder for squence length in compile time.
+seq_len = ModelHyperParams.max_length
+# Here list the data shapes and data types of all inputs.
+# The shapes here act as placeholder and are set to pass the infer-shape in
+# compile time.
+input_descs = {
+    # The actual data shape of src_word is:
+    # [batch_size, max_src_len_in_batch, 1]
+    "src_word": [(batch_size, seq_len, 1), "int64", 2],
+    # The actual data shape of src_pos is:
+    # [batch_size, max_src_len_in_batch, 1]
+    "src_pos": [(batch_size, seq_len, 1), "int64"],
+    # This input is used to remove attention weights on paddings in the
+    # encoder.
+    # The actual data shape of src_slf_attn_bias is:
+    # [batch_size, n_head, max_src_len_in_batch, max_src_len_in_batch]
+    "src_slf_attn_bias": [(batch_size, ModelHyperParams.n_head, seq_len,
+                           seq_len), "float32"],
+    "dense_src_slf_attn_bias": [(batch_size, DenseModelHyperParams.n_head, seq_len,
+                           seq_len), "float32"],
+    # The actual data shape of trg_word is:
+    # [batch_size, max_trg_len_in_batch, 1]
+    "trg_word": [(batch_size, seq_len, 1), "int64",
+                 2],  # lod_level is only used in fast decoder.
+    "reverse_trg_word": [(batch_size, seq_len, 1), "int64",
+                 2],  # lod_level is only used in fast decoder.
+    # The actual data shape of trg_pos is:
+    # [batch_size, max_trg_len_in_batch, 1]
+    "trg_pos": [(batch_size, seq_len, 1), "int64"],
+    # This input is used to remove attention weights on paddings and
+    # subsequent words in the decoder.
+    # The actual data shape of trg_slf_attn_bias is:
+    # [batch_size, n_head, max_trg_len_in_batch, max_trg_len_in_batch]
+    "trg_slf_attn_bias": [(batch_size, ModelHyperParams.n_head, seq_len,
+                           seq_len), "float32"],
+    "dense_trg_slf_attn_bias": [(batch_size, DenseModelHyperParams.n_head, seq_len,
+                           seq_len), "float32"],
+    # This input is used to remove attention weights on paddings of the source
+    # input in the encoder-decoder attention.
+    # The actual data shape of trg_src_attn_bias is:
+    # [batch_size, n_head, max_trg_len_in_batch, max_src_len_in_batch]
+    "trg_src_attn_bias": [(batch_size, ModelHyperParams.n_head, seq_len,
+                           seq_len), "float32"],
+    "dense_trg_src_attn_bias": [(batch_size, DenseModelHyperParams.n_head, seq_len,
+                           seq_len), "float32"],
+    # This input is used in independent decoder program for inference.
+    # The actual data shape of enc_output is:
+    # [batch_size, max_src_len_in_batch, d_model]
+    "enc_output": [(batch_size, seq_len, ModelHyperParams.d_model), "float32"],
+    # The actual data shape of label_word is:
+    # [batch_size * max_trg_len_in_batch, 1]
+    "lbl_word": [(batch_size * seq_len, 1), "int64"],
+    "reverse_lbl_word": [(batch_size * seq_len, 1), "int64"],
+    "eos_position": [(batch_size * seq_len, 1), "int64"],
+    # This input is used to mask out the loss of paddding tokens.
+    # The actual data shape of label_weight is:
+    # [batch_size * max_trg_len_in_batch, 1]
+    "lbl_weight": [(batch_size * seq_len, 1), "float32"],
+    # This input is used in beam-search decoder.
+    "init_score": [(batch_size, 1), "float32"],
+    # This input is used in beam-search decoder for the first gather
+    # (cell states updation)
+    "init_idx": [(batch_size, ), "int32"],
+    "decode_length": [(batch_size, ), "int64"],
+}
+# Names of word embedding table which might be reused for weight sharing.
+dense_word_emb_param_names = (
+    "src_word_emb_table",
+    "trg_word_emb_table", )
+# Names of position encoding table which will be initialized externally.
+dense_pos_enc_param_names = (
+    "dense_src_pos_enc_table",
+    "dense_trg_pos_enc_table", )
+# Names of word embedding table which might be reused for weight sharing.
+word_emb_param_names = (
+    "src_word_emb_table",
+    "trg_word_emb_table", )
+# Names of position encoding table which will be initialized externally.
+pos_enc_param_names = (
+    "src_pos_enc_table",
+    "trg_pos_enc_table", )
+# separated inputs for different usages.
+encoder_data_input_fields = (
+    "src_word",
+    "src_pos",
+    "src_slf_attn_bias", )
+# separated inputs for different usages.
+dense_encoder_data_input_fields = (
+    "src_word",
+    "src_pos",
+    "dense_src_slf_attn_bias", )
+decoder_data_input_fields = (
+    "trg_word",
+    "reverse_trg_word",
+    "trg_pos",
+    "trg_slf_attn_bias",
+    "trg_src_attn_bias",
+    "enc_output", )
+dense_decoder_data_input_fields = (
+    "trg_word",
+    "reverse_trg_word",
+    "trg_pos",
+    "dense_trg_slf_attn_bias",
+    "dense_trg_src_attn_bias",
+    "enc_output", )
+label_data_input_fields = (
+    "lbl_word",
+    "lbl_weight", 
+    "reverse_lbl_word",
+    "eos_position")
+dense_bias_input_fields = (
+    "dense_src_slf_attn_bias",
+    "dense_trg_slf_attn_bias",
+    "dense_trg_src_attn_bias")
+# In fast decoder, trg_pos (only containing the current time step) is generated
+# by ops and trg_slf_attn_bias is not needed.
+fast_encoder_data_input_fields = (
+    "src_word",
+    "src_pos",
+    "src_slf_attn_bias", 
+    "dense_src_slf_attn_bias", )
+
+fast_decoder_data_input_fields = (
+    "decode_length", )
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/dense_model.py b/PaddleNLP/Research/EMNLP2019-MAL/src/dense_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..ceb1319e18c53b7c77829d2c6f273a57fec595a9
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/dense_model.py
@@ -0,0 +1,958 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License
+
+from functools import partial
+import numpy as np
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+from paddle.fluid.layer_helper import LayerHelper as LayerHelper
+
+from config import *
+from beam_search import BeamSearch
+
+INF = 1. * 1e5
+
+def layer_norm(x, begin_norm_axis=1, epsilon=1e-6, param_attr=None, bias_attr=None):
+    """
+        layer_norm
+    """
+    helper = LayerHelper('layer_norm', **locals())
+    mean = layers.reduce_mean(x, dim=begin_norm_axis, keep_dim=True)
+    shift_x = layers.elementwise_sub(x=x, y=mean, axis=0)
+    variance = layers.reduce_mean(layers.square(shift_x), dim=begin_norm_axis, keep_dim=True)
+    r_stdev = layers.rsqrt(variance + epsilon)
+    norm_x = layers.elementwise_mul(x=shift_x, y=r_stdev, axis=0)
+
+    param_shape = [reduce(lambda x, y: x * y, norm_x.shape[begin_norm_axis:])]
+    param_dtype = norm_x.dtype
+    scale = helper.create_parameter(
+        attr=param_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        default_initializer=fluid.initializer.Constant(1.))
+    bias = helper.create_parameter(
+        attr=bias_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        is_bias=True,
+        default_initializer=fluid.initializer.Constant(0.))
+
+    out = layers.elementwise_mul(x=norm_x, y=scale, axis=-1)
+    out = layers.elementwise_add(x=out, y=bias, axis=-1)
+
+    return out
+
+def dense_position_encoding_init(n_position, d_pos_vec):
+    """
+    Generate the initial values for the sinusoid position encoding table.
+    """
+    channels = d_pos_vec
+    position = np.arange(n_position)
+    num_timescales = channels // 2
+    log_timescale_increment = (np.log(float(1e4) / float(1)) /
+                               (num_timescales - 1))
+    inv_timescales = np.exp(np.arange(
+        num_timescales) * -log_timescale_increment)
+        #num_timescales)) * -log_timescale_increment
+    scaled_time = np.expand_dims(position, 1) * np.expand_dims(inv_timescales,
+                                                               0)
+    signal = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1)
+    signal = np.pad(signal, [[0, 0], [0, np.mod(channels, 2)]], 'constant')
+    position_enc = signal
+    return position_enc.astype("float32")
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         attention_type="dot_product",):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        if n_head == 1:
+            return x
+
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(
+            x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(
+            x=trans_x,
+            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key ** -0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights,
+                dropout_prob=dropout_rate,
+                seed=DenseModelHyperParams.dropout_seed,
+                is_test=False, dropout_implementation='upscale_in_train')
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = layers.concat([cache['k'], k], axis=1)
+        v = layers.concat([cache['v'], v], axis=1)
+        layers.assign(k, cache['k'])
+        layers.assign(v, cache['v'])
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, #d_model,
+                                                  dropout_rate)
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         bias_attr=False,
+                         num_flatten_dims=2)
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act="relu")
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden,
+            dropout_prob=dropout_rate,
+            seed=DenseModelHyperParams.dropout_seed,
+            is_test=False, dropout_implementation='upscale_in_train')
+    out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2)
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out = layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                epsilon=1e-6,
+                param_attr=fluid.initializer.Constant(1.),
+                bias_attr=fluid.initializer.Constant(0.))
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out,
+                    dropout_prob=dropout_rate,
+                    seed=DenseModelHyperParams.dropout_seed,
+                    is_test=False, dropout_implementation='upscale_in_train')
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def prepare_encoder_decoder(src_word,
+                            src_pos,
+                            src_vocab_size,
+                            src_emb_dim,
+                            src_max_len,
+                            dropout_rate=0.,
+                            word_emb_param_name=None,
+                            training=True,
+                            pos_enc_param_name=None, 
+                            is_src=True,
+                            params_type="normal"):
+    """Add word embeddings and position encodings.
+    The output tensor has a shape of:
+    [batch_size, max_src_length_in_batch, d_model].
+    This module is used at the bottom of the encoder stacks.
+    """
+    assert params_type == "fixed" or params_type == "normal" or params_type == "new"
+    pre_name = "densedense"
+
+    if params_type == "fixed":
+        pre_name = "fixed_densefixed_dense"
+    elif params_type == "new":
+        pre_name = "new_densenew_dense"
+
+    src_word_emb = layers.embedding(
+        src_word,
+        size=[src_vocab_size, src_emb_dim],
+        padding_idx=DenseModelHyperParams.bos_idx,  # set embedding of bos to 0
+        param_attr=fluid.ParamAttr(
+            name = pre_name + word_emb_param_name,
+            initializer=fluid.initializer.Normal(0., src_emb_dim ** -0.5)))#, is_sparse=True)
+    if not is_src and training:
+        src_word_emb = layers.pad(src_word_emb, [0, 0, 1, 0, 0, 0])
+    src_word_emb = layers.scale(x=src_word_emb, scale=src_emb_dim ** 0.5)
+    src_pos_enc = layers.embedding(
+        src_pos,
+        size=[src_max_len, src_emb_dim],
+        param_attr=fluid.ParamAttr(
+            trainable=False, name = pre_name + pos_enc_param_name))
+    src_pos_enc.stop_gradient = True
+    enc_input = src_word_emb + src_pos_enc
+    return layers.dropout(
+        enc_input,
+        dropout_prob=dropout_rate,
+        seed=DenseModelHyperParams.dropout_seed,
+        is_test=False, dropout_implementation='upscale_in_train') if dropout_rate else enc_input
+
+
+prepare_encoder = partial(
+    prepare_encoder_decoder, pos_enc_param_name="src_pos_enc_table", is_src=True)
+prepare_decoder = partial(
+    prepare_encoder_decoder, pos_enc_param_name="trg_pos_enc_table", is_src=False)
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da"):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(enc_input, preprocess_cmd,
+                          prepostprocess_dropout), None, None, attn_bias, d_key,
+        d_value, d_model, n_head, attention_dropout)
+    attn_output = post_process_layer(enc_input, attn_output, postprocess_cmd,
+                                     prepostprocess_dropout)
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout),
+        d_inner_hid, d_model, relu_dropout)
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd,
+                              prepostprocess_dropout)
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd="n",
+            postprocess_cmd="da"):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    stack_layer_norm = []
+    bottom_embedding_output = pre_process_layer(enc_input, preprocess_cmd, prepostprocess_dropout)
+    stack_layer_norm.append(bottom_embedding_output)
+
+    #zeros = layers.zeros_like(enc_input)
+    #ones_flag = layers.equal(zeros, zeros)
+    #ones = layers.cast(ones_flag, 'float32')
+
+    for i in range(n_layer):
+        enc_output = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd,
+            postprocess_cmd, )
+        enc_output_2 = pre_process_layer(enc_output, preprocess_cmd, prepostprocess_dropout)
+        stack_layer_norm.append(enc_output_2)
+         
+        pre_output = bottom_embedding_output
+        for index in xrange(1, len(stack_layer_norm)):
+            pre_output = pre_output + stack_layer_norm[index]
+
+        # pre_mean
+        enc_input = pre_output / len(stack_layer_norm)
+
+    enc_output = pre_process_layer(enc_output, preprocess_cmd,
+                                   prepostprocess_dropout)
+    return enc_output
+
+
+def decoder_layer(dec_input,
+                  enc_output,
+                  slf_attn_bias,
+                  dec_enc_attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  preprocess_cmd,
+                  postprocess_cmd,
+                  cache=None):
+    """ The layer to be stacked in decoder part.
+    The structure of this module is similar to that in the encoder part except
+    a multi-head attention is added to implement encoder-decoder attention.
+    """
+    slf_attn_output = multi_head_attention(
+        pre_process_layer(dec_input, preprocess_cmd, prepostprocess_dropout),
+        None,
+        None,
+        slf_attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        cache, )
+    slf_attn_output = post_process_layer(
+        dec_input,
+        slf_attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout, )
+    enc_attn_output = multi_head_attention(
+        pre_process_layer(slf_attn_output, preprocess_cmd,
+                          prepostprocess_dropout),
+        enc_output,
+        enc_output,
+        dec_enc_attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout, )
+    enc_attn_output = post_process_layer(
+        slf_attn_output,
+        enc_attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout, )
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(enc_attn_output, preprocess_cmd,
+                          prepostprocess_dropout),
+        d_inner_hid,
+        d_model,
+        relu_dropout, )
+    dec_output = post_process_layer(
+        enc_attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout, )
+    return dec_output
+
+
+def decoder(dec_input,
+            enc_output,
+            dec_slf_attn_bias,
+            dec_enc_attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd,
+            postprocess_cmd,
+            caches=None):
+    """
+    The decoder is composed of a stack of identical decoder_layer layers.
+    """
+    for i in range(n_layer):
+        dec_output = decoder_layer(
+            dec_input,
+            enc_output,
+            dec_slf_attn_bias,
+            dec_enc_attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd,
+            postprocess_cmd,
+            cache=None if caches is None else caches[i])
+        dec_input = dec_output
+    dec_output = pre_process_layer(dec_output, preprocess_cmd,
+                                   prepostprocess_dropout)
+    return dec_output
+
+
+def make_all_inputs(input_fields):
+    """
+    Define the input data layers for the transformer model.
+    """
+    inputs = []
+    for input_field in input_fields:
+        input_var = layers.data(
+            name=input_field,
+            shape=input_descs[input_field][0],
+            dtype=input_descs[input_field][1],
+            lod_level=input_descs[input_field][2]
+            if len(input_descs[input_field]) == 3 else 0,
+            append_batch_size=False)
+        inputs.append(input_var)
+    return inputs
+
+def make_all_py_reader_inputs(input_fields, is_test=False):
+    """
+    Define the input data layers for the transformer model.
+    """
+    reader = layers.py_reader(
+        capacity=20,
+        name="test_reader" if is_test else "train_reader",
+        shapes=[dense_input_descs[input_field][0] for input_field in input_fields],
+        dtypes=[dense_input_descs[input_field][1] for input_field in input_fields],
+        lod_levels=[
+            dense_input_descs[input_field][2]
+            if len(dense_input_descs[input_field]) == 3 else 0
+            for input_field in input_fields
+        ], use_double_buffer=True)
+    return layers.read_file(reader), reader
+
+
+def dense_transformer(src_vocab_size,
+                trg_vocab_size,
+                max_length,
+                n_layer,
+                enc_n_layer,
+                n_head,
+                d_key,
+                d_value,
+                d_model,
+                d_inner_hid,
+                prepostprocess_dropout,
+                attention_dropout,
+                relu_dropout,
+                preprocess_cmd,
+                postprocess_cmd,
+                weight_sharing,
+                embedding_sharing,
+                label_smooth_eps,
+                use_py_reader=False,
+                is_test=False,
+                params_type="normal",
+                all_data_inputs=None):
+    """
+        transformer
+    """
+    if embedding_sharing:
+        assert src_vocab_size == trg_vocab_size, (
+            "Vocabularies in source and target should be same for weight sharing."
+        )
+
+
+    data_input_names = encoder_data_input_fields + \
+                decoder_data_input_fields[:-1] + label_data_input_fields + dense_bias_input_fields
+
+    if use_py_reader:
+        all_inputs = all_data_inputs
+    else:
+        all_inputs = make_all_inputs(data_input_names)
+
+    enc_inputs_len = len(encoder_data_input_fields)
+    dec_inputs_len = len(decoder_data_input_fields[:-1])
+    enc_inputs = all_inputs[0:enc_inputs_len]
+    dec_inputs = all_inputs[enc_inputs_len:enc_inputs_len + dec_inputs_len]
+    real_label = all_inputs[enc_inputs_len + dec_inputs_len]
+    weights = all_inputs[enc_inputs_len + dec_inputs_len + 1]
+    reverse_label = all_inputs[enc_inputs_len + dec_inputs_len + 2]
+    enc_inputs[2] = all_inputs[-3] # dense_src_slf_attn_bias
+    dec_inputs[3] = all_inputs[-2] # dense_trg_slf_attn_bias
+    dec_inputs[4] = all_inputs[-1] # dense_trg_src_attn_bias
+
+    enc_output = wrap_encoder(
+        src_vocab_size,
+        max_length,
+        enc_n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        weight_sharing,
+        embedding_sharing,
+        enc_inputs,
+        params_type=params_type)
+
+    predict = wrap_decoder(
+        trg_vocab_size,
+        max_length,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        weight_sharing,
+        embedding_sharing,
+        dec_inputs,
+        enc_output, is_train = True if not is_test else False,
+        params_type=params_type)
+
+    # Padding index do not contribute to the total loss. The weights is used to
+    # cancel padding index in calculating the loss.
+    if label_smooth_eps:
+        label = layers.one_hot(input=real_label, depth=trg_vocab_size)
+        label = label * (1 - label_smooth_eps) + (1 - label) * (
+            label_smooth_eps / (trg_vocab_size - 1))
+        label.stop_gradient = True
+    else:
+        label = real_label
+
+    cost = layers.softmax_with_cross_entropy(
+        logits=predict,
+        label=label,
+        soft_label=True if label_smooth_eps else False)
+    weighted_cost = cost * weights
+    sum_cost = layers.reduce_sum(weighted_cost)
+    sum_cost.persistable = True
+    token_num = layers.reduce_sum(weights)
+    token_num.persistable = True
+    token_num.stop_gradient = True
+    avg_cost = sum_cost / token_num
+
+    sen_count = layers.shape(dec_inputs[0])[0]
+    batch_predict = layers.reshape(predict, shape = [sen_count, -1, DenseModelHyperParams.trg_vocab_size])
+    batch_label = layers.reshape(real_label, shape=[sen_count, -1])
+    batch_weights = layers.reshape(weights, shape=[sen_count, -1, 1])
+    return sum_cost, avg_cost, token_num, batch_predict, cost, sum_cost, batch_label, batch_weights
+
+
+def wrap_encoder(src_vocab_size,
+                 max_length,
+                 n_layer,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 prepostprocess_dropout,
+                 attention_dropout,
+                 relu_dropout,
+                 preprocess_cmd,
+                 postprocess_cmd,
+                 weight_sharing,
+                 embedding_sharing,
+                 enc_inputs=None,
+                 params_type="normal"):
+    """
+    The wrapper assembles together all needed layers for the encoder.
+    """
+    if enc_inputs is None:
+        # This is used to implement independent encoder program in inference.
+        src_word, src_pos, src_slf_attn_bias = make_all_inputs(
+            encoder_data_input_fields)
+    else:
+        src_word, src_pos, src_slf_attn_bias = enc_inputs
+    enc_input = prepare_encoder(
+        src_word,
+        src_pos,
+        src_vocab_size,
+        d_model,
+        max_length,
+        prepostprocess_dropout,
+        word_emb_param_name=dense_word_emb_param_names[0],
+        params_type=params_type)
+    enc_output = encoder(
+        enc_input,
+        src_slf_attn_bias,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd, )
+    return enc_output
+
+
+def wrap_decoder(trg_vocab_size,
+                 max_length,
+                 n_layer,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 prepostprocess_dropout,
+                 attention_dropout,
+                 relu_dropout,
+                 preprocess_cmd,
+                 postprocess_cmd,
+                 weight_sharing,
+                 embedding_sharing,
+                 dec_inputs=None,
+                 enc_output=None,
+                 caches=None, is_train=True, params_type="normal"):
+    """
+    The wrapper assembles together all needed layers for the decoder.
+    """
+    if dec_inputs is None:
+        # This is used to implement independent decoder program in inference.
+        trg_word, reverse_trg_word, trg_pos, trg_slf_attn_bias, trg_src_attn_bias, enc_output = \
+            make_all_inputs(dense_decoder_data_input_fields)
+    else:
+        trg_word, reverse_trg_word, trg_pos, trg_slf_attn_bias, trg_src_attn_bias = dec_inputs
+
+    dec_input = prepare_decoder(
+        trg_word,
+        trg_pos,
+        trg_vocab_size,
+        d_model,
+        max_length,
+        prepostprocess_dropout,
+        word_emb_param_name=dense_word_emb_param_names[0]
+        if embedding_sharing else dense_word_emb_param_names[1], 
+        training=is_train,
+        params_type=params_type)
+
+    dec_output = decoder(
+        dec_input,
+        enc_output,
+        trg_slf_attn_bias,
+        trg_src_attn_bias,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        caches=caches)
+    # Reshape to 2D tensor to use GEMM instead of BatchedGEMM
+    dec_output = layers.reshape(
+        dec_output, shape=[-1, dec_output.shape[-1]], inplace=True)
+
+    assert params_type == "fixed" or params_type == "normal" or params_type == "new"
+    pre_name = "densedense"
+    if params_type == "fixed":
+        pre_name = "fixed_densefixed_dense"
+    elif params_type == "new":
+        pre_name = "new_densenew_dense"
+    if weight_sharing and embedding_sharing:
+        predict = layers.matmul(
+            x=dec_output,
+            y=fluid.default_main_program().global_block().var(
+                pre_name + dense_word_emb_param_names[0]),
+            transpose_y=True)
+    elif weight_sharing:
+        predict = layers.matmul(
+            x=dec_output,
+            y=fluid.default_main_program().global_block().var(
+                pre_name +  dense_word_emb_param_names[1]),
+            transpose_y=True)
+    else:
+        predict = layers.fc(input=dec_output,
+                            size=trg_vocab_size,
+                            bias_attr=False)
+    #layers.Print(predict, message="logits", summarize=20)
+    if dec_inputs is None:
+        # Return probs for independent decoder program.
+        predict = layers.softmax(predict)
+    return predict
+
+    
+def get_enc_bias(source_inputs):
+    """
+        get_enc_bias
+    """
+    source_inputs = layers.cast(source_inputs, 'float32')
+    emb_sum = layers.reduce_sum(layers.abs(source_inputs), dim=-1)
+    zero = layers.fill_constant([1], 'float32', value=0) 
+    bias = layers.cast(layers.equal(emb_sum, zero), 'float32') * -1e9
+    return layers.unsqueeze(layers.unsqueeze(bias, axes=[1]), axes=[1])
+
+
+def dense_fast_decode(
+        src_vocab_size,
+        trg_vocab_size,
+        max_in_len,
+        n_layer,
+        enc_n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        weight_sharing,
+        embedding_sharing,
+        beam_size,
+        batch_size, 
+        max_out_len,
+        decode_alpha,
+        eos_idx,
+        params_type="normal"):
+    """
+    Use beam search to decode. Caches will be used to store states of history
+    steps which can make the decoding faster.
+    """
+
+    assert params_type == "normal" or params_type == "new" or params_type == "fixed"
+    data_input_names = dense_encoder_data_input_fields + fast_decoder_data_input_fields
+
+    all_inputs = make_all_inputs(data_input_names)
+
+    enc_inputs_len = len(encoder_data_input_fields)
+    dec_inputs_len = len(fast_decoder_data_input_fields)
+    enc_inputs = all_inputs[0:enc_inputs_len]
+    dec_inputs = all_inputs[enc_inputs_len:enc_inputs_len + dec_inputs_len]
+
+    enc_output = wrap_encoder(src_vocab_size, max_in_len, enc_n_layer, n_head,
+                              d_key, d_value, d_model, d_inner_hid,
+                              prepostprocess_dropout, attention_dropout,
+                              relu_dropout, preprocess_cmd, postprocess_cmd,
+                              weight_sharing, embedding_sharing, enc_inputs, params_type=params_type)
+    enc_bias = get_enc_bias(enc_inputs[0])
+    source_length, = dec_inputs
+
+    def beam_search(enc_output, enc_bias, source_length):
+        """
+            beam_search
+        """
+        max_len = layers.fill_constant(
+            shape=[1], dtype='int64', value=max_out_len)
+        step_idx = layers.fill_constant(
+            shape=[1], dtype='int64', value=0)
+        cond = layers.less_than(x=step_idx, y=max_len)
+        while_op = layers.While(cond)
+
+        caches_batch_size = batch_size * beam_size
+        init_score = np.zeros([1, beam_size]).astype('float32')
+        init_score[:, 1:] = -INF
+        initial_log_probs = layers.assign(init_score)
+
+        alive_log_probs = layers.expand(initial_log_probs, [batch_size, 1])
+        # alive seq [batch_size, beam_size, 1]
+        initial_ids = layers.zeros([batch_size, 1, 1], 'float32')
+        alive_seq = layers.expand(initial_ids, [1, beam_size, 1]) 
+        alive_seq = layers.cast(alive_seq, 'int64')
+
+        enc_output = layers.unsqueeze(enc_output, axes=[1])
+        enc_output = layers.expand(enc_output, [1, beam_size, 1, 1])
+        enc_output = layers.reshape(enc_output, [caches_batch_size, -1, d_model])
+
+        tgt_src_attn_bias = layers.unsqueeze(enc_bias, axes=[1])
+        tgt_src_attn_bias = layers.expand(tgt_src_attn_bias, [1, beam_size, n_head, 1, 1]) 
+        enc_bias_shape = layers.shape(tgt_src_attn_bias)
+        tgt_src_attn_bias = layers.reshape(tgt_src_attn_bias, [-1, enc_bias_shape[2], 
+                enc_bias_shape[3], enc_bias_shape[4]])
+
+        beam_search = BeamSearch(beam_size, batch_size, decode_alpha, trg_vocab_size, d_model)
+
+        caches = [{
+            "k": layers.fill_constant(
+                shape=[caches_batch_size, 0, d_model],
+                dtype=enc_output.dtype,
+                value=0),
+            "v": layers.fill_constant(
+                shape=[caches_batch_size, 0, d_model],
+                dtype=enc_output.dtype,
+                value=0)
+        } for i in range(n_layer)]
+
+        finished_seq = layers.zeros_like(alive_seq)
+        finished_scores = layers.fill_constant([batch_size, beam_size], 
+                                                dtype='float32', value=-INF)
+        finished_flags = layers.fill_constant([batch_size, beam_size], 
+                                                dtype='float32', value=0)
+
+        with while_op.block():
+            pos = layers.fill_constant([caches_batch_size, 1, 1], dtype='int64', value=1)
+            pos = layers.elementwise_mul(pos, step_idx, axis=0)
+
+            alive_seq_1 = layers.reshape(alive_seq, [caches_batch_size, -1])
+            alive_seq_2 = alive_seq_1[:, -1:] 
+            alive_seq_2 = layers.unsqueeze(alive_seq_2, axes=[1])
+
+            logits = wrap_decoder(
+                trg_vocab_size, max_in_len, n_layer, n_head, d_key,
+                d_value, d_model, d_inner_hid, prepostprocess_dropout,
+                attention_dropout, relu_dropout, preprocess_cmd,
+                postprocess_cmd, weight_sharing, embedding_sharing,
+                dec_inputs=(alive_seq_2, alive_seq_2, pos, None, tgt_src_attn_bias),
+                enc_output=enc_output, caches=caches, is_train=False, params_type=params_type)
+
+            alive_seq_2, alive_log_probs_2, finished_seq_2, finished_scores_2, finished_flags_2, caches_2 = \
+                    beam_search.inner_func(step_idx, logits, alive_seq_1, alive_log_probs, finished_seq, 
+                                           finished_scores, finished_flags, caches, enc_output, 
+                                           tgt_src_attn_bias)
+
+            layers.increment(x=step_idx, value=1.0, in_place=True)
+            finish_cond = beam_search.is_finished(step_idx, source_length, alive_log_probs_2, 
+                                                  finished_scores_2, finished_flags_2) 
+
+            layers.assign(alive_seq_2, alive_seq)
+            layers.assign(alive_log_probs_2, alive_log_probs)
+            layers.assign(finished_seq_2, finished_seq)
+            layers.assign(finished_scores_2, finished_scores)
+            layers.assign(finished_flags_2, finished_flags)
+
+            for i in xrange(len(caches_2)):
+                layers.assign(caches_2[i]["k"], caches[i]["k"])
+                layers.assign(caches_2[i]["v"], caches[i]["v"])
+
+            layers.logical_and(x=cond, y=finish_cond, out=cond)
+
+        finished_flags = layers.reduce_sum(finished_flags, dim=1, keep_dim=True) / beam_size
+        finished_flags = layers.cast(finished_flags, 'bool')
+        mask = layers.cast(layers.reduce_any(input=finished_flags, dim=1, keep_dim=True), 'float32')
+        mask = layers.expand(mask, [1, beam_size])
+
+        mask2 = 1.0 - mask
+        finished_seq = layers.cast(finished_seq, 'float32')
+        alive_seq = layers.cast(alive_seq, 'float32')
+        #print mask
+
+        finished_seq = layers.elementwise_mul(finished_seq, mask, axis=0) + \
+                        layers.elementwise_mul(alive_seq, mask2, axis = 0)
+        finished_seq = layers.cast(finished_seq, 'int32')
+        finished_scores = layers.elementwise_mul(finished_scores, mask, axis=0) + \
+                            layers.elementwise_mul(alive_log_probs, mask2)
+        finished_seq.persistable = True
+        finished_scores.persistable = True
+
+        return finished_seq, finished_scores
+
+    finished_ids, finished_scores = beam_search(enc_output, enc_bias, source_length)
+    return finished_ids, finished_scores
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/forward_model.py b/PaddleNLP/Research/EMNLP2019-MAL/src/forward_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..61372933546d5192d0b5a5b9d43616396f6dcbe5
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/forward_model.py
@@ -0,0 +1,943 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License
+
+from functools import partial
+import numpy as np
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+from paddle.fluid.layer_helper import LayerHelper as LayerHelper
+
+from config import *
+from beam_search import BeamSearch
+from attention import _dot_product_relative
+
+INF = 1. * 1e5
+
+def layer_norm(x, begin_norm_axis=1, epsilon=1e-6, param_attr=None, bias_attr=None):
+    """
+        layer_norm
+    """
+    helper = LayerHelper('layer_norm', **locals())
+    mean = layers.reduce_mean(x, dim=begin_norm_axis, keep_dim=True)
+    shift_x = layers.elementwise_sub(x=x, y=mean, axis=0)
+    variance = layers.reduce_mean(layers.square(shift_x), dim=begin_norm_axis, keep_dim=True)
+    r_stdev = layers.rsqrt(variance + epsilon)
+    norm_x = layers.elementwise_mul(x=shift_x, y=r_stdev, axis=0)
+
+    param_shape = [reduce(lambda x, y: x * y, norm_x.shape[begin_norm_axis:])]
+    param_dtype = norm_x.dtype
+    scale = helper.create_parameter(
+        attr=param_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        default_initializer=fluid.initializer.Constant(1.))
+    bias = helper.create_parameter(
+        attr=bias_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        is_bias=True,
+        default_initializer=fluid.initializer.Constant(0.))
+
+    out = layers.elementwise_mul(x=norm_x, y=scale, axis=-1)
+    out = layers.elementwise_add(x=out, y=bias, axis=-1)
+
+    return out
+
+def forward_position_encoding_init(n_position, d_pos_vec):
+    """
+    Generate the initial values for the sinusoid position encoding table.
+    """
+    channels = d_pos_vec
+    position = np.arange(n_position)
+    num_timescales = channels // 2
+    log_timescale_increment = (np.log(float(1e4) / float(1)) /
+                               (num_timescales - 1))
+    inv_timescales = np.exp(np.arange(
+        num_timescales) * -log_timescale_increment)
+        #num_timescales)) * -log_timescale_increment
+    scaled_time = np.expand_dims(position, 1) * np.expand_dims(inv_timescales,
+                                                               0)
+    signal = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1)
+    signal = np.pad(signal, [[0, 0], [0, np.mod(channels, 2)]], 'constant')
+    position_enc = signal
+    return position_enc.astype("float32")
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         attention_type="dot_product",):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        if n_head == 1:
+            return x
+
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(
+            x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(
+            x=trans_x,
+            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key ** -0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights,
+                dropout_prob=dropout_rate,
+                seed=ModelHyperParams.dropout_seed,
+                is_test=False, dropout_implementation='upscale_in_train')
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = layers.concat([cache['k'], k], axis=1)
+        v = layers.concat([cache['v'], v], axis=1)
+        layers.assign(k, cache['k'])
+        layers.assign(v, cache['v'])
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    assert attention_type == "dot_product" or attention_type == "dot_product_relative_encoder" or attention_type == "dot_product_relative_decoder"
+    if attention_type == "dot_product":
+        ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, #d_model,
+                                                      dropout_rate)
+    elif attention_type == "dot_product_relative_encoder": 
+        q = layers.scale(x=q, scale=d_key ** -0.5)
+        ctx_multiheads = _dot_product_relative(q, k, v, attn_bias, dropout=dropout_rate)
+    else:
+        q = layers.scale(x=q, scale=d_key ** -0.5)
+        ctx_multiheads = _dot_product_relative(q, k, v, attn_bias, dropout=dropout_rate, cache = cache)
+
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         bias_attr=False,
+                         num_flatten_dims=2)
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act="relu")
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden,
+            dropout_prob=dropout_rate,
+            seed=ModelHyperParams.dropout_seed,
+            is_test=False, dropout_implementation='upscale_in_train')
+    out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2)
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out = layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                epsilon=1e-6,
+                param_attr=fluid.initializer.Constant(1.),
+                bias_attr=fluid.initializer.Constant(0.))
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out,
+                    dropout_prob=dropout_rate,
+                    seed=ModelHyperParams.dropout_seed,
+                    is_test=False, dropout_implementation='upscale_in_train')
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def prepare_encoder_decoder(src_word,
+                            src_pos,
+                            src_vocab_size,
+                            src_emb_dim,
+                            src_max_len,
+                            dropout_rate=0.,
+                            word_emb_param_name=None,
+                            training=True,
+                            pos_enc_param_name=None, 
+                            is_src=True,
+                            params_type="normal"):
+    """Add word embeddings and position encodings.
+    The output tensor has a shape of:
+    [batch_size, max_src_length_in_batch, d_model].
+    This module is used at the bottom of the encoder stacks.
+    """
+    assert params_type == "fixed" or params_type == "normal" or params_type == "new"
+    pre_name = "forwardforward"
+    if params_type == "fixed":
+        pre_name = "fixed_forwardfixed_forward"
+    elif params_type == "new":
+        pre_name = "new_forwardnew_forward"
+    src_word_emb = layers.embedding(
+        src_word,
+        size=[src_vocab_size, src_emb_dim],
+        padding_idx=ModelHyperParams.bos_idx,  # set embedding of bos to 0
+        param_attr=fluid.ParamAttr(
+            name = pre_name + word_emb_param_name,
+            initializer=fluid.initializer.Normal(0., src_emb_dim ** -0.5)))#, is_sparse=True)
+    if not is_src and training:
+        src_word_emb = layers.pad(src_word_emb, [0, 0, 1, 0, 0, 0])
+    src_word_emb = layers.scale(x=src_word_emb, scale=src_emb_dim ** 0.5)
+    src_pos_enc = layers.embedding(
+        src_pos,
+        size=[src_max_len, src_emb_dim],
+        param_attr=fluid.ParamAttr(
+            trainable=False, name = pre_name + pos_enc_param_name))
+    src_pos_enc.stop_gradient = True
+    enc_input = src_word_emb + src_pos_enc
+    return layers.dropout(
+        enc_input,
+        dropout_prob=dropout_rate,
+        seed=ModelHyperParams.dropout_seed,
+        is_test=False, dropout_implementation='upscale_in_train') if dropout_rate else enc_input
+
+
+prepare_encoder = partial(
+    prepare_encoder_decoder, pos_enc_param_name=pos_enc_param_names[0], is_src=True)
+prepare_decoder = partial(
+    prepare_encoder_decoder, pos_enc_param_name=pos_enc_param_names[1], is_src=False)
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da"):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(enc_input, preprocess_cmd,
+                          prepostprocess_dropout), None, None, attn_bias, d_key,
+        d_value, d_model, n_head, attention_dropout)
+    attn_output = post_process_layer(enc_input, attn_output, postprocess_cmd,
+                                     prepostprocess_dropout)
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout),
+        d_inner_hid, d_model, relu_dropout)
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd,
+                              prepostprocess_dropout)
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd="n",
+            postprocess_cmd="da"):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd,
+            postprocess_cmd, )
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd,
+                                   prepostprocess_dropout)
+    return enc_output
+
+
+def decoder_layer(dec_input,
+                  enc_output,
+                  slf_attn_bias,
+                  dec_enc_attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  preprocess_cmd,
+                  postprocess_cmd,
+                  cache=None):
+    """ The layer to be stacked in decoder part.
+    The structure of this module is similar to that in the encoder part except
+    a multi-head attention is added to implement encoder-decoder attention.
+    """
+    slf_attn_output = multi_head_attention(
+        pre_process_layer(dec_input, preprocess_cmd, prepostprocess_dropout),
+        None,
+        None,
+        slf_attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        cache)
+    slf_attn_output = post_process_layer(
+        dec_input,
+        slf_attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout, )
+    enc_attn_output = multi_head_attention(
+        pre_process_layer(slf_attn_output, preprocess_cmd,
+                          prepostprocess_dropout),
+        enc_output,
+        enc_output,
+        dec_enc_attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout, )
+    enc_attn_output = post_process_layer(
+        slf_attn_output,
+        enc_attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout, )
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(enc_attn_output, preprocess_cmd,
+                          prepostprocess_dropout),
+        d_inner_hid,
+        d_model,
+        relu_dropout, )
+    dec_output = post_process_layer(
+        enc_attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout, )
+    return dec_output
+
+
+def decoder(dec_input,
+            enc_output,
+            dec_slf_attn_bias,
+            dec_enc_attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd,
+            postprocess_cmd,
+            caches=None):
+    """
+    The decoder is composed of a stack of identical decoder_layer layers.
+    """
+    for i in range(n_layer):
+        dec_output = decoder_layer(
+            dec_input,
+            enc_output,
+            dec_slf_attn_bias,
+            dec_enc_attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd,
+            postprocess_cmd,
+            cache=None if caches is None else caches[i])
+        dec_input = dec_output
+    dec_output = pre_process_layer(dec_output, preprocess_cmd,
+                                   prepostprocess_dropout)
+    return dec_output
+
+
+def make_all_inputs(input_fields):
+    """
+    Define the input data layers for the transformer model.
+    """
+    inputs = []
+    for input_field in input_fields:
+        input_var = layers.data(
+            name=input_field,
+            shape=input_descs[input_field][0],
+            dtype=input_descs[input_field][1],
+            lod_level=input_descs[input_field][2]
+            if len(input_descs[input_field]) == 3 else 0,
+            append_batch_size=False)
+        inputs.append(input_var)
+    return inputs
+
+
+def make_all_py_reader_inputs(input_fields, is_test=False):
+    """
+    Define the input data layers for the transformer model.
+    """
+    reader = layers.py_reader(
+        capacity=20,
+        name="test_reader" if is_test else "train_reader",
+        shapes=[input_descs[input_field][0] for input_field in input_fields],
+        dtypes=[input_descs[input_field][1] for input_field in input_fields],
+        lod_levels=[
+            input_descs[input_field][2]
+            if len(input_descs[input_field]) == 3 else 0
+            for input_field in input_fields
+        ], use_double_buffer=True)
+    return layers.read_file(reader), reader
+
+
+def forward_transformer(src_vocab_size,
+                trg_vocab_size,
+                max_length,
+                n_layer,
+                n_head,
+                d_key,
+                d_value,
+                d_model,
+                d_inner_hid,
+                prepostprocess_dropout,
+                attention_dropout,
+                relu_dropout,
+                preprocess_cmd,
+                postprocess_cmd,
+                weight_sharing,
+                embedding_sharing,
+                label_smooth_eps,
+                use_py_reader=False,
+                is_test=False,
+                params_type="normal",
+                all_data_inputs=None):
+    """
+        transformer
+    """
+    if embedding_sharing:
+        assert src_vocab_size == trg_vocab_size, (
+            "Vocabularies in source and target should be same for weight sharing."
+        )
+
+    data_input_names = encoder_data_input_fields + \
+                decoder_data_input_fields[:-1] + label_data_input_fields + dense_bias_input_fields
+
+    if use_py_reader:
+        all_inputs = all_data_inputs
+    else:
+        all_inputs = make_all_inputs(data_input_names)
+
+    enc_inputs_len = len(encoder_data_input_fields)
+    dec_inputs_len = len(decoder_data_input_fields[:-1])
+    enc_inputs = all_inputs[0:enc_inputs_len]
+    dec_inputs = all_inputs[enc_inputs_len:enc_inputs_len + dec_inputs_len]
+    real_label = all_inputs[enc_inputs_len + dec_inputs_len]
+    weights = all_inputs[enc_inputs_len + dec_inputs_len + 1]
+    reverse_label = all_inputs[enc_inputs_len + dec_inputs_len + 2]
+
+    enc_output = wrap_encoder(
+        src_vocab_size,
+        max_length,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        weight_sharing,
+        embedding_sharing,
+        enc_inputs,
+        params_type=params_type)
+
+    predict = wrap_decoder(
+        trg_vocab_size,
+        max_length,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        weight_sharing,
+        embedding_sharing,
+        dec_inputs,
+        enc_output, is_train = True if not is_test else False,
+        params_type=params_type)
+
+    # Padding index do not contribute to the total loss. The weights is used to
+    # cancel padding index in calculating the loss.
+    if label_smooth_eps:
+        label = layers.one_hot(input=real_label, depth=trg_vocab_size)
+        label = label * (1 - label_smooth_eps) + (1 - label) * (
+            label_smooth_eps / (trg_vocab_size - 1))
+        label.stop_gradient = True
+    else:
+        label = real_label
+
+    cost = layers.softmax_with_cross_entropy(
+        logits=predict,
+        label=label,
+        soft_label=True if label_smooth_eps else False)
+    weighted_cost = cost * weights
+    sum_cost = layers.reduce_sum(weighted_cost)
+    sum_cost.persistable = True
+    token_num = layers.reduce_sum(weights)
+    token_num.persistable = True
+    token_num.stop_gradient = True
+    avg_cost = sum_cost / token_num
+
+    sen_count = layers.shape(dec_inputs[0])[0]
+    batch_predict = layers.reshape(predict, shape = [sen_count, -1, ModelHyperParams.trg_vocab_size])
+    #batch_label = layers.reshape(real_label, shape=[sen_count, -1])
+    batch_weights = layers.reshape(weights, shape=[sen_count, -1, 1])
+    return sum_cost, avg_cost, token_num, batch_predict, cost, sum_cost, real_label, batch_weights
+
+
+def wrap_encoder(src_vocab_size,
+                 max_length,
+                 n_layer,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 prepostprocess_dropout,
+                 attention_dropout,
+                 relu_dropout,
+                 preprocess_cmd,
+                 postprocess_cmd,
+                 weight_sharing,
+                 embedding_sharing,
+                 enc_inputs=None,
+                 params_type="normal"):
+    """
+    The wrapper assembles together all needed layers for the encoder.
+    """
+    if enc_inputs is None:
+        # This is used to implement independent encoder program in inference.
+        src_word, src_pos, src_slf_attn_bias = make_all_inputs(
+            encoder_data_input_fields)
+    else:
+        src_word, src_pos, src_slf_attn_bias = enc_inputs
+    enc_input = prepare_encoder(
+        src_word,
+        src_pos,
+        src_vocab_size,
+        d_model,
+        max_length,
+        prepostprocess_dropout,
+        word_emb_param_name=word_emb_param_names[0],
+        params_type=params_type)
+    enc_output = encoder(
+        enc_input,
+        src_slf_attn_bias,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd, )
+    return enc_output
+
+
+def wrap_decoder(trg_vocab_size,
+                 max_length,
+                 n_layer,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 prepostprocess_dropout,
+                 attention_dropout,
+                 relu_dropout,
+                 preprocess_cmd,
+                 postprocess_cmd,
+                 weight_sharing,
+                 embedding_sharing,
+                 dec_inputs=None,
+                 enc_output=None,
+                 caches=None, is_train=True, params_type="normal"):
+    """
+    The wrapper assembles together all needed layers for the decoder.
+    """
+    if dec_inputs is None:
+        # This is used to implement independent decoder program in inference.
+        trg_word, reverse_trg_word, trg_pos, trg_slf_attn_bias, trg_src_attn_bias, enc_output = \
+            make_all_inputs(decoder_data_input_fields)
+    else:
+        trg_word, reverse_trg_word, trg_pos, trg_slf_attn_bias, trg_src_attn_bias = dec_inputs
+
+    dec_input = prepare_decoder(
+        trg_word,
+        trg_pos,
+        trg_vocab_size,
+        d_model,
+        max_length,
+        prepostprocess_dropout,
+        word_emb_param_name=word_emb_param_names[0]
+        if embedding_sharing else word_emb_param_names[1], 
+        training=is_train,
+        params_type=params_type)
+
+    dec_output = decoder(
+        dec_input,
+        enc_output,
+        trg_slf_attn_bias,
+        trg_src_attn_bias,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        caches=caches)
+    # Reshape to 2D tensor to use GEMM instead of BatchedGEMM
+    dec_output = layers.reshape(
+        dec_output, shape=[-1, dec_output.shape[-1]], inplace=True)
+
+    assert params_type == "fixed" or params_type == "normal" or params_type == "new"
+    pre_name = "forwardforward"
+    if params_type == "fixed":
+        pre_name = "fixed_forwardfixed_forward"
+    elif params_type == "new":
+        pre_name = "new_forwardnew_forward"
+    if weight_sharing and embedding_sharing:
+        predict = layers.matmul(
+            x=dec_output,
+            y=fluid.default_main_program().global_block().var(
+                pre_name + word_emb_param_names[0]),
+            transpose_y=True)
+    elif weight_sharing:
+        predict = layers.matmul(
+            x=dec_output,
+            y=fluid.default_main_program().global_block().var(
+                pre_name +  word_emb_param_names[1]),
+            transpose_y=True)
+    else:
+        predict = layers.fc(input=dec_output,
+                            size=trg_vocab_size,
+                            bias_attr=False)
+    if dec_inputs is None:
+        # Return probs for independent decoder program.
+        predict = layers.softmax(predict)
+    return predict
+
+    
+def get_enc_bias(source_inputs):
+    """
+        get_enc_bias
+    """
+    source_inputs = layers.cast(source_inputs, 'float32')
+    emb_sum = layers.reduce_sum(layers.abs(source_inputs), dim=-1)
+    zero = layers.fill_constant([1], 'float32', value=0) 
+    bias = layers.cast(layers.equal(emb_sum, zero), 'float32') * -1e9
+    return layers.unsqueeze(layers.unsqueeze(bias, axes=[1]), axes=[1])
+
+
+def forward_fast_decode(
+        src_vocab_size,
+        trg_vocab_size,
+        max_in_len,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        weight_sharing,
+        embedding_sharing,
+        beam_size,
+        batch_size,
+        max_out_len,
+        decode_alpha,
+        eos_idx,
+        params_type="normal"):
+    """
+    Use beam search to decode. Caches will be used to store states of history
+    steps which can make the decoding faster.
+    """
+
+    assert params_type == "normal" or params_type == "new" or params_type == "fixed"
+    data_input_names = encoder_data_input_fields + fast_decoder_data_input_fields
+
+    all_inputs = make_all_inputs(data_input_names)
+
+    enc_inputs_len = len(encoder_data_input_fields)
+    dec_inputs_len = len(fast_decoder_data_input_fields)
+    enc_inputs = all_inputs[0:enc_inputs_len]
+    dec_inputs = all_inputs[enc_inputs_len:enc_inputs_len + dec_inputs_len]
+
+    enc_output = wrap_encoder(src_vocab_size, max_in_len, n_layer, n_head,
+                              d_key, d_value, d_model, d_inner_hid,
+                              prepostprocess_dropout, attention_dropout,
+                              relu_dropout, preprocess_cmd, postprocess_cmd,
+                              weight_sharing, embedding_sharing, enc_inputs, params_type=params_type)
+    enc_bias = get_enc_bias(enc_inputs[0])
+    source_length, = dec_inputs
+
+    def beam_search(enc_output, enc_bias, source_length):
+        """
+            beam_search
+        """
+        max_len = layers.fill_constant(
+            shape=[1], dtype='int64', value=max_out_len)
+        step_idx = layers.fill_constant(
+            shape=[1], dtype='int64', value=0)
+        cond = layers.less_than(x=step_idx, y=max_len)
+        while_op = layers.While(cond)
+
+        caches_batch_size = batch_size * beam_size
+        init_score = np.zeros([1, beam_size]).astype('float32')
+        init_score[:, 1:] = -INF
+        initial_log_probs = layers.assign(init_score)
+
+        alive_log_probs = layers.expand(initial_log_probs, [batch_size, 1])
+        # alive seq [batch_size, beam_size, 1]
+        initial_ids = layers.zeros([batch_size, 1, 1], 'float32')
+        alive_seq = layers.expand(initial_ids, [1, beam_size, 1]) 
+        alive_seq = layers.cast(alive_seq, 'int64')
+
+        enc_output = layers.unsqueeze(enc_output, axes=[1])
+        enc_output = layers.expand(enc_output, [1, beam_size, 1, 1])
+        enc_output = layers.reshape(enc_output, [caches_batch_size, -1, d_model])
+
+        tgt_src_attn_bias = layers.unsqueeze(enc_bias, axes=[1])
+        tgt_src_attn_bias = layers.expand(tgt_src_attn_bias, [1, beam_size, n_head, 1, 1]) 
+        enc_bias_shape = layers.shape(tgt_src_attn_bias)
+        tgt_src_attn_bias = layers.reshape(tgt_src_attn_bias, [-1, enc_bias_shape[2], 
+                enc_bias_shape[3], enc_bias_shape[4]])
+            
+        beam_search = BeamSearch(beam_size, batch_size, decode_alpha, trg_vocab_size, d_model)
+
+        caches = [{
+            "k": layers.fill_constant(
+                shape=[caches_batch_size, 0, d_model],
+                dtype=enc_output.dtype,
+                value=0),
+            "v": layers.fill_constant(
+                shape=[caches_batch_size, 0, d_model],
+                dtype=enc_output.dtype,
+                value=0)
+        } for i in range(n_layer)]
+        
+        finished_seq = layers.zeros_like(alive_seq)
+        finished_scores = layers.fill_constant([batch_size, beam_size], 
+                                                dtype='float32', value=-INF)
+        finished_flags = layers.fill_constant([batch_size, beam_size], 
+                                                dtype='float32', value=0)
+        
+        with while_op.block():
+            pos = layers.fill_constant([caches_batch_size, 1, 1], dtype='int64', value=1)
+            pos = layers.elementwise_mul(pos, step_idx, axis=0)
+
+            alive_seq_1 = layers.reshape(alive_seq, [caches_batch_size, -1])
+            alive_seq_2 = alive_seq_1[:, -1:] 
+            alive_seq_2 = layers.unsqueeze(alive_seq_2, axes=[1])
+ 
+            logits = wrap_decoder(
+                trg_vocab_size, max_in_len, n_layer, n_head, d_key,
+                d_value, d_model, d_inner_hid, prepostprocess_dropout,
+                attention_dropout, relu_dropout, preprocess_cmd,
+                postprocess_cmd, weight_sharing, embedding_sharing,
+                dec_inputs=(alive_seq_2, alive_seq_2, pos, None, tgt_src_attn_bias),
+                enc_output=enc_output, caches=caches, is_train=False, params_type=params_type)
+
+            alive_seq_2, alive_log_probs_2, finished_seq_2, finished_scores_2, finished_flags_2, caches_2 = \
+                    beam_search.inner_func(step_idx, logits, alive_seq_1, alive_log_probs, finished_seq, 
+                                           finished_scores, finished_flags, caches, enc_output, 
+                                           tgt_src_attn_bias)
+            
+            layers.increment(x=step_idx, value=1.0, in_place=True)
+            finish_cond = beam_search.is_finished(step_idx, source_length, alive_log_probs_2, 
+                                                  finished_scores_2, finished_flags_2) 
+
+            layers.assign(alive_seq_2, alive_seq)
+            layers.assign(alive_log_probs_2, alive_log_probs)
+            layers.assign(finished_seq_2, finished_seq)
+            layers.assign(finished_scores_2, finished_scores)
+            layers.assign(finished_flags_2, finished_flags)
+
+            for i in xrange(len(caches_2)):
+                layers.assign(caches_2[i]["k"], caches[i]["k"])
+                layers.assign(caches_2[i]["v"], caches[i]["v"])
+
+            layers.logical_and(x=cond, y=finish_cond, out=cond)
+
+        finished_flags = layers.reduce_sum(finished_flags, dim=1, keep_dim=True) / beam_size
+        finished_flags = layers.cast(finished_flags, 'bool')
+        mask = layers.cast(layers.reduce_any(input=finished_flags, dim=1, keep_dim=True), 'float32')
+        mask = layers.expand(mask, [1, beam_size])
+
+        mask2 = 1.0 - mask
+        finished_seq = layers.cast(finished_seq, 'float32')
+        alive_seq = layers.cast(alive_seq, 'float32')
+        #print mask
+
+        finished_seq = layers.elementwise_mul(finished_seq, mask, axis=0) + \
+                        layers.elementwise_mul(alive_seq, mask2, axis = 0)
+        finished_seq = layers.cast(finished_seq, 'int32')
+        finished_scores = layers.elementwise_mul(finished_scores, mask, axis=0) + \
+                            layers.elementwise_mul(alive_log_probs, mask2)
+        finished_seq.persistable = True
+        finished_scores.persistable = True
+
+        return finished_seq, finished_scores
+        
+    finished_ids, finished_scores = beam_search(enc_output, enc_bias, source_length)
+    return finished_ids, finished_scores
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/gen_records.py b/PaddleNLP/Research/EMNLP2019-MAL/src/gen_records.py
new file mode 100644
index 0000000000000000000000000000000000000000..8b2e2b1a61466762fd4b220914d65a0648f0286c
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/gen_records.py
@@ -0,0 +1,220 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os, sys
+import random
+import six
+import ast
+
+class TRDataGen(object):
+    """record data generator
+    """
+
+    def __init__(self, num_shards, data_dir):
+        self.num_shards = num_shards
+        self.data_dir = data_dir
+
+    def gen_data_fnames(self, is_train=True):
+        """generate filenames for train and valid
+        return:
+            train_filenames, valid_filenames
+        """
+        if not os.path.isdir(self.data_dir):
+            try:
+                os.mkdir(self.data_dir)
+            except Exception as e:
+                raise ValueError("%s is exists as one file", self.data_dir)
+        if is_train:
+            train_prefix = os.path.join(self.data_dir, "translate-train-%05d-of_unshuffle")
+            return [train_prefix % i for i in xrange(self.num_shards)]
+        return [os.path.join(self.data_dir, "translate-dev-00000-of_unshuffle")]
+
+    def generate(self, data_list, is_train=True, is_shuffle=True):
+        """generating record file
+        :param data_list:
+        :param is_train:
+        :return:
+        """
+        output_filename = self.gen_data_fnames(is_train)
+        #writers = [tf.python_io.TFRecordWriter(fname) for fname in output_filename]
+        writers = [open(fname, 'w') for fname in output_filename]
+        ct = 0
+        shard = 0
+        for case in data_list:
+            ct += 1
+            if ct % 10000 == 0:
+                logging.info("Generating case %s ." % ct)
+
+            example = self.to_example(case)
+            writers[shard].write(example.strip() + "\n")
+            if is_train:
+                shard = (shard + 1) % self.num_shards
+        logging.info("Generating case %s ." % ct)
+        for writer in writers:
+            writer.close()
+        if is_shuffle:
+            self.shuffle_dataset(output_filename)
+
+    def to_example(self, dictionary):
+        """
+        :param source:
+        :param target:
+        :return:
+        """
+
+        if "inputs" not in dictionary or "targets" not in dictionary:
+            raise ValueError("Empty generated field: inputs or target")
+        
+        inputs = " ".join(str(x) for x in dictionary["inputs"])
+        targets = " ".join(str(x) for x in dictionary["targets"])
+        return inputs + "\t" + targets
+
+    def shuffle_dataset(self, filenames):
+        """
+        :return:
+        """
+        logging.info("Shuffling data...")
+        for fname in filenames:
+            records = self.read_records(fname)
+            random.shuffle(records)
+            out_fname = fname.replace("_unshuffle", "-shuffle")
+            self.write_records(records, out_fname)
+            os.remove(fname)
+
+    def read_records(self, filename):
+        """
+        :param filename:
+        :return:
+        """
+        records = []
+        with open(filename, 'r') as reader:
+            for record in reader:
+                records.append(record)
+                if len(records) % 100000 == 0:
+                    logging.info("read: %d", len(records))
+        return records
+
+    def write_records(self, records, out_filename):
+        """
+        :param records:
+        :param out_filename:
+        :return:
+        """
+        with open(out_filename, 'w') as f:
+            for count, record in enumerate(records):
+                f.write(record)
+                if count > 0 and count % 100000 == 0:
+                    logging.info("write: %d", count)
+
+
+if __name__ == "__main__":
+    from preprocess.problem import SubwordVocabProblem
+    from preprocess.problem import TokenVocabProblem
+    import argparse
+    
+    parser = argparse.ArgumentParser("Tips for generating subword.")
+    parser.add_argument(
+        "--tmp_dir",
+        type=str,
+        required=True,
+        help="dir that includes original corpus.")
+
+    parser.add_argument(
+        "--data_dir",
+        type=str,
+        required=True,
+        help="dir that generates training files")
+
+    parser.add_argument(
+        "--source_train_files",
+        type=str,
+        required=True,
+        help="train file for source")
+
+    parser.add_argument(
+        "--target_train_files",
+        type=str,
+        required=True,
+        help="train file for target")
+
+    parser.add_argument(
+        "--source_vocab_size",
+        type=int,
+        required=True,
+        help="source_vocab_size")
+
+    parser.add_argument(
+        "--target_vocab_size",
+        type=int,
+        required=True,
+        help="target_vocab_size")
+
+    parser.add_argument(
+        "--num_shards",
+        type=int,
+        default=100,
+        help="number of shards")
+    
+    parser.add_argument(
+        "--subword",
+        type=ast.literal_eval,
+        default=False,
+        help="subword")
+
+    parser.add_argument(
+        "--token",
+        type=ast.literal_eval,
+        default=False,
+        help="token")
+    
+    parser.add_argument(
+        "--onevocab",
+        type=ast.literal_eval,
+        default=False,
+        help="share vocab")
+
+    args = parser.parse_args()
+    print args
+
+    gen = TRDataGen(args.num_shards, args.data_dir)
+    source_train_files = args.source_train_files.split(",")
+    target_train_files = args.target_train_files.split(",")
+    if args.token == args.subword:
+        print "one of subword or token is True"
+        import sys
+
+        sys.exit(1)
+    
+    LOG_FORMAT = "[%(asctime)s %(levelname)s %(filename)s:%(lineno)d] %(message)s"
+    logging.basicConfig(stream=sys.stdout, level=logging.INFO, format=LOG_FORMAT)
+    
+    if args.subword:
+        problem = SubwordVocabProblem(args.source_vocab_size,
+                                      args.target_vocab_size,
+                                      source_train_files,
+                                      target_train_files,
+                                      None,
+                                      None,
+                                      args.onevocab)
+    else:
+        problem = TokenVocabProblem(args.source_vocab_size, 
+                                    args.target_vocab_size, 
+                                    source_train_files, 
+                                    target_train_files, 
+                                    None,
+                                    None,
+                                    args.onevocab)
+
+    gen.generate(problem.generate_data(args.data_dir, args.tmp_dir, True), True, True)
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/id2word.py b/PaddleNLP/Research/EMNLP2019-MAL/src/id2word.py
new file mode 100644
index 0000000000000000000000000000000000000000..c7708b484604fa530cbc649c22caaad99f89580b
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/id2word.py
@@ -0,0 +1,44 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+
+id2word = {}
+ln = sys.stdin
+
+def load_vocab(file_path):
+    start_index = 0
+    f = open(file_path, 'r')
+
+    for line in f:
+        line = line.strip()
+        id2word[start_index] = line
+        start_index += 1
+    f.close()
+
+if __name__=="__main__":
+    load_vocab(sys.argv[1])
+    while True:
+        line = ln.readline().strip()
+        if not line:
+            break
+
+        split_res = line.split(" ")
+        output_str = ""
+        for item in split_res:
+            output_str += id2word[int(item.strip())]
+            output_str += " "
+        output_str = output_str.strip()
+        print output_str
+
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/infer.py b/PaddleNLP/Research/EMNLP2019-MAL/src/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..bbd6517b9d849bc04018d46134a8064588845778
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/infer.py
@@ -0,0 +1,470 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import ast
+import multiprocessing
+import numpy as np
+import os
+from functools import partial
+
+import contextlib
+import time
+import paddle.fluid.profiler as profiler
+
+import paddle
+import paddle.fluid as fluid
+
+import forward_model
+import reader
+import sys
+from config import *
+from forward_model import wrap_encoder as encoder
+from forward_model import wrap_decoder as decoder
+from forward_model import forward_fast_decode
+from dense_model import dense_fast_decode
+from relative_model import relative_fast_decode
+from forward_model import forward_position_encoding_init
+from reader import *
+
+
+def parse_args():
+    """
+        parse_args
+    """
+    parser = argparse.ArgumentParser("Training for Transformer.")
+    parser.add_argument(
+        "--val_file_pattern",
+        type=str,
+        required=True,
+        help="The pattern to match test data files.")
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        default=50,
+        help="The number of examples in one run for sequence generation.")
+    parser.add_argument(
+        "--pool_size",
+        type=int,
+        default=10000,
+        help="The buffer size to pool data.")
+    parser.add_argument(
+        "--special_token",
+        type=str,
+        default=["<s>", "<e>", "<unk>"],
+        nargs=3,
+        help="The <bos>, <eos> and <unk> tokens in the dictionary.")
+    parser.add_argument(
+        "--token_delimiter",
+        type=lambda x: str(x.encode().decode("unicode-escape")),
+        default=" ",
+        help="The delimiter used to split tokens in source or target sentences. "
+        "For EN-DE BPE data we provided, use spaces as token delimiter. ")
+    parser.add_argument(
+        "--use_mem_opt",
+        type=ast.literal_eval,
+        default=True,
+        help="The flag indicating whether to use memory optimization.")
+    parser.add_argument(
+        "--use_py_reader",
+        type=ast.literal_eval,
+        default=False,
+        help="The flag indicating whether to use py_reader.")
+    parser.add_argument(
+        "--use_parallel_exe",
+        type=ast.literal_eval,
+        default=False,
+        help="The flag indicating whether to use ParallelExecutor.")
+    parser.add_argument(
+        "--use_candidate",
+        type=ast.literal_eval,
+        default=False,
+        help="The flag indicating whether to use candidates.")
+    parser.add_argument(
+        "--common_ids",
+        type=str,
+        default="",
+        help="The file path of common ids.")
+    parser.add_argument(
+        'opts',
+        help='See config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    parser.add_argument(
+        "--use_delay_load",
+        type=ast.literal_eval,
+        default=True,
+        help=
+        "The flag indicating whether to load all data into memories at once.")
+    parser.add_argument(
+        "--vocab_size",
+        type=str,
+        required=True,
+        help="Size of Vocab.")
+    parser.add_argument(
+        "--infer_batch_size",
+        type=int,
+        help="Infer batch_size")
+    parser.add_argument(
+        "--decode_alpha",
+        type=float,
+        help="decode_alpha")
+
+    args = parser.parse_args()
+    # Append args related to dict
+    #src_dict = reader.DataReader.load_dict(args.src_vocab_fpath)
+    #trg_dict = reader.DataReader.load_dict(args.trg_vocab_fpath)
+    #dict_args = [
+    #    "src_vocab_size", str(len(src_dict)), "trg_vocab_size",
+    #    str(len(trg_dict)), "bos_idx", str(src_dict[args.special_token[0]]),
+    #    "eos_idx", str(src_dict[args.special_token[1]]), "unk_idx",
+    #    str(src_dict[args.special_token[2]])
+    #]
+    voc_size = args.vocab_size
+    dict_args = [
+        "src_vocab_size", voc_size,
+        "trg_vocab_size", voc_size,
+        "bos_idx", str(0),
+        "eos_idx", str(1),
+        "unk_idx", str(int(voc_size) - 1)
+    ]
+    merge_cfg_from_list(args.opts + dict_args,
+                        [InferTaskConfig, ModelHyperParams])
+    return args
+    
+
+def post_process_seq(seq,
+                     bos_idx=ModelHyperParams.bos_idx,
+                     eos_idx=ModelHyperParams.eos_idx,
+                     output_bos=InferTaskConfig.output_bos,
+                     output_eos=InferTaskConfig.output_eos):
+    """
+    Post-process the beam-search decoded sequence. Truncate from the first
+    <eos> and remove the <bos> and <eos> tokens currently.
+    """
+    eos_pos = len(seq) - 1
+    for i, idx in enumerate(seq):
+        if idx == eos_idx:
+            eos_pos = i
+            break
+    seq = [
+        idx for idx in seq[:eos_pos + 1]
+        if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)
+    ]
+    return seq
+
+
+def prepare_batch_input(insts, data_input_names, src_pad_idx, bos_idx, n_head,
+                        d_model):
+    """
+    Put all padded data needed by beam search decoder into a dict.
+    """
+    src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
+        [inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
+    source_length = np.asarray([src_max_len], dtype="int64")
+    src_word = src_word.reshape(-1, src_max_len, 1)
+    src_pos = src_pos.reshape(-1, src_max_len, 1)
+
+    data_input_dict = dict(
+        zip(data_input_names, [
+            src_word, src_pos, src_slf_attn_bias, source_length
+        ]))
+
+    return data_input_dict
+
+
+def prepare_feed_dict_list(data_generator, count):
+    """
+    Prepare the list of feed dict for multi-devices.
+    """
+    feed_dict_list = []
+    if data_generator is not None:  # use_py_reader == False
+        data_input_names = encoder_data_input_fields + fast_decoder_data_input_fields
+        data = next(data_generator)
+        for idx, data_buffer in enumerate(data):
+            data_input_dict = prepare_batch_input(
+                data_buffer, data_input_names, ModelHyperParams.bos_idx,
+                ModelHyperParams.bos_idx, ModelHyperParams.n_head,
+                ModelHyperParams.d_model)
+            feed_dict_list.append(data_input_dict)
+    return feed_dict_list if len(feed_dict_list) == count else None
+
+
+def prepare_dense_feed_dict_list(data_generator, count):
+    """
+    Prepare the list of feed dict for multi-devices.
+    """
+    feed_dict_list = []
+    if data_generator is not None:  # use_py_reader == False
+        data_input_names = dense_encoder_data_input_fields + fast_decoder_data_input_fields
+        data = next(data_generator)
+        for idx, data_buffer in enumerate(data):
+            data_input_dict = prepare_batch_input(
+                data_buffer, data_input_names, DenseModelHyperParams.bos_idx,
+                DenseModelHyperParams.bos_idx, DenseModelHyperParams.n_head,
+                DenseModelHyperParams.d_model)
+            feed_dict_list.append(data_input_dict)
+    return feed_dict_list if len(feed_dict_list) == count else None
+
+
+def prepare_infer_feed_dict_list(data_generator, count):
+    feed_dict_list = []
+    if data_generator is not None:  # use_py_reader == False
+        data_input_names = encoder_data_input_fields + fast_decoder_data_input_fields
+        dense_data_input_names = dense_encoder_data_input_fields + fast_decoder_data_input_fields
+        data = next(data_generator)
+        for idx, data_buffer in enumerate(data):
+            dense_data_input_dict = prepare_batch_input(
+                data_buffer, dense_data_input_names, DenseModelHyperParams.bos_idx,
+                DenseModelHyperParams.bos_idx, DenseModelHyperParams.n_head,
+                DenseModelHyperParams.d_model)
+
+            data_input_dict = prepare_batch_input(data_buffer, data_input_names, 
+                ModelHyperParams.bos_idx, ModelHyperParams.bos_idx, 
+                ModelHyperParams.n_head, ModelHyperParams.d_model)
+            
+            for key in dense_data_input_dict:
+                if key not in data_input_dict:
+                    data_input_dict[key] = dense_data_input_dict[key]
+            
+            feed_dict_list.append(data_input_dict)
+    return feed_dict_list if len(feed_dict_list) == count else None
+
+
+
+def get_trans_res(batch_size, out_list, final_list):
+    """
+        Get trans
+    """
+    for index in xrange(batch_size):
+        seq = out_list[index][0] #top1 seq
+
+        if 1 not in seq:
+            res = seq[1:-1]
+        else:
+            res = seq[1:seq.index(1)]
+
+        res = map(str, res)
+        final_list.append(" ".join(res))
+
+
+def fast_infer(args):
+    """
+    Inference by beam search decoder based solely on Fluid operators.
+    """
+    test_prog = fluid.Program()
+    startup_prog = fluid.Program()
+
+    #with fluid.program_guard(test_prog, startup_prog):
+    with fluid.unique_name.guard("new_forward"):
+        out_ids1, out_scores1 = forward_fast_decode(
+            ModelHyperParams.src_vocab_size,
+            ModelHyperParams.trg_vocab_size,
+            ModelHyperParams.max_length + 50,
+            ModelHyperParams.n_layer,
+            ModelHyperParams.n_head,
+            ModelHyperParams.d_key,
+            ModelHyperParams.d_value,
+            ModelHyperParams.d_model,
+            ModelHyperParams.d_inner_hid,
+            ModelHyperParams.prepostprocess_dropout,
+            ModelHyperParams.attention_dropout,
+            ModelHyperParams.relu_dropout,
+            ModelHyperParams.preprocess_cmd,
+            ModelHyperParams.postprocess_cmd,
+            ModelHyperParams.weight_sharing,
+            ModelHyperParams.embedding_sharing,
+            InferTaskConfig.beam_size,
+            args.infer_batch_size,
+            InferTaskConfig.max_out_len,
+            args.decode_alpha,
+            ModelHyperParams.eos_idx,
+            params_type="new"
+            )
+    
+    with fluid.unique_name.guard("new_relative_position"):
+        out_ids2, out_scores2 = relative_fast_decode(
+            ModelHyperParams.src_vocab_size,
+            ModelHyperParams.trg_vocab_size,
+            ModelHyperParams.max_length + 50,
+            ModelHyperParams.n_layer,
+            ModelHyperParams.n_head,
+            ModelHyperParams.d_key,
+            ModelHyperParams.d_value,
+            ModelHyperParams.d_model,
+            ModelHyperParams.d_inner_hid,
+            ModelHyperParams.prepostprocess_dropout,
+            ModelHyperParams.attention_dropout,
+            ModelHyperParams.relu_dropout,
+            ModelHyperParams.preprocess_cmd,
+            ModelHyperParams.postprocess_cmd,
+            ModelHyperParams.weight_sharing,
+            ModelHyperParams.embedding_sharing,
+            InferTaskConfig.beam_size,
+            args.infer_batch_size,
+            InferTaskConfig.max_out_len,
+            args.decode_alpha,
+            ModelHyperParams.eos_idx,
+            params_type="new"
+            )
+
+    DenseModelHyperParams.src_vocab_size = ModelHyperParams.src_vocab_size
+    DenseModelHyperParams.trg_vocab_size = ModelHyperParams.trg_vocab_size
+    DenseModelHyperParams.weight_sharing = ModelHyperParams.weight_sharing
+    DenseModelHyperParams.embedding_sharing = ModelHyperParams.embedding_sharing 
+    with fluid.unique_name.guard("new_dense"):
+        out_ids3, out_scores3 = dense_fast_decode(
+            DenseModelHyperParams.src_vocab_size,
+            DenseModelHyperParams.trg_vocab_size,
+            DenseModelHyperParams.max_length + 50,
+            DenseModelHyperParams.n_layer,
+            DenseModelHyperParams.enc_n_layer,
+            DenseModelHyperParams.n_head,
+            DenseModelHyperParams.d_key,
+            DenseModelHyperParams.d_value,
+            DenseModelHyperParams.d_model,
+            DenseModelHyperParams.d_inner_hid,
+            DenseModelHyperParams.prepostprocess_dropout,
+            DenseModelHyperParams.attention_dropout,
+            DenseModelHyperParams.relu_dropout,
+            DenseModelHyperParams.preprocess_cmd,
+            DenseModelHyperParams.postprocess_cmd,
+            DenseModelHyperParams.weight_sharing,
+            DenseModelHyperParams.embedding_sharing,
+            InferTaskConfig.beam_size,
+            args.infer_batch_size,
+            InferTaskConfig.max_out_len,
+            args.decode_alpha,
+            ModelHyperParams.eos_idx,
+            params_type="new"
+            )
+
+    test_prog = fluid.default_main_program().clone(for_test=True)
+    # This is used here to set dropout to the test mode.
+
+    if InferTaskConfig.use_gpu:
+        place = fluid.CUDAPlace(0)
+        dev_count = fluid.core.get_cuda_device_count()
+    else:
+        place = fluid.CPUPlace()
+        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+
+    fluid.io.load_params(
+        exe,
+        InferTaskConfig.model_path,
+        main_program=test_prog)
+
+
+    if args.use_mem_opt:
+        fluid.memory_optimize(test_prog)
+
+    exec_strategy = fluid.ExecutionStrategy()
+    # For faster executor
+    exec_strategy.use_experimental_executor = True
+    exec_strategy.num_threads = 1
+    build_strategy = fluid.BuildStrategy()
+    
+    # data reader settings for inference
+    args.use_token_batch = False
+    #args.sort_type = reader.SortType.NONE
+    args.shuffle = False
+    args.shuffle_batch = False
+    
+    dev_count = 1
+    lines_cnt = len(open(args.val_file_pattern, 'r').readlines())
+    data_reader = line_reader(args.val_file_pattern, args.infer_batch_size, dev_count,
+                    token_delimiter=args.token_delimiter,
+                    max_len=ModelHyperParams.max_length,
+                    parse_line=parse_src_line)
+
+    test_data = prepare_data_generator(
+        args,
+        is_test=True,
+        count=dev_count,
+        pyreader=None,
+        batch_size=args.infer_batch_size, data_reader=data_reader)
+    
+    data_generator = test_data()
+    iter_num = 0
+
+    if not os.path.exists("trans"):
+        os.mkdir("trans")
+    
+    model_name = InferTaskConfig.model_path.split("/")[-1]
+    forward_res = open(os.path.join("trans", "forward_%s" % model_name), 'w')
+    relative_res = open(os.path.join("trans", "relative_%s" % model_name), 'w')
+    dense_res = open(os.path.join("trans", "dense_%s" % model_name), 'w')
+
+    forward_list = []
+    relative_list = []
+    dense_list = []
+    with profile_context(False):
+        while True:
+            try:
+                feed_dict_list = prepare_infer_feed_dict_list(data_generator, dev_count)
+
+                forward_seq_ids, relative_seq_ids, dense_seq_ids = exe.run(
+                    program=test_prog,
+                    fetch_list=[out_ids1.name, out_ids2.name, out_ids3.name],
+                    feed=feed_dict_list[0]
+                    if feed_dict_list is not None else None,
+                    return_numpy=False,
+                    use_program_cache=False)
+
+                fseq_ids = np.asarray(forward_seq_ids).tolist()
+                rseq_ids = np.asarray(relative_seq_ids).tolist()
+                dseq_ids = np.asarray(dense_seq_ids).tolist()
+                
+                get_trans_res(args.infer_batch_size, fseq_ids, forward_list)
+                get_trans_res(args.infer_batch_size, rseq_ids, relative_list)
+                get_trans_res(args.infer_batch_size, dseq_ids, dense_list)
+
+                
+            except (StopIteration, fluid.core.EOFException):
+                break
+        forward_list = forward_list[:lines_cnt]
+        relative_list = relative_list[:lines_cnt]
+        dense_list = dense_list[:lines_cnt]
+
+        forward_res.writelines("\n".join(forward_list))
+        forward_res.flush()
+        forward_res.close()
+
+        relative_res.writelines("\n".join(relative_list))
+        relative_res.flush()
+        relative_res.close()
+
+        dense_res.writelines("\n".join(dense_list))
+        dense_res.flush()
+        dense_res.close()
+
+
+@contextlib.contextmanager
+def profile_context(profile=True):
+    """
+        profile_context
+    """
+    if profile:
+        with profiler.profiler('All', 'total', './profile_dir/profile_file_tmp'):
+            yield
+    else:
+        yield
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    fast_infer(args)
+
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/launch.py b/PaddleNLP/Research/EMNLP2019-MAL/src/launch.py
new file mode 100644
index 0000000000000000000000000000000000000000..f877816f008d9714195557bf6cd3be6c295b1972
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/launch.py
@@ -0,0 +1,146 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import subprocess
+import commands
+import os
+import six
+import copy
+import argparse
+import time
+
+from args import ArgumentGroup, print_arguments, inv_arguments
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+multip_g = ArgumentGroup(parser, "multiprocessing", 
+        "start paddle training using multi-processing mode.")
+multip_g.add_arg("node_ips", str, None, 
+        "paddle trainer ips")
+multip_g.add_arg("node_id", int, None, 
+        "the trainer id of the node for multi-node distributed training.")
+multip_g.add_arg("print_config", bool, True, 
+        "print the config of multi-processing mode.")
+multip_g.add_arg("current_node_ip", str, None, 
+        "the ip of current node.")
+multip_g.add_arg("split_log_path", str, "log",
+        "log path for each trainer.")
+multip_g.add_arg("log_prefix", str, "",
+        "the prefix name of job log.")
+multip_g.add_arg("nproc_per_node", int, 8, 
+        "the number of process to use on each node.")
+multip_g.add_arg("selected_gpus", str, "0,1,2,3,4,5,6,7", 
+        "the gpus selected to use.")
+multip_g.add_arg("training_script", str, None, "the program/script to be lauched "
+        "in parallel followed by all the arguments", positional_arg=True)
+multip_g.add_arg("training_script_args", str, None,
+        "training script args", positional_arg=True, nargs=argparse.REMAINDER)
+# yapf: enable
+
+
+def start_procs(args):
+    """
+        start_procs
+    """
+    procs = []
+    log_fns = []
+
+    default_env = os.environ.copy()
+
+    node_id = args.node_id
+    node_ips = [x.strip() for x in args.node_ips.split(',')]
+    current_ip = args.current_node_ip
+    num_nodes = len(node_ips)
+    selected_gpus = [x.strip() for x in args.selected_gpus.split(',')]
+    selected_gpu_num = len(selected_gpus)
+
+    all_trainer_endpoints = ""
+    for ip in node_ips:
+        for i in range(args.nproc_per_node):
+            if all_trainer_endpoints != "":
+                all_trainer_endpoints += ","
+            all_trainer_endpoints += "%s:617%d" % (ip, i)
+
+    nranks = num_nodes * args.nproc_per_node
+    gpus_per_proc = args.nproc_per_node % selected_gpu_num 
+    if gpus_per_proc == 0:
+        gpus_per_proc =  selected_gpu_num / args.nproc_per_node
+    else:
+        gpus_per_proc =  selected_gpu_num / args.nproc_per_node + 1
+
+    selected_gpus_per_proc = [selected_gpus[i:i + gpus_per_proc] 
+                                for i in range(0, len(selected_gpus), gpus_per_proc)]
+
+    if args.print_config:
+        print("all_trainer_endpoints: ", all_trainer_endpoints, 
+              ", node_id: ", node_id,
+              ", current_ip: ", current_ip,
+              ", num_nodes: ", num_nodes,
+              ", node_ips: ", node_ips,
+              ", gpus_per_proc: ", gpus_per_proc,
+              ", selected_gpus_per_proc: ", selected_gpus_per_proc,
+              ", nranks: ", nranks)
+
+    current_env = copy.copy(default_env)
+    procs = []
+    cmds = []
+    log_fns = []
+    for i in range(0, args.nproc_per_node):
+        trainer_id = node_id * args.nproc_per_node + i
+        current_env.update({
+            "FLAGS_selected_gpus": "%s" % ",".join([str(s) for s in selected_gpus_per_proc[i]]),
+            "PADDLE_TRAINER_ID": "%d" % trainer_id,
+            "PADDLE_CURRENT_ENDPOINT": "%s:617%d" % (current_ip, i),
+            "PADDLE_TRAINERS_NUM": "%d" % nranks,
+            "PADDLE_TRAINER_ENDPOINTS": all_trainer_endpoints,
+            "PADDLE_NODES_NUM": "%d" % num_nodes
+        })
+
+        cmd = [sys.executable, "-u",
+               args.training_script] + args.training_script_args
+        cmds.append(cmd)
+
+        if args.split_log_path:
+            fn = open("%s/%sjob.log.%d" % (args.split_log_path, args.log_prefix, trainer_id), "a")
+            log_fns.append(fn)
+            process = subprocess.Popen(cmd, env=current_env, stdout=fn, stderr=fn)
+        else:
+            process = subprocess.Popen(cmd, env=current_env)
+        procs.append(process)
+
+    for i in range(len(procs)):
+        proc = procs[i]
+        proc.wait()
+        if len(log_fns) > 0:
+            log_fns[i].close()
+        if proc.returncode != 0:    
+            raise subprocess.CalledProcessError(returncode=procs[i].returncode,
+                                                cmd=cmds[i])
+        else:
+            print("proc %d finsh" % i)
+
+
+def main(args):
+    """
+        main_func
+    """
+    if args.print_config:
+        print_arguments(args)
+    start_procs(args)
+
+
+if __name__ == "__main__":
+    lanch_args = parser.parse_args()
+    main(lanch_args)
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/__init__.py b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e945bdba5e379f14cdf29c76da2fc3baf34dd48e
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/__init__.py
@@ -0,0 +1,14 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/gen_utils.py b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/gen_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9ff897ed54ac77b19a42cfbd8f3ac935fdb5f10b
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/gen_utils.py
@@ -0,0 +1,208 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import logging
+import argparse
+from text_encoder import SubwordTextEncoder, TokenTextEncoder
+from text_encoder import EOS_ID
+
+
+def get_or_generate_vocab(data_dir, tmp_dir, vocab_filename, vocab_size,
+                          sources, file_byte_budget=1e6):
+    """Generate a vocabulary from the datasets in sources."""
+
+    def generate():
+        """Generate lines for vocabulary generation."""
+        logging.info("Generating vocab from: %s", str(sources))
+        for source in sources:
+            for lang_file in source[1]:
+                logging.info("Reading file: %s" % lang_file)
+
+                filepath = os.path.join(tmp_dir, lang_file)
+                with open(filepath, mode="r") as source_file:
+                    file_byte_budget_ = file_byte_budget
+                    counter = 0
+                    countermax = int(os.path.getsize(filepath) / file_byte_budget_ / 2)
+                    logging.info("countermax: %d" % countermax)
+                    for line in source_file:
+                        if counter < countermax:
+                            counter += 1
+                        else:
+                            if file_byte_budget_ <= 0:
+                                break
+                            line = line.strip()
+                            file_byte_budget_ -= len(line)
+                            counter = 0
+                            yield line
+
+    return get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size,
+                                       generate())
+
+
+def get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size,
+                                generator, max_subtoken_length=None,
+                                reserved_tokens=None):
+    """Inner implementation for vocab generators.
+
+    Args:
+      data_dir: The base directory where data and vocab files are stored. If None,
+        then do not save the vocab even if it doesn't exist.
+      vocab_filename: relative filename where vocab file is stored
+      vocab_size: target size of the vocabulary constructed by SubwordTextEncoder
+      generator: a generator that produces tokens from the vocabulary
+      max_subtoken_length: an optional integer.  Set this to a finite value to
+        avoid quadratic costs during vocab building.
+      reserved_tokens: List of reserved tokens. `text_encoder.RESERVED_TOKENS`
+        should be a prefix of `reserved_tokens`. If `None`, defaults to
+        `RESERVED_TOKENS`.
+
+    Returns:
+      A SubwordTextEncoder vocabulary object.
+    """
+    if data_dir and vocab_filename:
+        vocab_filepath = os.path.join(data_dir, vocab_filename)
+        if os.path.exists(vocab_filepath):
+            logging.info("Found vocab file: %s", vocab_filepath)
+            return SubwordTextEncoder(vocab_filepath)
+    else:
+        vocab_filepath = None
+
+    logging.info("Generating vocab file: %s", vocab_filepath)
+    vocab = SubwordTextEncoder.build_from_generator(
+        generator, vocab_size, max_subtoken_length=max_subtoken_length,
+        reserved_tokens=reserved_tokens)
+
+    if vocab_filepath:
+        if not os.path.exists(data_dir):
+            os.makedirs(data_dir)
+        vocab.store_to_file(vocab_filepath)
+
+    return vocab
+
+
+def txt_line_iterator(fname):
+    """
+    generator for line
+    :param fname:
+    :return:
+    """
+    with open(fname, 'r') as f:
+        for line in f:
+            yield line.strip()
+
+
+def txt2txt_generator(source_fname, target_fname):
+    """
+
+    :param source_fname:
+    :param target_fname:
+    :return:
+    """
+    for source, target in zip(
+            txt_line_iterator(source_fname),
+            txt_line_iterator(target_fname)
+    ):
+        yield {"inputs": source, "targets": target}
+
+
+def txt2txt_encoder(sample_generator, vocab, target_vocab=None):
+    """
+
+    :param sample_generator:
+    :param vocab:
+    :param target_vocab:
+    :return:
+    """
+    target_vocab = target_vocab or vocab
+    for sample in sample_generator:
+        sample["inputs"] = vocab.encode(sample["inputs"])
+        sample["inputs"].append(EOS_ID)
+        sample["targets"] = target_vocab.encode(sample["targets"])
+        sample["targets"].append(EOS_ID)
+        yield sample
+
+
+def txt_encoder(filename, batch_size=1, vocab=None):
+    """
+
+    :param sample_generator:
+    :param vocab:
+    :return:
+    """
+    def pad_mini_batch(batch):
+        """
+
+        :param batch:
+        :return:
+        """
+        lens = map(lambda x: len(x), batch)
+        max_len = max(lens)
+        for i in range(len(batch)):
+            batch[i] = batch[i] + [0] * (max_len - lens[i])
+        return batch
+
+    fp = open(filename, 'r')
+    samples = []
+    batches = []
+    ct = 0
+    for sample in fp:
+        sample = sample.strip()
+
+        if vocab:
+            sample = vocab.encode(sample)
+        else:
+            sample = [int(s) for s in sample]
+        #sample.append(EOS_ID)
+        batches.append(sample)
+        ct += 1
+        if ct % batch_size == 0:
+            batches = pad_mini_batch(batches)
+            samples.extend(batches)
+            batches = []
+    if ct % batch_size != 0:
+        batches += [batches[-1]] * (batch_size - ct % batch_size)
+        batches = pad_mini_batch(batches)
+        samples.extend(batches)
+    return samples
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser("Tips for generating testset")
+    parser.add_argument(
+        "--vocab",
+        type=str,
+        required=True,
+        help="The path of source vocab.")
+    
+    parser.add_argument(
+        "--testset",
+        type=str,
+        required=True,
+        help="The path of testset.")
+
+    parser.add_argument(
+        "--output",
+        type=str,
+        required=True,
+        help="The path of result.")
+
+    args = parser.parse_args()
+    token = TokenTextEncoder(args.vocab)
+    samples = txt_encoder(args.testset, 1, token)
+
+    with open(args.output, 'w') as f:
+        for sample in samples:
+            res = [str(item) for item in sample]
+            f.write("%s\n" % " ".join(res))
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/problem.py b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/problem.py
new file mode 100644
index 0000000000000000000000000000000000000000..595be317423f4c11155b970782febc5b2e8a2189
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/problem.py
@@ -0,0 +1,297 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from gen_utils import get_or_generate_vocab
+from gen_utils import txt_line_iterator
+import os, sys
+from gen_utils import txt2txt_encoder
+from gen_utils import txt2txt_generator
+from text_encoder import TokenTextEncoder
+import logging
+
+LOG_FORMAT = "[%(asctime)s %(levelname)s %(filename)s:%(lineno)d] %(message)s"
+logging.basicConfig(stream=sys.stdout, level=logging.INFO, format=LOG_FORMAT)
+
+class GenSubword(object):
+    """
+    gen subword
+    """
+
+    def __init__(self,
+                 vocab_size=8000,
+                 training_dataset_filenames="train.txt"):
+        """
+
+        :param vocab_size:
+        :param vocab_name:
+        :param training_dataset_filenames: list
+        """
+        self.vocab_size = vocab_size
+        self.vocab_name = "vocab.%s" % self.vocab_size
+        if not isinstance(training_dataset_filenames, list):
+            training_dataset_filenames = [training_dataset_filenames]
+        self.training_dataset_filenames = training_dataset_filenames
+
+    def generate_data(self, data_dir, tmp_dir):
+        """
+
+        :param data_dir: target dir(includes vocab file)
+        :param tmp_dir: original dir(includes training dataset filenames)
+        :return:
+        """
+        data_set = [["", self.training_dataset_filenames]]
+        source_vocab = get_or_generate_vocab(
+            data_dir,
+            tmp_dir,
+            self.vocab_name,
+            self.vocab_size,
+            data_set,
+            file_byte_budget=1e8)
+        source_vocab.store_to_file(os.path.join(data_dir, self.vocab_name))
+
+
+class SubwordVocabProblem(object):
+    """subword input"""
+
+    def __init__(self,
+                 source_vocab_size=8000,
+                 target_vocab_size=8000,
+                 source_train_filenames="train.src",
+                 target_train_filenames="train.tgt",
+                 source_dev_filenames="dev.src",
+                 target_dev_filenames="dev.tgt",
+                 one_vocab=False):
+        """
+
+        :param source_vocab_size:
+        :param target_vocab_size:
+        :param source_train_filenames:
+        :param target_train_filenames:
+        :param source_dev_filenames:
+        :param target_dev_filenames:
+        """
+        self.source_vocab_size = source_vocab_size
+        self.target_vocab_size = target_vocab_size
+        self.source_vocab_name = "vocab.source.%s" % self.source_vocab_size
+        self.target_vocab_name = "vocab.target.%s" % self.target_vocab_size
+        if not isinstance(source_train_filenames, list):
+            source_train_filenames = [source_train_filenames]
+        if not isinstance(target_train_filenames, list):
+            target_train_filenames = [target_train_filenames]
+        if not isinstance(source_dev_filenames, list):
+            source_dev_filenames = [source_dev_filenames]
+        if not isinstance(target_dev_filenames, list):
+            target_dev_filenames = [target_dev_filenames]
+        self.source_train_filenames = source_train_filenames
+        self.target_train_filenames = target_train_filenames
+        self.source_dev_filenames = source_dev_filenames
+        self.target_dev_filenames = target_dev_filenames
+        self.one_vocab = one_vocab
+
+    def generate_data(self, data_dir, tmp_dir, is_train=True):
+        """
+
+        :param data_dir:
+        :param tmp_dir:
+        :return:
+        """
+        self.source_train_ds = [["", self.source_train_filenames]]
+        self.target_train_ds = [["", self.target_train_filenames]]
+        logging.info("building source vocab ...")
+        logging.info(self.one_vocab)
+        if not self.one_vocab:
+            source_vocab = get_or_generate_vocab(data_dir, tmp_dir,
+                                                 self.source_vocab_name,
+                                                 self.source_vocab_size,
+                                                 self.source_train_ds,
+                                                 file_byte_budget=1e8)
+            logging.info("building target vocab ...")
+            target_vocab = get_or_generate_vocab(data_dir, tmp_dir,
+                                                 self.target_vocab_name,
+                                                 self.target_vocab_size,
+                                                 self.target_train_ds,
+                                                 file_byte_budget=1e8)
+        else:
+            train_ds = [["", self.source_train_filenames + self.target_train_filenames]]
+            source_vocab = get_or_generate_vocab(data_dir, tmp_dir,
+                                                 self.source_vocab_name,
+                                                 self.source_vocab_size,
+                                                 train_ds,
+                                                 file_byte_budget=1e8)
+            target_vocab = source_vocab
+            target_vocab.store_to_file(os.path.join(data_dir, self.target_vocab_name))
+        pair_filenames = [self.source_train_filenames, self.target_train_filenames]
+        if not is_train:
+            pair_filenames = [self.source_dev_filenames, self.target_dev_filenames]
+        self.compile_data(tmp_dir, pair_filenames, is_train)
+        source_fname = "train.lang1" if is_train else "dev.lang1"
+        target_fname = "train.lang2" if is_train else "dev.lang2"
+        source_fname = os.path.join(tmp_dir, source_fname)
+        target_fname = os.path.join(tmp_dir, target_fname)
+        return txt2txt_encoder(txt2txt_generator(source_fname, target_fname),
+                               source_vocab,
+                               target_vocab)
+
+    def compile_data(self, tmp_dir, pair_filenames, is_train=True):
+        """
+        combine the input files
+        :param tmp_dir:
+        :param pair_filenames:
+        :param is_train:
+        :return:
+        """
+        filename = "train.lang1" if is_train else "dev.lang1"
+        out_file_1 = open(os.path.join(tmp_dir, filename), "w")
+        filename = "train.lang2" if is_train else "dev.lang2"
+        out_file_2 = open(os.path.join(tmp_dir, filename), "w")
+        for file1, file2 in zip(pair_filenames[0], pair_filenames[1]):
+            for line in txt_line_iterator(os.path.join(tmp_dir, file1)):
+                out_file_1.write(line + "\n")
+            for line in txt_line_iterator(os.path.join(tmp_dir, file2)):
+                out_file_2.write(line + "\n")
+        out_file_2.close()
+        out_file_1.close()
+
+
+class TokenVocabProblem(object):
+    """token input"""
+
+    def __init__(self,
+                 source_vocab_size=8000,
+                 target_vocab_size=8000,
+                 source_train_filenames="train.src",
+                 target_train_filenames="train.tgt",
+                 source_dev_filenames="dev.src",
+                 target_dev_filenames="dev.tgt",
+                 one_vocab=False):
+        """
+
+        :param source_vocab_size:
+        :param target_vocab_size:
+        :param source_train_filenames:
+        :param target_train_filenames:
+        :param source_dev_filenames:
+        :param target_dev_filenames:
+        """
+        self.source_vocab_size = source_vocab_size
+        self.target_vocab_size = target_vocab_size
+        self.source_vocab_name = "vocab.source.%s" % self.source_vocab_size
+        self.target_vocab_name = "vocab.target.%s" % self.target_vocab_size
+        if not isinstance(source_train_filenames, list):
+            source_train_filenames = [source_train_filenames]
+        if not isinstance(target_train_filenames, list):
+            target_train_filenames = [target_train_filenames]
+        if not isinstance(source_dev_filenames, list):
+            source_dev_filenames = [source_dev_filenames]
+        if not isinstance(target_dev_filenames, list):
+            target_dev_filenames = [target_dev_filenames]
+        self.source_train_filenames = source_train_filenames
+        self.target_train_filenames = target_train_filenames
+        self.source_dev_filenames = source_dev_filenames
+        self.target_dev_filenames = target_dev_filenames
+        self.one_vocab = one_vocab
+
+
+    def add_exsits_vocab(self, filename):
+        """
+        :param filename
+        """
+        token_list = []
+        with open(filename) as f:
+            for line in f:
+                line = line.strip()
+                token_list.append(line)
+        token_list.append("UNK")
+        return token_list
+
+
+    def generate_data(self, data_dir, tmp_dir, is_train=True):
+        """
+
+        :param data_dir:
+        :param tmp_dir:
+        :return:
+        """
+        self.source_train_ds = [["", self.source_train_filenames]]
+        self.target_train_ds = [["", self.target_train_filenames]]
+
+        pair_filenames = [self.source_train_filenames, self.target_train_filenames]
+        if not is_train:
+            pair_filenames = [self.source_dev_filenames, self.target_dev_filenames]
+        self.compile_data(tmp_dir, pair_filenames, is_train)
+        source_fname = "train.lang1" if is_train else "dev.lang1"
+        target_fname = "train.lang2" if is_train else "dev.lang2"
+        source_fname = os.path.join(tmp_dir, source_fname)
+        target_fname = os.path.join(tmp_dir, target_fname)
+        if is_train:
+            source_vocab_path = os.path.join(data_dir, self.source_vocab_name)
+            target_vocab_path = os.path.join(data_dir, self.target_vocab_name)
+            if not self.one_vocab:
+                if os.path.exists(source_vocab_path) and os.path.exists(target_vocab_path):
+                    logging.info("found source vocab ...")
+                    source_vocab = TokenTextEncoder(None, vocab_list=self.add_exsits_vocab(source_vocab_path))
+
+                    logging.info("found target vocab ...")
+                    target_vocab = TokenTextEncoder(None, vocab_list=self.add_exsits_vocab(target_vocab_path))
+                else:
+                    logging.info("building source vocab ...")
+                    source_vocab = TokenTextEncoder.build_from_corpus(source_fname,
+                                                                      self.source_vocab_size)
+                    os.makedirs(data_dir)
+                    logging.info("building target vocab ...")
+                    target_vocab = TokenTextEncoder.build_from_corpus(target_fname,
+                                                                      self.target_vocab_size)
+            else:
+                if os.path.exists(source_vocab_path):
+                    logging.info("found source vocab ...")
+                    source_vocab = TokenTextEncoder(None, vocab_list=self.add_exsits_vocab(source_vocab_path))
+                else:
+                    source_vocab = TokenTextEncoder.build_from_corpus([source_fname, target_fname],
+                                                                      self.source_vocab_size)
+                    logging.info("building target vocab ...")
+                target_vocab = source_vocab
+
+            source_vocab.store_to_file(source_vocab_path)
+            target_vocab.store_to_file(target_vocab_path)
+        else:
+            source_vocab = TokenTextEncoder(os.path.join(data_dir, self.source_vocab_name))
+            target_vocab = TokenTextEncoder(os.path.join(data_dir, self.target_vocab_name))
+
+        return txt2txt_encoder(txt2txt_generator(source_fname, target_fname),
+                               source_vocab,
+                               target_vocab)
+
+    def compile_data(self, tmp_dir, pair_filenames, is_train=True):
+        """
+        combine the input files
+        :param tmp_dir:
+        :param pair_filenames:
+        :param is_train:
+        :return:
+        """
+        filename = "train.lang1" if is_train else "dev.lang1"
+        out_file_1 = open(os.path.join(tmp_dir, filename), "w")
+        filename = "train.lang2" if is_train else "dev.lang2"
+        out_file_2 = open(os.path.join(tmp_dir, filename), "w")
+        for file1, file2 in zip(pair_filenames[0], pair_filenames[1]):
+            for line in txt_line_iterator(os.path.join(tmp_dir, file1)):
+                out_file_1.write(line + "\n")
+            for line in txt_line_iterator(os.path.join(tmp_dir, file2)):
+                out_file_2.write(line + "\n")
+        out_file_2.close()
+        out_file_1.close()
+
+
+if __name__ == "__main__":
+    gen_sub = GenSubword().generate_data("train_data", "../asr/")
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/subword_decode.py b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/subword_decode.py
new file mode 100644
index 0000000000000000000000000000000000000000..2cc77647fdb25f306cabd6159895b9f6423fa4bd
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/subword_decode.py
@@ -0,0 +1,214 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import logging
+import argparse
+from text_encoder import SubwordTextEncoder
+from text_encoder import EOS_ID
+
+
+def get_or_generate_vocab(data_dir, tmp_dir, vocab_filename, vocab_size,
+                          sources, file_byte_budget=1e6):
+    """Generate a vocabulary from the datasets in sources."""
+
+    def generate():
+        """Generate lines for vocabulary generation."""
+        logging.info("Generating vocab from: %s", str(sources))
+        for source in sources:
+            for lang_file in source[1]:
+                logging.info("Reading file: %s" % lang_file)
+
+                filepath = os.path.join(tmp_dir, lang_file)
+                with open(filepath, mode="r") as source_file:
+                    file_byte_budget_ = file_byte_budget
+                    counter = 0
+                    countermax = int(os.path.getsize(filepath) / file_byte_budget_ / 2)
+                    logging.info("countermax: %d" % countermax)
+                    for line in source_file:
+                        if counter < countermax:
+                            counter += 1
+                        else:
+                            if file_byte_budget_ <= 0:
+                                break
+                            line = line.strip()
+                            file_byte_budget_ -= len(line)
+                            counter = 0
+                            yield line
+
+    return get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size,
+                                       generate())
+
+
+def get_or_generate_vocab_inner(data_dir, vocab_filename, vocab_size,
+                                generator, max_subtoken_length=None,
+                                reserved_tokens=None):
+    """Inner implementation for vocab generators.
+
+    Args:
+      data_dir: The base directory where data and vocab files are stored. If None,
+        then do not save the vocab even if it doesn't exist.
+      vocab_filename: relative filename where vocab file is stored
+      vocab_size: target size of the vocabulary constructed by SubwordTextEncoder
+      generator: a generator that produces tokens from the vocabulary
+      max_subtoken_length: an optional integer.  Set this to a finite value to
+        avoid quadratic costs during vocab building.
+      reserved_tokens: List of reserved tokens. `text_encoder.RESERVED_TOKENS`
+        should be a prefix of `reserved_tokens`. If `None`, defaults to
+        `RESERVED_TOKENS`.
+
+    Returns:
+      A SubwordTextEncoder vocabulary object.
+    """
+    if data_dir and vocab_filename:
+        vocab_filepath = os.path.join(data_dir, vocab_filename)
+        if os.path.exists(vocab_filepath):
+            logging.info("Found vocab file: %s", vocab_filepath)
+            return SubwordTextEncoder(vocab_filepath)
+    else:
+        vocab_filepath = None
+
+    logging.info("Generating vocab file: %s", vocab_filepath)
+    vocab = SubwordTextEncoder.build_from_generator(
+        generator, vocab_size, max_subtoken_length=max_subtoken_length,
+        reserved_tokens=reserved_tokens)
+
+    if vocab_filepath:
+        if not os.path.exists(data_dir):
+            os.makedirs(data_dir)
+        vocab.store_to_file(vocab_filepath)
+
+    return vocab
+
+
+def txt_line_iterator(fname):
+    """
+    generator for line
+    :param fname:
+    :return:
+    """
+    with open(fname, 'r') as f:
+        for line in f:
+            yield line.strip()
+
+
+def txt2txt_generator(source_fname, target_fname):
+    """
+
+    :param source_fname:
+    :param target_fname:
+    :return:
+    """
+    for source, target in zip(
+            txt_line_iterator(source_fname),
+            txt_line_iterator(target_fname)
+    ):
+        yield {"inputs": source, "targets": target}
+
+
+def txt2txt_encoder(sample_generator, vocab, target_vocab=None):
+    """
+
+    :param sample_generator:
+    :param vocab:
+    :param target_vocab:
+    :return:
+    """
+    target_vocab = target_vocab or vocab
+    for sample in sample_generator:
+        sample["inputs"] = vocab.encode(sample["inputs"])
+        sample["inputs"].append(EOS_ID)
+        sample["targets"] = target_vocab.encode(sample["targets"])
+        sample["targets"].append(EOS_ID)
+        yield sample
+
+
+def txt_encoder(filename, batch_size=1, vocab=None):
+    """
+
+    :param sample_generator:
+    :param vocab:
+    :return:
+    """
+    def pad_mini_batch(batch):
+        """
+
+        :param batch:
+        :return:
+        """
+        lens = map(lambda x: len(x), batch)
+        max_len = max(lens)
+        for i in range(len(batch)):
+            batch[i] = batch[i] + [0] * (max_len - lens[i])
+        return batch
+
+    fp = open(filename, 'r')
+    samples = []
+    batches = []
+    ct = 0
+    for sample in fp:
+        sample = sample.strip()
+
+        if vocab:
+            sample = vocab.encode(sample)
+        else:
+            sample = [int(s) for s in sample]
+        #sample.append(EOS_ID)
+        batches.append(sample)
+        ct += 1
+        if ct % batch_size == 0:
+            batches = pad_mini_batch(batches)
+            samples.extend(batches)
+            batches = []
+    if ct % batch_size != 0:
+        batches += [batches[-1]] * (batch_size - ct % batch_size)
+        batches = pad_mini_batch(batches)
+        samples.extend(batches)
+    return samples
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser("Tips for generating testset")
+    parser.add_argument(
+        "--vocab",
+        type=str,
+        required=True,
+        help="The path of source vocab.")
+    
+    parser.add_argument(
+        "--input",
+        type=str,
+        required=True,
+        help="The path of testset.")
+
+    parser.add_argument(
+        "--output",
+        type=str,
+        required=True,
+        help="The path of result.")
+
+    args = parser.parse_args()
+    subword = SubwordTextEncoder(args.vocab)
+
+    samples = []
+    with open(args.input, 'r') as f:
+        for line in f:
+            line = line.strip()
+            ids_list = [int(num) for num in line.split(" ")]
+            samples.append(ids_list)
+
+    with open(args.output, 'w') as f:
+        for sample in samples:
+            ret = subword.decode(sample)
+            f.write("%s\n" % ret)
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/text_encoder.py b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/text_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..590c02a0596a3da250023d1a790120e951d0b646
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/text_encoder.py
@@ -0,0 +1,926 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import six
+import os
+import re
+import logging
+from tokenizer import encode as tokenizer_encode
+from tokenizer import decode as tokenizer_decode
+from itertools import chain
+import collections
+
+PAD = "<pad>"
+EOS = "<EOS>"
+RESERVED_TOKENS = [PAD, EOS]
+NUM_RESERVED_TOKENS = len(RESERVED_TOKENS)
+PAD_ID = RESERVED_TOKENS.index(PAD)  # Normally 0
+EOS_ID = RESERVED_TOKENS.index(EOS)  # Normally 1
+
+if six.PY2:
+    RESERVED_TOKENS_BYTES = RESERVED_TOKENS
+else:
+    RESERVED_TOKENS_BYTES = [bytes(PAD, "ascii"), bytes(EOS, "ascii")]
+
+# Regular expression for unescaping token strings.
+# '\u' is converted to '_'
+# '\\' is converted to '\'
+# '\213;' is converted to unichr(213)
+_UNESCAPE_REGEX = re.compile(r"\\u|\\\\|\\([0-9]+);")
+_ESCAPE_CHARS = set(u"\\_u;0123456789")
+
+
+def strip_ids(ids, ids_to_strip):
+    """Strip ids_to_strip from the end ids."""
+    ids = list(ids)
+    while ids[-1] in ids_to_strip:
+        ids.pop()
+    return ids
+
+
+def native_to_unicode(s):
+    """
+
+    :param s:
+    :return:
+    """
+    return s if is_unicode(s) else to_unicode(s)
+
+
+def is_unicode(s):
+    """
+
+    :param s:
+    :return:
+    """
+    if six.PY2:
+        if isinstance(s, unicode):
+            return True
+    else:
+        if isinstance(s, str):
+            return True
+    return False
+
+
+def unicode_to_native(s):
+    """
+
+    :param s:
+    :return:
+    """
+    if six.PY2:
+        return s.encode("utf-8") if is_unicode(s) else s
+    else:
+        return s
+
+
+def to_unicode(s, ignore_errors=False):
+    """
+
+    :param s:
+    :param ignore_errors:
+    :return:
+    """
+    if is_unicode(s):
+        """
+
+        """
+        return s
+    error_mode = "ignore" if ignore_errors else "strict"
+    return s.decode("utf-8", errors=error_mode)
+
+
+def _escape_token(token, alphabet):
+    """Escape away underscores and OOV characters and append '_'.
+
+    This allows the token to be expressed as the concatenation of a list
+    of subtokens from the vocabulary. The underscore acts as a sentinel
+    which allows us to invertibly concatenate multiple such lists.
+
+    Args:
+      token: A unicode string to be escaped.
+      alphabet: A set of all characters in the vocabulary's alphabet.
+
+    Returns:
+      escaped_token: An escaped unicode string.
+
+    Raises:
+      ValueError: If the provided token is not unicode.
+    """
+    if not isinstance(token, six.text_type):
+        raise ValueError("Expected string type for token, got %s" % type(token))
+
+    token = token.replace(u"\\", u"\\\\").replace(u"_", u"\\u")
+    ret = [c if c in alphabet and c != u"\n" else r"\%d;" % ord(c) for c in token]
+    return u"".join(ret) + "_"
+
+
+def _unescape_token(escaped_token):
+    """Inverse of _escape_token().
+
+    Args:
+      escaped_token: a unicode string
+
+    Returns:
+      token: a unicode string
+    """
+
+    def match(m):
+        """
+
+        :param m:
+        :return:
+        """
+        if m.group(1) is None:
+            return u"_" if m.group(0) == u"\\u" else u"\\"
+
+        try:
+            return six.unichr(int(m.group(1)))
+        except (ValueError, OverflowError) as _:
+            return u"\u3013"  # Unicode for undefined character.
+
+    trimmed = escaped_token[:-1] if escaped_token.endswith("_") else escaped_token
+    return _UNESCAPE_REGEX.sub(match, trimmed)
+
+
+class TextEncoder(object):
+    """Base class for converting from ints to/from human readable strings."""
+
+    def __init__(self, num_reserved_ids=NUM_RESERVED_TOKENS):
+        self._num_reserved_ids = num_reserved_ids
+
+    @property
+    def num_reserved_ids(self):
+        """
+
+        :return:
+        """
+        return self._num_reserved_ids
+
+    def encode(self, s):
+        """Transform a human-readable string into a sequence of int ids.
+
+        The ids should be in the range [num_reserved_ids, vocab_size). Ids [0,
+        num_reserved_ids) are reserved.
+
+        EOS is not appended.
+
+        Args:
+          s: human-readable string to be converted.
+
+        Returns:
+          ids: list of integers
+        """
+        return [int(w) + self._num_reserved_ids for w in s.split()]
+
+    def decode(self, ids, strip_extraneous=False):
+        """Transform a sequence of int ids into a human-readable string.
+
+        EOS is not expected in ids.
+
+        Args:
+          ids: list of integers to be converted.
+          strip_extraneous: bool, whether to strip off extraneous tokens
+            (EOS and PAD).
+
+        Returns:
+          s: human-readable string.
+        """
+        if strip_extraneous:
+            ids = strip_ids(ids, list(range(self._num_reserved_ids or 0)))
+        return " ".join(self.decode_list(ids))
+
+    def decode_list(self, ids):
+        """Transform a sequence of int ids into a their string versions.
+
+        This method supports transforming individual input/output ids to their
+        string versions so that sequence to/from text conversions can be visualized
+        in a human readable format.
+
+        Args:
+          ids: list of integers to be converted.
+
+        Returns:
+          strs: list of human-readable string.
+        """
+        decoded_ids = []
+        for id_ in ids:
+            if 0 <= id_ < self._num_reserved_ids:
+                decoded_ids.append(RESERVED_TOKENS[int(id_)])
+            else:
+                decoded_ids.append(id_ - self._num_reserved_ids)
+        return [str(d) for d in decoded_ids]
+
+    @property
+    def vocab_size(self):
+        """
+
+        :return:
+        """
+        raise NotImplementedError()
+
+
+class SubwordTextEncoder(TextEncoder):
+    """Class for invertibly encoding text using a limited vocabulary.
+
+    Invertibly encodes a native string as a sequence of subtokens from a limited
+    vocabulary.
+
+    A SubwordTextEncoder is built from a corpus (so it is tailored to the text in
+    the corpus), and stored to a file. See text_encoder_build_subword.py.
+
+    It can then be loaded and used to encode/decode any text.
+
+    Encoding has four phases:
+
+    1. Tokenize into a list of tokens.  Each token is a unicode string of either
+       all alphanumeric characters or all non-alphanumeric characters.  We drop
+       tokens consisting of a single space that are between two alphanumeric
+       tokens.
+
+    2. Escape each token.  This escapes away special and out-of-vocabulary
+       characters, and makes sure that each token ends with an underscore, and
+       has no other underscores.
+
+    3. Represent each escaped token as a the concatenation of a list of subtokens
+       from the limited vocabulary.  Subtoken selection is done greedily from
+       beginning to end.  That is, we construct the list in order, always picking
+       the longest subtoken in our vocabulary that matches a prefix of the
+       remaining portion of the encoded token.
+
+    4. Concatenate these lists.  This concatenation is invertible due to the
+       fact that the trailing underscores indicate when one list is finished.
+
+    """
+
+    def __init__(self, filename=None):
+        """Initialize and read from a file, if provided.
+
+        Args:
+          filename: filename from which to read vocab. If None, do not load a
+            vocab
+        """
+        self._alphabet = set()
+        self.filename = filename
+        if filename is not None:
+            self._load_from_file(filename)
+        super(SubwordTextEncoder, self).__init__(num_reserved_ids=None)
+
+    def encode(self, s):
+        """Converts a native string to a list of subtoken ids.
+
+        Args:
+          s: a native string.
+        Returns:
+          a list of integers in the range [0, vocab_size)
+        """
+        return self._tokens_to_subtoken_ids(
+            tokenizer_encode(native_to_unicode(s)))
+
+    def encode_without_tokenizing(self, token_text):
+        """Converts string to list of subtoken ids without calling tokenizer.
+
+        This treats `token_text` as a single token and directly converts it
+        to subtoken ids. This may be useful when the default tokenizer doesn't
+        do what we want (e.g., when encoding text with tokens composed of lots of
+        nonalphanumeric characters). It is then up to the caller to make sure that
+        raw text is consistently converted into tokens. Only use this if you are
+        sure that `encode` doesn't suit your needs.
+
+        Args:
+          token_text: A native string representation of a single token.
+        Returns:
+          A list of subword token ids; i.e., integers in the range [0, vocab_size).
+        """
+        return self._tokens_to_subtoken_ids([native_to_unicode(token_text)])
+
+    def decode(self, ids, strip_extraneous=False):
+        """Converts a sequence of subtoken ids to a native string.
+
+        Args:
+          ids: a list of integers in the range [0, vocab_size)
+          strip_extraneous: bool, whether to strip off extraneous tokens
+            (EOS and PAD).
+
+        Returns:
+          a native string
+        """
+        if strip_extraneous:
+            ids = strip_ids(ids, list(range(self._num_reserved_ids or 0)))
+        return unicode_to_native(
+            tokenizer_decode(self._subtoken_ids_to_tokens(ids)))
+
+    def decode_list(self, ids):
+        """
+
+        :param ids:
+        :return:
+        """
+        return [self._subtoken_id_to_subtoken_string(s) for s in ids]
+
+    @property
+    def vocab_size(self):
+        """The subtoken vocabulary size."""
+        return len(self._all_subtoken_strings)
+
+    def _tokens_to_subtoken_ids(self, tokens):
+        """Converts a list of tokens to a list of subtoken ids.
+
+        Args:
+          tokens: a list of strings.
+        Returns:
+          a list of integers in the range [0, vocab_size)
+        """
+        ret = []
+        for token in tokens:
+            ret.extend(self._token_to_subtoken_ids(token))
+        return ret
+
+    def _token_to_subtoken_ids(self, token):
+        """Converts token to a list of subtoken ids.
+
+        Args:
+          token: a string.
+        Returns:
+          a list of integers in the range [0, vocab_size)
+        """
+        cache_location = hash(token) % self._cache_size
+        cache_key, cache_value = self._cache[cache_location]
+        if cache_key == token:
+            return cache_value
+        ret = self._escaped_token_to_subtoken_ids(
+            _escape_token(token, self._alphabet))
+        self._cache[cache_location] = (token, ret)
+        return ret
+
+    def _subtoken_ids_to_tokens(self, subtokens):
+        """Converts a list of subtoken ids to a list of tokens.
+
+        Args:
+          subtokens: a list of integers in the range [0, vocab_size)
+        Returns:
+          a list of strings.
+        """
+        concatenated = "".join(
+            [self._subtoken_id_to_subtoken_string(s) for s in subtokens])
+        split = concatenated.split("_")
+        ret = []
+        for t in split:
+            if t:
+                unescaped = _unescape_token(t + "_")
+                if unescaped:
+                    ret.append(unescaped)
+        return ret
+
+    def _subtoken_id_to_subtoken_string(self, subtoken):
+        """Converts a subtoken integer ID to a subtoken string."""
+        if 0 <= subtoken < self.vocab_size:
+            return self._all_subtoken_strings[subtoken]
+        return u""
+
+    def _escaped_token_to_subtoken_strings(self, escaped_token):
+        """Converts an escaped token string to a list of subtoken strings.
+
+        Args:
+          escaped_token: An escaped token as a unicode string.
+        Returns:
+          A list of subtokens as unicode strings.
+        """
+        # NOTE: This algorithm is greedy; it won't necessarily produce the "best"
+        # list of subtokens.
+        ret = []
+        start = 0
+        token_len = len(escaped_token)
+        while start < token_len:
+            for end in range(
+                    min(token_len, start + self._max_subtoken_len), start, -1):
+                subtoken = escaped_token[start:end]
+                if subtoken in self._subtoken_string_to_id:
+                    ret.append(subtoken)
+                    start = end
+                    break
+
+            else:  # Did not break
+                # If there is no possible encoding of the escaped token then one of the
+                # characters in the token is not in the alphabet. This should be
+                # impossible and would be indicative of a bug.
+                assert False, "Token substring not found in subtoken vocabulary."
+
+        return ret
+
+    def _escaped_token_to_subtoken_ids(self, escaped_token):
+        """Converts an escaped token string to a list of subtoken IDs.
+
+        Args:
+          escaped_token: An escaped token as a unicode string.
+        Returns:
+          A list of subtoken IDs as integers.
+        """
+        return [
+            self._subtoken_string_to_id[subtoken]
+            for subtoken in self._escaped_token_to_subtoken_strings(escaped_token)
+            ]
+
+    @classmethod
+    def build_from_generator(cls,
+                             generator,
+                             target_vocab_size,
+                             max_subtoken_length=None,
+                             reserved_tokens=None):
+        """Builds a SubwordTextEncoder from the generated text.
+
+        Args:
+          generator: yields text.
+          target_vocab_size: int, approximate vocabulary size to create.
+          max_subtoken_length: Maximum length of a subtoken. If this is not set,
+            then the runtime and memory use of creating the vocab is quadratic in
+            the length of the longest token. If this is set, then it is instead
+            O(max_subtoken_length * length of longest token).
+          reserved_tokens: List of reserved tokens. The global variable
+            `RESERVED_TOKENS` must be a prefix of `reserved_tokens`. If this
+            argument is `None`, it will use `RESERVED_TOKENS`.
+
+        Returns:
+          SubwordTextEncoder with `vocab_size` approximately `target_vocab_size`.
+        """
+        token_counts = collections.defaultdict(int)
+        for item in generator:
+            for tok in tokenizer_encode(native_to_unicode(item)):
+                token_counts[tok] += 1
+        encoder = cls.build_to_target_size(
+            target_vocab_size, token_counts, 1, 1e3,
+            max_subtoken_length=max_subtoken_length,
+            reserved_tokens=reserved_tokens)
+        return encoder
+
+    @classmethod
+    def build_to_target_size(cls,
+                             target_size,
+                             token_counts,
+                             min_val,
+                             max_val,
+                             max_subtoken_length=None,
+                             reserved_tokens=None,
+                             num_iterations=4):
+        """Builds a SubwordTextEncoder that has `vocab_size` near `target_size`.
+
+        Uses simple recursive binary search to find a minimum token count that most
+        closely matches the `target_size`.
+
+        Args:
+          target_size: Desired vocab_size to approximate.
+          token_counts: A dictionary of token counts, mapping string to int.
+          min_val: An integer; lower bound for the minimum token count.
+          max_val: An integer; upper bound for the minimum token count.
+          max_subtoken_length: Maximum length of a subtoken. If this is not set,
+            then the runtime and memory use of creating the vocab is quadratic in
+            the length of the longest token. If this is set, then it is instead
+            O(max_subtoken_length * length of longest token).
+          reserved_tokens: List of reserved tokens. The global variable
+            `RESERVED_TOKENS` must be a prefix of `reserved_tokens`. If this
+            argument is `None`, it will use `RESERVED_TOKENS`.
+          num_iterations: An integer; how many iterations of refinement.
+
+        Returns:
+          A SubwordTextEncoder instance.
+
+        Raises:
+          ValueError: If `min_val` is greater than `max_val`.
+        """
+        if min_val > max_val:
+            raise ValueError("Lower bound for the minimum token count "
+                             "is greater than the upper bound.")
+        if target_size < 1:
+            raise ValueError("Target size must be positive.")
+
+        if reserved_tokens is None:
+            reserved_tokens = RESERVED_TOKENS
+
+        def bisect(min_val, max_val):
+            """Bisection to find the right size."""
+            present_count = (max_val + min_val) // 2
+            logging.info("Trying min_count %d" % present_count)
+            subtokenizer = cls()
+            subtokenizer.build_from_token_counts(
+                token_counts, present_count, num_iterations,
+                max_subtoken_length=max_subtoken_length,
+                reserved_tokens=reserved_tokens)
+
+            # Being within 1% of the target size is ok.
+            is_ok = abs(subtokenizer.vocab_size - target_size) * 100 < target_size
+            # If min_val == max_val, we can't do any better than this.
+            if is_ok or min_val >= max_val or present_count < 2:
+                return subtokenizer
+
+            if subtokenizer.vocab_size > target_size:
+                other_subtokenizer = bisect(present_count + 1, max_val)
+            else:
+                other_subtokenizer = bisect(min_val, present_count - 1)
+
+            if other_subtokenizer is None:
+                return subtokenizer
+
+            if (abs(other_subtokenizer.vocab_size - target_size) <
+                    abs(subtokenizer.vocab_size - target_size)):
+                return other_subtokenizer
+            return subtokenizer
+
+        return bisect(min_val, max_val)
+
+    def build_from_token_counts(self,
+                                token_counts,
+                                min_count,
+                                num_iterations=4,
+                                reserved_tokens=None,
+                                max_subtoken_length=None):
+        """Train a SubwordTextEncoder based on a dictionary of word counts.
+
+        Args:
+          token_counts: a dictionary of Unicode strings to int.
+          min_count: an integer - discard subtokens with lower counts.
+          num_iterations: an integer.  how many iterations of refinement.
+          reserved_tokens: List of reserved tokens. The global variable
+            `RESERVED_TOKENS` must be a prefix of `reserved_tokens`. If this
+            argument is `None`, it will use `RESERVED_TOKENS`.
+          max_subtoken_length: Maximum length of a subtoken. If this is not set,
+            then the runtime and memory use of creating the vocab is quadratic in
+            the length of the longest token. If this is set, then it is instead
+            O(max_subtoken_length * length of longest token).
+
+        Raises:
+          ValueError: if reserved is not 0 or len(RESERVED_TOKENS). In this case, it
+            is not clear what the space is being reserved for, or when it will be
+            filled in.
+        """
+        if reserved_tokens is None:
+            reserved_tokens = RESERVED_TOKENS
+        else:
+            # There is not complete freedom in replacing RESERVED_TOKENS.
+            for default, proposed in zip(RESERVED_TOKENS, reserved_tokens):
+                if default != proposed:
+                    raise ValueError("RESERVED_TOKENS must be a prefix of "
+                                     "reserved_tokens.")
+
+        # Initialize the alphabet. Note, this must include reserved tokens or it can
+        # result in encoding failures.
+        alphabet_tokens = chain(six.iterkeys(token_counts),
+                                [native_to_unicode(t) for t in reserved_tokens])
+
+        self._init_alphabet_from_tokens(alphabet_tokens)
+
+        # Bootstrap the initial list of subtokens with the characters from the
+        # alphabet plus the escaping characters.
+        self._init_subtokens_from_list(list(self._alphabet),
+                                       reserved_tokens=reserved_tokens)
+
+        # We build iteratively.  On each iteration, we segment all the words,
+        # then count the resulting potential subtokens, keeping the ones
+        # with high enough counts for our new vocabulary.
+        if min_count < 1:
+            min_count = 1
+        for i in range(num_iterations):
+            logging.info("Iteration {0}".format(i))
+
+            # Collect all substrings of the encoded token that break along current
+            # subtoken boundaries.
+            subtoken_counts = collections.defaultdict(int)
+            for token, count in six.iteritems(token_counts):
+                escaped_token = _escape_token(token, self._alphabet)
+                subtokens = self._escaped_token_to_subtoken_strings(escaped_token)
+                start = 0
+                for subtoken in subtokens:
+                    last_position = len(escaped_token) + 1
+                    if max_subtoken_length is not None:
+                        last_position = min(last_position, start + max_subtoken_length)
+
+                    for end in range(start + 1, last_position):
+                        new_subtoken = escaped_token[start:end]
+                        subtoken_counts[new_subtoken] += count
+                    start += len(subtoken)
+
+            # Array of sets of candidate subtoken strings, by length.
+            len_to_subtoken_strings = []
+            for subtoken_string, count in six.iteritems(subtoken_counts):
+                lsub = len(subtoken_string)
+                if count >= min_count:
+                    while len(len_to_subtoken_strings) <= lsub:
+                        len_to_subtoken_strings.append(set())
+                    len_to_subtoken_strings[lsub].add(subtoken_string)
+
+            # Consider the candidates longest to shortest, so that if we accept
+            # a longer subtoken string, we can decrement the counts of its prefixes.
+            new_subtoken_strings = []
+            for lsub in range(len(len_to_subtoken_strings) - 1, 0, -1):
+                subtoken_strings = len_to_subtoken_strings[lsub]
+                for subtoken_string in subtoken_strings:
+                    count = subtoken_counts[subtoken_string]
+                    if count >= min_count:
+                        # Exclude alphabet tokens here, as they must be included later,
+                        # explicitly, regardless of count.
+                        if subtoken_string not in self._alphabet:
+                            new_subtoken_strings.append((count, subtoken_string))
+                        for l in range(1, lsub):
+                            subtoken_counts[subtoken_string[:l]] -= count
+
+            # Include the alphabet explicitly to guarantee all strings are encodable.
+            new_subtoken_strings.extend((subtoken_counts.get(a, 0), a)
+                                        for a in self._alphabet)
+            new_subtoken_strings.sort(reverse=True)
+
+            # Reinitialize to the candidate vocabulary.
+            new_subtoken_strings = [subtoken for _, subtoken in new_subtoken_strings]
+            if reserved_tokens:
+                new_subtoken_strings = reserved_tokens + new_subtoken_strings
+
+            self._init_subtokens_from_list(new_subtoken_strings)
+            logging.info("vocab_size = %d" % self.vocab_size)
+
+    @property
+    def all_subtoken_strings(self):
+        """
+
+        :return:
+        """
+        return tuple(self._all_subtoken_strings)
+
+    def dump(self):
+        """Debugging dump of the current subtoken vocabulary."""
+        subtoken_strings = [(i, s)
+                            for s, i in six.iteritems(self._subtoken_string_to_id)]
+        print(u", ".join(u"{0} : '{1}'".format(i, s)
+                         for i, s in sorted(subtoken_strings)))
+
+    def _init_subtokens_from_list(self, subtoken_strings, reserved_tokens=None):
+        """Initialize token information from a list of subtoken strings.
+
+        Args:
+          subtoken_strings: a list of subtokens
+          reserved_tokens: List of reserved tokens. We must have `reserved_tokens`
+            as None or the empty list, or else the global variable `RESERVED_TOKENS`
+            must be a prefix of `reserved_tokens`.
+
+        Raises:
+          ValueError: if reserved is not 0 or len(RESERVED_TOKENS). In this case, it
+            is not clear what the space is being reserved for, or when it will be
+            filled in.
+        """
+        if reserved_tokens is None:
+            reserved_tokens = []
+
+        if reserved_tokens:
+            self._all_subtoken_strings = reserved_tokens + subtoken_strings
+        else:
+            self._all_subtoken_strings = subtoken_strings
+
+        # we remember the maximum length of any subtoken to avoid having to
+        # check arbitrarily long strings.
+        self._max_subtoken_len = max([len(s) for s in subtoken_strings])
+        self._subtoken_string_to_id = {
+            s: i + len(reserved_tokens)
+            for i, s in enumerate(subtoken_strings) if s
+            }
+        # Initialize the cache to empty.
+        self._cache_size = 2 ** 20
+        self._cache = [(None, None)] * self._cache_size
+
+    def _init_alphabet_from_tokens(self, tokens):
+        """Initialize alphabet from an iterable of token or subtoken strings."""
+        # Include all characters from all tokens in the alphabet to guarantee that
+        # any token can be encoded. Additionally, include all escaping characters.
+        self._alphabet = {c for token in tokens for c in token}
+        self._alphabet |= _ESCAPE_CHARS
+
+    def _load_from_file_object(self, f):
+        """Load from a file object.
+
+        Args:
+          f: File object to load vocabulary from
+        """
+        subtoken_strings = []
+        for line in f:
+            s = line.strip()
+            # Some vocab files wrap words in single quotes, but others don't
+            if ((s.startswith("'") and s.endswith("'")) or
+                    (s.startswith("\"") and s.endswith("\""))):
+                s = s[1:-1]
+            subtoken_strings.append(native_to_unicode(s))
+        self._init_subtokens_from_list(subtoken_strings)
+        self._init_alphabet_from_tokens(subtoken_strings)
+
+    def _load_from_file(self, filename):
+        """Load from a vocab file."""
+        if not os.path.exists(filename):
+            raise ValueError("File %s not found" % filename)
+        with open(filename, 'r') as f:
+            self._load_from_file_object(f)
+
+    def store_to_file(self, filename, add_single_quotes=True):
+        """
+
+        :param filename:
+        :param add_single_quotes:
+        :return:
+        """
+        with open(filename, "w") as f:
+            for subtoken_string in self._all_subtoken_strings:
+                if add_single_quotes:
+                    f.write("'" + unicode_to_native(subtoken_string) + "'\n")
+                else:
+                    f.write(unicode_to_native(subtoken_string) + "\n")
+
+
+class TokenTextEncoder(TextEncoder):
+    """Encoder based on a user-supplied vocabulary (file or list)."""
+
+    def __init__(self,
+                 vocab_filename,
+                 reverse=False,
+                 vocab_list=None,
+                 replace_oov="UNK",
+                 num_reserved_ids=NUM_RESERVED_TOKENS):
+        """Initialize from a file or list, one token per line.
+
+        Handling of reserved tokens works as follows:
+        - When initializing from a list, we add reserved tokens to the vocab.
+        - When initializing from a file, we do not add reserved tokens to the vocab.
+        - When saving vocab files, we save reserved tokens to the file.
+
+        Args:
+          vocab_filename: If not None, the full filename to read vocab from. If this
+             is not None, then vocab_list should be None.
+          reverse: Boolean indicating if tokens should be reversed during encoding
+             and decoding.
+          vocab_list: If not None, a list of elements of the vocabulary. If this is
+             not None, then vocab_filename should be None.
+          replace_oov: If not None, every out-of-vocabulary token seen when
+             encoding will be replaced by this string (which must be in vocab).
+          num_reserved_ids: Number of IDs to save for reserved tokens like <EOS>.
+        """
+        super(TokenTextEncoder, self).__init__(num_reserved_ids=num_reserved_ids)
+        self._reverse = reverse
+        self._replace_oov = replace_oov
+        if vocab_filename:
+            self._init_vocab_from_file(vocab_filename)
+        else:
+            assert vocab_list is not None
+            self._init_vocab_from_list(vocab_list)
+
+    @classmethod
+    def build_from_corpus(cls, filenames, vocab_size):
+        """
+
+        :param filenames:
+        :param vocab_size:
+        :return:
+        """
+
+        def create_dictionary(names, lim=0):
+            """
+            :param name:
+            :param lim:
+            :return:
+            """
+            global_counter = collections.Counter()
+            for name in names:
+                fd = open(name)
+                for line in fd:
+                    words = line.strip().split()
+                    words = filter(lambda x: x != "-1", words)
+                    global_counter.update(words)
+            if lim <= 2:
+                lim = len(global_counter) + 3
+            vocab_count = global_counter.most_common(lim - 3)
+            total_counts = sum(global_counter.values())
+            coverage = 100.0 * sum([count for word, count in vocab_count]) / total_counts
+            logging.info("coverage: %s" % coverage)
+
+            vocab_table = ["<pad>", "<EOS>"]
+            for i, (word, count) in enumerate(vocab_count):
+                vocab_table.append(word)
+            vocab_table.append("UNK")
+            return vocab_table
+
+        if not isinstance(filenames, list): filenames = [filenames]
+        vocab = cls(None,
+                    vocab_list=create_dictionary(filenames, vocab_size),
+                    replace_oov="UNK")
+        return vocab
+
+    def encode(self, s):
+        """Converts a space-separated string of tokens to a list of ids."""
+        sentence = s
+        tokens = sentence.strip().split()
+        if self._replace_oov is not None:
+            tokens = [t if t in self._token_to_id else self._replace_oov
+                      for t in tokens]
+        ret = [self._token_to_id[tok] for tok in tokens]
+        return ret[::-1] if self._reverse else ret
+
+    def decode(self, ids, strip_extraneous=False):
+        """
+
+        :param ids:
+        :param strip_extraneous:
+        :return:
+        """
+        return " ".join(self.decode_list(ids))
+
+    def decode_list(self, ids):
+        """
+
+        :param ids:
+        :return:
+        """
+        seq = reversed(ids) if self._reverse else ids
+        return [self._safe_id_to_token(i) for i in seq]
+
+    @property
+    def vocab_size(self):
+        """
+
+        :return:
+        """
+        return len(self._id_to_token)
+
+    def _safe_id_to_token(self, idx):
+        """
+
+        :param idx:
+        :return:
+        """
+        return self._id_to_token.get(idx, "ID_%d" % idx)
+
+    def _init_vocab_from_file(self, filename):
+        """Load vocab from a file.
+
+        Args:
+          filename: The file to load vocabulary from.
+        """
+        with open(filename, 'r') as f:
+            tokens = [token.strip() for token in f.readlines()]
+
+        def token_gen():
+            """token gen"""
+            for token in tokens:
+                yield token
+
+        self._init_vocab(token_gen(), add_reserved_tokens=False)
+
+    def _init_vocab_from_list(self, vocab_list):
+        """Initialize tokens from a list of tokens.
+
+        It is ok if reserved tokens appear in the vocab list. They will be
+        removed. The set of tokens in vocab_list should be unique.
+
+        Args:
+          vocab_list: A list of tokens.
+        """
+
+        def token_gen():
+            """token gen"""
+            for token in vocab_list:
+                if token not in RESERVED_TOKENS:
+                    yield token
+
+        self._init_vocab(token_gen())
+
+    def _init_vocab(self, token_generator, add_reserved_tokens=True):
+        """Initialize vocabulary with tokens from token_generator."""
+
+        self._id_to_token = {}
+        non_reserved_start_index = 0
+
+        if add_reserved_tokens:
+            self._id_to_token.update(enumerate(RESERVED_TOKENS))
+            non_reserved_start_index = len(RESERVED_TOKENS)
+
+        self._id_to_token.update(
+            enumerate(token_generator, start=non_reserved_start_index))
+
+        # _token_to_id is the reverse of _id_to_token
+        self._token_to_id = dict((v, k)
+                                 for k, v in six.iteritems(self._id_to_token))
+
+    def store_to_file(self, filename):
+        """Write vocab file to disk.
+
+        Vocab files have one token per line. The file ends in a newline. Reserved
+        tokens are written to the vocab file as well.
+
+        Args:
+          filename: Full path of the file to store the vocab to.
+        """
+        with open(filename, "w") as f:
+            for i in range(len(self._id_to_token)):
+                f.write(self._id_to_token[i] + "\n")
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/tokenizer.py b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/tokenizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..74f6fb39c1c4bd9255df6838f0ca05c74cfb3334
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/preprocess/tokenizer.py
@@ -0,0 +1,166 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import sys
+import glob
+import unicodedata
+import six
+import logging
+from six.moves import range  # pylint: disable=redefined-builtin
+
+# Conversion between Unicode and UTF-8, if required (on Python2)
+_native_to_unicode = (lambda s: s.decode("utf-8")) if six.PY2 else (lambda s: s)
+
+# This set contains all letter and number characters.
+_ALPHANUMERIC_CHAR_SET = set(
+    six.unichr(i) for i in range(sys.maxunicode)
+    if (unicodedata.category(six.unichr(i)).startswith("L") or
+        unicodedata.category(six.unichr(i)).startswith("N")))
+
+
+def encode(text):
+    """Encode a unicode string as a list of tokens.
+
+    Args:
+      text: a unicode string
+    Returns:
+      a list of tokens as Unicode strings
+    """
+    if not text:
+        return []
+    ret = []
+    token_start = 0
+    # Classify each character in the input string
+    is_alnum = [c in _ALPHANUMERIC_CHAR_SET for c in text]
+    for pos in range(1, len(text)):
+        if is_alnum[pos] != is_alnum[pos - 1]:
+            token = text[token_start:pos]
+            if token != u" " or token_start == 0:
+                ret.append(token)
+            token_start = pos
+    final_token = text[token_start:]
+    ret.append(final_token)
+    return ret
+
+
+def decode(tokens):
+    """Decode a list of tokens to a unicode string.
+
+    Args:
+      tokens: a list of Unicode strings
+    Returns:
+      a unicode string
+    """
+    token_is_alnum = [t[0] in _ALPHANUMERIC_CHAR_SET for t in tokens]
+    ret = []
+    for i, token in enumerate(tokens):
+        if i > 0 and token_is_alnum[i - 1] and token_is_alnum[i]:
+            ret.append(u" ")
+        ret.append(token)
+    return "".join(ret)
+
+
+def _read_filepattern(filepattern, max_lines=None, split_on_newlines=True):
+    """Reads files matching a wildcard pattern, yielding the contents.
+
+    Args:
+      filepattern: A wildcard pattern matching one or more files.
+      max_lines: If set, stop reading after reading this many lines.
+      split_on_newlines: A boolean. If true, then split files by lines and strip
+          leading and trailing whitespace from each line. Otherwise, treat each
+          file as a single string.
+
+    Yields:
+      The contents of the files as lines, if split_on_newlines is True, or
+      the entire contents of each file if False.
+    """
+    filenames = sorted(glob.glob(filepattern))
+    lines_read = 0
+    for filename in filenames:
+        with open(filename, 'r') as f:
+            if split_on_newlines:
+                for line in f:
+                    yield line.strip()
+                    lines_read += 1
+                    if max_lines and lines_read >= max_lines:
+                        return
+
+            else:
+                if max_lines:
+                    doc = []
+                    for line in f:
+                        doc.append(line)
+                        lines_read += 1
+                        if max_lines and lines_read >= max_lines:
+                            yield "".join(doc)
+                            return
+                    yield "".join(doc)
+
+                else:
+                    yield f.read()
+
+
+def corpus_token_counts(
+        text_filepattern, corpus_max_lines, split_on_newlines=True):
+    """Read the corpus and compute a dictionary of token counts.
+
+    Args:
+      text_filepattern: A pattern matching one or more files.
+      corpus_max_lines: An integer; maximum total lines to read.
+      split_on_newlines: A boolean. If true, then split files by lines and strip
+          leading and trailing whitespace from each line. Otherwise, treat each
+          file as a single string.
+
+    Returns:
+      a dictionary mapping token to count.
+    """
+    counts = collections.Counter()
+    for doc in _read_filepattern(
+            text_filepattern,
+            max_lines=corpus_max_lines,
+            split_on_newlines=split_on_newlines):
+        counts.update(encode(_native_to_unicode(doc)))
+
+    return counts
+
+
+def vocab_token_counts(text_filepattern, max_lines):
+    """Read a vocab file and return a dictionary of token counts.
+
+    Reads a two-column CSV file of tokens and their frequency in a dataset. The
+    tokens are presumed to be generated by encode() or the equivalent.
+
+    Args:
+      text_filepattern: A pattern matching one or more files.
+      max_lines: An integer; maximum total lines to read.
+
+    Returns:
+      a dictionary mapping token to count.
+    """
+    ret = {}
+    for i, line in enumerate(
+            _read_filepattern(text_filepattern, max_lines=max_lines)):
+        if "," not in line:
+            logging.warning("Malformed vocab line #%d '%s'", i, line)
+            continue
+
+        token, count = line.rsplit(",", 1)
+        ret[_native_to_unicode(token)] = int(count)
+
+    return ret
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/reader.py b/PaddleNLP/Research/EMNLP2019-MAL/src/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a443efdcbe8b27d440a4b8dd018281e8eabe362
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/reader.py
@@ -0,0 +1,617 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import glob
+import six
+import os
+import tarfile
+import random
+
+import numpy as np
+
+
+from collections import defaultdict
+
+def batching_scheme(batch_size,
+                    max_length,
+                    min_length_bucket=8,
+                    length_bucket_step=1.1,
+                    drop_long_sequences=False,
+                    shard_multiplier=1,
+                    length_multiplier=1,
+                    min_length=0):
+    """A batching scheme based on model hyperparameters.
+
+    Every batch containins a number of sequences divisible by `shard_multiplier`.
+
+    Args:
+      batch_size: int, total number of tokens in a batch.
+      max_length: int, sequences longer than this will be skipped. Defaults to
+        batch_size.
+      min_length_bucket: int
+      length_bucket_step: float greater than 1.0
+      drop_long_sequences: bool, if True, then sequences longer than
+        `max_length` are dropped.  This prevents generating batches with
+        more than the usual number of tokens, which can cause out-of-memory
+        errors.
+      shard_multiplier: an integer increasing the batch_size to suit splitting
+        across datashards.
+      length_multiplier: an integer multiplier that is used to increase the
+        batch sizes and sequence length tolerance.
+      min_length: int, sequences shorter than this will be skipped.
+
+    Returns:
+       A dictionary with parameters that can be passed to input_pipeline:
+         * boundaries: list of bucket boundaries
+         * batch_sizes: list of batch sizes for each length bucket
+         * max_length: int, maximum length of an example
+
+    Raises:
+      ValueError: If min_length > max_length
+    """
+
+    def _bucket_boundaries(max_length, min_length=8, length_bucket_step=1.1):
+        assert length_bucket_step > 1.0
+        x = min_length
+        boundaries = []
+        while x < max_length:
+            boundaries.append(x)
+            x = max(x + 1, int(x * length_bucket_step))
+        return boundaries
+
+    max_length = max_length or batch_size
+    if max_length < min_length:
+        raise ValueError("max_length must be greater or equal to min_length")
+
+    boundaries = _bucket_boundaries(max_length, min_length_bucket,
+                                    length_bucket_step)
+    boundaries = [boundary * length_multiplier for boundary in boundaries]
+    max_length *= length_multiplier
+
+    batch_sizes = [
+        max(1, batch_size // length) for length in boundaries + [max_length]
+        ]
+    max_batch_size = max(batch_sizes)
+    # Since the Datasets API only allows a single constant for window_size,
+    # and it needs divide all bucket_batch_sizes, we pick a highly-compoisite
+    # window size and then round down all batch sizes to divisors of that window
+    # size, so that a window can always be divided evenly into batches.
+    # TODO(noam): remove this when Dataset API improves.
+    highly_composite_numbers = [
+        1, 2, 4, 6, 12, 24, 36, 48, 60, 120, 180, 240, 360, 720, 840, 1260, 1680,
+        2520, 5040, 7560, 10080, 15120, 20160, 25200, 27720, 45360, 50400, 55440,
+        83160, 110880, 166320, 221760, 277200, 332640, 498960, 554400, 665280,
+        720720, 1081080, 1441440, 2162160, 2882880, 3603600, 4324320, 6486480,
+        7207200, 8648640, 10810800, 14414400, 17297280, 21621600, 32432400,
+        36756720, 43243200, 61261200, 73513440, 110270160
+    ]
+    window_size = max(
+        [i for i in highly_composite_numbers if i <= 3 * max_batch_size])
+    divisors = [i for i in xrange(1, window_size + 1) if window_size % i == 0]
+    batch_sizes = [max([d for d in divisors if d <= bs]) for bs in batch_sizes]
+    window_size *= shard_multiplier
+    batch_sizes = [bs * shard_multiplier for bs in batch_sizes]
+    # The Datasets API splits one window into multiple batches, which
+    # produces runs of many consecutive batches of the same size.  This
+    # is bad for training.  To solve this, we will shuffle the batches
+    # using a queue which must be several times as large as the maximum
+    # number of batches per window.
+    max_batches_per_window = window_size // min(batch_sizes)
+    shuffle_queue_size = max_batches_per_window * 3
+
+    ret = {
+        "boundaries": boundaries,
+        "batch_sizes": batch_sizes,
+        "min_length": min_length,
+        "max_length": (max_length if drop_long_sequences else 10 ** 9),
+        "shuffle_queue_size": shuffle_queue_size,
+    }
+    return ret
+
+
+def bucket_by_sequence_length(data_reader,
+                              example_length_fn,
+                              bucket_boundaries,
+                              bucket_batch_sizes, 
+                              trainer_nums,
+                              trainer_id):
+    """Bucket entries in dataset by length.
+
+    Args:
+      dataset: Dataset of dict<feature name, Tensor>.
+      example_length_fn: function from example to int, determines the length of
+        the example, which will determine the bucket it goes into.
+      bucket_boundaries: list<int>, boundaries of the buckets.
+      bucket_batch_sizes: list<int>, batch size per bucket.
+
+    Returns:
+      Dataset of padded and batched examples.
+    """
+    def example_to_bucket_id(example):
+        """
+            get bucket_id
+        """
+        seq_length = example_length_fn(example)
+        boundaries = list(bucket_boundaries)
+        buckets_min = [np.iinfo(np.int32).min] + boundaries
+        buckets_max = boundaries + [np.iinfo(np.int32).max]
+        for i in range(len(buckets_min)):
+            if buckets_min[i] <= seq_length and seq_length < buckets_max[i]:
+                bucket_id = i
+        return bucket_id
+
+    def window_size_fn(bucket_id):
+        """
+            get window size
+        """
+        window_size = bucket_batch_sizes[bucket_id]
+        return window_size
+
+    def group_by_window(reader, key_func, window_size_func, drop_last=False):
+        """
+            group the line by length
+        """
+        groups = defaultdict(list)
+
+        def impl():
+            """
+                impl
+            """
+            for e in reader():
+                key = key_func(e)
+                window_size = window_size_func(key)
+                groups[key].append(e)
+                if len(groups[key]) == window_size:
+                    each_size = window_size / trainer_nums
+                    res = groups[key][trainer_id * each_size: (trainer_id + 1) * each_size]
+                    yield res
+                    groups[key] = []
+            if drop_last:
+                groups.clear()
+
+        return impl
+
+    reader = group_by_window(data_reader, example_to_bucket_id, window_size_fn)
+    return reader
+
+
+def shuffle(reader, buf_size):
+    """
+    Creates a data reader whose data output is shuffled.
+
+    Output from the iterator that created by original reader will be
+    buffered into shuffle buffer, and then shuffled. The size of shuffle buffer
+    is determined by argument buf_size.
+
+    :param reader: the original reader whose output will be shuffled.
+    :type reader: callable
+    :param buf_size: shuffle buffer size.
+    :type buf_size: int
+
+    :return: the new reader whose output is shuffled.
+    :rtype: callable
+    """
+
+    def data_reader():
+        """
+            data_reader
+        """
+        buf = []
+        for e in reader():
+            buf.append(e)
+            if len(buf) >= buf_size:
+                random.shuffle(buf)
+                for b in buf:
+                    yield b
+                buf = []
+
+        if len(buf) > 0:
+            random.shuffle(buf)
+            for b in buf:
+                yield b
+
+    return data_reader
+
+
+def sort(reader, buf_size, cmp=None, key=None, reverse=False):
+    """
+    Creates a data reader whose data output is sorted.
+
+    Output from the iterator that created by original reader will be
+    buffered into sort buffer, and then sorted. The size of sort buffer
+    is determined by argument buf_size.
+
+    :param reader: the original reader whose output will be sorted.
+    :type reader: callable
+    :param buf_size: shuffle buffer size.
+    :type buf_size: int
+
+    :return: the new reader whose output is sorted.
+    :rtype: callable
+    """
+
+    def data_reader():
+        """
+            data_reader
+        """
+        buf = []
+        for e in reader():
+            buf.append(e)
+            if len(buf) >= buf_size:
+                buf = sorted(buf, cmp, key, reverse)
+                for b in buf:
+                    yield b
+                buf = []
+
+        if len(buf) > 0:
+            sorted(buf, cmp, key, reverse)
+            for b in buf:
+                yield b
+
+    return data_reader
+
+
+def batch_by_token(reader, batch_size, len_fun, drop_last=False):
+    """
+    Create a batched reader.
+
+    :param reader: the data reader to read from.
+    :type reader: callable
+    :param batch_size: size of each mini-batch
+    :type batch_size: int
+    :param drop_last: drop the last batch, if the size of last batch is not equal to batch_size.
+    :type drop_last: bool
+    :return: the batched reader.
+    :rtype: callable
+    """
+
+    def batch_reader():
+        """
+            batch_reader
+        """
+        r = reader()
+        b = []
+        max_len = 0
+        for instance in r:
+            cur_len = len_fun(instance)
+            max_len = max(max_len, cur_len)
+            if max_len * (len(b) + 1) > batch_size:
+                yield b
+                b = [instance]
+                max_len = cur_len
+            else:
+                b.append(instance)
+        if drop_last == False and len(b) != 0:
+            yield b
+
+    # Batch size check
+    batch_size = int(batch_size)
+    if batch_size <= 0:
+        raise ValueError("batch_size should be a positive integeral value, "
+                         "but got batch_size={}".format(batch_size))
+
+    return batch_reader
+
+
+def parse_line(line, max_len, min_len=0, field_delimiter="\t", token_delimiter=" "):
+    """
+        parse training data
+    """
+    src, trg = line.strip("\n").split(field_delimiter)
+    src_ids = [int(token) for token in src.split(token_delimiter)]
+    trg_ids = [int(token) for token in trg.split(token_delimiter)]
+    reverse_trg_ids = trg_ids[::-1]
+    reverse_trg_ids = reverse_trg_ids[1:]
+    reverse_trg_ids.append(1)
+    inst_max_len = max(len(src_ids), len(trg_ids))
+    inst_min_len = min(len(src_ids), len(trg_ids))
+    if inst_max_len <= max_len and inst_min_len > min_len:
+        return src_ids, [0] + trg_ids[:-1], trg_ids, [0] + reverse_trg_ids[:-1], reverse_trg_ids
+    else:
+        return None
+
+
+def repeat(reader, count=-1):
+    """
+        repeat
+    """
+    def data_reader():
+        """
+            repeat data
+        """
+        time = count
+        while time != 0:
+            for e in reader():
+                yield e
+            time -= 1
+
+    return data_reader
+
+
+def parse_src_line(line, max_len, min_len=0, token_delimiter=" "):
+    """
+        parse infer data
+    """
+    src = line.strip("\n")
+    src_ids = [int(token) for token in src.split(token_delimiter)]
+    inst_max_len = inst_min_len = len(src_ids)
+    if inst_max_len < max_len and inst_min_len > min_len:
+        src_ids.append(1)
+        return [src_ids]
+    else:
+        src_ids = src_ids[:max_len - 1]
+        src_ids.append(1)
+        return [src_ids]
+
+
+def interleave_reader(fpattern, cycle_length, block_length=1, **kwargs):
+    """
+        cycle reader
+    """
+    # refer to:
+    # https://www.tensorflow.org/api_docs/python/tf/contrib/data/parallel_interleave?hl=zh_cn
+    # https://www.tensorflow.org/api_docs/python/tf/data/Dataset?hl=zh_cn#interleave
+    fpaths = glob.glob(fpattern)
+    fpaths = sorted(fpaths)
+    if 'parse_line' in kwargs:
+
+        parse_line = kwargs.pop('parse_line')
+
+    class Worker(object):  # mimic a worker thread
+        """
+           each worker wrap a file
+        """
+        def __init__(self):
+            self.input = None
+            self.iter = None
+
+        def set_input(self, input_arg):
+            """
+                set file reader
+            """
+            if self.iter is not None:
+                self.iter.close()
+            self.input = input_arg
+            self.iter = open(input_arg, 'rb')
+
+        def get_next(self):
+            """
+                get next data
+            """
+            return next(self.iter)
+
+    def data_reader():
+        """
+            generate data
+        """
+        num_workers = cycle_length  # + prefetched
+        workers = []
+        # Indices in `workers` of iterators to interleave.
+        interleave_indices = []
+        # Indices in `workers` of prefetched iterators.
+        staging_indices = []
+        # EnsureWorkerThreadsStarted
+        for i in range(num_workers):
+            if i >= len(fpaths):
+                break
+            workers.append(Worker())
+            workers[i].set_input(fpaths[i])
+            if i < cycle_length:
+                interleave_indices.append(i)
+            else:
+                staging_indices.append(i)
+        input_index = len(workers)  # index for files
+        next_index = 0  # index for worker
+        block_count = 0  # counter for the number of instances from one block
+        #
+        while True:  # break while when all inputs end
+            can_produce_elements = False
+            # The for loop only fully runs when all workers ending.
+            # Otherwise, run one step then break the for loop, or
+            # find the first possible unended iterator by setting next_index
+            # or go to the step of loop.
+            for i in range(len(interleave_indices)):
+                index = (next_index + i) % len(interleave_indices)
+                current_worker_index = interleave_indices[index]
+                current_worker = workers[current_worker_index]
+
+                try:
+                    line = current_worker.get_next()
+                    if six.PY3:
+                        line = line.decode()
+                    inst = parse_line(line, **kwargs)
+                    if inst is not None:
+                        yield inst
+                    next_index = index
+                    block_count += 1
+                    if block_count == block_length:
+                        # advance to the next iterator
+                        next_index = (index + 1) % len(interleave_indices)
+                        block_count = 0
+                    can_produce_elements = True
+                    break
+                except (StopIteration,):  # This iterator has reached the end.
+                    if input_index < len(fpaths):  # get a new iterator and skip
+                        current_worker.set_input(fpaths[input_index])
+                        staging_indices.append(current_worker_index)
+                        if len(staging_indices) > 0:  # pop_front
+                            interleave_indices[index] = staging_indices[0]
+                            staging_indices = staging_indices[1:]
+
+                        input_index += 1
+                        # advance to the next iterator
+                        next_index = (index + 1) % len(interleave_indices)
+                        block_count = 0
+                        can_produce_elements = True
+                        break
+                    # else: advance to the next iterator by loop step
+
+            if not can_produce_elements:
+                # all inputs end, triggered when all iterators have reached the end
+                break
+
+    return data_reader
+
+
+def line_reader(fpattern, batch_size, dev_count, **kwargs):
+    """
+        cycle reader
+    """
+
+    fpaths = glob.glob(fpattern)
+    #np.random.shuffle(fpaths)
+    #random.shuffle(fpaths) 
+    if "parse_line" in kwargs:
+        parse_line = kwargs.pop('parse_line')
+
+    def data_reader():
+        """
+            data_reader
+        """
+        res = []
+        total_size = batch_size * dev_count
+        for fpath in fpaths:
+            if not os.path.isfile(fpath):
+                raise IOError("Invalid file: %s" % fpath)
+            with open(fpath, "rb") as f:
+                for line in f:
+                    if six.PY3:
+                        line = line.decode()
+                    inst = parse_line(line, **kwargs)
+                    res.append(inst)
+                    if len(res) == total_size:
+                        yield res
+                        res = []
+        if len(res) > 0:
+            pad_count = total_size - len(res)
+            for index in xrange(pad_count):
+                res.append(res[-1])
+            yield res
+
+    return data_reader
+
+
+def prepare_data_generator(args, is_test, count, pyreader, batch_size=None, 
+                            data_reader=None, py_reader_provider_wrapper=None):
+    """
+    Data generator wrapper for DataReader. If use py_reader, set the data
+    provider for py_reader
+    """
+    def stack(data_reader, count, clip_last=True):
+        """
+            Data generator for multi-devices
+        """
+        def __impl__():
+            res = []
+            for item in data_reader():
+                res.append(item)
+                if len(res) == count:
+                    yield res
+                    res = []
+            if len(res) == count:
+                yield res
+            elif not clip_last:
+                data = []
+                for item in res:
+                    data += item
+                if len(data) > count:
+                    inst_num_per_part = len(data) // count
+                    yield [
+                        data[inst_num_per_part * i:inst_num_per_part * (i + 1)]
+                        for i in range(count)
+                    ]
+
+        return __impl__
+
+    def split(data_reader, count):
+        """
+            split for multi-gpu
+        """
+        def __impl__():
+            for item in data_reader():
+                inst_num_per_part = len(item) // count
+                for i in range(count):
+                    yield item[inst_num_per_part * i:inst_num_per_part * (i + 1
+                                                                          )]
+
+        return __impl__
+
+    if not args.use_token_batch:
+        # to make data on each device have similar token number
+        data_reader = split(data_reader, count)
+    #if args.use_py_reader:
+    if pyreader:
+        pyreader.decorate_tensor_provider(
+            py_reader_provider_wrapper(data_reader))
+        data_reader = None
+    else:  # Data generator for multi-devices
+        data_reader = stack(data_reader, count)
+    return data_reader
+
+
+def pad_batch_data(insts,
+                   pad_idx,
+                   n_head,
+                   is_target=False,
+                   is_label=False,
+                   return_attn_bias=True,
+                   return_max_len=True,
+                   return_num_token=False):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and attention bias.
+    """
+    return_list = []
+    max_len = max(len(inst) for inst in insts)
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+    inst_data = np.array(
+        [inst + [pad_idx] * (max_len - len(inst)) for inst in insts])
+    return_list += [inst_data.astype("int64").reshape([-1, 1])]
+    if is_label:  # label weight
+        inst_weight = np.array(
+            [[1.] * len(inst) + [0.] * (max_len - len(inst)) for inst in insts])
+        return_list += [inst_weight.astype("float32").reshape([-1, 1])]
+    else:  # position data
+        inst_pos = np.array([
+            list(range(0, len(inst))) + [0] * (max_len - len(inst))
+            for inst in insts
+        ])
+        return_list += [inst_pos.astype("int64").reshape([-1, 1])]
+    if return_attn_bias:
+        if is_target:
+            # This is used to avoid attention on paddings and subsequent
+            # words.
+            slf_attn_bias_data = np.ones((inst_data.shape[0], max_len, max_len))
+            slf_attn_bias_data = np.triu(slf_attn_bias_data,
+                                         1).reshape([-1, 1, max_len, max_len])
+            slf_attn_bias_data = np.tile(slf_attn_bias_data,
+                                         [1, n_head, 1, 1]) * [-1e9]
+        else:
+            # This is used to avoid attention on paddings.
+            slf_attn_bias_data = np.array([[0] * len(inst) + [-1e9] *
+                                           (max_len - len(inst))
+                                           for inst in insts])
+            slf_attn_bias_data = np.tile(
+                slf_attn_bias_data.reshape([-1, 1, 1, max_len]),
+                [1, n_head, max_len, 1])
+        return_list += [slf_attn_bias_data.astype("float32")]
+    if return_max_len:
+        return_list += [max_len]
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+    return return_list if len(return_list) > 1 else return_list[0]
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/relative_model.py b/PaddleNLP/Research/EMNLP2019-MAL/src/relative_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..24a6fa53409a483702d343c932d275e49f70147a
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/relative_model.py
@@ -0,0 +1,954 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from functools import partial
+import numpy as np
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+from paddle.fluid.layer_helper import LayerHelper as LayerHelper
+
+from config import *
+from beam_search import BeamSearch
+from attention import _dot_product_relative
+
+INF = 1. * 1e5
+
+def layer_norm(x, begin_norm_axis=1, epsilon=1e-6, param_attr=None, bias_attr=None):
+    """
+        layer_norm
+    """
+    helper = LayerHelper('layer_norm', **locals())
+    mean = layers.reduce_mean(x, dim=begin_norm_axis, keep_dim=True)
+    shift_x = layers.elementwise_sub(x=x, y=mean, axis=0)
+    variance = layers.reduce_mean(layers.square(shift_x), dim=begin_norm_axis, keep_dim=True)
+    r_stdev = layers.rsqrt(variance + epsilon)
+    norm_x = layers.elementwise_mul(x=shift_x, y=r_stdev, axis=0)
+
+    param_shape = [reduce(lambda x, y: x * y, norm_x.shape[begin_norm_axis:])]
+    param_dtype = norm_x.dtype
+    scale = helper.create_parameter(
+        attr=param_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        default_initializer=fluid.initializer.Constant(1.))
+    bias = helper.create_parameter(
+        attr=bias_attr,
+        shape=param_shape,
+        dtype=param_dtype,
+        is_bias=True,
+        default_initializer=fluid.initializer.Constant(0.))
+
+    out = layers.elementwise_mul(x=norm_x, y=scale, axis=-1)
+    out = layers.elementwise_add(x=out, y=bias, axis=-1)
+
+    return out#norm_x * scale + bias
+
+def relative_position_encoding_init(n_position, d_pos_vec):
+    """
+    Generate the initial values for the sinusoid position encoding table.
+    """
+    channels = d_pos_vec
+    position = np.arange(n_position)
+    num_timescales = channels // 2
+    log_timescale_increment = (np.log(float(1e4) / float(1)) /
+                               (num_timescales - 1))
+    inv_timescales = np.exp(np.arange(
+        num_timescales) * -log_timescale_increment)
+        #num_timescales)) * -log_timescale_increment
+    scaled_time = np.expand_dims(position, 1) * np.expand_dims(inv_timescales,
+                                                               0)
+    signal = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1)
+    signal = np.pad(signal, [[0, 0], [0, np.mod(channels, 2)]], 'constant')
+    position_enc = signal
+    return position_enc.astype("float32")
+
+
+def multi_head_attention(queries,
+                         keys,
+                         values,
+                         attn_bias,
+                         d_key,
+                         d_value,
+                         d_model,
+                         n_head=1,
+                         dropout_rate=0.,
+                         cache=None,
+                         attention_type="dot_product",
+                         params_type = "normal"):
+    """
+    Multi-Head Attention. Note that attn_bias is added to the logit before
+    computing softmax activiation to mask certain selected positions so that
+    they will not considered in attention weights.
+    """
+    keys = queries if keys is None else keys
+    values = keys if values is None else values
+
+    if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3):
+        raise ValueError(
+            "Inputs: quries, keys and values should all be 3-D tensors.")
+
+    def __compute_qkv(queries, keys, values, n_head, d_key, d_value):
+        """
+        Add linear projection to queries, keys, and values.
+        """
+        q = layers.fc(input=queries,
+                      size=d_key * n_head,
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        k = layers.fc(input=keys,
+                      size=d_key * n_head,
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        v = layers.fc(input=values,
+                      size=d_value * n_head,
+                      bias_attr=False,
+                      num_flatten_dims=2)
+        return q, k, v
+
+    def __split_heads(x, n_head):
+        """
+        Reshape the last dimension of inpunt tensor x so that it becomes two
+        dimensions and then transpose. Specifically, input a tensor with shape
+        [bs, max_sequence_length, n_head * hidden_dim] then output a tensor
+        with shape [bs, n_head, max_sequence_length, hidden_dim].
+        """
+        if n_head == 1:
+            return x
+
+        hidden_size = x.shape[-1]
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        reshaped = layers.reshape(
+            x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True)
+
+        # permuate the dimensions into:
+        # [batch_size, n_head, max_sequence_len, hidden_size_per_head]
+        return layers.transpose(x=reshaped, perm=[0, 2, 1, 3])
+
+    def __combine_heads(x):
+        """
+        Transpose and then reshape the last two dimensions of inpunt tensor x
+        so that it becomes one dimension, which is reverse to __split_heads.
+        """
+        if len(x.shape) == 3: return x
+        if len(x.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+
+        trans_x = layers.transpose(x, perm=[0, 2, 1, 3])
+        # The value 0 in shape attr means copying the corresponding dimension
+        # size of the input as the output dimension size.
+        return layers.reshape(
+            x=trans_x,
+            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace=True)
+
+    def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate):
+        """
+        Scaled Dot-Product Attention
+        """
+        scaled_q = layers.scale(x=q, scale=d_key ** -0.5)
+        product = layers.matmul(x=scaled_q, y=k, transpose_y=True)
+        if attn_bias:
+            product += attn_bias
+        weights = layers.softmax(product)
+        if dropout_rate:
+            weights = layers.dropout(
+                weights,
+                dropout_prob=dropout_rate,
+                seed=ModelHyperParams.dropout_seed,
+                is_test=False, dropout_implementation='upscale_in_train')
+        out = layers.matmul(weights, v)
+        return out
+
+    q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value)
+
+    if cache is not None:  # use cache and concat time steps
+        # Since the inplace reshape in __split_heads changes the shape of k and
+        # v, which is the cache input for next time step, reshape the cache
+        # input from the previous time step first.
+        k = layers.concat([cache['k'], k], axis=1)
+        v = layers.concat([cache['v'], v], axis=1)
+        layers.assign(k, cache['k'])
+        layers.assign(v, cache['v'])
+
+    q = __split_heads(q, n_head)
+    k = __split_heads(k, n_head)
+    v = __split_heads(v, n_head)
+
+    assert attention_type == "dot_product" or attention_type == "dot_product_relative_encoder" or attention_type == "dot_product_relative_decoder"
+    if attention_type == "dot_product":
+        ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, #d_model,
+                                                      dropout_rate)
+    elif attention_type == "dot_product_relative_encoder": 
+        q = layers.scale(x=q, scale=d_key ** -0.5)
+        ctx_multiheads = _dot_product_relative(q, k, v, attn_bias, dropout=dropout_rate, params_type = params_type)
+    else:
+        q = layers.scale(x=q, scale=d_key ** -0.5)
+        ctx_multiheads = _dot_product_relative(q, k, v, attn_bias, dropout=dropout_rate, cache = cache, params_type = params_type)
+
+
+    out = __combine_heads(ctx_multiheads)
+
+    # Project back to the model size.
+    proj_out = layers.fc(input=out,
+                         size=d_model,
+                         bias_attr=False,
+                         num_flatten_dims=2)
+    return proj_out
+
+
+def positionwise_feed_forward(x, d_inner_hid, d_hid, dropout_rate):
+    """
+    Position-wise Feed-Forward Networks.
+    This module consists of two linear transformations with a ReLU activation
+    in between, which is applied to each position separately and identically.
+    """
+    hidden = layers.fc(input=x,
+                       size=d_inner_hid,
+                       num_flatten_dims=2,
+                       act="relu")
+    if dropout_rate:
+        hidden = layers.dropout(
+            hidden,
+            dropout_prob=dropout_rate,
+            seed=ModelHyperParams.dropout_seed,
+            is_test=False, dropout_implementation='upscale_in_train')
+    out = layers.fc(input=hidden, size=d_hid, num_flatten_dims=2)
+    return out
+
+
+def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0.):
+    """
+    Add residual connection, layer normalization and droput to the out tensor
+    optionally according to the value of process_cmd.
+    This will be used before or after multi-head attention and position-wise
+    feed-forward networks.
+    """
+    for cmd in process_cmd:
+        if cmd == "a":  # add residual connection
+            out = out + prev_out if prev_out else out
+        elif cmd == "n":  # add layer normalization
+            out = layer_norm(
+                out,
+                begin_norm_axis=len(out.shape) - 1,
+                epsilon=1e-6,
+                param_attr=fluid.initializer.Constant(1.),
+                bias_attr=fluid.initializer.Constant(0.))
+        elif cmd == "d":  # add dropout
+            if dropout_rate:
+                out = layers.dropout(
+                    out,
+                    dropout_prob=dropout_rate,
+                    seed=ModelHyperParams.dropout_seed,
+                    is_test=False, dropout_implementation='upscale_in_train')
+    return out
+
+
+pre_process_layer = partial(pre_post_process_layer, None)
+post_process_layer = pre_post_process_layer
+
+
+def prepare_encoder_decoder(src_word,
+                            src_pos,
+                            src_vocab_size,
+                            src_emb_dim,
+                            src_max_len,
+                            dropout_rate=0.,
+                            word_emb_param_name=None,
+                            training=True,
+                            pos_enc_param_name=None, 
+                            is_src=True,
+                            params_type="normal"):
+    """Add word embeddings and position encodings.
+    The output tensor has a shape of:
+    [batch_size, max_src_length_in_batch, d_model].
+    This module is used at the bottom of the encoder stacks.
+    """
+    assert params_type == "fixed" or params_type == "normal" or params_type == "new"
+    pre_name = "relative_positionrelative_position"
+    if params_type == "fixed":
+        pre_name = "fixed_relative_positionfixed_relative_position"
+    elif params_type == "new":
+        pre_name = "new_relative_positionnew_relative_position"
+    src_word_emb = layers.embedding(
+        src_word,
+        size=[src_vocab_size, src_emb_dim],
+        padding_idx=ModelHyperParams.bos_idx,  # set embedding of bos to 0
+        param_attr=fluid.ParamAttr(
+            name = pre_name + word_emb_param_name,
+            initializer=fluid.initializer.Normal(0., src_emb_dim ** -0.5)))#, is_sparse=True)
+    if not is_src and training:
+        src_word_emb = layers.pad(src_word_emb, [0, 0, 1, 0, 0, 0])
+    src_word_emb = layers.scale(x=src_word_emb, scale=src_emb_dim ** 0.5)
+    src_pos_enc = layers.embedding(
+        src_pos,
+        size=[src_max_len, src_emb_dim],
+        param_attr=fluid.ParamAttr(
+            trainable=False, name = pre_name + pos_enc_param_name))
+    src_pos_enc.stop_gradient = True
+    enc_input = src_word_emb + src_pos_enc
+    return layers.dropout(
+        enc_input,
+        dropout_prob=dropout_rate,
+        seed=ModelHyperParams.dropout_seed,
+        is_test=False, dropout_implementation='upscale_in_train') if dropout_rate else enc_input
+
+
+prepare_encoder = partial(
+    prepare_encoder_decoder, pos_enc_param_name=pos_enc_param_names[0], is_src=True)
+prepare_decoder = partial(
+    prepare_encoder_decoder, pos_enc_param_name=pos_enc_param_names[1], is_src=False)
+
+
+def encoder_layer(enc_input,
+                  attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  preprocess_cmd="n",
+                  postprocess_cmd="da",
+                  params_type="normal"):
+    """The encoder layers that can be stacked to form a deep encoder.
+    This module consits of a multi-head (self) attention followed by
+    position-wise feed-forward networks and both the two components companied
+    with the post_process_layer to add residual connection, layer normalization
+    and droput.
+    """
+    attn_output = multi_head_attention(
+        pre_process_layer(enc_input, preprocess_cmd,
+                          prepostprocess_dropout), None, None, attn_bias, d_key,
+        d_value, d_model, n_head, attention_dropout, attention_type = "dot_product_relative_encoder", params_type = params_type)
+    attn_output = post_process_layer(enc_input, attn_output, postprocess_cmd,
+                                     prepostprocess_dropout)
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(attn_output, preprocess_cmd, prepostprocess_dropout),
+        d_inner_hid, d_model, relu_dropout)
+    return post_process_layer(attn_output, ffd_output, postprocess_cmd,
+                              prepostprocess_dropout)
+
+
+def encoder(enc_input,
+            attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd="n",
+            postprocess_cmd="da",
+            params_type="normal"):
+    """
+    The encoder is composed of a stack of identical layers returned by calling
+    encoder_layer.
+    """
+    for i in range(n_layer):
+        enc_output = encoder_layer(
+            enc_input,
+            attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd,
+            postprocess_cmd,
+            params_type=params_type)
+        enc_input = enc_output
+    enc_output = pre_process_layer(enc_output, preprocess_cmd,
+                                   prepostprocess_dropout)
+    return enc_output
+
+
+def decoder_layer(dec_input,
+                  enc_output,
+                  slf_attn_bias,
+                  dec_enc_attn_bias,
+                  n_head,
+                  d_key,
+                  d_value,
+                  d_model,
+                  d_inner_hid,
+                  prepostprocess_dropout,
+                  attention_dropout,
+                  relu_dropout,
+                  preprocess_cmd,
+                  postprocess_cmd,
+                  cache=None,
+                  params_type="normal"):
+    """ The layer to be stacked in decoder part.
+    The structure of this module is similar to that in the encoder part except
+    a multi-head attention is added to implement encoder-decoder attention.
+    """
+    slf_attn_output = multi_head_attention(
+        pre_process_layer(dec_input, preprocess_cmd, prepostprocess_dropout),
+        None,
+        None,
+        slf_attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout,
+        cache,
+        attention_type="dot_product_relative_decoder",
+        params_type=params_type)
+    slf_attn_output = post_process_layer(
+        dec_input,
+        slf_attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout, )
+    enc_attn_output = multi_head_attention(
+        pre_process_layer(slf_attn_output, preprocess_cmd, prepostprocess_dropout),
+        enc_output,
+        enc_output,
+        dec_enc_attn_bias,
+        d_key,
+        d_value,
+        d_model,
+        n_head,
+        attention_dropout, 
+        params_type=params_type)
+    enc_attn_output = post_process_layer(
+        slf_attn_output,
+        enc_attn_output,
+        postprocess_cmd,
+        prepostprocess_dropout, )
+    ffd_output = positionwise_feed_forward(
+        pre_process_layer(enc_attn_output, preprocess_cmd,
+                          prepostprocess_dropout),
+        d_inner_hid,
+        d_model,
+        relu_dropout, )
+    dec_output = post_process_layer(
+        enc_attn_output,
+        ffd_output,
+        postprocess_cmd,
+        prepostprocess_dropout, )
+    return dec_output
+
+
+def decoder(dec_input,
+            enc_output,
+            dec_slf_attn_bias,
+            dec_enc_attn_bias,
+            n_layer,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd,
+            postprocess_cmd,
+            caches=None,
+            params_type="normal"):
+    """
+    The decoder is composed of a stack of identical decoder_layer layers.
+    """
+    for i in range(n_layer):
+        dec_output = decoder_layer(
+            dec_input,
+            enc_output,
+            dec_slf_attn_bias,
+            dec_enc_attn_bias,
+            n_head,
+            d_key,
+            d_value,
+            d_model,
+            d_inner_hid,
+            prepostprocess_dropout,
+            attention_dropout,
+            relu_dropout,
+            preprocess_cmd,
+            postprocess_cmd,
+            cache=None if caches is None else caches[i],
+            params_type=params_type)
+        dec_input = dec_output
+    dec_output = pre_process_layer(dec_output, preprocess_cmd,
+                                   prepostprocess_dropout)
+    return dec_output
+
+
+def make_all_inputs(input_fields):
+    """
+    Define the input data layers for the transformer model.
+    """
+    inputs = []
+    for input_field in input_fields:
+        input_var = layers.data(
+            name=input_field,
+            shape=input_descs[input_field][0],
+            dtype=input_descs[input_field][1],
+            lod_level=input_descs[input_field][2]
+            if len(input_descs[input_field]) == 3 else 0,
+            append_batch_size=False)
+        inputs.append(input_var)
+    return inputs
+
+
+def make_all_py_reader_inputs(input_fields, is_test=False):
+    """
+    Define the input data layers for the transformer model.
+    """
+    reader = layers.py_reader(
+        capacity=20,
+        name="test_reader" if is_test else "train_reader",
+        shapes=[input_descs[input_field][0] for input_field in input_fields],
+        dtypes=[input_descs[input_field][1] for input_field in input_fields],
+        lod_levels=[
+            input_descs[input_field][2]
+            if len(input_descs[input_field]) == 3 else 0
+            for input_field in input_fields
+        ], use_double_buffer=True)
+    return layers.read_file(reader), reader
+
+
+def relative_transformer(src_vocab_size,
+                trg_vocab_size,
+                max_length,
+                n_layer,
+                n_head,
+                d_key,
+                d_value,
+                d_model,
+                d_inner_hid,
+                prepostprocess_dropout,
+                attention_dropout,
+                relu_dropout,
+                preprocess_cmd,
+                postprocess_cmd,
+                weight_sharing,
+                embedding_sharing,
+                label_smooth_eps,
+                use_py_reader=False,
+                is_test=False,
+                params_type="normal",
+                all_data_inputs = None):
+    """
+        transformer
+    """
+    if embedding_sharing:
+        assert src_vocab_size == trg_vocab_size, (
+            "Vocabularies in source and target should be same for weight sharing."
+        )
+
+    data_input_names = encoder_data_input_fields + \
+                decoder_data_input_fields[:-1] + label_data_input_fields + dense_bias_input_fields
+
+    if use_py_reader:
+        all_inputs = all_data_inputs
+    else:
+        all_inputs = make_all_inputs(data_input_names)
+
+    enc_inputs_len = len(encoder_data_input_fields)
+    dec_inputs_len = len(decoder_data_input_fields[:-1])
+    enc_inputs = all_inputs[0:enc_inputs_len]
+    dec_inputs = all_inputs[enc_inputs_len:enc_inputs_len + dec_inputs_len]
+    real_label = all_inputs[enc_inputs_len + dec_inputs_len]
+    weights = all_inputs[enc_inputs_len + dec_inputs_len + 1]
+    reverse_label = all_inputs[enc_inputs_len + dec_inputs_len + 2]
+
+    enc_output = wrap_encoder(
+        src_vocab_size,
+        max_length,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        weight_sharing,
+        embedding_sharing,
+        enc_inputs,
+        params_type=params_type)
+
+    predict = wrap_decoder(
+        trg_vocab_size,
+        max_length,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        weight_sharing,
+        embedding_sharing,
+        dec_inputs,
+        enc_output, is_train = True if not is_test else False,
+        params_type=params_type)
+
+    # Padding index do not contribute to the total loss. The weights is used to
+    # cancel padding index in calculating the loss.
+    if label_smooth_eps:
+        label = layers.one_hot(input=real_label, depth=trg_vocab_size)
+        label = label * (1 - label_smooth_eps) + (1 - label) * (
+            label_smooth_eps / (trg_vocab_size - 1))
+        label.stop_gradient = True
+    else:
+        label = real_label
+
+    cost = layers.softmax_with_cross_entropy(
+        logits=predict,
+        label=label,
+        soft_label=True if label_smooth_eps else False)
+    weighted_cost = cost * weights
+    sum_cost = layers.reduce_sum(weighted_cost)
+    sum_cost.persistable = True
+    token_num = layers.reduce_sum(weights)
+    token_num.persistable = True
+    token_num.stop_gradient = True
+    avg_cost = sum_cost / token_num
+
+    sen_count = layers.shape(dec_inputs[0])[0]
+    batch_predict = layers.reshape(predict, shape = [sen_count, -1, ModelHyperParams.trg_vocab_size])
+    batch_label = layers.reshape(real_label, shape=[sen_count, -1])
+    batch_weights = layers.reshape(weights, shape=[sen_count, -1, 1])
+    return sum_cost, avg_cost, token_num, batch_predict, cost, sum_cost, batch_label, batch_weights
+
+
+def wrap_encoder(src_vocab_size,
+                 max_length,
+                 n_layer,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 prepostprocess_dropout,
+                 attention_dropout,
+                 relu_dropout,
+                 preprocess_cmd,
+                 postprocess_cmd,
+                 weight_sharing,
+                 embedding_sharing,
+                 enc_inputs=None,
+                 params_type="normal"):
+    """
+    The wrapper assembles together all needed layers for the encoder.
+    """
+    if enc_inputs is None:
+        # This is used to implement independent encoder program in inference.
+        src_word, src_pos, src_slf_attn_bias = make_all_inputs(
+            encoder_data_input_fields)
+    else:
+        src_word, src_pos, src_slf_attn_bias = enc_inputs
+    enc_input = prepare_encoder(
+        src_word,
+        src_pos,
+        src_vocab_size,
+        d_model,
+        max_length,
+        prepostprocess_dropout,
+        word_emb_param_name=word_emb_param_names[0],
+        params_type=params_type)
+    enc_output = encoder(
+        enc_input,
+        src_slf_attn_bias,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        params_type=params_type)
+    return enc_output
+
+
+def wrap_decoder(trg_vocab_size,
+                 max_length,
+                 n_layer,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 prepostprocess_dropout,
+                 attention_dropout,
+                 relu_dropout,
+                 preprocess_cmd,
+                 postprocess_cmd,
+                 weight_sharing,
+                 embedding_sharing,
+                 dec_inputs=None,
+                 enc_output=None,
+                 caches=None, is_train=True, params_type="normal"):
+    """
+    The wrapper assembles together all needed layers for the decoder.
+    """
+    if dec_inputs is None:
+        # This is used to implement independent decoder program in inference.
+        trg_word, reverse_trg_word, trg_pos, trg_slf_attn_bias, trg_src_attn_bias, enc_output = \
+            make_all_inputs(decoder_data_input_fields)
+    else:
+        trg_word, reverse_trg_word, trg_pos, trg_slf_attn_bias, trg_src_attn_bias = dec_inputs
+
+    dec_input = prepare_decoder(
+        trg_word,
+        trg_pos,
+        trg_vocab_size,
+        d_model,
+        max_length,
+        prepostprocess_dropout,
+        word_emb_param_name=word_emb_param_names[0]
+        if embedding_sharing else word_emb_param_names[1], 
+        training=is_train,
+        params_type=params_type)
+
+    dec_output = decoder(
+        dec_input,
+        enc_output,
+        trg_slf_attn_bias,
+        trg_src_attn_bias,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        caches=caches,
+        params_type=params_type)
+    # Reshape to 2D tensor to use GEMM instead of BatchedGEMM
+    dec_output = layers.reshape(
+        dec_output, shape=[-1, dec_output.shape[-1]], inplace=True)
+
+    assert params_type == "fixed" or params_type == "normal" or params_type == "new"
+    pre_name = "relative_positionrelative_position"
+    if params_type == "fixed":
+        pre_name = "fixed_relative_positionfixed_relative_position"
+    elif params_type == "new":
+        pre_name = "new_relative_positionnew_relative_position"
+    if weight_sharing and embedding_sharing:
+        predict = layers.matmul(
+            x=dec_output,
+            y=fluid.default_main_program().global_block().var(
+                pre_name + word_emb_param_names[0]),
+            transpose_y=True)
+    elif weight_sharing:
+        predict = layers.matmul(
+            x=dec_output,
+            y=fluid.default_main_program().global_block().var(
+                pre_name +  word_emb_param_names[1]),
+            transpose_y=True)
+    else:
+        predict = layers.fc(input=dec_output,
+                            size=trg_vocab_size,
+                            bias_attr=False)
+    if dec_inputs is None:
+        # Return probs for independent decoder program.
+        predict = layers.softmax(predict)
+    return predict
+
+    
+def get_enc_bias(source_inputs):
+    """
+        get_enc_bias
+    """
+    source_inputs = layers.cast(source_inputs, 'float32')
+    emb_sum = layers.reduce_sum(layers.abs(source_inputs), dim=-1)
+    zero = layers.fill_constant([1], 'float32', value=0) 
+    bias = layers.cast(layers.equal(emb_sum, zero), 'float32') * -1e9
+    return layers.unsqueeze(layers.unsqueeze(bias, axes=[1]), axes=[1])
+
+
+def relative_fast_decode(
+        src_vocab_size,
+        trg_vocab_size,
+        max_in_len,
+        n_layer,
+        n_head,
+        d_key,
+        d_value,
+        d_model,
+        d_inner_hid,
+        prepostprocess_dropout,
+        attention_dropout,
+        relu_dropout,
+        preprocess_cmd,
+        postprocess_cmd,
+        weight_sharing,
+        embedding_sharing,
+        beam_size,
+        batch_size,
+        max_out_len,
+        decode_alpha,
+        eos_idx,
+        params_type="normal"):
+    """
+    Use beam search to decode. Caches will be used to store states of history
+    steps which can make the decoding faster.
+    """
+
+    assert params_type == "normal" or params_type == "new" or params_type == "fixed"
+    data_input_names = encoder_data_input_fields + fast_decoder_data_input_fields
+
+    all_inputs = make_all_inputs(data_input_names)
+
+    enc_inputs_len = len(encoder_data_input_fields)
+    dec_inputs_len = len(fast_decoder_data_input_fields)
+    enc_inputs = all_inputs[0:enc_inputs_len]
+    dec_inputs = all_inputs[enc_inputs_len:enc_inputs_len + dec_inputs_len]
+
+    enc_output = wrap_encoder(src_vocab_size, max_in_len, n_layer, n_head,
+                              d_key, d_value, d_model, d_inner_hid,
+                              prepostprocess_dropout, attention_dropout,
+                              relu_dropout, preprocess_cmd, postprocess_cmd,
+                              weight_sharing, embedding_sharing, enc_inputs, params_type=params_type)
+    enc_bias = get_enc_bias(enc_inputs[0])
+    source_length, = dec_inputs
+
+    def beam_search(enc_output, enc_bias, source_length):
+        """
+            beam_search
+        """
+        max_len = layers.fill_constant(
+            shape=[1], dtype='int64', value=max_out_len)
+        step_idx = layers.fill_constant(
+            shape=[1], dtype='int64', value=0)
+        cond = layers.less_than(x=step_idx, y=max_len)
+        while_op = layers.While(cond)
+
+        caches_batch_size = batch_size * beam_size
+        init_score = np.zeros([1, beam_size]).astype('float32')
+        init_score[:, 1:] = -INF
+        initial_log_probs = layers.assign(init_score)
+
+        alive_log_probs = layers.expand(initial_log_probs, [batch_size, 1])
+        # alive seq [batch_size, beam_size, 1]
+        initial_ids = layers.zeros([batch_size, 1, 1], 'float32')
+        alive_seq = layers.expand(initial_ids, [1, beam_size, 1]) 
+        alive_seq = layers.cast(alive_seq, 'int64')
+
+        enc_output = layers.unsqueeze(enc_output, axes=[1])
+        enc_output = layers.expand(enc_output, [1, beam_size, 1, 1])
+        enc_output = layers.reshape(enc_output, [caches_batch_size, -1, d_model])
+
+        tgt_src_attn_bias = layers.unsqueeze(enc_bias, axes=[1])
+        tgt_src_attn_bias = layers.expand(tgt_src_attn_bias, [1, beam_size, n_head, 1, 1]) 
+        enc_bias_shape = layers.shape(tgt_src_attn_bias)
+        tgt_src_attn_bias = layers.reshape(tgt_src_attn_bias, [-1, enc_bias_shape[2], 
+                enc_bias_shape[3], enc_bias_shape[4]])
+            
+        beam_search = BeamSearch(beam_size, batch_size, decode_alpha, trg_vocab_size, d_model)
+
+        caches = [{
+            "k": layers.fill_constant(
+                shape=[caches_batch_size, 0, d_model],
+                dtype=enc_output.dtype,
+                value=0),
+            "v": layers.fill_constant(
+                shape=[caches_batch_size, 0, d_model],
+                dtype=enc_output.dtype,
+                value=0)
+        } for i in range(n_layer)]
+        
+        finished_seq = layers.zeros_like(alive_seq)
+        finished_scores = layers.fill_constant([batch_size, beam_size], 
+                                                dtype='float32', value=-INF)
+        finished_flags = layers.fill_constant([batch_size, beam_size], 
+                                                dtype='float32', value=0)
+        
+        with while_op.block():
+            pos = layers.fill_constant([caches_batch_size, 1, 1], dtype='int64', value=1)
+            pos = layers.elementwise_mul(pos, step_idx, axis=0)
+
+            alive_seq_1 = layers.reshape(alive_seq, [caches_batch_size, -1])
+            alive_seq_2 = alive_seq_1[:, -1:] 
+            alive_seq_2 = layers.unsqueeze(alive_seq_2, axes=[1])
+ 
+            logits = wrap_decoder(
+                trg_vocab_size, max_in_len, n_layer, n_head, d_key,
+                d_value, d_model, d_inner_hid, prepostprocess_dropout,
+                attention_dropout, relu_dropout, preprocess_cmd,
+                postprocess_cmd, weight_sharing, embedding_sharing,
+                dec_inputs=(alive_seq_2, alive_seq_2, pos, None, tgt_src_attn_bias),
+                enc_output=enc_output, caches=caches, is_train=False, params_type=params_type)
+
+            alive_seq_2, alive_log_probs_2, finished_seq_2, finished_scores_2, finished_flags_2, caches_2 = \
+                    beam_search.inner_func(step_idx, logits, alive_seq_1, alive_log_probs, finished_seq, 
+                                           finished_scores, finished_flags, caches, enc_output, 
+                                           tgt_src_attn_bias)
+            
+            layers.increment(x=step_idx, value=1.0, in_place=True)
+            finish_cond = beam_search.is_finished(step_idx, source_length, alive_log_probs_2, 
+                                                  finished_scores_2, finished_flags_2) 
+
+            layers.assign(alive_seq_2, alive_seq)
+            layers.assign(alive_log_probs_2, alive_log_probs)
+            layers.assign(finished_seq_2, finished_seq)
+            layers.assign(finished_scores_2, finished_scores)
+            layers.assign(finished_flags_2, finished_flags)
+
+            for i in xrange(len(caches_2)):
+                layers.assign(caches_2[i]["k"], caches[i]["k"])
+                layers.assign(caches_2[i]["v"], caches[i]["v"])
+
+            layers.logical_and(x=cond, y=finish_cond, out=cond)
+
+        finished_flags = layers.reduce_sum(finished_flags, dim=1, keep_dim=True) / beam_size
+        finished_flags = layers.cast(finished_flags, 'bool')
+        mask = layers.cast(layers.reduce_any(input=finished_flags, dim=1, keep_dim=True), 'float32')
+        mask = layers.expand(mask, [1, beam_size])
+
+        mask2 = 1.0 - mask
+        finished_seq = layers.cast(finished_seq, 'float32')
+        alive_seq = layers.cast(alive_seq, 'float32')
+        #print mask
+
+        finished_seq = layers.elementwise_mul(finished_seq, mask, axis=0) + \
+                        layers.elementwise_mul(alive_seq, mask2, axis = 0)
+        finished_seq = layers.cast(finished_seq, 'int32')
+        finished_scores = layers.elementwise_mul(finished_scores, mask, axis=0) + \
+                            layers.elementwise_mul(alive_log_probs, mask2)
+        finished_seq.persistable = True
+        finished_scores.persistable = True
+
+        return finished_seq, finished_scores
+        
+    finished_ids, finished_scores = beam_search(enc_output, enc_bias, source_length)
+    return finished_ids, finished_scores
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/src/train.py b/PaddleNLP/Research/EMNLP2019-MAL/src/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b9a48b1ed4bd3a5c6eb783f450fbd0d1a1f4895
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/src/train.py
@@ -0,0 +1,1099 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import ast
+import copy
+import logging
+import multiprocessing
+import os
+import six
+import sys
+import time
+import random
+import math
+
+import numpy as np
+import paddle.fluid as fluid
+import paddle.fluid.layers as layers
+import paddle.fluid.layers.nn as nn
+import paddle.fluid.layers.tensor as tensor
+from paddle.fluid.framework import default_main_program
+
+import reader
+from reader import *
+from config import *
+from forward_model import forward_transformer, forward_position_encoding_init, forward_fast_decode, make_all_py_reader_inputs
+from dense_model import dense_transformer, dense_fast_decode
+from relative_model import relative_transformer, relative_fast_decode
+
+def parse_args():
+    """
+        parse_args
+    """
+    parser = argparse.ArgumentParser("Training for Transformer.")
+    parser.add_argument(
+        "--train_file_pattern",
+        type=str,
+        required=True,
+        help="The pattern to match training data files.")
+    parser.add_argument(
+        "--val_file_pattern",
+        type=str,
+        help="The pattern to match validation data files.")
+    parser.add_argument(
+        "--ckpt_path",
+        type=str,
+        help="The pattern to match training data files.")
+    parser.add_argument(
+        "--infer_batch_size",
+        type=int,
+        help="Infer batch_size")
+    parser.add_argument(
+        "--decode_alpha",
+        type=float,
+        help="decode_alpha")
+    parser.add_argument(
+        "--beam_size",
+        type=int,
+        help="Infer beam_size")
+    parser.add_argument(
+        "--use_token_batch",
+        type=ast.literal_eval,
+        default=True,
+        help="The flag indicating whether to "
+        "produce batch data according to token number.")
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        default=4096,
+        help="The number of sequences contained in a mini-batch, or the maximum "
+        "number of tokens (include paddings) contained in a mini-batch. Note "
+        "that this represents the number on single device and the actual batch "
+        "size for multi-devices will multiply the device number.")
+    parser.add_argument(
+        "--pool_size",
+        type=int,
+        default=200000,
+        help="The buffer size to pool data.")
+    parser.add_argument(
+        "--num_threads",
+        type=int,
+        default=2,
+        help="The number of threads which executor use.")
+    parser.add_argument(
+        "--use_fp16",
+        type=ast.literal_eval,
+        default=True,
+        help="Use fp16 or not"
+    )
+
+    parser.add_argument(
+        "--nccl_comm_num",
+        type=int,
+        default=1,
+        help="The number of threads which executor use.")
+
+    parser.add_argument(
+        "--sort_type",
+        default="pool",
+        choices=("global", "pool", "none"),
+        help="The grain to sort by length: global for all instances; pool for "
+        "instances in pool; none for no sort.")
+    parser.add_argument(
+        "--use_hierarchical_allreduce",
+        default=False,
+        type=ast.literal_eval,
+        help="Use hierarchical allreduce or not.")
+    parser.add_argument(
+        "--hierarchical_allreduce_inter_nranks",
+        default=8,
+        type=int,
+        help="interranks.")
+    parser.add_argument(
+        "--shuffle",
+        type=ast.literal_eval,
+        default=True,
+        help="The flag indicating whether to shuffle instances in each pass.")
+    parser.add_argument(
+        "--shuffle_batch",
+        type=ast.literal_eval,
+        default=True,
+        help="The flag indicating whether to shuffle the data batches.")
+    parser.add_argument(
+        "--special_token",
+        type=str,
+        default=["<s>", "<e>", "<unk>"],
+        nargs=3,
+        help="The <bos>, <eos> and <unk> tokens in the dictionary.")
+    parser.add_argument(
+        "--token_delimiter",
+        type=lambda x: str(x.encode().decode("unicode-escape")),
+        default=" ",
+        help="The delimiter used to split tokens in source or target sentences. "
+        "For EN-DE BPE data we provided, use spaces as token delimiter. ")
+    parser.add_argument(
+        'opts',
+        help='See config.py for all options',
+        default=None,
+        nargs=argparse.REMAINDER)
+    parser.add_argument(
+        '--local',
+        type=ast.literal_eval,
+        default=False,
+        help='Whether to run as local mode.')
+    parser.add_argument(
+        '--device',
+        type=str,
+        default='GPU',
+        choices=['CPU', 'GPU'],
+        help="The device type.")
+    parser.add_argument(
+        '--sync', type=ast.literal_eval, default=True, help="sync mode.")
+    parser.add_argument(
+        "--enable_ce",
+        type=ast.literal_eval,
+        default=False,
+        help="The flag indicating whether to run the task "
+        "for continuous evaluation.")
+    parser.add_argument(
+        "--use_mem_opt",
+        type=ast.literal_eval,
+        default=True,
+        help="The flag indicating whether to use memory optimization.")
+    parser.add_argument(
+        "--use_py_reader",
+        type=ast.literal_eval,
+        default=True,
+        help="The flag indicating whether to use py_reader.")
+    parser.add_argument(
+        "--fetch_steps",
+        type=int,
+        default=100,
+        help="The frequency to fetch and print output.")
+    parser.add_argument(
+        "--use_delay_load",
+        type=ast.literal_eval,
+        default=True,
+        help=
+        "The flag indicating whether to load all data into memories at once.")
+    parser.add_argument(
+        "--src_vocab_size",
+        type=str,
+        required=True,
+        help="Size of src Vocab.")
+    parser.add_argument(
+        "--tgt_vocab_size",
+        type=str,
+        required=True,
+        help="Size of tgt Vocab.")
+    parser.add_argument(
+        "--restore_step",
+        type=int,
+        default=0,
+        help="The step number of checkpoint to restore training.")
+    parser.add_argument(
+        "--fuse",
+        type=int,
+        default=0,
+        help="Use fusion or not.")
+
+    args = parser.parse_args()
+
+    src_voc_size = args.src_vocab_size
+    trg_voc_size = args.tgt_vocab_size
+    if args.use_delay_load:
+        dict_args = [
+            "src_vocab_size", src_voc_size,
+            "trg_vocab_size", trg_voc_size,
+            "bos_idx", str(0),
+            "eos_idx", str(1),
+            "unk_idx", str(int(src_voc_size) - 1)
+        ]
+    else:
+        src_dict = reader.DataReader.load_dict(args.src_vocab_fpath)
+        trg_dict = reader.DataReader.load_dict(args.trg_vocab_fpath)
+        dict_args = [
+            "src_vocab_size", str(len(src_dict)), "trg_vocab_size",
+            str(len(trg_dict)), "bos_idx", str(src_dict[args.special_token[0]]),
+            "eos_idx", str(src_dict[args.special_token[1]]), "unk_idx",
+            str(src_dict[args.special_token[2]])
+        ]
+    merge_cfg_from_list(args.opts + dict_args,
+                        [TrainTaskConfig, ModelHyperParams])
+    return args
+
+
+def prepare_batch_input(insts, data_input_names, src_pad_idx, trg_pad_idx,
+                        n_head, d_model):
+    """
+    Put all padded data needed by training into a dict.
+    """
+    src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
+        [inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
+    src_word = src_word.reshape(-1, src_max_len, 1)
+    src_pos = src_pos.reshape(-1, src_max_len, 1)
+
+    trg_word, trg_pos, trg_slf_attn_bias, trg_max_len = pad_batch_data(
+        [inst[1] for inst in insts], trg_pad_idx, n_head, is_target=True)
+    trg_word = trg_word.reshape(-1, trg_max_len, 1)
+    trg_word = trg_word[:, 1:, :]
+    trg_pos = trg_pos.reshape(-1, trg_max_len, 1)
+
+    trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
+                                [1, 1, trg_max_len, 1]).astype("float32")
+
+    lbl_word, lbl_weight, num_token = pad_batch_data(
+        [inst[2] for inst in insts],
+        trg_pad_idx,
+        n_head,
+        is_target=False,
+        is_label=True,
+        return_attn_bias=False,
+        return_max_len=False,
+        return_num_token=True)
+
+    # reverse_target
+    reverse_trg_word, _, _, _ = pad_batch_data(
+        [inst[3] for inst in insts], trg_pad_idx, n_head, is_target=True)
+    reverse_trg_word = reverse_trg_word.reshape(-1, trg_max_len, 1)
+    reverse_trg_word = reverse_trg_word[:, 1:, :]
+
+    reverse_lbl_word, _, _ = pad_batch_data(
+        [inst[4] for inst in insts],
+        trg_pad_idx,
+        n_head,
+        is_target=False,
+        is_label=True,
+        return_attn_bias=False,
+        return_max_len=False,
+        return_num_token=True)
+
+    eos_position = []
+    meet_eos = False
+    for word_id in reverse_lbl_word:
+        if word_id[0] == 1 and not meet_eos:
+            meet_eos = True
+            eos_position.append([1])
+        elif word_id[0] == 1 and meet_eos:
+            eos_position.append([0])
+        else:
+            meet_eos = False
+            eos_position.append([0])
+
+    data_input_dict = dict(
+        zip(data_input_names, [
+            src_word, src_pos, src_slf_attn_bias, trg_word, reverse_trg_word, trg_pos,
+            trg_slf_attn_bias, trg_src_attn_bias, lbl_word, lbl_weight, reverse_lbl_word, np.asarray(eos_position, dtype = "int64")
+        ]))
+
+    return data_input_dict, np.asarray([num_token], dtype="float32")
+
+
+def prepare_feed_dict_list(data_generator, count, num_tokens=None, num_insts=None):
+    """
+    Prepare the list of feed dict for multi-devices.
+    """
+    feed_dict_list = []
+    eos_idx = ModelHyperParams.eos_idx
+    n_head =  ModelHyperParams.n_head
+    d_model = ModelHyperParams.d_model
+    max_length = ModelHyperParams.max_length
+    dense_n_head = DenseModelHyperParams.n_head
+    dense_d_model = DenseModelHyperParams.d_model
+
+    if data_generator is not None:  # use_py_reader == False
+        dense_data_input_names = dense_encoder_data_input_fields + \
+                        dense_decoder_data_input_fields[:-1] + dense_label_data_input_fields
+        data_input_names = encoder_data_input_fields + \
+                        decoder_data_input_fields[:-1] + label_data_input_fields
+        data = next(data_generator)
+        for idx, data_buffer in enumerate(data):
+            data_input_dict, num_token = prepare_batch_input(
+                data_buffer, data_input_names, eos_idx,
+                eos_idx, n_head,
+                d_model)
+            dense_data_input_dict, _ = prepare_batch_input(
+                data_buffer, dense_data_input_names, eos_idx,
+                eos_idx, dense_n_head,
+                dense_d_model)
+            data_input_dict.update(dense_data_input_dict) # merge dict
+            feed_dict_list.append(data_input_dict)
+            if isinstance(num_tokens, list): num_tokens.append(num_token)
+            if isinstance(num_insts, list): num_insts.append(len(data_buffer))
+
+    return feed_dict_list if len(feed_dict_list) == count else None
+
+
+def py_reader_provider_wrapper(data_reader):
+    """
+    Data provider needed by fluid.layers.py_reader.
+    """
+
+    def py_reader_provider():
+        """
+            py_reader_provider
+        """
+        eos_idx = ModelHyperParams.eos_idx
+        n_head =  ModelHyperParams.n_head
+        d_model = ModelHyperParams.d_model
+        max_length = ModelHyperParams.max_length
+        dense_n_head = DenseModelHyperParams.n_head
+        dense_d_model = DenseModelHyperParams.d_model
+
+        data_input_names = encoder_data_input_fields + \
+                    decoder_data_input_fields[:-1] + label_data_input_fields
+        dense_data_input_names = dense_encoder_data_input_fields + \
+                    dense_decoder_data_input_fields[:-1] + label_data_input_fields
+
+        new_data_input_names = data_input_names + dense_bias_input_fields
+
+        for batch_id, data in enumerate(data_reader()):
+            data_input_dict, num_token = prepare_batch_input(
+                data, data_input_names, eos_idx,
+                eos_idx, n_head,
+                d_model)
+            dense_data_input_dict, _ = prepare_batch_input(
+                data, dense_data_input_names, eos_idx,
+                eos_idx, dense_n_head,
+                dense_d_model)
+            data_input_dict["dense_src_slf_attn_bias"] = dense_data_input_dict["dense_src_slf_attn_bias"]
+            data_input_dict["dense_trg_slf_attn_bias"] = dense_data_input_dict["dense_trg_slf_attn_bias"]
+            data_input_dict["dense_trg_src_attn_bias"] = dense_data_input_dict["dense_trg_src_attn_bias"]
+            total_dict = dict(data_input_dict.items())
+            yield [total_dict[item] for item in new_data_input_names]
+
+    return py_reader_provider
+
+
+from infer import prepare_feed_dict_list as infer_prepare_feed_dict_list
+from infer import prepare_dense_feed_dict_list as infer_prepare_dense_feed_dict_list
+def test_context(exe, train_exe, dev_count, agent_name, args):
+    # Context to do validation.
+    test_prog = fluid.Program()
+    startup_prog = fluid.Program()
+    if args.enable_ce:
+        test_prog.random_seed = 1000
+        startup_prog.random_seed = 1000
+    with fluid.program_guard(test_prog, startup_prog):
+        if agent_name == "new_forward":
+            with fluid.unique_name.guard("new_forward"):
+                out_ids1, out_scores1 = forward_fast_decode(
+                    ModelHyperParams.src_vocab_size,
+                    ModelHyperParams.trg_vocab_size,
+                    ModelHyperParams.max_length + 50,
+                    ModelHyperParams.n_layer,
+                    ModelHyperParams.n_head,
+                    ModelHyperParams.d_key,
+                    ModelHyperParams.d_value,
+                    ModelHyperParams.d_model,
+                    ModelHyperParams.d_inner_hid,
+                    ModelHyperParams.prepostprocess_dropout,
+                    ModelHyperParams.attention_dropout,
+                    ModelHyperParams.relu_dropout,
+                    ModelHyperParams.preprocess_cmd,
+                    ModelHyperParams.postprocess_cmd,
+                    ModelHyperParams.weight_sharing,
+                    ModelHyperParams.embedding_sharing,
+                    args.beam_size,
+                    args.infer_batch_size,
+                    InferTaskConfig.max_out_len,
+                    args.decode_alpha,
+                    ModelHyperParams.eos_idx,
+                    params_type="new"
+                    )
+        elif agent_name == "new_relative_position":
+            with fluid.unique_name.guard("new_relative_position"):
+                out_ids2, out_scores2 = relative_fast_decode(
+                    ModelHyperParams.src_vocab_size,
+                    ModelHyperParams.trg_vocab_size,
+                    ModelHyperParams.max_length + 50,
+                    ModelHyperParams.n_layer,
+                    ModelHyperParams.n_head,
+                    ModelHyperParams.d_key,
+                    ModelHyperParams.d_value,
+                    ModelHyperParams.d_model,
+                    ModelHyperParams.d_inner_hid,
+                    ModelHyperParams.prepostprocess_dropout,
+                    ModelHyperParams.attention_dropout,
+                    ModelHyperParams.relu_dropout,
+                    ModelHyperParams.preprocess_cmd,
+                    ModelHyperParams.postprocess_cmd,
+                    ModelHyperParams.weight_sharing,
+                    ModelHyperParams.embedding_sharing,
+                    args.beam_size,
+                    args.infer_batch_size,
+                    InferTaskConfig.max_out_len,
+                    args.decode_alpha,
+                    ModelHyperParams.eos_idx,
+                    params_type="new"
+                    )
+
+        elif agent_name == "new_dense":
+            DenseModelHyperParams.src_vocab_size = ModelHyperParams.src_vocab_size
+            DenseModelHyperParams.trg_vocab_size = ModelHyperParams.trg_vocab_size
+            DenseModelHyperParams.weight_sharing = ModelHyperParams.weight_sharing
+            DenseModelHyperParams.embedding_sharing = ModelHyperParams.embedding_sharing 
+            with fluid.unique_name.guard("new_dense"):
+                out_ids3, out_scores3 = dense_fast_decode(
+                    DenseModelHyperParams.src_vocab_size,
+                    DenseModelHyperParams.trg_vocab_size,
+                    DenseModelHyperParams.max_length + 50,
+                    DenseModelHyperParams.n_layer,
+                    DenseModelHyperParams.enc_n_layer,
+                    DenseModelHyperParams.n_head,
+                    DenseModelHyperParams.d_key,
+                    DenseModelHyperParams.d_value,
+                    DenseModelHyperParams.d_model,
+                    DenseModelHyperParams.d_inner_hid,
+                    DenseModelHyperParams.prepostprocess_dropout,
+                    DenseModelHyperParams.attention_dropout,
+                    DenseModelHyperParams.relu_dropout,
+                    DenseModelHyperParams.preprocess_cmd,
+                    DenseModelHyperParams.postprocess_cmd,
+                    DenseModelHyperParams.weight_sharing,
+                    DenseModelHyperParams.embedding_sharing,
+                    args.beam_size,
+                    args.infer_batch_size,
+                    InferTaskConfig.max_out_len,
+                    args.decode_alpha,
+                    ModelHyperParams.eos_idx,
+                    params_type="new"
+                    )
+
+    test_prog = test_prog.clone(for_test=True)
+
+    dev_count = 1
+    file_pattern = "%s" % (args.val_file_pattern)
+    lines_cnt = len(open(file_pattern, 'r').readlines())
+    data_reader = line_reader(file_pattern, args.infer_batch_size, dev_count,
+                    token_delimiter=args.token_delimiter,
+                    max_len=ModelHyperParams.max_length,
+                    parse_line=parse_src_line)
+
+    test_data = prepare_data_generator(args, is_test=True, count=dev_count, pyreader=None,
+                                       batch_size=args.infer_batch_size, data_reader=data_reader)
+
+    def test(step_id, exe=exe):
+
+        f = ""
+        if agent_name == "new_relative_position":
+            f = open("./output/new_relative_position_iter_%d.trans" % (step_id), 'w')
+        elif agent_name == "new_forward":
+            f = open("./output/new_forward_iter_%d.trans" % (step_id), 'w')
+        elif agent_name == "new_dense":
+            f = open("./output/new_dense_iter_%d.trans" % (step_id), 'w')
+
+        data_generator = test_data()
+        trans_list = []
+        while True:
+            try:
+                feed_dict_list = infer_prepare_feed_dict_list(data_generator, 1) if agent_name != "new_dense" else infer_prepare_dense_feed_dict_list(data_generator, 1)
+                if agent_name == "new_forward":
+                    seq_ids, seq_scores = exe.run(
+                            fetch_list=[out_ids1.name, out_scores1.name],
+                            feed=feed_dict_list,
+                            program=test_prog,
+                            return_numpy=True)
+                elif agent_name == "new_relative_position":
+                    seq_ids, seq_scores = exe.run(
+                            fetch_list=[out_ids2.name, out_scores2.name],
+                            feed=feed_dict_list,
+                            program=test_prog,
+                            return_numpy=True)
+                elif agent_name == "new_dense":
+                    seq_ids, seq_scores = exe.run(
+                            fetch_list=[out_ids3.name, out_scores3.name],
+                            feed=feed_dict_list,
+                            program=test_prog,
+                            return_numpy=True)
+
+                seq_ids = seq_ids.tolist()
+                for index in xrange(args.infer_batch_size):
+                    seq = seq_ids[index][0]
+                    if 1 not in seq:
+                        res = seq[1:-1]
+                    else:
+                        res = seq[1: seq.index(1)]
+                    res = map(str, res)
+                    trans_list.append(" ".join(res))
+            except (StopIteration, fluid.core.EOFException):
+                # The current pass is over.
+                break
+        trans_list = trans_list[:lines_cnt]
+        for trans in trans_list:
+            f.write("%s\n" % trans)
+
+        f.flush()
+        f.close()
+    return test
+
+def get_tensor_by_prefix(pre_name, param_name_list):
+    tensors_list = []
+    for param_name in param_name_list:
+        if pre_name in param_name:
+            tensors_list.append(fluid.global_scope().find_var(param_name).get_tensor())
+
+    if pre_name == "fixed_relative_positionfixed_relative_position":
+        tensors_list.append(fluid.global_scope().find_var("fixed_relative_positions_keys").get_tensor())
+        tensors_list.append(fluid.global_scope().find_var("fixed_relative_positions_values").get_tensor())
+    elif pre_name == "new_relative_positionnew_relative_position":
+        tensors_list.append(fluid.global_scope().find_var("new_relative_positions_keys").get_tensor())
+        tensors_list.append(fluid.global_scope().find_var("new_relative_positions_values").get_tensor())
+
+    return tensors_list
+
+
+def train_loop(exe,
+               train_prog,
+               startup_prog,
+               args,
+               dev_count,
+               avg_cost,
+               teacher_cost,
+               single_model_sum_cost,
+               single_model_avg_cost,
+               token_num,
+               pyreader, place,
+               nccl2_num_trainers=1,
+               nccl2_trainer_id=0,
+               scaled_cost=None,
+               loss_scaling=None
+               ):
+    """
+        train_loop
+    """
+    # Initialize the parameters.
+    if TrainTaskConfig.ckpt_path:
+        exe.run(startup_prog)
+        logging.info("load checkpoint from {}".format(TrainTaskConfig.ckpt_path))
+        fluid.io.load_params(exe, TrainTaskConfig.ckpt_path, main_program=train_prog)
+    else:
+        logging.info("init fluid.framework.default_startup_program")
+        exe.run(startup_prog)
+
+    param_list = train_prog.block(0).all_parameters()
+    param_name_list = [p.name for p in param_list ]
+
+    logging.info("begin reader")
+    batch_scheme = batching_scheme(args.batch_size, 256, shard_multiplier=nccl2_num_trainers)
+    tf_data = bucket_by_sequence_length(
+                repeat(
+                        interleave_reader(
+                            args.train_file_pattern,
+                            cycle_length=8,
+                            token_delimiter=args.token_delimiter,
+                            max_len=ModelHyperParams.max_length,
+                            parse_line=parse_line,
+                        ), -1),
+                    lambda x:max(len(x[0]), len(x[1])),
+                    batch_scheme["boundaries"],
+                    batch_scheme["batch_sizes"],
+                    nccl2_num_trainers,
+                    nccl2_trainer_id
+    )
+    args.use_token_batch = False
+    train_data = prepare_data_generator(
+        args, is_test=False, count=dev_count, pyreader=pyreader, data_reader=tf_data, \
+                py_reader_provider_wrapper=py_reader_provider_wrapper)
+
+    # For faster executor
+    exec_strategy = fluid.ExecutionStrategy()
+    exec_strategy.use_experimental_executor = True
+    exec_strategy.num_threads = 1
+    exec_strategy.num_iteration_per_drop_scope = 20
+    build_strategy = fluid.BuildStrategy()
+    build_strategy.enable_inplace = True
+    build_strategy.fuse_all_optimizer_ops = False
+    build_strategy.fuse_all_reduce_ops = False
+    build_strategy.enable_backward_optimizer_op_deps = True
+    if args.fuse:
+        build_strategy.fuse_all_reduce_ops = True
+
+
+    trainer_id = nccl2_trainer_id
+    train_exe = fluid.ParallelExecutor(
+        use_cuda=TrainTaskConfig.use_gpu,
+        loss_name=avg_cost.name,
+        main_program=train_prog,
+        build_strategy=build_strategy,
+        exec_strategy=exec_strategy,
+        num_trainers=nccl2_num_trainers,
+        trainer_id=nccl2_trainer_id)
+
+    if args.val_file_pattern is not None:
+        new_forward_test = test_context(exe, train_exe, dev_count, "new_forward", args)
+        new_dense_test = test_context(exe, train_exe, dev_count, "new_dense", args)
+        new_relative_position_test = test_context(exe, train_exe, dev_count, "new_relative_position", args)
+
+    # the best cross-entropy value with label smoothing
+    loss_normalizer = -((1. - TrainTaskConfig.label_smooth_eps) * np.log(
+        (1. - TrainTaskConfig.label_smooth_eps
+         )) + TrainTaskConfig.label_smooth_eps *
+                        np.log(TrainTaskConfig.label_smooth_eps / (
+                            ModelHyperParams.trg_vocab_size - 1) + 1e-20))
+
+    # set recovery step
+    step_idx = args.restore_step if args.restore_step else 0
+    if step_idx != 0:
+        var = fluid.global_scope().find_var("@LR_DECAY_COUNTER@").get_tensor()
+        recovery_step = np.array([step_idx]).astype("int64")
+        var.set(recovery_step, fluid.CPUPlace())
+        step = np.array(var)[0]
+
+
+    # set pos encoding
+    model_prefix = ["fixed_forward", "fixed_relative_position",
+                    "new_forward", "new_relative_position"]
+    for pos_enc_param_name in pos_enc_param_names:
+        for prefix in model_prefix:
+            pos_name = prefix * 2 + pos_enc_param_name
+            pos_enc_param = fluid.global_scope().find_var(
+                pos_name).get_tensor()
+
+            pos_enc_param.set(
+                forward_position_encoding_init(
+                        ModelHyperParams.max_length + 50, 
+                        ModelHyperParams.d_model), place)
+
+    model_prefix_2 = ["fixed_dense", "new_dense"]
+    for pos_enc_param_name in pos_enc_param_names:
+        for prefix in model_prefix_2:
+            pos_name = prefix * 2 + pos_enc_param_name
+            pos_enc_param = fluid.global_scope().find_var(
+                pos_name).get_tensor()
+            
+            pos_enc_param.set(
+                forward_position_encoding_init(
+                        DenseModelHyperParams.max_length + 50,
+                        DenseModelHyperParams.d_model), place)
+    
+
+    logging.info("begin train")
+    for pass_id in six.moves.xrange(TrainTaskConfig.pass_num):
+        pass_start_time = time.time()
+        avg_batch_time = time.time()
+
+        pyreader.start()
+        data_generator = None
+
+        batch_id = 0
+        while True:
+            try:
+                num_tokens = []
+                num_insts = []
+                feed_dict_list = prepare_feed_dict_list(data_generator,
+                                                        dev_count, num_tokens, num_insts)
+
+                num_token = np.sum(num_tokens).reshape([-1])
+                num_inst = np.sum(num_insts).reshape([-1])
+
+                outs = train_exe.run(
+                    fetch_list=[avg_cost.name, token_num.name, teacher_cost.name]
+                    if (step_idx == 0 or step_idx % args.fetch_steps == (args.fetch_steps - 1)) else [],
+                    feed=feed_dict_list)
+
+                if (step_idx == 0 or step_idx % args.fetch_steps == (args.fetch_steps - 1)):
+                    single_model_total_avg_cost, token_num_val = np.array(outs[0]), np.array(outs[1])
+                    teacher = np.array(outs[2])
+
+                    if step_idx == 0:
+                        logging.info(
+                            ("step_idx: %d, epoch: %d, batch: %d, teacher loss: %f, avg loss: %f, "
+                            "normalized loss: %f, ppl: %f" + (", batch size: %d" if num_inst else "")) %
+                            ((step_idx, pass_id, batch_id, teacher, single_model_total_avg_cost,
+                             single_model_total_avg_cost - loss_normalizer,
+                             np.exp([min(single_model_total_avg_cost, 100)])) + ((num_inst,) if num_inst else ())))
+                    else:
+                        logging.info(
+                            ("step_idx: %d, epoch: %d, batch: %d, teacher loss: %f, avg loss: %f, "
+                            "normalized loss: %f, ppl: %f, speed: %.2f step/s" + \
+                            (", batch size: %d" if num_inst else "")) %
+                            ((step_idx, pass_id, batch_id, teacher, single_model_total_avg_cost,
+                             single_model_total_avg_cost - loss_normalizer,
+                             np.exp([min(single_model_total_avg_cost, 100)]),
+                             args.fetch_steps / (time.time() - avg_batch_time)) + ((num_inst,) if num_inst else ())))
+                        avg_batch_time = time.time()
+
+                if step_idx % TrainTaskConfig.fixed_freq == (TrainTaskConfig.fixed_freq - 1):
+                    logging.info("copy parameters to fixed parameters when step_idx is {}".format(step_idx))
+
+                    fixed_forward_tensors = get_tensor_by_prefix("fixed_forwardfixed_forward", param_name_list)
+                    new_forward_tensors = get_tensor_by_prefix("new_forwardnew_forward", param_name_list)
+                    fixed_dense_tensors = get_tensor_by_prefix("fixed_densefixed_dense", param_name_list)
+                    new_dense_tensors = get_tensor_by_prefix("new_densenew_dense", param_name_list)
+                    fixed_relative_tensors = get_tensor_by_prefix("fixed_relative_positionfixed_relative_position", param_name_list)
+                    new_relative_tensors = get_tensor_by_prefix("new_relative_positionnew_relative_position", param_name_list)
+
+                    for (fixed_tensor, new_tensor) in zip(fixed_forward_tensors, new_forward_tensors):
+                        fixed_tensor.set(np.array(new_tensor), place)
+                    for (fixed_tensor, new_tensor) in zip(fixed_relative_tensors, new_relative_tensors):
+                        fixed_tensor.set(np.array(new_tensor), place)
+                    for (fixed_tensor, new_tensor) in zip(fixed_dense_tensors, new_dense_tensors):
+                        fixed_tensor.set(np.array(new_tensor), place)
+
+                if step_idx % TrainTaskConfig.save_freq == (TrainTaskConfig.save_freq - 1):
+                    if trainer_id == 0:
+                        fluid.io.save_params(
+                            exe,
+                            os.path.join(TrainTaskConfig.model_dir,
+                                         "iter_" + str(step_idx) + ".infer.model"),train_prog)
+
+                        if args.val_file_pattern is not None:
+                            train_exe.drop_local_exe_scopes()
+                            new_dense_test(step_idx)
+                            new_forward_test(step_idx)
+                            new_relative_position_test(step_idx)
+
+                batch_id += 1
+                step_idx += 1
+            except (StopIteration, fluid.core.EOFException):
+                break
+
+
+def train(args):
+    """
+        train
+    """
+    is_local = os.getenv("PADDLE_IS_LOCAL", "1")
+    if is_local == '0':
+        args.local = False
+    print(args)
+
+    if args.device == 'CPU':
+        TrainTaskConfig.use_gpu = False
+
+    training_role = os.getenv("TRAINING_ROLE", "TRAINER")
+    gpus = os.getenv("FLAGS_selected_gpus").split(",")
+    gpu_id = int(gpus[0])
+
+    if training_role == "PSERVER" or (not TrainTaskConfig.use_gpu):
+        place = fluid.CPUPlace()
+        dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+    else:
+        place = fluid.CUDAPlace(gpu_id)
+        dev_count = len(gpus)
+
+    exe = fluid.Executor(place)
+
+    train_prog = fluid.Program()
+    startup_prog = fluid.Program()
+
+    if args.enable_ce:
+        train_prog.random_seed = 1000
+        startup_prog.random_seed = 1000
+
+    with fluid.program_guard(train_prog, startup_prog):
+        logits_list = []
+
+        data_input_names = encoder_data_input_fields + \
+                decoder_data_input_fields[:-1] + label_data_input_fields + dense_bias_input_fields
+
+        all_data_inputs, pyreader = make_all_py_reader_inputs(data_input_names)
+        with fluid.unique_name.guard("new_forward"):
+            new_forward_sum_cost, new_forward_avg_cost, new_forward_token_num, new_forward_logits, new_forward_xent, new_forward_loss, new_forward_label, new_forward_non_zeros = forward_transformer(
+                ModelHyperParams.src_vocab_size,
+                ModelHyperParams.trg_vocab_size,
+                ModelHyperParams.max_length + 50,
+                ModelHyperParams.n_layer,
+                ModelHyperParams.n_head,
+                ModelHyperParams.d_key,
+                ModelHyperParams.d_value,
+                ModelHyperParams.d_model,
+                ModelHyperParams.d_inner_hid,
+                ModelHyperParams.prepostprocess_dropout,
+                ModelHyperParams.attention_dropout,
+                ModelHyperParams.relu_dropout,
+                ModelHyperParams.preprocess_cmd,
+                ModelHyperParams.postprocess_cmd,
+                ModelHyperParams.weight_sharing,
+                ModelHyperParams.embedding_sharing,
+                TrainTaskConfig.label_smooth_eps,
+                use_py_reader=True,
+                is_test=False,
+                params_type="new",
+                all_data_inputs=all_data_inputs)
+
+        with fluid.unique_name.guard("new_relative_position"):
+            new_relative_position_sum_cost, new_relative_position_avg_cost, new_relative_position_token_num, new_relative_position_logits, new_relative_position_xent, new_relative_position_loss, new_relative_position_label, new_relative_position_non_zeros = relative_transformer(
+                ModelHyperParams.src_vocab_size,
+                ModelHyperParams.trg_vocab_size,
+                ModelHyperParams.max_length + 50,
+                ModelHyperParams.n_layer,
+                ModelHyperParams.n_head,
+                ModelHyperParams.d_key,
+                ModelHyperParams.d_value,
+                ModelHyperParams.d_model,
+                ModelHyperParams.d_inner_hid,
+                ModelHyperParams.prepostprocess_dropout,
+                ModelHyperParams.attention_dropout,
+                ModelHyperParams.relu_dropout,
+                ModelHyperParams.preprocess_cmd,
+                ModelHyperParams.postprocess_cmd,
+                ModelHyperParams.weight_sharing,
+                ModelHyperParams.embedding_sharing,
+                TrainTaskConfig.label_smooth_eps,
+                use_py_reader=args.use_py_reader,
+                is_test=False,
+                params_type="new",
+                all_data_inputs=all_data_inputs)
+
+        DenseModelHyperParams.src_vocab_size = ModelHyperParams.src_vocab_size
+        DenseModelHyperParams.trg_vocab_size = ModelHyperParams.trg_vocab_size
+        DenseModelHyperParams.weight_sharing = ModelHyperParams.weight_sharing
+        DenseModelHyperParams.embedding_sharing = ModelHyperParams.embedding_sharing
+
+        with fluid.unique_name.guard("new_dense"):
+            new_dense_sum_cost, new_dense_avg_cost, new_dense_token_num, new_dense_logits, new_dense_xent, new_dense_loss, new_dense_label, _ = dense_transformer(
+                DenseModelHyperParams.src_vocab_size,
+                DenseModelHyperParams.trg_vocab_size,
+                DenseModelHyperParams.max_length + 50,
+                DenseModelHyperParams.n_layer,
+                DenseModelHyperParams.enc_n_layer,
+                DenseModelHyperParams.n_head,
+                DenseModelHyperParams.d_key,
+                DenseModelHyperParams.d_value,
+                DenseModelHyperParams.d_model,
+                DenseModelHyperParams.d_inner_hid,
+                DenseModelHyperParams.prepostprocess_dropout,
+                DenseModelHyperParams.attention_dropout,
+                DenseModelHyperParams.relu_dropout,
+                DenseModelHyperParams.preprocess_cmd,
+                DenseModelHyperParams.postprocess_cmd,
+                DenseModelHyperParams.weight_sharing,
+                DenseModelHyperParams.embedding_sharing,
+                TrainTaskConfig.label_smooth_eps,
+                use_py_reader=args.use_py_reader,
+                is_test=False,
+                params_type="new",
+                all_data_inputs=all_data_inputs)
+
+        with fluid.unique_name.guard("fixed_forward"):
+            fixed_forward_sum_cost, fixed_forward_avg_cost, fixed_forward_token_num, fixed_forward_logits, fixed_forward_xent, fixed_forward_loss, fixed_forward_label, fixed_forward_non_zeros = forward_transformer(
+                ModelHyperParams.src_vocab_size,
+                ModelHyperParams.trg_vocab_size,
+                ModelHyperParams.max_length + 50,
+                ModelHyperParams.n_layer,
+                ModelHyperParams.n_head,
+                ModelHyperParams.d_key,
+                ModelHyperParams.d_value,
+                ModelHyperParams.d_model,
+                ModelHyperParams.d_inner_hid,
+                ModelHyperParams.prepostprocess_dropout,
+                ModelHyperParams.attention_dropout,
+                ModelHyperParams.relu_dropout,
+                ModelHyperParams.preprocess_cmd,
+                ModelHyperParams.postprocess_cmd,
+                ModelHyperParams.weight_sharing,
+                ModelHyperParams.embedding_sharing,
+                TrainTaskConfig.label_smooth_eps,
+                use_py_reader=args.use_py_reader,
+                is_test=False,
+                params_type="fixed",
+                all_data_inputs=all_data_inputs)
+            logits_list.append(fixed_forward_logits)
+
+
+        DenseModelHyperParams.src_vocab_size = ModelHyperParams.src_vocab_size
+        DenseModelHyperParams.trg_vocab_size = ModelHyperParams.trg_vocab_size
+        DenseModelHyperParams.weight_sharing = ModelHyperParams.weight_sharing
+        DenseModelHyperParams.embedding_sharing = ModelHyperParams.embedding_sharing
+
+        with fluid.unique_name.guard("fixed_dense"):
+            fixed_dense_sum_cost, fixed_dense_avg_cost, fixed_dense_token_num, fixed_dense_logits, fixed_dense_xent, fixed_dense_loss, fixed_dense_label, _ = dense_transformer(
+                DenseModelHyperParams.src_vocab_size,
+                DenseModelHyperParams.trg_vocab_size,
+                DenseModelHyperParams.max_length + 50,
+                DenseModelHyperParams.n_layer,
+                DenseModelHyperParams.enc_n_layer,
+                DenseModelHyperParams.n_head,
+                DenseModelHyperParams.d_key,
+                DenseModelHyperParams.d_value,
+                DenseModelHyperParams.d_model,
+                DenseModelHyperParams.d_inner_hid,
+                DenseModelHyperParams.prepostprocess_dropout,
+                DenseModelHyperParams.attention_dropout,
+                DenseModelHyperParams.relu_dropout,
+                DenseModelHyperParams.preprocess_cmd,
+                DenseModelHyperParams.postprocess_cmd,
+                DenseModelHyperParams.weight_sharing,
+                DenseModelHyperParams.embedding_sharing,
+                TrainTaskConfig.label_smooth_eps,
+                use_py_reader=args.use_py_reader,
+                is_test=False,
+                params_type="fixed",
+                all_data_inputs=all_data_inputs)
+            logits_list.append(fixed_dense_logits)
+
+        with fluid.unique_name.guard("fixed_relative_position"):
+            fixed_relative_sum_cost, fixed_relative_avg_cost, fixed_relative_token_num, fixed_relative_logits, fixed_relative_xent, fixed_relative_loss, fixed_relative_label, _ = relative_transformer(
+                ModelHyperParams.src_vocab_size,
+                ModelHyperParams.trg_vocab_size,
+                ModelHyperParams.max_length + 50,
+                ModelHyperParams.n_layer,
+                ModelHyperParams.n_head,
+                ModelHyperParams.d_key,
+                ModelHyperParams.d_value,
+                ModelHyperParams.d_model,
+                ModelHyperParams.d_inner_hid,
+                ModelHyperParams.prepostprocess_dropout,
+                ModelHyperParams.attention_dropout,
+                ModelHyperParams.relu_dropout,
+                ModelHyperParams.preprocess_cmd,
+                ModelHyperParams.postprocess_cmd,
+                ModelHyperParams.weight_sharing,
+                ModelHyperParams.embedding_sharing,
+                TrainTaskConfig.label_smooth_eps,
+                use_py_reader=args.use_py_reader,
+                is_test=False,
+                params_type="fixed",
+                all_data_inputs=all_data_inputs)
+            logits_list.append(fixed_relative_logits)
+
+        # normalizing
+        confidence = 1.0 - TrainTaskConfig.label_smooth_eps
+        low_confidence = (1.0 - confidence) / (ModelHyperParams.trg_vocab_size - 1)
+        normalizing = -(confidence * math.log(confidence) + (ModelHyperParams.trg_vocab_size - 1) *
+                low_confidence * math.log(low_confidence + 1e-20))
+
+        batch_size = layers.shape(new_forward_logits)[0]
+        seq_length = layers.shape(new_forward_logits)[1]
+        trg_voc_size = layers.shape(new_forward_logits)[2]
+        
+        # ensemble
+        teacher_logits = logits_list[0]
+        for index in xrange(1, len(logits_list)):
+            teacher_logits += logits_list[index]
+        
+        teacher_logits = teacher_logits / len(logits_list)
+
+        # new_target
+        new_target = layers.softmax(teacher_logits)
+        new_target.stop_gradient = True
+
+        # agent_1: forward
+        fdistill_xent = layers.softmax_with_cross_entropy(
+                logits=new_forward_logits,
+                label=new_target,
+                soft_label=True)
+        fdistill_xent -= normalizing
+        fdistill_loss = layers.reduce_sum(fdistill_xent * new_forward_non_zeros) / new_forward_token_num
+
+         # agent_2: relative
+        rdistill_xent = layers.softmax_with_cross_entropy(
+                logits=new_relative_position_logits,
+                label=new_target,
+                soft_label=True)
+        rdistill_xent -= normalizing
+        rdistill_loss = layers.reduce_sum(rdistill_xent * new_forward_non_zeros) / new_forward_token_num
+
+        # agent_3: dense
+        ddistill_xent = layers.softmax_with_cross_entropy(
+                logits=new_dense_logits,
+                label=new_target,
+                soft_label=True)
+        ddistill_xent -= normalizing
+        ddistill_loss = layers.reduce_sum(ddistill_xent * new_forward_non_zeros) / new_forward_token_num
+
+        
+        teacher_loss = fixed_forward_avg_cost + fixed_dense_avg_cost + fixed_relative_avg_cost
+        avg_cost = TrainTaskConfig.beta * new_forward_avg_cost + (1.0 - TrainTaskConfig.beta) * fdistill_loss + TrainTaskConfig.beta * new_relative_position_avg_cost + (1.0 - TrainTaskConfig.beta) * rdistill_loss + TrainTaskConfig.beta * new_dense_avg_cost + (1.0 - TrainTaskConfig.beta) * ddistill_loss + teacher_loss
+
+
+        avg_cost.persistable = True
+        teacher_loss.persistable = True
+
+        optimizer = None
+        if args.sync:
+            lr_decay = fluid.layers.learning_rate_scheduler.noam_decay(
+                ModelHyperParams.d_model, TrainTaskConfig.warmup_steps)
+            logging.info("before adam")
+
+            with fluid.default_main_program()._lr_schedule_guard():
+                learning_rate = lr_decay * TrainTaskConfig.learning_rate
+            optimizer = fluid.optimizer.Adam(
+                learning_rate=learning_rate,
+                beta1=TrainTaskConfig.beta1,
+                beta2=TrainTaskConfig.beta2,
+                epsilon=TrainTaskConfig.eps)
+        else:
+            optimizer = fluid.optimizer.SGD(0.003)
+        if args.use_fp16:
+            #black_varnames={"src_slf_attn_bias", "trg_slf_attn_bias", "trg_src_attn_bias", "dense_src_slf_attn_bias", "dense_trg_slf_attn_bias", "dense_trg_src_attn_bias"}
+            #amp_lists=fluid.contrib.mixed_precision.AutoMixedPrecisionLists(custom_black_varnames=black_varnames,
+            #        custom_black_list=["dropout"])
+            #optimizer = fluid.contrib.mixed_precision.decorate(optimizer, amp_lists=amp_lists,
+            optimizer = fluid.contrib.mixed_precision.decorate(optimizer,
+                                                                init_loss_scaling=32768, incr_every_n_steps=2000,
+                                                                use_dynamic_loss_scaling=True)
+
+        optimizer.minimize(avg_cost)
+
+        loss_scaling=None
+        scaled_cost=None
+        if args.use_fp16:
+            scaled_cost = optimizer.get_scaled_loss()
+            loss_scaling = optimizer.get_loss_scaling()
+
+    if args.local:
+        logging.info("local start_up:")
+        train_loop(exe, train_prog, startup_prog, args, dev_count, avg_cost, teacher_loss, new_relative_position_sum_cost, new_relative_position_avg_cost,
+                   new_relative_position_token_num, pyreader, place)
+    else:
+        trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+        worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS")
+        current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT")
+        worker_endpoints = worker_endpoints_env.split(",")
+        trainers_num = len(worker_endpoints)
+
+        logging.info("worker_endpoints:{} trainers_num:{} current_endpoint:{} \
+                trainer_id:{}".format(worker_endpoints, trainers_num,
+                                      current_endpoint, trainer_id))
+
+        config = fluid.DistributeTranspilerConfig()
+        config.mode = "nccl2"
+        if args.nccl_comm_num > 1:
+            config.nccl_comm_num = args.nccl_comm_num
+        if args.use_hierarchical_allreduce and trainers_num > args.hierarchical_allreduce_inter_nranks:
+            logging.info("use_hierarchical_allreduce")
+            config.use_hierarchical_allreduce=args.use_hierarchical_allreduce
+
+            config.hierarchical_allreduce_inter_nranks=8
+            if config.hierarchical_allreduce_inter_nranks > 1:
+                config.hierarchical_allreduce_inter_nranks=args.hierarchical_allreduce_inter_nranks
+
+            assert config.hierarchical_allreduce_inter_nranks > 1
+            assert trainers_num % config.hierarchical_allreduce_inter_nranks == 0
+
+            config.hierarchical_allreduce_exter_nranks = \
+                trainers_num / config.hierarchical_allreduce_inter_nranks
+
+        t = fluid.DistributeTranspiler(config=config)
+        t.transpile(
+            trainer_id, trainers=worker_endpoints_env,
+            current_endpoint=current_endpoint, program=train_prog,
+            startup_program=startup_prog)
+
+        train_loop(exe, train_prog, startup_prog, args, dev_count, avg_cost, teacher_loss,
+                   new_relative_position_sum_cost, new_relative_position_avg_cost, new_relative_position_token_num, pyreader, place, trainers_num, trainer_id, scaled_cost=scaled_cost, loss_scaling=loss_scaling)
+
+        
+if __name__ == "__main__":
+    LOG_FORMAT = "[%(asctime)s %(levelname)s %(filename)s:%(lineno)d] %(message)s"
+    logging.basicConfig(
+        stream=sys.stdout, level=logging.DEBUG, format=LOG_FORMAT)
+    logging.getLogger().setLevel(logging.INFO)
+
+    args = parse_args()
+    train(args)
diff --git a/PaddleNLP/Research/EMNLP2019-MAL/train.sh b/PaddleNLP/Research/EMNLP2019-MAL/train.sh
new file mode 100755
index 0000000000000000000000000000000000000000..13e4de5fb33950804ec842a69aecd8ec445a57f4
--- /dev/null
+++ b/PaddleNLP/Research/EMNLP2019-MAL/train.sh
@@ -0,0 +1,70 @@
+#!/bin/bash
+source ./env/env.sh
+source ./env/utils.sh
+source ./env/cloud_job_conf.conf
+
+iplist=$1
+#iplist=`echo $nodelist | xargs  | sed 's/ /,/g'`
+
+if [ ! -d log ]
+then
+    mkdir log
+fi
+
+export GLOG_vmodule=fuse_all_reduce_op_pass=10,alloc_continuous_space_for_grad_pass=10
+
+if [[ ${FUSE} == "1" ]]; then
+    export FLAGS_fuse_parameter_memory_size=64 #MB
+fi
+
+set -ux
+check_iplist
+
+distributed_args=""
+if [[ ${NUM_CARDS} == "1" ]]; then
+    distributed_args="--selected_gpus 0"
+fi
+
+node_ips=${PADDLE_TRAINERS}
+
+distributed_args="--node_ips ${PADDLE_TRAINERS} --node_id ${PADDLE_TRAINER_ID} --current_node_ip ${POD_IP} --nproc_per_node 8 --selected_gpus 0,1,2,3,4,5,6,7"
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_fraction_of_gpu_memory_to_use=0.1
+export NCCL_IB_GID_INDEX=3
+export NCCL_IB_RETRY_CNT=10
+export FLAGS_sync_nccl_allreduce=0
+
+BATCH_SIZE=1250
+python -u ./src/launch.py ${distributed_args} \
+	./src/train.py \
+	--src_vocab_size 37007 \
+	--tgt_vocab_size 37007 \
+	--train_file_pattern 'data/translate-train-*' \
+	--token_delimiter ' ' \
+	--batch_size ${BATCH_SIZE} \
+	--use_py_reader True \
+	--use_delay_load True \
+    --nccl_comm_num ${NCCL_COMM_NUM} \
+    --use_hierarchical_allreduce ${USE_HIERARCHICAL_ALLREDUCE} \
+	--fetch_steps 50 \
+    --fuse ${FUSE} \
+    --val_file_pattern 'testset/testfile' \
+    --infer_batch_size 32 \
+    --decode_alpha 0.3 \
+    --beam_size 4 \
+    --use_fp16 True \
+	learning_rate 2.0 \
+	warmup_steps 8000 \
+	beta2 0.997 \
+	d_model 1024 \
+	d_inner_hid 4096 \
+	n_head 16 \
+	prepostprocess_dropout 0.3 \
+	attention_dropout 0.1 \
+	relu_dropout 0.1 \
+	embedding_sharing True \
+	pass_num 100 \
+	max_length 256 \
+    save_freq 5000 \
+	model_dir 'output'
+
diff --git a/PaddleNLP/dialogue_domain_classification/README.MD b/PaddleNLP/dialogue_domain_classification/README.MD
new file mode 100755
index 0000000000000000000000000000000000000000..c5753194a8356e3f7f6c1a32cc9c5fed2206d656
--- /dev/null
+++ b/PaddleNLP/dialogue_domain_classification/README.MD
@@ -0,0 +1,223 @@
+# Paddle NLP（对话领域分类器）
+
+
+
+## 模型简介
+
+​   在对话业务场景中，完整的对话能力往往由多个领域的语义解析bot组成并提供，对话领域分类器能够根据业务场景需求，将流量分发到对应领域的语义解析bot。对话领域分类器不但能够节省机器资源，流量只分发到所属领域的bot，避免了无效流量调用bot； 同时，对话领域分类器的精准分发，过滤了无效的解析结果，也使得最终的解析结果更加精确。
+
+
+
+
+## 快速开始
+**目前模型要求使用PaddlePaddle 1.6及以上版本或适当的develop版本运行。**
+
+### 1. Paddle版本安装
+
+本项目训练模块兼容Python2.7.x以及Python3.7.x， 依赖PaddlePaddle 1.6版本以及CentOS系统环境， 安装请参考官网 [快速安装](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/index_cn.html)。
+
+注意：该模型同时支持cpu和gpu训练和预测，用户可以根据自身需求，选择安装对应的paddlepaddle-gpu或paddlepaddle版本。
+
+> Warning: GPU 和 CPU 版本的 PaddlePaddle 分别是 paddlepaddle-gpu 和 paddlepaddle，请安装时注意区别。
+
+
+### 2. 代码安装
+
+克隆工具集代码库到本地
+
+```shell
+git clone https://github.com/PaddlePaddle/models.git
+cd models/PaddleNLP/dialogue_domain_classification
+```
+
+
+
+### 3. 数据准备
+
+本项目提供了部分涉及的数据集，通过运行以下指令可以快速下载。运行指令后会生成`data/input`目录，`data/input`目录下有训练集数据（train.txt）、开发集数据（eval.txt）、测试集数据（test.txt），对应词典（char.dict），领域词表(domain.dict) 以及模型配置文件(model.conf)
+
+```shell
+mkdir -p data/input
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dialogue_domain_classification-dataset-1.0.0.tar.gz
+tar -zxvf dialogue_domain_classification-dataset-1.0.0.tar.gz -C ./data/input
+```
+
+**数据格式说明**
+
+
+1. 数据格式
+
+输入和输出的数据格式相同。
+
+数据格式为:  query \t  domain_1 \002 domain_2  （多个标签， 使用\002分隔开）
+
+指定输入数据的文件夹： 参数`data_dir`
+
+训练文件: train.txt
+验证集: eval.txt
+测试集: test.txt
+
+指定输出结果的文件夹： 参数`save_dir`
+
+测试集预测结果为: test.rst
+
+2. 模型配置
+  
+参数`config_path` 指定模型配置文件地址， 格式如下：
+```shell
+[model]
+emb_dim = 128
+win_sizes = [5, 5, 5]
+hid_dim = 160
+hid_dim2 = 160
+```
+
+
+
+### 4. 模型下载
+
+针对于"打电话， 天气， 火车票预订， 机票预订， 音乐"这5个领域数据，我们开源了一个使用CharCNN训练好的对话领域分类模型，使用以下指令可以对模型进行下载。
+
+```model
+mkdir -p model
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/dialogue_domain_classification-model-1.0.0.tar.gz
+tar -zxvf dialogue_domain_classification-model-1.0.0.tar.gz -C ./model
+```
+
+### 5. 脚本参数说明
+
+通过执行如下指令，可以查看入口脚本文件所需要的参数以及说明，指令如下：
+ `export PATH="/path/to/your/python:$PATH"; python run_classifier.py --help `
+
+```shell
+1. 模型参数
+--init_checkpoint   # 指定热启动加载的checkpoint模型， Default: None.
+--checkpoints   # 指定保存checkpoints的地址，Default: ./checkpoints.
+--config_path   # 指定模型配置文件，Default: ./data/input/model.conf.
+--build_dict    # 是否根据训练数据建立char字典和domain字典，Default: False
+
+2. 训练参数
+--epoch    # 训练的轮次，Default: 100.
+--learning_rate # 学习率， Default: 0.1.
+--save_steps    # 保存模型的频率，每x个steps保存一次模型，Default: 1000.
+--validation_steps  # 模型评估的频率，每x个steps在验证集上验证模型的效果，Default: 100.
+--random_seed   # 随机数种子，Default: 7
+--threshold # 领域置信度阈值，当置信度超过阈值，预测结果出对应的领域标签。 Default: 0.1.
+--cpu_num   # 当使用cpu训练时的线程数(当use_cuda=False才起作用)。 Default: 3.
+
+3. logging
+--skip_steps    # 训练时打印loss的频率，每x个steps打印一次loss，Default: 10.
+
+4. 数据
+--data_dir  # 数据集的目录，其中train.txt为训练集，eval.txt为验证集，test.txt为测试集。Default: ./data/input/
+--save_dir  # 模型产出的目录， Default: ./data/output/
+--max_seq_len   # 最大句子长度，超过会进行截断，Default: 50.
+--batch_size    # 批大小， Default: 64.
+
+5. 脚本运行配置
+--use_cuda  # 是否使用GPU，Default: False
+--do_train  # 是否进行训练，Default: True
+--do_eval   # 是否进行验证，Default: True
+--do_test   # 是否进行测试，Default: True
+```
+
+
+
+### 6. 模型训练
+
+用户可以基于示例数据构建训练集和开发集，可以运行下面的命令，进行模型训练和开发集验证。
+
+```
+sh run.sh train 
+```
+
+> Warning1:  可以参考`run.sh`脚本以及第5节的**脚本参数说明**， 对默认参数进行修改。
+
+> Warning2:  CPU多线程以及GPU多卡训练时，每个step训练分别给每一个CPU核或者GPU卡提供一个batch数据，实际上的batch_size为单核的线程数倍或者单卡的多卡数倍。
+
+
+### 7. 模型评估
+
+基于已有的预训练模型和数据，可以运行下面的命令进行测试，查看训练的模型在验证集（test.tsv）上的评测结果
+
+```
+sh run.sh eval
+```
+
+> Warning:  可以参考`run.sh`脚本以及第5节的**脚本参数说明**， 对默认参数进行修改。
+
+### 8. 模型推断
+
+```
+sh run.sh test
+```
+> Warning:  可以参考`run.sh`脚本以及第5节的**脚本参数说明**， 对默认参数进行修改。
+
+
+
+## 进阶使用
+
+
+
+### 1. 任务定义与建模
+
+在真实复杂业务场景中，语义解析服务往往由多个不同领域的语义解析bot组成，从而同时满足多个场景下的语义解析需求。例如：同时能查天气、播放音乐、查询股票等多种功能的对话bot。
+
+与此同时用户输入的query句子形式各样，而且存在很多歧义。比如用户输入的query为`“下雨了”`， 这条query的语义解析既属于`天气`领域， 又属于`音乐`领域(薛之谦的歌曲)。针对这种多歧义的情况，业务上常见的方法是将query进行"广播"，即同时请求每一个语义解析bot，再对返回的解析结果进行粗排，得到最终的语义解析结果。
+
+对话领域分类器能够处理同一query同时命中多个领域的情况，根据对话领域分类器的解析结果，可以对query进行有效的分发到各个领域的bot。对话领域分类器对query进行有效的分发，可以避免"广播"式调用带来的资源浪费，大量的节省了机器资源；同时也提高了最终粗排后的语义解析结果的准确率。
+
+
+对话领域分类模型解决了一个多标签分类(Multilabel Classification)的问题， 将用户输入的文本作为模型的输入，分类器会预测出输入文本对应的每一个标签的置信度，从而得到多标签结果，并依次对query分发。
+
+
+
+### 2. 模型原理介绍
+
+对话领域分类器的大体结构如下图所示，用户输入通过`输入层`进行向量化后，作为`分类器模型`的输入，`分类器`最终的输出是一个多标签结果为`[label_1, label_2, ..., label_n]`，它的维度为`n`.(训练数据定义的训练领域总共有`n-1`个，每一个领域对应一个标签，还有额外一个标签表示背景，即不属于任何一个训练领域)
+
+其中每个`label_i`的概率为0到1之间，且所有label的概率之和不恒为1，它表示当前输入属于第`i`个领域的概率。最后可以人为对每一个label的概率设置阈值，从而可以得到多标签分类的结果。
+
+![net](./imgs/nets.png)
+
+**评估指标说明**
+
+传统的二分类任务中，通常使用准确率、召回率和F1值对模型效果进行评估。
+
+<p align="center">
+    <img src="./imgs/function.png" />
+</p>
+
+
+**该项目中对于正负样本的定义** 
+
+在多标签分类任务中，我们将样本分为正样本(Pos)与负样本(Neg)两种。如果样本包含了领域标签，表示需要分发到至少1个bot进行解析，则为正样本；反之，样本不包含任何领域标签流量，表示不需要分发，则为负样本。
+
+我们的对话领域分类器在保证了原有解析效果的基础之上，有效的降低机器资源的消耗。即在保证正样本召回率的情况下，尽可能提高准确率。
+
+
+**该项目中样本预测正确的定义**
+1. 如果`正确结果`不包含领域标签， 则`预测结果`也不包含领域标签时，预测正确。
+2. 如果`正确结果`包含领域标签， 则`预测结果`包含`正确结果`的所有领域标签时(即`预测结果`的标签是`正确结果`的超集，预测正确。 
+
+
+
+
+### 3. 代码结构说明
+
+```
+├── run_classifier.py：该项目的主函数，封装包括训练、预测、评估的部分
+├── nets.py : 定义了模型所使用的网络结构
+├── utils.py：定义了其他常用的功能函数
+├── run.sh: 启动主函数的demo脚本
+```
+
+
+### 4. 如何组建自己的模型
+可以根据自己的需求，组建自定义的模型，具体方法如下所示：
+
+1. 定义自己的对话领域模型，可以在 ./nets.py 中添加自己的网络结构。
+
+2. 定义自己的领域对话数据，可以参考**第3节数据准备**的数据格式，准备自己的训练数据。
+
+3. 模型训练、评估、预测的逻辑，需要在[run.sh](./run.sh)中修改对应的模型路径、数据路径和词典路径等参数，具体说明请参照**第5节的脚本参数说明**.
diff --git a/PaddleNLP/dialogue_domain_classification/imgs/function.png b/PaddleNLP/dialogue_domain_classification/imgs/function.png
new file mode 100755
index 0000000000000000000000000000000000000000..40a236d2dc681ec79ecad68303fbc4ff081db56b
Binary files /dev/null and b/PaddleNLP/dialogue_domain_classification/imgs/function.png differ
diff --git a/PaddleNLP/dialogue_domain_classification/imgs/nets.png b/PaddleNLP/dialogue_domain_classification/imgs/nets.png
new file mode 100755
index 0000000000000000000000000000000000000000..812003b1fc4b181b33395f18902742929e6f522d
Binary files /dev/null and b/PaddleNLP/dialogue_domain_classification/imgs/nets.png differ
diff --git a/PaddleNLP/dialogue_domain_classification/nets.py b/PaddleNLP/dialogue_domain_classification/nets.py
new file mode 100755
index 0000000000000000000000000000000000000000..77912b3b0cda2fcda2f6e264478b909e3472ff77
--- /dev/null
+++ b/PaddleNLP/dialogue_domain_classification/nets.py
@@ -0,0 +1,96 @@
+"""
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+
+import paddle.fluid as fluid
+import paddle
+
+def textcnn_net_multi_label(data,
+            label,
+            dict_dim,
+            emb_dim=128,
+            hid_dim=128,
+            hid_dim2=96,
+            class_dim=2,
+            win_sizes=None,
+            is_infer=False,
+            threshold=0.5,
+            max_seq_len=100):
+    """
+    multi labels Textcnn_net 
+    """
+    init_bound = 0.1
+    initializer = fluid.initializer.Uniform(low=-init_bound, high=init_bound)
+    #gradient_clip = fluid.clip.GradientClipByNorm(10.0)
+    gradient_clip = None
+    regularizer = fluid.regularizer.L2DecayRegularizer(
+                                        regularization_coeff=1e-4)
+    seg_param_attrs = fluid.ParamAttr(name="seg_weight",
+                                  learning_rate=640.0,
+                                  initializer=initializer,
+                                  gradient_clip=gradient_clip,
+                                  trainable=True)
+    fc_param_attrs_1 = fluid.ParamAttr(name="fc_weight_1",
+                                               learning_rate=1.0,
+                                               regularizer=regularizer,
+                                               initializer=initializer,
+                                               gradient_clip=gradient_clip,
+                                               trainable=True)
+    fc_param_attrs_2 = fluid.ParamAttr(name="fc_weight_2",
+                                               learning_rate=1.0,
+                                               regularizer=regularizer,
+                                               initializer=initializer,
+                                               gradient_clip=gradient_clip,
+                                               trainable=True)
+
+    if win_sizes is None:
+        win_sizes = [1, 2, 3]
+
+    # embedding layer
+
+    emb = fluid.embedding(input=data, size=[dict_dim, emb_dim], param_attr=seg_param_attrs)
+
+    # convolution layer
+    convs = []
+    for cnt, win_size in enumerate(win_sizes):
+        emb = fluid.layers.reshape(x=emb, shape=[-1, 1, max_seq_len, emb_dim], inplace=True)
+        filter_size = (win_size, emb_dim)
+        cnn_param_attrs = fluid.ParamAttr(name="cnn_weight" + str(cnt),
+                                              learning_rate=1.0,
+                                              regularizer=regularizer,
+                                              initializer=initializer,
+                                              trainable=True)
+        conv_out = fluid.layers.conv2d(input=emb, num_filters=hid_dim, filter_size=filter_size, act="relu", \
+                                    param_attr=cnn_param_attrs)
+        pool_out = fluid.layers.pool2d(
+                input=conv_out,
+                pool_type='max',
+                pool_stride=1,
+                global_pooling=True)
+        convs.append(pool_out)
+    convs_out = fluid.layers.concat(input=convs, axis=1)
+
+    # full connect layer
+    fc_1 = fluid.layers.fc(input=[pool_out], size=hid_dim2, act=None, param_attr=fc_param_attrs_1)
+    # sigmoid layer
+    fc_2 = fluid.layers.fc(input=[fc_1], size=class_dim, act=None, param_attr=fc_param_attrs_2)
+    prediction = fluid.layers.sigmoid(fc_2)
+    if is_infer:
+        return prediction
+
+    cost = fluid.layers.sigmoid_cross_entropy_with_logits(x=fc_2, label=label)
+    avg_cost = fluid.layers.mean(x=cost)
+    pred_label = fluid.layers.ceil(fluid.layers.thresholded_relu(prediction, threshold))
+    return [avg_cost, prediction, pred_label, label]
diff --git a/PaddleNLP/dialogue_domain_classification/run.sh b/PaddleNLP/dialogue_domain_classification/run.sh
new file mode 100755
index 0000000000000000000000000000000000000000..81efc2b64fb417435ff1bee10d52ba10f1138e5e
--- /dev/null
+++ b/PaddleNLP/dialogue_domain_classification/run.sh
@@ -0,0 +1,118 @@
+export PATH="/home/guohongjie/tmp/paddle/paddle_release_home/python/bin/:$PATH"
+
+
+
+
+
+#  CPU setting
+:<<EOF
+USE_CUDA=false
+CPU_NUM=3 # cpu_num works only when USE_CUDA=false
+# path to your python
+export PATH="/home/work/guohongjie/cpu_paddle/python2/bin:$PATH"
+EOF
+
+
+
+# GPU_settting
+:<<EOF
+# cuda path
+LD_LIBRARY_PATH=/home/work/cuda/cudnn/cudnn_v7/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda/lib:$LD_LIBRARY_PATH
+export LD_LIBRARY_PATH="/home/work/guohongjie/cuda/cudnn/cudnn_v7/cuda/lib64:$LD_LIBRARY_PATH"
+export LD_LIBRARY_PATH="/home/work/guohongjie/cuda/cuda-9.0/lib64:$LD_LIBRARY_PATH"
+USE_CUDA=true
+CPU_NUM=3 # cpu_num works only when USE_CUDA=false
+export FLAGS_fraction_of_gpu_memory_to_use=0.02
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_fast_eager_deletion_mode=1
+export CUDA_VISIBLE_DEVICES=0     #   which GPU to use
+# path to your python
+export PATH="/home/work/guohongjie/gpu_paddle/python2/bin:$PATH"
+EOF
+
+
+
+echo "the python your use is `which python`"
+
+MODEL_PATH=None # not loading any pretrained model
+#MODEL_PATH=./model/ # the default pretrained model
+INPUT_DIR=./data/input/
+OUTPUT_DIR=./data/output/
+TRAIN_CONF=./data/input/model.conf
+BUILD_DICT=false	# if you use your new dataset, set it true to build domain and char dict
+BATCH_SIZE=64
+
+
+
+
+train() {
+      python -u run_classifier.py \
+        --use_cuda ${USE_CUDA} \
+        --cpu_num ${CPU_NUM} \
+        --do_train true \
+        --do_eval false \
+        --do_test false \
+        --build_dict ${BUILD_DICT} \
+        --data_dir ${INPUT_DIR} \
+        --save_dir ${OUTPUT_DIR} \
+        --config_path ${TRAIN_CONF} \
+        --batch_size ${BATCH_SIZE} \
+        --init_checkpoint ${MODEL_PATH} 
+}
+
+evaluate() {
+    python -u run_classifier.py \
+        --use_cuda ${USE_CUDA} \
+        --cpu_num ${CPU_NUM} \
+        --do_train true \
+        --do_eval true \
+        --do_test false \
+        --build_dict ${BUILD_DICT} \
+        --data_dir ${INPUT_DIR} \
+        --save_dir ${OUTPUT_DIR} \
+        --config_path ${TRAIN_CONF} \
+        --batch_size ${BATCH_SIZE}  \
+        --init_checkpoint ${MODEL_PATH} 
+}
+
+
+infer() {
+    python -u run_classifier.py \
+        --use_cuda ${USE_CUDA} \
+        --cpu_num ${CPU_NUM} \
+        --do_train false \
+        --do_eval false \
+        --do_test true \
+        --build_dict ${BUILD_DICT} \
+        --data_dir ${INPUT_DIR} \
+        --save_dir ${OUTPUT_DIR} \
+        --config_path ${TRAIN_CONF} \
+        --batch_size ${BATCH_SIZE}  \
+        --init_checkpoint ${MODEL_PATH} 
+}
+
+main() {
+    local cmd=${1:-help}
+    case "${cmd}" in
+        train)
+            train "$@";
+            ;;
+        eval)
+            evaluate "$@";
+            ;;
+        test)
+            infer "$@";
+            ;;
+        help)
+            echo "Usage: ${BASH_SOURCE} {train|eval|test}";
+            return 0;
+            ;;
+        *)
+            echo "Unsupport commend [${cmd}]";
+            echo "Usage: ${BASH_SOURCE} {train|eval|test}";
+            return 1;
+            ;;
+    esac
+}
+
+main "$@"
diff --git a/PaddleNLP/dialogue_domain_classification/run_classifier.py b/PaddleNLP/dialogue_domain_classification/run_classifier.py
new file mode 100755
index 0000000000000000000000000000000000000000..0fc9107413088ae444505fade3c779a38c9331bb
--- /dev/null
+++ b/PaddleNLP/dialogue_domain_classification/run_classifier.py
@@ -0,0 +1,473 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import time
+import argparse
+import numpy as np
+import multiprocessing
+import sys
+# sys.path.append("../models/classification/")
+from nets import textcnn_net_multi_label
+import paddle
+import paddle.fluid as fluid
+from utils import ArgumentGroup, print_arguments, DataProcesser, DataReader, ConfigReader
+from utils import init_checkpoint, check_version, logger
+import random
+import codecs
+import logging
+import math
+np.random.seed(0)
+random.seed(0)
+
+
+parser = argparse.ArgumentParser(__doc__)
+DEV_COUNT = 1
+model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
+model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
+model_g.add_arg("checkpoints", str, "./checkpoints", "Path to save checkpoints.")
+model_g.add_arg("config_path", str, "./data/input/model.conf", "Model conf.")
+model_g.add_arg("build_dict", bool, False, "Build dict.")
+
+train_g = ArgumentGroup(parser, "training", "training options.")
+train_g.add_arg("cpu_num", int, 3, "Number of Threads.")
+train_g.add_arg("epoch", int, 100, "Number of epoches for training.")
+train_g.add_arg("learning_rate", float, 0.1, "Learning rate used to train with warmup.")
+train_g.add_arg("save_steps", int, 1000, "The steps interval to save checkpoints.")
+train_g.add_arg("validation_steps", int, 100, "The steps interval to evaluate model performance.")
+train_g.add_arg("random_seed", int, 7, "random seed")
+train_g.add_arg("threshold", float, 0.1, "When the confidence exceeds the threshold, the corresponding label is given.")
+
+log_g = ArgumentGroup(parser, "logging", "logging related.")
+log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.")
+
+data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
+data_g.add_arg("data_dir", str, "./data/input/", "Path to training data.")
+data_g.add_arg("save_dir", str, "./data/output/", "Path to save.")
+data_g.add_arg("max_seq_len", int, 50, "Tokens' number of the longest seqence allowed.")
+data_g.add_arg("batch_size", int, 64, "The total number of examples in one batch for training.")
+
+run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+run_type_g.add_arg("use_cuda", bool, False, "If set, use GPU for training.")
+# run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).")
+run_type_g.add_arg("do_train", bool, True, "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("do_eval", bool, True, "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("do_test", bool, True, "Whether to perform evaluation on test data set.")
+args = parser.parse_args()
+
+
+
+def get_score(pred_result, label, eval_phase):
+    """[get precision recall and f-score]
+    
+    Arguments:
+        pred_result {[type]} -- [pred labels]
+        label {[type]} -- [origin labels]
+    """
+    tp = 0
+    total = 0
+    true_cnt = 0
+    pred_pos_num = 0
+    pos_num = 0
+    for i in range(len(pred_result)):
+        total += 1
+        pred_labels = []
+        actual_labels = []
+        for j in range(1, len(pred_result[0])): # the 0 one is background
+            if pred_result[i][j] == 1:
+                pred_labels.append(j)
+            if label[i][j] == 1:
+                actual_labels.append(j)
+        if len(pred_labels) > 0:
+            pred_pos_num += 1
+        if len(actual_labels) > 0:
+            pos_num += 1
+            if set(actual_labels).issubset(set(pred_labels)):
+                tp += 1
+                true_cnt += 1
+        elif len(pred_labels) == 0 and len(actual_labels) == 0:
+            true_cnt += 1   
+    try:
+        precision = tp * 1.0 / pred_pos_num
+        recall = tp * 1.0 / pos_num
+        f1 = 2 * precision * recall / (recall + precision)
+    except Exception as  e:
+        precision = 0
+        recall = 0
+        f1 = 0
+    acc = true_cnt * 1.0 / total
+    logger.info("tp, pred_pos_num, pos_num, total")
+    logger.info("%d, %d, %d, %d" % (tp, pred_pos_num, pos_num, total))
+    logger.info("%s result is : precision is %f, recall is %f, f1_score is %f, acc is %f" % (eval_phase, precision, \
+                recall, f1, acc))
+
+
+def train(args, train_exe, build_res, place):
+    """[train the net]
+    
+    Arguments:
+        args {[type]} -- [description]
+        train_exe {[type]} -- [description]
+        compiled_prog{[type]} -- [description]
+        build_res {[type]} -- [description]
+        place {[type]} -- [description]
+    """
+    global DEV_COUNT
+    compiled_prog = build_res["compiled_prog"]
+    cost = build_res["cost"]
+    prediction = build_res["prediction"]
+    pred_label = build_res["pred_label"]
+    label = build_res["label"]
+    fetch_list = [cost.name, prediction.name, pred_label.name, label.name]
+    train_pyreader = build_res["train_pyreader"]
+    train_prog = build_res["train_prog"]
+    steps = 0
+    time_begin = time.time()
+    test_exe = train_exe
+    logger.info("Begin training")
+    for i in range(args.epoch):
+        try:
+            for data in train_pyreader(): 
+                avg_cost_np, avg_pred_np, pred_label, label = train_exe.run(feed=data, program=compiled_prog, \
+                                                                            fetch_list=fetch_list)
+                steps += 1
+                if steps % int(args.skip_steps) == 0:
+                    time_end = time.time()
+                    used_time = time_end - time_begin
+                    get_score(pred_label, label, eval_phase = "Train")
+                    logger.info('loss is {}'.format(avg_cost_np))
+                    logger.info("epoch: %d, step: %d, speed: %f steps/s" % (i, steps, args.skip_steps / used_time))
+                    time_begin = time.time()
+                if steps % args.save_steps == 0:
+                    save_path = os.path.join(args.checkpoints,
+                        "step_" + str(steps))
+                    fluid.io.save_persistables(train_exe, save_path, train_prog)
+                    logger.info("[save]step %d : save at %s" % (steps, save_path))
+                if steps % args.validation_steps == 0:
+                    if args.do_eval:
+                        evaluate(args, test_exe, build_res, "eval")
+                    if args.do_test:
+                        evaluate(args, test_exe, build_res, "test")
+        except Exception as e:
+            logger.exception(str(e))
+            logger.error("Train error : %s" % str(e))
+            exit(1)
+    save_path = os.path.join(args.checkpoints, "step_" + str(steps))
+    fluid.io.save_persistables(train_exe, save_path, train_prog)
+    logger.info("[save]step %d : save at %s" % (steps, save_path))
+
+
+def evaluate(args, test_exe, build_res, eval_phase, save_result=False, id2intent=None):
+    """[evaluate on dev/test dataset]
+    
+    Arguments:
+        args {[type]} -- [description]
+        test_exe {[type]} -- [description]
+        test_prog {[type]} -- [description]
+        build_res {[type]} -- [description]
+        place {[type]} -- [description]
+        eval_phase {[type]} -- [description]
+    
+    Keyword Arguments:
+        threshold {float} -- [description] (default: {0.5})
+        save_result {bool} -- [description] (default: {False})
+        id2intent {[type]} -- [description] (default: {None})
+    """
+    place = build_res["test_place"] 
+    threshold = args.threshold
+    cost = build_res["cost"]
+    prediction = build_res["prediction"]
+    pred_label = build_res["pred_label"]
+    label = build_res["label"]
+    fetch_list = [cost.name, prediction.name, pred_label.name, label.name]
+    total_cost, total_acc, pred_prob_list, pred_label_list, label_list = [], [], [], [], []
+    if eval_phase == "eval":
+        test_prog = build_res["eval_compiled_prog"]
+        test_pyreader = build_res["eval_pyreader"]
+    elif eval_phase == "test":
+        test_prog = build_res["test_compiled_prog"]
+        test_pyreader = build_res["test_pyreader"]
+    else:
+        exit(1)
+    logger.info("-----------------------------------------------------------")
+    for data in test_pyreader():
+        avg_cost_np, avg_pred_np, pred_label, label= test_exe.run(program=test_prog, fetch_list=fetch_list, feed=data, \
+            return_numpy=True)
+        total_cost.append(avg_cost_np)
+        pred_prob_list.extend(avg_pred_np)
+        pred_label_list.extend(pred_label)
+        label_list.extend(label)
+           
+    if save_result:
+        logger.info("save result at : %s" % args.save_dir + "/" + eval_phase + ".rst")
+        save_dir = args.save_dir
+        if not os.path.exists(save_dir):
+            logger.warning("save dir not exists, and create it")
+            os.makedirs(save_dir)
+        fin = codecs.open(os.path.join(args.data_dir, eval_phase + ".txt"), "r", encoding="utf8")
+        fout = codecs.open(args.save_dir + "/" + eval_phase + ".rst", "w", encoding="utf8")
+        for line in pred_prob_list:
+            query = fin.readline().rsplit("\t", 1)[0]
+            res = []
+            for i in range(1, len(line)):
+                if line[i] > threshold:
+                    #res.append(id2intent[i]+":"+str(line[i]))
+                    res.append(id2intent[i])
+            if len(res) == 0:
+                res.append(id2intent[0])
+            fout.write("%s\t%s\n" % (query, "\2".join(sorted(res))))
+        fout.close() 
+        fin.close()
+    
+    logger.info("[%s] result: " % eval_phase)
+    get_score(pred_label_list, label_list, eval_phase)
+    logger.info('loss is {}'.format(sum(total_cost) * 1.0 / len(total_cost)))
+    logger.info("-----------------------------------------------------------")
+
+
+
+def create_net(args, flow_data, class_dim, dict_dim, place, model_name="textcnn_net", is_infer=False):
+    """[create network and pyreader]
+    
+    Arguments:
+        flow_data {[type]} -- [description]
+        class_dim {[type]} -- [description]
+        dict_dim {[type]} -- [description]
+        place {[type]} -- [description]
+    
+    Keyword Arguments:
+        model_name {str} -- [description] (default: {"textcnn_net"})
+        is_infer {bool} -- [description] (default: {False})
+    
+    Returns:
+        [type] -- [description]
+    """
+    if model_name == "textcnn_net":
+        model = textcnn_net_multi_label
+    else:
+        return
+    char_list = fluid.data(name="char", shape=[None, args.max_seq_len, 1], dtype="int64", lod_level=0)
+    label = fluid.data(name="label", shape=[None, class_dim], dtype="float32", lod_level=0)  # label data
+    reader = fluid.io.PyReader(feed_list=[char_list, label], capacity=args.batch_size * 10, iterable=True, \
+                                return_list=False)
+    output = model(char_list, label, dict_dim,
+                emb_dim=flow_data["model"]["emb_dim"],
+                hid_dim=flow_data["model"]["hid_dim"],
+                hid_dim2=flow_data["model"]["hid_dim2"],
+                class_dim=class_dim,
+                win_sizes=flow_data["model"]["win_sizes"],
+                is_infer=is_infer,
+                threshold=args.threshold,
+                max_seq_len=args.max_seq_len)
+    if is_infer:
+        prediction = output
+        return [reader, prediction]
+    else:
+        avg_cost, prediction, pred_label, label = output[0], output[1], output[2], output[3]
+        return [reader, avg_cost, prediction, pred_label, label]
+        
+
+def build_data_reader(args, char_dict, intent_dict):
+    """[decorate samples for pyreader]
+    
+    Arguments:
+        args {[type]} -- [description]
+        char_dict {[type]} -- [description]
+        intent_dict {[type]} -- [description]
+    
+    Returns:
+        [type] -- [description]
+    """
+    reader_res = {}
+    if args.do_train:
+        train_processor = DataReader(char_dict, intent_dict, args.max_seq_len)
+        train_data_generator = train_processor.prepare_data(
+            data_path=args.data_dir + "train.txt",
+            batch_size=args.batch_size,
+            mode='train')
+        reader_res["train_data_generator"] = train_data_generator
+        num_train_examples = train_processor._get_num_examples()
+        logger.info("Num train examples: %d" % num_train_examples)
+        logger.info("Num train steps: %d" % (math.ceil(num_train_examples * 1.0 / args.batch_size) * \
+                                            args.epoch // DEV_COUNT))
+        if math.ceil(num_train_examples * 1.0 / args.batch_size) // DEV_COUNT <= 0:
+            logger.error("Num of train steps is less than 0  or equals to 0, exit")
+            exit(1)
+    if args.do_eval:
+        eval_processor = DataReader(char_dict, intent_dict, args.max_seq_len)
+        eval_data_generator = eval_processor.prepare_data(
+            data_path=args.data_dir + "eval.txt",
+            batch_size=args.batch_size,
+            mode='eval')
+        reader_res["eval_data_generator"] = eval_data_generator
+        num_eval_examples = eval_processor._get_num_examples()
+        logger.info("Num eval examples: %d" % num_eval_examples)
+    if args.do_test:
+        test_processor = DataReader(char_dict, intent_dict, args.max_seq_len)
+        test_data_generator = test_processor.prepare_data(
+            data_path=args.data_dir + "test.txt",
+            batch_size=args.batch_size,
+            mode='test')
+        reader_res["test_data_generator"] = test_data_generator
+    return reader_res
+
+
+def build_graph(args, model_config, num_labels, dict_dim, place, test_place, reader_res):
+    """[build paddle graph]
+    
+    Arguments:
+        args {[type]} -- [description]
+        model_config {[type]} -- [description]
+        num_labels {[type]} -- [description]
+        dict_dim {[type]} -- [description]
+        place {[type]} -- [description]
+        reader_res {[type]} -- [description]
+    
+    Returns:
+        [type] -- [description]
+    """
+    res = {}
+    cost, prediction, pred_label, label = None, None, None, None
+    train_prog = fluid.default_main_program()
+    
+    startup_prog = fluid.default_startup_program()
+    eval_prog = train_prog.clone(for_test=True)
+    test_prog = train_prog.clone(for_test=True)
+    train_prog.random_seed = args.random_seed
+    startup_prog.random_seed = args.random_seed
+    if args.do_train:
+        with fluid.program_guard(train_prog, startup_prog):
+            with fluid.unique_name.guard():
+                train_pyreader, cost, prediction, pred_label, label = create_net(args, model_config, num_labels, \
+                                                            dict_dim, place, model_name="textcnn_net")
+                train_pyreader.decorate_sample_list_generator(reader_res['train_data_generator'], places=place)
+                res["train_pyreader"] = train_pyreader
+                sgd_optimizer = fluid.optimizer.SGD(learning_rate=fluid.layers.exponential_decay(
+                                learning_rate=args.learning_rate, decay_steps=1000, decay_rate=0.5, staircase=True))
+                sgd_optimizer.minimize(cost)
+    if args.do_eval:
+        with fluid.program_guard(eval_prog, startup_prog):
+            with fluid.unique_name.guard():
+                eval_pyreader, cost, prediction, pred_label, label = create_net(args, model_config, num_labels, \
+                                                             dict_dim, test_place, model_name="textcnn_net")
+                eval_pyreader.decorate_sample_list_generator(reader_res['eval_data_generator'], places=test_place)
+                res["eval_pyreader"] = eval_pyreader
+    if args.do_test:
+        with fluid.program_guard(test_prog, startup_prog):
+            with fluid.unique_name.guard():
+                test_pyreader, cost, prediction, pred_label, label = create_net(args, model_config, num_labels, \
+                                                            dict_dim, test_place, model_name="textcnn_net")
+                test_pyreader.decorate_sample_list_generator(reader_res['test_data_generator'], places=test_place)
+                res["test_pyreader"] = test_pyreader
+    res["cost"] = cost
+    res["prediction"] = prediction
+    res["label"] = label
+    res["pred_label"] = pred_label
+    res["train_prog"] =train_prog 
+    res["eval_prog"] = eval_prog
+    res["test_prog"] = test_prog
+
+  
+    return res
+
+
+def main(args):
+    """
+    Main Function
+    """
+    global DEV_COUNT
+    startup_prog = fluid.default_startup_program()
+    random.seed(args.random_seed)
+    model_config = ConfigReader.read_conf(args.config_path)
+    if args.use_cuda:
+        test_place = fluid.cuda_places(0)
+        place = fluid.cuda_places()
+        DEV_COUNT = fluid.core.get_cuda_device_count()
+    else:
+        test_place = fluid.cpu_places(1)
+        os.environ['CPU_NUM'] = str(args.cpu_num)
+        place = fluid.cpu_places()
+        DEV_COUNT = args.cpu_num
+    logger.info("Dev Num is %s" % str(DEV_COUNT))
+    exe = fluid.Executor(place[0])
+    if args.do_train and args.build_dict:
+        DataProcesser.build_dict(args.data_dir + "train.txt", args.data_dir)
+    # read dict
+    char_dict = DataProcesser.read_dict(args.data_dir + "char.dict")
+    dict_dim = len(char_dict)
+    intent_dict = DataProcesser.read_dict(args.data_dir + "domain.dict")
+    id2intent = {}
+    for key, value in intent_dict.items():
+        id2intent[int(value)] = key
+    num_labels = len(intent_dict)
+    # build model
+    reader_res = build_data_reader(args, char_dict, intent_dict)
+    build_res = build_graph(args, model_config, num_labels, dict_dim, place, test_place, reader_res)
+    build_res["place"] = place
+    build_res["test_place"] = test_place
+    if not (args.do_train or args.do_eval or args.do_test):
+        raise ValueError("For args `do_train`, `do_eval` and `do_test`, at "
+                         "least one of them must be True.")
+        
+    exe.run(startup_prog)
+    if args.init_checkpoint and args.init_checkpoint != "None":
+        try:
+            init_checkpoint(exe, args.init_checkpoint, main_program=startup_prog)
+            logger.info("Load model from %s" % args.init_checkpoint)
+        except Exception as e:
+            logger.exception(str(e))
+            logger.error("Faild load model from %s [%s]" % (args.init_checkpoint, str(e)))
+    build_strategy = fluid.compiler.BuildStrategy()
+    build_strategy.fuse_all_reduce_ops = False
+    exec_strategy = fluid.ExecutionStrategy()
+    exec_strategy.num_threads = 1
+    # add compiled prog
+    if args.do_train:
+        compiled_prog = fluid.compiler.CompiledProgram(build_res["train_prog"]).with_data_parallel( \
+                                                                    loss_name=build_res["cost"].name, \
+                                                                    build_strategy=build_strategy, \
+                                                                    exec_strategy=exec_strategy)
+        build_res["compiled_prog"] = compiled_prog
+    if args.do_test:
+        test_compiled_prog = fluid.compiler.CompiledProgram(build_res["test_prog"])
+        build_res["test_compiled_prog"] = test_compiled_prog
+    if args.do_eval:
+        eval_compiled_prog = fluid.compiler.CompiledProgram(build_res["eval_prog"])
+        build_res["eval_compiled_prog"] = eval_compiled_prog
+
+    if args.do_train:
+       train(args, exe, build_res, place)
+    if args.do_eval:
+       evaluate(args, exe, build_res, "eval", \
+                save_result=True, id2intent=id2intent)
+    if args.do_test:
+       evaluate(args, exe, build_res, "test",\
+                 save_result=True, id2intent=id2intent)
+
+        
+
+if __name__ == "__main__":
+    logger.info("the paddle version is %s" % paddle.__version__)
+    check_version('1.6.0')
+    print_arguments(args)
+    main(args)
diff --git a/PaddleNLP/dialogue_domain_classification/utils.py b/PaddleNLP/dialogue_domain_classification/utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..2c839a2ccc605fae3c602f241586fda2838fea15
--- /dev/null
+++ b/PaddleNLP/dialogue_domain_classification/utils.py
@@ -0,0 +1,354 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+"""
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+
+from __future__ import unicode_literals
+import sys
+import os
+import random
+import paddle
+import logging
+import paddle.fluid as fluid
+import numpy as np
+import collections
+import six
+import codecs
+try:
+    import configparser as cp
+except ImportError:
+    import ConfigParser as cp
+
+
+random_seed = 7
+logger = logging.getLogger()
+format = "%(asctime)s - %(name)s - %(levelname)s -%(filename)s-%(lineno)4d -%(message)s"
+# format = "%(levelname)8s: %(asctime)s: %(filename)s:%(lineno)4d %(message)s"
+logging.basicConfig(format=format)
+logger.setLevel(logging.INFO)
+logger = logging.getLogger('Paddle-DDC') 
+
+
+def str2bool(v):
+    """[ because argparse does not support to parse "true, False" as python
+     boolean directly]
+    Arguments:
+        v {[type]} -- [description]
+    Returns:
+        [type] -- [description]
+    """
+    return v.lower() in ("true", "t", "1")
+
+
+def to_lodtensor(data, place):
+    """
+    convert ot LODtensor
+    """
+    seq_lens = [len(seq) for seq in data]
+    cur_len = 0
+    lod = [cur_len]
+    for l in seq_lens:
+        cur_len += l
+        lod.append(cur_len)
+    flattened_data = np.concatenate(data, axis=0).astype("int64")
+    flattened_data = flattened_data.reshape([len(flattened_data), 1])
+    res = fluid.LoDTensor()
+    res.set(flattened_data, place)
+    res.set_lod([lod])
+    return res
+
+
+class ArgumentGroup(object):
+    """[ArgumentGroup]
+    
+    Arguments:
+        object {[type]} -- [description]
+    """
+    def __init__(self, parser, title, des):
+        self._group = parser.add_argument_group(title=title, description=des)
+
+    def add_arg(self, name, type, default, help, **kwargs):
+        """[add_arg]
+        
+        Arguments:
+            name {[type]} -- [description]
+            type {[type]} -- [description]
+            default {[type]} -- [description]
+            help {[type]} -- [description]
+        """
+        type = str2bool if type == bool else type
+        self._group.add_argument(
+            "--" + name,
+            default=default,
+            type=type,
+            help=help + ' Default: %(default)s.',
+            **kwargs)
+
+
+class DataReader(object):
+    """[get data generator for dataset]
+    
+    Arguments:
+        object {[type]} -- [description]
+    
+    Returns:
+        [type] -- [description]
+    """
+    def __init__(self, char_vocab, intent_dict, max_len):
+        self._char_vocab = char_vocab
+        self._intent_dict = intent_dict
+        self._oov_id = 0
+        self.intent_size = len(intent_dict)
+        self.all_data = []
+        self.max_len = max_len
+        self.padding_id = 0
+    
+    def _get_num_examples(self):
+        return len(self.all_data)
+    
+    def prepare_data(self, data_path, batch_size, mode):
+        """
+        prepare data
+        """
+        # print word_dict_path
+        # assert os.path.exists(
+        #     word_dict_path), "The given word dictionary dose not exist."
+        assert os.path.exists(data_path), "The given data file does not exist."
+        if mode == "train":
+            train_reader = fluid.io.batch(paddle.reader.shuffle(self.data_reader(data_path, self.max_len, shuffle=True),
+                                        buf_size=batch_size * 100), batch_size)
+            # train_reader = fluid.io.batch(self.data_reader(data_path), batch_size)                   
+            return train_reader
+        else:
+            test_reader = fluid.io.batch(self.data_reader(data_path, self.max_len), batch_size)
+            return test_reader
+
+    def data_reader(self, file_path, max_len, shuffle=False):
+        """
+        Convert query into id list
+        use fixed voc
+        """
+        
+        for line in codecs.open(file_path, "r", encoding="utf8"):
+            line = line.strip()
+            if isinstance(line, six.binary_type):
+                line = line.decode("utf8", errors="ignore")
+            query, intent = line.split("\t")
+            char_id_list = list(map(lambda x: 0 if x not in self._char_vocab else int(self._char_vocab[x]), \
+                            list(query)))
+            if len(char_id_list) < max_len:
+                char_id_list.extend([self.padding_id] * (max_len - len(char_id_list)))
+            char_id_list = char_id_list[:max_len]
+            intent_id_list = [self.padding_id] * self.intent_size
+            for item in intent.split('\2'):
+                intent_id_list[int(self._intent_dict[item])] = 1
+            self.all_data.append([char_id_list, intent_id_list])
+        if shuffle:
+            random.seed(random_seed)
+            random.shuffle(self.all_data)
+        def reader():
+            """
+            reader
+            """
+            for char_id_list, intent_id_list in self.all_data:
+                # print char_id_list, intent_id
+                yield char_id_list, intent_id_list
+        return reader
+
+
+class DataProcesser(object):
+    """[file process methods]
+    
+    Arguments:
+        object {[type]} -- [description]
+    
+    Returns:
+        [type] -- [description]
+    """
+    @staticmethod
+    def read_dict(filename):
+        """
+        read_dict: key\2value
+        """
+        res_dict = {}
+        for line in codecs.open(filename, encoding="utf8"):
+            try:
+                if isinstance(line, six.binary_type):
+                    line = line.strip().decode("utf8")
+                line = line.strip()
+                key, value = line.strip().split("\2")
+                res_dict[key] = value
+            except Exception as err:
+                logger.error(str(err))
+                logger.error("read dict[%s] failed" % filename)
+        return res_dict
+
+    @staticmethod
+    def build_dict(filename, save_dir, min_num_char=2, min_num_intent=2):
+        """[build_dict  from file]
+        
+        Arguments:
+            filename {[type]} -- [description]
+            save_dir {[type]} -- [description]
+        
+        Keyword Arguments:
+            min_num_char {int} -- [description] (default: {2})
+            min_num_intent {int} -- [description] (default: {2})
+        """
+        char_dict = {}
+        intent_dict = {}
+        # readfile
+        for line in codecs.open(filename): 
+            line = line.strip()
+            if isinstance(line, six.binary_type):
+                line = line.strip().decode("utf8", errors="ignore")
+            query, intents = line.split("\t")
+            # read query
+            for char_item in list(query):
+                if char_item not in char_dict:
+                    char_dict[char_item] = 0
+                char_dict[char_item] += 1
+            # read intents
+            for intent in intents.split('\002'):
+                if intent not in intent_dict:
+                    intent_dict[intent] = 0
+                intent_dict[intent] += 1
+        #   save char dict
+        with codecs.open("%s/char.dict" % save_dir, "w", encoding="utf8") as f_out:
+            f_out.write("PAD\0020\n")
+            f_out.write("OOV\0021\n")
+            char_id = 2
+            for key, value in char_dict.items():
+                if value >= min_num_char:
+                    if isinstance(key, six.binary_type):
+                        key = key.encode("utf8")
+                    f_out.write("%s\002%d\n" % (key, char_id))
+                    char_id += 1
+        #   save intent dict
+        with codecs.open("%s/domain.dict" % save_dir, "w", encoding="utf8") as f_out:
+            f_out.write("SYS_OTHER\0020\n")
+            intent_id = 1
+            for key, value in intent_dict.items():
+                if value >= min_num_intent and key != u'SYS_OTHER':
+                    if isinstance(key, six.binary_type):
+                        key = key.encode("utf8")
+                    f_out.write("%s\002%d\n" % (key, intent_id))
+                    intent_id += 1
+        
+
+
+class ConfigReader(object):
+    """[read model config file]
+    
+    Arguments:
+        object {[type]} -- [description]
+    
+    Returns:
+        [type] -- [description]
+    """
+
+    @staticmethod
+    def read_conf(conf_file):
+        """[read_conf]
+        
+        Arguments:
+            conf_file {[type]} -- [description]
+        
+        Returns:
+            [type] -- [description]
+        """
+        flow_data = collections.defaultdict(lambda: {})
+        class2key = set(["model"])
+        param_conf = cp.ConfigParser()
+        param_conf.read(conf_file)
+        for section in param_conf.sections():
+            if section not in class2key:
+                continue
+            for option in param_conf.items(section):
+                flow_data[section][option[0]] = eval(option[1])
+        return flow_data
+
+
+def init_pretraining_params(exe,
+                            pretraining_params_path,
+                            main_program,
+                            use_fp16=False):
+    """load params of pretrained model, NOT including moment, learning_rate"""
+    assert os.path.exists(pretraining_params_path
+                          ), "[%s] cann't be found." % pretraining_params_path
+
+    def _existed_params(var):
+        if not isinstance(var, fluid.framework.Parameter):
+            return False
+        return os.path.exists(os.path.join(pretraining_params_path, var.name))
+
+    fluid.io.load_vars(
+        exe,
+        pretraining_params_path,
+        main_program=main_program,
+        predicate=_existed_params)
+    print("Load pretraining parameters from {}.".format(
+        pretraining_params_path))
+
+
+def init_checkpoint(exe, init_checkpoint_path, main_program):
+    """
+    Init CheckPoint
+    """
+    assert os.path.exists(
+        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+
+    def existed_persitables(var):
+        """
+        If existed presitabels
+        """
+        if not fluid.io.is_persistable(var):
+            return False
+        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
+
+    fluid.io.load_vars(
+        exe,
+        init_checkpoint_path,
+        main_program=main_program,
+        predicate=existed_persitables)
+    print ("Load model from {}".format(init_checkpoint_path))
+
+def print_arguments(args):
+    """
+    Print Arguments
+    """
+    print('-----------  Configuration Arguments -----------')
+    for arg, value in sorted(six.iteritems(vars(args))):
+        print('%s: %s' % (arg, value))
+    print('------------------------------------------------')
+
+
+def check_version(version='1.6.0'):
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version(version)
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
+
+
diff --git a/PaddleNLP/emotion_detection/utils.py b/PaddleNLP/emotion_detection/utils.py
index 4420d71e12539c281bb507b44c54c98768ca9771..2477f98a7413445f4e1b67153ffe8c1e8ebe362c 100644
--- a/PaddleNLP/emotion_detection/utils.py
+++ b/PaddleNLP/emotion_detection/utils.py
@@ -119,7 +119,7 @@ def data_reader(file_path, word_dict, num_examples, phrase, epoch, max_seq_len):
         Reader function
         """
         for idx in range(epoch):
-            if phrase == "train":
+            if phrase == "train" and 'ce_mode' not in os.environ:
                 random.shuffle(all_data)
             for wids, label, seq_len in all_data:
                 yield wids, label, seq_len
diff --git a/PaddleNLP/language_model/args.py b/PaddleNLP/language_model/args.py
index 8014bb521ec4f276603c45702d4090048235646c..8713d78186005cb87d3ce23bc83cf567f429a17e 100644
--- a/PaddleNLP/language_model/args.py
+++ b/PaddleNLP/language_model/args.py
@@ -59,6 +59,12 @@ def parse_args():
         type=str2bool,
         default=False,
         help='Whether profiling the trainning [True|False]')
+    parser.add_argument(
+        '--enable_auto_fusion',
+        type=str2bool,
+        default=False,
+        help='Whether enable fusion_group [True|False]. It is a experimental feature.'
+    )
     parser.add_argument(
         '--use_dataloader',
         type=str2bool,
@@ -80,5 +86,12 @@ def parse_args():
     parser.add_argument('--enable_ce', action='store_true')
     parser.add_argument('--batch_size', type=int, default=0, help='batch size')
     parser.add_argument('--max_epoch', type=int, default=0, help='max epoch')
+
+    # NOTE: args for profiler, used for benchmark
+    parser.add_argument(
+        '--profiler_path',
+        type=str,
+        default='/tmp/paddingrnn.profile',
+        help='the profiler output file path. used for benchmark')
     args = parser.parse_args()
     return args
diff --git a/PaddleNLP/language_model/run.sh b/PaddleNLP/language_model/run.sh
index 851c89771053e61beb8acf9e6a1d7309bc13d030..f010d910233ffcd59c62d63553dc803b19e06c36 100644
--- a/PaddleNLP/language_model/run.sh
+++ b/PaddleNLP/language_model/run.sh
@@ -6,7 +6,7 @@ function run_train() {
     python train.py \
         --data_path data/simple-examples/data/ \
         --model_type small \
-        --use_gpu True
+        --use_gpu True \
         #--init_from_pretrain_model models/0/params
 }
 
diff --git a/PaddleNLP/language_model/train.py b/PaddleNLP/language_model/train.py
index 0c3c3d8b94e0ed038b719cf8b5ec40ac7166610c..f00604f3aa740eaa7c740ca82282ec0eae180916 100644
--- a/PaddleNLP/language_model/train.py
+++ b/PaddleNLP/language_model/train.py
@@ -22,9 +22,10 @@ import os
 import random
 import math
 import contextlib
-
+from distutils.dir_util import mkpath
 import paddle
 import paddle.fluid as fluid
+from paddle.fluid import profiler
 import paddle.fluid.framework as framework
 import paddle.fluid.profiler as profiler
 from paddle.fluid.executor import Executor
@@ -50,9 +51,9 @@ SEED = 123
 
 
 @contextlib.contextmanager
-def profile_context(profile=True):
+def profile_context(profile=True, profiler_path='/tmp/paddingrnn.profile'):
     if profile:
-        with profiler.profiler('All', 'total', '/tmp/paddingrnn.profile'):
+        with profiler.profiler('All', 'total', profiler_path):
             yield
     else:
         yield
@@ -111,6 +112,9 @@ def main():
 
     config = RNNConfig(args)
 
+    if not os.path.exists(args.save_model_dir):
+        mkpath(args.save_model_dir)
+
     # define train program
     main_program = fluid.Program()
     startup_program = fluid.Program()
@@ -121,7 +125,6 @@ def main():
             res_vars = lm_model.lm_model(
                 config.hidden_size,
                 config.vocab_size,
-                config.batch_size,
                 num_layers=config.num_layers,
                 num_steps=config.num_steps,
                 init_scale=config.init_scale,
@@ -156,7 +159,6 @@ def main():
             lm_model.lm_model(
                 config.hidden_size,
                 config.vocab_size,
-                config.batch_size,
                 num_layers=config.num_layers,
                 num_steps=config.num_steps,
                 init_scale=config.init_scale,
@@ -189,6 +191,12 @@ def main():
 
     build_strategy = fluid.BuildStrategy()
     build_strategy.fuse_all_optimizer_ops = True
+    try:
+        fluid.require_version(min_version='1.7.0')
+        build_strategy.enable_auto_fusion = args.enable_auto_fusion
+    except Exception as e:
+        logger.info("PaddlePaddle version 1.7.0 or higher is "
+                    "required when you want to enable fusion_group.")
 
     if args.parallel:
         train_program = fluid.compiler.CompiledProgram(
@@ -206,11 +214,12 @@ def main():
     train_data, valid_data, test_data = ptb_data
 
     def generate_init_data():
+        batch_size = config.batch_size * device_count
         init_hidden = np.zeros(
-            (config.num_layers, config.batch_size, config.hidden_size),
+            (batch_size, config.num_layers, config.hidden_size),
             dtype='float32')
         init_cell = np.zeros(
-            (config.num_layers, config.batch_size, config.hidden_size),
+            (batch_size, config.num_layers, config.hidden_size),
             dtype='float32')
         return init_hidden, init_cell
 
@@ -244,8 +253,8 @@ def main():
 
     def eval(data):
         # when eval the batch_size set to 1
-        eval_data_iter = reader.get_data_iter(data, config.batch_size,
-                                              config.num_steps)
+        eval_data_iter = reader.get_data_iter(data, config.batch_size *
+                                              device_count, config.num_steps)
         total_loss = 0.0
         iters = 0
         init_hidden, init_cell = generate_init_data()
@@ -277,8 +286,8 @@ def main():
     def train_an_epoch(epoch_id, batch_times):
         # get train epoch size
         log_interval = get_log_interval(len(train_data))
-        train_data_iter = reader.get_data_iter(train_data, config.batch_size,
-                                               config.num_steps)
+        train_data_iter = reader.get_data_iter(train_data, config.batch_size *
+                                               device_count, config.num_steps)
 
         total_loss = 0
         iters = 0
@@ -307,7 +316,6 @@ def main():
             lr = np.array(fetch_outs[1])
             init_hidden = np.array(fetch_outs[2])
             init_cell = np.array(fetch_outs[3])
-
             total_loss += cost_train
             iters += config.num_steps
             if batch_id > 0 and batch_id % log_interval == 0:
@@ -315,6 +323,12 @@ def main():
                 print(
                     "-- Epoch:[%d]; Batch:[%d]; Time: %.5f s; ppl: %.5f, lr: %.5f"
                     % (epoch_id, batch_id, batch_time, ppl[0], lr[0]))
+
+            # profiler tools for benchmark
+            if args.profile and batch_id == log_interval:
+                profiler.reset_profiler()
+            elif args.profile and batch_id == (log_interval + 5):
+                break
         ppl = np.exp(total_loss / iters)
         return ppl
 
@@ -368,6 +382,11 @@ def main():
                         % (epoch_id, batch_id, batch_time, ppl[0], lr[0]))
 
                 batch_id += 1
+                # profiler tools for benchmark
+                if args.profile and batch_id == log_interval:
+                    profiler.reset_profiler()
+                elif args.profile and batch_id == (log_interval + 5):
+                    break
         except fluid.core.EOFException:
             dataloader.reset()
 
@@ -379,7 +398,7 @@ def main():
         if args.use_dataloader:
 
             def data_gen():
-                data_iter_size = config.batch_size // device_count
+                data_iter_size = config.batch_size
                 train_batches = reader.get_data_iter(train_data, data_iter_size,
                                                      config.num_steps)
                 for batch in train_batches:
@@ -425,31 +444,37 @@ def main():
                 print("ptblm\tlstm_language_model_%s_loss_card%d\t%s" %
                       (args.rnn_model, device_count, train_ppl[0]))
 
-            # NOTE(zjl): sometimes we have not enough data for eval if batch_size is large, i.e., 2100
-            # Just skip to avoid error
-            def is_valid_data(data, batch_size, num_steps):
-                data_len = len(data)
-                batch_len = data_len // batch_size
-                epoch_size = (batch_len - 1) // num_steps
-                return epoch_size >= 1
-
-            valid_data_valid = is_valid_data(valid_data, config.batch_size,
-                                             config.num_steps)
-            if valid_data_valid:
-                valid_ppl = eval(valid_data)
-                print("Valid ppl: %.5f" % valid_ppl[0])
-            else:
-                print(
-                    'WARNING: length of valid_data is {}, which is not enough for batch_size {} and num_steps {}'.
-                    format(
-                        len(valid_data), config.batch_size, config.num_steps))
+            if not args.profile:
+                # NOTE(zjl): sometimes we have not enough data for eval if batch_size is large, i.e., 2100
+                # Just skip to avoid error
+                def is_valid_data(data, batch_size, num_steps):
+                    data_len = len(data)
+                    batch_len = data_len // batch_size
+                    epoch_size = (batch_len - 1) // num_steps
+                    return epoch_size >= 1
+
+                valid_data_valid = is_valid_data(valid_data, config.batch_size,
+                                                 config.num_steps)
+                if valid_data_valid:
+                    valid_ppl = eval(valid_data)
+                    print("Valid ppl: %.5f" % valid_ppl[0])
+                else:
+                    print(
+                        'WARNING: length of valid_data is {}, which is not enough for batch_size {} and num_steps {}'.
+                        format(
+                            len(valid_data), config.batch_size,
+                            config.num_steps))
+
+                save_model_dir = os.path.join(args.save_model_dir,
+                                              str(epoch_id))
+                if not os.path.exists(save_model_dir):
+                    mkpath(save_model_dir)
+                save_model_dir = os.path.join(save_model_dir, 'params')
 
-            save_model_dir = os.path.join(args.save_model_dir,
-                                          str(epoch_id), "params")
-            fluid.save(main_program, save_model_dir)
-            print("Saved model to: %s.\n" % save_model_dir)
+                fluid.save(main_program, save_model_dir)
+                print("Saved model to: %s.\n" % save_model_dir)
 
-    with profile_context(args.profile):
+    with profile_context(args.profile, args.profiler_path):
         train()
 
     test_ppl = eval(test_data)
diff --git a/PaddleNLP/models/language_model/lm_model.py b/PaddleNLP/models/language_model/lm_model.py
index ff668b0c5d38b47b682a985447a7446874543038..c66b77b7dc6c37b32926fc697bb35e12a24d8850 100644
--- a/PaddleNLP/models/language_model/lm_model.py
+++ b/PaddleNLP/models/language_model/lm_model.py
@@ -26,7 +26,6 @@ from paddle.fluid.contrib.layers import basic_lstm
 
 def lm_model(hidden_size,
              vocab_size,
-             batch_size,
              num_layers=2,
              num_steps=20,
              init_scale=0.1,
@@ -185,45 +184,15 @@ def lm_model(hidden_size,
                 pre_cell = cell_array[k]
                 weight_1 = weight_1_arr[k]
                 bias = bias_arr[k]
-
                 nn = layers.concat([input, pre_hidden], 1)
                 gate_input = layers.matmul(x=nn, y=weight_1)
 
                 gate_input = layers.elementwise_add(gate_input, bias)
                 i, j, f, o = layers.split(gate_input, num_or_sections=4, dim=-1)
 
-                try:
-                    from paddle.fluid.contrib.layers import fused_elemwise_activation
-                    # fluid.contrib.layers.fused_elemwise_activation can do a fused
-                    # operation, like:
-                    # 1) x + sigmoid(y); x + tanh(y)
-                    # 2) tanh(x + y)
-                    # Now the unary operation supported in this fused op is limit, and
-                    # we will extent this operation to support more unary operations and
-                    # do this kind of fusion automitically in future version of paddle.fluid.
-                    # layers.sigmoid(i) * layers.tanh(j)
-                    tmp0 = fused_elemwise_activation(
-                        x=layers.tanh(j),
-                        y=i,
-                        functor_list=['elementwise_mul', 'sigmoid'],
-                        save_intermediate_out=False)
-                    # pre_cell * layers.sigmoid(f)
-                    tmp1 = fused_elemwise_activation(
-                        x=pre_cell,
-                        y=f,
-                        functor_list=['elementwise_mul', 'sigmoid'],
-                        save_intermediate_out=False)
-                    c = tmp0 + tmp1
-                    # layers.tanh(c) * layers.sigmoid(o)
-                    m = fused_elemwise_activation(
-                        x=layers.tanh(c),
-                        y=o,
-                        functor_list=['elementwise_mul', 'sigmoid'],
-                        save_intermediate_out=False)
-                except ImportError:
-                    c = pre_cell * layers.sigmoid(f) + layers.sigmoid(
-                        i) * layers.tanh(j)
-                    m = layers.tanh(c) * layers.sigmoid(o)
+                c = pre_cell * layers.sigmoid(f) + layers.sigmoid(
+                    i) * layers.tanh(j)
+                m = layers.tanh(c) * layers.sigmoid(o)
 
                 hidden_array[k] = m
                 cell_array[k] = c
@@ -254,11 +223,8 @@ def lm_model(hidden_size,
 
         return real_res, last_hidden, last_cell
 
-    batch_size_each = batch_size // fluid.core.get_cuda_device_count()
-    x = fluid.data(
-        name="x", shape=[batch_size_each, num_steps, 1], dtype='int64')
-    y = fluid.data(
-        name="y", shape=[batch_size_each * num_steps, 1], dtype='int64')
+    x = fluid.data(name="x", shape=[None, num_steps, 1], dtype='int64')
+    y = fluid.data(name="y", shape=[None, 1], dtype='int64')
 
     if use_dataloader:
         dataloader = fluid.io.DataLoader.from_generator(
@@ -269,16 +235,18 @@ def lm_model(hidden_size,
 
     init_hidden = fluid.data(
         name="init_hidden",
-        shape=[num_layers, batch_size_each, hidden_size],
+        shape=[None, num_layers, hidden_size],
         dtype='float32')
     init_cell = fluid.data(
         name="init_cell",
-        shape=[num_layers, batch_size_each, hidden_size],
+        shape=[None, num_layers, hidden_size],
         dtype='float32')
-
     init_cell.persistable = True
     init_hidden.persistable = True
 
+    init_hidden = layers.transpose(init_hidden, perm=[1, 0, 2])
+    init_cell = layers.transpose(init_cell, perm=[1, 0, 2])
+
     init_hidden_reshape = layers.reshape(
         init_hidden, shape=[num_layers, -1, hidden_size])
     init_cell_reshape = layers.reshape(
@@ -373,9 +341,8 @@ def lm_model(hidden_size,
     # can be used directly in next batch. This can avoid the fetching of
     # last_hidden and last_cell and feeding of init_hidden and init_cell in
     # each training step.
-    layers.assign(input=last_cell, output=init_cell)
-    layers.assign(input=last_hidden, output=init_hidden)
-
+    last_hidden = layers.transpose(last_hidden, perm=[1, 0, 2])
+    last_cell = layers.transpose(last_cell, perm=[1, 0, 2])
     feeding_list = ['x', 'y', 'init_hidden', 'init_cell']
     if use_dataloader:
         return loss, last_hidden, last_cell, feeding_list, dataloader
diff --git a/PaddleNLP/models/model_check.py b/PaddleNLP/models/model_check.py
index 4469be4c0d903c97ac05e3733e2842ccd8984f4d..51713452a7f0b1019c7b8b7d37d24e0c5f15c77c 100644
--- a/PaddleNLP/models/model_check.py
+++ b/PaddleNLP/models/model_check.py
@@ -33,6 +33,21 @@ def check_cuda(use_cuda, err = \
     except Exception as e:
         pass
 
+def check_version():
+        """
+        Log error and exit when the installed version of paddlepaddle is
+        not satisfied.
+        """
+        err = "PaddlePaddle version 1.6 or higher is required, " \
+            "or a suitable develop version is satisfied as well. \n" \
+            "Please make sure the version is good with your code." \
+
+        try:
+            fluid.require_version('1.6.0')
+        except Exception as e:
+            print(err)
+            sys.exit(1)
+
 
 def check_version():
     """
diff --git a/PaddleNLP/preprocess/padding.py b/PaddleNLP/preprocess/padding.py
index 6094562d396181349bebac9e883f6fca9dc71afc..82171e68eb3af3513eaf4655c740a06bb1112d57 100644
--- a/PaddleNLP/preprocess/padding.py
+++ b/PaddleNLP/preprocess/padding.py
@@ -69,7 +69,7 @@ def pad_batch_data(insts,
 
     if return_seq_lens:
         seq_lens = np.array([len(inst) for inst in insts])
-        return_list += [seq_lens.astype("int64").reshape([-1, 1])]
+        return_list += [seq_lens.astype("int64").reshape([-1])]
 
     return return_list if len(return_list) > 1 else return_list[0]
 
diff --git a/PaddleNLP/sentiment_classification/download.py b/PaddleNLP/sentiment_classification/download.py
index 8c28013f95a6639f248a56939be635e55e8ac41e..cc0a2a160f455cfa0e34ddebe8dee2d9b1502f9e 100644
--- a/PaddleNLP/sentiment_classification/download.py
+++ b/PaddleNLP/sentiment_classification/download.py
@@ -56,7 +56,7 @@ def download(url, filename):
     retry = 0
     retry_limit = 3
     chunk_size = 4096
-    while not (os.path.exists(filename):
+    while not os.path.exists(filename):
         if retry < retry_limit:
             retry += 1
         else:
@@ -104,17 +104,17 @@ def download_model(dir_path):
     if not os.path.exists(dir_path):
         os.makedirs(dir_path)
     url = BASE_URL + MODEL_NAME
-    model_path = os.path.join(dir_path, model)
+    model_path = os.path.join(dir_path, MODEL_NAME)
     print("Downloading model: %s" % url)
     # download model
-    download(url, model_path, MODEL_NAME)
+    download(url, model_path)
     # extract model.tar.gz
     print("Extracting model: %s" % model_path)
     extract(model_path, dir_path)
     os.remove(model_path)
 
 if __name__ == "__main__":
-    if len(sys) != 2:
+    if len(sys.argv) != 2:
         usage()
         sys.exit(1)
     
diff --git a/PaddleNLP/sentiment_classification/run_ernie.sh b/PaddleNLP/sentiment_classification/run_ernie.sh
index de014be6e199f8e4494a1d32ea0baa4d192073c8..4ab92f87474988db06e4ce656726e25e95703b3c 100644
--- a/PaddleNLP/sentiment_classification/run_ernie.sh
+++ b/PaddleNLP/sentiment_classification/run_ernie.sh
@@ -2,7 +2,7 @@
 export FLAGS_fraction_of_gpu_memory_to_use=0.95
 export FLAGS_enable_parallel_graph=1
 export FLAGS_sync_nccl_allreduce=1
-export CUDA_VISIBLE_DEVICES=12
+export CUDA_VISIBLE_DEVICES=0
 export CPU_NUM=1
 ERNIE_PRETRAIN=./ernie_pretrain_model/
 DATA_PATH=./senta_data
diff --git a/PaddleNLP/similarity_net/reader.py b/PaddleNLP/similarity_net/reader.py
index 8a3bfa9cd6385cc03e0b433b202e432e0c342d27..c9a68a8d8317f6842e58d6a9f79aec0b4d617cc2 100644
--- a/PaddleNLP/similarity_net/reader.py
+++ b/PaddleNLP/similarity_net/reader.py
@@ -28,7 +28,7 @@ class SimNetProcessor(object):
         self.valid_label = np.array([])
         self.test_label = np.array([])
 
-    def get_reader(self, mode):
+    def get_reader(self, mode, epoch=0):
         """
         Get Reader
         """
@@ -85,34 +85,35 @@ class SimNetProcessor(object):
                             title = [0]
                         yield [query, title]
             else:
-                with io.open(self.args.train_data_dir, "r",
-                                 encoding="utf8") as file:
-                    for line in file:
-                        query, pos_title, neg_title = line.strip().split("\t")
-                        if len(query) == 0 or len(pos_title) == 0 or len(
-                                neg_title) == 0:
-                            logging.warning(
-                                "line not match format in test file")
-                            continue
-                        query = [
-                            self.vocab[word] for word in query.split(" ")
-                            if word in self.vocab
-                        ]
-                        pos_title = [
-                            self.vocab[word] for word in pos_title.split(" ")
-                            if word in self.vocab
-                        ]
-                        neg_title = [
-                            self.vocab[word] for word in neg_title.split(" ")
-                            if word in self.vocab
-                        ]
-                        if len(query) == 0:
-                            query = [0]
-                        if len(pos_title) == 0:
-                            pos_title = [0]
-                        if len(neg_title) == 0:
-                            neg_title = [0]
-                        yield [query, pos_title, neg_title]
+                for idx in range(epoch):
+                    with io.open(self.args.train_data_dir, "r",
+                                    encoding="utf8") as file:
+                        for line in file:
+                            query, pos_title, neg_title = line.strip().split("\t")
+                            if len(query) == 0 or len(pos_title) == 0 or len(
+                                    neg_title) == 0:
+                                logging.warning(
+                                    "line not match format in test file")
+                                continue
+                            query = [
+                                self.vocab[word] for word in query.split(" ")
+                                if word in self.vocab
+                            ]
+                            pos_title = [
+                                self.vocab[word] for word in pos_title.split(" ")
+                                if word in self.vocab
+                            ]
+                            neg_title = [
+                                self.vocab[word] for word in neg_title.split(" ")
+                                if word in self.vocab
+                            ]
+                            if len(query) == 0:
+                                query = [0]
+                            if len(pos_title) == 0: 
+                                pos_title = [0]
+                            if len(neg_title) == 0:
+                                neg_title = [0]
+                            yield [query, pos_title, neg_title]
 
         def reader_with_pointwise():
             """
@@ -166,30 +167,31 @@ class SimNetProcessor(object):
                             title = [0]
                         yield [query, title]
             else:
-                with io.open(self.args.train_data_dir, "r",
-                                encoding="utf8") as file:
-                    for line in file:
-                        query, title, label = line.strip().split("\t")
-                        if len(query) == 0 or len(title) == 0 or len(
-                                label) == 0 or not label.isdigit() or int(
-                                    label) not in [0, 1]:
-                            logging.warning(
-                                "line not match format in test file")
-                            continue
-                        query = [
-                            self.vocab[word] for word in query.split(" ")
-                            if word in self.vocab
-                        ]
-                        title = [
-                            self.vocab[word] for word in title.split(" ")
-                            if word in self.vocab
-                        ]
-                        label = int(label)
-                        if len(query) == 0:
-                            query = [0]
-                        if len(title) == 0:
-                            title = [0]
-                        yield [query, title, label]
+                for idx in range(epoch):
+                    with io.open(self.args.train_data_dir, "r",
+                                    encoding="utf8") as file:
+                        for line in file:
+                            query, title, label = line.strip().split("\t")
+                            if len(query) == 0 or len(title) == 0 or len(
+                                    label) == 0 or not label.isdigit() or int(
+                                        label) not in [0, 1]:
+                                logging.warning(
+                                    "line not match format in test file")
+                                continue
+                            query = [
+                                self.vocab[word] for word in query.split(" ")
+                                if word in self.vocab
+                            ]
+                            title = [
+                                self.vocab[word] for word in title.split(" ")
+                                if word in self.vocab
+                            ]
+                            label = int(label)
+                            if len(query) == 0:
+                                query = [0]
+                            if len(title) == 0:
+                                title = [0]
+                            yield [query, title, label]
 
         if self.args.task_mode == "pairwise":
             return reader_with_pairwise
diff --git a/PaddleNLP/similarity_net/run_classifier.py b/PaddleNLP/similarity_net/run_classifier.py
index 158aa398b06e01565c49fa46e8073d4fed714424..944bb1117bde232cdb7b6631428376832a0937ad 100644
--- a/PaddleNLP/similarity_net/run_classifier.py
+++ b/PaddleNLP/similarity_net/run_classifier.py
@@ -92,11 +92,6 @@ def train(conf_dict, args):
     """
     train processic
     """
-    if args.enable_ce:
-        SEED = 102
-        fluid.default_startup_program().random_seed = SEED
-        fluid.default_main_program().random_seed = SEED
-
     # loading vocabulary
     vocab = utils.load_vocab(args.vocab_path)
     # get vocab size
@@ -124,6 +119,12 @@ def train(conf_dict, args):
     startup_prog = fluid.Program()
     train_program = fluid.Program()
 
+    # used for continuous evaluation 
+    if args.enable_ce:
+        SEED = 102
+        startup_prog.random_seed = SEED
+        train_program.random_seed = SEED
+
     simnet_process = reader.SimNetProcessor(args, vocab)
     if args.task_mode == "pairwise":
         # Build network
@@ -140,7 +141,7 @@ def train(conf_dict, args):
                 optimizer.ops(avg_cost)
                 
         # Get Reader
-        get_train_examples = simnet_process.get_reader("train")
+        get_train_examples = simnet_process.get_reader("train",epoch=args.epoch)
         if args.do_valid:
             test_prog = fluid.Program()
             with fluid.program_guard(test_prog, startup_prog):
@@ -164,7 +165,7 @@ def train(conf_dict, args):
                 optimizer.ops(avg_cost)
 
         # Get Feeder and Reader
-        get_train_examples = simnet_process.get_reader("train")
+        get_train_examples = simnet_process.get_reader("train",epoch=args.epoch)
         if args.do_valid:
             test_prog = fluid.Program()
             with fluid.program_guard(test_prog, startup_prog):
@@ -218,63 +219,67 @@ def train(conf_dict, args):
     global_step = 0
     ce_info = []
     train_exe = exe
-    for epoch_id in range(args.epoch):
+    #for epoch_id in range(args.epoch):
+    # used for continuous evaluation
+    if args.enable_ce:
+        train_batch_data = fluid.io.batch(get_train_examples, args.batch_size, drop_last=False)
+    else:
         train_batch_data = fluid.io.batch(
             fluid.io.shuffle(
-                get_train_examples, buf_size=10000),
+               get_train_examples, buf_size=10000),
             args.batch_size,
             drop_last=False)
-        train_pyreader.decorate_paddle_reader(train_batch_data)
-        train_pyreader.start()
-        exe.run(startup_prog)
-        losses = []
-        start_time = time.time()
-        while True:
-            try:
-                global_step += 1
-                fetch_list = [avg_cost.name]
-                avg_loss = train_exe.run(program=train_program, fetch_list = fetch_list)
-                if args.do_valid and global_step % args.validation_steps == 0:
-                    get_valid_examples = simnet_process.get_reader("valid")
-                    valid_result = valid_and_test(test_prog,test_pyreader,get_valid_examples,simnet_process,"valid",exe,[pred.name])
-                    if args.compute_accuracy:
-                        valid_auc, valid_acc = valid_result
-                        logging.info(
-                            "global_steps: %d, valid_auc: %f, valid_acc: %f" %
-                            (global_step, valid_auc, valid_acc))
-                    else:
-                        valid_auc = valid_result
-                        logging.info("global_steps: %d, valid_auc: %f" %
-                                    (global_step, valid_auc))
-                if global_step % args.save_steps == 0:
-                    model_save_dir = os.path.join(args.output_dir,
-                                                  conf_dict["model_path"])
-                    model_path = os.path.join(model_save_dir, str(global_step))
-                        
-                    if not os.path.exists(model_save_dir):
-                        os.makedirs(model_save_dir)
-                    if args.task_mode == "pairwise":
-                        feed_var_names = [left.name, pos_right.name]
-                        target_vars = [left_feat, pos_score]
-                    else:
-                        feed_var_names = [
-                            left.name,
-                            right.name,
-                        ]
-                        target_vars = [left_feat, pred]
-                    fluid.io.save_inference_model(model_path, feed_var_names,
-                                                  target_vars, exe,
-                                                  test_prog)
-                    logging.info("saving infer model in %s" % model_path)
-                losses.append(np.mean(avg_loss[0]))
-            
-            except fluid.core.EOFException:
-                train_pyreader.reset()
-                break
-        end_time = time.time()
-        logging.info("epoch: %d, loss: %f, used time: %d sec" %
-                     (epoch_id, np.mean(losses), end_time - start_time))
-        ce_info.append([np.mean(losses), end_time - start_time])
+    train_pyreader.decorate_paddle_reader(train_batch_data)
+    train_pyreader.start()
+    exe.run(startup_prog)
+    losses = []
+    start_time = time.time()
+    while True:
+        try:
+            global_step += 1
+            fetch_list = [avg_cost.name]
+            avg_loss = train_exe.run(program=train_program, fetch_list = fetch_list)
+            losses.append(np.mean(avg_loss[0]))
+            if args.do_valid and global_step % args.validation_steps == 0:
+                get_valid_examples = simnet_process.get_reader("valid")
+                valid_result = valid_and_test(test_prog,test_pyreader,get_valid_examples,simnet_process,"valid",exe,[pred.name])
+                if args.compute_accuracy:
+                    valid_auc, valid_acc = valid_result
+                    logging.info(
+                        "global_steps: %d, valid_auc: %f, valid_acc: %f, valid_loss: %f" %
+                        (global_step, valid_auc, valid_acc, np.mean(losses)))
+                else:
+                    valid_auc = valid_result
+                    logging.info("global_steps: %d, valid_auc: %f, valid_loss: %f" %
+                                (global_step, valid_auc, np.mean(losses)))
+            if global_step % args.save_steps == 0:
+                model_save_dir = os.path.join(args.output_dir,
+                                            conf_dict["model_path"])
+                model_path = os.path.join(model_save_dir, str(global_step))
+                    
+                if not os.path.exists(model_save_dir):
+                    os.makedirs(model_save_dir)
+                if args.task_mode == "pairwise":
+                    feed_var_names = [left.name, pos_right.name]
+                    target_vars = [left_feat, pos_score]
+                else:
+                    feed_var_names = [
+                        left.name,
+                        right.name,
+                    ]
+                    target_vars = [left_feat, pred]
+                fluid.io.save_inference_model(model_path, feed_var_names,
+                                            target_vars, exe,
+                                            test_prog)
+                logging.info("saving infer model in %s" % model_path)
+        
+        except fluid.core.EOFException:
+            train_pyreader.reset()
+            break
+    end_time = time.time()
+    #logging.info("epoch: %d, loss: %f, used time: %d sec" %
+                #(epoch_id, np.mean(losses), end_time - start_time))
+    ce_info.append([np.mean(losses), end_time - start_time])
     #final save
     logging.info("the final step is %s" % global_step)    
     model_save_dir = os.path.join(args.output_dir,
@@ -295,14 +300,14 @@ def train(conf_dict, args):
                                 target_vars, exe,
                                 test_prog)
     logging.info("saving infer model in %s" % model_path)
-
+    # used for continuous evaluation
     if args.enable_ce:
         card_num = get_cards()
         ce_loss = 0
         ce_time = 0
         try:
-            ce_loss = ce_info[-2][0]
-            ce_time = ce_info[-2][1]
+            ce_loss = ce_info[-1][0]
+            ce_time = ce_info[-1][1]
         except:
             logging.info("ce info err!")
         print("kpis\teach_step_duration_%s_card%s\t%s" %
diff --git a/PaddleRec/ctr/Paddle_baseline_KDD2019/generate_test.py b/PaddleRec/ctr/Paddle_baseline_KDD2019/generate_test.py
index 8c39950f9329b4f0a2d4fe2512b1833bd796ccf3..66bf13d250e5487696177f6d710cd2ec73944d97 100644
--- a/PaddleRec/ctr/Paddle_baseline_KDD2019/generate_test.py
+++ b/PaddleRec/ctr/Paddle_baseline_KDD2019/generate_test.py
@@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-
 import argparse
 import logging
 import numpy as np
@@ -22,12 +21,12 @@ import os
 os.environ["CUDA_VISIBLE_DEVICES"] = ""
 import paddle
 import paddle.fluid as fluid
-logging.basicConfig(
-    format='%(asctime)s - %(levelname)s - %(message)s')
+logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger("fluid")
 logger.setLevel(logging.INFO)
 num_context_feature = 22
 
+
 def parse_args():
     parser = argparse.ArgumentParser(description="PaddlePaddle DeepFM example")
     parser.add_argument(
@@ -59,6 +58,7 @@ def parse_args():
 
     return parser.parse_args()
 
+
 def to_lodtensor(data, place):
     seq_lens = [len(seq) for seq in data]
     cur_len = 0
@@ -72,7 +72,6 @@ def to_lodtensor(data, place):
     res.set(flattened_data, place)
     res.set_lod([lod])
 
-
     return res
 
 
@@ -91,7 +90,8 @@ def data2tensor(data, place):
         sparse_data = to_lodtensor([x[1 + i] for x in data], place)
         feed_dict["context" + str(i)] = sparse_data
 
-    context_fm = to_lodtensor(np.array([x[-2] for x in data]).astype("float32"), place)
+    context_fm = to_lodtensor(
+        np.array([x[-2] for x in data]).astype("float32"), place)
 
     feed_dict["context_fm"] = context_fm
     y_data = np.array([x[-1] for x in data]).astype("int64")
@@ -99,6 +99,7 @@ def data2tensor(data, place):
     feed_dict["label"] = y_data
     return feed_dict
 
+
 def test():
     args = parse_args()
 
@@ -112,13 +113,14 @@ def test():
     exe = fluid.Executor(place)
 
     whole_filelist = ["./out/normed_test_session.txt"]
-    test_files = whole_filelist[int(0.0 * len(whole_filelist)):int(1.0 * len(whole_filelist))]
-
+    test_files = whole_filelist[int(0.0 * len(whole_filelist)):int(1.0 * len(
+        whole_filelist))]
 
     epochs = 1
 
     for i in range(epochs):
-        cur_model_path = args.model_path + "/epoch" + str(1) + ".model"
+        cur_model_path = os.path.join(args.model_path,
+                                      "epoch" + str(1) + ".model")
         with open("./testres/res" + str(i), 'w') as r:
             with fluid.scope_guard(test_scope):
                 [inference_program, feed_target_names, fetch_targets] = \
@@ -129,9 +131,11 @@ def test():
                 for batch_id, data in enumerate(test_reader()):
                     print(len(data[0]))
                     feed_dict = data2tensor(data, place)
-                    loss_val, auc_val, accuracy, predict, _ = exe.run(inference_program,
-                                                feed=feed_dict,
-                                                fetch_list=fetch_targets, return_numpy=False)
+                    loss_val, auc_val, accuracy, predict, _ = exe.run(
+                        inference_program,
+                        feed=feed_dict,
+                        fetch_list=fetch_targets,
+                        return_numpy=False)
 
                     x = np.array(predict)
                     for j in range(x.shape[0]):
diff --git a/PaddleRec/ctr/Paddle_baseline_KDD2019/infer.py b/PaddleRec/ctr/Paddle_baseline_KDD2019/infer.py
index 5c58ecf1380bd339601a7692bedce11a587dd940..c218ce0fccc94ee595fd1681b54258d8d6ce43c0 100644
--- a/PaddleRec/ctr/Paddle_baseline_KDD2019/infer.py
+++ b/PaddleRec/ctr/Paddle_baseline_KDD2019/infer.py
@@ -12,8 +12,7 @@ import paddle.fluid as fluid
 import map_reader
 from network_conf import ctr_deepfm_dataset
 
-logging.basicConfig(
-    format='%(asctime)s - %(levelname)s - %(message)s')
+logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger("fluid")
 logger.setLevel(logging.INFO)
 
@@ -91,15 +90,20 @@ def infer():
     place = fluid.CPUPlace()
     inference_scope = fluid.core.Scope()
 
-    filelist = ["%s/%s" % (args.data_path, x) for x in os.listdir(args.data_path)]
+    filelist = [
+        "%s/%s" % (args.data_path, x) for x in os.listdir(args.data_path)
+    ]
     from map_reader import MapDataset
     map_dataset = MapDataset()
     map_dataset.setup(args.sparse_feature_dim)
     exe = fluid.Executor(place)
 
-    whole_filelist = ["raw_data/part-%d" % x for x in range(len(os.listdir("raw_data")))]
+    whole_filelist = [
+        "raw_data/part-%d" % x for x in range(len(os.listdir("raw_data")))
+    ]
     #whole_filelist = ["./out/normed_train09",  "./out/normed_train10",  "./out/normed_train11"]
-    test_files = whole_filelist[int(0.0 * len(whole_filelist)):int(1.0 * len(whole_filelist))]
+    test_files = whole_filelist[int(0.0 * len(whole_filelist)):int(1.0 * len(
+        whole_filelist))]
 
     # file_groups = [whole_filelist[i:i+train_thread_num] for i in range(0, len(whole_filelist), train_thread_num)]
 
@@ -110,7 +114,8 @@ def infer():
 
     epochs = 2
     for i in range(epochs):
-        cur_model_path = args.model_path + "/epoch" + str(i + 1) + ".model"
+        cur_model_path = os.path.join(args.model_path,
+                                      "epoch" + str(i + 1) + ".model")
         with fluid.scope_guard(inference_scope):
             [inference_program, feed_target_names, fetch_targets] = \
                 fluid.io.load_inference_model(cur_model_path, exe)
@@ -120,9 +125,11 @@ def infer():
 
             test_reader = map_dataset.infer_reader(test_files, 1000, 100000)
             for batch_id, data in enumerate(test_reader()):
-                loss_val, auc_val, accuracy, predict, label = exe.run(inference_program,
-                                            feed=data2tensor(data, place),
-                                            fetch_list=fetch_targets, return_numpy=False)
+                loss_val, auc_val, accuracy, predict, label = exe.run(
+                    inference_program,
+                    feed=data2tensor(data, place),
+                    fetch_list=fetch_targets,
+                    return_numpy=False)
 
                 #print(np.array(predict))
                 #x = np.array(predict)
diff --git a/PaddleRec/ctr/Paddle_baseline_KDD2019/local_train.py b/PaddleRec/ctr/Paddle_baseline_KDD2019/local_train.py
index 62241f2b11fd2cdf642b6b2e778e3ecb7681d41a..9d7e9452a14e08d293d77bc41fc2806ca9c0d1a2 100644
--- a/PaddleRec/ctr/Paddle_baseline_KDD2019/local_train.py
+++ b/PaddleRec/ctr/Paddle_baseline_KDD2019/local_train.py
@@ -6,33 +6,36 @@ import paddle.fluid as fluid
 import sys
 from network_confv6 import ctr_deepfm_dataset
 
-
 NUM_CONTEXT_FEATURE = 22
 DIM_USER_PROFILE = 10
 DIM_DENSE_FEATURE = 3
-PYTHON_PATH = "/home/yaoxuefeng/whls/paddle_release_home/python/bin/python" # this is mine change yours
+PYTHON_PATH = "/home/yaoxuefeng/whls/paddle_release_home/python/bin/python"  # this is mine change yours
+
 
 def train():
     args = parse_args()
     if not os.path.isdir(args.model_output_dir):
         os.mkdir(args.model_output_dir)
-    
+
     #set the input format for our model. Note that you need to carefully modify them when you define a new network
     #user_profile = fluid.layers.data(
-        #name="user_profile", shape=[DIM_USER_PROFILE], dtype='int64', lod_level=1)
+    #name="user_profile", shape=[DIM_USER_PROFILE], dtype='int64', lod_level=1)
     dense_feature = fluid.layers.data(
         name="dense_feature", shape=[DIM_DENSE_FEATURE], dtype='float32')
     context_feature = [
-        fluid.layers.data(name="context" + str(i), shape=[1], lod_level=1, dtype="int64")
-        for i in range(0, NUM_CONTEXT_FEATURE)]
+        fluid.layers.data(
+            name="context" + str(i), shape=[1], lod_level=1, dtype="int64")
+        for i in range(0, NUM_CONTEXT_FEATURE)
+    ]
     context_feature_fm = fluid.layers.data(
         name="context_fm", shape=[1], dtype='int64', lod_level=1)
     label = fluid.layers.data(name='label', shape=[1], dtype='int64')
 
     print("ready to network")
     #self define network 
-    loss, auc_var, batch_auc_var, accuracy, predict = ctr_deepfm_dataset(dense_feature, context_feature, context_feature_fm, label,
-                                                        args.embedding_size, args.sparse_feature_dim)
+    loss, auc_var, batch_auc_var, accuracy, predict = ctr_deepfm_dataset(
+        dense_feature, context_feature, context_feature_fm, label,
+        args.embedding_size, args.sparse_feature_dim)
 
     print("ready to optimize")
     optimizer = fluid.optimizer.SGD(learning_rate=1e-4)
@@ -42,7 +45,8 @@ def train():
     exe.run(fluid.default_startup_program())
     #use dataset api for much faster speed
     dataset = fluid.DatasetFactory().create_dataset()
-    dataset.set_use_var([dense_feature] + context_feature + [context_feature_fm] + [label])
+    dataset.set_use_var([dense_feature] + context_feature +
+                        [context_feature_fm] + [label])
     #self define how to process generated training insatnces in map_reader.py
     pipe_command = PYTHON_PATH + "  map_reader.py %d" % args.sparse_feature_dim
     dataset.set_pipe_command(pipe_command)
@@ -50,27 +54,36 @@ def train():
     thread_num = 1
     dataset.set_thread(thread_num)
     #self define how to split training files for example:"split -a 2 -d -l 200000 normed_train.txt normed_train"
-    whole_filelist = ["./out/normed_train%d" % x for x in range(len(os.listdir("out")))]
-    whole_filelist = ["./out/normed_train00", "./out/normed_train01", "./out/normed_train02", "./out/normed_train03",
-                      "./out/normed_train04", "./out/normed_train05", "./out/normed_train06", "./out/normed_train07",
-                      "./out/normed_train08",
-                      "./out/normed_train09", "./out/normed_train10", "./out/normed_train11"]
+    whole_filelist = [
+        "./out/normed_train%d" % x for x in range(len(os.listdir("out")))
+    ]
+    whole_filelist = [
+        "./out/normed_train00", "./out/normed_train01", "./out/normed_train02",
+        "./out/normed_train03", "./out/normed_train04", "./out/normed_train05",
+        "./out/normed_train06", "./out/normed_train07", "./out/normed_train08",
+        "./out/normed_train09", "./out/normed_train10", "./out/normed_train11"
+    ]
     print("ready to epochs")
     epochs = 10
     for i in range(epochs):
         print("start %dth epoch" % i)
         dataset.set_filelist(whole_filelist[:int(len(whole_filelist))])
         #print the informations you want by setting fetch_list and fetch_info
-        exe.train_from_dataset(program=fluid.default_main_program(),
-                               dataset=dataset,
-                               fetch_list=[auc_var, accuracy, predict, label],
-                               fetch_info=["auc", "accuracy", "predict", "label"],
-                               debug=False)
-        model_dir = args.model_output_dir + '/epoch' + str(i + 1) + ".model"
+        exe.train_from_dataset(
+            program=fluid.default_main_program(),
+            dataset=dataset,
+            fetch_list=[auc_var, accuracy, predict, label],
+            fetch_info=["auc", "accuracy", "predict", "label"],
+            debug=False)
+        model_dir = os.path.join(args.model_output_dir,
+                                 '/epoch' + str(i + 1) + ".model")
         sys.stderr.write("epoch%d finished" % (i + 1))
         #save model
-        fluid.io.save_inference_model(model_dir, [dense_feature.name] + [x.name for x in context_feature] + [context_feature_fm.name] + [label.name],
-                                      [loss, auc_var, accuracy, predict, label], exe)
+        fluid.io.save_inference_model(
+            model_dir,
+            [dense_feature.name] + [x.name for x in context_feature] +
+            [context_feature_fm.name] + [label.name],
+            [loss, auc_var, accuracy, predict, label], exe)
 
 
 if __name__ == '__main__':
diff --git a/PaddleRec/ctr/dcn/README.md b/PaddleRec/ctr/dcn/README.md
index 560acee8ab535d05831275e036d1fcfe541bd555..1ac51ddc132932ef691226874fbdbdf9a5d01c8e 100644
--- a/PaddleRec/ctr/dcn/README.md
+++ b/PaddleRec/ctr/dcn/README.md
@@ -10,6 +10,7 @@
 ├── network.py                      # 网络结构
 ├── config.py                       # 参数配置
 ├── reader.py                       # 读取数据相关的函数
+├── utils.py                        # 通用函数
 ├── data/
     ├── download.sh                 # 下载数据脚本
     ├── preprocess.py               # 数据预处理脚本
@@ -23,7 +24,7 @@
 DCN模型介绍可以参阅论文[Deep & Cross Network for Ad Click Predictions](https://arxiv.org/abs/1708.05123)
 
 ## 环境
-- PaddlePaddle 1.6
+- **目前模型库下模型均要求使用PaddlePaddle 1.6及以上版本或适当的develop版本**
 
 ## 数据下载
 
@@ -31,7 +32,7 @@ DCN模型介绍可以参阅论文[Deep & Cross Network for Ad Click Predictions]
 
 数据下载命令
 ```bash
-cd data && sh download.sh
+cd data && python download.py
 ```
 
 ## 数据处理
@@ -69,13 +70,14 @@ loss: [0.44703564]      auc_val: [0.80654419]
 ## 多机训练
 首先使用命令下载并预处理小规模样例数据集：
 ```bash
-cd dist_data && sh dist_download.sh && cd ..
+cd dist_data && python dist_download.py && cd ..
 ```
 运行命令本地模拟多机场景，默认使用2 X 2，即2个pserver，2个trainer的方式组网训练。
 
 **注意：在多机训练中，建议使用Paddle 1.6版本以上或[最新版本](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/Tables.html#whl-dev)。**
 
 ```bash
+# 该sh不支持Windows
 sh cluster_train.sh
 ```
 参数说明：
@@ -101,7 +103,7 @@ python infer.py --model_output_dir cluster_model --test_epoch 10 --test_valid_da
 - 0号trainer保存模型参数
 
 - 每次训练完成后需要手动停止pserver进程，使用以下命令查看pserver进程：
-  
+
 >ps -ef | grep python
-  
+
 - 数据读取使用dataset模式，目前仅支持运行在Linux环境下
diff --git a/PaddleRec/ctr/dcn/cluster_train.py b/PaddleRec/ctr/dcn/cluster_train.py
index 1b136ed935b2b9896d26949a78e0390fb0233755..5f8cc95f6def6656328b9d1929eac9008a36ff67 100644
--- a/PaddleRec/ctr/dcn/cluster_train.py
+++ b/PaddleRec/ctr/dcn/cluster_train.py
@@ -7,6 +7,13 @@ from collections import OrderedDict
 import paddle.fluid as fluid
 
 from network import DCN
+import utils
+
+
+def boolean_string(s):
+    if s.lower() not in {'false', 'true'}:
+        raise ValueError('Not a valid boolean string')
+    return s.lower() == 'true'
 
 
 def parse_args():
@@ -61,7 +68,7 @@ def parse_args():
         help='Cross net l2 regularizer coefficient')
     parser.add_argument(
         '--use_bn',
-        type=bool,
+        type=boolean_string,
         default=True,
         help='Whether use batch norm in dnn part')
     parser.add_argument(
@@ -161,7 +168,8 @@ def train():
                 fetch_info=['total_loss', 'avg_logloss', 'auc'],
                 debug=False,
                 print_period=args.print_steps)
-            model_dir = args.model_output_dir + '/epoch_' + str(epoch_id + 1)
+            model_dir = os.path.join(args.model_output_dir,
+                                     'epoch_' + str(epoch_id + 1))
             sys.stderr.write('epoch%d is finished and takes %f s\n' % (
                 (epoch_id + 1), time.time() - start))
             if args.trainer_id == 0:  # only trainer 0 save model
@@ -194,4 +202,5 @@ def train():
 
 
 if __name__ == "__main__":
+    utils.check_version()
     train()
diff --git a/PaddleRec/ctr/dcn/config.py b/PaddleRec/ctr/dcn/config.py
index e17b9dc5037ef00b4419988cf22ff70500f5844a..8dac8ef375f94549c77e6e4ab751ea9cada376b7 100644
--- a/PaddleRec/ctr/dcn/config.py
+++ b/PaddleRec/ctr/dcn/config.py
@@ -1,11 +1,15 @@
-#!/usr/bin/env python
-# coding: utf-8
 import argparse
 """
 global params
 """
 
 
+def boolean_string(s):
+    if s.lower() not in {'false', 'true'}:
+        raise ValueError('Not a valid boolean string')
+    return s.lower() == 'true'
+
+
 def parse_args():
     parser = argparse.ArgumentParser(description="PaddleFluid DCN demo")
     parser.add_argument(
@@ -63,7 +67,7 @@ def parse_args():
         help='Cross net l2 regularizer coefficient')
     parser.add_argument(
         '--use_bn',
-        type=bool,
+        type=boolean_string,
         default=True,
         help='Whether use batch norm in dnn part')
     parser.add_argument(
@@ -75,5 +79,7 @@ def parse_args():
     parser.add_argument(
         '--clip_by_norm', type=float, default=100.0, help="gradient clip norm")
     parser.add_argument('--print_steps', type=int, default=100)
+    parser.add_argument(
+        '--enable_ce', action='store_true', help='If set, run the task with continuous evaluation logs.')
 
     return parser.parse_args()
diff --git a/PaddleRec/ctr/dcn/data/download.py b/PaddleRec/ctr/dcn/data/download.py
new file mode 100644
index 0000000000000000000000000000000000000000..b2fedfe83625970d0e47b9db0a373a99b457fe61
--- /dev/null
+++ b/PaddleRec/ctr/dcn/data/download.py
@@ -0,0 +1,24 @@
+import os
+import sys
+import io
+
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+TOOLS_PATH = os.path.join(LOCAL_PATH, "..", "..", "tools")
+sys.path.append(TOOLS_PATH)
+
+from tools import download_file_and_uncompress
+
+if __name__ == '__main__':
+    trainfile = 'train.txt'
+    url = "https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz"
+
+    print("download and extract starting...")
+    download_file_and_uncompress(url)
+    print("download and extract finished")
+
+    count = 0
+    for _ in io.open(trainfile, 'r', encoding='utf-8'):
+        count += 1
+
+    print("total records: %d" % count)
+    print("done")
diff --git a/PaddleRec/ctr/dcn/data/download.sh b/PaddleRec/ctr/dcn/data/download.sh
deleted file mode 100755
index 9d96285bf4c303f50a88e4c0767a757e6cfa1727..0000000000000000000000000000000000000000
--- a/PaddleRec/ctr/dcn/data/download.sh
+++ /dev/null
@@ -1,23 +0,0 @@
-#!/bin/bash
-
-workdir=$(cd $(dirname $0); pwd)
-
-cd $workdir
-
-trainfile='train.txt'
-
-echo "data dir:" ${workdir}
-
-cd $workdir
-
-echo "download data starting..."
-wget --no-check-certificate -c https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz
-echo "download finished"
-
-echo "extracting ..."
-tar xzf dac.tar.gz >/dev/null 2>&1
-wc -l $trainfile | awk '{print $1}' > line_nums.log
-
-echo "extract finished"
-echo "total records: "`cat line_nums.log`
-echo "done"
diff --git a/PaddleRec/ctr/dcn/data/preprocess.py b/PaddleRec/ctr/dcn/data/preprocess.py
index 6ee5044bac370979b4416528bbcd35411ea37a4e..dd23c7ddb42a99123cdaa450199d210925ca206f 100644
--- a/PaddleRec/ctr/dcn/data/preprocess.py
+++ b/PaddleRec/ctr/dcn/data/preprocess.py
@@ -1,5 +1,3 @@
-#!/usr/bin/env python
-# coding: utf-8
 from __future__ import print_function, absolute_import, division
 
 import os
@@ -19,7 +17,6 @@ VOCAB_DIR = 'vocab'
 TRAIN_DIR = 'train'
 TEST_VALID_DIR = 'test_valid'
 SPLIT_RATIO = 0.9
-LINE_NUMS = "line_nums.log"
 FREQ_THR = 10
 
 INT_COLUMN_NAMES = ['I' + str(i) for i in range(1, 14)]
@@ -113,11 +110,13 @@ def split_data():
             fout.close()
             data_dir = TEST_VALID_DIR
             cur_part_idx = int(line_idx / 200000)
-            fout = open(data_dir + '/part-' + str(cur_part_idx), 'w')
+            fout = open(
+                os.path.join(data_dir, 'part-' + str(cur_part_idx)), 'w')
         if line_idx % 200000 == 0 and line_idx != 0:
             fout.close()
             cur_part_idx = int(line_idx / 200000)
-            fout = open(data_dir + '/part-' + str(cur_part_idx), 'w')
+            fout = open(
+                os.path.join(data_dir, 'part-' + str(cur_part_idx)), 'w')
         fout.write(line)
     fout.close()
     fin.close()
diff --git a/PaddleRec/ctr/dcn/dist_data/dist_download.py b/PaddleRec/ctr/dcn/dist_data/dist_download.py
new file mode 100644
index 0000000000000000000000000000000000000000..662982f6d6738ad90accd6b03dca7a21eb9fb3ae
--- /dev/null
+++ b/PaddleRec/ctr/dcn/dist_data/dist_download.py
@@ -0,0 +1,19 @@
+from __future__ import print_function
+import os
+import sys
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+TOOLS_PATH = os.path.join(LOCAL_PATH, "..", "..", "tools")
+sys.path.append(TOOLS_PATH)
+
+from tools import download_file_and_uncompress
+
+if __name__ == '__main__':
+    url = "https://paddlerec.bj.bcebos.com/deepfm%2Fdist_data_demo.tar.gz"
+
+    print("download and extract starting...")
+    download_file_and_uncompress(url, savename="dist_data_demo.tar.gz")
+    print("download and extract finished")
+
+    print("preprocessing...")
+    os.system("python dist_preprocess.py")
+    print("preprocess done")
\ No newline at end of file
diff --git a/PaddleRec/ctr/dcn/dist_data/dist_download.sh b/PaddleRec/ctr/dcn/dist_data/dist_download.sh
deleted file mode 100755
index 78b3841ae9076942d317b9da0bcb3f0cd65a6be9..0000000000000000000000000000000000000000
--- a/PaddleRec/ctr/dcn/dist_data/dist_download.sh
+++ /dev/null
@@ -1,7 +0,0 @@
-#!/bin/bash
-
-# download small demo dataset
-wget --no-check-certificate https://paddlerec.bj.bcebos.com/deepfm%2Fdist_data_demo.tar.gz -O dist_data_demo.tar.gz
-tar xzvf dist_data_demo.tar.gz
-# preprocess demo dataset
-python dist_preprocess.py
diff --git a/PaddleRec/ctr/dcn/dist_data/dist_preprocess.py b/PaddleRec/ctr/dcn/dist_data/dist_preprocess.py
index 6a4c801c0233384decfa8035d8d967257088d775..afad881b48270ac00443ad5ce6273fa3d8216862 100644
--- a/PaddleRec/ctr/dcn/dist_data/dist_preprocess.py
+++ b/PaddleRec/ctr/dcn/dist_data/dist_preprocess.py
@@ -1,5 +1,3 @@
-#!/usr/bin/env python
-# coding: utf-8
 from __future__ import print_function, absolute_import, division
 
 import os
@@ -21,7 +19,6 @@ TEST_DIR = 'dist_test_valid_data'
 TRAIN_FILE = os.path.join(TRAIN_DIR, 'tr')
 TEST_FILE = os.path.join(TEST_DIR, 'ev')
 SPLIT_RATIO = 0.9
-LINE_NUMS = "line_nums.log"
 FREQ_THR = 10
 
 INT_COLUMN_NAMES = ['I' + str(i) for i in range(1, 14)]
diff --git a/PaddleRec/ctr/dcn/infer.py b/PaddleRec/ctr/dcn/infer.py
index 7d6fea628bf47af5599c818e4b4948d440f17846..260bbe908ced74b7084a2be2390dd00f0812aaf1 100644
--- a/PaddleRec/ctr/dcn/infer.py
+++ b/PaddleRec/ctr/dcn/infer.py
@@ -1,5 +1,3 @@
-#!/usr/bin/env python
-# coding: utf-8
 import logging
 import random
 
@@ -16,6 +14,7 @@ from config import parse_args
 from reader import CriteoDataset
 from network import DCN
 from collections import OrderedDict
+import utils
 
 logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger('fluid')
@@ -45,7 +44,8 @@ def infer():
 
     startup_program = fluid.framework.Program()
     test_program = fluid.framework.Program()
-    cur_model_path = args.model_output_dir + '/epoch_' + args.test_epoch
+    cur_model_path = os.path.join(args.model_output_dir,
+                                  'epoch_' + args.test_epoch)
 
     with fluid.scope_guard(inference_scope):
         with fluid.framework.program_guard(test_program, startup_program):
@@ -67,11 +67,8 @@ def infer():
                 dirname=cur_model_path,
                 main_program=fluid.default_main_program())
 
-            auc_states_names = ['_generated_var_2', '_generated_var_3']
-            for name in auc_states_names:
-                param = inference_scope.var(name).get_tensor()
-                param_array = np.zeros(param._get_dims()).astype("int64")
-                param.set(param_array, place)
+            for var in dcn_model.auc_states:  # reset auc states
+                set_zero(var.name, scope=inference_scope, place=place)
 
             loss_all = 0
             num_ins = 0
@@ -93,5 +90,23 @@ def infer():
             )
 
 
+def set_zero(var_name,
+             scope=fluid.global_scope(),
+             place=fluid.CPUPlace(),
+             param_type="int64"):
+    """
+    Set tensor of a Variable to zero.
+    Args:
+        var_name(str): name of Variable
+        scope(Scope): Scope object, default is fluid.global_scope()
+        place(Place): Place object, default is fluid.CPUPlace()
+        param_type(str): param data type, default is int64
+    """
+    param = scope.var(var_name).get_tensor()
+    param_array = np.zeros(param._get_dims()).astype(param_type)
+    param.set(param_array, place)
+
+
 if __name__ == '__main__':
+    utils.check_version()
     infer()
diff --git a/PaddleRec/ctr/dcn/local_train.py b/PaddleRec/ctr/dcn/local_train.py
index 48ff7689468bf4de016fcaf83a29dd07ea2d6a70..d01d702f6b5cb985cbb168db006096d70bd68d4a 100644
--- a/PaddleRec/ctr/dcn/local_train.py
+++ b/PaddleRec/ctr/dcn/local_train.py
@@ -1,5 +1,3 @@
-#!/usr/bin/env python
-# coding: utf-8
 from __future__ import print_function, absolute_import, division
 import os
 import random
@@ -11,6 +9,7 @@ import paddle.fluid as fluid
 
 from config import parse_args
 from network import DCN
+import utils
 """
 train DCN model
 """
@@ -22,6 +21,12 @@ def train(args):
     :param args: hyperparams of model
     :return:
     """
+    # ce
+    if args.enable_ce:
+        SEED = 102
+        fluid.default_main_program().random_seed = SEED
+        fluid.default_startup_program().random_seed = SEED
+
     cat_feat_dims_dict = OrderedDict()
     for line in open(args.cat_feat_num):
         spls = line.strip().split()
@@ -74,7 +79,8 @@ def train(args):
             fetch_info=['total_loss', 'avg_logloss', 'auc'],
             debug=False,
             print_period=args.print_steps)
-        model_dir = args.model_output_dir + '/epoch_' + str(epoch_id + 1)
+        model_dir = os.path.join(args.model_output_dir,
+                                 'epoch_' + str(epoch_id + 1))
         sys.stderr.write('epoch%d is finished and takes %f s\n' % (
             (epoch_id + 1), time.time() - start))
         fluid.io.save_persistables(
@@ -86,4 +92,5 @@ def train(args):
 if __name__ == '__main__':
     args = parse_args()
     print(args)
+    utils.check_version()
     train(args)
diff --git a/PaddleRec/ctr/dcn/network.py b/PaddleRec/ctr/dcn/network.py
index 0589e0a1cb4208717683174fb008a69019d897fd..ffa399edc61134146b3cca33d1e5f0adfc733659 100644
--- a/PaddleRec/ctr/dcn/network.py
+++ b/PaddleRec/ctr/dcn/network.py
@@ -1,5 +1,3 @@
-#!/usr/bin/env python
-# coding: utf-8
 from __future__ import print_function, absolute_import, division
 import paddle.fluid as fluid
 from collections import OrderedDict
@@ -40,13 +38,13 @@ class DCN(object):
 
     def build_network(self, is_test=False):
         # data input
-        self.target_input = fluid.layers.data(
-            name='label', shape=[1], dtype='float32')
+        self.target_input = fluid.data(
+            name='label', shape=[None, 1], dtype='float32')
 
         data_dict = OrderedDict()
         for feat_name in self.feat_dims_dict:
-            data_dict[feat_name] = fluid.layers.data(
-                name=feat_name, shape=[1], dtype='float32')
+            data_dict[feat_name] = fluid.data(
+                name=feat_name, shape=[None, 1], dtype='float32')
 
         self.net_input = self._create_embedding_input(data_dict)
 
@@ -64,9 +62,8 @@ class DCN(object):
         # auc
         prob_2d = fluid.layers.concat([1 - self.prob, self.prob], 1)
         label_int = fluid.layers.cast(self.target_input, 'int64')
-        auc_var, batch_auc_var, auc_states = fluid.layers.auc(input=prob_2d,
-                                                              label=label_int,
-                                                              slide_steps=0)
+        auc_var, batch_auc_var, self.auc_states = fluid.layers.auc(
+            input=prob_2d, label=label_int, slide_steps=0)
         self.auc_var = auc_var
 
         # logloss
@@ -120,7 +117,7 @@ class DCN(object):
 
     def _create_embedding_input(self, data_dict):
         # sparse embedding
-        sparse_emb_dict = OrderedDict((name, fluid.layers.embedding(
+        sparse_emb_dict = OrderedDict((name, fluid.embedding(
             input=fluid.layers.cast(
                 data_dict[name], dtype='int64'),
             size=[
diff --git a/PaddleRec/ctr/dcn/reader.py b/PaddleRec/ctr/dcn/reader.py
index aa915f6edbaf2cda51e764645f719fa965f3bc5d..291fc988edb3683aec5ef529ec78fbd87897fc72 100644
--- a/PaddleRec/ctr/dcn/reader.py
+++ b/PaddleRec/ctr/dcn/reader.py
@@ -1,5 +1,3 @@
-#!/usr/bin/env python
-# coding: utf-8
 """
 dataset and reader
 """
diff --git a/PaddleRec/ctr/dcn/utils.py b/PaddleRec/ctr/dcn/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..779b129e574e4611f29202f71d73856ecb888069
--- /dev/null
+++ b/PaddleRec/ctr/dcn/utils.py
@@ -0,0 +1,24 @@
+import sys
+import paddle.fluid as fluid
+import logging
+
+logging.basicConfig()
+logger = logging.getLogger(__name__)
+
+__all__ = ['check_version']
+
+
+def check_version():
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
diff --git a/PaddleRec/ctr/deepfm/README.md b/PaddleRec/ctr/deepfm/README.md
index a9847c8073913cd102d3b18a19ea52b53a1922af..ace75e5405e001acc8ba67285a9f6c79726bb01f 100644
--- a/PaddleRec/ctr/deepfm/README.md
+++ b/PaddleRec/ctr/deepfm/README.md
@@ -15,7 +15,7 @@ This model implementation reproduces the result of the paper "DeepFM: A Factoriz
 ```
 
 ## Environment
-- PaddlePaddle 1.6
+- **Now all models in PaddleRec require PaddlePaddle version 1.6 or higher, or suitable develop version.**
 
 ## Download and preprocess data
 
@@ -25,7 +25,7 @@ To preprocess the raw dataset, we min-max normalize continuous features to [0, 1
 
 Download and preprocess data:
 ```bash
-cd data && sh download_preprocess.sh && cd ..
+cd data && python download_preprocess.py && cd ..
 ```
 
 After executing these commands, 3 folders "train_data", "test_data" and "aid_data" will be generated. The folder "train_data" contains 90% of the raw data, while the rest 10% is in "test_data". The folder "aid_data" contains a created feature dictionary "feat_dict.pkl2".
@@ -58,12 +58,13 @@ We emulate distributed training on a local machine. In default, we use 2 X 2，i
 ### Download and preprocess distributed demo dataset
 This small demo dataset(a few lines from Criteo dataset) only test if distributed training can train.
 ```bash
-cd dist_data && sh dist_data_download.sh && cd ..
+cd dist_data && python dist_data_download.py && cd ..
 ```
 
 ### Distributed Train and Infer
 Train
 ```bash
+# 该sh不支持Windows
 sh cluster_train.sh
 ```
 params of cluster_train.sh：
@@ -80,7 +81,7 @@ other params explained in cluster_train.py
 
 Infer
 ```bash
-python infer.py --model_output_dir cluster_model --test_epoch 10 --test_data_dir=dist_data/dist_test_data --feat_dict='dist_data/aid_data/feat_dict_10.pkl2'
+python infer.py --model_output_dir cluster_model --test_epoch 10 --num_feat 141443 --test_data_dir=dist_data/dist_test_data --feat_dict='dist_data/aid_data/feat_dict_10.pkl2'
 ```
 
 Notes:
@@ -89,7 +90,7 @@ Notes:
 - The first trainer(with trainer_id 0) saves model params.
 
 - After each training, pserver processes should be stop manually. You can use command below:
-  
+
 >ps -ef | grep python
 
 - We use Dataset API to load data，it's only supported on Linux now.
diff --git a/PaddleRec/ctr/deepfm/args.py b/PaddleRec/ctr/deepfm/args.py
index 22dbfaa1d7f9acc6ac594159f5b601fb2a4c29f8..d28e64ec0a4cc8a5d7967d710fc2ddc864cec12f 100644
--- a/PaddleRec/ctr/deepfm/args.py
+++ b/PaddleRec/ctr/deepfm/args.py
@@ -67,5 +67,7 @@ def parse_args():
         '--reg', type=float, default=1e-4, help=' (default: 1e-4)')
     parser.add_argument('--num_field', type=int, default=39)
     parser.add_argument('--num_feat', type=int, default=1086460)  # 2090493
+    parser.add_argument(
+        '--enable_ce', action='store_true', help='If set, run the task with continuous evaluation logs.')
 
     return parser.parse_args()
diff --git a/PaddleRec/ctr/deepfm/cluster_train.py b/PaddleRec/ctr/deepfm/cluster_train.py
index 23985ebefab5d2ae0f1e92a32bb86dc0fd63c14e..c0509d460b48184b6c4e0727a5f82d225d5a7a54 100644
--- a/PaddleRec/ctr/deepfm/cluster_train.py
+++ b/PaddleRec/ctr/deepfm/cluster_train.py
@@ -5,6 +5,7 @@ import time
 from network_conf import ctr_deepfm_model
 
 import paddle.fluid as fluid
+import utils
 
 
 def parse_args():
@@ -114,9 +115,9 @@ def train():
     if args.trainer_id == 0 and not os.path.isdir(args.model_output_dir):
         os.mkdir(args.model_output_dir)
 
-    loss, auc, data_list = ctr_deepfm_model(args.embedding_size, args.num_field,
-                                            args.num_feat, args.layer_sizes,
-                                            args.act, args.reg, args.is_sparse)
+    loss, auc, data_list, auc_states = ctr_deepfm_model(
+        args.embedding_size, args.num_field, args.num_feat, args.layer_sizes,
+        args.act, args.reg, args.is_sparse)
     optimizer = fluid.optimizer.SGD(
         learning_rate=args.lr,
         regularization=fluid.regularizer.L2DecayRegularizer(args.reg))
@@ -132,7 +133,7 @@ def train():
         dataset.set_batch_size(args.batch_size)
         dataset.set_thread(args.num_thread)
         train_filelist = [
-            args.train_data_dir + '/' + x
+            os.path.join(args.train_data_dir, x)
             for x in os.listdir(args.train_data_dir)
         ]
 
@@ -151,11 +152,12 @@ def train():
             exe.train_from_dataset(
                 program=main_program,
                 dataset=dataset,
-                fetch_list=[loss],
-                fetch_info=['epoch %d batch loss' % (epoch_id + 1)],
-                print_period=20,
+                fetch_list=[loss, auc],
+                fetch_info=['epoch %d batch loss' % (epoch_id + 1), "auc"],
+                print_period=5,
                 debug=False)
-            model_dir = args.model_output_dir + '/epoch_' + str(epoch_id + 1)
+            model_dir = os.path.join(args.model_output_dir,
+                                     'epoch_' + str(epoch_id + 1))
             sys.stderr.write('epoch%d is finished and takes %f s\n' % (
                 (epoch_id + 1), time.time() - start))
             if args.trainer_id == 0:  # only trainer 0 save model
@@ -188,4 +190,5 @@ def train():
 
 
 if __name__ == "__main__":
+    utils.check_version()
     train()
diff --git a/PaddleRec/ctr/deepfm/data/download_preprocess.py b/PaddleRec/ctr/deepfm/data/download_preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..054610236c7516d1fe521c70145a64af95b25def
--- /dev/null
+++ b/PaddleRec/ctr/deepfm/data/download_preprocess.py
@@ -0,0 +1,25 @@
+import os
+import shutil
+import sys
+
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+TOOLS_PATH = os.path.join(LOCAL_PATH, "..", "..", "tools")
+sys.path.append(TOOLS_PATH)
+
+from tools import download_file_and_uncompress, download_file
+
+if __name__ == '__main__':
+    url = "https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz"
+    url2 = "https://paddlerec.bj.bcebos.com/deepfm%2Ffeat_dict_10.pkl2"
+
+    print("download and extract starting...")
+    download_file_and_uncompress(url)
+    download_file(url2, "./aid_data/feat_dict_10.pkl2", True)
+    print("download and extract finished")
+
+    print("preprocessing...")
+    os.system("python preprocess.py")
+    print("preprocess done")
+
+    shutil.rmtree("raw_data")
+    print("done")
diff --git a/PaddleRec/ctr/deepfm/data/download_preprocess.sh b/PaddleRec/ctr/deepfm/data/download_preprocess.sh
deleted file mode 100644
index eed4ffe2d12d24f2d1b403306aeb7bf74aa3f1d5..0000000000000000000000000000000000000000
--- a/PaddleRec/ctr/deepfm/data/download_preprocess.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/bin/bash
-
-wget --no-check-certificate https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz
-wget --no-check-certificate https://paddlerec.bj.bcebos.com/deepfm%2Ffeat_dict_10.pkl2 -O ./aid_data/feat_dict_10.pkl2 || rm -f ./aid_data/feat_dict_10.pkl2
-tar zxf dac.tar.gz >/dev/null 2>&1
-rm -f dac.tar.gz
-
-python preprocess.py
-rm *.txt
-rm -r raw_data
diff --git a/PaddleRec/ctr/deepfm/dist_data/dist_data_download.py b/PaddleRec/ctr/deepfm/dist_data/dist_data_download.py
new file mode 100644
index 0000000000000000000000000000000000000000..63e2756db389f65c17280b85437cab69e159dbda
--- /dev/null
+++ b/PaddleRec/ctr/deepfm/dist_data/dist_data_download.py
@@ -0,0 +1,22 @@
+import os
+import shutil
+import sys
+
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+TOOLS_PATH = os.path.join(LOCAL_PATH, "..", "..", "tools")
+sys.path.append(TOOLS_PATH)
+
+from tools import download_file_and_uncompress
+
+if __name__ == '__main__':
+    url = "https://paddlerec.bj.bcebos.com/deepfm%2Fdist_data_demo.tar.gz"
+
+    print("download and extract starting...")
+    download_file_and_uncompress(url, savename="dist_data_demo.tar.gz")
+    print("download and extract finished")
+
+    print("preprocessing...")
+    os.system("python preprocess_dist.py")
+    print("preprocess done")
+
+    print("done")
\ No newline at end of file
diff --git a/PaddleRec/ctr/deepfm/dist_data/dist_data_download.sh b/PaddleRec/ctr/deepfm/dist_data/dist_data_download.sh
deleted file mode 100644
index 6db0dbdc92ad55cfa2fe8c653ede88b4b9e77eba..0000000000000000000000000000000000000000
--- a/PaddleRec/ctr/deepfm/dist_data/dist_data_download.sh
+++ /dev/null
@@ -1,7 +0,0 @@
-#!/bin/bash
-
-# download small demo dataset
-wget --no-check-certificate https://paddlerec.bj.bcebos.com/deepfm%2Fdist_data_demo.tar.gz -O dist_data_demo.tar.gz
-tar xzvf dist_data_demo.tar.gz
-# preprocess dataset
-python preprocess_dist.py
diff --git a/PaddleRec/ctr/deepfm/infer.py b/PaddleRec/ctr/deepfm/infer.py
index c5ceb564ddc482626887ee9bc12f252b5ff7e6fa..2b7e29a77013e2bdbfd865cffe89ac926dd1486e 100644
--- a/PaddleRec/ctr/deepfm/infer.py
+++ b/PaddleRec/ctr/deepfm/infer.py
@@ -11,6 +11,7 @@ import paddle.fluid as fluid
 from args import parse_args
 from criteo_reader import CriteoDataset
 from network_conf import ctr_deepfm_model
+import utils
 
 logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger('fluid')
@@ -24,7 +25,8 @@ def infer():
     inference_scope = fluid.Scope()
 
     test_files = [
-        args.test_data_dir + '/' + x for x in os.listdir(args.test_data_dir)
+        os.path.join(args.test_data_dir, x)
+        for x in os.listdir(args.test_data_dir)
     ]
     criteo_dataset = CriteoDataset()
     criteo_dataset.setup(args.feat_dict)
@@ -33,11 +35,12 @@ def infer():
 
     startup_program = fluid.framework.Program()
     test_program = fluid.framework.Program()
-    cur_model_path = args.model_output_dir + '/epoch_' + args.test_epoch
+    cur_model_path = os.path.join(args.model_output_dir,
+                                  'epoch_' + args.test_epoch)
 
     with fluid.scope_guard(inference_scope):
         with fluid.framework.program_guard(test_program, startup_program):
-            loss, auc, data_list = ctr_deepfm_model(
+            loss, auc, data_list, auc_states = ctr_deepfm_model(
                 args.embedding_size, args.num_field, args.num_feat,
                 args.layer_sizes, args.act, args.reg)
 
@@ -48,11 +51,8 @@ def infer():
                 dirname=cur_model_path,
                 main_program=fluid.default_main_program())
 
-            auc_states_names = ['_generated_var_2', '_generated_var_3']
-            for name in auc_states_names:
-                param = inference_scope.var(name).get_tensor()
-                param_array = np.zeros(param._get_dims()).astype("int64")
-                param.set(param_array, place)
+            for var in auc_states:  # reset auc states
+                set_zero(var.name, scope=inference_scope, place=place)
 
             loss_all = 0
             num_ins = 0
@@ -70,5 +70,23 @@ def infer():
             )
 
 
+def set_zero(var_name,
+             scope=fluid.global_scope(),
+             place=fluid.CPUPlace(),
+             param_type="int64"):
+    """
+    Set tensor of a Variable to zero.
+    Args:
+        var_name(str): name of Variable
+        scope(Scope): Scope object, default is fluid.global_scope()
+        place(Place): Place object, default is fluid.CPUPlace()
+        param_type(str): param data type, default is int64
+    """
+    param = scope.var(var_name).get_tensor()
+    param_array = np.zeros(param._get_dims()).astype(param_type)
+    param.set(param_array, place)
+
+
 if __name__ == '__main__':
+    utils.check_version()
     infer()
diff --git a/PaddleRec/ctr/deepfm/local_train.py b/PaddleRec/ctr/deepfm/local_train.py
index b6edf9742297a822f300461e075aef55282ca4a9..001f625eb21dd6587266b5338910274e5d4fc7b0 100644
--- a/PaddleRec/ctr/deepfm/local_train.py
+++ b/PaddleRec/ctr/deepfm/local_train.py
@@ -6,10 +6,17 @@ from network_conf import ctr_deepfm_model
 import time
 import numpy
 import pickle
+import utils
 
 
 def train():
     args = parse_args()
+    # add ce
+    if args.enable_ce:
+        SEED = 102
+        fluid.default_main_program().random_seed = SEED
+        fluid.default_startup_program().random_seed = SEED
+
     print('---------- Configuration Arguments ----------')
     for key, value in args.__dict__.items():
         print(key + ':' + str(value))
@@ -17,9 +24,9 @@ def train():
     if not os.path.isdir(args.model_output_dir):
         os.mkdir(args.model_output_dir)
 
-    loss, auc, data_list = ctr_deepfm_model(args.embedding_size, args.num_field,
-                                            args.num_feat, args.layer_sizes,
-                                            args.act, args.reg)
+    loss, auc, data_list, auc_states = ctr_deepfm_model(
+        args.embedding_size, args.num_field, args.num_feat, args.layer_sizes,
+        args.act, args.reg)
     optimizer = fluid.optimizer.SGD(
         learning_rate=args.lr,
         regularization=fluid.regularizer.L2DecayRegularizer(args.reg))
@@ -35,7 +42,8 @@ def train():
     dataset.set_batch_size(args.batch_size)
     dataset.set_thread(args.num_thread)
     train_filelist = [
-        args.train_data_dir + '/' + x for x in os.listdir(args.train_data_dir)
+        os.path.join(args.train_data_dir, x)
+        for x in os.listdir(args.train_data_dir)
     ]
 
     print('---------------------------------------------')
@@ -45,11 +53,12 @@ def train():
         exe.train_from_dataset(
             program=fluid.default_main_program(),
             dataset=dataset,
-            fetch_list=[loss],
-            fetch_info=['epoch %d batch loss' % (epoch_id + 1)],
+            fetch_list=[loss, auc],
+            fetch_info=['epoch %d batch loss' % (epoch_id + 1), "auc"],
             print_period=1000,
             debug=False)
-        model_dir = args.model_output_dir + '/epoch_' + str(epoch_id + 1)
+        model_dir = os.path.join(args.model_output_dir,
+                                 'epoch_' + str(epoch_id + 1))
         sys.stderr.write('epoch%d is finished and takes %f s\n' % (
             (epoch_id + 1), time.time() - start))
         fluid.io.save_persistables(
@@ -59,4 +68,5 @@ def train():
 
 
 if __name__ == '__main__':
+    utils.check_version()
     train()
diff --git a/PaddleRec/ctr/deepfm/network_conf.py b/PaddleRec/ctr/deepfm/network_conf.py
index 480a0c753c069a36058ce3b9efab671ccb77ab9b..609ea12f4f4399f563ccf9ad7954be7d01682adc 100644
--- a/PaddleRec/ctr/deepfm/network_conf.py
+++ b/PaddleRec/ctr/deepfm/network_conf.py
@@ -11,12 +11,12 @@ def ctr_deepfm_model(embedding_size,
                      is_sparse=False):
     init_value_ = 0.1
 
-    raw_feat_idx = fluid.layers.data(
-        name='feat_idx', shape=[num_field], dtype='int64')
-    raw_feat_value = fluid.layers.data(
-        name='feat_value', shape=[num_field], dtype='float32')
-    label = fluid.layers.data(
-        name='label', shape=[1], dtype='float32')  # None * 1
+    raw_feat_idx = fluid.data(
+        name='feat_idx', shape=[None, num_field], dtype='int64')
+    raw_feat_value = fluid.data(
+        name='feat_value', shape=[None, num_field], dtype='float32')
+    label = fluid.data(
+        name='label', shape=[None, 1], dtype='float32')  # None * 1
 
     feat_idx = fluid.layers.reshape(raw_feat_idx,
                                     [-1, 1])  # (None * num_field) * 1
@@ -25,7 +25,7 @@ def ctr_deepfm_model(embedding_size,
 
     # -------------------- first order term  --------------------
 
-    first_weights_re = fluid.layers.embedding(
+    first_weights_re = fluid.embedding(
         input=feat_idx,
         is_sparse=is_sparse,
         dtype='float32',
@@ -41,7 +41,7 @@ def ctr_deepfm_model(embedding_size,
 
     # -------------------- second order term  --------------------
 
-    feat_embeddings_re = fluid.layers.embedding(
+    feat_embeddings_re = fluid.embedding(
         input=feat_idx,
         is_sparse=is_sparse,
         dtype='float32',
@@ -111,4 +111,5 @@ def ctr_deepfm_model(embedding_size,
                                                           label=label_int,
                                                           slide_steps=0)
 
-    return batch_cost, auc_var, [raw_feat_idx, raw_feat_value, label]
+    return batch_cost, auc_var, [raw_feat_idx, raw_feat_value,
+                                 label], auc_states
diff --git a/PaddleRec/ctr/deepfm/utils.py b/PaddleRec/ctr/deepfm/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..779b129e574e4611f29202f71d73856ecb888069
--- /dev/null
+++ b/PaddleRec/ctr/deepfm/utils.py
@@ -0,0 +1,24 @@
+import sys
+import paddle.fluid as fluid
+import logging
+
+logging.basicConfig()
+logger = logging.getLogger(__name__)
+
+__all__ = ['check_version']
+
+
+def check_version():
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
diff --git a/PaddleRec/ctr/deepfm_dygraph/README.md b/PaddleRec/ctr/deepfm_dygraph/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..fc9b14aa8aa2486af81f43d4bc077646d4012112
--- /dev/null
+++ b/PaddleRec/ctr/deepfm_dygraph/README.md
@@ -0,0 +1,73 @@
+
+# DeepFM动态图
+
+以下是本例的简要目录结构及说明：
+
+```text
+.
+├── README.md                       # 文档
+├── train.py                        # 本地训练脚本
+├── infer.py                        # 本地预测脚本
+├── network.py                      # 网络结构
+├── data_reader.py                  # 读取数据相关的函数
+├── utility.py                      # 参数设置和通用函数
+├── data/
+    ├── download_preprocess.py      # 下载并预处理数据脚本
+    ├── preprocess.py               # 数据预处理脚本
+
+```
+
+## 介绍
+本模型使用PaddlePaddle **动态图** 复现了DeepFM模型。
+
+DeepFM模型介绍可以参阅论文[DeepFM: A Factorization-Machine based Neural Network for CTR Prediction](https://arxiv.org/abs/1703.04247)
+
+## 环境
+- **目前本模型要求使用PaddlePaddle 1.7或最新develop版本**
+
+## 数据下载和预处理
+
+我们在[Criteo](https://www.kaggle.com/c/criteo-display-ad-challenge/)数据集训练测试DeepFM。整个数据集包含约4500万条记录。每一行第一列是label，表示该条广告是否被点击，剩下的是13个整数型特征(I1 - I13)和26个离散型特征(C1 - C26)。
+
+通过min-max normalize将连续特征转换到 [0, 1]区间，并去除离散型特征中出现少于10次的特征。整个数据集被划分为两部分：90%用来训练，10%用来评估模型效果。
+
+下载并预处理数据命令:
+```bash
+cd data && python download_preprocess.py && cd ..
+```
+
+执行完命令后将生成三个文件夹: train_data, test_data和aid_data。
+
+train_data包含90%数据，test_data包含剩下的10%数据，aid_data中有一个生成或下载（节约用户生成特征字典时间）的特征字典feat_dict_10.pkl2。
+
+## 训练模型
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python -u train.py > train.log 2>&1 &
+```
+
+每一轮数据训练结束后会测试模型效果。
+
+加载已经存在的模型并继续训练:
+
+```bash
+# 加载保存的epoch_0并继续训练
+CUDA_VISIBLE_DEVICES=0 python -u train.py --checkpoint=models/epoch_0 > train.log 2>&1 &
+```
+
+## 预测模型
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python infer.py --checkpoint=models/epoch_0
+```
+
+加载models/epoch_0的模型，对test_data中数据进行预测，评估模型效果。注意：最后一行才是整个test数据集的auc。
+
+## 效果
+```text
+test auc of epoch 0 is 0.802877
+```
+
+第一轮数据训练结束后，test auc为0.802877。
+
+继续训练模型易出现过拟合现象，可以通过评估模型选择效果最好的模型作为最终训练结果。
diff --git a/PaddleRec/ctr/deepfm_dygraph/data/aid_data/train_file_idx.txt b/PaddleRec/ctr/deepfm_dygraph/data/aid_data/train_file_idx.txt
new file mode 100644
index 0000000000000000000000000000000000000000..680c603c9753ea1bd7be35303d5c8c337b250152
--- /dev/null
+++ b/PaddleRec/ctr/deepfm_dygraph/data/aid_data/train_file_idx.txt
@@ -0,0 +1 @@
+[156, 51, 24, 103, 195, 35, 188, 16, 224, 173, 116, 3, 226, 11, 64, 94, 6, 70, 197, 164, 220, 77, 172, 194, 227, 12, 65, 129, 39, 38, 75, 210, 215, 36, 46, 185, 76, 222, 108, 78, 120, 71, 33, 189, 135, 97, 90, 219, 105, 205, 136, 167, 106, 29, 157, 125, 217, 121, 175, 143, 200, 45, 179, 37, 86, 140, 225, 47, 20, 228, 4, 209, 177, 178, 171, 58, 48, 118, 9, 149, 55, 192, 82, 17, 43, 54, 93, 96, 159, 216, 18, 206, 223, 104, 132, 182, 60, 109, 28, 180, 44, 166, 128, 27, 163, 141, 229, 102, 150, 7, 83, 198, 41, 191, 114, 117, 122, 161, 130, 174, 176, 160, 201, 49, 112, 69, 165, 95, 133, 92, 59, 110, 151, 203, 67, 169, 21, 66, 80, 22, 23, 152, 40, 127, 111, 186, 72, 26, 190, 42, 0, 63, 53, 124, 137, 85, 126, 196, 187, 208, 98, 25, 15, 170, 193, 168, 202, 31, 146, 147, 113, 32, 204, 131, 68, 84, 213, 19, 81, 79, 162, 199, 107, 50, 2, 207, 10, 181, 144, 139, 134, 62, 155, 142, 214, 212, 61, 52, 101, 99, 158, 145, 13, 153, 56, 184, 221]
\ No newline at end of file
diff --git a/PaddleRec/ctr/deepfm_dygraph/data/download_preprocess.py b/PaddleRec/ctr/deepfm_dygraph/data/download_preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..a193e2fc3d50a9b94407b0eed2031f9a6cd28259
--- /dev/null
+++ b/PaddleRec/ctr/deepfm_dygraph/data/download_preprocess.py
@@ -0,0 +1,27 @@
+import os
+import shutil
+import sys
+
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+TOOLS_PATH = os.path.join(LOCAL_PATH, "..", "..", "tools")
+sys.path.append(TOOLS_PATH)
+
+from tools import download_file_and_uncompress, download_file
+
+if __name__ == '__main__':
+    url = "https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz"
+    url2 = "https://paddlerec.bj.bcebos.com/deepfm%2Ffeat_dict_10.pkl2"
+
+    print("download and extract starting...")
+    download_file_and_uncompress(url)
+    if not os.path.exists("aid_data"):
+        os.makedirs("aid_data")
+    download_file(url2, "./aid_data/feat_dict_10.pkl2", True)
+    print("download and extract finished")
+
+    print("preprocessing...")
+    os.system("python preprocess.py")
+    print("preprocess done")
+
+    shutil.rmtree("raw_data")
+    print("done")
diff --git a/PaddleRec/ctr/deepfm_dygraph/data/preprocess.py b/PaddleRec/ctr/deepfm_dygraph/data/preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..b98141ea3a803bf03038c37a17b6086b0ae3ffb8
--- /dev/null
+++ b/PaddleRec/ctr/deepfm_dygraph/data/preprocess.py
@@ -0,0 +1,120 @@
+from __future__ import division
+import os
+import numpy
+from collections import Counter
+import shutil
+import pickle
+
+
+def get_raw_data(intput_file, raw_data, ins_per_file):
+    if not os.path.isdir(raw_data):
+        os.mkdir(raw_data)
+
+    fin = open(intput_file, 'r')
+    fout = open(os.path.join(raw_data, 'part-0'), 'w')
+    for line_idx, line in enumerate(fin):
+        if line_idx % ins_per_file == 0 and line_idx != 0:
+            fout.close()
+            cur_part_idx = int(line_idx / ins_per_file)
+            fout = open(
+                os.path.join(raw_data, 'part-' + str(cur_part_idx)), 'w')
+        fout.write(line)
+    fout.close()
+    fin.close()
+
+
+def split_data(raw_data, aid_data, train_data, test_data):
+    split_rate_ = 0.9
+    dir_train_file_idx_ = os.path.join(aid_data, 'train_file_idx.txt')
+    filelist_ = [
+        os.path.join(raw_data, 'part-%d' % x)
+        for x in range(len(os.listdir(raw_data)))
+    ]
+
+    if not os.path.exists(dir_train_file_idx_):
+        train_file_idx = list(
+            numpy.random.choice(
+                len(filelist_), int(len(filelist_) * split_rate_), False))
+        with open(dir_train_file_idx_, 'w') as fout:
+            fout.write(str(train_file_idx))
+    else:
+        with open(dir_train_file_idx_, 'r') as fin:
+            train_file_idx = eval(fin.read())
+
+    for idx in range(len(filelist_)):
+        if idx in train_file_idx:
+            shutil.move(filelist_[idx], train_data)
+        else:
+            shutil.move(filelist_[idx], test_data)
+
+
+def get_feat_dict(intput_file, aid_data, print_freq=100000, total_ins=45000000):
+    freq_ = 10
+    dir_feat_dict_ = os.path.join(aid_data, 'feat_dict_' + str(freq_) + '.pkl2')
+    continuous_range_ = range(1, 14)
+    categorical_range_ = range(14, 40)
+
+    if not os.path.exists(dir_feat_dict_):
+        # print('generate a feature dict')
+        # Count the number of occurrences of discrete features
+        feat_cnt = Counter()
+        with open(intput_file, 'r') as fin:
+            for line_idx, line in enumerate(fin):
+                if line_idx % print_freq == 0:
+                    print(r'generating feature dict {:.2f} %'.format((
+                        line_idx / total_ins) * 100))
+                features = line.rstrip('\n').split('\t')
+                for idx in categorical_range_:
+                    if features[idx] == '': continue
+                    feat_cnt.update([features[idx]])
+
+        # Only retain discrete features with high frequency
+        dis_feat_set = set()
+        for feat, ot in feat_cnt.items():
+            if ot >= freq_:
+                dis_feat_set.add(feat)
+
+        # Create a dictionary for continuous and discrete features
+        feat_dict = {}
+        tc = 1
+        # Continuous features
+        for idx in continuous_range_:
+            feat_dict[idx] = tc
+            tc += 1
+        for feat in dis_feat_set:
+            feat_dict[feat] = tc
+            tc += 1
+        # Save dictionary
+        with open(dir_feat_dict_, 'wb') as fout:
+            pickle.dump(feat_dict, fout, protocol=2)
+        print('args.num_feat ', len(feat_dict) + 1)
+
+
+def preprocess(input_file,
+               outdir,
+               ins_per_file,
+               total_ins=None,
+               print_freq=None):
+    train_data = os.path.join(outdir, "train_data")
+    test_data = os.path.join(outdir, "test_data")
+    aid_data = os.path.join(outdir, "aid_data")
+    raw_data = os.path.join(outdir, "raw_data")
+    if not os.path.isdir(train_data):
+        os.mkdir(train_data)
+    if not os.path.isdir(test_data):
+        os.mkdir(test_data)
+    if not os.path.isdir(aid_data):
+        os.mkdir(aid_data)
+
+    if print_freq is None:
+        print_freq = 10 * ins_per_file
+
+    get_raw_data(input_file, raw_data, ins_per_file)
+    split_data(raw_data, aid_data, train_data, test_data)
+    get_feat_dict(input_file, aid_data, print_freq, total_ins)
+
+    print('Done!')
+
+
+if __name__ == '__main__':
+    preprocess('train.txt', './', 200000, 45000000)
diff --git a/PaddleRec/ctr/deepfm_dygraph/data_reader.py b/PaddleRec/ctr/deepfm_dygraph/data_reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c9d9abcd8f1b64adf903703f5eec995e04016ae
--- /dev/null
+++ b/PaddleRec/ctr/deepfm_dygraph/data_reader.py
@@ -0,0 +1,75 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import pickle
+import random
+
+import paddle
+
+
+class DataGenerator(object):
+    def __init__(self, feat_dict_path):
+        # min-max of continuous features in Criteo dataset
+        self.cont_min_ = [0, -3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
+        self.cont_max_ = [
+            5775, 257675, 65535, 969, 23159456, 431037, 56311, 6047, 29019, 46,
+            231, 4008, 7393
+        ]
+        self.cont_diff_ = [
+            self.cont_max_[i] - self.cont_min_[i]
+            for i in range(len(self.cont_min_))
+        ]
+        self.continuous_range_ = range(1, 14)
+        self.categorical_range_ = range(14, 40)
+        self.feat_dict_ = pickle.load(open(feat_dict_path, 'rb'))
+
+    def _process_line(self, line):
+        features = line.rstrip('\n').split('\t')
+        feat_idx = []
+        feat_value = []
+        for idx in self.continuous_range_:
+            if features[idx] == '':
+                feat_idx.append(0)
+                feat_value.append(0.0)
+            else:
+                feat_idx.append(self.feat_dict_[idx])
+                feat_value.append(
+                    (float(features[idx]) - self.cont_min_[idx - 1]) /
+                    self.cont_diff_[idx - 1])
+        for idx in self.categorical_range_:
+            if features[idx] == '' or features[idx] not in self.feat_dict_:
+                feat_idx.append(0)
+                feat_value.append(0.0)
+            else:
+                feat_idx.append(self.feat_dict_[features[idx]])
+                feat_value.append(1.0)
+        label = [int(features[0])]
+        return feat_idx, feat_value, label
+
+    def train_reader(self, file_list, batch_size, cycle, shuffle=True):
+        def _reader():
+            if shuffle:
+                random.shuffle(file_list)
+            while True:
+                for fn in file_list:
+                    for line in open(fn, 'r'):
+                        yield self._process_line(line)
+                if not cycle:
+                    break
+
+        return paddle.batch(_reader, batch_size=batch_size)
+
+
+def data_reader(batch_size,
+                file_list,
+                feat_dict_path,
+                cycle=False,
+                shuffle=False,
+                data_type="train"):
+    generator = DataGenerator(feat_dict_path)
+
+    if data_type != "train" and data_type != "test":
+        print("data type only support train | test")
+        raise Exception("data type only support train | test")
+    return generator.train_reader(file_list, batch_size, cycle, shuffle=shuffle)
diff --git a/PaddleRec/ctr/deepfm_dygraph/infer.py b/PaddleRec/ctr/deepfm_dygraph/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..7feec87a529fea1b21026c4555d4b1b4dff8f3fb
--- /dev/null
+++ b/PaddleRec/ctr/deepfm_dygraph/infer.py
@@ -0,0 +1,90 @@
+from __future__ import print_function
+
+import os
+
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.base import to_variable
+import logging
+import time
+
+import data_reader
+import utility as utils
+from network import DeepFM
+
+logging.basicConfig(
+    format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+def infer(args):
+    if args.use_gpu:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CPUPlace()
+    with fluid.dygraph.guard(place):
+        deepfm = DeepFM(args)
+
+        test_filelist = [
+            os.path.join(args.test_data_dir, x)
+            for x in os.listdir(args.test_data_dir)
+        ]
+
+        test_reader = data_reader.data_reader(
+            args.batch_size, test_filelist, args.feat_dict, data_type="test")
+
+        # load model
+        if args.checkpoint:
+            model_dict, optimizer_dict = fluid.dygraph.load_dygraph(
+                args.checkpoint)
+            deepfm.set_dict(model_dict)
+            logger.info("load model {} finished.".format(args.checkpoint))
+        else:
+            logger.error("no model to load!")
+            logger.error("please set model to load in --checkpoint first.")
+            exit(1)
+
+        def eval():
+            deepfm.eval()
+            logger.info("start eval model.")
+            total_step = 0
+            batch_begin = time.time()
+            auc_metric_test = fluid.metrics.Auc("ROC")
+            for data in test_reader():
+                total_step += 1
+                raw_feat_idx, raw_feat_value, label = zip(*data)
+                raw_feat_idx = np.array(raw_feat_idx, dtype=np.int64)
+                raw_feat_value = np.array(raw_feat_value, dtype=np.float32)
+                label = np.array(label, dtype=np.int64)
+                raw_feat_idx, raw_feat_value, label = [
+                    to_variable(i)
+                    for i in [raw_feat_idx, raw_feat_value, label]
+                ]
+
+                predict = deepfm(raw_feat_idx, raw_feat_value, label)
+
+                # for auc
+                predict_2d = fluid.layers.concat([1 - predict, predict], 1)
+                auc_metric_test.update(
+                    preds=predict_2d.numpy(), labels=label.numpy())
+
+                if total_step > 0 and total_step % 100 == 0:
+                    logger.info(
+                        "TEST --> batch: {} auc: {:.6f} speed: {:.2f} ins/s".
+                        format(total_step,
+                               auc_metric_test.eval(), 100 * args.batch_size / (
+                                   time.time() - batch_begin)))
+                    batch_begin = time.time()
+
+            logger.info("test auc is %.6f" % auc_metric_test.eval())
+
+        begin = time.time()
+        eval()
+        logger.info("test finished, cost %f s" % (time.time() - begin))
+
+
+if __name__ == '__main__':
+    args = utils.parse_args()
+    utils.print_arguments(args)
+
+    infer(args)
diff --git a/PaddleRec/ctr/deepfm_dygraph/network.py b/PaddleRec/ctr/deepfm_dygraph/network.py
new file mode 100644
index 0000000000000000000000000000000000000000..e954d1b81e8eba694662342b9a95f8fe29a6f61c
--- /dev/null
+++ b/PaddleRec/ctr/deepfm_dygraph/network.py
@@ -0,0 +1,125 @@
+import math
+
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.nn import Linear, Embedding
+
+
+class DeepFM(fluid.dygraph.Layer):
+    def __init__(self, args):
+        super(DeepFM, self).__init__()
+        self.args = args
+        self.init_value_ = 0.1
+
+        self.fm = FM(args)
+        self.dnn = DNN(args)
+
+    def forward(self, raw_feat_idx, raw_feat_value, label):
+        feat_idx = fluid.layers.reshape(raw_feat_idx,
+                                        [-1, 1])  # (None * num_field) * 1
+        feat_value = fluid.layers.reshape(
+            raw_feat_value,
+            [-1, self.args.num_field, 1])  # None * num_field * 1
+
+        y_first_order, y_second_order, feat_embeddings = self.fm(feat_idx,
+                                                                 feat_value)
+        y_dnn = self.dnn(feat_embeddings)
+
+        predict = fluid.layers.sigmoid(y_first_order + y_second_order + y_dnn)
+
+        return predict
+
+
+class FM(fluid.dygraph.Layer):
+    def __init__(self, args):
+        super(FM, self).__init__()
+        self.args = args
+        self.init_value_ = 0.1
+        self.embedding_w = Embedding(
+            size=[self.args.num_feat + 1, 1],
+            dtype='float32',
+            padding_idx=0,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.TruncatedNormalInitializer(
+                    loc=0.0, scale=self.init_value_),
+                regularizer=fluid.regularizer.L1DecayRegularizer(
+                    self.args.reg)))
+        self.embedding = Embedding(
+            size=[self.args.num_feat + 1, self.args.embedding_size],
+            dtype='float32',
+            padding_idx=0,
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.TruncatedNormalInitializer(
+                    loc=0.0,
+                    scale=self.init_value_ /
+                    math.sqrt(float(self.args.embedding_size)))))
+
+    def forward(self, feat_idx, feat_value):
+        # -------------------- first order term  --------------------
+        first_weights_re = self.embedding_w(feat_idx)
+        first_weights = fluid.layers.reshape(
+            first_weights_re,
+            shape=[-1, self.args.num_field, 1])  # None * num_field * 1
+        y_first_order = fluid.layers.reduce_sum(first_weights * feat_value, 1)
+
+        # -------------------- second order term  --------------------
+        feat_embeddings_re = self.embedding(feat_idx)
+        feat_embeddings = fluid.layers.reshape(
+            feat_embeddings_re,
+            shape=[-1, self.args.num_field, self.args.embedding_size
+                   ])  # None * num_field * embedding_size
+        feat_embeddings = feat_embeddings * feat_value  # None * num_field * embedding_size
+
+        # sum_square part
+        summed_features_emb = fluid.layers.reduce_sum(
+            feat_embeddings, 1)  # None * embedding_size
+        summed_features_emb_square = fluid.layers.square(
+            summed_features_emb)  # None * embedding_size
+
+        # square_sum part
+        squared_features_emb = fluid.layers.square(
+            feat_embeddings)  # None * num_field * embedding_size
+        squared_sum_features_emb = fluid.layers.reduce_sum(
+            squared_features_emb, 1)  # None * embedding_size
+
+        y_second_order = 0.5 * fluid.layers.reduce_sum(
+            summed_features_emb_square - squared_sum_features_emb,
+            1,
+            keep_dim=True)  # None * 1
+
+        return y_first_order, y_second_order, feat_embeddings
+
+
+class DNN(fluid.dygraph.Layer):
+    def __init__(self, args):
+        super(DNN, self).__init__()
+        self.args = args
+        self.init_value_ = 0.1
+        sizes = [self.args.num_field * self.args.embedding_size] + self.args.layer_sizes + [1]
+        acts = [self.args.act
+                for _ in range(len(self.args.layer_sizes))] + [None]
+        w_scales = [
+            self.init_value_ / math.sqrt(float(10))
+            for _ in range(len(self.args.layer_sizes))
+        ] + [self.init_value_]
+        self.linears = []
+        for i in range(len(self.args.layer_sizes) + 1):
+            linear = Linear(
+                sizes[i],
+                sizes[i + 1],
+                act=acts[i],
+                param_attr=fluid.ParamAttr(
+                    initializer=fluid.initializer.TruncatedNormalInitializer(
+                        loc=0.0, scale=w_scales[i])),
+                bias_attr=fluid.ParamAttr(
+                    initializer=fluid.initializer.TruncatedNormalInitializer(
+                        loc=0.0, scale=self.init_value_)))
+            self.add_sublayer('linear_%d' % i, linear)
+            self.linears.append(linear)
+
+    def forward(self, feat_embeddings):
+        y_dnn = fluid.layers.reshape(
+            feat_embeddings,
+            [-1, self.args.num_field * self.args.embedding_size])
+        for linear in self.linears:
+            y_dnn = linear(y_dnn)
+        return y_dnn
diff --git a/PaddleRec/ctr/deepfm_dygraph/train.py b/PaddleRec/ctr/deepfm_dygraph/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..97a7626cc325fac708e37913229c51ea09acbd44
--- /dev/null
+++ b/PaddleRec/ctr/deepfm_dygraph/train.py
@@ -0,0 +1,156 @@
+from __future__ import print_function
+
+import os
+
+import numpy as np
+import paddle.fluid as fluid
+import time
+from paddle.fluid.dygraph.base import to_variable
+import logging
+
+import data_reader
+import utility as utils
+from network import DeepFM
+
+logging.basicConfig(
+    format='%(asctime)s - %(levelname)s - %(message)s', level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+def train(args):
+    if args.use_gpu:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CPUPlace()
+    with fluid.dygraph.guard(place):
+        deepfm = DeepFM(args)
+
+        train_filelist = [
+            os.path.join(args.train_data_dir, x)
+            for x in os.listdir(args.train_data_dir)
+        ]
+        test_filelist = [
+            os.path.join(args.test_data_dir, x)
+            for x in os.listdir(args.test_data_dir)
+        ]
+
+        train_reader = data_reader.data_reader(
+            args.batch_size, train_filelist, args.feat_dict, data_type="train")
+        test_reader = data_reader.data_reader(
+            args.batch_size, test_filelist, args.feat_dict, data_type="test")
+
+        def eval(epoch):
+            deepfm.eval()
+            logger.info("start eval model.")
+            total_step = 0.0
+            auc_metric_test = fluid.metrics.Auc("ROC")
+            for data in test_reader():
+                total_step += 1
+                raw_feat_idx, raw_feat_value, label = zip(*data)
+                raw_feat_idx = np.array(raw_feat_idx, dtype=np.int64)
+                raw_feat_value = np.array(raw_feat_value, dtype=np.float32)
+                label = np.array(label, dtype=np.int64)
+                raw_feat_idx, raw_feat_value, label = [
+                    to_variable(i)
+                    for i in [raw_feat_idx, raw_feat_value, label]
+                ]
+
+                predict = deepfm(raw_feat_idx, raw_feat_value, label)
+
+                # for auc
+                predict_2d = fluid.layers.concat([1 - predict, predict], 1)
+                auc_metric_test.update(
+                    preds=predict_2d.numpy(), labels=label.numpy())
+
+            logger.info("test auc of epoch %d is %.6f" %
+                        (epoch, auc_metric_test.eval()))
+
+        optimizer = fluid.optimizer.Adam(
+            parameter_list=deepfm.parameters(),
+            regularization=fluid.regularizer.L2DecayRegularizer(args.reg))
+
+        # load model if exists
+        start_epoch = 0
+        if args.checkpoint:
+            model_dict, optimizer_dict = fluid.dygraph.load_dygraph(
+                args.checkpoint)
+            deepfm.set_dict(model_dict)
+            optimizer.set_dict(optimizer_dict)
+            start_epoch = int(
+                os.path.basename(args.checkpoint).split("_")[
+                    -1]) + 1  # get next train epoch
+            logger.info("load model {} finished.".format(args.checkpoint))
+
+        for epoch in range(start_epoch, args.num_epoch):
+            begin = time.time()
+            batch_begin = time.time()
+            batch_id = 0
+            total_loss = 0.0
+            auc_metric = fluid.metrics.Auc("ROC")
+            logger.info("training epoch {} start.".format(epoch))
+
+            for data in train_reader():
+                raw_feat_idx, raw_feat_value, label = zip(*data)
+                raw_feat_idx = np.array(raw_feat_idx, dtype=np.int64)
+                raw_feat_value = np.array(raw_feat_value, dtype=np.float32)
+                label = np.array(label, dtype=np.int64)
+                raw_feat_idx, raw_feat_value, label = [
+                    to_variable(i)
+                    for i in [raw_feat_idx, raw_feat_value, label]
+                ]
+
+                predict = deepfm(raw_feat_idx, raw_feat_value, label)
+
+                loss = fluid.layers.log_loss(
+                    input=predict,
+                    label=fluid.layers.cast(
+                        label, dtype="float32"))
+                batch_loss = fluid.layers.reduce_sum(loss)
+
+                total_loss += batch_loss.numpy().item()
+
+                batch_loss.backward()
+                optimizer.minimize(batch_loss)
+                deepfm.clear_gradients()
+
+                # for auc
+                predict_2d = fluid.layers.concat([1 - predict, predict], 1)
+                auc_metric.update(
+                    preds=predict_2d.numpy(), labels=label.numpy())
+
+                if batch_id > 0 and batch_id % 100 == 0:
+                    logger.info(
+                        "epoch: {}, batch_id: {}, loss: {:.6f}, auc: {:.6f}, speed: {:.2f} ins/s".
+                        format(epoch, batch_id, total_loss / args.batch_size /
+                               100,
+                               auc_metric.eval(), 100 * args.batch_size / (
+                                   time.time() - batch_begin)))
+                    batch_begin = time.time()
+                    total_loss = 0.0
+
+                batch_id += 1
+            logger.info("epoch %d is finished and takes %f s" %
+                        (epoch, time.time() - begin))
+            # save model and optimizer
+            logger.info("going to save epoch {} model and optimizer.".format(
+                epoch))
+            fluid.dygraph.save_dygraph(
+                deepfm.state_dict(),
+                model_path=os.path.join(args.model_output_dir,
+                                        "epoch_" + str(epoch)))
+            fluid.dygraph.save_dygraph(
+                optimizer.state_dict(),
+                model_path=os.path.join(args.model_output_dir,
+                                        "epoch_" + str(epoch)))
+            logger.info("save epoch {} finished.".format(epoch))
+            # eval model
+            deepfm.eval()
+            eval(epoch)
+            deepfm.train()
+
+
+if __name__ == '__main__':
+    args = utils.parse_args()
+    utils.print_arguments(args)
+
+    train(args)
diff --git a/PaddleRec/ctr/deepfm_dygraph/utility.py b/PaddleRec/ctr/deepfm_dygraph/utility.py
new file mode 100644
index 0000000000000000000000000000000000000000..31a80bbb8b1d21edd7e88ce9440c741732e9b81f
--- /dev/null
+++ b/PaddleRec/ctr/deepfm_dygraph/utility.py
@@ -0,0 +1,98 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import distutils.util
+
+import numpy as np
+import six
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="deepfm dygraph")
+    parser.add_argument(
+        '--train_data_dir',
+        type=str,
+        default='data/train_data',
+        help='The path of train data (default: data/train_data)')
+    parser.add_argument(
+        '--test_data_dir',
+        type=str,
+        default='data/test_data',
+        help='The path of test data (default: models)')
+    parser.add_argument(
+        '--model_output_dir',
+        type=str,
+        default='models',
+        help='The path for model to store (default: models)')
+    parser.add_argument(
+        '--checkpoint',
+        type=str,
+        default='',
+        help='The path for model and optimizer to load (default: "")')
+    parser.add_argument(
+        '--feat_dict',
+        type=str,
+        default='data/aid_data/feat_dict_10.pkl2',
+        help='The path of feat_dict')
+    parser.add_argument(
+        '--num_epoch',
+        type=int,
+        default=10,
+        help="The number of epochs to train (default: 10)")
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=4096,
+        help="The size of mini-batch (default:4096)")
+    parser.add_argument(
+        '--use_gpu', type=distutils.util.strtobool, default=True)
+
+    parser.add_argument(
+        '--embedding_size',
+        type=int,
+        default=10,
+        help="The size for embedding layer (default:10)")
+    parser.add_argument(
+        '--layer_sizes',
+        nargs='+',
+        type=int,
+        default=[400, 400, 400],
+        help='The size of each layers (default: [400, 400, 400])')
+    parser.add_argument(
+        '--act',
+        type=str,
+        default='relu',
+        help='The activation of each layers (default: relu)')
+    parser.add_argument(
+        '--lr', type=float, default=1e-3, help='Learning rate (default: 1e-3)')
+    parser.add_argument(
+        '--reg', type=float, default=1e-4, help=' (default: 1e-4)')
+    parser.add_argument('--num_field', type=int, default=39)
+    parser.add_argument('--num_feat', type=int, default=1086460)  # 2090493
+
+    return parser.parse_args()
+
+
+def print_arguments(args):
+    """Print argparse's arguments.
+    Usage:
+    .. code-block:: python
+        parser = argparse.ArgumentParser()
+        parser.add_argument("name", default="Jonh", type=str, help="User name.")
+        args = parser.parse_args()
+        print_arguments(args)
+    :param args: Input argparse.Namespace for printing.
+    :type args: argparse.Namespace
+    """
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(six.iteritems(vars(args))):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def to_numpy(data):
+    flattened_data = np.concatenate(data, axis=0).astype("int64")
+    flattened_data = flattened_data.reshape([len(flattened_data), 1])
+    return flattened_data
diff --git a/PaddleRec/ctr/dnn/README.cn.md b/PaddleRec/ctr/dnn/README.cn.md
index 47d3b22099886b40499be3168f4f2759c9e3facd..f93d386939af94096a276e70f4776ed9c875a09e 100644
--- a/PaddleRec/ctr/dnn/README.cn.md
+++ b/PaddleRec/ctr/dnn/README.cn.md
@@ -15,6 +15,7 @@
 ```
 
 ## 运行环境
+**要求使用PaddlePaddle 1.6及以上版本或适当的develop版本。**
 需要先安装PaddlePaddle Fluid，然后运行：
 
 ```shell
@@ -28,7 +29,7 @@ pip install -r requirements.txt
 
 下载数据集：
 ```bash
-cd data && ./download.sh && cd ..
+cd data && python download.py && cd ..
 ```
 
 ## 模型
@@ -48,13 +49,12 @@ python train.py \
         2>&1 | tee train.log
 ```
 
-训练到第1轮的第40000个batch后，测试的AUC为0.801178，误差（cost）为0.445196。
-
 ### 分布式训练
 
 本地启动一个2 trainer 2 pserver的分布式训练任务，分布式场景下训练数据会按照trainer的id进行切分，保证trainer之间的训练数据不会重叠，提高训练效率
 
 ```bash
+# 该sh不支持Windows
 sh cluster_train.sh
 ```
 
@@ -64,10 +64,13 @@ sh cluster_train.sh
 对测试集进行预测：
 ```bash
 python infer.py \
-        --model_path models/pass-0/ \
-        --data_path data/raw/valid.txt
+        --model_path models/pass-2/ \
+        --data_path data/raw/train.txt
 ```
-注意：infer.py跑完最后输出的AUC才是整个预测文件的整体AUC。
+
+加载pass-2的模型, 预期测试AUC为`0.794`。
+
+注意：infer.py跑完最后输出的AUC才是整个预测文件的整体AUC。train.txt文件在reader.py中被分为训练和测试两部分，所以这里数据不会和训练重叠。
 
 ## 在百度云上运行集群训练
 1. 参考文档 [在百度云上启动Fluid分布式训练](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst) 在百度云上部署一个CPU集群。
diff --git a/PaddleRec/ctr/dnn/README.md b/PaddleRec/ctr/dnn/README.md
index 022fa47c9248269f184b316ba7642324cbdc8717..2599a20779e48bf5d03233359fb40cef1070bec8 100644
--- a/PaddleRec/ctr/dnn/README.md
+++ b/PaddleRec/ctr/dnn/README.md
@@ -20,6 +20,7 @@ factorization machines, please refer to the paper [factorization
 machines](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf)
 
 ## Environment
+**Now all models in PaddleRec require PaddlePaddle version 1.6 or higher, or suitable develop version.**
 You should install PaddlePaddle Fluid first, and run:
 
 ```shell
@@ -38,7 +39,7 @@ categorical features. For the test dataset, the labels are omitted.
 
 Download dataset:
 ```bash
-cd data && ./download.sh && cd ..
+cd data && python download.py && cd ..
 ```
 
 ## Model
@@ -63,15 +64,13 @@ python train.py \
         2>&1 | tee train.log
 ```
 
-After training pass 1 batch 40000, the testing AUC is `0.801178` and the testing
-cost is `0.445196`.
-
 ### Distributed Train
 Run a 2 pserver 2 trainer distribute training on a single machine.
 In distributed training setting, training data is splited by trainer_id, so that training data
  do not overlap among trainers
 
 ```bash
+# this shell not support Windows
 sh cluster_train.sh
 ```
 
@@ -81,9 +80,12 @@ The command line options for infering can be listed by `python infer.py -h`.
 To make inference for the test dataset:
 ```bash
 python infer.py \
-        --model_path models/ \
+        --model_path models/pass-2 \
         --data_path data/raw/train.txt
 ```
+
+Load models in `models/pass-2`, the expected testing Auc is `0.794`.
+
 Note: The AUC value in the last log info is the total AUC for all test dataset. Here, train.txt is splited inside the reader.py so that validation data does not have overlap with training data.
 
 ## Train on Baidu Cloud
diff --git a/PaddleRec/ctr/dnn/data/download.py b/PaddleRec/ctr/dnn/data/download.py
new file mode 100644
index 0000000000000000000000000000000000000000..b88191ffa85cd9ce36d4b5d20e8ec30c8434e5c9
--- /dev/null
+++ b/PaddleRec/ctr/dnn/data/download.py
@@ -0,0 +1,28 @@
+import os
+import shutil
+import sys
+import glob
+
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+TOOLS_PATH = os.path.join(LOCAL_PATH, "..", "..", "tools")
+sys.path.append(TOOLS_PATH)
+
+from tools import download_file_and_uncompress
+
+if __name__ == '__main__':
+    url = "https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz"
+
+    print("download and extract starting...")
+    download_file_and_uncompress(url)
+    print("download and extract finished")
+
+    if os.path.exists("raw"):
+        shutil.rmtree("raw")
+    os.mkdir("raw")
+
+    # mv ./*.txt raw/
+    files = glob.glob("*.txt")
+    for f in files:
+        shutil.move(f, "raw")
+
+    print("done")
diff --git a/PaddleRec/ctr/dnn/infer.py b/PaddleRec/ctr/dnn/infer.py
index 2f622629ec487b076f382f141061804a760c71ac..33dd00dffae68826f449d7cb1457ef8152dfe764 100644
--- a/PaddleRec/ctr/dnn/infer.py
+++ b/PaddleRec/ctr/dnn/infer.py
@@ -10,6 +10,7 @@ import paddle.fluid as fluid
 
 import reader
 from network_conf import ctr_dnn_model
+import utils
 
 logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger("fluid")
@@ -59,36 +60,37 @@ def infer():
 
     startup_program = fluid.framework.Program()
     test_program = fluid.framework.Program()
-    with fluid.framework.program_guard(test_program, startup_program):
-        loss, auc_var, batch_auc_var, _, data_list = ctr_dnn_model(
-            args.embedding_size, args.sparse_feature_dim, False)
+    with fluid.scope_guard(inference_scope):
+        with fluid.framework.program_guard(test_program, startup_program):
+            loss, auc_var, batch_auc_var, _, data_list, auc_states = ctr_dnn_model(
+                args.embedding_size, args.sparse_feature_dim, False)
 
-        exe = fluid.Executor(place)
+            exe = fluid.Executor(place)
 
-        feeder = fluid.DataFeeder(feed_list=data_list, place=place)
+            feeder = fluid.DataFeeder(feed_list=data_list, place=place)
 
-        fluid.io.load_persistables(
-            executor=exe,
-            dirname=args.model_path,
-            main_program=fluid.default_main_program())
+            fluid.io.load_persistables(
+                executor=exe,
+                dirname=args.model_path,
+                main_program=fluid.default_main_program())
 
-        def set_zero(var_name):
-            param = inference_scope.var(var_name).get_tensor()
-            param_array = np.zeros(param._get_dims()).astype("int64")
-            param.set(param_array, place)
+            def set_zero(var_name):
+                param = inference_scope.var(var_name).get_tensor()
+                param_array = np.zeros(param._get_dims()).astype("int64")
+                param.set(param_array, place)
 
-        auc_states_names = ['_generated_var_2', '_generated_var_3']
-        for name in auc_states_names:
-            set_zero(name)
+            for var in auc_states:
+                set_zero(var.name)
 
-        for batch_id, data in enumerate(test_reader()):
-            loss_val, auc_val = exe.run(test_program,
-                                        feed=feeder.feed(data),
-                                        fetch_list=[loss, auc_var])
-            if batch_id % 100 == 0:
-                logger.info("TEST --> batch: {} loss: {} auc: {}".format(
-                    batch_id, loss_val / args.batch_size, auc_val))
+            for batch_id, data in enumerate(test_reader()):
+                loss_val, auc_val = exe.run(test_program,
+                                            feed=feeder.feed(data),
+                                            fetch_list=[loss, auc_var])
+                if batch_id % 100 == 0:
+                    logger.info("TEST --> batch: {} loss: {} auc: {}".format(
+                        batch_id, loss_val / args.batch_size, auc_val))
 
 
 if __name__ == '__main__':
+    utils.check_version()
     infer()
diff --git a/PaddleRec/ctr/dnn/network_conf.py b/PaddleRec/ctr/dnn/network_conf.py
index bb23d4844a8785ddb4dde49e4b5d83b1359b9e07..56e93d85f8afff7f917ad76766ddcc31f83925f0 100644
--- a/PaddleRec/ctr/dnn/network_conf.py
+++ b/PaddleRec/ctr/dnn/network_conf.py
@@ -31,20 +31,24 @@ def ctr_deepfm_model(factor_size, sparse_feature_dim, dense_feature_dim,
         """
         sparse_fm_layer
         """
-        first_embeddings = fluid.layers.embedding(
+        first_embeddings = fluid.embedding(
             input=input,
             dtype='float32',
             size=[emb_dict_size, 1],
             is_sparse=True)
+        first_embeddings = fluid.layers.squeeze(
+            input=first_embeddings, axes=[1])
         first_order = fluid.layers.sequence_pool(
             input=first_embeddings, pool_type='sum')
 
-        nonzero_embeddings = fluid.layers.embedding(
+        nonzero_embeddings = fluid.embedding(
             input=input,
             dtype='float32',
             size=[emb_dict_size, factor_size],
             param_attr=fm_param_attr,
             is_sparse=True)
+        nonzero_embeddings = fluid.layers.squeeze(
+            input=nonzero_embeddings, axes=[1])
         summed_features_emb = fluid.layers.sequence_pool(
             input=nonzero_embeddings, pool_type='sum')
         summed_features_emb_square = fluid.layers.square(summed_features_emb)
@@ -57,8 +61,8 @@ def ctr_deepfm_model(factor_size, sparse_feature_dim, dense_feature_dim,
             summed_features_emb_square - squared_sum_features_emb)
         return first_order, second_order
 
-    dense_input = fluid.layers.data(
-        name="dense_input", shape=[dense_feature_dim], dtype='float32')
+    dense_input = fluid.data(
+        name="dense_input", shape=[None, dense_feature_dim], dtype='float32')
 
     sparse_input_ids = [
         fluid.layers.data(
@@ -66,7 +70,7 @@ def ctr_deepfm_model(factor_size, sparse_feature_dim, dense_feature_dim,
         for i in range(1, 27)
     ]
 
-    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    label = fluid.data(name='label', shape=[None, 1], dtype='int64')
 
     datas = [dense_input] + sparse_input_ids + [label]
 
@@ -96,6 +100,7 @@ def ctr_deepfm_model(factor_size, sparse_feature_dim, dense_feature_dim,
             size=[sparse_feature_dim, factor_size],
             param_attr=sparse_fm_param_attr,
             is_sparse=True)
+        emb = fluid.layers.squeeze(input=emb, axes=[1])
         return fluid.layers.sequence_pool(input=emb, pool_type='average')
 
     sparse_embed_seq = list(map(embedding_layer, sparse_input_ids))
@@ -139,7 +144,7 @@ def ctr_deepfm_model(factor_size, sparse_feature_dim, dense_feature_dim,
 def ctr_dnn_model(embedding_size, sparse_feature_dim, use_py_reader=True):
     def embedding_layer(input):
         """embedding_layer"""
-        emb = fluid.layers.embedding(
+        emb = fluid.embedding(
             input=input,
             is_sparse=True,
             # you need to patch https://github.com/PaddlePaddle/Paddle/pull/14190
@@ -149,18 +154,19 @@ def ctr_dnn_model(embedding_size, sparse_feature_dim, use_py_reader=True):
             param_attr=fluid.ParamAttr(
                 name="SparseFeatFactors",
                 initializer=fluid.initializer.Uniform()))
+        emb = fluid.layers.squeeze(input=emb, axes=[1])
         return fluid.layers.sequence_pool(input=emb, pool_type='average')
 
-    dense_input = fluid.layers.data(
-        name="dense_input", shape=[dense_feature_dim], dtype='float32')
+    dense_input = fluid.data(
+        name="dense_input", shape=[None, dense_feature_dim], dtype='float32')
 
     sparse_input_ids = [
-        fluid.layers.data(
-            name="C" + str(i), shape=[1], lod_level=1, dtype='int64')
+        fluid.data(
+            name="C" + str(i), shape=[None, 1], lod_level=1, dtype='int64')
         for i in range(1, 27)
     ]
 
-    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    label = fluid.data(name='label', shape=[None, 1], dtype='int64')
 
     words = [dense_input] + sparse_input_ids + [label]
 
@@ -195,16 +201,16 @@ def ctr_dnn_model(embedding_size, sparse_feature_dim, use_py_reader=True):
                               initializer=fluid.initializer.Normal(
                                   scale=1 / math.sqrt(fc2.shape[1]))))
     predict = fluid.layers.fc(input=fc3,
-                              size=2,
-                              act='softmax',
+                              size=1,
+                              act='sigmoid',
                               param_attr=fluid.ParamAttr(
                                   initializer=fluid.initializer.Normal(
                                       scale=1 / math.sqrt(fc3.shape[1]))))
 
-    cost = fluid.layers.cross_entropy(input=predict, label=words[-1])
+    cost = fluid.layers.log_loss(input=predict, label=fluid.layers.cast(words[-1], dtype="float32"))
     avg_cost = fluid.layers.reduce_sum(cost)
     accuracy = fluid.layers.accuracy(input=predict, label=words[-1])
     auc_var, batch_auc_var, auc_states = \
         fluid.layers.auc(input=predict, label=words[-1], num_thresholds=2 ** 12, slide_steps=20)
 
-    return avg_cost, auc_var, batch_auc_var, py_reader, words
+    return avg_cost, auc_var, batch_auc_var, py_reader, words, auc_states
diff --git a/PaddleRec/ctr/dnn/reader.py b/PaddleRec/ctr/dnn/reader.py
index d6eb1a77cb73c43577063f1255ffca308c29fcf1..8568964701dd53322975e666379da449fe577cf1 100644
--- a/PaddleRec/ctr/dnn/reader.py
+++ b/PaddleRec/ctr/dnn/reader.py
@@ -1,3 +1,6 @@
+import mmh3
+
+
 class Dataset:
     def __init__(self):
         pass
@@ -43,7 +46,8 @@ class CriteoDataset(Dataset):
                                                      self.cont_diff_[idx - 1])
                         for idx in self.categorical_range_:
                             sparse_feature.append([
-                                hash(str(idx) + features[idx]) % self.hash_dim_
+                                mmh3.hash(str(idx) + features[idx]) %
+                                self.hash_dim_
                             ])
 
                         label = [int(features[0])]
diff --git a/PaddleRec/ctr/dnn/requirements.txt b/PaddleRec/ctr/dnn/requirements.txt
index dca9a909647e3b066931de2909c2d1e65c78c995..d4db0c1a7973b3aacffdcbffd42dec0f00eaa607 100644
--- a/PaddleRec/ctr/dnn/requirements.txt
+++ b/PaddleRec/ctr/dnn/requirements.txt
@@ -1 +1,2 @@
 click
+mmh3
diff --git a/PaddleRec/ctr/dnn/train.py b/PaddleRec/ctr/dnn/train.py
index 69e51b9db9bec7d38658e442354363b0e72301c4..599bcad18793e865bce6b028f89245594ba0c70f 100644
--- a/PaddleRec/ctr/dnn/train.py
+++ b/PaddleRec/ctr/dnn/train.py
@@ -13,6 +13,7 @@ import paddle.fluid as fluid
 import reader
 from network_conf import ctr_dnn_model
 from multiprocessing import cpu_count
+import utils
 
 # disable gpu training for this example
 os.environ["CUDA_VISIBLE_DEVICES"] = ""
@@ -172,8 +173,8 @@ def train_loop(args, train_program, py_reader, loss, auc_var, batch_auc_var,
                     .format(pass_id, batch_id, loss_val / args.batch_size,
                             auc_val, batch_auc_val))
                 if batch_id % 1000 == 0 and batch_id != 0:
-                    model_dir = args.model_output_dir + '/batch-' + str(
-                        batch_id)
+                    model_dir = os.path.join(args.model_output_dir,
+                                             'batch-' + str(batch_id))
                     if args.trainer_id == 0:
                         fluid.io.save_persistables(
                             executor=exe,
@@ -187,7 +188,7 @@ def train_loop(args, train_program, py_reader, loss, auc_var, batch_auc_var,
 
         total_time += time.time() - pass_start
 
-        model_dir = args.model_output_dir + '/pass-' + str(pass_id)
+        model_dir = os.path.join(args.model_output_dir, 'pass-' + str(pass_id))
         if args.trainer_id == 0:
             fluid.io.save_persistables(
                 executor=exe,
@@ -214,7 +215,7 @@ def train():
     if not os.path.isdir(args.model_output_dir):
         os.mkdir(args.model_output_dir)
 
-    loss, auc_var, batch_auc_var, py_reader, _ = ctr_dnn_model(
+    loss, auc_var, batch_auc_var, py_reader, _, auc_states = ctr_dnn_model(
         args.embedding_size, args.sparse_feature_dim)
     optimizer = fluid.optimizer.Adam(learning_rate=1e-4)
     optimizer.minimize(loss)
@@ -269,4 +270,5 @@ def get_cards(args):
 
 
 if __name__ == '__main__':
+    utils.check_version()
     train()
diff --git a/PaddleRec/ctr/dnn/utils.py b/PaddleRec/ctr/dnn/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..779b129e574e4611f29202f71d73856ecb888069
--- /dev/null
+++ b/PaddleRec/ctr/dnn/utils.py
@@ -0,0 +1,24 @@
+import sys
+import paddle.fluid as fluid
+import logging
+
+logging.basicConfig()
+logger = logging.getLogger(__name__)
+
+__all__ = ['check_version']
+
+
+def check_version():
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
diff --git a/PaddleRec/ctr/tools/tools.py b/PaddleRec/ctr/tools/tools.py
new file mode 100644
index 0000000000000000000000000000000000000000..da34a027c027d6809603869f946499ec45edf8e6
--- /dev/null
+++ b/PaddleRec/ctr/tools/tools.py
@@ -0,0 +1,133 @@
+import os
+import time
+import shutil
+import requests
+import sys
+import tarfile
+import zipfile
+import platform
+import functools
+
+lasttime = time.time()
+FLUSH_INTERVAL = 0.1
+
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+
+
+def get_platform():
+    return platform.platform()
+
+
+def is_windows():
+    return get_platform().lower().startswith("windows")
+
+
+def progress(str, end=False):
+    global lasttime
+    if end:
+        str += "\n"
+        lasttime = 0
+    if time.time() - lasttime >= FLUSH_INTERVAL:
+        sys.stdout.write("\r%s" % str)
+        lasttime = time.time()
+        sys.stdout.flush()
+
+
+def download_file(url, savepath, print_progress):
+    r = requests.get(url, stream=True)
+    total_length = r.headers.get('content-length')
+
+    if total_length is None:
+        with open(savepath, 'wb') as f:
+            shutil.copyfileobj(r.raw, f)
+    else:
+        with open(savepath, 'wb') as f:
+            dl = 0
+            total_length = int(total_length)
+            starttime = time.time()
+            if print_progress:
+                print("Downloading %s" % os.path.basename(savepath))
+            for data in r.iter_content(chunk_size=4096):
+                dl += len(data)
+                f.write(data)
+                if print_progress:
+                    done = int(50 * dl / total_length)
+                    progress("[%-50s] %.2f%%" %
+                             ('=' * done, float(100 * dl) / total_length))
+        if print_progress:
+            progress("[%-50s] %.2f%%" % ('=' * 50, 100), end=True)
+
+
+def _uncompress_file(filepath, extrapath, delete_file, print_progress):
+    if print_progress:
+        print("Uncompress %s" % os.path.basename(filepath))
+
+    if filepath.endswith("zip"):
+        handler = _uncompress_file_zip
+    elif filepath.endswith("tgz"):
+        handler = _uncompress_file_tar
+    else:
+        handler = functools.partial(_uncompress_file_tar, mode="r")
+
+    for total_num, index, rootpath in handler(filepath, extrapath):
+        if print_progress:
+            done = int(50 * float(index) / total_num)
+            progress("[%-50s] %.2f%%" %
+                     ('=' * done, float(100 * index) / total_num))
+    if print_progress:
+        progress("[%-50s] %.2f%%" % ('=' * 50, 100), end=True)
+
+    if delete_file:
+        os.remove(filepath)
+
+    return rootpath
+
+
+def _uncompress_file_zip(filepath, extrapath):
+    files = zipfile.ZipFile(filepath, 'r')
+    filelist = files.namelist()
+    rootpath = filelist[0]
+    total_num = len(filelist)
+    for index, file in enumerate(filelist):
+        files.extract(file, extrapath)
+        yield total_num, index, rootpath
+    files.close()
+    yield total_num, index, rootpath
+
+
+def _uncompress_file_tar(filepath, extrapath, mode="r:gz"):
+    files = tarfile.open(filepath, mode)
+    filelist = files.getnames()
+    total_num = len(filelist)
+    rootpath = filelist[0]
+    for index, file in enumerate(filelist):
+        files.extract(file, extrapath)
+        yield total_num, index, rootpath
+    files.close()
+    yield total_num, index, rootpath
+
+
+def download_file_and_uncompress(url,
+                                 savepath=None,
+                                 savename=None,
+                                 extrapath=None,
+                                 print_progress=True,
+                                 cover=False,
+                                 delete_file=False):
+    if savepath is None:
+        savepath = "."
+
+    if extrapath is None:
+        extrapath = "."
+
+    if savename is None:
+        savename = url.split("/")[-1]
+    savepath = os.path.join(savepath, savename)
+
+    if cover:
+        if os.path.exists(savepath):
+            shutil.rmtree(savepath)
+
+    if not os.path.exists(savepath):
+        download_file(url, savepath, print_progress)
+    _ = _uncompress_file(savepath, extrapath, delete_file, print_progress)
diff --git a/PaddleRec/ctr/xdeepfm/README.md b/PaddleRec/ctr/xdeepfm/README.md
index 9b2475cd789db6e298db31194713e27064849d8b..57341fc10e62ae1b5be80f21546a6d31d8610cde 100644
--- a/PaddleRec/ctr/xdeepfm/README.md
+++ b/PaddleRec/ctr/xdeepfm/README.md
@@ -8,11 +8,11 @@
 
 demo数据集，在data目录下执行命令，下载数据
 ```bash
-sh download.sh
+python download.py
 ```
 
 ## 环境
-- PaddlePaddle 1.6
+- **要求使用PaddlePaddle 1.6及以上版本或适当的develop版本。**
 
 ## 单机训练
 ```bash
@@ -39,6 +39,7 @@ test_epoch设置加载第10轮训练的模型。
 
 数据下载同上面命令。
 ```bash
+# 该sh不支持Windows
 sh cluster_train.sh
 ```
 参数说明：
@@ -64,7 +65,7 @@ python infer.py --model_output_dir cluster_model --test_epoch 10 --use_gpu=0
 - 0号trainer保存模型参数
 
 - 每次训练完成后需要手动停止pserver进程，使用以下命令查看pserver进程：
-  
+
 >ps -ef | grep python
 
 - 数据读取使用dataset模式，目前仅支持运行在Linux环境下
diff --git a/PaddleRec/ctr/xdeepfm/args.py b/PaddleRec/ctr/xdeepfm/args.py
index 70f0863f98b8c8dbcd27bb510ae2f5644699fb3f..8b30dbc0dc040680b362d6282e850fb450fc4e3f 100644
--- a/PaddleRec/ctr/xdeepfm/args.py
+++ b/PaddleRec/ctr/xdeepfm/args.py
@@ -75,5 +75,7 @@ def parse_args():
         required=False,
         default=False,
         help='embedding will use sparse or not, (default: False)')
+    parser.add_argument(
+        '--enable_ce', action='store_true', help='If set, run the task with continuous evaluation logs.')
 
     return parser.parse_args()
diff --git a/PaddleRec/ctr/xdeepfm/cluster_train.py b/PaddleRec/ctr/xdeepfm/cluster_train.py
index 97135b89b71209833304cd968176c0725b1ce49c..e1d318b54d0dfca614cc90d8ad2350a29d5e1c06 100644
--- a/PaddleRec/ctr/xdeepfm/cluster_train.py
+++ b/PaddleRec/ctr/xdeepfm/cluster_train.py
@@ -5,6 +5,7 @@ import time
 import network_conf
 
 import paddle.fluid as fluid
+import utils
 
 
 def parse_args():
@@ -121,7 +122,7 @@ def train():
     if not os.path.isdir(args.model_output_dir):
         os.mkdir(args.model_output_dir)
 
-    loss, auc, data_list = eval('network_conf.' + args.model_name)(
+    loss, auc, data_list, auc_states = eval('network_conf.' + args.model_name)(
         args.embedding_size, args.num_field, args.num_feat,
         args.layer_sizes_dnn, args.act, args.reg, args.layer_sizes_cin,
         args.is_sparse)
@@ -138,7 +139,7 @@ def train():
         dataset.set_pipe_command('python criteo_reader.py')
         dataset.set_batch_size(args.batch_size)
         dataset.set_filelist([
-            args.train_data_dir + '/' + x
+            os.path.join(args.train_data_dir, x)
             for x in os.listdir(args.train_data_dir)
         ])
 
@@ -160,7 +161,8 @@ def train():
                 fetch_info=['loss', 'auc'],
                 debug=False,
                 print_period=args.print_steps)
-            model_dir = args.model_output_dir + '/epoch_' + str(epoch_id + 1)
+            model_dir = os.path.join(args.model_output_dir,
+                                     'epoch_' + str(epoch_id + 1))
             sys.stderr.write('epoch%d is finished and takes %f s\n' % (
                 (epoch_id + 1), time.time() - start))
             if args.trainer_id == 0:  # only trainer 0 save model
@@ -193,4 +195,5 @@ def train():
 
 
 if __name__ == "__main__":
+    utils.check_version()
     train()
diff --git a/PaddleRec/ctr/xdeepfm/data/download.py b/PaddleRec/ctr/xdeepfm/data/download.py
new file mode 100644
index 0000000000000000000000000000000000000000..4b21696704d28f97cc152e1d4d9ffa96c61c6854
--- /dev/null
+++ b/PaddleRec/ctr/xdeepfm/data/download.py
@@ -0,0 +1,28 @@
+import os
+import shutil
+import sys
+
+LOCAL_PATH = os.path.dirname(os.path.abspath(__file__))
+TOOLS_PATH = os.path.join(LOCAL_PATH, "..", "..", "tools")
+sys.path.append(TOOLS_PATH)
+
+from tools import download_file_and_uncompress, download_file
+
+if __name__ == '__main__':
+    url_train = "https://paddlerec.bj.bcebos.com/xdeepfm%2Ftr"
+    url_test = "https://paddlerec.bj.bcebos.com/xdeepfm%2Fev"
+
+    train_dir = "train_data"
+    test_dir = "test_data"
+
+    if not os.path.exists(train_dir):
+        os.mkdir(train_dir)
+    if not os.path.exists(test_dir):
+        os.mkdir(test_dir)
+
+    print("download and extract starting...")
+    download_file(url_train, "./train_data/tr", True)
+    download_file(url_test, "./test_data/ev", True)
+    print("download and extract finished")
+
+    print("done")
diff --git a/PaddleRec/ctr/xdeepfm/data/download.sh b/PaddleRec/ctr/xdeepfm/data/download.sh
deleted file mode 100644
index 95438938b6fd701c55b316faf9716fc4d11de8a7..0000000000000000000000000000000000000000
--- a/PaddleRec/ctr/xdeepfm/data/download.sh
+++ /dev/null
@@ -1,12 +0,0 @@
-#!/bin/bash
-
-if [ ! -d "train_data" ]; then
-    mkdir train_data
-fi
-
-if [ ! -d "test_data" ]; then
-    mkdir test_data
-fi
-
-wget --no-check-certificate https://paddlerec.bj.bcebos.com/xdeepfm%2Fev -O ./test_data/ev
-wget --no-check-certificate https://paddlerec.bj.bcebos.com/xdeepfm%2Ftr -O ./train_data/tr
diff --git a/PaddleRec/ctr/xdeepfm/infer.py b/PaddleRec/ctr/xdeepfm/infer.py
index fe2fc8d326ae7cb3e489c3bd765cc17903ac5321..489d6337fe6a5abfc28a05707c74a56a2f9d513c 100644
--- a/PaddleRec/ctr/xdeepfm/infer.py
+++ b/PaddleRec/ctr/xdeepfm/infer.py
@@ -8,6 +8,7 @@ import paddle.fluid as fluid
 from args import parse_args
 from criteo_reader import CriteoDataset
 import network_conf
+import utils
 
 logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger('fluid')
@@ -25,7 +26,8 @@ def infer():
     inference_scope = fluid.Scope()
 
     test_files = [
-        args.test_data_dir + '/' + x for x in os.listdir(args.test_data_dir)
+        os.path.join(args.test_data_dir, x)
+        for x in os.listdir(args.test_data_dir)
     ]
     criteo_dataset = CriteoDataset()
     test_reader = paddle.batch(
@@ -33,13 +35,16 @@ def infer():
 
     startup_program = fluid.framework.Program()
     test_program = fluid.framework.Program()
-    cur_model_path = args.model_output_dir + '/epoch_' + args.test_epoch
+    cur_model_path = os.path.join(args.model_output_dir,
+                                  'epoch_' + args.test_epoch)
 
     with fluid.scope_guard(inference_scope):
         with fluid.framework.program_guard(test_program, startup_program):
-            loss, auc, data_list = eval('network_conf.' + args.model_name)(
-                args.embedding_size, args.num_field, args.num_feat,
-                args.layer_sizes_dnn, args.act, args.reg, args.layer_sizes_cin)
+            loss, auc, data_list, auc_states = eval(
+                'network_conf.' + args.model_name)(
+                    args.embedding_size, args.num_field, args.num_feat,
+                    args.layer_sizes_dnn, args.act, args.reg,
+                    args.layer_sizes_cin)
 
             exe = fluid.Executor(place)
             feeder = fluid.DataFeeder(feed_list=data_list, place=place)
@@ -48,11 +53,8 @@ def infer():
                 dirname=cur_model_path,
                 main_program=fluid.default_main_program())
 
-            auc_states_names = ['_generated_var_2', '_generated_var_3']
-            for name in auc_states_names:
-                param = inference_scope.var(name).get_tensor()
-                param_array = np.zeros(param._get_dims()).astype("int64")
-                param.set(param_array, place)
+            for var in auc_states:  # reset auc states
+                set_zero(var.name, scope=inference_scope, place=place)
 
             loss_all = 0
             num_ins = 0
@@ -71,5 +73,23 @@ def infer():
             )
 
 
+def set_zero(var_name,
+             scope=fluid.global_scope(),
+             place=fluid.CPUPlace(),
+             param_type="int64"):
+    """
+    Set tensor of a Variable to zero.
+    Args:
+        var_name(str): name of Variable
+        scope(Scope): Scope object, default is fluid.global_scope()
+        place(Place): Place object, default is fluid.CPUPlace()
+        param_type(str): param data type, default is int64
+    """
+    param = scope.var(var_name).get_tensor()
+    param_array = np.zeros(param._get_dims()).astype(param_type)
+    param.set(param_array, place)
+
+
 if __name__ == '__main__':
+    utils.check_version()
     infer()
diff --git a/PaddleRec/ctr/xdeepfm/local_train.py b/PaddleRec/ctr/xdeepfm/local_train.py
index 8c548d49ca34a45718b1b872412a9b3781fccad7..c5fc8a2eef2a05b0651e4b24c951cde50ebad976 100644
--- a/PaddleRec/ctr/xdeepfm/local_train.py
+++ b/PaddleRec/ctr/xdeepfm/local_train.py
@@ -4,15 +4,22 @@ import paddle.fluid as fluid
 import sys
 import network_conf
 import time
+import utils
 
 
 def train():
     args = parse_args()
+    # add ce
+    if args.enable_ce:
+        SEED = 102
+        fluid.default_main_program().random_seed = SEED
+        fluid.default_startup_program().random_seed = SEED
+
     print(args)
     if not os.path.isdir(args.model_output_dir):
         os.mkdir(args.model_output_dir)
 
-    loss, auc, data_list = eval('network_conf.' + args.model_name)(
+    loss, auc, data_list, auc_states = eval('network_conf.' + args.model_name)(
         args.embedding_size, args.num_field, args.num_feat,
         args.layer_sizes_dnn, args.act, args.reg, args.layer_sizes_cin)
     optimizer = fluid.optimizer.SGD(
@@ -25,7 +32,8 @@ def train():
     dataset.set_pipe_command('python criteo_reader.py')
     dataset.set_batch_size(args.batch_size)
     dataset.set_filelist([
-        args.train_data_dir + '/' + x for x in os.listdir(args.train_data_dir)
+        os.path.join(args.train_data_dir, x)
+        for x in os.listdir(args.train_data_dir)
     ])
 
     if args.use_gpu == 1:
@@ -46,7 +54,8 @@ def train():
             fetch_info=['loss', 'auc'],
             debug=False,
             print_period=args.print_steps)
-        model_dir = args.model_output_dir + '/epoch_' + str(epoch_id + 1)
+        model_dir = os.path.join(args.model_output_dir,
+                                 'epoch_' + str(epoch_id + 1))
         sys.stderr.write('epoch%d is finished and takes %f s\n' % (
             (epoch_id + 1), time.time() - start))
         fluid.io.save_persistables(
@@ -56,4 +65,5 @@ def train():
 
 
 if __name__ == '__main__':
+    utils.check_version()
     train()
diff --git a/PaddleRec/ctr/xdeepfm/network_conf.py b/PaddleRec/ctr/xdeepfm/network_conf.py
index 1cdc5c74990ca9a41865c31fdf1dd00822d4764c..0a0b43dbc8744957930b23dea0c5f878b0bb3d44 100644
--- a/PaddleRec/ctr/xdeepfm/network_conf.py
+++ b/PaddleRec/ctr/xdeepfm/network_conf.py
@@ -14,18 +14,18 @@ def ctr_xdeepfm_model(embedding_size,
     initer = fluid.initializer.TruncatedNormalInitializer(
         loc=0.0, scale=init_value_)
 
-    raw_feat_idx = fluid.layers.data(
-        name='feat_idx', shape=[num_field], dtype='int64')
-    raw_feat_value = fluid.layers.data(
-        name='feat_value', shape=[num_field], dtype='float32')
-    label = fluid.layers.data(
-        name='label', shape=[1], dtype='float32')  # None * 1
+    raw_feat_idx = fluid.data(
+        name='feat_idx', shape=[None, num_field], dtype='int64')
+    raw_feat_value = fluid.data(
+        name='feat_value', shape=[None, num_field], dtype='float32')
+    label = fluid.data(
+        name='label', shape=[None, 1], dtype='float32')  # None * 1
     feat_idx = fluid.layers.reshape(raw_feat_idx,
                                     [-1, 1])  # (None * num_field) * 1
     feat_value = fluid.layers.reshape(
         raw_feat_value, [-1, num_field, 1])  # None * num_field * 1
 
-    feat_embeddings = fluid.layers.embedding(
+    feat_embeddings = fluid.embedding(
         input=feat_idx,
         is_sparse=is_sparse,
         dtype='float32',
@@ -39,7 +39,7 @@ def ctr_xdeepfm_model(embedding_size,
 
     # -------------------- linear  --------------------
 
-    weights_linear = fluid.layers.embedding(
+    weights_linear = fluid.embedding(
         input=feat_idx,
         is_sparse=is_sparse,
         dtype='float32',
@@ -134,4 +134,5 @@ def ctr_xdeepfm_model(embedding_size,
                                                           label=label_int,
                                                           slide_steps=0)
 
-    return batch_cost, auc_var, [raw_feat_idx, raw_feat_value, label]
+    return batch_cost, auc_var, [raw_feat_idx, raw_feat_value,
+                                 label], auc_states
diff --git a/PaddleRec/ctr/xdeepfm/utils.py b/PaddleRec/ctr/xdeepfm/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..779b129e574e4611f29202f71d73856ecb888069
--- /dev/null
+++ b/PaddleRec/ctr/xdeepfm/utils.py
@@ -0,0 +1,24 @@
+import sys
+import paddle.fluid as fluid
+import logging
+
+logging.basicConfig()
+logger = logging.getLogger(__name__)
+
+__all__ = ['check_version']
+
+
+def check_version():
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
diff --git a/PaddleRec/gnn/train.py b/PaddleRec/gnn/train.py
index 32277d33e03b98916d5d487d0ac55e459b238b7e..790d23d902c426a29b4ab649a9e20f68721cacc7 100644
--- a/PaddleRec/gnn/train.py
+++ b/PaddleRec/gnn/train.py
@@ -101,10 +101,8 @@ def train():
     feed_list = [e.name for e in feed_datas]
 
     if use_parallel:
-        exec_strategy = fluid.ExecutionStrategy()
-        exec_strategy.num_threads = 1 if os.name == 'nt' else 0
         train_exe = fluid.ParallelExecutor(
-            use_cuda=use_cuda, loss_name=loss.name, exec_strategy=exec_strategy)
+            use_cuda=use_cuda, loss_name=loss.name)
     else:
         train_exe = exe
 
diff --git a/PaddleRec/gru4rec/README.md b/PaddleRec/gru4rec/README.md
index b60e4f292a5e426436bd09f79bba2a8b7add88bf..15a9b106e7d891942f2cfee975361ade0a79c44b 100644
--- a/PaddleRec/gru4rec/README.md
+++ b/PaddleRec/gru4rec/README.md
@@ -45,6 +45,9 @@ session-based推荐应用场景非常广泛，比如用户的商品浏览、新
 
 运行样例程序可跳过'RSC15 数据下载及预处理'部分
 
+
+**要求使用PaddlePaddle 1.6及以上版本或适当的develop版本。**
+
 同时推荐用户参考[ IPython Notebook demo](https://aistudio.baidu.com/aistudio/projectDetail/122296)
 
 ## RSC15 数据下载及预处理
@@ -278,7 +281,7 @@ model:model_r@20/epoch_10 recall@20:0.681 time_cost(s):12.2
 
 可参考cluster_train.py 配置其他多机环境
 
-运行命令本地模拟多机场景
+运行命令本地模拟多机场景, 暂不支持windows
 ```
 sh cluster_train.sh
 ```
diff --git a/PaddleRec/gru4rec/convert_format.py b/PaddleRec/gru4rec/convert_format.py
index b5db511ef087e59724e765f9fc9275fda6428b27..7bca1d527ab756903382c1314e945e8ddff8a7a9 100644
--- a/PaddleRec/gru4rec/convert_format.py
+++ b/PaddleRec/gru4rec/convert_format.py
@@ -2,8 +2,8 @@ import sys
 
 
 def convert_format(input, output):
-    with open(input) as rf:
-        with open(output, "w") as wf:
+    with open(input, "r", encoding='utf-8') as rf:
+        with open(output, "w", encoding='utf-8') as wf:
             last_sess = -1
             sign = 1
             i = 0
diff --git a/PaddleRec/gru4rec/dy_graph/README.md b/PaddleRec/gru4rec/dy_graph/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..bc0caa7cf5eb5bb05ef6e722837bae5dd123d89c
--- /dev/null
+++ b/PaddleRec/gru4rec/dy_graph/README.md
@@ -0,0 +1,23 @@
+# gru4rec 动态图实现
+
+# 环境配置
+paddle 1.7
+
+
+
+# 下载数据
+```
+wget https://paddlerec.bj.bcebos.com/gru4rec/dy_graph/data_rsc15.tar
+tar xvf data_rsc15.tar
+```
+
+# 数据格式
+数据格式及预处理处理同静态图相同。
+
+# 训练及预测
+
+```
+CUDA_VISIBLE_DEVICES=0 nohup sh run_gru.sh > log 2>&1 &
+```
+
+每一轮训练完都会进行预测。
diff --git a/PaddleRec/gru4rec/dy_graph/args.py b/PaddleRec/gru4rec/dy_graph/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..ad33ea1a27155c81678f72ee46e6448e60a6ee45
--- /dev/null
+++ b/PaddleRec/gru4rec/dy_graph/args.py
@@ -0,0 +1,55 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import distutils.util
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--model_type",
+        type=str,
+        default="small",
+        help="model_type [test|small|medium|large]")
+    parser.add_argument(
+        "--rnn_model",
+        type=str,
+        default="static",
+        help="model_type [static|padding|cudnn]")
+    parser.add_argument(
+        "--data_path", type=str, help="all the data for train,valid,test")
+    parser.add_argument('--para_init', action='store_true')
+    parser.add_argument(
+        '--use_gpu', type=bool, default=False, help='whether using gpu')
+    parser.add_argument(
+        '--log_path',
+        help='path of the log file. If not set, logs are printed to console')
+    parser.add_argument(
+        '--save_model_dir',
+        type=str,
+        default="models",
+        help='dir of the saved model.')
+    parser.add_argument(
+        '--init_from_pretrain_model',
+        type=str,
+        default=None,
+        help='dir to init model.')
+    parser.add_argument('--ce', action='store_true', help="run ce")
+    args = parser.parse_args()
+    return args
diff --git a/PaddleRec/gru4rec/dy_graph/gru4rec_dy.py b/PaddleRec/gru4rec/dy_graph/gru4rec_dy.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a87e0bf9bfb1308d6c52a804dc125322fea6846
--- /dev/null
+++ b/PaddleRec/gru4rec/dy_graph/gru4rec_dy.py
@@ -0,0 +1,464 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import os
+import unittest
+import paddle.fluid as fluid
+import paddle.fluid.core as core
+from paddle.fluid.dygraph.nn import Embedding
+import paddle.fluid.framework as framework
+from paddle.fluid.optimizer import SGDOptimizer
+from paddle.fluid.optimizer import AdagradOptimizer
+from paddle.fluid.dygraph.base import to_variable
+import numpy as np
+import six
+
+import reader
+import model_check
+import time
+
+from args import *
+
+import sys
+if sys.version[0] == '2':
+    reload(sys)
+    sys.setdefaultencoding("utf-8")
+
+
+class SimpleGRURNN(fluid.Layer):
+    def __init__(self,
+                 hidden_size,
+                 num_steps,
+                 num_layers=2,
+                 init_scale=0.1,
+                 dropout=None):
+        super(SimpleGRURNN, self).__init__()
+        self._hidden_size = hidden_size
+        self._num_layers = num_layers
+        self._init_scale = init_scale
+        self._dropout = dropout
+        self._num_steps = num_steps
+
+        self.weight_1_arr = []
+        self.weight_2_arr = []
+        self.weight_3_arr = []
+        self.bias_1_arr = []
+        self.bias_2_arr = []
+        self.mask_array = []
+
+        for i in range(self._num_layers):
+            weight_1 = self.create_parameter(
+                attr=fluid.ParamAttr(
+                    initializer=fluid.initializer.UniformInitializer(
+                        low=-self._init_scale, high=self._init_scale)),
+                shape=[self._hidden_size * 2, self._hidden_size * 2],
+                dtype="float32",
+                default_initializer=fluid.initializer.UniformInitializer(
+                    low=-self._init_scale, high=self._init_scale))
+            self.weight_1_arr.append(self.add_parameter('w1_%d' % i, weight_1))
+            weight_2 = self.create_parameter(
+                attr=fluid.ParamAttr(
+                    initializer=fluid.initializer.UniformInitializer(
+                        low=-self._init_scale, high=self._init_scale)),
+                shape=[self._hidden_size, self._hidden_size],
+                dtype="float32",
+                default_initializer=fluid.initializer.UniformInitializer(
+                    low=-self._init_scale, high=self._init_scale))
+            self.weight_2_arr.append(self.add_parameter('w2_%d' % i, weight_2))
+            weight_3 = self.create_parameter(
+                attr=fluid.ParamAttr(
+                    initializer=fluid.initializer.UniformInitializer(
+                        low=-self._init_scale, high=self._init_scale)),
+                shape=[self._hidden_size, self._hidden_size],
+                dtype="float32",
+                default_initializer=fluid.initializer.UniformInitializer(
+                    low=-self._init_scale, high=self._init_scale))
+            self.weight_3_arr.append(self.add_parameter('w3_%d' % i, weight_3))
+            bias_1 = self.create_parameter(
+                attr=fluid.ParamAttr(
+                    initializer=fluid.initializer.UniformInitializer(
+                        low=-self._init_scale, high=self._init_scale)),
+                shape=[self._hidden_size * 2],
+                dtype="float32",
+                default_initializer=fluid.initializer.Constant(0.0))
+            self.bias_1_arr.append(self.add_parameter('b1_%d' % i, bias_1))
+            bias_2 = self.create_parameter(
+                attr=fluid.ParamAttr(
+                    initializer=fluid.initializer.UniformInitializer(
+                        low=-self._init_scale, high=self._init_scale)),
+                shape=[self._hidden_size * 1],
+                dtype="float32",
+                default_initializer=fluid.initializer.Constant(0.0))
+            self.bias_2_arr.append(self.add_parameter('b2_%d' % i, bias_2))
+
+    def forward(self, input_embedding, init_hidden=None):
+        hidden_array = []
+
+        for i in range(self._num_layers):
+            hidden_array.append(init_hidden[i])
+
+        res = []
+        for index in range(self._num_steps):
+            step_input = input_embedding[:, index, :]
+            for k in range(self._num_layers):
+                pre_hidden = hidden_array[k]
+                weight_1 = self.weight_1_arr[k]
+                weight_2 = self.weight_2_arr[k]
+                weight_3 = self.weight_3_arr[k]
+                bias_1 = self.bias_1_arr[k]
+                bias_2 = self.bias_2_arr[k]
+
+                nn = fluid.layers.concat([step_input, pre_hidden], 1)
+                gate_input = fluid.layers.matmul(x=nn, y=weight_1)
+                gate_input = fluid.layers.elementwise_add(gate_input, bias_1)
+                u, r = fluid.layers.split(gate_input, num_or_sections=2, dim=-1)
+                hidden_c = fluid.layers.tanh(
+                    fluid.layers.elementwise_add(
+                        fluid.layers.matmul(
+                            x=step_input, y=weight_2) + fluid.layers.matmul(
+                                x=(fluid.layers.sigmoid(r) * pre_hidden),
+                                y=weight_3),
+                        bias_2))
+                hidden_state = fluid.layers.sigmoid(u) * pre_hidden + (
+                    1.0 - fluid.layers.sigmoid(u)) * hidden_c
+                hidden_array[k] = hidden_state
+                step_input = hidden_state
+
+                if self._dropout is not None and self._dropout > 0.0:
+                    step_input = fluid.layers.dropout(
+                        step_input,
+                        dropout_prob=self._dropout,
+                        dropout_implementation='upscale_in_train')
+            res.append(step_input)
+        real_res = fluid.layers.concat(res, 1)
+        real_res = fluid.layers.reshape(
+            real_res, [-1, self._num_steps, self._hidden_size])
+        last_hidden = fluid.layers.concat(hidden_array, 1)
+        last_hidden = fluid.layers.reshape(
+            last_hidden, shape=[-1, self._num_layers, self._hidden_size])
+        last_hidden = fluid.layers.transpose(x=last_hidden, perm=[1, 0, 2])
+        return real_res, last_hidden
+
+
+class PtbModel(fluid.Layer):
+    def __init__(self,
+                 name_scope,
+                 hidden_size,
+                 vocab_size,
+                 num_layers=2,
+                 num_steps=20,
+                 init_scale=0.1,
+                 dropout=None):
+        #super(PtbModel, self).__init__(name_scope)
+        super(PtbModel, self).__init__()
+        self.hidden_size = hidden_size
+        self.vocab_size = vocab_size
+        self.init_scale = init_scale
+        self.num_layers = num_layers
+        self.num_steps = num_steps
+        self.dropout = dropout
+        self.simple_gru_rnn = SimpleGRURNN(
+            #self.full_name(),
+            hidden_size,
+            num_steps,
+            num_layers=num_layers,
+            init_scale=init_scale,
+            dropout=dropout)
+        self.embedding = Embedding(
+            #self.full_name(),
+            size=[vocab_size, hidden_size],
+            dtype='float32',
+            is_sparse=False,
+            param_attr=fluid.ParamAttr(
+                name='embedding_para',
+                initializer=fluid.initializer.UniformInitializer(
+                    low=-init_scale, high=init_scale)))
+        self.softmax_weight = self.create_parameter(
+            attr=fluid.ParamAttr(),
+            shape=[self.hidden_size, self.vocab_size],
+            dtype="float32",
+            default_initializer=fluid.initializer.UniformInitializer(
+                low=-self.init_scale, high=self.init_scale))
+        self.softmax_bias = self.create_parameter(
+            attr=fluid.ParamAttr(),
+            shape=[self.vocab_size],
+            dtype="float32",
+            default_initializer=fluid.initializer.UniformInitializer(
+                low=-self.init_scale, high=self.init_scale))
+
+    def build_once(self, input, label, init_hidden):
+        pass
+
+    def forward(self, input, label, init_hidden):
+
+        init_h = fluid.layers.reshape(
+            init_hidden, shape=[self.num_layers, -1, self.hidden_size])
+
+        x_emb = self.embedding(input)
+
+        x_emb = fluid.layers.reshape(
+            x_emb, shape=[-1, self.num_steps, self.hidden_size])
+        if self.dropout is not None and self.dropout > 0.0:
+            x_emb = fluid.layers.dropout(
+                x_emb,
+                dropout_prob=self.dropout,
+                dropout_implementation='upscale_in_train')
+        rnn_out, last_hidden = self.simple_gru_rnn(x_emb, init_h)
+
+        projection = fluid.layers.matmul(rnn_out, self.softmax_weight)
+        projection = fluid.layers.elementwise_add(projection, self.softmax_bias)
+        loss = fluid.layers.softmax_with_cross_entropy(
+            logits=projection, label=label, soft_label=False)
+        pre_2d = fluid.layers.reshape(projection, shape=[-1, self.vocab_size])
+        label_2d = fluid.layers.reshape(label, shape=[-1, 1])
+        acc = fluid.layers.accuracy(input=pre_2d, label=label_2d, k=20)
+        loss = fluid.layers.reshape(loss, shape=[-1, self.num_steps])
+        loss = fluid.layers.reduce_mean(loss, dim=[0])
+        loss = fluid.layers.reduce_sum(loss)
+
+        return loss, last_hidden, acc
+
+    def debug_emb(self):
+
+        np.save("emb_grad", self.x_emb.gradient())
+
+
+def train_ptb_lm():
+    args = parse_args()
+
+    # check if set use_gpu=True in paddlepaddle cpu version
+    model_check.check_cuda(args.use_gpu)
+    # check if paddlepaddle version is satisfied
+    model_check.check_version()
+
+    model_type = args.model_type
+
+    vocab_size = 37484
+    if model_type == "test":
+        num_layers = 1
+        batch_size = 2
+        hidden_size = 10
+        num_steps = 4
+        init_scale = 0.1
+        max_grad_norm = 5.0
+        epoch_start_decay = 1
+        max_epoch = 1
+        dropout = 0.0
+        lr_decay = 0.5
+        base_learning_rate = 1.0
+    elif model_type == "small":
+        num_layers = 2
+        batch_size = 20
+        hidden_size = 200
+        num_steps = 20
+        init_scale = 0.1
+        max_grad_norm = 5.0
+        epoch_start_decay = 4
+        max_epoch = 2
+        dropout = 0.0
+        lr_decay = 0.5
+        base_learning_rate = 1.0
+    elif model_type == "gru4rec":
+        num_layers = 1
+        batch_size = 500
+        hidden_size = 100
+        num_steps = 10
+        init_scale = 0.1
+        max_grad_norm = 5.0
+        epoch_start_decay = 10
+        max_epoch = 5
+        dropout = 0.0
+        lr_decay = 0.5
+        base_learning_rate = 0.05
+    elif model_type == "medium":
+        num_layers = 2
+        batch_size = 20
+        hidden_size = 650
+        num_steps = 35
+        init_scale = 0.05
+        max_grad_norm = 5.0
+        epoch_start_decay = 6
+        max_epoch = 39
+        dropout = 0.5
+        lr_decay = 0.8
+        base_learning_rate = 1.0
+    elif model_type == "large":
+        num_layers = 2
+        batch_size = 20
+        hidden_size = 1500
+        num_steps = 35
+        init_scale = 0.04
+        max_grad_norm = 10.0
+        epoch_start_decay = 14
+        max_epoch = 55
+        dropout = 0.65
+        lr_decay = 1.0 / 1.15
+        base_learning_rate = 1.0
+    else:
+        print("model type not support")
+        return
+
+    with fluid.dygraph.guard(core.CUDAPlace(0)):
+        if args.ce:
+            print("ce mode")
+            seed = 33
+            np.random.seed(seed)
+            fluid.default_startup_program().random_seed = seed
+            fluid.default_main_program().random_seed = seed
+            max_epoch = 1
+        ptb_model = PtbModel(
+            "ptb_model",
+            hidden_size=hidden_size,
+            vocab_size=vocab_size,
+            num_layers=num_layers,
+            num_steps=num_steps,
+            init_scale=init_scale,
+            dropout=dropout)
+
+        if args.init_from_pretrain_model:
+            if not os.path.exists(args.init_from_pretrain_model + '.pdparams'):
+                print(args.init_from_pretrain_model)
+                raise Warning("The pretrained params do not exist.")
+                return
+            fluid.load_dygraph(args.init_from_pretrain_model)
+            print("finish initing model from pretrained params from %s" %
+                  (args.init_from_pretrain_model))
+
+        dy_param_updated = dict()
+        dy_param_init = dict()
+        dy_loss = None
+        last_hidden = None
+
+        data_path = args.data_path
+        print("begin to load data")
+        ptb_data = reader.get_ptb_data(data_path)
+        print("finished load data")
+        train_data, valid_data, test_data = ptb_data
+
+        batch_len = len(train_data) // batch_size
+        total_batch_size = (batch_len - 1) // num_steps
+        print("total_batch_size:", total_batch_size)
+        log_interval = total_batch_size // 20
+
+        bd = []
+        lr_arr = [base_learning_rate]
+        for i in range(1, max_epoch):
+            bd.append(total_batch_size * i)
+            new_lr = base_learning_rate * (lr_decay**
+                                           max(i + 1 - epoch_start_decay, 0.0))
+            lr_arr.append(new_lr)
+
+        sgd = AdagradOptimizer(
+            parameter_list=ptb_model.parameters(),
+            learning_rate=fluid.layers.piecewise_decay(
+                boundaries=bd, values=lr_arr))
+
+        print("parameters:--------------------------------")
+        for para in ptb_model.parameters():
+            print(para.name)
+        print("parameters:--------------------------------")
+
+        def eval(model, data):
+            print("begion to eval")
+            total_loss = 0.0
+            iters = 0.0
+            init_hidden_data = np.zeros(
+                (num_layers, batch_size, hidden_size), dtype='float32')
+
+            model.eval()
+            train_data_iter = reader.get_data_iter(data, batch_size, num_steps)
+            init_hidden = to_variable(init_hidden_data)
+            accum_num_recall = 0.0
+            for batch_id, batch in enumerate(train_data_iter):
+                x_data, y_data = batch
+                x_data = x_data.reshape((-1, num_steps, 1))
+                y_data = y_data.reshape((-1, num_steps, 1))
+                x = to_variable(x_data)
+                y = to_variable(y_data)
+                dy_loss, last_hidden, acc = ptb_model(x, y, init_hidden)
+
+                out_loss = dy_loss.numpy()
+                acc_ = acc.numpy()[0]
+                accum_num_recall += acc_
+                if batch_id % 1 == 0:
+                    print("batch_id:%d  recall@20:%.4f" %
+                          (batch_id, accum_num_recall / (batch_id + 1)))
+
+                init_hidden = last_hidden
+
+                total_loss += out_loss
+                iters += num_steps
+
+            print("eval finished")
+            ppl = np.exp(total_loss / iters)
+            print("recall@20 ", accum_num_recall / (batch_id + 1))
+            if args.ce:
+                print("kpis\ttest_ppl\t%0.3f" % ppl[0])
+
+        grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(max_grad_norm)
+        for epoch_id in range(max_epoch):
+            ptb_model.train()
+            total_loss = 0.0
+            iters = 0.0
+            init_hidden_data = np.zeros(
+                (num_layers, batch_size, hidden_size), dtype='float32')
+
+            train_data_iter = reader.get_data_iter(train_data, batch_size,
+                                                   num_steps)
+            init_hidden = to_variable(init_hidden_data)
+
+            start_time = time.time()
+            for batch_id, batch in enumerate(train_data_iter):
+                x_data, y_data = batch
+                x_data = x_data.reshape((-1, num_steps, 1))
+                y_data = y_data.reshape((-1, num_steps, 1))
+                x = to_variable(x_data)
+                y = to_variable(y_data)
+                dy_loss, last_hidden, acc = ptb_model(x, y, init_hidden)
+
+                out_loss = dy_loss.numpy()
+                acc_ = acc.numpy()[0]
+
+                init_hidden = last_hidden
+                dy_loss.backward()
+                sgd.minimize(dy_loss, grad_clip=grad_clip)
+                ptb_model.clear_gradients()
+                total_loss += out_loss
+                iters += num_steps
+
+                if batch_id > 0 and batch_id % 100 == 1:
+                    ppl = np.exp(total_loss / iters)
+                    print(
+                        "-- Epoch:[%d]; Batch:[%d]; ppl: %.5f, acc: %.5f, lr: %.5f"
+                        % (epoch_id, batch_id, ppl[0], acc_,
+                           sgd._global_learning_rate().numpy()))
+
+            print("one ecpoh finished", epoch_id)
+            print("time cost ", time.time() - start_time)
+            ppl = np.exp(total_loss / iters)
+            print("-- Epoch:[%d]; ppl: %.5f" % (epoch_id, ppl[0]))
+            if args.ce:
+                print("kpis\ttrain_ppl\t%0.3f" % ppl[0])
+            save_model_dir = os.path.join(args.save_model_dir,
+                                          str(epoch_id), 'params')
+            fluid.save_dygraph(ptb_model.state_dict(), save_model_dir)
+            print("Saved model to: %s.\n" % save_model_dir)
+            eval(ptb_model, test_data)
+
+        #eval(ptb_model, test_data)
+
+
+train_ptb_lm()
diff --git a/PaddleRec/gru4rec/dy_graph/model_check.py b/PaddleRec/gru4rec/dy_graph/model_check.py
new file mode 100644
index 0000000000000000000000000000000000000000..106c28e6ddbc0d3d784396017ba70a2b40121f44
--- /dev/null
+++ b/PaddleRec/gru4rec/dy_graph/model_check.py
@@ -0,0 +1,58 @@
+#encoding=utf8
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import paddle
+import paddle.fluid as fluid
+
+
+def check_cuda(use_cuda, err = \
+    "\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
+    Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n"
+                                                                                                                     ):
+    """
+    Log error and exit when set use_gpu=true in paddlepaddle
+    cpu version.
+    """
+    try:
+        if use_cuda == True and fluid.is_compiled_with_cuda() == False:
+            print(err)
+            sys.exit(1)
+    except Exception as e:
+        pass
+
+
+def check_version():
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        print(err)
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    check_cuda(True)
+
+    check_cuda(False)
+
+    check_cuda(True, "This is only for testing.")
diff --git a/PaddleRec/gru4rec/dy_graph/reader.py b/PaddleRec/gru4rec/dy_graph/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..06b504e05a1fa259307fb673b98a3b7bea1021b6
--- /dev/null
+++ b/PaddleRec/gru4rec/dy_graph/reader.py
@@ -0,0 +1,85 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import os
+import sys
+import numpy as np
+
+EOS = "</eos>"
+
+
+def build_vocab(filename):
+
+    vocab_dict = {}
+    ids = 0
+    vocab_dict[EOS] = ids
+    ids += 1
+
+    with open(filename, "r") as f:
+        for line in f.readlines():
+            for w in line.strip().split():
+                if w not in vocab_dict:
+                    vocab_dict[w] = ids
+                    ids += 1
+
+    print("vocab word num", ids)
+
+    return vocab_dict
+
+
+def file_to_ids(src_file, src_vocab):
+
+    src_data = []
+    with open(src_file, "r") as f_src:
+        for line in f_src.readlines():
+            arra = line.strip().split()
+            ids = [src_vocab[w] for w in arra if w in src_vocab]
+
+            src_data += ids + [0]
+    return src_data
+
+
+def get_ptb_data(data_path=None):
+
+    train_file = os.path.join(data_path, "ptb.train.txt")
+    valid_file = os.path.join(data_path, "ptb.valid.txt")
+    test_file = os.path.join(data_path, "ptb.test.txt")
+
+    vocab_dict = build_vocab(train_file)
+    train_ids = file_to_ids(train_file, vocab_dict)
+    valid_ids = file_to_ids(valid_file, vocab_dict)
+    test_ids = file_to_ids(test_file, vocab_dict)
+
+    return train_ids, valid_ids, test_ids
+
+
+def get_data_iter(raw_data, batch_size, num_steps):
+    data_len = len(raw_data)
+    raw_data = np.asarray(raw_data, dtype="int64")
+
+    batch_len = data_len // batch_size
+
+    data = raw_data[0:batch_size * batch_len].reshape((batch_size, batch_len))
+
+    epoch_size = (batch_len - 1) // num_steps
+    for i in range(epoch_size):
+        start = i * num_steps
+        x = np.copy(data[:, i * num_steps:(i + 1) * num_steps])
+        y = np.copy(data[:, i * num_steps + 1:(i + 1) * num_steps + 1])
+
+        yield (x, y)
diff --git a/PaddleRec/gru4rec/dy_graph/run_gru.sh b/PaddleRec/gru4rec/dy_graph/run_gru.sh
new file mode 100644
index 0000000000000000000000000000000000000000..dc37e3b41605f1add7e7dd6006b0313c42bd5f86
--- /dev/null
+++ b/PaddleRec/gru4rec/dy_graph/run_gru.sh
@@ -0,0 +1,2 @@
+python -u gru4rec_dy.py  --data_path data/ --model_type gru4rec
+
diff --git a/PaddleRec/gru4rec/infer.py b/PaddleRec/gru4rec/infer.py
index bc459c28a9b24761b202dc5d8110d583322abdeb..032205cf7b6f9cc1015583e13a29c2361f889897 100644
--- a/PaddleRec/gru4rec/infer.py
+++ b/PaddleRec/gru4rec/infer.py
@@ -71,6 +71,7 @@ def infer(test_reader, use_cuda, model_path):
 
 
 if __name__ == "__main__":
+    utils.check_version()
     args = parse_args()
     start_index = args.start_index
     last_index = args.last_index
diff --git a/PaddleRec/gru4rec/infer_sample_neg.py b/PaddleRec/gru4rec/infer_sample_neg.py
index 48458e82b4fe2bbc7141c3e45469b8414d87ece4..b77f3685576e129926a8529d99adbc06185acd91 100644
--- a/PaddleRec/gru4rec/infer_sample_neg.py
+++ b/PaddleRec/gru4rec/infer_sample_neg.py
@@ -84,6 +84,7 @@ def infer(args, vocab_size, test_reader, use_cuda):
 
 
 if __name__ == "__main__":
+    utils.check_version()
     args = parse_args()
     start_index = args.start_index
     last_index = args.last_index
diff --git a/PaddleRec/gru4rec/net.py b/PaddleRec/gru4rec/net.py
index 6a715443ff1e72ae77aba51d5eaffe4eefee9687..4bdfcdfa9182e7b0311873f5c3b88a18e7451ad3 100644
--- a/PaddleRec/gru4rec/net.py
+++ b/PaddleRec/gru4rec/net.py
@@ -10,12 +10,12 @@ def all_vocab_network(vocab_size,
     gru_lr_x = 1.0
     fc_lr_x = 1.0
     # Input data
-    src_wordseq = fluid.layers.data(
-        name="src_wordseq", shape=[1], dtype="int64", lod_level=1)
-    dst_wordseq = fluid.layers.data(
-        name="dst_wordseq", shape=[1], dtype="int64", lod_level=1)
+    src_wordseq = fluid.data(
+        name="src_wordseq", shape=[None, 1], dtype="int64", lod_level=1)
+    dst_wordseq = fluid.data(
+        name="dst_wordseq", shape=[None, 1], dtype="int64", lod_level=1)
 
-    emb = fluid.layers.embedding(
+    emb = fluid.embedding(
         input=src_wordseq,
         size=[vocab_size, hid_size],
         param_attr=fluid.ParamAttr(
@@ -56,19 +56,20 @@ def train_bpr_network(vocab_size, neg_size, hid_size, drop_out=0.2):
     gru_lr_x = 1.0
     fc_lr_x = 1.0
     # Input data
-    src = fluid.layers.data(name="src", shape=[1], dtype="int64", lod_level=1)
-    pos_label = fluid.layers.data(
-        name="pos_label", shape=[1], dtype="int64", lod_level=1)
-    label = fluid.layers.data(
-        name="label", shape=[neg_size + 1], dtype="int64", lod_level=1)
+    src = fluid.data(name="src", shape=[None, 1], dtype="int64", lod_level=1)
+    pos_label = fluid.data(
+        name="pos_label", shape=[None, 1], dtype="int64", lod_level=1)
+    label = fluid.data(
+        name="label", shape=[None, neg_size + 1], dtype="int64", lod_level=1)
 
-    emb_src = fluid.layers.embedding(
+    emb_src = fluid.embedding(
         input=src,
         size=[vocab_size, hid_size],
         param_attr=fluid.ParamAttr(
             name="emb",
             initializer=fluid.initializer.XavierInitializer(),
             learning_rate=emb_lr_x))
+    emb_src = fluid.layers.squeeze(input=emb_src, axes=[1])
 
     emb_src_drop = fluid.layers.dropout(emb_src, dropout_prob=drop_out)
 
@@ -90,7 +91,7 @@ def train_bpr_network(vocab_size, neg_size, hid_size, drop_out=0.2):
     gru_h0_drop = fluid.layers.dropout(gru_h0, dropout_prob=drop_out)
 
     label_re = fluid.layers.sequence_reshape(input=label, new_dim=1)
-    emb_label = fluid.layers.embedding(
+    emb_label1 = fluid.embedding(
         input=label_re,
         size=[vocab_size, hid_size],
         param_attr=fluid.ParamAttr(
@@ -98,6 +99,7 @@ def train_bpr_network(vocab_size, neg_size, hid_size, drop_out=0.2):
             initializer=fluid.initializer.XavierInitializer(),
             learning_rate=emb_lr_x))
 
+    emb_label = fluid.layers.squeeze(input=emb_label1, axes=[1])
     emb_label_drop = fluid.layers.dropout(emb_label, dropout_prob=drop_out)
 
     gru_exp = fluid.layers.expand(
@@ -120,19 +122,20 @@ def train_cross_entropy_network(vocab_size, neg_size, hid_size, drop_out=0.2):
     gru_lr_x = 1.0
     fc_lr_x = 1.0
     # Input data
-    src = fluid.layers.data(name="src", shape=[1], dtype="int64", lod_level=1)
-    pos_label = fluid.layers.data(
-        name="pos_label", shape=[1], dtype="int64", lod_level=1)
-    label = fluid.layers.data(
-        name="label", shape=[neg_size + 1], dtype="int64", lod_level=1)
+    src = fluid.data(name="src", shape=[None, 1], dtype="int64", lod_level=1)
+    pos_label = fluid.data(
+        name="pos_label", shape=[None, 1], dtype="int64", lod_level=1)
+    label = fluid.data(
+        name="label", shape=[None, neg_size + 1], dtype="int64", lod_level=1)
 
-    emb_src = fluid.layers.embedding(
+    emb_src = fluid.embedding(
         input=src,
         size=[vocab_size, hid_size],
         param_attr=fluid.ParamAttr(
             name="emb",
             initializer=fluid.initializer.XavierInitializer(),
             learning_rate=emb_lr_x))
+    emb_src = fluid.layers.squeeze(input=emb_src, axes=[1])
 
     emb_src_drop = fluid.layers.dropout(emb_src, dropout_prob=drop_out)
 
@@ -154,13 +157,14 @@ def train_cross_entropy_network(vocab_size, neg_size, hid_size, drop_out=0.2):
     gru_h0_drop = fluid.layers.dropout(gru_h0, dropout_prob=drop_out)
 
     label_re = fluid.layers.sequence_reshape(input=label, new_dim=1)
-    emb_label = fluid.layers.embedding(
+    emb_label1 = fluid.embedding(
         input=label_re,
         size=[vocab_size, hid_size],
         param_attr=fluid.ParamAttr(
             name="emb",
             initializer=fluid.initializer.XavierInitializer(),
             learning_rate=emb_lr_x))
+    emb_label = fluid.layers.squeeze(input=emb_label1, axes=[1])
 
     emb_label_drop = fluid.layers.dropout(emb_label, dropout_prob=drop_out)
 
@@ -180,8 +184,8 @@ def train_cross_entropy_network(vocab_size, neg_size, hid_size, drop_out=0.2):
 
 
 def infer_network(vocab_size, batch_size, hid_size, dropout=0.2):
-    src = fluid.layers.data(name="src", shape=[1], dtype="int64", lod_level=1)
-    emb_src = fluid.layers.embedding(
+    src = fluid.data(name="src", shape=[None, 1], dtype="int64", lod_level=1)
+    emb_src = fluid.embedding(
         input=src, size=[vocab_size, hid_size], param_attr="emb")
     emb_src_drop = fluid.layers.dropout(
         emb_src, dropout_prob=dropout, is_test=True)
@@ -198,20 +202,18 @@ def infer_network(vocab_size, batch_size, hid_size, dropout=0.2):
     gru_h0_drop = fluid.layers.dropout(
         gru_h0, dropout_prob=dropout, is_test=True)
 
-    all_label = fluid.layers.data(
-        name="all_label",
-        shape=[vocab_size, 1],
-        dtype="int64",
-        append_batch_size=False)
-    emb_all_label = fluid.layers.embedding(
+    all_label = fluid.data(
+        name="all_label", shape=[vocab_size, 1], dtype="int64")
+    emb_all_label = fluid.embedding(
         input=all_label, size=[vocab_size, hid_size], param_attr="emb")
+    emb_all_label = fluid.layers.squeeze(input=emb_all_label, axes=[1])
     emb_all_label_drop = fluid.layers.dropout(
         emb_all_label, dropout_prob=dropout, is_test=True)
 
     all_pre = fluid.layers.matmul(
         gru_h0_drop, emb_all_label_drop, transpose_y=True)
 
-    pos_label = fluid.layers.data(
-        name="pos_label", shape=[1], dtype="int64", lod_level=1)
+    pos_label = fluid.data(
+        name="pos_label", shape=[None, 1], dtype="int64", lod_level=1)
     acc = fluid.layers.accuracy(input=all_pre, label=pos_label, k=20)
     return acc
diff --git a/PaddleRec/gru4rec/text2paddle.py b/PaddleRec/gru4rec/text2paddle.py
index 563a8cadbbed335d9fccabff8401c3e20f90f6d4..b26b171fec98d25b1b8c3d5438834cc8aaecb048 100644
--- a/PaddleRec/gru4rec/text2paddle.py
+++ b/PaddleRec/gru4rec/text2paddle.py
@@ -2,6 +2,12 @@ import sys
 import six
 import collections
 import os
+import sys
+import io
+if six.PY2:
+    reload(sys)
+    sys.setdefaultencoding('utf-8')
+
 
 def word_count(input_file, word_freq=None):
     """
@@ -25,11 +31,11 @@ def build_dict(min_word_freq=0, train_dir="", test_dir=""):
     word_freq = collections.defaultdict(int)
     files = os.listdir(train_dir)
     for fi in files:
-        with open(train_dir + '/' + fi, "r") as f:
+        with io.open(os.path.join(train_dir, fi), "r") as f:
             word_freq = word_count(f, word_freq)
     files = os.listdir(test_dir)
     for fi in files:
-        with open(test_dir + '/' + fi, "r") as f:
+        with io.open(os.path.join(test_dir, fi), "r") as f:
             word_freq = word_count(f, word_freq)
 
     word_freq = [x for x in six.iteritems(word_freq) if x[1] > min_word_freq]
@@ -39,38 +45,50 @@ def build_dict(min_word_freq=0, train_dir="", test_dir=""):
     return word_idx
 
 
-def write_paddle(word_idx, train_dir, test_dir, output_train_dir, output_test_dir):
+def write_paddle(word_idx, train_dir, test_dir, output_train_dir,
+                 output_test_dir):
     files = os.listdir(train_dir)
     if not os.path.exists(output_train_dir):
         os.mkdir(output_train_dir)
     for fi in files:
-        with open(train_dir + '/' + fi, "r") as f:
-            with open(output_train_dir + '/' + fi, "w") as wf:
+        with io.open(os.path.join(train_dir, fi), "r") as f:
+            with io.open(os.path.join(output_train_dir, fi), "w") as wf:
                 for l in f:
                     l = l.strip().split()
                     l = [word_idx.get(w) for w in l]
                     for w in l:
-                        wf.write(str(w) + " ")
-                    wf.write("\n")
+                        wf.write(str2file(str(w) + " "))
+                    wf.write(str2file("\n"))
 
     files = os.listdir(test_dir)
     if not os.path.exists(output_test_dir):
         os.mkdir(output_test_dir)
     for fi in files:
-        with open(test_dir + '/' + fi, "r") as f:
-            with open(output_test_dir + '/' + fi, "w") as wf:
+        with io.open(os.path.join(test_dir, fi), "r", encoding='utf-8') as f:
+            with io.open(
+                    os.path.join(output_test_dir, fi), "w",
+                    encoding='utf-8') as wf:
                 for l in f:
                     l = l.strip().split()
                     l = [word_idx.get(w) for w in l]
                     for w in l:
-                        wf.write(str(w) + " ")
-                    wf.write("\n")
+                        wf.write(str2file(str(w) + " "))
+                    wf.write(str2file("\n"))
+
+
+def str2file(str):
+    if six.PY2:
+        return str.decode("utf-8")
+    else:
+        return str
+
 
-def text2paddle(train_dir, test_dir, output_train_dir, output_test_dir, output_vocab):
+def text2paddle(train_dir, test_dir, output_train_dir, output_test_dir,
+                output_vocab):
     vocab = build_dict(0, train_dir, test_dir)
-    with open(output_vocab, "w") as wf:
-        wf.write(str(len(vocab)) + "\n")
-        #wf.write(str(vocab))
+    print("vocab size:", str(len(vocab)))
+    with io.open(output_vocab, "w", encoding='utf-8') as wf:
+        wf.write(str2file(str(len(vocab)) + "\n"))
     write_paddle(vocab, train_dir, test_dir, output_train_dir, output_test_dir)
 
 
@@ -79,4 +97,5 @@ test_dir = sys.argv[2]
 output_train_dir = sys.argv[3]
 output_test_dir = sys.argv[4]
 output_vocab = sys.argv[5]
-text2paddle(train_dir, test_dir, output_train_dir, output_test_dir, output_vocab)
+text2paddle(train_dir, test_dir, output_train_dir, output_test_dir,
+            output_vocab)
diff --git a/PaddleRec/gru4rec/train.py b/PaddleRec/gru4rec/train.py
index b43926b69eaf002380a261a0689be91ec3f6ff90..686f7a117730ebc96f315717d6ebfc7b158444d0 100644
--- a/PaddleRec/gru4rec/train.py
+++ b/PaddleRec/gru4rec/train.py
@@ -58,8 +58,8 @@ def train():
     """ do training """
     args = parse_args()
     if args.enable_ce:
-       fluid.default_startup_program().random_seed = SEED 
-       fluid.default_main_program().random_seed = SEED 
+        fluid.default_startup_program().random_seed = SEED
+        fluid.default_main_program().random_seed = SEED
     hid_size = args.hid_size
     train_dir = args.train_dir
     vocab_path = args.vocab_path
@@ -143,17 +143,16 @@ def train():
         if args.use_cuda:
             gpu_num = device[1]
             print("kpis\teach_pass_duration_gpu%s\t%s" %
-                (gpu_num, total_time / epoch_idx))
-            print("kpis\ttrain_ppl_gpu%s\t%s" %
-                (gpu_num, ce_ppl))
+                  (gpu_num, total_time / epoch_idx))
+            print("kpis\ttrain_ppl_gpu%s\t%s" % (gpu_num, ce_ppl))
         else:
             cpu_num = device[1]
             threads_num = device[2]
             print("kpis\teach_pass_duration_cpu%s_thread%s\t%s" %
-                (cpu_num, threads_num, total_time / epoch_idx))
+                  (cpu_num, threads_num, total_time / epoch_idx))
             print("kpis\ttrain_ppl_cpu%s_thread%s\t%s" %
-                (cpu_num, threads_num, ce_ppl))
-        
+                  (cpu_num, threads_num, ce_ppl))
+
     print("finish training")
 
 
@@ -166,7 +165,8 @@ def get_device(args):
         threads_num = os.environ.get('NUM_THREADS', 1)
         cpu_num = os.environ.get('CPU_NUM', 1)
         return "cpu", int(cpu_num), int(threads_num)
-        
+
 
 if __name__ == "__main__":
+    utils.check_version()
     train()
diff --git a/PaddleRec/gru4rec/train_sample_neg.py b/PaddleRec/gru4rec/train_sample_neg.py
index 2642452024810fe16cfa1154e273febdb1d63254..fbb687052771fd6ca642fda7637c103e120136ff 100644
--- a/PaddleRec/gru4rec/train_sample_neg.py
+++ b/PaddleRec/gru4rec/train_sample_neg.py
@@ -128,4 +128,5 @@ def train():
 
 
 if __name__ == "__main__":
+    utils.check_version()
     train()
diff --git a/PaddleRec/gru4rec/utils.py b/PaddleRec/gru4rec/utils.py
index 1cd6a313b2a5097b16c473722737e0e6936f4e31..424ebf78490a7b5220ec1a3c2a1b045f28638f35 100644
--- a/PaddleRec/gru4rec/utils.py
+++ b/PaddleRec/gru4rec/utils.py
@@ -6,6 +6,7 @@ import numpy as np
 import paddle.fluid as fluid
 import paddle
 import os
+import io
 
 
 def to_lodtensor(data, place):
@@ -86,7 +87,7 @@ def to_lodtensor_bpr_test(raw_data, vocab_size, place):
 
 
 def get_vocab_size(vocab_path):
-    with open(vocab_path, "r") as rf:
+    with io.open(vocab_path, "r", encoding='utf-8') as rf:
         line = rf.readline()
         return int(line.strip())
 
@@ -99,7 +100,7 @@ def prepare_data(file_dir,
                  is_train=True):
     """ prepare the English Pann Treebank (PTB) data """
     print("start constuct word dict")
-    if is_train:
+    if is_train and 'ce_mode' not in os.environ:
         vocab_size = get_vocab_size(vocab_path)
         reader = sort_batch(
             paddle.reader.shuffle(
@@ -110,12 +111,28 @@ def prepare_data(file_dir,
             batch_size * 20)
     else:
         vocab_size = get_vocab_size(vocab_path)
-        reader = paddle.batch(
+        reader = fluid.io.batch(
             test(
                 file_dir, buffer_size, data_type=DataType.SEQ), batch_size)
     return vocab_size, reader
 
 
+def check_version():
+    """
+     Log error and exit when the installed version of paddlepaddle is
+     not satisfied.
+     """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
+
+
 def sort_batch(reader, batch_size, sort_group_size, drop_last=False):
     """
     Create a batched reader.
@@ -170,7 +187,8 @@ def reader_creator(file_dir, n, data_type):
     def reader():
         files = os.listdir(file_dir)
         for fi in files:
-            with open(file_dir + '/' + fi, "r") as f:
+            with io.open(
+                    os.path.join(file_dir, fi), "r", encoding='utf-8') as f:
                 for l in f:
                     if DataType.SEQ == data_type:
                         l = l.strip().split()
diff --git a/PaddleRec/multi-task/MMoE/README.md b/PaddleRec/multi-task/MMoE/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..27a018976ce2020d314059007c88073e99f350fc
--- /dev/null
+++ b/PaddleRec/multi-task/MMoE/README.md
@@ -0,0 +1,28 @@
+# MMoE
+
+##简介
+
+MMoE是经典的多任务（multi-task）模型，原论文[Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts](https://www.kdd.org/kdd2018/accepted-papers/view/modeling-task-relationships-in-multi-task-learning-with-multi-gate-mixture-) 发表于KDD 2018.
+
+多任务模型通过学习不同任务的联系和差异，可提高每个任务的学习效率和质量。多任务学习的的框架广泛采用shared-bottom的结构，不同任务间共用底部的隐层。这种结构本质上可以减少过拟合的风险，但是效果上可能受到任务差异和数据分布带来的影响。论文中提出了一个Multi-gate Mixture-of-Experts(MMoE)的多任务学习结构。MMoE模型刻画了任务相关性，基于共享表示来学习特定任务的函数，避免了明显增加参数的缺点。(https://zhuanlan.zhihu.com/p/55752344)
+
+我们基于实际工业界场景实现了MMoE的核心思想。
+
+## 配置
+1.6 及以上
+
+## 数据
+
+我们采用了随机数据作为训练数据，可以根据自己的数据调整data部分。
+
+## 训练
+
+```
+python mmoe_train.py
+```
+
+# 未来工作
+
+1. 添加预测部分
+
+2. 添加公开数据集的结果
diff --git a/PaddleRec/multi-task/MMoE/args.py b/PaddleRec/multi-task/MMoE/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..2887b54ac71a63e56acb3f1d3ed73b6cd51dc582
--- /dev/null
+++ b/PaddleRec/multi-task/MMoE/args.py
@@ -0,0 +1,35 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import distutils.util
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--base_lr", type=float, default=0.01, help="learning_rate")
+    parser.add_argument("--batch_size", type=int, default=5, help="batch_size")
+    parser.add_argument("--dict_dim", type=int, default=64, help="dict dim")
+    parser.add_argument(
+        "--emb_dim", type=int, default=100, help="embedding_dim")
+    parser.add_argument(
+        '--use_gpu', type=bool, default=False, help='whether using gpu')
+    parser.add_argument('--ce', action='store_true', help="run ce")
+    args = parser.parse_args()
+    return args
diff --git a/PaddleRec/multi-task/MMoE/mmoe_train.py b/PaddleRec/multi-task/MMoE/mmoe_train.py
new file mode 100644
index 0000000000000000000000000000000000000000..713ca393486b1f246819f3fc73ad12dcce6a617c
--- /dev/null
+++ b/PaddleRec/multi-task/MMoE/mmoe_train.py
@@ -0,0 +1,149 @@
+import paddle.fluid as fluid
+import numpy as np
+import time
+from args import *
+
+
+def fc_layers(input, layers, acts, prefix):
+    fc_layers_input = [input]
+    fc_layers_size = layers
+    fc_layers_act = acts
+    init_range = 0.2
+    scales_tmp = [input.shape[1]] + fc_layers_size
+    scales = []
+    for i in range(len(scales_tmp)):
+        scales.append(init_range / (scales_tmp[i]**0.5))
+    for i in range(len(fc_layers_size)):
+        name = prefix + "_" + str(i)
+        fc = fluid.layers.fc(
+                input = fc_layers_input[-1],
+                size = fc_layers_size[i],
+                act = fc_layers_act[i],
+                param_attr = \
+                        fluid.ParamAttr(learning_rate=1.0, \
+                        initializer=fluid.initializer.NormalInitializer(loc=0.0, scale=1.0 * scales[i])),
+                bias_attr = \
+                        fluid.ParamAttr(learning_rate=1.0, \
+                        initializer=fluid.initializer.NormalInitializer(loc=0.0, scale=1.0 * scales[i])),
+                name=name)
+        fc_layers_input.append(fc)
+    return fc_layers_input[-1]
+
+
+def mmoe_layer(inputs, expert_num=8, gate_num=3):
+
+    expert_out = []
+    expert_nn = [3]
+    expert_act = ['relu']
+    for i in range(0, expert_num):
+        cur_expert = fc_layers(inputs, expert_nn, expert_act,
+                               'expert_' + str(i))
+        expert_out.append(cur_expert)
+    expert_concat = fluid.layers.concat(expert_out, axis=1)
+    expert_concat = fluid.layers.reshape(expert_concat,
+                                         [-1, expert_num, expert_nn[-1]])
+
+    outs = []
+    for i in range(0, gate_num):
+        cur_gate = fluid.layers.fc(input=inputs,
+                                   size=expert_num,
+                                   act='softmax',
+                                   name='gate_' + str(i))
+        cur_gate_expert = fluid.layers.elementwise_mul(
+            expert_concat, cur_gate, axis=0)
+        cur_gate_expert = fluid.layers.reduce_sum(cur_gate_expert, dim=1)
+        cur_fc = fc_layers(cur_gate_expert, [64, 32, 16, 1],
+                           ['relu', 'relu', 'relu', None], 'out_' + str(i))
+        outs.append(cur_fc)
+    return outs
+
+
+def model(dict_dim, emb_dim):
+    label_like = fluid.layers.data(
+        name="label_like",
+        shape=[-1, 1],
+        dtype="int64",
+        lod_level=0,
+        append_batch_size=False)
+    label_comment = fluid.layers.data(
+        name="label_comment",
+        shape=[-1, 1],
+        dtype="int64",
+        lod_level=0,
+        append_batch_size=False)
+    label_share = fluid.layers.data(
+        name="label_share",
+        shape=[-1, 1],
+        dtype="int64",
+        lod_level=0,
+        append_batch_size=False)
+
+    a_data = fluid.layers.data(
+        name="a", shape=[-1, 1], dtype="int64", append_batch_size=False)
+    emb = fluid.layers.embedding(input=a_data, size=[dict_dim, emb_dim])
+
+    outs = mmoe_layer(emb, expert_num=8, gate_num=3)
+
+    output_like = fluid.layers.sigmoid(
+        fluid.layers.clip(
+            outs[0], min=-15.0, max=15.0), name="output_like")
+    output_comment = fluid.layers.sigmoid(
+        fluid.layers.clip(
+            outs[1], min=-15.0, max=15.0), name="output_comment")
+    output_share = fluid.layers.sigmoid(
+        fluid.layers.clip(
+            outs[2], min=-15.0, max=15.0), name="output_share")
+
+    cost_like = fluid.layers.log_loss(
+        input=output_like,
+        label=fluid.layers.cast(
+            x=label_like, dtype='float32'))
+    cost_comment = fluid.layers.log_loss(
+        input=output_comment,
+        label=fluid.layers.cast(
+            x=label_comment, dtype='float32'))
+    cost_share = fluid.layers.log_loss(
+        input=output_share,
+        label=fluid.layers.cast(
+            x=label_share, dtype='float32'))
+
+    avg_cost_like = fluid.layers.mean(x=cost_like)
+    avg_cost_comment = fluid.layers.mean(x=cost_comment)
+    avg_cost_share = fluid.layers.mean(x=cost_share)
+
+    cost = avg_cost_like + avg_cost_comment + avg_cost_share
+    return cost, [a_data, label_like, label_comment, label_share]
+
+
+args = parse_args()
+batch_size = args.batch_size
+dict_dim = args.dict_dim
+emb_dim = args.emb_dim
+
+print("batch_size:[%d], dict_dim:[%d], emb_dim:[%d], learning_rate:[%.4f]" %
+      (batch_size, dict_dim, emb_dim, args.base_lr))
+
+loss, data_list = model(dict_dim, emb_dim)
+sgd = fluid.optimizer.SGD(learning_rate=args.base_lr)
+sgd.minimize(loss)
+place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+feeder = fluid.DataFeeder(feed_list=data_list, place=place)
+exe = fluid.Executor(place)
+exe.run(fluid.default_startup_program())
+for batch_id in range(100):
+    data = [
+        np.random.randint(
+            2, size=(batch_size, 1)).astype('int64') for i in range(4)
+    ]
+    begin = time.time()
+    loss_data, = exe.run(fluid.default_main_program(),
+                         feed={
+                             "a": data[0],
+                             "label_like": data[1],
+                             "label_comment": data[2],
+                             "label_share": data[3]
+                         },
+                         fetch_list=[loss.name])
+    end = time.time()
+    print("batch_id:[%d], loss:[%.5f], batch_time:[%.5f s]" %
+          (batch_id, float(np.array(loss_data)), end - begin))
diff --git a/PaddleRec/multiview_simnet/README.md b/PaddleRec/multiview_simnet/README.md
index a3bceaba646d13b11e66c475e35e091d4cd68d03..1c682eed1b630de7047f7002e0d709cf715d1e14 100644
--- a/PaddleRec/multiview_simnet/README.md
+++ b/PaddleRec/multiview_simnet/README.md
@@ -3,6 +3,8 @@
 ## Introduction
 In personalized recommendation scenario, a user often is provided with several items from personalized interest matching model. In real world application, a user may have multiple views of features, say user-id, age, click-history of items, search queries. A item, e.g. news, may also have multiple views of features like news title, news category, images in news and so on. Multi-view Simnet is matching a model that combine users' and items' multiple views of features into one unified model. The model can be used in many industrial product like Baidu's feed news. The model is adapted from the paper A Multi-View Deep Learning(MV-DNN) Approach for Cross Domain User Modeling in Recommendation Systems, WWW 2015. The difference between our model and the MV-DNN is that we also consider multiple feature views of users.
 
+**Now all models in PaddleRec require PaddlePaddle version 1.6 or higher, or suitable develop version.**
+
 We also recommend users to take a look at the [IPython Notebook demo](https://aistudio.baidu.com/aistudio/projectDetail/122294)
 
 ## Dataset
diff --git a/PaddleRec/multiview_simnet/infer.py b/PaddleRec/multiview_simnet/infer.py
index 7b5bb080d278ba5fffbe678841037b71b02b3069..e9136588add2815c65e5b9e7e707de1f6fce8707 100644
--- a/PaddleRec/multiview_simnet/infer.py
+++ b/PaddleRec/multiview_simnet/infer.py
@@ -31,6 +31,22 @@ logger = logging.getLogger("fluid")
 logger.setLevel(logging.INFO)
 
 
+def check_version():
+    """
+     Log error and exit when the installed version of paddlepaddle is
+     not satisfied.
+     """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
+
+
 def parse_args():
     parser = argparse.ArgumentParser("multi-view simnet")
     parser.add_argument("--train_file", type=str, help="Training file")
@@ -116,4 +132,5 @@ def main():
 
 
 if __name__ == "__main__":
+    check_version()
     main()
diff --git a/PaddleRec/multiview_simnet/nets.py b/PaddleRec/multiview_simnet/nets.py
index fed177844bdd247d163aee9e8625cd0ec74378b3..104101e2b8f6548f4ddce289b97794d569eefc41 100644
--- a/PaddleRec/multiview_simnet/nets.py
+++ b/PaddleRec/multiview_simnet/nets.py
@@ -13,7 +13,6 @@
 # limitations under the License.
 
 import paddle.fluid as fluid
-import paddle.fluid.layers.nn as nn
 import paddle.fluid.layers.tensor as tensor
 import paddle.fluid.layers.control_flow as cf
 import paddle.fluid.layers.io as io
@@ -26,7 +25,7 @@ class BowEncoder(object):
         self.param_name = ""
 
     def forward(self, emb):
-        return nn.sequence_pool(input=emb, pool_type='sum')
+        return fluid.layers.sequence_pool(input=emb, pool_type='sum')
 
 
 class CNNEncoder(object):
@@ -53,7 +52,6 @@ class CNNEncoder(object):
             pool_type=self.pool_type,
             param_attr=self.param_name + ".param",
             bias_attr=self.param_name + ".bias")
-        
 
 
 class GrnnEncoder(object):
@@ -64,19 +62,18 @@ class GrnnEncoder(object):
         self.hidden_size = hidden_size
 
     def forward(self, emb):
-        fc0 = nn.fc(
-            input=emb, 
-            size=self.hidden_size * 3, 
-            param_attr=self.param_name + "_fc.w",
-            bias_attr=False)
-        
-        gru_h = nn.dynamic_gru(
+        fc0 = fluid.layers.fc(input=emb,
+                              size=self.hidden_size * 3,
+                              param_attr=self.param_name + "_fc.w",
+                              bias_attr=False)
+
+        gru_h = fluid.layers.dynamic_gru(
             input=fc0,
             size=self.hidden_size,
             is_reverse=False,
             param_attr=self.param_name + ".param",
             bias_attr=self.param_name + ".bias")
-        return nn.sequence_pool(input=gru_h, pool_type='max')
+        return fluid.layers.sequence_pool(input=gru_h, pool_type='max')
 
 
 '''this is a very simple Encoder factory
@@ -119,40 +116,40 @@ class MultiviewSimnet(object):
 
     def get_correct(self, x, y):
         less = tensor.cast(cf.less_than(x, y), dtype='float32')
-        correct = nn.reduce_sum(less)
+        correct = fluid.layers.reduce_sum(less)
         return correct
 
     def train_net(self):
         # input fields for query, pos_title, neg_title
         q_slots = [
-            io.data(
-                name="q%d" % i, shape=[1], lod_level=1, dtype='int64')
+            fluid.data(
+                name="q%d" % i, shape=[None, 1], lod_level=1, dtype='int64')
             for i in range(len(self.query_encoders))
         ]
         pt_slots = [
-            io.data(
-                name="pt%d" % i, shape=[1], lod_level=1, dtype='int64')
+            fluid.data(
+                name="pt%d" % i, shape=[None, 1], lod_level=1, dtype='int64')
             for i in range(len(self.title_encoders))
         ]
         nt_slots = [
-            io.data(
-                name="nt%d" % i, shape=[1], lod_level=1, dtype='int64')
+            fluid.data(
+                name="nt%d" % i, shape=[None, 1], lod_level=1, dtype='int64')
             for i in range(len(self.title_encoders))
         ]
 
         # lookup embedding for each slot
         q_embs = [
-            nn.embedding(
+            fluid.embedding(
                 input=query, size=self.emb_shape, param_attr="emb")
             for query in q_slots
         ]
         pt_embs = [
-            nn.embedding(
+            fluid.embedding(
                 input=title, size=self.emb_shape, param_attr="emb")
             for title in pt_slots
         ]
         nt_embs = [
-            nn.embedding(
+            fluid.embedding(
                 input=title, size=self.emb_shape, param_attr="emb")
             for title in nt_slots
         ]
@@ -169,21 +166,30 @@ class MultiviewSimnet(object):
         ]
 
         # concat multi view for query, pos_title, neg_title
-        q_concat = nn.concat(q_encodes)
-        pt_concat = nn.concat(pt_encodes)
-        nt_concat = nn.concat(nt_encodes)
+        q_concat = fluid.layers.concat(q_encodes)
+        pt_concat = fluid.layers.concat(pt_encodes)
+        nt_concat = fluid.layers.concat(nt_encodes)
 
         # projection of hidden layer
-        q_hid = nn.fc(q_concat, size=self.hidden_size, param_attr='q_fc.w', bias_attr='q_fc.b')
-        pt_hid = nn.fc(pt_concat, size=self.hidden_size, param_attr='t_fc.w', bias_attr='t_fc.b')
-        nt_hid = nn.fc(nt_concat, size=self.hidden_size, param_attr='t_fc.w', bias_attr='t_fc.b')
+        q_hid = fluid.layers.fc(q_concat,
+                                size=self.hidden_size,
+                                param_attr='q_fc.w',
+                                bias_attr='q_fc.b')
+        pt_hid = fluid.layers.fc(pt_concat,
+                                 size=self.hidden_size,
+                                 param_attr='t_fc.w',
+                                 bias_attr='t_fc.b')
+        nt_hid = fluid.layers.fc(nt_concat,
+                                 size=self.hidden_size,
+                                 param_attr='t_fc.w',
+                                 bias_attr='t_fc.b')
 
         # cosine of hidden layers
-        cos_pos = nn.cos_sim(q_hid, pt_hid)
-        cos_neg = nn.cos_sim(q_hid, nt_hid)
+        cos_pos = fluid.layers.cos_sim(q_hid, pt_hid)
+        cos_neg = fluid.layers.cos_sim(q_hid, nt_hid)
 
         # pairwise hinge_loss
-        loss_part1 = nn.elementwise_sub(
+        loss_part1 = fluid.layers.elementwise_sub(
             tensor.fill_constant_batch_size_like(
                 input=cos_pos,
                 shape=[-1, 1],
@@ -191,37 +197,37 @@ class MultiviewSimnet(object):
                 dtype='float32'),
             cos_pos)
 
-        loss_part2 = nn.elementwise_add(loss_part1, cos_neg)
+        loss_part2 = fluid.layers.elementwise_add(loss_part1, cos_neg)
 
-        loss_part3 = nn.elementwise_max(
+        loss_part3 = fluid.layers.elementwise_max(
             tensor.fill_constant_batch_size_like(
                 input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'),
             loss_part2)
 
-        avg_cost = nn.mean(loss_part3)
+        avg_cost = fluid.layers.mean(loss_part3)
         correct = self.get_correct(cos_neg, cos_pos)
 
         return q_slots + pt_slots + nt_slots, avg_cost, correct
 
     def pred_net(self, query_fields, pos_title_fields, neg_title_fields):
         q_slots = [
-            io.data(
-                name="q%d" % i, shape=[1], lod_level=1, dtype='int64')
+            fluid.data(
+                name="q%d" % i, shape=[None, 1], lod_level=1, dtype='int64')
             for i in range(len(self.query_encoders))
         ]
         pt_slots = [
-            io.data(
-                name="pt%d" % i, shape=[1], lod_level=1, dtype='int64')
+            fluid.data(
+                name="pt%d" % i, shape=[None, 1], lod_level=1, dtype='int64')
             for i in range(len(self.title_encoders))
         ]
         # lookup embedding for each slot
         q_embs = [
-            nn.embedding(
+            fluid.embedding(
                 input=query, size=self.emb_shape, param_attr="emb")
             for query in q_slots
         ]
         pt_embs = [
-            nn.embedding(
+            fluid.embedding(
                 input=title, size=self.emb_shape, param_attr="emb")
             for title in pt_slots
         ]
@@ -233,11 +239,18 @@ class MultiviewSimnet(object):
             self.title_encoders[i].forward(emb) for i, emb in enumerate(pt_embs)
         ]
         # concat multi view for query, pos_title, neg_title
-        q_concat = nn.concat(q_encodes)
-        pt_concat = nn.concat(pt_encodes)
+        q_concat = fluid.layers.concat(q_encodes)
+        pt_concat = fluid.layers.concat(pt_encodes)
         # projection of hidden layer
-        q_hid = nn.fc(q_concat, size=self.hidden_size, param_attr='q_fc.w', bias_attr='q_fc.b')
-        pt_hid = nn.fc(pt_concat, size=self.hidden_size, param_attr='t_fc.w', bias_attr='t_fc.b')
+        q_hid = fluid.layers.fc(q_concat,
+                                size=self.hidden_size,
+                                param_attr='q_fc.w',
+                                bias_attr='q_fc.b')
+        pt_hid = fluid.layers.fc(pt_concat,
+                                 size=self.hidden_size,
+                                 param_attr='t_fc.w',
+                                 bias_attr='t_fc.b')
+
         # cosine of hidden layers
-        cos = nn.cos_sim(q_hid, pt_hid)
+        cos = fluid.layers.cos_sim(q_hid, pt_hid)
         return cos
diff --git a/PaddleRec/multiview_simnet/train.py b/PaddleRec/multiview_simnet/train.py
index f098fd109e8813ffbfb40753122acbef3cd896a6..8f4072addf7ecbbfda162298e87a834eb2855637 100644
--- a/PaddleRec/multiview_simnet/train.py
+++ b/PaddleRec/multiview_simnet/train.py
@@ -88,6 +88,22 @@ def parse_args():
     return parser.parse_args()
 
 
+def check_version():
+    """
+     Log error and exit when the installed version of paddlepaddle is
+     not satisfied.
+     """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
+
+
 def start_train(args):
     if args.enable_ce:
         SEED = 102
@@ -145,7 +161,7 @@ def start_train(args):
     # only for ce
     if args.enable_ce:
         threads_num, cpu_num = get_cards(args)
-        epoch_idx = args.epochs 
+        epoch_idx = args.epochs
         ce_loss = 0
         try:
             ce_loss = ce_info[-2]
@@ -153,9 +169,9 @@ def start_train(args):
             logger.error("ce info error")
 
         print("kpis\teach_pass_duration_cpu%s_thread%s\t%s" %
-                (cpu_num, threads_num, total_time / epoch_idx))
+              (cpu_num, threads_num, total_time / epoch_idx))
         print("kpis\ttrain_loss_cpu%s_thread%s\t%s" %
-                (cpu_num, threads_num, ce_loss))
+              (cpu_num, threads_num, ce_loss))
 
 
 def get_cards(args):
@@ -170,4 +186,5 @@ def main():
 
 
 if __name__ == "__main__":
+    check_version()
     main()
diff --git a/PaddleRec/ssr/README.md b/PaddleRec/ssr/README.md
index d0b4dfb41b4cea19efa42c4a233c9544349d1770..6abc52405a3bf6a288bae2b3675d84fe33bd00ac 100644
--- a/PaddleRec/ssr/README.md
+++ b/PaddleRec/ssr/README.md
@@ -12,6 +12,10 @@ Sequence Semantic Retrieval(SSR) Model shares the similar idea with Multi-Rate D
 - The idea of SSR is to model a user's personalized interest of an item through matching model structure, and the representation of a news item can be computed online even the news item does not exist in training dataset.
 - With the representation of news items, we are able to build an vector indexing service online for news prediction and this is the retrieval part of SSR.
 
+## Version
+**Now all models in PaddleRec require PaddlePaddle version 1.6 or higher, or suitable develop version.**
+
+
 ## Dataset
 Dataset preprocessing follows the method of [GRU4Rec Project](https://github.com/PaddlePaddle/models/tree/develop/fluid/PaddleRec/gru4rec). Note that you should reuse scripts from GRU4Rec project for data preprocessing.
 
@@ -39,7 +43,7 @@ cpu 单机多卡训练
 CPU_NUM=10 python train.py --train_dir train_data --use_cuda 0 --parallel 1 --batch_size 50 --model_dir model_output --num_devices 10
 ```
 
-本地模拟多机训练
+本地模拟多机训练, 不支持windows.
 ``` bash
 sh cluster_train.sh
 ```
diff --git a/PaddleRec/ssr/infer.py b/PaddleRec/ssr/infer.py
index 09dee039f4da1e08de0169b3370aff174c89556b..3a44fad7196336a71ce0ed484d5869b1633541f4 100644
--- a/PaddleRec/ssr/infer.py
+++ b/PaddleRec/ssr/infer.py
@@ -120,6 +120,7 @@ def infer(args, vocab_size, test_reader):
 
 
 if __name__ == "__main__":
+    utils.check_version()
     args = parse_args()
     start_index = args.start_index
     last_index = args.last_index
diff --git a/PaddleRec/ssr/nets.py b/PaddleRec/ssr/nets.py
index 4df23573c91fcf16a4ef95d1bab1ac01e437d148..7b78adae3b45626f10f99b57654501b5f09f19a1 100644
--- a/PaddleRec/ssr/nets.py
+++ b/PaddleRec/ssr/nets.py
@@ -26,7 +26,7 @@ class BowEncoder(object):
         self.param_name = ""
 
     def forward(self, emb):
-        return nn.sequence_pool(input=emb, pool_type='sum')
+        return fluid.layers.sequence_pool(input=emb, pool_type='sum')
 
 
 class GrnnEncoder(object):
@@ -37,18 +37,18 @@ class GrnnEncoder(object):
         self.hidden_size = hidden_size
 
     def forward(self, emb):
-        fc0 = nn.fc(input=emb,
-                    size=self.hidden_size * 3,
-                    param_attr=self.param_name + "_fc.w",
-                    bias_attr=False)
+        fc0 = fluid.layers.fc(input=emb,
+                              size=self.hidden_size * 3,
+                              param_attr=self.param_name + "_fc.w",
+                              bias_attr=False)
 
-        gru_h = nn.dynamic_gru(
+        gru_h = fluid.layers.dynamic_gru(
             input=fc0,
             size=self.hidden_size,
             is_reverse=False,
             param_attr=self.param_name + ".param",
             bias_attr=self.param_name + ".bias")
-        return nn.sequence_pool(input=gru_h, pool_type='max')
+        return fluid.layers.sequence_pool(input=gru_h, pool_type='max')
 
 
 class PairwiseHingeLoss(object):
@@ -56,12 +56,12 @@ class PairwiseHingeLoss(object):
         self.margin = margin
 
     def forward(self, pos, neg):
-        loss_part1 = nn.elementwise_sub(
+        loss_part1 = fluid.layers.elementwise_sub(
             tensor.fill_constant_batch_size_like(
                 input=pos, shape=[-1, 1], value=self.margin, dtype='float32'),
             pos)
-        loss_part2 = nn.elementwise_add(loss_part1, neg)
-        loss_part3 = nn.elementwise_max(
+        loss_part2 = fluid.layers.elementwise_add(loss_part1, neg)
+        loss_part3 = fluid.layers.elementwise_max(
             tensor.fill_constant_batch_size_like(
                 input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'),
             loss_part2)
@@ -82,40 +82,41 @@ class SequenceSemanticRetrieval(object):
 
     def get_correct(self, x, y):
         less = tensor.cast(cf.less_than(x, y), dtype='float32')
-        correct = nn.reduce_sum(less)
+        correct = fluid.layers.reduce_sum(less)
         return correct
 
     def train(self):
-        user_data = io.data(name="user", shape=[1], dtype="int64", lod_level=1)
-        pos_item_data = io.data(
-            name="p_item", shape=[1], dtype="int64", lod_level=1)
-        neg_item_data = io.data(
-            name="n_item", shape=[1], dtype="int64", lod_level=1)
-        user_emb = nn.embedding(
+        user_data = fluid.data(
+            name="user", shape=[None, 1], dtype="int64", lod_level=1)
+        pos_item_data = fluid.data(
+            name="p_item", shape=[None, 1], dtype="int64", lod_level=1)
+        neg_item_data = fluid.data(
+            name="n_item", shape=[None, 1], dtype="int64", lod_level=1)
+        user_emb = fluid.embedding(
             input=user_data, size=self.emb_shape, param_attr="emb.item")
-        pos_item_emb = nn.embedding(
+        pos_item_emb = fluid.embedding(
             input=pos_item_data, size=self.emb_shape, param_attr="emb.item")
-        neg_item_emb = nn.embedding(
+        neg_item_emb = fluid.embedding(
             input=neg_item_data, size=self.emb_shape, param_attr="emb.item")
         user_enc = self.user_encoder.forward(user_emb)
         pos_item_enc = self.item_encoder.forward(pos_item_emb)
         neg_item_enc = self.item_encoder.forward(neg_item_emb)
-        user_hid = nn.fc(input=user_enc,
-                         size=self.hidden_size,
-                         param_attr='user.w',
-                         bias_attr="user.b")
-        pos_item_hid = nn.fc(input=pos_item_enc,
-                             size=self.hidden_size,
-                             param_attr='item.w',
-                             bias_attr="item.b")
-        neg_item_hid = nn.fc(input=neg_item_enc,
-                             size=self.hidden_size,
-                             param_attr='item.w',
-                             bias_attr="item.b")
-        cos_pos = nn.cos_sim(user_hid, pos_item_hid)
-        cos_neg = nn.cos_sim(user_hid, neg_item_hid)
+        user_hid = fluid.layers.fc(input=user_enc,
+                                   size=self.hidden_size,
+                                   param_attr='user.w',
+                                   bias_attr="user.b")
+        pos_item_hid = fluid.layers.fc(input=pos_item_enc,
+                                       size=self.hidden_size,
+                                       param_attr='item.w',
+                                       bias_attr="item.b")
+        neg_item_hid = fluid.layers.fc(input=neg_item_enc,
+                                       size=self.hidden_size,
+                                       param_attr='item.w',
+                                       bias_attr="item.b")
+        cos_pos = fluid.layers.cos_sim(user_hid, pos_item_hid)
+        cos_neg = fluid.layers.cos_sim(user_hid, neg_item_hid)
         hinge_loss = self.pairwise_hinge_loss.forward(cos_pos, cos_neg)
-        avg_cost = nn.mean(hinge_loss)
+        avg_cost = fluid.layers.mean(hinge_loss)
         correct = self.get_correct(cos_neg, cos_pos)
 
         return [user_data, pos_item_data,
diff --git a/PaddleRec/ssr/reader.py b/PaddleRec/ssr/reader.py
index 15989fd8cec366b2c3b71672f134035c42bf79da..6d73440bea1699733ace31d0a5f21943f7665918 100644
--- a/PaddleRec/ssr/reader.py
+++ b/PaddleRec/ssr/reader.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 
 import random
+import io
 
 
 class Dataset:
@@ -33,7 +34,7 @@ class YoochooseVocab(Vocab):
     def load(self, filelist):
         idx = 0
         for f in filelist:
-            with open(f, "r") as fin:
+            with io.open(f, "r", encoding='utf-8') as fin:
                 for line in fin:
                     group = line.strip().split()
                     for item in group:
@@ -64,7 +65,7 @@ class YoochooseDataset(Dataset):
     def _reader_creator(self, filelist, is_train):
         def reader():
             for f in filelist:
-                with open(f, 'r') as fin:
+                with io.open(f, 'r', encoding='utf-8') as fin:
                     line_idx = 0
                     for line in fin:
                         ids = line.strip().split()
diff --git a/PaddleRec/ssr/train.py b/PaddleRec/ssr/train.py
index 1c0c9f8cc3ed6750d21ba43985fb142dc527cf00..a3e85d63ac65baf0e59cfa65d6049761bb2581d2 100644
--- a/PaddleRec/ssr/train.py
+++ b/PaddleRec/ssr/train.py
@@ -68,9 +68,9 @@ def get_cards(args):
 
 def train(args):
     if args.enable_ce:
-       SEED = 102
-       fluid.default_startup_program().random_seed = SEED 
-       fluid.default_main_program().random_seed = SEED 
+        SEED = 102
+        fluid.default_startup_program().random_seed = SEED
+        fluid.default_main_program().random_seed = SEED
     use_cuda = True if args.use_cuda else False
     parallel = True if args.parallel else False
     print("use_cuda:", use_cuda, "parallel:", parallel)
@@ -136,17 +136,16 @@ def train(args):
         if args.use_cuda:
             gpu_num = device[1]
             print("kpis\teach_pass_duration_gpu%s\t%s" %
-                (gpu_num, total_time / epoch_idx))
-            print("kpis\ttrain_acc_gpu%s\t%s" %
-                (gpu_num, ce_acc))
+                  (gpu_num, total_time / epoch_idx))
+            print("kpis\ttrain_acc_gpu%s\t%s" % (gpu_num, ce_acc))
         else:
             cpu_num = device[1]
             threads_num = device[2]
             print("kpis\teach_pass_duration_cpu%s_thread%s\t%s" %
-                (cpu_num, threads_num, total_time / epoch_idx))
+                  (cpu_num, threads_num, total_time / epoch_idx))
             print("kpis\ttrain_acc_cpu%s_thread%s\t%s" %
-                (cpu_num, threads_num, ce_acc))
-        
+                  (cpu_num, threads_num, ce_acc))
+
 
 def get_device(args):
     if args.use_cuda:
@@ -157,7 +156,7 @@ def get_device(args):
         threads_num = os.environ.get('NUM_THREADS', 1)
         cpu_num = os.environ.get('CPU_NUM', 1)
         return "cpu", int(cpu_num), int(threads_num)
-        
+
 
 def main():
     args = parse_args()
@@ -165,4 +164,5 @@ def main():
 
 
 if __name__ == "__main__":
+    utils.check_version()
     main()
diff --git a/PaddleRec/ssr/utils.py b/PaddleRec/ssr/utils.py
index 4fe9ef470ed0a2a5da7bef6a975f45e5a04ab18e..65571cb08a930d520c82c881bf5c68ca7c53b152 100644
--- a/PaddleRec/ssr/utils.py
+++ b/PaddleRec/ssr/utils.py
@@ -4,10 +4,11 @@ import os
 import logging
 import paddle.fluid as fluid
 import paddle
+import io
 
 
 def get_vocab_size(vocab_path):
-    with open(vocab_path, "r") as rf:
+    with io.open(vocab_path, "r", encoding='utf-8') as rf:
         line = rf.readline()
         return int(line.strip())
 
@@ -16,7 +17,7 @@ def construct_train_data(file_dir, vocab_path, batch_size):
     vocab_size = get_vocab_size(vocab_path)
     files = [file_dir + '/' + f for f in os.listdir(file_dir)]
     y_data = reader.YoochooseDataset(vocab_size)
-    train_reader = paddle.batch(
+    train_reader = fluid.io.batch(
         paddle.reader.shuffle(
             y_data.train(files), buf_size=batch_size * 100),
         batch_size=batch_size)
@@ -27,10 +28,26 @@ def construct_test_data(file_dir, vocab_path, batch_size):
     vocab_size = get_vocab_size(vocab_path)
     files = [file_dir + '/' + f for f in os.listdir(file_dir)]
     y_data = reader.YoochooseDataset(vocab_size)
-    test_reader = paddle.batch(y_data.test(files), batch_size=batch_size)
+    test_reader = fluid.io.batch(y_data.test(files), batch_size=batch_size)
     return test_reader, vocab_size
 
 
+def check_version():
+    """
+     Log error and exit when the installed version of paddlepaddle is
+     not satisfied.
+     """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
+
+
 def infer_data(raw_data, place):
     data = [dat[0] for dat in raw_data]
     seq_lens = [len(seq) for seq in data]
diff --git a/PaddleRec/tagspace/README.md b/PaddleRec/tagspace/README.md
index b72a05b8cbf8152f530d18e75b4414c13d514086..1980b06dfd531da3db5691a204a3a0c088785e2d 100644
--- a/PaddleRec/tagspace/README.md
+++ b/PaddleRec/tagspace/README.md
@@ -26,6 +26,8 @@ TagSpace模型的介绍可以参阅论文[#TagSpace: Semantic Embeddings from Ha
 
 Tagspace模型学习文本及标签的embedding表示，应用于工业级的标签推荐，具体应用场景有feed新闻标签推荐。
 
+**Now all models in PaddleRec require PaddlePaddle version 1.6 or higher, or suitable develop version.**
+
 同时推荐用户参考[ IPython Notebook demo](https://aistudio.baidu.com/aistudio/projectDetail/122298)
 
 ## 数据下载及预处理
@@ -42,6 +44,8 @@ Tagspace模型学习文本及标签的embedding表示，应用于工业级的标
 
 备份数据解压后，将文本数据转为paddle数据，先将数据放到训练数据目录和测试数据目录
 ```
+mkdir raw_big_train_data
+mkdir raw_big_test_data
 mv train.csv raw_big_train_data
 mv test.csv raw_big_test_data
 ```
diff --git a/PaddleRec/tagspace/infer.py b/PaddleRec/tagspace/infer.py
index e8522b095826622721de9f2e329c8c361f6f7c41..66412fc5b20a2146227c39572b53b841a5983a6b 100644
--- a/PaddleRec/tagspace/infer.py
+++ b/PaddleRec/tagspace/infer.py
@@ -71,6 +71,7 @@ def infer(test_reader, vocab_tag, use_cuda, model_path, epoch):
 
 
 if __name__ == "__main__":
+    utils.check_version()
     args = parse_args()
     start_index = args.start_index
     last_index = args.last_index
diff --git a/PaddleRec/tagspace/net.py b/PaddleRec/tagspace/net.py
index 797ae63442643ad1a8ce1f0dcf374eff24dbbe67..479d6620aaf1ddf3b6ca5decf56f85e604bd0878 100644
--- a/PaddleRec/tagspace/net.py
+++ b/PaddleRec/tagspace/net.py
@@ -2,42 +2,53 @@ import paddle.fluid as fluid
 import paddle.fluid.layers.nn as nn
 import paddle.fluid.layers.tensor as tensor
 import paddle.fluid.layers.control_flow as cf
-import paddle.fluid.layers.io as io
 
-def network(vocab_text_size, vocab_tag_size, emb_dim=10, hid_dim=1000, win_size=5, margin=0.1, neg_size=5):
+
+def network(vocab_text_size,
+            vocab_tag_size,
+            emb_dim=10,
+            hid_dim=1000,
+            win_size=5,
+            margin=0.1,
+            neg_size=5):
     """ network definition """
-    text = io.data(name="text", shape=[1], lod_level=1, dtype='int64')
-    pos_tag = io.data(name="pos_tag", shape=[1], lod_level=1, dtype='int64')
-    neg_tag = io.data(name="neg_tag", shape=[1], lod_level=1, dtype='int64')
-    text_emb = nn.embedding(
-            input=text, size=[vocab_text_size, emb_dim], param_attr="text_emb")
-    pos_tag_emb = nn.embedding(
-            input=pos_tag, size=[vocab_tag_size, emb_dim], param_attr="tag_emb")
-    neg_tag_emb = nn.embedding(
-            input=neg_tag, size=[vocab_tag_size, emb_dim], param_attr="tag_emb")
+    text = fluid.data(name="text", shape=[None, 1], lod_level=1, dtype='int64')
+    pos_tag = fluid.data(
+        name="pos_tag", shape=[None, 1], lod_level=1, dtype='int64')
+    neg_tag = fluid.data(
+        name="neg_tag", shape=[None, 1], lod_level=1, dtype='int64')
+    text_emb = fluid.embedding(
+        input=text, size=[vocab_text_size, emb_dim], param_attr="text_emb")
+    text_emb = fluid.layers.squeeze(input=text_emb, axes=[1])
+    pos_tag_emb = fluid.embedding(
+        input=pos_tag, size=[vocab_tag_size, emb_dim], param_attr="tag_emb")
+    pos_tag_emb = fluid.layers.squeeze(input=pos_tag_emb, axes=[1])
+    neg_tag_emb = fluid.embedding(
+        input=neg_tag, size=[vocab_tag_size, emb_dim], param_attr="tag_emb")
+    neg_tag_emb = fluid.layers.squeeze(input=neg_tag_emb, axes=[1])
 
     conv_1d = fluid.nets.sequence_conv_pool(
-            input=text_emb,
-            num_filters=hid_dim,
-            filter_size=win_size,
-            act="tanh",
-            pool_type="max",
-            param_attr="cnn")
-    text_hid = fluid.layers.fc(input=conv_1d, size=emb_dim, param_attr="text_hid")
+        input=text_emb,
+        num_filters=hid_dim,
+        filter_size=win_size,
+        act="tanh",
+        pool_type="max",
+        param_attr="cnn")
+    text_hid = fluid.layers.fc(input=conv_1d,
+                               size=emb_dim,
+                               param_attr="text_hid")
     cos_pos = nn.cos_sim(pos_tag_emb, text_hid)
     mul_text_hid = fluid.layers.sequence_expand_as(x=text_hid, y=neg_tag_emb)
     mul_cos_neg = nn.cos_sim(neg_tag_emb, mul_text_hid)
-    cos_neg_all = fluid.layers.sequence_reshape(input=mul_cos_neg, new_dim=neg_size)
+    cos_neg_all = fluid.layers.sequence_reshape(
+        input=mul_cos_neg, new_dim=neg_size)
     #choose max negtive cosine
     cos_neg = nn.reduce_max(cos_neg_all, dim=1, keep_dim=True)
     #calculate hinge loss
     loss_part1 = nn.elementwise_sub(
-            tensor.fill_constant_batch_size_like(
-                input=cos_pos,
-                shape=[-1, 1],
-                value=margin,
-                dtype='float32'),
-            cos_pos)
+        tensor.fill_constant_batch_size_like(
+            input=cos_pos, shape=[-1, 1], value=margin, dtype='float32'),
+        cos_pos)
     loss_part2 = nn.elementwise_add(loss_part1, cos_neg)
     loss_part3 = nn.elementwise_max(
         tensor.fill_constant_batch_size_like(
diff --git a/PaddleRec/tagspace/text2paddle.py b/PaddleRec/tagspace/text2paddle.py
index 6aa040c02aae715549593f41dc3bcf0509aa5c6f..14d5369b9509f094730e0e714f4e24dd0d4a9d3d 100644
--- a/PaddleRec/tagspace/text2paddle.py
+++ b/PaddleRec/tagspace/text2paddle.py
@@ -2,22 +2,27 @@ import sys
 import six
 import collections
 import os
-import csv 
-import re
+import sys
+if six.PY2:
+    reload(sys)
+    sys.setdefaultencoding('utf-8')
+
 
-def word_count(column_num, input_file, word_freq=None):
+def word_count(input_file, word_freq=None):
     """
     compute word count from corpus
     """
     if word_freq is None:
         word_freq = collections.defaultdict(int)
-    data_file = csv.reader(input_file)
-    for row in data_file:
-        for w in re.split(r'\W+',row[column_num].strip()):
-            word_freq[w]+= 1
+
+    for l in input_file:
+        for w in l.strip().split():
+            word_freq[w] += 1
+
     return word_freq
 
-def build_dict(column_num=2, min_word_freq=0, train_dir="", test_dir=""):
+
+def build_dict(min_word_freq=0, train_dir="", test_dir=""):
     """
     Build a word dictionary from the corpus,  Keys of the dictionary are words,
     and values are zero-based IDs of these words.
@@ -25,12 +30,12 @@ def build_dict(column_num=2, min_word_freq=0, train_dir="", test_dir=""):
     word_freq = collections.defaultdict(int)
     files = os.listdir(train_dir)
     for fi in files:
-        with open(train_dir + '/' + fi, "r") as f:
-            word_freq = word_count(column_num, f, word_freq)
+        with open(os.path.join(train_dir, fi), "r") as f:
+            word_freq = word_count(f, word_freq)
     files = os.listdir(test_dir)
     for fi in files:
-        with open(test_dir + '/' + fi, "r") as f:
-            word_freq = word_count(column_num, f, word_freq)
+        with open(os.path.join(test_dir, fi), "r") as f:
+            word_freq = word_count(f, word_freq)
 
     word_freq = [x for x in six.iteritems(word_freq) if x[1] > min_word_freq]
     word_freq_sorted = sorted(word_freq, key=lambda x: (-x[1], x[0]))
@@ -39,20 +44,17 @@ def build_dict(column_num=2, min_word_freq=0, train_dir="", test_dir=""):
     return word_idx
 
 
-def write_paddle(text_idx, tag_idx, train_dir, test_dir, output_train_dir, output_test_dir):
+def write_paddle(word_idx, train_dir, test_dir, output_train_dir,
+                 output_test_dir):
     files = os.listdir(train_dir)
     if not os.path.exists(output_train_dir):
         os.mkdir(output_train_dir)
     for fi in files:
-        with open(train_dir + '/' + fi, "r") as f:
-            with open(output_train_dir + '/' + fi, "w") as wf:
-                data_file = csv.reader(f)
-                for row in data_file:
-                    tag_raw = re.split(r'\W+', row[0].strip())
-                    pos_index = tag_idx.get(tag_raw[0])
-                    wf.write(str(pos_index) + ",")
-                    text_raw = re.split(r'\W+', row[2].strip())
-                    l = [text_idx.get(w) for w in text_raw]
+        with open(os.path.join(train_dir, fi), "r") as f:
+            with open(os.path.join(output_train_dir, fi), "w") as wf:
+                for l in f:
+                    l = l.strip().split()
+                    l = [word_idx.get(w) for w in l]
                     for w in l:
                         wf.write(str(w) + " ")
                     wf.write("\n")
@@ -61,37 +63,29 @@ def write_paddle(text_idx, tag_idx, train_dir, test_dir, output_train_dir, outpu
     if not os.path.exists(output_test_dir):
         os.mkdir(output_test_dir)
     for fi in files:
-        with open(test_dir + '/' + fi, "r") as f:
-            with open(output_test_dir + '/' + fi, "w") as wf:
-                data_file = csv.reader(f)
-                for row in data_file:
-                    tag_raw = re.split(r'\W+', row[0].strip())
-                    pos_index = tag_idx.get(tag_raw[0])
-                    wf.write(str(pos_index) + ",")
-                    text_raw = re.split(r'\W+', row[2].strip())
-                    l = [text_idx.get(w) for w in text_raw]
+        with open(os.path.join(test_dir, fi), "r") as f:
+            with open(os.path.join(output_test_dir, fi), "w") as wf:
+                for l in f:
+                    l = l.strip().split()
+                    l = [word_idx.get(w) for w in l]
                     for w in l:
                         wf.write(str(w) + " ")
                     wf.write("\n")
 
-def text2paddle(train_dir, test_dir, output_train_dir, output_test_dir, output_vocab_text, output_vocab_tag):
-    print("start constuct word dict")
-    vocab_text = build_dict(2, 0, train_dir, test_dir)
-    with open(output_vocab_text, "w") as wf:
-        wf.write(str(len(vocab_text)) + "\n")
-
-    vocab_tag = build_dict(0, 0, train_dir, test_dir)
-    with open(output_vocab_tag, "w") as wf:
-        wf.write(str(len(vocab_tag)) + "\n")
 
-    print("construct word dict done\n")
-    write_paddle(vocab_text, vocab_tag, train_dir, test_dir, output_train_dir, output_test_dir)
+def text2paddle(train_dir, test_dir, output_train_dir, output_test_dir,
+                output_vocab):
+    vocab = build_dict(0, train_dir, test_dir)
+    with open(output_vocab, "w", encoding='utf-8') as wf:
+        wf.write(str(len(vocab)) + "\n")
+        #wf.write(str(vocab))
+    write_paddle(vocab, train_dir, test_dir, output_train_dir, output_test_dir)
 
 
 train_dir = sys.argv[1]
 test_dir = sys.argv[2]
 output_train_dir = sys.argv[3]
 output_test_dir = sys.argv[4]
-output_vocab_text = sys.argv[5]
-output_vocab_tag = sys.argv[6]
-text2paddle(train_dir, test_dir, output_train_dir, output_test_dir, output_vocab_text, output_vocab_tag)
+output_vocab = sys.argv[5]
+text2paddle(train_dir, test_dir, output_train_dir, output_test_dir,
+            output_vocab)
diff --git a/PaddleRec/tagspace/train.py b/PaddleRec/tagspace/train.py
index 419bb1c4b156c148f8bc4bc3a48385b6722f5c68..da3563e096ac3da3cb94721494daf8855be552a9 100644
--- a/PaddleRec/tagspace/train.py
+++ b/PaddleRec/tagspace/train.py
@@ -56,8 +56,8 @@ def train():
     """ do training """
     args = parse_args()
     if args.enable_ce:
-       fluid.default_startup_program().random_seed = SEED 
-       fluid.default_main_program().random_seed = SEED 
+        fluid.default_startup_program().random_seed = SEED
+        fluid.default_main_program().random_seed = SEED
     train_dir = args.train_dir
     vocab_text_path = args.vocab_text_path
     vocab_tag_path = args.vocab_tag_path
@@ -114,7 +114,8 @@ def train():
                     "neg_tag": lod_neg_tag
                 },
                 fetch_list=[avg_cost.name, correct.name])
-            ce_info.append(float(np.sum(correct_val)) / (args.num_devices * batch_size))
+            ce_info.append(
+                float(np.sum(correct_val)) / (args.num_devices * batch_size))
             if batch_id % args.print_batch == 0:
                 print("TRAIN --> pass: {} batch_num: {} avg_cost: {}, acc: {}"
                       .format(pass_idx, (batch_id + 10) * batch_size,
@@ -128,8 +129,7 @@ def train():
         save_dir = "%s/epoch_%d" % (model_dir, epoch_idx)
         feed_var_names = ["text", "pos_tag"]
         fetch_vars = [cos_pos]
-        fluid.io.save_inference_model(save_dir, feed_var_names, fetch_vars,
-                                      exe)
+        fluid.io.save_inference_model(save_dir, feed_var_names, fetch_vars, exe)
     # only for ce
     if args.enable_ce:
         ce_acc = 0
@@ -142,17 +142,16 @@ def train():
         if args.use_cuda:
             gpu_num = device[1]
             print("kpis\teach_pass_duration_gpu%s\t%s" %
-                (gpu_num, total_time / epoch_idx))
-            print("kpis\ttrain_acc_gpu%s\t%s" %
-                (gpu_num, ce_acc))
+                  (gpu_num, total_time / epoch_idx))
+            print("kpis\ttrain_acc_gpu%s\t%s" % (gpu_num, ce_acc))
         else:
             cpu_num = device[1]
             threads_num = device[2]
             print("kpis\teach_pass_duration_cpu%s_thread%s\t%s" %
-                (cpu_num, threads_num, total_time / epoch_idx))
+                  (cpu_num, threads_num, total_time / epoch_idx))
             print("kpis\ttrain_acc_cpu%s_thread%s\t%s" %
-                (cpu_num, threads_num, ce_acc))
-        
+                  (cpu_num, threads_num, ce_acc))
+
     print("finish training")
 
 
@@ -165,7 +164,8 @@ def get_device(args):
         threads_num = os.environ.get('NUM_THREADS', 1)
         cpu_num = os.environ.get('CPU_NUM', 1)
         return "cpu", int(cpu_num), int(threads_num)
-        
+
 
 if __name__ == "__main__":
+    utils.check_version()
     train()
diff --git a/PaddleRec/tagspace/utils.py b/PaddleRec/tagspace/utils.py
index f5b7e64753d331df57e2ef0a86b5a1dff1cea37a..7ae71249e41ec2d6f42be07f3a5a472a1d3ba8a2 100644
--- a/PaddleRec/tagspace/utils.py
+++ b/PaddleRec/tagspace/utils.py
@@ -8,6 +8,11 @@ import numpy as np
 import paddle.fluid as fluid
 import paddle
 import csv
+import io
+if six.PY2:
+    reload(sys)
+    sys.setdefaultencoding('utf-8')
+
 
 def to_lodtensor(data, place):
     """ convert to LODtensor """
@@ -24,12 +29,29 @@ def to_lodtensor(data, place):
     res.set_lod([lod])
     return res
 
+
 def get_vocab_size(vocab_path):
-    with open(vocab_path, "r") as rf:
+    with io.open(vocab_path, "r") as rf:
         line = rf.readline()
         return int(line.strip())
 
 
+def check_version():
+    """
+     Log error and exit when the installed version of paddlepaddle is
+     not satisfied.
+     """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
+
+
 def prepare_data(file_dir,
                  vocab_text_path,
                  vocab_tag_path,
@@ -45,19 +67,25 @@ def prepare_data(file_dir,
         reader = sort_batch(
             paddle.reader.shuffle(
                 train(
-                    file_dir, vocab_tag_size, neg_size,
-                    buffer_size, data_type=DataType.SEQ),
+                    file_dir,
+                    vocab_tag_size,
+                    neg_size,
+                    buffer_size,
+                    data_type=DataType.SEQ),
                 buf_size=buffer_size),
-            batch_size, batch_size * 20)
+            batch_size,
+            batch_size * 20)
     else:
         vocab_tag_size = get_vocab_size(vocab_tag_path)
         vocab_text_size = 0
         reader = sort_batch(
             test(
                 file_dir, vocab_tag_size, buffer_size, data_type=DataType.SEQ),
-            batch_size, batch_size * 20)
+            batch_size,
+            batch_size * 20)
     return vocab_text_size, vocab_tag_size, reader
 
+
 def sort_batch(reader, batch_size, sort_group_size, drop_last=False):
     """
     Create a batched reader.
@@ -107,11 +135,13 @@ def sort_batch(reader, batch_size, sort_group_size, drop_last=False):
 class DataType(object):
     SEQ = 2
 
+
 def train_reader_creator(file_dir, tag_size, neg_size, n, data_type):
     def reader():
         files = os.listdir(file_dir)
         for fi in files:
-            with open(file_dir + '/' + fi, "r") as f:
+            with io.open(
+                    os.path.join(file_dir, fi), "r", encoding='utf-8') as f:
                 for l in f:
                     l = l.strip().split(",")
                     pos_index = int(l[0])
@@ -123,7 +153,7 @@ def train_reader_creator(file_dir, tag_size, neg_size, n, data_type):
                     max_iter = 100
                     now_iter = 0
                     sum_n = 0
-                    while(sum_n < neg_size) :
+                    while (sum_n < neg_size):
                         now_iter += 1
                         if now_iter > max_iter:
                             print("error : only one class")
@@ -135,13 +165,16 @@ def train_reader_creator(file_dir, tag_size, neg_size, n, data_type):
                             sum_n += 1
                     if n > 0 and len(text) > n: continue
                     yield text, pos_tag, neg_tag
+
     return reader
 
+
 def test_reader_creator(file_dir, tag_size, n, data_type):
     def reader():
         files = os.listdir(file_dir)
         for fi in files:
-            with open(file_dir + '/' + fi, "r") as f:
+            with io.open(
+                    os.path.join(file_dir, fi), "r", encoding='utf-8') as f:
                 for l in f:
                     l = l.strip().split(",")
                     pos_index = int(l[0])
@@ -153,11 +186,13 @@ def test_reader_creator(file_dir, tag_size, n, data_type):
                         tag = []
                         tag.append(ii)
                         yield text, tag, pos_tag
+
     return reader
 
 
 def train(train_dir, tag_size, neg_size, n, data_type=DataType.SEQ):
     return train_reader_creator(train_dir, tag_size, neg_size, n, data_type)
 
+
 def test(test_dir, tag_size, n, data_type=DataType.SEQ):
     return test_reader_creator(test_dir, tag_size, n, data_type)
diff --git a/PaddleRec/word2vec/README.md b/PaddleRec/word2vec/README.md
index 724e7bc17d21a701c82b82676c89bb2331082262..581c81aca5f9a7b5cd1396a1a719b477f84c766a 100644
--- a/PaddleRec/word2vec/README.md
+++ b/PaddleRec/word2vec/README.md
@@ -20,6 +20,8 @@
 ## 介绍
 本例实现了skip-gram模式的word2vector模型。
 
+**目前模型库下模型均要求使用PaddlePaddle 1.6及以上版本或适当的develop版本。**
+
 同时推荐用户参考[ IPython Notebook demo](https://aistudio.baidu.com/aistudio/projectDetail/124377)
 
 ## 数据下载
@@ -36,7 +38,7 @@ mv 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tok
 
 ```bash
 mkdir data
-wget https://paddlerec.bj.bcebos.com/word2vec/1-billion-word-language-modeling-benchmark-r13output.tar
+wget --no-check-certificate https://paddlerec.bj.bcebos.com/word2vec/1-billion-word-language-modeling-benchmark-r13output.tar
 tar xvf 1-billion-word-language-modeling-benchmark-r13output.tar
 mv 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/ data/
 ```
@@ -45,7 +47,7 @@ mv 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tok
 
 ```bash
 mkdir data
-wget https://paddlerec.bj.bcebos.com/word2vec/text.tar
+wget --no-check-certificate https://paddlerec.bj.bcebos.com/word2vec/text.tar
 tar xvf text.tar
 mv text data/
 ```
@@ -95,7 +97,7 @@ python train.py -h
 OPENBLAS_NUM_THREADS=1 CPU_NUM=5 python train.py --train_data_dir data/convert_text8 --dict_path data/test_build_dict --num_passes 10 --batch_size 100 --model_output_dir v1_cpu5_b100_lr1dir --base_lr 1.0 --print_batch 1000 --with_speed --is_sparse
 ```
 
-本地单机模拟多机训练
+本地单机模拟多机训练, 目前暂不支持windows。
 
 ```bash
 sh cluster_train.sh
@@ -106,9 +108,9 @@ sh cluster_train.sh
 
 ```bash
 #全量数据集测试集
-wget https://paddlerec.bj.bcebos.com/word2vec/test_dir.tar
+wget --no-check-certificate https://paddlerec.bj.bcebos.com/word2vec/test_dir.tar
 #样本数据集测试集
-wget https://paddlerec.bj.bcebos.com/word2vec/test_mid_dir.tar
+wget --no-check-certificate https://paddlerec.bj.bcebos.com/word2vec/test_mid_dir.tar
 ```
 
 预测命令，注意词典名称需要加后缀"_word_to_id_", 此文件是预处理阶段生成的。
diff --git a/PaddleRec/word2vec/infer.py b/PaddleRec/word2vec/infer.py
index 1b3290029d620d130d2fe7b7c2bcfd8bbeae54c2..32f65d17de585ef4f1fc14797e300ef7f55ac877 100644
--- a/PaddleRec/word2vec/infer.py
+++ b/PaddleRec/word2vec/infer.py
@@ -10,6 +10,9 @@ import paddle.fluid as fluid
 import paddle
 import net
 import utils
+if six.PY2:
+    reload(sys)
+    sys.setdefaultencoding('utf-8')
 
 
 def parse_args():
@@ -76,15 +79,12 @@ def infer_epoch(args, vocab_size, test_reader, use_cuda, i2w):
                 for data in test_reader():
                     step_id += 1
                     b_size = len([dat[0] for dat in data])
-                    wa = np.array(
-                        [dat[0] for dat in data]).astype("int64").reshape(
-                            b_size, 1)
-                    wb = np.array(
-                        [dat[1] for dat in data]).astype("int64").reshape(
-                            b_size, 1)
-                    wc = np.array(
-                        [dat[2] for dat in data]).astype("int64").reshape(
-                            b_size, 1)
+                    wa = np.array([dat[0] for dat in data]).astype(
+                        "int64").reshape(b_size)
+                    wb = np.array([dat[1] for dat in data]).astype(
+                        "int64").reshape(b_size)
+                    wc = np.array([dat[2] for dat in data]).astype(
+                        "int64").reshape(b_size)
 
                     label = [dat[3] for dat in data]
                     input_word = [dat[4] for dat in data]
@@ -93,9 +93,8 @@ def infer_epoch(args, vocab_size, test_reader, use_cuda, i2w):
                                        "analogy_a": wa,
                                        "analogy_b": wb,
                                        "analogy_c": wc,
-                                       "all_label":
-                                       np.arange(vocab_size).reshape(
-                                           vocab_size, 1).astype("int64"),
+                                       "all_label": np.arange(vocab_size)
+                                       .reshape(vocab_size).astype("int64"),
                                    },
                                    fetch_list=[pred.name, values],
                                    return_numpy=False)
@@ -143,15 +142,12 @@ def infer_step(args, vocab_size, test_reader, use_cuda, i2w):
                     for data in test_reader():
                         step_id += 1
                         b_size = len([dat[0] for dat in data])
-                        wa = np.array(
-                            [dat[0] for dat in data]).astype("int64").reshape(
-                                b_size, 1)
-                        wb = np.array(
-                            [dat[1] for dat in data]).astype("int64").reshape(
-                                b_size, 1)
-                        wc = np.array(
-                            [dat[2] for dat in data]).astype("int64").reshape(
-                                b_size, 1)
+                        wa = np.array([dat[0] for dat in data]).astype(
+                            "int64").reshape(b_size)
+                        wb = np.array([dat[1] for dat in data]).astype(
+                            "int64").reshape(b_size)
+                        wc = np.array([dat[2] for dat in data]).astype(
+                            "int64").reshape(b_size)
 
                         label = [dat[3] for dat in data]
                         input_word = [dat[4] for dat in data]
@@ -162,7 +158,7 @@ def infer_step(args, vocab_size, test_reader, use_cuda, i2w):
                                 "analogy_b": wb,
                                 "analogy_c": wc,
                                 "all_label":
-                                np.arange(vocab_size).reshape(vocab_size, 1),
+                                np.arange(vocab_size).reshape(vocab_size),
                             },
                             fetch_list=[pred.name, values],
                             return_numpy=False)
@@ -185,6 +181,7 @@ def infer_step(args, vocab_size, test_reader, use_cuda, i2w):
 
 
 if __name__ == "__main__":
+    utils.check_version()
     args = parse_args()
     start_index = args.start_index
     last_index = args.last_index
diff --git a/PaddleRec/word2vec/net.py b/PaddleRec/word2vec/net.py
index ab2abbc76bde8e03c9a6e1e0abb062aa467d2c91..b8379b6696b3253f60079ccc4b4042e253d5ace4 100644
--- a/PaddleRec/word2vec/net.py
+++ b/PaddleRec/word2vec/net.py
@@ -23,10 +23,10 @@ import paddle.fluid as fluid
 def skip_gram_word2vec(dict_size, embedding_size, is_sparse=False, neg_num=5):
 
     datas = []
-    input_word = fluid.layers.data(name="input_word", shape=[1], dtype='int64')
-    true_word = fluid.layers.data(name='true_label', shape=[1], dtype='int64')
-    neg_word = fluid.layers.data(
-        name="neg_label", shape=[neg_num], dtype='int64')
+    input_word = fluid.data(name="input_word", shape=[None, 1], dtype='int64')
+    true_word = fluid.data(name='true_label', shape=[None, 1], dtype='int64')
+    neg_word = fluid.data(
+        name="neg_label", shape=[None, neg_num], dtype='int64')
 
     datas.append(input_word)
     datas.append(true_word)
@@ -37,7 +37,7 @@ def skip_gram_word2vec(dict_size, embedding_size, is_sparse=False, neg_num=5):
 
     words = fluid.layers.read_file(py_reader)
     init_width = 0.5 / embedding_size
-    input_emb = fluid.layers.embedding(
+    input_emb = fluid.embedding(
         input=words[0],
         is_sparse=is_sparse,
         size=[dict_size, embedding_size],
@@ -45,33 +45,31 @@ def skip_gram_word2vec(dict_size, embedding_size, is_sparse=False, neg_num=5):
             name='emb',
             initializer=fluid.initializer.Uniform(-init_width, init_width)))
 
-    true_emb_w = fluid.layers.embedding(
+    true_emb_w = fluid.embedding(
         input=words[1],
         is_sparse=is_sparse,
         size=[dict_size, embedding_size],
         param_attr=fluid.ParamAttr(
             name='emb_w', initializer=fluid.initializer.Constant(value=0.0)))
 
-    true_emb_b = fluid.layers.embedding(
+    true_emb_b = fluid.embedding(
         input=words[1],
         is_sparse=is_sparse,
         size=[dict_size, 1],
         param_attr=fluid.ParamAttr(
             name='emb_b', initializer=fluid.initializer.Constant(value=0.0)))
-    neg_word_reshape = fluid.layers.reshape(words[2], shape=[-1, 1])
-    neg_word_reshape.stop_gradient = True
+    input_emb = fluid.layers.squeeze(input=input_emb, axes=[1])
+    true_emb_w = fluid.layers.squeeze(input=true_emb_w, axes=[1])
+    true_emb_b = fluid.layers.squeeze(input=true_emb_b, axes=[1])
 
-    neg_emb_w = fluid.layers.embedding(
-        input=neg_word_reshape,
+    neg_emb_w = fluid.embedding(
+        input=words[2],
         is_sparse=is_sparse,
         size=[dict_size, embedding_size],
         param_attr=fluid.ParamAttr(
             name='emb_w', learning_rate=1.0))
-
-    neg_emb_w_re = fluid.layers.reshape(
-        neg_emb_w, shape=[-1, neg_num, embedding_size])
-    neg_emb_b = fluid.layers.embedding(
-        input=neg_word_reshape,
+    neg_emb_b = fluid.embedding(
+        input=words[2],
         is_sparse=is_sparse,
         size=[dict_size, 1],
         param_attr=fluid.ParamAttr(
@@ -86,8 +84,7 @@ def skip_gram_word2vec(dict_size, embedding_size, is_sparse=False, neg_num=5):
         true_emb_b)
     input_emb_re = fluid.layers.reshape(
         input_emb, shape=[-1, 1, embedding_size])
-    neg_matmul = fluid.layers.matmul(
-        input_emb_re, neg_emb_w_re, transpose_y=True)
+    neg_matmul = fluid.layers.matmul(input_emb_re, neg_emb_w, transpose_y=True)
     neg_matmul_re = fluid.layers.reshape(neg_matmul, shape=[-1, neg_num])
     neg_logits = fluid.layers.elementwise_add(neg_matmul_re, neg_emb_b_vec)
     #nce loss
@@ -111,22 +108,18 @@ def skip_gram_word2vec(dict_size, embedding_size, is_sparse=False, neg_num=5):
 
 
 def infer_network(vocab_size, emb_size):
-    analogy_a = fluid.layers.data(name="analogy_a", shape=[1], dtype='int64')
-    analogy_b = fluid.layers.data(name="analogy_b", shape=[1], dtype='int64')
-    analogy_c = fluid.layers.data(name="analogy_c", shape=[1], dtype='int64')
-    all_label = fluid.layers.data(
-        name="all_label",
-        shape=[vocab_size, 1],
-        dtype='int64',
-        append_batch_size=False)
-    emb_all_label = fluid.layers.embedding(
+    analogy_a = fluid.data(name="analogy_a", shape=[None], dtype='int64')
+    analogy_b = fluid.data(name="analogy_b", shape=[None], dtype='int64')
+    analogy_c = fluid.data(name="analogy_c", shape=[None], dtype='int64')
+    all_label = fluid.data(name="all_label", shape=[vocab_size], dtype='int64')
+    emb_all_label = fluid.embedding(
         input=all_label, size=[vocab_size, emb_size], param_attr="emb")
 
-    emb_a = fluid.layers.embedding(
+    emb_a = fluid.embedding(
         input=analogy_a, size=[vocab_size, emb_size], param_attr="emb")
-    emb_b = fluid.layers.embedding(
+    emb_b = fluid.embedding(
         input=analogy_b, size=[vocab_size, emb_size], param_attr="emb")
-    emb_c = fluid.layers.embedding(
+    emb_c = fluid.embedding(
         input=analogy_c, size=[vocab_size, emb_size], param_attr="emb")
     target = fluid.layers.elementwise_add(
         fluid.layers.elementwise_sub(emb_b, emb_a), emb_c)
diff --git a/PaddleRec/word2vec/preprocess.py b/PaddleRec/word2vec/preprocess.py
index 1d5ad03c0ab2b562bc6cd53d4cac62a16d181e3b..030ac0bea42fef2cd50ea559934f2be450078ea2 100644
--- a/PaddleRec/word2vec/preprocess.py
+++ b/PaddleRec/word2vec/preprocess.py
@@ -6,6 +6,10 @@ import six
 import argparse
 import io
 import math
+import sys
+if six.PY2:
+    reload(sys)
+    sys.setdefaultencoding('utf-8')
 prog = re.compile("[^a-z ]", flags=0)
 
 
@@ -110,10 +114,14 @@ def filter_corpus(args):
     if not os.path.exists(args.output_corpus_dir):
         os.makedirs(args.output_corpus_dir)
     for file in os.listdir(args.input_corpus_dir):
-        with io.open(args.output_corpus_dir + '/convert_' + file, "w") as wf:
+        with io.open(
+                os.path.join(args.output_corpus_dir, 'convert_' + file),
+                "w",
+                encoding='utf-8') as wf:
             with io.open(
-                    args.input_corpus_dir + '/' + file, encoding='utf-8') as rf:
-                print(args.input_corpus_dir + '/' + file)
+                    os.path.join(args.input_corpus_dir, file),
+                    encoding='utf-8') as rf:
+                print(os.path.join(args.input_corpus_dir, file))
                 for line in rf:
                     signal = False
                     line = text_strip(line)
diff --git a/PaddleRec/word2vec/train.py b/PaddleRec/word2vec/train.py
index 430ec132d2f810eed0025f16e9b87a8f742c455c..929abf4dc4272fec154bc29415d9651c3dbd5b74 100644
--- a/PaddleRec/word2vec/train.py
+++ b/PaddleRec/word2vec/train.py
@@ -12,6 +12,12 @@ import six
 import reader
 from net import skip_gram_word2vec
 
+import utils
+import sys
+if six.PY2:
+    reload(sys)
+    sys.setdefaultencoding('utf-8')
+
 logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger("fluid")
 logger.setLevel(logging.INFO)
@@ -78,6 +84,9 @@ def parse_args():
         required=False,
         default=False,
         help='print speed or not , (default: False)')
+    parser.add_argument(
+        '--enable_ce', action='store_true', help='If set, run the task with continuous evaluation logs.')
+
     return parser.parse_args()
 
 
@@ -189,6 +198,11 @@ def GetFileList(data_path):
 
 
 def train(args):
+    # add ce
+    if args.enable_ce:
+        SEED = 102
+        fluid.default_main_program().random_seed = SEED
+        fluid.default_startup_program().random_seed = SEED
 
     if not os.path.isdir(args.model_output_dir):
         os.mkdir(args.model_output_dir)
@@ -224,5 +238,6 @@ def train(args):
 
 
 if __name__ == '__main__':
+    utils.check_version()
     args = parse_args()
     train(args)
diff --git a/PaddleRec/word2vec/utils.py b/PaddleRec/word2vec/utils.py
index 01cd04e493b09e880303d7b0c87f5ed71cf86357..c09e30d7e7115f4772eac94af868d190e6555ea7 100644
--- a/PaddleRec/word2vec/utils.py
+++ b/PaddleRec/word2vec/utils.py
@@ -7,12 +7,13 @@ import paddle.fluid as fluid
 import paddle
 import os
 import preprocess
+import io
 
 
 def BuildWord_IdMap(dict_path):
     word_to_id = dict()
     id_to_word = dict()
-    with open(dict_path, 'r') as f:
+    with io.open(dict_path, 'r', encoding='utf-8') as f:
         for line in f:
             word_to_id[line.split(' ')[0]] = int(line.split(' ')[1])
             id_to_word[int(line.split(' ')[1])] = line.split(' ')[0]
@@ -22,10 +23,26 @@ def BuildWord_IdMap(dict_path):
 def prepare_data(file_dir, dict_path, batch_size):
     w2i, i2w = BuildWord_IdMap(dict_path)
     vocab_size = len(i2w)
-    reader = paddle.batch(test(file_dir, w2i), batch_size)
+    reader = fluid.io.batch(test(file_dir, w2i), batch_size)
     return vocab_size, reader, i2w
 
 
+def check_version():
+    """
+     Log error and exit when the installed version of paddlepaddle is
+     not satisfied.
+     """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
+
+
 def native_to_unicode(s):
     if _is_unicode(s):
         return s
@@ -75,7 +92,8 @@ def reader_creator(file_dir, word_to_id):
     def reader():
         files = os.listdir(file_dir)
         for fi in files:
-            with open(file_dir + '/' + fi, "r") as f:
+            with io.open(
+                    os.path.join(file_dir, fi), "r", encoding='utf-8') as f:
                 for line in f:
                     if ':' in line:
                         pass
diff --git a/PaddleST/README.md b/PaddleST/README.md
index 0736b0f71d99be0ed6e1f9bf2a31cf6086234b29..859689626fca2d830a3e8cdd055c3aebb422ee0d 100644
--- a/PaddleST/README.md
+++ b/PaddleST/README.md
@@ -10,7 +10,8 @@ The full list of frontier research:
 
 |研究课题|论文全文|开源地址|
 |----|----|----|
-|MONOPOLY: Learning to Price Public Facilities for Revaluing Private Properties with Large-scale Urban Data|链接: https://pan.baidu.com/s/1mYT3gpqurOpMJ8seLJogDg 提取码: qm1b |https://github.com/PaddlePaddle/models/tree/develop/PaddleST/Research/CIKM2019-MONOPOLY|
+|MONOPOLY: Learning to Price Public Facilities for Revaluing Private Properties with Large-scale Urban Data|https://dl.acm.org/doi/10.1145/3357384.3357810|https://github.com/PaddlePaddle/models/tree/develop/PaddleST/Research/CIKM2019-MONOPOLY|
+|||
 
 ### 前沿应用项目列表如下：
 
@@ -19,3 +20,4 @@ The full list of frontier industrial projects:
 |应用项目|项目简介|开源地址|
 |----|----|----|
 ||||
+
diff --git a/PaddleST/Research/CIKM2019-MONOPOLY/README.md b/PaddleST/Research/CIKM2019-MONOPOLY/README.md
index 8ce54af453af165ebb25137435e96679d1516407..1fe233edf933c91b2706d4234e21ba0569c9d568 100644
--- a/PaddleST/Research/CIKM2019-MONOPOLY/README.md
+++ b/PaddleST/Research/CIKM2019-MONOPOLY/README.md
@@ -29,7 +29,7 @@ We have conducted extensive experiments with the large-scale urban data of sever
 
 1. paddle安装
 
-    本项目依赖于Paddle Fluid 1.5.1 及以上版本，请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装
+    本项目依赖于Paddle Fluid 1.6.1 及以上版本，请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装
 
 2. 下载代码
 
@@ -74,9 +74,9 @@ We have conducted extensive experiments with the large-scale urban data of sever
 
 Please feel free to review our paper :)
 
-链接: https://pan.baidu.com/s/1mYT3gpqurOpMJ8seLJogDg 提取码: qm1b 
+链接(link): [https://dl.acm.org/citation.cfm?id=3357810] (https://dl.acm.org/citation.cfm?id=3357810)
+
 
-[ACM DL](https://dl.acm.org/citation.cfm?id=3357810)
 
 
 ## 引用格式(Paper Citation)
diff --git a/PaddleST/Research/CIKM2019-MONOPOLY/conf/house_price/house_price.local.template b/PaddleST/Research/CIKM2019-MONOPOLY/conf/house_price/house_price.local.template
index a0e9b5a99ff3701575803c862173b8395b78588d..576aca44387e5b0f4235d6a01e190f4a4b845c06 100644
--- a/PaddleST/Research/CIKM2019-MONOPOLY/conf/house_price/house_price.local.template
+++ b/PaddleST/Research/CIKM2019-MONOPOLY/conf/house_price/house_price.local.template
@@ -280,7 +280,7 @@ num_in_dimension: ${DEFAULT:num_in_dimension}
 num_out_dimension: ${DEFAULT:num_out_dimension}
 
 # Directory where the results are saved to
-eval_dir: ${Train:train_dir}/epoch<s>
+eval_dir: ${Train:train_dir}/checkpoint_1
 
 # The number of samples in each batch
 batch_size: ${DEFAULT:eval_batch_size}
diff --git a/PaddleST/Research/CIKM2019-MONOPOLY/nets/house_price/house_price.py b/PaddleST/Research/CIKM2019-MONOPOLY/nets/house_price/house_price.py
index f535977f27cc97292292127154c7c46114d667d4..4d21b77ffde0b79adc3be440578b99080c2671e6 100644
--- a/PaddleST/Research/CIKM2019-MONOPOLY/nets/house_price/house_price.py
+++ b/PaddleST/Research/CIKM2019-MONOPOLY/nets/house_price/house_price.py
@@ -77,8 +77,7 @@ class HousePrice(BaseNet):
             act=act)
         return _fc
  
-    
-    def pred_format(self, result):
+    def pred_format(self, result, **kwargs):
         """
             format pred output
         """
@@ -118,7 +117,7 @@ class HousePrice(BaseNet):
 
         max_house_num = FLAGS.max_house_num
         max_public_num = FLAGS.max_public_num
-       
+        pred_keys = inputs.keys() 
         #step1. get house self feature
         if FLAGS.with_house_attr:
             def _get_house_attr(name, attr_vec_size):
@@ -136,6 +135,10 @@ class HousePrice(BaseNet):
         else:
             #no house attr
             house_vec = fluid.layers.reshape(inputs["house_business"], [-1, self.city_info.business_num])
+            pred_keys.remove('house_wuye')
+            pred_keys.remove('house_kfs')
+            pred_keys.remove('house_age')
+            pred_keys.remove('house_lou')
 
         house_self = self.fc_fn(house_vec, 1, act='sigmoid', layer_name='house_self', FLAGS=FLAGS)
         house_self = fluid.layers.reshape(house_self, [-1, 1])
@@ -192,8 +195,8 @@ class HousePrice(BaseNet):
         net_output = {"debug_output": debug_output, 
                       "model_output": model_output}
 
-        model_output['feeded_var_names'] = inputs.keys()
-        model_output['target_vars'] = [label, pred]
+        model_output['feeded_var_names'] = pred_keys   
+        model_output['fetch_targets'] = [label, pred]
         model_output['loss'] = avg_cost
 
         #debug_output['pred'] = pred 
diff --git a/PaddleST/Research/KDD2020-P3AC/README.md b/PaddleST/Research/KDD2020-P3AC/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f945040d3aacd988c4f470cd420d02836fd9de11
--- /dev/null
+++ b/PaddleST/Research/KDD2020-P3AC/README.md
@@ -0,0 +1,75 @@
+# P3AC
+
+## 任务说明(Introduction)
+
+TODO
+
+## 安装说明(Install Guide)
+
+### 环境准备
+
+1. paddle安装
+
+    本项目依赖于Paddle Fluid 1.6.1 及以上版本，请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装
+
+2. 下载代码
+
+    克隆数据集代码库到本地, 本代码依赖[Paddle-EPEP框架](https://github.com/PaddlePaddle/epep)
+    ```
+    git clone https://github.com/PaddlePaddle/epep.git
+    cd epep
+    git clone https://github.com/PaddlePaddle/models.git
+    ln -s models/PaddleST/Research/KDD2020-P3AC/conf/poi_qac_personalized conf/poi_qac_personalized
+    ln -s models/PaddleST/Research/KDD2020-P3AC/datasets/poi_qac_personalized datasets/poi_qac_personalized
+    ln -s models/PaddleST/Research/KDD2020-P3AC/nets/poi_qac_personalized nets/poi_qac_personalized
+    ```
+
+3. 环境依赖
+
+    python版本依赖python 2.7
+
+
+### 实验说明
+
+1. 数据准备
+
+    TODO
+    ```
+    #script to download 
+    ```
+
+2. 模型训练
+
+    ```
+    cp conf/poi_qac_personalized/poi_qac_personalized.local.conf.template conf/poi_qac_personalized/poi_qac_personalized.local.conf
+    sh run.sh -c conf/poi_qac_personalized/poi_qac_personalized.local.conf -m train [ -g 0 ]
+    ```
+
+3. 模型评估
+    ```
+    pred_gpu=$1
+    mode=$2 #query, poi, eval
+
+    if [ $# -lt 2 ];then
+        exit 1
+    fi
+
+    #编辑conf/poi_qac_personalized/poi_qac_personalized.local.conf.template，打开 CUDA_VISIBLE_DEVICES: <pred_gpu>
+    cp conf/poi_qac_personalized/poi_qac_personalized.local.conf.template conf/poi_qac_personalized/poi_qac_personalized.local.conf
+    sed -i "s#<pred_gpu>#$pred_gpu#g" conf/poi_qac_personalized/poi_qac_personalized.local.conf
+    sed -i "s#<mode>#$mode#g" conf/poi_qac_personalized/poi_qac_personalized.local.conf
+
+    sh run.sh -c poi_qac_personalized.local -m predict 1>../tmp/$mode-pred$pred_gpu.out 2>../tmp/$mode-pred$pred_gpu.err
+    ```
+
+## 论文下载(Paper Download)
+
+Please feel free to review our paper :)
+
+TODO
+
+## 引用格式(Paper Citation)
+
+TODO
+
+
diff --git a/PaddleST/Research/KDD2020-P3AC/conf/poi_qac_personalized/poi_qac_personalized.local.conf.template b/PaddleST/Research/KDD2020-P3AC/conf/poi_qac_personalized/poi_qac_personalized.local.conf.template
new file mode 100644
index 0000000000000000000000000000000000000000..b89e0f76ff8a69a2fa537ae5b740e8d1658b1cd6
--- /dev/null
+++ b/PaddleST/Research/KDD2020-P3AC/conf/poi_qac_personalized/poi_qac_personalized.local.conf.template
@@ -0,0 +1,342 @@
+[DEFAULT]
+sample_seed: 1234
+# The value in `DEFAULT` section will be referenced by other sections.
+# For convinence, we will put the variables which changes frequently here and 
+# let other section refer them
+debug_mode: False
+#reader: dataset |  pyreader | async | datafeed | sync
+#data_reader: dataset
+dataset_mode: Memory
+#data_reader: datafeed
+data_reader: pyreader
+py_reader_iterable: False
+
+#model_type: lstm_net 
+model_type: cnn_net 
+vocab_size: 93896 
+#emb_dim: 200
+emb_dim: 128
+time_size: 28
+tag_size: 371
+fc_dim: 64
+
+emb_lr: 1.0
+base_lr: 0.001
+margin: 0.35
+window_size: 3
+pooling_type: max 
+#activate: sigmoid
+activate: None
+use_attention: True
+use_personal: True
+max_seq_len: 128
+prefix_word_id: True
+#print_period: 200
+#TODO personal_resident_drive + neg_only_sample
+#query cityid trendency, poi tag/alias
+#local-cpu | local-gpu | pserver-cpu | pserver-gpu | nccl2
+platform: local-gpu
+# Input settings
+dataset_name: PoiQacPersonalized
+
+CUDA_VISIBLE_DEVICES: 0,1,2,3
+#CUDA_VISIBLE_DEVICES: <pred_gpu>
+
+train_batch_size: 128
+#train_batch_size: 2
+eval_batch_size: 2
+#file_list: ../tmp/data/poi/qac/train_data/part-00000
+dataset_dir: ../tmp/data/poi/qac/train_data
+#init_train_params: ../tmp/data/poi/qac/tencent_pretrain.words
+tag_dict_path: None 
+qac_dict_path: None 
+kv_path: None
+#qac_dict_path: ./datasets/poi_qac_personalized/qac_term.dict
+#tag_dict_path: ./datasets/poi_qac_personalized/poi_tag.dict
+#kv_path: ../tmp/data/poi/qac/kv
+
+# Model settings
+model_name: PoiQacPersonalized
+preprocessing_name: None 
+#file_pattern: %s-part-*
+file_pattern: part-
+num_in_dimension: 3
+num_out_dimension: 4
+
+# Learning options
+num_samples_train: 100
+num_samples_eval: 10
+max_number_of_steps: 155000
+
+[Convert]
+# The name of the dataset to convert
+dataset_name: ${DEFAULT:dataset_name}
+
+#dataset_dir: ${DEFAULT:dataset_dir}
+dataset_dir: stream
+
+# The output Records file name prefix.
+dataset_split_name: train
+
+# The number of Records per shard
+num_per_shard: 100000
+
+# The dimensions of net input vectors, it is just used by svm dataset
+# which of input are sparse tensors now
+num_in_dimension: ${DEFAULT:num_in_dimension}
+
+# The output file name pattern with two placeholders ("%s" and "%d"), 
+# it must correspond to the glob `file_pattern' in Train and Evaluate
+# config sections
+#file_pattern: %s-part-%05d
+file_pattern: part-
+
+
+[Train]
+#######################
+#  Dataset Configure  #
+#######################
+# The name of the dataset to load
+dataset_name: ${DEFAULT:dataset_name}
+
+# The directory where the dataset files are stored
+dataset_dir:  ${DEFAULT:dataset_dir}
+
+# dataset_split_name
+dataset_split_name: train
+
+batch_shuffle_size: 128
+#log_exp or hinge
+#loss_func: hinge
+loss_func: log_exp
+neg_sample_num: 5
+reader_batch: True
+drop_last_batch: False
+
+# The glob pattern for data path, `file_pattern' must contain only one "%s" 
+# which is the placeholder for split name (such as 'train', 'validation')
+file_pattern: ${DEFAULT:file_pattern}
+
+# The file type text or record
+file_type: record
+
+# kv path, used in image_sim
+kv_path: ${DEFAULT:kv_path}
+
+# The number of input sample for training
+num_samples: ${DEFAULT:num_samples_train}
+
+# The number of parallel readers that read data from the dataset
+num_readers: 2
+
+# The number of threads used to create the batches
+num_preprocessing_threads: 2
+
+# Number of epochs from dataset source
+num_epochs_input: 10
+
+###########################
+#  Basic Train Configure  #
+###########################
+# Directory where checkpoints and event logs are written to.
+train_dir: ../tmp/model/poi/qac/save_model
+# The max number of ckpt files to store variables
+save_max_to_keep: 40
+
+# The frequency with which the model is saved, in seconds.
+save_model_secs: None
+
+# The frequency with which the model is saved, in steps.
+save_model_steps: 5000
+
+# The name of the architecture to train
+model_name: ${DEFAULT:model_name}
+
+# The dimensions of net input vectors, it is just used by svm dataset
+# which of input are sparse tensors now
+num_in_dimension: ${DEFAULT:num_in_dimension}
+
+# The dimensions of net output vector, it will be num of classes in image classify task 
+num_out_dimension: ${DEFAULT:num_out_dimension}
+
+#####################################
+#  Training Optimization Configure  #
+#####################################
+# The number of samples in each batch
+batch_size: ${DEFAULT:train_batch_size}
+
+# The maximum number of training steps
+max_number_of_steps: ${DEFAULT:max_number_of_steps}
+
+# The weight decay on the model weights
+#weight_decay: 0.00000001
+weight_decay: None
+
+# The decay to use for the moving average. If left as None, then moving averages are not used
+moving_average_decay: None
+
+# ***************** learning rate options ***************** #
+
+# Specifies how the learning rate is decayed. One of "fixed", "exponential" or "polynomial"
+learning_rate_decay_type: fixed 
+
+# Learning rate decay factor
+learning_rate_decay_factor: 0.1
+
+# Proportion of training steps to perform linear learning rate warmup for
+learning_rate_warmup_proportion: 0.1
+
+init_learning_rate: 0
+
+learning_rate_warmup_steps: 10000
+
+# The minimal end learning rate used by a polynomial decay learning rate
+end_learning_rate: 0.0001
+
+# Number of epochs after which learning rate decays
+num_epochs_per_decay: 10
+
+# A boolean, whether or not it should cycle beyond decay_steps
+learning_rate_polynomial_decay_cycle: False
+
+# ******************* optimizer options ******************* #
+# The name of the optimizer, one of the following:
+# "adadelta", "adagrad", "adam", "ftrl", "momentum", "sgd" or "rmsprop"
+#optimizer: weight_decay_adam
+optimizer: adam
+#optimizer: sgd
+# Epsilon term for the optimizer, used for adadelta, adam, rmsprop
+opt_epsilon: 1e-8
+
+# conf for adadelta
+# The decay rate for adadelta
+adadelta_rho: 0.95
+# Starting value for the AdaGrad accumulators
+adagrad_initial_accumulator_value: 0.1
+
+# conf for adam
+# The exponential decay rate for the 1st moment estimates
+adam_beta1: 0.9
+# The exponential decay rate for the 2nd moment estimates
+adam_beta2: 0.997
+
+adam_weight_decay: 0.01
+#adam_exclude_from_weight_decay: LayerNorm,layer_norm,bias
+# conf for ftrl
+# The learning rate power
+ftrl_learning_rate_power: -0.1
+# Starting value for the FTRL accumulators
+ftrl_initial_accumulator_value: 0.1
+# The FTRL l1 regularization strength
+ftrl_l1: 0.0
+# The FTRL l2 regularization strength
+ftrl_l2: 0.01
+
+# conf for momentum
+# The momentum for the MomentumOptimizer and RMSPropOptimizer
+momentum: 0.9
+
+# conf for rmsprop
+# Decay term for RMSProp
+rmsprop_decay: 0.9
+
+
+# Number of model clones to deploy
+num_gpus: 3
+
+#############################
+#  Log and Trace Configure  #
+#############################
+# The frequency with which logs are print
+log_every_n_steps: 100
+
+# The frequency with which logs are trace.
+trace_every_n_steps: 1
+
+
+[Evaluate]
+# process mode: pred, eval or export
+#proc_name: eval
+proc_name: pred
+
+#data_reader: datafeed
+py_reader_iterable: True
+#platform: hadoop
+platform: local-gpu
+qac_dict_path: ./datasets/poi_qac_personalized/qac_term.dict
+tag_dict_path: ./datasets/poi_qac_personalized/poi_tag.dict
+#kv_path: ../tmp/data/poi/qac/kv
+# The directory where the dataset files are stored
+#file_list: ../tmp/x.bug
+file_list: ../tmp/data/poi/qac/recall_data/<mode>/part-0<pred_gpu>
+#file_list: ../tmp/data/poi/qac/ltr_data/<mode>/part-0<pred_gpu>
+#dataset_dir: stream_record
+# The directory where the model was written to or an absolute path to a checkpoint file
+init_pretrain_model: ../tmp/model/poi/qac/save_model_logexp/checkpoint_125000
+#init_pretrain_model: ../tmp/model/poi/qac/save_model_personal_logexp/checkpoint_125000
+#init_pretrain_model: ../tmp/model/poi/qac/save_model_wordid_logexp/checkpoint_125000
+#init_pretrain_model: ../tmp/model/poi/qac/save_model_personal_wordid_logexp/checkpoint_125000
+#init_pretrain_model: ../tmp/model/poi/qac/save_model_attention_logexp/checkpoint_125000
+#init_pretrain_model: ../tmp/model/poi/qac/save_model_attention_personal_logexp/checkpoint_125000
+#init_pretrain_model: ../tmp/model/poi/qac/save_model_attention_wordid_logexp/checkpoint_125000
+#init_pretrain_model: ../tmp/model/poi/qac/save_model_attention_personal_wordid_logexp/checkpoint_125000
+model_type: cnn_net 
+fc_dim: 64
+use_attention: False
+use_personal: False
+prefix_word_id: False
+
+#dump_vec: query
+#dump_vec: <mode>
+dump_vec: eval
+# The number of samples in each batch
+#batch_size: ${DEFAULT:eval_batch_size}
+batch_size: 1
+
+# The file type text or record
+#file_type: record
+file_type: text
+
+reader_batch: False
+
+# only exectute evaluation once
+eval_once: True
+
+#######################
+#  Dataset Configure  #
+#######################
+# The name of the dataset to load
+dataset_name: ${DEFAULT:dataset_name}
+
+# The name of the train/test split
+dataset_split_name: validation
+
+# The glob pattern for data path, `file_pattern' must contain only one "%s" 
+# which is the placeholder for split name (such as 'train', 'validation')
+file_pattern: ${DEFAULT:file_pattern}
+
+# The number of input sample for evaluation
+num_samples: ${DEFAULT:num_samples_eval}
+
+# The number of parallel readers that read data from the dataset
+num_readers: 2
+
+# The number of threads used to create the batches
+num_preprocessing_threads: 1
+
+# Number of epochs from dataset source
+num_epochs_input: 1
+
+# The name of the architecture to evaluate
+model_name: ${DEFAULT:model_name}
+
+# The dimensions of net input vectors, it is just used by svm dataset
+# which of input are sparse tensors now
+num_in_dimension: ${DEFAULT:num_in_dimension}
+
+# The dimensions of net output vector, it will be num of classes in image classify task 
+num_out_dimension: ${DEFAULT:num_out_dimension}
+
+# Directory where the results are saved to
+eval_dir: ${Train:train_dir}/checkpoint_1
+
diff --git a/PaddleST/Research/KDD2020-P3AC/datasets/poi_qac_personalized/__init__.py b/PaddleST/Research/KDD2020-P3AC/datasets/poi_qac_personalized/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleST/Research/KDD2020-P3AC/datasets/poi_qac_personalized/qac_personalized.py b/PaddleST/Research/KDD2020-P3AC/datasets/poi_qac_personalized/qac_personalized.py
new file mode 100644
index 0000000000000000000000000000000000000000..c27db98ceefb5daebb0088d1006e13e2226e643b
--- /dev/null
+++ b/PaddleST/Research/KDD2020-P3AC/datasets/poi_qac_personalized/qac_personalized.py
@@ -0,0 +1,577 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+################################################################################
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+################################################################################
+
+"""
+ Specify the brief poi_qac_personalized.py
+"""
+import os
+import sys
+import re
+import time
+import numpy as np
+import random
+import paddle.fluid as fluid
+
+from datasets.base_dataset import BaseDataset
+
+reload(sys)
+sys.setdefaultencoding('gb18030')
+
+
+base_rule = re.compile("[\1\2]")
+
+class PoiQacPersonalized(BaseDataset):
+    """
+    PoiQacPersonalized dataset 
+    """
+    def __init__(self, flags):
+        super(PoiQacPersonalized, self).__init__(flags)
+        self.inited_dict = False
+
+    def parse_context(self, inputs):
+        """
+        provide input context
+        """
+
+        """
+        set inputs_kv: please set key as the same as layer.data.name
+
+        notice:
+        (1)
+        If user defined "inputs key" is different from layer.data.name,
+        the frame will rewrite "inputs key" with layer.data.name
+        (2)
+        The param "inputs" will be passed to user defined nets class through
+        the nets class interface function : net(self, FLAGS, inputs), 
+        """ 
+        if self._flags.use_personal:
+            #inputs['user_loc_geoid'] = fluid.layers.data(name="user_loc_geoid", shape=[40],
+            #        dtype="int64", lod_level=0) #from clk poi
+            #inputs['user_bound_geoid'] = fluid.layers.data(name="user_bound_geoid", shape=[40],
+            #        dtype="int64", lod_level=0) #from clk poi
+            #inputs['user_time_id'] = fluid.layers.data(name="user_time_geoid", shape=[1],
+            #        dtype="int64", lod_level=1) #from clk poi
+            inputs['user_clk_geoid'] = fluid.layers.data(name="user_clk_geoid", shape=[40],
+                    dtype="int64", lod_level=0) #from clk poi
+            inputs['user_tag_id'] = fluid.layers.data(name="user_tag_id", shape=[1],
+                    dtype="int64", lod_level=1) #from clk poi
+            inputs['user_resident_geoid'] = fluid.layers.data(name="user_resident_geoid", shape=[40],
+                    dtype="int64", lod_level=0) #home, company
+            inputs['user_navi_drive'] = fluid.layers.data(name="user_navi_drive", shape=[1],
+                    dtype="int64", lod_level=0) #driver or not
+        
+        inputs['prefix_letter_id'] = fluid.layers.data(name="prefix_letter_id", shape=[1],
+                dtype="int64", lod_level=1)
+        if self._flags.prefix_word_id:
+            inputs['prefix_word_id'] = fluid.layers.data(name="prefix_word_id", shape=[1],
+                dtype="int64", lod_level=1)
+        inputs['prefix_loc_geoid'] = fluid.layers.data(name="prefix_loc_geoid", shape=[40],
+                dtype="int64", lod_level=0)
+        if self._flags.use_personal:
+            inputs['prefix_time_id'] = fluid.layers.data(name="prefix_time_id", shape=[1],
+                dtype="int64", lod_level=1)
+
+        inputs['pos_name_letter_id'] = fluid.layers.data(name="pos_name_letter_id", shape=[1],
+                dtype="int64", lod_level=1)
+        inputs['pos_name_word_id'] = fluid.layers.data(name="pos_name_word_id", shape=[1],
+                dtype="int64", lod_level=1)
+        inputs['pos_addr_letter_id'] = fluid.layers.data(name="pos_addr_letter_id", shape=[1],
+                dtype="int64", lod_level=1)
+        inputs['pos_addr_word_id'] = fluid.layers.data(name="pos_addr_word_id", shape=[1],
+                dtype="int64", lod_level=1)
+        inputs['pos_loc_geoid'] = fluid.layers.data(name="pos_loc_geoid", shape=[40],
+                dtype="int64", lod_level=0)
+        if self._flags.use_personal:
+            inputs['pos_tag_id'] = fluid.layers.data(name="pos_tag_id", shape=[1],
+                dtype="int64", lod_level=1)
+
+        if self.is_training:
+            inputs['neg_name_letter_id'] = fluid.layers.data(name="neg_name_letter_id", shape=[1],
+                    dtype="int64", lod_level=1)
+            inputs['neg_name_word_id'] = fluid.layers.data(name="neg_name_word_id", shape=[1],
+                    dtype="int64", lod_level=1)
+            inputs['neg_addr_letter_id'] = fluid.layers.data(name="neg_addr_letter_id", shape=[1],
+                    dtype="int64", lod_level=1)
+            inputs['neg_addr_word_id'] = fluid.layers.data(name="neg_addr_word_id", shape=[1],
+                    dtype="int64", lod_level=1)
+            inputs['neg_loc_geoid'] = fluid.layers.data(name="neg_loc_geoid", shape=[40],
+                    dtype="int64", lod_level=0)
+            if self._flags.use_personal:
+                inputs['neg_tag_id'] = fluid.layers.data(name="neg_tag_id", shape=[1],
+                    dtype="int64", lod_level=1)
+        else:
+            #for predict label
+            inputs['label'] = fluid.layers.data(name="label", shape=[1],
+                dtype="int64", lod_level=0)
+
+        context = {"inputs": inputs}
+
+        #set debug list, print info during training
+        #debug_list = [key for key in inputs]
+        #context["debug_list"] = ["prefix_ids", "label"]
+
+        return context
+
+    def _init_dict(self):
+        """
+            init dict
+        """
+        if self.inited_dict:
+            return
+        
+        if self._flags.platform in ('local-gpu', 'pserver-gpu', 'slurm'):
+            gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+            self.place = fluid.CUDAPlace(gpu_id)
+        else:
+            self.place = fluid.CPUPlace()
+
+        self.term_dict = {}
+        if self._flags.qac_dict_path is not None:
+            with open(self._flags.qac_dict_path, 'r') as f:
+                for line in f:
+                    term, term_id = line.strip('\r\n').split('\t')
+                    self.term_dict[term] = int(term_id)
+
+        self.tag_info = {}
+        if self._flags.tag_dict_path is not None:
+            with open(self._flags.tag_dict_path, 'r') as f:
+                for line in f:
+                    tag, level, tid  = line.strip('\r\n').split('\t')
+                    self.tag_info[tag] =  map(int, tid.split(','))
+
+        self.user_kv = None
+        self.poi_kv = None 
+        if self._flags.kv_path is not None:
+            self.poi_kv = {}
+            with open(self._flags.kv_path + "/sug_raw.dat", "r") as f:
+                for line in f:
+                    pid, val = line.strip('\r\n').split('\t', 1)
+                   self.poi_kv[pid] = val
+
+            self.user_kv = {}
+            with open(self._flags.kv_path + "/user_profile.dat", "r") as f:
+                for line in f:
+                    uid, val = line.strip('\r\n').split('\t', 1)
+                    self.user_kv[uid] = val
+
+            sys.stderr.write("load user kv:%s\n" % self._flags.kv_path)
+
+        self.inited_dict = True
+        sys.stderr.write("loaded term dict:%s, tag_dict:%s\n" % (len(self.term_dict), len(self.tag_info)))
+
+    def _get_time_id(self, ts):
+        """
+        get time id:0-27
+        """
+        ts_struct = time.localtime(ts)
+
+        week = ts_struct[6]
+        hour = ts_struct[3]
+
+        base = 0
+        if hour >= 0 and hour < 6:
+            base = 0
+        elif hour >= 6 and hour < 12:
+            base = 1
+        elif hour >= 12 and hour < 18:
+            base = 2
+        else:
+            base = 3
+
+        final = week * 4 + base
+        return final
+
+    def _pad_batch_data(self, insts, pad_idx, return_max_len=True, return_num_token=False):
+        """
+        Pad the instances to the max sequence length in batch, and generate the
+        corresponding position data and attention bias.
+        """
+        return_list = []
+        max_len = max(len(inst) for inst in insts)
+        # Any token included in dict can be used to pad, since the paddings' loss
+        # will be masked out by weights and make no effect on parameter gradients.
+        inst_data = np.array(
+            [inst + [pad_idx] * (max_len - len(inst)) for inst in insts])
+        return_list += [inst_data.astype("int64").reshape([-1, 1])]
+        
+        if return_max_len:
+            return_list += [max_len]
+        if return_num_token:
+            num_token = 0
+            for inst in insts:
+                num_token += len(inst)
+            return_list += [num_token]
+        return return_list if len(return_list) > 1 else return_list[0]
+
+    def _get_tagid(self, tag_str):
+        if len(tag_str.strip()) < 1:
+            return []
+        tags = set()
+        for t in tag_str.split():
+            if ':' in t: 
+                t = t.split(':')[0]
+            t = t.lower()
+            if t in self.tag_info:
+                tags.update(self.tag_info[t])
+        return list(tags) 
+
+    def _get_ids(self, seg_info):
+        #phraseseg, basicseg = seg_info
+         
+        if len(seg_info) < 2:
+            return [0], [0]
+        _, bt = [x.split('\3') for x in seg_info]
+
+        rq = "".join(bt)
+        bl = [t.encode('gb18030') for t in rq.decode('gb18030')]
+        letter_ids = [] 
+        for t in bl:
+            letter_ids.append(self.term_dict.get(t.lower(), 1))
+            if len(letter_ids) >= self._flags.max_seq_len:
+                break
+
+        word_ids = []
+        for t in bt:
+            word_ids.append(self.term_dict.get(t.lower(), 1)) 
+            if len(word_ids) >= self._flags.max_seq_len:
+                break
+        return letter_ids, word_ids
+ 
+    def _get_poi_ids(self, poi_str, max_num=0):
+        if len(poi_str) < 1:
+            return []
+        ids = []
+        all_p = poi_str.split('\1')
+        
+        pidx = range(0, len(all_p))
+        if max_num > 0:
+            #neg sample: last 10 is negative sampling
+            if len(all_p) > max_num:
+                neg_s_idx = len(all_p) - 10
+                pidx = [1, 2] + random.sample(pidx[3:neg_s_idx], max_num - 13) + pidx[neg_s_idx:] 
+            else:
+                pidx = pidx[1:]
+        bids = set()
+        for x in pidx:
+            poi_seg = all_p[x].split('\2')
+            tagid = [0] 
+            if len(poi_seg) >= 9:
+                #name, uid, index, name_lid, name_wid, addr_lid, addr_wid, geohash, tagid
+                bid = poi_seg[1]
+                name_letter_id = map(int, poi_seg[3].split())[:self._flags.max_seq_len]
+                name_word_id = map(int, poi_seg[4].split())[:self._flags.max_seq_len]
+                addr_letter_id = map(int, poi_seg[5].split())[:self._flags.max_seq_len]
+                addr_word_id = map(int, poi_seg[6].split())[:self._flags.max_seq_len]
+                ghid = map(int, poi_seg[7].split(','))
+                if len(poi_seg[8]) > 0:
+                    tagid = map(int, poi_seg[8].split(','))
+            else:
+                #raw_text: uid, name, addr, xy, tag, alias
+                bid = poi_seg[0]
+                name_letter_id, name_word_id = self._get_ids(poi_seg[1])
+                addr_letter_id, addr_word_id = self._get_ids(poi_seg[2])
+                ghid = map(int, poi_seg[3].split(',')) 
+                if len(poi_seg[4]) > 0:
+                    tagid = map(int, poi_seg[4].split(','))
+
+            if not self.is_training and name_letter_id == [0]:
+                continue # empty name
+            if bid in bids:
+                continue
+            bids.add(bid)
+            ids.append([name_letter_id, name_word_id, addr_letter_id, addr_word_id, ghid, tagid])
+
+        return ids
+
+    def _get_user_ids(self, cuid, user_str):
+        if self.user_kv:
+            if cuid in self.user_kv:
+                val = self.user_kv[cuid]
+                drive_conf, clk_p, res_p = val.split('\t')
+            else:
+                return []
+        else:
+            if len(user_str) < 1:
+                return []
+            drive_conf, clk_p, res_p = user_str.split('\1')
+            
+        ids = []
+        conf1, conf2 = drive_conf.split('\2')
+        is_driver = 0
+        if float(conf1) > 0.5 or float(conf2) > 1.5:
+            is_driver = 1
+        
+        user_clk_geoid = [0] * 40
+        user_tag_id = set()
+        if len(clk_p) > 0:
+            if self.user_kv:
+                for p in clk_p.split('\1'):
+                    bid, time, loc, bound = p.split('\2')
+                    if bid in self.poi_kv:
+                        v = self.poi_kv[bid]
+                        v = base_rule.sub("", v)
+                        info = v.split('\t') #name, addr, ghid, tag, alias
+                        ghid = map(int, info[2].split(',')) 
+                        for i in range(len(user_clk_geoid)):
+                            user_clk_geoid[i] = user_clk_geoid[i] | ghid[i]
+                        user_tag_id.update(self._get_tagid(info[4]))
+            else:
+                for p in clk_p.split('\2'):
+                    bid, gh, tags = p.split('\3')
+                    ghid = map(int, gh.split(',')) 
+                    for i in range(len(user_clk_geoid)):
+                        user_clk_geoid[i] = user_clk_geoid[i] | ghid[i]
+                    if len(tags) > 0:
+                        user_tag_id.update(tags.split(','))
+        if len(user_tag_id) < 1:
+            user_tag_id = [0]
+        user_tag_id = map(int, list(user_tag_id))
+        ids.append(user_clk_geoid)
+        ids.append(user_tag_id)
+
+        user_res_geoid = [0] * 40
+        if len(res_p) > 0:
+            if self.user_kv:
+                for p in res_p.split('\1'):
+                    bid, conf = p.split('\2')
+                    if bid in self.poi_kv:
+                        v = self.poi_kv[bid]
+                        v = base_rule.sub("", v)
+                        info = v.split('\t') #name, addr, ghid, tag, alias
+                        ghid = map(int, info[2].split(','))
+                        for i in range(len(user_res_geoid)):
+                            user_res_geoid[i] = user_res_geoid[i] | ghid[i]
+            else:
+                for p in res_p.split('\2'):
+                    bid, gh, conf = p.split('\3')
+                    ghid = map(int, gh.split(','))
+                    for i in range(len(user_res_geoid)):
+                        user_res_geoid[i] = user_res_geoid[i] | ghid[i]
+        ids.append(user_res_geoid)
+        ids.append([is_driver])
+        return ids
+
+    def parse_batch(self, data_gen):
+        """
+        reader_batch must be true: only for train & loss_func is log_exp, other use parse_oneline
+        pos : neg = 1 : N
+        """
+        batch_data = {}
+        def _get_lod(k):
+            #sys.stderr.write("%s\t%s\t%s\n" % (k, " ".join(map(str, batch_data[k][0])),
+            #            " ".join(map(str, batch_data[k][1])) ))
+            return fluid.create_lod_tensor(np.array(batch_data[k][0]).reshape([-1, 1]),
+                    [batch_data[k][1]], self.place)
+        
+        keys = None
+        for line in data_gen():
+            for s in self.parse_oneline(line):
+                for k, v in s:
+                    if k not in batch_data:
+                        batch_data[k] = [[], []]
+
+                    if not isinstance(v[0], list):
+                        v = [v] #pos 1 to N
+                    for j in v:
+                        batch_data[k][0].extend(j)
+                        batch_data[k][1].append(len(j))
+
+                if keys is None:
+                    keys = [k for k, _ in s]
+                if len(batch_data[keys[0]][1]) == self._flags.batch_size:
+                    yield [(k, _get_lod(k)) for k in keys]
+                    batch_data = {}
+        
+        if not self._flags.drop_last_batch and len(batch_data) != 0:
+            yield [(k, _get_lod(k)) for k in keys]
+
+    def parse_oneline(self, line):
+        """
+        datareader interface
+        """
+        self._init_dict()
+
+        qid, user, prefix, pos_poi, neg_poi = line.strip("\r\n").split("\t")
+        cuid, time, loc_cityid, bound_cityid, loc_gh, bound_gh = qid.split('_') 
+       
+        #step1
+        user_input = []
+        if self._flags.use_personal:
+            user_ids = self._get_user_ids(cuid, user)
+            if len(user_ids) < 1:
+                user_ids = [[0] * 40, [0], [0] * 40, [0]]
+            user_input = [("user_clk_geoid", user_ids[0]), \
+                          ("user_tag_id", user_ids[1]), \
+                          ("user_resident_geoid", user_ids[2]), \
+                          ("user_navi_drive", user_ids[3])]
+
+        #step2
+        prefix_seg = prefix.split('\2')
+        prefix_time_id = self._get_time_id(int(time)) 
+        prefix_loc_geoid = [0] * 40 
+        if len(prefix_seg) >= 4: #query, letterid, wordid, ghid, poslen, neglen
+            prefix_letter_id = map(int, prefix_seg[1].split())[:self._flags.max_seq_len]
+            prefix_word_id = map(int, prefix_seg[2].split())[:self._flags.max_seq_len]
+            loc_gh, bound_gh = prefix_seg[3].split('_')
+            ghid = map(int, loc_gh.split(','))
+            for i in range(len(prefix_loc_geoid)):
+                prefix_loc_geoid[i] = prefix_loc_geoid[i] | ghid[i]
+            ghid = map(int, bound_gh.split(','))
+            for i in range(len(prefix_loc_geoid)):
+                prefix_loc_geoid[i] = prefix_loc_geoid[i] | ghid[i]
+        else: #raw text
+            prefix_letter_id, prefix_word_id = self._get_ids(prefix)
+            ghid = map(int, loc_gh.split(','))
+            for i in range(len(prefix_loc_geoid)):
+                prefix_loc_geoid[i] = prefix_loc_geoid[i] | ghid[i]
+            ghid = map(int, bound_gh.split(','))
+            for i in range(len(prefix_loc_geoid)):
+                prefix_loc_geoid[i] = prefix_loc_geoid[i] | ghid[i]
+
+        prefix_input = [("prefix_letter_id", prefix_letter_id), \
+                    ("prefix_loc_geoid", prefix_loc_geoid)]
+
+        if self._flags.prefix_word_id:
+            prefix_input.insert(1, ("prefix_word_id", prefix_word_id))
+
+        if self._flags.use_personal:
+            prefix_input.append(("prefix_time_id", [prefix_time_id]))
+
+        #step3
+        pos_ids = self._get_poi_ids(pos_poi)
+        pos_num = len(pos_ids)
+        max_num = 0
+        if self.is_training:
+            max_num = max(20, self._flags.neg_sample_num) #last 10 is neg sample
+        neg_ids = self._get_poi_ids(neg_poi, max_num=max_num)
+        #if not train, add all pois
+        if not self.is_training:
+            pos_ids.extend(neg_ids)
+            if len(pos_ids) < 1:
+                pos_ids.append([[0], [0], [0], [0], [0] * 40, [0]])
+
+        #step4
+        idx = 0
+        for pos_id in pos_ids:
+            pos_input = [("pos_name_letter_id", pos_id[0]), \
+                        ("pos_name_word_id", pos_id[1]), \
+                        ("pos_addr_letter_id", pos_id[2]), \
+                        ("pos_addr_word_id", pos_id[3]), \
+                        ("pos_loc_geoid", pos_id[4])]
+
+            if self._flags.use_personal:
+                pos_input.append(("pos_tag_id", pos_id[5]))
+
+            if self.is_training:
+                if len(neg_ids) > self._flags.neg_sample_num:
+                    #Noise Contrastive Estimation
+                    #if self._flags.neg_sample_num > 3:
+                    #    nids_sample = neg_ids[:3]
+                    nids_sample = random.sample(neg_ids, self._flags.neg_sample_num)
+                else:
+                    nids_sample = neg_ids
+
+                if self._flags.reader_batch:
+                    if len(nids_sample) != self._flags.neg_sample_num:
+                        continue
+
+                    neg_batch = [[], [], [], [], [], []]
+                    for neg_id in nids_sample:
+                        for i in range(len(neg_batch)):
+                            neg_batch[i].append(neg_id[i]) 
+                    
+                    neg_input = [("neg_name_letter_id", neg_batch[0]), \
+                                ("neg_name_word_id", neg_batch[1]), \
+                                ("neg_addr_letter_id", neg_batch[2]), \
+                                ("neg_addr_word_id", neg_batch[3]), \
+                                ("neg_loc_geoid", neg_batch[4])]
+                    if self._flags.use_personal:
+                        neg_input.append(("neg_tag_id", neg_batch[5]))
+                    yield user_input + prefix_input + pos_input + neg_input
+                else:
+                    for neg_id in nids_sample:
+                        neg_input = [("neg_name_letter_id", neg_id[0]), \
+                                    ("neg_name_word_id", neg_id[1]), \
+                                    ("neg_addr_letter_id", neg_id[2]), \
+                                    ("neg_addr_word_id", neg_id[3]), \
+                                    ("neg_loc_geoid", neg_id[4])]
+                        if self._flags.use_personal:
+                            neg_input.append(("neg_tag_id", neg_id[5]))
+                        yield user_input + prefix_input + pos_input + neg_input
+            else:
+                label = int(idx < pos_num)
+                yield user_input + prefix_input + pos_input + [("label", [label])]
+
+            idx += 1
+
+
+if __name__ == '__main__':
+    from utils import flags
+    from utils.load_conf_file import LoadConfFile
+    FLAGS = flags.FLAGS
+    flags.DEFINE_custom("conf_file", "./conf/test/test.conf", 
+        "conf file", action=LoadConfFile, sec_name="Train")
+    
+    sys.stderr.write('-----------  Configuration Arguments -----------\n')
+    for arg, value in sorted(flags.get_flags_dict().items()):
+        sys.stderr.write('%s: %s\n' % (arg, value))
+    sys.stderr.write('------------------------------------------------\n')
+   
+    dataset_instance = PoiQacPersonalized(FLAGS)
+    def _dump_vec(data, name):
+        print("%s\t%s" % (name, " ".join(map(str, np.array(data)))))
+    
+    def _data_generator(): 
+        """
+        stdin sample generator: read from stdin 
+        """
+        for line in sys.stdin:
+            if not line.strip():
+                continue
+            yield line
+
+    if FLAGS.reader_batch: 
+        for sample in dataset_instance.parse_batch(_data_generator):
+            _dump_vec(sample[0][1], 'user_clk_geoid')
+            _dump_vec(sample[1][1], 'user_tag_id')
+            _dump_vec(sample[2][1], 'user_resident_geoid')
+            _dump_vec(sample[3][1], 'user_navi_drive')
+            _dump_vec(sample[4][1], 'prefix_letter_id')
+            _dump_vec(sample[5][1], 'prefix_loc_geoid')
+            _dump_vec(sample[6][1], 'prefix_time_id')
+            _dump_vec(sample[7][1], 'pos_name_letter_id')
+            _dump_vec(sample[10][1], 'pos_addr_word_id')
+            _dump_vec(sample[11][1], 'pos_loc_geoid')
+            _dump_vec(sample[12][1], 'pos_tag_id')
+            _dump_vec(sample[13][1], 'neg_name_letter_id or label')
+    else:
+        for line in sys.stdin:
+            for sample in dataset_instance.parse_oneline(line):
+                _dump_vec(sample[0][1], 'user_clk_geoid')
+                _dump_vec(sample[1][1], 'user_tag_id')
+                _dump_vec(sample[2][1], 'user_resident_geoid')
+                _dump_vec(sample[3][1], 'user_navi_drive')
+                _dump_vec(sample[4][1], 'prefix_letter_id')
+                _dump_vec(sample[5][1], 'prefix_loc_geoid')
+                _dump_vec(sample[6][1], 'prefix_time_id')
+                _dump_vec(sample[7][1], 'pos_name_letter_id')
+                _dump_vec(sample[10][1], 'pos_addr_word_id')
+                _dump_vec(sample[11][1], 'pos_loc_geoid')
+                _dump_vec(sample[12][1], 'pos_tag_id')
+                _dump_vec(sample[13][1], 'neg_name_letter_id or label')
+
diff --git a/PaddleST/Research/KDD2020-P3AC/nets/poi_qac_personalized/__init__.py b/PaddleST/Research/KDD2020-P3AC/nets/poi_qac_personalized/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleST/Research/KDD2020-P3AC/nets/poi_qac_personalized/qac_personalized.py b/PaddleST/Research/KDD2020-P3AC/nets/poi_qac_personalized/qac_personalized.py
new file mode 100644
index 0000000000000000000000000000000000000000..d53b9176ac60125f07e97313566f718d9fb1c588
--- /dev/null
+++ b/PaddleST/Research/KDD2020-P3AC/nets/poi_qac_personalized/qac_personalized.py
@@ -0,0 +1,659 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+################################################################################
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+################################################################################
+
+
+"""
+ Specify the brief poi_qac_personalized.py
+"""
+
+import math
+import numpy as np
+import logging
+import collections
+import paddle.fluid as fluid
+
+from nets.base_net import BaseNet
+
+
+def ffn(input, d_hid, d_size, name="ffn"):
+    """
+        Position-wise Feed-Forward Network
+    """
+    hidden = fluid.layers.fc(input=input,
+             size=d_hid,
+             num_flatten_dims=1,
+             param_attr=fluid.ParamAttr(name=name + '_innerfc_weight'),
+             bias_attr=fluid.ParamAttr(
+                 name=name + '_innerfc_bias',
+                 initializer=fluid.initializer.Constant(0.)),
+             act="leaky_relu")
+
+    out = fluid.layers.fc(input=hidden,
+              size=d_size,
+              num_flatten_dims=1,
+              param_attr=fluid.ParamAttr(name=name + '_outerfc_weight'),
+              bias_attr=fluid.ParamAttr(
+                  name=name + '_outerfc_bias',
+                  initializer=fluid.initializer.Constant(0.)))
+    return out
+
+
+def dot_product_attention(query, key, value, d_key, q_mask=None, k_mask=None,
+        dropout_rate=None):
+    """
+     Args:
+         query: a tensor with shape [batch, Q_time, Q_dimension]
+         key: a tensor with shape [batch, time, K_dimension]
+         value: a tensor with shape [batch, time, V_dimension]
+
+         q_lengths: a tensor with shape [batch]
+         k_lengths: a tensor with shape [batch]
+
+     Returns:
+         a tensor with shape [batch, query_time, value_dimension]
+
+     Raises:
+         AssertionError: if Q_dimension not equal to K_dimension when attention 
+                        type is dot.
+    """
+
+    logits = fluid.layers.matmul(x=query, y=key, transpose_y=True, alpha=d_key**(-0.5))
+
+    if (q_mask is not None) and (k_mask is not None):
+        mask = fluid.layers.matmul(x=q_mask, y=k_mask, transpose_y=True)
+        another_mask = fluid.layers.scale(
+            mask,
+            scale=float(2**32 - 1),
+            bias=float(-1),
+            bias_after_scale=False)
+
+        logits = mask * logits + another_mask
+
+    attention = fluid.layers.softmax(logits)
+    if dropout_rate:
+        attention = fluid.layers.dropout(
+            input=attention, dropout_prob=dropout_rate, is_test=False)
+
+    atten_out = fluid.layers.matmul(x=attention, y=value)
+
+    return atten_out
+
+
+def safe_cosine_sim(x, y):
+    """
+        fluid.layers.cos_sim maybe nan
+        avoid nan
+    """
+    l2x = fluid.layers.l2_normalize(x, axis=-1)
+    l2y = fluid.layers.l2_normalize(y, axis=-1)
+    cos = fluid.layers.reduce_sum(l2x * l2y, dim=1, keep_dim=True)
+    return cos
+
+
+def loss_neg_log_of_pos(pos_score, neg_score_n, gama=5.0):
+    '''
+        pos_score: batch_size x 1
+        neg_score_n: batch_size x n
+    '''
+    # n x batch_size
+    neg_score_n = fluid.layers.transpose(neg_score_n, [1, 0])
+    # 1 x batch_size
+    pos_score = fluid.layers.reshape(pos_score, [1, -1])
+
+    exp_pos_score = fluid.layers.exp(pos_score * gama)
+    exp_neg_score_n = fluid.layers.exp(neg_score_n * gama)
+
+    ## (n+1) x batch_size
+    pos_neg_score = fluid.layers.concat([exp_pos_score, exp_neg_score_n], axis=0)
+    ## 1 x batch_size
+    exp_sum = fluid.layers.reduce_sum(pos_neg_score, dim=0, keep_dim=True)
+    ## 1 x batch_size
+    loss = -1.0 * fluid.layers.log(exp_pos_score / exp_sum)
+    # batch_size
+    loss = fluid.layers.reshape(loss, [-1, 1])
+    #return [loss, exp_pos_score, exp_neg_score_n, pos_neg_score, exp_sum]
+    return loss
+
+
+def loss_pairwise_hinge(pos, neg, margin=0.8):
+    """
+        pairwise
+    """
+    loss_part1 = fluid.layers.elementwise_sub(
+        fluid.layers.fill_constant_batch_size_like(
+            input=pos, shape=[-1, 1], value=margin, dtype='float32'), pos)
+    loss_part2 = fluid.layers.elementwise_add(loss_part1, neg)
+    loss_part3 = fluid.layers.elementwise_max(
+        fluid.layers.fill_constant_batch_size_like(
+            input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'), loss_part2)
+    return loss_part3
+
+
+class PoiQacPersonalized(BaseNet):
+    """
+    This module provide nets for poi classification
+    """
+    def __init__(self, FLAGS):
+        super(PoiQacPersonalized, self).__init__(FLAGS)
+        self.hid_dim = 128
+
+    def net(self, inputs):
+        """
+        PoiQacPersonalized interface
+        """ 
+        # debug output info during training
+        debug_output = collections.OrderedDict()
+        model_output = {}
+        net_output = {"debug_output": debug_output, 
+                      "model_output": model_output}
+       
+        user_input_keys = ['user_clk_geoid', 'user_tag_id', 'user_resident_geoid', 'user_navi_drive']
+            
+        pred_input_keys = ['prefix_letter_id', 'prefix_loc_geoid', 'pos_name_letter_id',
+            'pos_name_word_id', 'pos_addr_letter_id', 'pos_addr_word_id', 'pos_loc_geoid']
+        query_key_num = 2
+        if self._flags.use_personal:
+            pred_input_keys.insert(2, 'prefix_time_id')
+            pred_input_keys.append('pos_tag_id')
+            query_key_num += 2
+            if self._flags.prefix_word_id:
+                pred_input_keys.insert(1, 'prefix_word_id')
+                query_key_num += 1
+            pred_input_keys = user_input_keys + pred_input_keys
+            query_key_num += len(user_input_keys)
+        elif self._flags.prefix_word_id:
+            pred_input_keys.insert(1, 'prefix_word_id')
+            query_key_num += 1
+        
+        #for p in pred_input_keys:
+        #    debug_output[p] = inputs[p]
+
+        prefix_vec, prefix_pool = self._get_query_vec(inputs)
+        pos_vec, pos_pool = self._get_poi_vec(inputs, 'pos')
+        
+        pos_score = safe_cosine_sim(pos_vec, prefix_vec)
+        #fluid.layers.Print(pos_score, summarize=10000)
+        if self.is_training:
+            neg_vec, neg_pool = self._get_poi_vec(inputs, 'neg') 
+            if self._flags.loss_func == 'log_exp':
+                neg_vec = fluid.layers.reshape(neg_vec, [-1, self._flags.fc_dim])
+                prefix_expand = fluid.layers.reshape(fluid.layers.expand(prefix_vec, [1,
+                                self._flags.neg_sample_num]), [-1, self._flags.fc_dim])
+                neg_score = safe_cosine_sim(neg_vec, prefix_expand)
+                cost = loss_neg_log_of_pos(pos_score,  fluid.layers.reshape(neg_score,
+                            [-1, self._flags.neg_sample_num]), 15) 
+            else:
+                neg_score = safe_cosine_sim(neg_vec, prefix_vec)
+                cost = loss_pairwise_hinge(pos_score, neg_score, self._flags.margin)
+            #debug_output["pos_score"] = pos_score
+            #debug_output["neg_score"] = neg_score
+            #debug_output['prefix_pool'] = prefix_pool
+            #debug_output['pos_pool'] = pos_pool
+            #debug_output['neg_pool'] = neg_pool
+
+            loss = fluid.layers.mean(x=cost)
+            if self._flags.init_learning_rate > 0:
+                # define the optimizer
+                #d_model = 1 / (warmup_steps * (learning_rate ** 2)) 
+                with fluid.default_main_program()._lr_schedule_guard():
+                    learning_rate = fluid.layers.learning_rate_scheduler.noam_decay(
+                        self._flags.emb_dim, self._flags.learning_rate_warmup_steps
+                        ) * self._flags.init_learning_rate
+                optimizer = fluid.optimizer.AdamOptimizer(
+                    learning_rate=learning_rate, beta1=self._flags.adam_beta1,
+                    beta2=self._flags.adam_beta2, epsilon=self._flags.opt_epsilon)
+                logging.info("use noam_decay learning_rate_scheduler for optimizer.")
+                net_output["optimizer"] = optimizer
+
+            net_output["loss"] = loss 
+            model_output['fetch_targets'] = [inputs["prefix_letter_id"], pos_score]
+        else:
+            if self._flags.dump_vec == "query":
+                model_output['fetch_targets'] = [prefix_vec]
+                pred_input_keys = pred_input_keys[:query_key_num]
+            elif self._flags.dump_vec == "poi":
+                model_output['fetch_targets'] = [prefix_vec, pos_score, pos_vec]
+            else:
+                model_output['fetch_targets'] = [inputs["prefix_letter_id"], pos_score, inputs["label"]]
+
+        model_output['feeded_var_names'] = pred_input_keys 
+        
+        return net_output
+ 
+    def _get_query_vec(self, inputs):
+        """
+        get query & user vec
+        """
+        if self._flags.use_personal: 
+            #user_tag_id
+            #embedding layer
+            tag_emb = fluid.layers.embedding(input=inputs['user_tag_id'], is_sparse=True,
+                    size=[self._flags.tag_size, self._flags.emb_dim],
+                    param_attr=fluid.ParamAttr(name="tagid_embedding", learning_rate=self._flags.emb_lr),
+                    padding_idx=0)
+            tag_emb = fluid.layers.sequence_pool(tag_emb, pool_type="sum")
+            
+            user_clk_geoid = fluid.layers.reshape(fluid.layers.cast(inputs['user_clk_geoid'],
+                        dtype="float32"), [-1, 40])
+            user_resident_geoid = fluid.layers.reshape(fluid.layers.cast(inputs['user_resident_geoid'],
+                        dtype="float32"), [-1, 40])
+            user_profile = fluid.layers.cast(inputs['user_navi_drive'], dtype="float32")
+            user_pool = fluid.layers.concat([tag_emb, user_clk_geoid, user_resident_geoid, user_profile],
+                    axis=1)
+            #fc layer
+            user_vec = fluid.layers.fc(input=user_pool, size=self._flags.emb_dim, act="leaky_relu",
+                    param_attr=fluid.ParamAttr(name='user_fc_weight'),
+                    bias_attr=fluid.ParamAttr(name='user_fc_bias'))
+            #fluid.layers.Print(user_vec) 
+ 
+        loc_vec = fluid.layers.reshape(fluid.layers.cast(x=inputs['prefix_loc_geoid'],
+                    dtype="float32"), [-1, 40])
+              
+        if self._flags.model_type == "bilstm_net":
+            network = self.bilstm_net
+        elif self._flags.model_type == "bow_net":
+            network = self.bow_net
+        elif self._flags.model_type == "cnn_net":
+            network = self.cnn_net
+        elif self._flags.model_type == "lstm_net":
+            network = self.lstm_net
+        elif self._flags.model_type == "gru_net":
+            network = self.gru_net
+        else:
+            raise ValueError("Unknown network type!")
+
+        prefix_letter_pool = network(inputs["prefix_letter_id"],
+                            "wordid_embedding",
+                            self._flags.vocab_size,
+                            self._flags.emb_dim,
+                            hid_dim=self.hid_dim,
+                            fc_dim=0,
+                            emb_lr=self._flags.emb_lr)
+        if self._flags.use_attention:
+            #max-pooling
+            prefix_letter_pool = fluid.layers.sequence_pool(prefix_letter_pool, pool_type="max")
+       
+        prefix_vec = prefix_letter_pool
+        if self._flags.prefix_word_id:
+            prefix_word_pool = network(inputs["prefix_word_id"],
+                                "wordid_embedding",
+                                self._flags.vocab_size,
+                                self._flags.emb_dim,
+                                hid_dim=self.hid_dim,
+                                fc_dim=0,
+                                emb_lr=self._flags.emb_lr)
+            if self._flags.use_attention:
+               #max-pooling
+               prefix_word_pool = fluid.layers.sequence_pool(prefix_word_pool, pool_type="max")
+            prefix_pool = fluid.layers.concat([prefix_letter_pool, prefix_word_pool], axis=1)
+            prefix_vec = fluid.layers.fc(input=prefix_pool, size=self.hid_dim, act="leaky_relu",
+                    param_attr=fluid.ParamAttr(name='prefix_fc_weight'),
+                    bias_attr=fluid.ParamAttr(name='prefix_fc_bias'))
+        #vector layer
+        #fluid.layers.Print(inputs["prefix_letter_id"])
+        #fluid.layers.Print(inputs["prefix_word_id"])
+        #fluid.layers.Print(prefix_vec)
+        
+        if self._flags.use_personal:
+            #prefix_time_id
+            time_emb = fluid.layers.embedding(input=inputs['prefix_time_id'], is_sparse=True,
+                    size=[self._flags.time_size, self._flags.emb_dim],
+                    param_attr=fluid.ParamAttr(name="timeid_embedding", learning_rate=self._flags.emb_lr))
+            time_emb = fluid.layers.sequence_pool(time_emb, pool_type="sum")
+            context_pool = fluid.layers.concat([prefix_vec, loc_vec, time_emb, user_vec], axis=1)
+        else:
+            context_pool = fluid.layers.concat([prefix_vec, loc_vec], axis=1)
+        context_vec = fluid.layers.fc(input=context_pool, size=self._flags.fc_dim, act=self._flags.activate,
+                param_attr=fluid.ParamAttr(name='context_fc_weight'),
+                bias_attr=fluid.ParamAttr(name='context_fc_bias'))
+        return context_vec, context_pool
+
+    def _get_poi_vec(self, inputs, tag):
+        """
+            get poi vec
+            context layer: same with query
+            feature extract layer: same with query, same kernal params
+            vector layer: fc 
+        """
+        name_letter_pool = self.cnn_net(inputs[tag + "_name_letter_id"],
+                            "wordid_embedding",
+                            self._flags.vocab_size,
+                            self._flags.emb_dim,
+                            hid_dim=self.hid_dim,
+                            fc_dim=0,
+                            emb_lr=self._flags.emb_lr)
+        
+        name_word_pool = self.cnn_net(inputs[tag + "_name_word_id"],
+                            "wordid_embedding",
+                            self._flags.vocab_size,
+                            self._flags.emb_dim,
+                            hid_dim=self.hid_dim,
+                            fc_dim=0,
+                            emb_lr=self._flags.emb_lr)
+        
+        addr_letter_pool = self.cnn_net(inputs[tag + "_addr_letter_id"],
+                            "wordid_embedding",
+                            self._flags.vocab_size,
+                            self._flags.emb_dim,
+                            hid_dim=self.hid_dim,
+                            fc_dim=0,
+                            emb_lr=self._flags.emb_lr)
+
+        addr_word_pool = self.cnn_net(inputs[tag + "_addr_word_id"],
+                            "wordid_embedding",
+                            self._flags.vocab_size,
+                            self._flags.emb_dim,
+                            hid_dim=self.hid_dim,
+                            fc_dim=0,
+                            emb_lr=self._flags.emb_lr)
+         
+        #fc layer
+        loc_vec = fluid.layers.reshape(fluid.layers.cast(x=inputs[tag + '_loc_geoid'],
+                    dtype="float32"), [-1, 40])
+         
+        if self._flags.use_attention:
+            addr2name_letter_att = dot_product_attention(name_letter_pool, addr_letter_pool,
+                    addr_letter_pool, self.hid_dim)
+            name2addr_letter_att = dot_product_attention(addr_letter_pool, name_letter_pool,
+                    name_letter_pool, self.hid_dim)
+            letter_vec = fluid.layers.sequence_concat([addr2name_letter_att, name2addr_letter_att])
+            letter_att = ffn(letter_vec, self.hid_dim, self.hid_dim, "inter_ffn")
+            #max-pooling
+            name_vec = fluid.layers.sequence_pool(letter_att, pool_type="max")
+
+            addr2name_word_att = dot_product_attention(name_word_pool, addr_word_pool,
+                    addr_word_pool, self.hid_dim)
+            name2addr_word_att = dot_product_attention(addr_word_pool, name_word_pool,
+                    name_word_pool, self.hid_dim)
+            word_vec = fluid.layers.sequence_concat([addr2name_word_att, name2addr_word_att])
+            word_att = ffn(word_vec, self.hid_dim, self.hid_dim, "inter_ffn") 
+            #max-pooling
+            addr_vec = fluid.layers.sequence_pool(word_att, pool_type="max")
+        else:
+            name_pool = fluid.layers.concat([name_letter_pool, name_word_pool], axis=1)
+            name_vec = fluid.layers.fc(input=name_pool, size=self.hid_dim, act="leaky_relu", 
+                    param_attr=fluid.ParamAttr(name='name_fc_weight'),
+                    bias_attr=fluid.ParamAttr(name='name_fc_bias'))
+            addr_pool = fluid.layers.concat([addr_letter_pool, addr_word_pool], axis=1)
+            addr_vec = fluid.layers.fc(input=addr_pool, size=self.hid_dim, act="leaky_relu", 
+                    param_attr=fluid.ParamAttr(name='addr_fc_weight'),
+                    bias_attr=fluid.ParamAttr(name='addr_fc_bias'))
+        
+        if self._flags.use_personal: 
+            tag_emb = fluid.layers.embedding(input=inputs[tag + '_tag_id'], is_sparse=True,
+                    size=[self._flags.tag_size, self._flags.emb_dim],
+                    param_attr=fluid.ParamAttr(name="tagid_embedding", learning_rate=self._flags.emb_lr),
+                    padding_idx=0)
+            tag_emb = fluid.layers.sequence_pool(tag_emb, pool_type="sum")
+            poi_pool = fluid.layers.concat([name_vec, addr_vec, loc_vec, tag_emb], axis=1)
+        else:
+            poi_pool = fluid.layers.concat([name_vec, addr_vec, loc_vec], axis=1)
+        #vector layer
+        #fluid.layers.Print(inputs[tag + "_name_letter_id"])
+        #fluid.layers.Print(inputs[tag + "_name_word_id"])
+        #fluid.layers.Print(poi_pool)
+        poi_vec = fluid.layers.fc(input=poi_pool, size=self._flags.fc_dim, act=self._flags.activate,
+                param_attr=fluid.ParamAttr(name='poi_fc_weight'),
+                bias_attr=fluid.ParamAttr(name='poi_fc_bias'))
+
+        return poi_vec, poi_pool
+
+    def train_format(self, result, global_step, epoch_id, batch_id):
+        """
+            result: one batch train narray
+        """ 
+        if global_step == 0 or global_step % self._flags.log_every_n_steps != 0:
+            return
+        
+        #result[0] default is loss.
+        avg_res = np.mean(np.array(result[0]))
+        vec = []
+        for i in range(1, len(result)):
+            res = np.array(result[i])
+            vec.append("%s#%s" % (res.shape, ' '.join(str(j) for j in res.flatten())))
+        logging.info("epoch[%s], global_step[%s], batch_id[%s], extra_info: "
+                "loss[%s], debug[%s]" % (epoch_id, global_step, batch_id,
+                avg_res, ";".join(vec)))
+
+    def init_params(self, place):
+        """
+            init embed
+        """
+        def _load_parameter(pretraining_file, vocab_size, word_emb_dim):
+            pretrain_word2vec = np.zeros([vocab_size, word_emb_dim], dtype=np.float32)
+            for line in open(pretraining_file, 'r'):
+                id, _, vec = line.strip('\r\n').split('\t')
+                pretrain_word2vec[int(id)] = map(float, vec.split())
+
+                return pretrain_word2vec
+
+        embedding_param = fluid.global_scope().find_var("wordid_embedding").get_tensor()
+        pretrain_word2vec = _load_parameter(self._flags.init_train_params,
+                self._flags.vocab_size, self._flags.emb_dim)
+        embedding_param.set(pretrain_word2vec, place)
+        logging.info("init pretrain word2vec:%s" % self._flags.init_train_params)
+
+    def pred_format(self, result, **kwargs):
+        """
+            format pred output
+        """
+        if result is None:
+            return
+    
+        if result == '_PRE_':
+            if self._flags.dump_vec not in ('query', 'poi'):
+                self.idx2word = {} 
+                with open(self._flags.qac_dict_path, 'r') as f:
+                    for line in f:
+                        term, tag, cnt, is_stop, term_id = line.strip('\r\n').split('\t')
+                        self.idx2word[int(term_id)] = term
+            return
+
+        if result == '_POST_':
+            if self._flags.init_pretrain_model is not None:
+                path = "%s/infer_model" % (self._flags.export_dir)
+                frame_env = kwargs['frame_env']
+                fluid.io.save_inference_model(path,
+                       frame_env.paddle_env['feeded_var_names'],
+                       frame_env.paddle_env['fetch_targets'],
+                       frame_env.paddle_env['exe'], frame_env.paddle_env['program'])
+
+            return 
+
+        if self._flags.dump_vec == "query":
+            prefix_vec = np.array(result[0])
+            for q in prefix_vec:
+                print("qid\t%s" % (" ".join(map(str, q))))
+        elif self._flags.dump_vec == "poi":
+            poi_score = np.array(result[1])
+            poi_vec = np.array(result[2])
+            for i in range(len(poi_score)):
+                print("bid\t%s\t%s" % (poi_score[i][0], " ".join(map(str, poi_vec[i]))))
+        else:
+            prefix_id = result[0]
+            pred_score = np.array(result[1])
+            label = np.array(result[2])
+            for i in range(len(pred_score)):
+                start = prefix_id.lod()[0][i]
+                end = prefix_id.lod()[0][i + 1]
+                words = []
+                for idx in np.array(prefix_id)[start:end]:
+                    words.append(self.idx2word.get(idx[0], "UNK"))
+                print("qid_%s\t%s\t%s" % ("".join(words), label[i][0], pred_score[i][0]))
+
+    def bow_net(self,
+                data,
+                layer_name,
+                dict_dim,
+                emb_dim=128,
+                hid_dim=128,
+                fc_dim=128, emb_lr=0.1):
+        """
+        bow net
+        """
+        # embedding layer
+        emb = fluid.layers.embedding(input=data, is_sparse=True, size=[dict_dim, emb_dim],
+                param_attr=fluid.ParamAttr(name=layer_name, learning_rate=emb_lr), padding_idx=0)
+        
+        # bow layer
+        bow = fluid.layers.sequence_pool(input=emb, pool_type='sum')
+        #bow = fluid.layers.tanh(bow)
+        #bow = fluid.layers.softsign(bow)
+        
+        # full connect layer
+        if fc_dim > 0:
+            bow = fluid.layers.fc(input=bow, size=fc_dim, act=self._flags.activate)
+        return bow 
+     
+    def cnn_net(self,
+                data,
+                layer_name,
+                dict_dim,
+                emb_dim=128,
+                hid_dim=128,
+                fc_dim=96,
+                win_size=3, emb_lr=0.1):
+        """
+        conv net
+        """
+        # embedding layer
+        emb = fluid.layers.embedding(input=data, is_sparse=True, size=[dict_dim, emb_dim],
+                param_attr=fluid.ParamAttr(name=layer_name, learning_rate=emb_lr), padding_idx=0)
+        
+        param_attr = fluid.ParamAttr(
+            name="conv_weight",
+            initializer=fluid.initializer.TruncatedNormalInitializer(loc=0.0, scale=0.1))
+        bias_attr = fluid.ParamAttr(
+            name="conv_bias",
+            initializer=fluid.initializer.Constant(0.0))
+        
+        if self._flags.use_attention:
+            # convolution layer
+            conv = fluid.layers.sequence_conv(
+                input=emb,
+                num_filters=hid_dim,
+                filter_size=win_size,
+                param_attr=param_attr,
+                bias_attr=bias_attr,
+                act="leaky_relu") #tanh
+            att = dot_product_attention(conv, conv, conv, hid_dim)
+            conv = ffn(att, hid_dim, hid_dim, "intra_ffn")
+        else:
+            # convolution layer
+            conv = fluid.nets.sequence_conv_pool(
+                input=emb,
+                num_filters=hid_dim,
+                filter_size=win_size,
+                param_attr=param_attr,
+                bias_attr=bias_attr,
+                act="leaky_relu", #tanh 
+                pool_type="max")
+            # full connect layer
+            if fc_dim > 0:
+                conv = fluid.layers.fc(input=conv, size=fc_dim, act=self._flags.activate)
+        return conv
+ 
+    def lstm_net(self, 
+                 data,
+                 layer_name,
+                 dict_dim,
+                 emb_dim=128,
+                 hid_dim=128,
+                 fc_dim=96,
+                 emb_lr=0.1):
+        """
+        lstm net
+        """
+        # embedding layer
+        emb = fluid.layers.embedding(input=data, is_sparse=True, size=[dict_dim, emb_dim],
+                param_attr=fluid.ParamAttr(name=layer_name, learning_rate=emb_lr), padding_idx=0)
+        
+        # Lstm layer
+        fc0 = fluid.layers.fc(input=emb, size=hid_dim * 4,
+                param_attr=fluid.ParamAttr(name='lstm_fc_weight'),
+                bias_attr=fluid.ParamAttr(name='lstm_fc_bias'))
+        lstm_h, c = fluid.layers.dynamic_lstm(input=fc0, size=hid_dim * 4, is_reverse=False,
+                param_attr=fluid.ParamAttr(name='lstm_weight'),
+                bias_attr=fluid.ParamAttr(name='lstm_bias'))
+        # max pooling layer
+        lstm = fluid.layers.sequence_pool(input=lstm_h, pool_type='max')
+        lstm = fluid.layers.tanh(lstm)
+    
+        # full connect layer
+        if fc_dim > 0:
+            lstm = fluid.layers.fc(input=lstm, size=fc_dim, act=self._flags.activate)
+        return lstm
+
+    def bilstm_net(self,
+                   data,
+                   layer_name,
+                   dict_dim,
+                   emb_dim=128,
+                   hid_dim=128,
+                   fc_dim=96,
+                   emb_lr=0.1):
+        """
+        bi-Lstm net
+        """
+        # embedding layer
+        emb = fluid.layers.embedding(input=data, is_sparse=True, size=[dict_dim, emb_dim],
+                param_attr=fluid.ParamAttr(name=layer_name, learning_rate=emb_lr), padding_idx=0)
+        
+        #LSTM layer
+        fc0 = fluid.layers.fc(input=emb, size=hid_dim * 4)
+        rfc0 = fluid.layers.fc(input=emb, size=hid_dim * 4)
+        lstm_h, c = fluid.layers.dynamic_lstm(input=fc0, size=hid_dim * 4, is_reverse=False)
+        rlstm_h, c = fluid.layers.dynamic_lstm(input=rfc0, size=hid_dim * 4, is_reverse=True)
+        # extract last layer
+        lstm_last = fluid.layers.sequence_last_step(input=lstm_h)
+        rlstm_last = fluid.layers.sequence_last_step(input=rlstm_h)
+        #lstm_last = fluid.layers.tanh(lstm_last)
+        #rlstm_last = fluid.layers.tanh(rlstm_last)
+        # concat layer
+        bi_lstm = fluid.layers.concat(input=[lstm_last, rlstm_last], axis=1)
+        
+        # full connect layer
+        if fc_dim > 0:
+            bi_lstm = fluid.layers.fc(input=bi_lstm, size=fc_dim, act=self._flags.activate)
+        return bi_lstm 
+      
+    def gru_net(self,
+                data,
+                layer_name,
+                dict_dim,
+                emb_dim=128,
+                hid_dim=128,
+                fc_dim=96,
+                emb_lr=0.1):
+        """
+        gru net
+        """
+        emb = fluid.layers.embedding(input=data, is_sparse=True, size=[dict_dim, emb_dim],
+                param_attr=fluid.ParamAttr(name=layer_name, learning_rate=emb_lr), padding_idx=0)
+       
+        #gru layer
+        fc0 = fluid.layers.fc(input=emb, size=hid_dim * 3)
+        gru = fluid.layers.dynamic_gru(input=fc0, size=hid_dim, is_reverse=False)
+        gru = fluid.layers.sequence_pool(input=gru, pool_type='max')
+        #gru = fluid.layers.tanh(gru)
+        
+        if fc_dim > 0:
+            gru = fluid.layers.fc(input=gru, size=fc_dim, act=self._flags.activate)
+        return gru
+
+     
diff --git a/PaddleST/Research/KDD2020-P3AC/test/__init__.py b/PaddleST/Research/KDD2020-P3AC/test/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PaddleSlim/README.md b/PaddleSlim/README.md
index eee98bf0a3fd5c550957b41cb35a97a8d2108b16..1b3cc97115d3e2da369f93baadc2ced0143db192 100644
--- a/PaddleSlim/README.md
+++ b/PaddleSlim/README.md
@@ -1,3 +1,7 @@
+# PaddleSlim新版本已经发布，项目已被迁移到： https://github.com/PaddlePaddle/PaddleSlim
+
+
+
 
 <div align="center">
   <h3>
@@ -209,5 +213,5 @@ Paddle-Slim工具库有以下特点：
 
 模型压缩框架支持以下格式模型导出：
 
-- **Paddle Fluid模型格式：** Paddle Fluid模型格式，可通过Paddle框架加载使用。
-- **Paddle Mobile模型格式：** 仅在量化训练策略时使用，兼容[Paddle Mobile](https://github.com/PaddlePaddle/paddle-mobile)的模型格式。
+- **Paddle Fluid模型格式：** Paddle Fluid模型格式，可通过[Paddle](https://github.com/PaddlePaddle/Paddle)，[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite)框架加载使用。
+- **Paddle Mobile模型格式：** 仅在量化训练策略时使用，兼容[Paddle Mobile](https://github.com/PaddlePaddle/paddle-mobile)的模型格式（现Paddle Mobile已升级为[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite)）。
diff --git a/PaddleSlim/classification/distillation/README.md b/PaddleSlim/classification/distillation/README.md
index 13b83a3b07ba626253d0587e5f4e03e0b3b04677..10af9b9cacc64a77ead34a92f68ff9bdfd7ab436 100755
--- a/PaddleSlim/classification/distillation/README.md
+++ b/PaddleSlim/classification/distillation/README.md
@@ -137,7 +137,7 @@ strategies:
 | FLOPS    | top1_acc/top5_acc |
 | -------- | ----------------- |
 | baseline | 70.99%/89.68%     |
-| 蒸馏后     | -                 |
+| 蒸馏后     | 72.30%/90.98%                 |
 
 #### 训练超参
 
@@ -153,7 +153,7 @@ strategies:
 | FLOPS    | top1_acc/top5_acc |
 | -------- | ----------------- |
 | baseline | 72.15%/90.65%     |
-| 蒸馏后     | 70.66%/90.42%                 |
+| 蒸馏后     | 70.95%/90.40%                 |
 
 #### 训练超参
 
@@ -169,7 +169,7 @@ strategies:
 | FLOPS    | top1_acc/top5_acc |
 | -------- | ----------------- |
 | baseline | 74.57%/92.14%     |
-| 蒸馏后     | -                 |
+| 蒸馏后     | 74.48%/91.95%            |
 
 #### 训练超参
 
diff --git a/PaddleSlim/classification/distillation/compress.py b/PaddleSlim/classification/distillation/compress.py
index 49a800f8cba00abf3faad5976c8a406ed6b1fd20..bf182fa22efad4b95861bed80484ca446050ebb5 100644
--- a/PaddleSlim/classification/distillation/compress.py
+++ b/PaddleSlim/classification/distillation/compress.py
@@ -34,12 +34,20 @@ add_arg('pretrained_model', str,  None,                "Whether to use pretraine
 add_arg('teacher_model',    str,  None,          "Set the teacher network to use.")
 add_arg('teacher_pretrained_model', str,  None,                "Whether to use pretrained model.")
 add_arg('compress_config',  str,  None,                 "The config file for compression with yaml format.")
+add_arg('enable_ce',        bool, False,                "If set, run the task with continuous evaluation logs.")
+
 # yapf: enable
 
 model_list = [m for m in dir(models) if "__" not in m]
 
 
 def compress(args):
+    # add ce
+    if args.enable_ce:
+        SEED = 1
+        fluid.default_main_program().random_seed = SEED
+        fluid.default_startup_program().random_seed = SEED
+
     image_shape = [int(m) for m in args.image_shape.split(",")]
 
     assert args.model in model_list, "{} is not in lists: {}".format(args.model,
diff --git a/PaddleSlim/classification/distillation/configs/mobilenetv1_resnet50_distillation.yaml b/PaddleSlim/classification/distillation/configs/mobilenetv1_resnet50_distillation.yaml
index 03c0d6ca124f064721784b84254459d46d387746..750cbae23928352fe0903f84327be0efab9b55ef 100644
--- a/PaddleSlim/classification/distillation/configs/mobilenetv1_resnet50_distillation.yaml
+++ b/PaddleSlim/classification/distillation/configs/mobilenetv1_resnet50_distillation.yaml
@@ -1,10 +1,5 @@
 version: 1.0
 distillers:
-    fsp_distiller:
-        class: 'FSPDistiller'
-        teacher_pairs: [['res50_res2a_branch2a.conv2d.output.1.tmp_0', 'res50_res3a_branch2a.conv2d.output.1.tmp_0']]
-        student_pairs: [['depthwise_conv2d_1.tmp_0', 'depthwise_conv2d_2.tmp_0']]
-        distillation_loss_weight: 1
     l2_distiller:
         class: 'L2Distiller'
         teacher_feature_map: 'res50_fc_0.tmp_0'
@@ -13,7 +8,7 @@ distillers:
 strategies:
     distillation_strategy:
         class: 'DistillationStrategy'
-        distillers: ['fsp_distiller', 'l2_distiller']
+        distillers: ['l2_distiller']
         start_epoch: 0
         end_epoch: 130
 compressor:
diff --git a/PaddleSlim/classification/distillation/configs/resnet34_resnet50_distillation.yaml b/PaddleSlim/classification/distillation/configs/resnet34_resnet50_distillation.yaml
index 4d4546e08d1f4ba0fc5fe21745faf22bf6698c7b..e19dc0e9faaaa3b5c9277ae58cb0aa25bdb05ab3 100644
--- a/PaddleSlim/classification/distillation/configs/resnet34_resnet50_distillation.yaml
+++ b/PaddleSlim/classification/distillation/configs/resnet34_resnet50_distillation.yaml
@@ -1,10 +1,5 @@
 version: 1.0
 distillers:
-    fsp_distiller:
-        class: 'FSPDistiller'
-        teacher_pairs: [['res50_res2a_branch2a.conv2d.output.1.tmp_0', 'res50_res2b_branch2a.conv2d.output.1.tmp_0'], ['res50_res4b_branch2a.conv2d.output.1.tmp_0', 'res50_res4c_branch2a.conv2d.output.1.tmp_0']]
-        student_pairs: [['res34_res2a_branch2a.conv2d.output.1.tmp_0', 'res34_res2a_branch1.conv2d.output.1.tmp_0'], ['res34_res4b_branch2a.conv2d.output.1.tmp_0', 'res34_res4c_branch2a.conv2d.output.1.tmp_0']]
-        distillation_loss_weight: 1
     l2_distiller:
         class: 'L2Distiller'
         teacher_feature_map: 'res50_fc_0.tmp_0'
@@ -13,7 +8,7 @@ distillers:
 strategies:
     distillation_strategy:
         class: 'DistillationStrategy'
-        distillers: ['fsp_distiller', 'l2_distiller']
+        distillers: ['l2_distiller']
         start_epoch: 0
         end_epoch: 130
 compressor:
diff --git a/PaddleSlim/classification/eval.py b/PaddleSlim/classification/eval.py
index f091a9e970f1e51443f554d8c1ab45c53bebe6e8..6cea460fb99156da42cf9f4718af2228b3e39a4e 100644
--- a/PaddleSlim/classification/eval.py
+++ b/PaddleSlim/classification/eval.py
@@ -33,20 +33,23 @@ add_arg('model_name', str,  "__model__", "model filename for inference model")
 add_arg('params_name', str, "__params__", "params filename for inference model")
 # yapf: enable
 
+
 def eval(args):
     # parameters from arguments
 
     place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
     exe = fluid.Executor(place)
 
-    val_program, feed_target_names, fetch_targets = fluid.io.load_inference_model(args.model_path,
-                                      exe,
-                                      model_filename=args.model_name,
-                                      params_filename=args.params_name)
+    val_program, feed_target_names, fetch_targets = fluid.io.load_inference_model(
+        args.model_path,
+        exe,
+        model_filename=args.model_name,
+        params_filename=args.params_name)
     val_reader = paddle.batch(reader.val(), batch_size=128)
-    feeder = fluid.DataFeeder(place=place, feed_list=feed_target_names, program=val_program)
+    feeder = fluid.DataFeeder(
+        place=place, feed_list=feed_target_names, program=val_program)
 
-    results=[] 
+    results = []
     for batch_id, data in enumerate(val_reader()):
 
         # top1_acc, top5_acc
@@ -56,8 +59,8 @@ def eval(args):
             label = [[d[1]] for d in data]
             feed_data = feeder.feed(image)
             pred = exe.run(val_program,
-                        feed=feed_data,
-                        fetch_list=fetch_targets)
+                           feed=feed_data,
+                           fetch_list=fetch_targets)
             pred = np.array(pred[0])
             label = np.array(label)
             sort_array = pred.argsort(axis=1)
@@ -68,23 +71,25 @@ def eval(args):
             for i in range(len(label)):
                 if label[i][0] in top_5_pred[i]:
                     acc_num += 1
-            top_5 = acc_num / len(label)
+            top_5 = float(acc_num) / len(label)
             results.append([top_1, top_5])
         else:
             # eval "eval model", which inputs are image and label, output is top1 and top5 accuracy
             result = exe.run(val_program,
-                          feed=feeder.feed(data),
-                          fetch_list=fetch_targets)
+                             feed=feeder.feed(data),
+                             fetch_list=fetch_targets)
             result = [np.mean(r) for r in result]
             results.append(result)
     result = np.mean(np.array(results), axis=0)
     print("top1_acc/top5_acc= {}".format(result))
     sys.stdout.flush()
 
+
 def main():
     args = parser.parse_args()
     print_arguments(args)
     eval(args)
 
+
 if __name__ == '__main__':
     main()
diff --git a/PaddleSlim/classification/pruning/README.md b/PaddleSlim/classification/pruning/README.md
index 6f8243b513e7123bdd5989dc42baa57e4f7c6c1e..a3141ba525e35ec6ba327785b6ecf61a50b3a5d7 100644
--- a/PaddleSlim/classification/pruning/README.md
+++ b/PaddleSlim/classification/pruning/README.md
@@ -121,6 +121,8 @@ fc10_weights (1280L, 1000L)
 
 ## 示例结果
 
+注：以下表格中的`model_size`为预测章节介绍的`__params__`文件的大小。
+
 ### MobileNetV1
 
 | FLOPS |top1_acc/top5_acc| model_size |Paddle Fluid inference time(ms)| Paddle Lite inference time(ms)|
@@ -130,13 +132,14 @@ fc10_weights (1280L, 1000L)
 |-30%|- |- |- |-|
 |-50%|- |- |- |-|
 
->训练超参
-batch size: 256
-lr_strategy: piecewise_decay
-step_epochs: 30, 60, 90
-num_epochs: 120
-l2_decay: 3e-5
-lr: 0.1
+#### 训练超参
+
+- batch size: 256
+- lr_strategy: piecewise_decay
+- step_epochs: 30, 60, 90
+- num_epochs: 120
+- l2_decay: 3e-5
+- lr: 0.1
 
 ### MobileNetV2
 
@@ -147,12 +150,13 @@ lr: 0.1
 |-30%|- |- |- |-|
 |-50%|- |- |- |-|
 
->训练超参：
-batch size: 500
-lr_strategy: cosine_decay
-num_epochs: 240
-l2_decay: 4e-5
-lr: 0.1
+#### 训练超参
+
+- batch size: 500
+- lr_strategy: cosine_decay
+- num_epochs: 240
+- l2_decay: 4e-5
+- lr: 0.1
 
 
 ### ResNet50
@@ -164,11 +168,17 @@ lr: 0.1
 |-30%|- |- |- |-|
 |-50%|- |- |- |-|
 
->训练超参
-batch size: 256
-lr_strategy: cosine_decay
-num_epochs: 120
-l2_decay: 1e-4
-lr: 0.1
+#### 训练超参
+
+- batch size: 256
+- lr_strategy: cosine_decay
+- num_epochs: 120
+- l2_decay: 1e-4
+- lr: 0.1
 
 ## FAQ
+
+### 1. 如何压缩Paddle分类库中的其它模型或自定义的分类模型？
+
+建议您参考`models/PaddleSlim/classification/models`路径下的模型定义文件添加新的分类模型，您可以从[Paddle图像分类库](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification)拷贝模型定义文件或自己编写模型定义文件。更多细节请参考[分类模型的常规训练方法](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification)和[PaddleSlim使用文档](https://github.com/PaddlePaddle/models/blob/develop/PaddleSlim/docs/usage.md)
+
diff --git a/PaddleSlim/classification/pruning/compress.py b/PaddleSlim/classification/pruning/compress.py
index 77f4f83aaf39e2afe7c214d7c558f6992c0218b4..0247a78f794a09a18ca13adcf60801782ce71660 100644
--- a/PaddleSlim/classification/pruning/compress.py
+++ b/PaddleSlim/classification/pruning/compress.py
@@ -33,6 +33,7 @@ add_arg('num_epochs',       int,  120,               "The number of total epochs
 add_arg('total_images',     int,  1281167,               "The number of total training images.")
 parser.add_argument('--step_epochs', nargs='+', type=int, default=[30, 60, 90], help="piecewise decay step")
 add_arg('config_file',      str, None,                 "The config file for compression with yaml format.")
+add_arg('enable_ce', bool, False, "If set, run the task with continuous evaluation logs.")
 # yapf: enable
 
 
@@ -68,6 +69,12 @@ def create_optimizer(args):
         return cosine_decay(args)
 
 def compress(args):
+    # add ce
+    if args.enable_ce:
+        SEED = 1
+        fluid.default_main_program().random_seed = SEED
+        fluid.default_startup_program().random_seed = SEED
+
     class_dim=1000
     image_shape="3,224,224"
     image_shape = [int(m) for m in image_shape.split(",")]
diff --git a/PaddleSlim/classification/pruning/configs/mobilenet_v2.yaml b/PaddleSlim/classification/pruning/configs/mobilenet_v2.yaml
index 8fb8e16d75cfcccf5590f2d46f332726dc34376e..59f7c7e5214fdcc6056561a7f582ceae4fecbd2f 100644
--- a/PaddleSlim/classification/pruning/configs/mobilenet_v2.yaml
+++ b/PaddleSlim/classification/pruning/configs/mobilenet_v2.yaml
@@ -17,6 +17,6 @@ strategies:
 #        pruned_params: '.*expand_weights'
 compressor:
     epoch: 241
-    checkpoint_path: './checkpoints/'
+    checkpoint_path: './checkpoints/mobilenet_v2'
     strategies:
         - uniform_pruning_strategy
diff --git a/PaddleSlim/classification/pruning/run.sh b/PaddleSlim/classification/pruning/run.sh
index db02e362bd8b959c36b8eb4ab30a1de5501c5c28..4843dc8c4cf5c604c707ee774f2442b1f6c8f355 100644
--- a/PaddleSlim/classification/pruning/run.sh
+++ b/PaddleSlim/classification/pruning/run.sh
@@ -38,7 +38,6 @@ nohup python -u compress.py \
 --batch_size 256 \
 --total_images 1281167 \
 --lr_strategy "piecewise_decay" \
---num_epochs 120 \
 --lr 0.1 \
 --l2_decay 3e-5 \
 --pretrained_model ../pretrain/MobileNetV1_pretrained \
@@ -53,7 +52,6 @@ tailf mobilenet_v1.log
 #--batch_size 256 \
 #--total_images 1281167 \
 #--lr_strategy "cosine_decay" \
-#--num_epochs 240 \
 #--lr 0.1 \
 #--l2_decay 4e-5 \
 #--pretrained_model ../pretrain/MobileNetV2_pretrained \
@@ -70,7 +68,6 @@ tailf mobilenet_v1.log
 #--total_images 1281167 \
 #--lr_strategy "cosine_decay" \
 #--lr 0.1 \
-#--num_epochs 120 \
 #--l2_decay 1e-4 \
 #--pretrained_model ../pretrain/ResNet34_pretrained \
 #--config_file "./configs/resnet34.yaml" \
diff --git a/PaddleSlim/classification/quantization/README.md b/PaddleSlim/classification/quantization/README.md
index 77a812071fd8e48f1bc713c53f8fb5866a5296a7..1d7f00cec38e2c7e94b466cead1852e49b4b5a32 100644
--- a/PaddleSlim/classification/quantization/README.md
+++ b/PaddleSlim/classification/quantization/README.md
@@ -1,4 +1,4 @@
->运行该示例前请安装Paddle1.6或更高版本
+>运行该示例前请安装Paddle1.6或更高版本。 本示例中的run.sh脚本仅适用于linux系统，在windows环境下，请参考run.sh内容编写适合windows环境的脚本。
 
 # 分类模型量化压缩示例
 
@@ -40,7 +40,7 @@ cost = fluid.layers.cross_entropy(input=out, label=label)
 
 - use_gpu: 是否使用gpu。如果选择使用GPU，请确保当前环境和Paddle版本支持GPU。默认为True。
 - batch_size: 在量化之后，对模型进行fine-tune训练时用的batch size。
-- model: 要压缩的目标模型，该示例支持'MobileNet', 'MobileNetV2'和'ResNet50'。
+- model: 要压缩的目标模型，该示例支持'MobileNet', 'MobileNetV2'和'ResNet34'。
 - pretrained_model: 预训练模型的路径，可以从[这里](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification#%E5%B7%B2%E5%8F%91%E5%B8%83%E6%A8%A1%E5%9E%8B%E5%8F%8A%E5%85%B6%E6%80%A7%E8%83%BD)下载。
 - config_file: 压缩策略的配置文件。
 
@@ -49,7 +49,7 @@ cost = fluid.layers.cross_entropy(input=out, label=label)
 ### 训练时的模型结构
 这部分介绍来源于[量化low-level API介绍](https://github.com/PaddlePaddle/models/tree/develop/PaddleSlim/quant_low_level_api#1-%E9%87%8F%E5%8C%96%E8%AE%AD%E7%BB%83low-level-apis%E4%BB%8B%E7%BB%8D)。
 
-PaddlePaddle框架中有四个和量化相关的IrPass, 分别是QuantizationTransformPass、QuantizationFreezePass、ConvertToInt8Pass以及TransformForMobilePass。在训练时，对网络应用了QuantizationTransformPass，作用是在网络中的conv2d、depthwise_conv2d、mul等算子的各个输入前插入连续的量化op和反量化op，并改变相应反向算子的某些输入。示例图如下：
+PaddlePaddle框架中和量化相关的IrPass有QuantizationTransformPass、QuantizationFreezePass、ConvertToInt8Pass。在训练时，对网络应用了QuantizationTransformPass，作用是在网络中的conv2d、depthwise_conv2d、mul等算子的各个输入前插入连续的量化op和反量化op，并改变相应反向算子的某些输入。示例图如下：
 
 <p align="center">
 <img src="../../docs/images/usage/TransformPass.png" height=400 width=520 hspace='10'/> <br />
@@ -65,10 +65,10 @@ PaddlePaddle框架中有四个和量化相关的IrPass, 分别是QuantizationTra
 >注意：配置文件中的信息不会保存在断点中，重启前对配置文件的修改将会生效。
 
 ### 保存评估和预测模型
-如果在配置文件的量化策略中设置了`float_model_save_path`, `int8_model_save_path`, `mobile_model_save_path`, 在训练结束后，会保存模型量化压缩之后用于评估和预测的模型。接下来介绍这三种模型的区别。
+如果在配置文件的量化策略中设置了`float_model_save_path`, `int8_model_save_path`，在训练结束后，会保存模型量化压缩之后用于评估和预测的模型。接下来介绍这2种模型的区别。
 
 #### FP32模型
-在介绍量化训练时的模型结构时介绍了PaddlePaddle框架中有四个和量化相关的IrPass, 分别是QuantizationTransformPass、QuantizationFreezePass、ConvertToInt8Pass以及TransformForMobilePass。FP32预测模型是在应用QuantizationFreezePass并删除eval_program中多余的operators之后，保存的模型。
+在介绍量化训练时的模型结构时介绍了PaddlePaddle框架中和量化相关的IrPass, 有QuantizationTransformPass、QuantizationFreezePass、ConvertToInt8Pass。FP32预测模型是在应用QuantizationFreezePass并删除eval_program中多余的operators之后，保存的模型。
 
 QuantizationFreezePass主要用于改变IrGraph中量化op和反量化op的顺序，即将类似图1中的量化op和反量化op顺序改变为图2中的布局。除此之外，QuantizationFreezePass还会将`conv2d`、`depthwise_conv2d`、`mul`等算子的权重离线量化为int8_t范围内的值(但数据类型仍为float32)，以减少预测过程中对权重的量化操作，示例如图2：
 
@@ -86,20 +86,13 @@ QuantizationFreezePass主要用于改变IrGraph中量化op和反量化op的顺
 <strong>图3：应用ConvertToInt8Pass后的结果</strong>
 </p>
 
-####  mobile模型
-经TransformForMobilePass转换后，用户可得到兼容[paddle-lite](https://github.com/PaddlePaddle/Paddle-Lite)移动端预测库的量化模型。paddle-mobile中的量化op和反量化op的名称分别为`quantize`和`dequantize`。`quantize`算子和PaddlePaddle框架中的`fake_quantize_abs_max`算子簇的功能类似，`dequantize` 算子和PaddlePaddle框架中的`fake_dequantize_max_abs`算子簇的功能相同。若选择paddle-mobile执行量化训练输出的模型，则需要将`fake_quantize_abs_max`等算子改为`quantize`算子以及将`fake_dequantize_max_abs`等算子改为`dequantize`算子，示例如图4：
-
-<p align="center">
-<img src="../../docs/images/usage/TransformForMobilePass.png" height=400 width=400 hspace='10'/> <br />
-<strong>图4：应用TransformForMobilePass后的结果</strong>
-</p>
-
 > 综上，可得在量化过程中有以下几种模型结构：
+
 1. 原始模型
 2. 经QuantizationTransformPass之后得到的适用于训练的量化模型结构，在${checkpoint_path}下保存的`eval_model`是这种结构，在训练过程中每个epoch结束时也使用这个网络结构进行评估，虽然这个模型结构不是最终想要的模型结构，但是每个epoch的评估结果可用来挑选模型。
 3. 经QuantizationFreezePass之后得到的FP32模型结构，具体结构已在上面进行介绍。本文档中列出的数据集的评估结果是对FP32模型结构进行评估得到的结果。这种模型结构在训练过程中只会保存一次，也就是在量化配置文件中设置的`end_epoch`结束时进行保存，如果想将其他epoch的训练结果转化成FP32模型，可使用脚本 <a href='./freeze.py'>PaddleSlim/classification/quantization/freeze.py</a>进行转化，具体使用方法在[评估](#评估)中介绍。
 4. 经ConvertToInt8Pass之后得到的8-bit模型结构，具体结构已在上面进行介绍。这种模型结构在训练过程中只会保存一次，也就是在量化配置文件中设置的`end_epoch`结束时进行保存，如果想将其他epoch的训练结果转化成8-bit模型，可使用脚本 <a href='./freeze.py'>PaddleSlim/classification/quantization/freeze.py</a>进行转化，具体使用方法在[评估](#评估)中介绍。
-5. 经TransformForMobilePass之后得到的mobile模型结构，具体结构已在上面进行介绍。这种模型结构在训练过程中只会保存一次，也就是在量化配置文件中设置的`end_epoch`结束时进行保存，如果想将其他epoch的训练结果转化成mobile模型，可使用脚本 <a href='./freeze.py'>PaddleSlim/classification/quantization/freeze.py</a>进行转化，具体使用方法在[评估](#评估)中介绍。
+
 
 ## 评估
 
@@ -120,11 +113,11 @@ python eval.py \
     --model_path ${checkpoint_path}/${epoch_id}/eval_model
 ```
 
-在评估之后，选取效果最好的epoch的模型，可使用脚本 <a href='./freeze.py'>PaddleSlim/classification/quantization/freeze.py</a>将该模型转化为以上介绍的三种模型：FP32模型，8-bit模型，mobile模型，需要配置的参数为：
+在评估之后，选取效果最好的epoch的模型，可使用脚本 <a href='./freeze.py'>PaddleSlim/classification/quantization/freeze.py</a>将该模型转化为以上介绍的2种模型：FP32模型，8-bit模型，需要配置的参数为：
 
 - model_path, 加载的模型路径，`为${checkpoint_path}/${epoch_id}/eval_model/`
 - weight_quant_type 模型参数的量化方式，和配置文件中的类型保持一致
-- save_path `FP32`, `8-bit`, `mobile`模型的保存路径，分别为 `${save_path}/float/`, `${save_path}/int8/`, `${save_path}/mobile/`
+- save_path `FP32`, `8-bit`模型的保存路径，分别为 `${save_path}/float/`, `${save_path}/int8/`
 
 运行命令示例：
 ```
@@ -166,18 +159,53 @@ python infer.py \
 ### PaddleLite预测
 FP32模型可使用Paddle-Lite进行加载预测，可参见教程[Paddle-Lite如何加载运行量化模型](https://github.com/PaddlePaddle/Paddle-Lite/wiki/model_quantization)。
 
-mobile预测模型兼容Paddle-Lite（Paddle-Mobile的升级版）, 使用方法可参考[Paddle-Lite文档](https://paddlepaddle.github.io/Paddle-Lite/).
+## 如何进行部分量化
+
+通过在定义op时指定 ``name_scope``为 ``skip_quant``可对这个op跳过量化。比如在<a href="../models/resnet.py">PaddleSlim/classification/models/resnet.py</a>中，将某个conv的定义作如下改变：
+
+原定义：
+```
+conv = self.conv_bn_layer(
+                input=input,
+                num_filters=64,
+                filter_size=7,
+                stride=2,
+                act='relu',
+                name=prefix_name + conv1_name)
+
+```
+
+跳过量化时的定义：
+
+```
+
+with fluid.name_scope('skip_quant'):
+    conv = self.conv_bn_layer(
+                input=input,
+                num_filters=64,
+                filter_size=7,
+                stride=2,
+                act='relu',
+                name=prefix_name + conv1_name)
+
+```
+在脚本 <a href="./compress.py">PaddleSlim/classification/quantization/compress.py</a> 中，统计了``conv`` op的数量和以``fake_quantize``开头的量化op的数量，在对一些``conv`` op跳过之后，可发现以``fake_quantize``开头的量化op的数量变少。
 
 
 ## 示例结果
 
+>当前release的结果并非超参调优后的最好结果，仅做示例参考，后续我们会优化当前结果。
+
+>注： lite端运行手机信息：Android手机，
+型号：BKL-AL20，运行内存RAM：4GB 6GB，CPU核心数：八核 4*A73 2.36GHz+4*A53 1.8GHz，操作系统：EMUI 8.0，CPU品牌：麒麟970
+
 ### MobileNetV1
 
 | weight量化方式 | activation量化方式| top1_acc/top5_acc |Paddle Fluid inference time(ms)| Paddle Lite inference time(ms)| 模型下载|
 |---|---|---|---|---| ---|
 |baseline|- |70.99%/89.68%|- |-| [下载模型](http://paddle-imagenet-models-name.bj.bcebos.com/MobileNetV1_pretrained.tar)|
 |abs_max|abs_max|70.74%/89.55% |- |-| [下载模型](https://paddle-slim-models.bj.bcebos.com/quantization%2Fmobilenetv1_w_abs_a_abs_7074_8955.tar.gz)|
-|abs_max|moving_average_abs_max|70.89%/89.67% |- |-| [下载模型](https://paddle-slim-models.bj.bcebos.com/quantization%2Fmobilenetv1_w_abs_a_move_7089_8967.tar.gz)|
+|abs_max|moving_average_abs_max|70.89%/89.67% |5.18|37.65| [下载模型](https://paddle-slim-models.bj.bcebos.com/quantization%2Fmobilenetv1_w_abs_a_move_7089_8967.tar.gz)|
 |channel_wise_abs_max|abs_max|70.93%/89.65% |- |-|[下载模型](https://paddle-slim-models.bj.bcebos.com/quantization%2Fmobilenetv1_w_chan_a_abs_7093_8965.tar.gz)|
 
 >训练超参：
@@ -198,19 +226,28 @@ fluid.optimizer.Momentum(momentum=0.9,
 |---|---|---|---|---|
 |baseline|- |72.15%/90.65%|- |-|
 |abs_max|abs_max|- |- |-|
-|abs_max|moving_average_abs_max|- |- |-|
+|abs_max|moving_average_abs_max|72.19%/90.71%|9.43 |56.09|
 |channel_wise_abs_max|abs_max|- |- |-|
 
 >训练超参：
 
-### ResNet50
+优化器
+```
+fluid.optimizer.Momentum(momentum=0.9,
+                         learning_rate=fluid.layers.piecewise_decay(
+                         boundaries=[5000 * 12],
+                         values=[0.0001, 0.00001]),
+                         regularization=fluid.regularizer.L2Decay(1e-4))
+```
+8卡，batch size 1024，epoch 30, 挑选好的结果
+### ResNet34
 
-| weight量化方式 | activation量化方式| top1_acc/top5_acc |Paddle Fluid inference time(ms)| Paddle Lite inference time(ms)|模型下载|
+| weight量化方式 | activation量化方式| top1_acc/top5_acc |Paddle Fluid inference time(ms)| Paddle Lite inference time(ms)|
 |---|---|---|---|---|---|
-|baseline|- |76.50%/93.00%|- |-|[下载模型](http://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_pretrained.tar)|
-|abs_max|abs_max|76.71%/93.10% |- |-|[下载模型](https://paddle-slim-models.bj.bcebos.com/quantization%2Fresnet50_w_abs_a_abs_7670_9310.tar.gz)|
-|abs_max|moving_average_abs_max|76.65%/93.12% |- |-|[下载模型](https://paddle-slim-models.bj.bcebos.com/quantization%2Fresnet50_w_abs_a_move_7665_9312.tar.gz) |
-|channel_wise_abs_max|abs_max|76.56%/93.05% |- |-| [下载模型](https://paddle-slim-models.bj.bcebos.com/quantization%2Fresnet50_w_chan_a_abs_7656_9304.tar.gz)|
+|baseline|- |74.57%/92.14%|- |-|
+|abs_max|abs_max|-|- |-|
+|abs_max|moving_average_abs_max|74.63%/92.17%|7.20|392.59|
+|channel_wise_abs_max|abs_max|-|- |-|
 
 >训练超参：
 
diff --git a/PaddleSlim/classification/quantization/compress.py b/PaddleSlim/classification/quantization/compress.py
index 4894684fabf3feed3b0ab179861ec921fd56884b..88c8d72ca904ce32e98309acfb1dc7d072dd132f 100644
--- a/PaddleSlim/classification/quantization/compress.py
+++ b/PaddleSlim/classification/quantization/compress.py
@@ -38,8 +38,9 @@ def compress(args):
     image_shape = "3,224,224"
     image_shape = [int(m) for m in image_shape.split(",")]
 
-    image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
-    label = fluid.layers.data(name='label', shape=[1], dtype='int64')
+    image = fluid.data(
+        name='image', shape=[None] + image_shape, dtype='float32')
+    label = fluid.data(name='label', shape=[None, 1], dtype='int64')
     # model definition
     model = models.__dict__[args.model]()
 
@@ -95,10 +96,21 @@ def compress(args):
         eval_fetch_list=val_fetch_list,
         teacher_programs=[],
         train_optimizer=opt,
+        prune_infer_model=[[image.name], [out.name]],
         distiller_optimizer=None)
     com_pass.config(args.config_file)
     com_pass.run()
 
+    conv_op_num = 0
+    fake_quant_op_num = 0
+    for op in com_pass.context.eval_graph.ops():
+        if op._op.type == 'conv2d':
+            conv_op_num += 1
+        elif op._op.type.startswith('fake_quantize'):
+            fake_quant_op_num += 1
+    print('conv op num {}'.format(conv_op_num))
+    print('fake quant op num {}'.format(fake_quant_op_num))
+
 
 def main():
     args = parser.parse_args()
diff --git a/PaddleSlim/classification/quantization/configs/mobilenet_v1.yaml b/PaddleSlim/classification/quantization/configs/mobilenet_v1.yaml
index 2f88ec9c9cdbc38517a678c4478a498a5739fff2..bf06f5aa41c3dc87aa6a48460e07799c875a0057 100644
--- a/PaddleSlim/classification/quantization/configs/mobilenet_v1.yaml
+++ b/PaddleSlim/classification/quantization/configs/mobilenet_v1.yaml
@@ -5,7 +5,6 @@ strategies:
         start_epoch: 0
         end_epoch: 29 
         float_model_save_path: './output/mobilenet_v1/float'
-        mobile_model_save_path: './output/mobilenet_v1/mobile'
         int8_model_save_path: './output/mobilenet_v1/int8'
         weight_bits: 8
         activation_bits: 8
diff --git a/PaddleSlim/classification/quantization/configs/mobilenet_v2.yaml b/PaddleSlim/classification/quantization/configs/mobilenet_v2.yaml
index b3de9344a286d316208ff2c2a9af652a1fe187c4..2c3cd7f366a69536cf64b3df2edec2596f70a6f7 100644
--- a/PaddleSlim/classification/quantization/configs/mobilenet_v2.yaml
+++ b/PaddleSlim/classification/quantization/configs/mobilenet_v2.yaml
@@ -5,7 +5,6 @@ strategies:
         start_epoch: 0
         end_epoch: 29
         float_model_save_path: './output/mobilenet_v2/float'
-        mobile_model_save_path: './output/mobilenet_v2/mobile'
         int8_model_save_path: './output/mobilenet_v2/int8'
         weight_bits: 8
         activation_bits: 8
diff --git a/PaddleSlim/classification/quantization/configs/resnet34.yaml b/PaddleSlim/classification/quantization/configs/resnet34.yaml
index 5ff6eeb245a23ab2f8956e58f59526f4cd3b64de..4b7aa8b4130f47dbabfcfd1f4a31411273b30b1b 100644
--- a/PaddleSlim/classification/quantization/configs/resnet34.yaml
+++ b/PaddleSlim/classification/quantization/configs/resnet34.yaml
@@ -3,9 +3,8 @@ strategies:
     quantization_strategy:
         class: 'QuantizationStrategy'
         start_epoch: 0
-        end_epoch: 29 
+        end_epoch: 0 
         float_model_save_path: './output/resnet34/float'
-        mobile_model_save_path: './output/resnet34/mobile'
         int8_model_save_path: './output/resnet34/int8'
         weight_bits: 8
         activation_bits: 8
@@ -14,7 +13,7 @@ strategies:
         save_in_nodes: ['image']
         save_out_nodes: ['fc_0.tmp_2']
 compressor:
-    epoch: 30
+    epoch: 1
     checkpoint_path: './checkpoints/resnet34/'
     strategies:
         - quantization_strategy
diff --git a/PaddleSlim/classification/quantization/freeze.py b/PaddleSlim/classification/quantization/freeze.py
index 396875c7e1a54b56b5c985952ab4630c37824fc3..a568e5a3154cebabfad1f70a613f20807de83065 100644
--- a/PaddleSlim/classification/quantization/freeze.py
+++ b/PaddleSlim/classification/quantization/freeze.py
@@ -45,81 +45,83 @@ add_arg('save_path', str, './output',   'Path to save inference model')
 add_arg('weight_quant_type', str, 'abs_max', 'quantization type for weight')
 # yapf: enable
 
+
 def eval(args):
     # parameters from arguments
 
     place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
     exe = fluid.Executor(place)
 
-    val_program, feed_names, fetch_targets = fluid.io.load_inference_model(args.model_path,
-                                      exe,
-                                      model_filename="__model__",
-                                      params_filename="__params__")
+    val_program, feed_names, fetch_targets = fluid.io.load_inference_model(
+        args.model_path,
+        exe,
+        model_filename="__model__.infer",
+        params_filename="__params__")
     val_reader = paddle.batch(reader.val(), batch_size=128)
-    feeder = fluid.DataFeeder(place=place, feed_list=feed_names, program=val_program)
+    feeder = fluid.DataFeeder(
+        place=place, feed_list=feed_names, program=val_program)
 
-    results=[]
+    results = []
     for batch_id, data in enumerate(val_reader()):
+        image = [[d[0]] for d in data]
+        label = [[d[1]] for d in data]
+        feed_data = feeder.feed(image)
+        pred = exe.run(val_program, feed=feed_data, fetch_list=fetch_targets)
+        pred = np.array(pred[0])
+        label = np.array(label)
+        sort_array = pred.argsort(axis=1)
+        top_1_pred = sort_array[:, -1:][:, ::-1]
+        top_1 = np.mean(label == top_1_pred)
+        top_5_pred = sort_array[:, -5:][:, ::-1]
+        acc_num = 0
+        for i in range(len(label)):
+            if label[i][0] in top_5_pred[i]:
+                acc_num += 1
+        top_5 = acc_num / len(label)
+        results.append([top_1, top_5])
 
-        # top1_acc, top5_acc
-        result = exe.run(val_program,
-                          feed=feeder.feed(data),
-                          fetch_list=fetch_targets)
-        result = [np.mean(r) for r in result]
-        results.append(result)
     result = np.mean(np.array(results), axis=0)
     print("top1_acc/top5_acc= {}".format(result))
     sys.stdout.flush()
+
     _logger.info("freeze the graph for inference")
     test_graph = IrGraph(core.Graph(val_program.desc), for_test=True)
 
     freeze_pass = QuantizationFreezePass(
-            scope=fluid.global_scope(),
-            place=place,
-            weight_quantize_type=args.weight_quant_type)
+        scope=fluid.global_scope(),
+        place=place,
+        weight_quantize_type=args.weight_quant_type)
     freeze_pass.apply(test_graph)
     server_program = test_graph.to_program()
     fluid.io.save_inference_model(
-            dirname=os.path.join(args.save_path, 'float'),
-            feeded_var_names=feed_names,
-            target_vars=fetch_targets,
-            executor=exe,
-            main_program=server_program,
-            model_filename='model',
-            params_filename='weights')
+        dirname=os.path.join(args.save_path, 'float'),
+        feeded_var_names=feed_names,
+        target_vars=fetch_targets,
+        executor=exe,
+        main_program=server_program,
+        model_filename='model',
+        params_filename='weights')
 
     _logger.info("convert the weights into int8 type")
     convert_int8_pass = ConvertToInt8Pass(
-            scope=fluid.global_scope(),
-            place=place)
+        scope=fluid.global_scope(), place=place)
     convert_int8_pass.apply(test_graph)
     server_int8_program = test_graph.to_program()
     fluid.io.save_inference_model(
-            dirname=os.path.join(args.save_path, 'int8'),
-            feeded_var_names=feed_names,
-            target_vars=fetch_targets,
-            executor=exe,
-            main_program=server_int8_program,
-            model_filename='model',
-            params_filename='weights')
-
-    _logger.info("convert the freezed pass to paddle-lite execution")
-    mobile_pass = TransformForMobilePass()
-    mobile_pass.apply(test_graph)
-    mobile_program = test_graph.to_program()
-    fluid.io.save_inference_model(
-            dirname=os.path.join(args.save_path, 'mobile'),
-            feeded_var_names=feed_names,
-            target_vars=fetch_targets,
-            executor=exe,
-            main_program=mobile_program,
-            model_filename='model',
-            params_filename='weights')
+        dirname=os.path.join(args.save_path, 'int8'),
+        feeded_var_names=feed_names,
+        target_vars=fetch_targets,
+        executor=exe,
+        main_program=server_int8_program,
+        model_filename='model',
+        params_filename='weights')
+
 
 def main():
     args = parser.parse_args()
     print_arguments(args)
     eval(args)
 
+
 if __name__ == '__main__':
     main()
diff --git a/PaddleSlim/compress.py b/PaddleSlim/compress.py
index 2f5c52c03ddaf9d9afaa94bbf96c11cd1564864c..7481160a878e0f9d84f4ae2f6588a84bf6bf3083 100644
--- a/PaddleSlim/compress.py
+++ b/PaddleSlim/compress.py
@@ -33,12 +33,19 @@ add_arg('teacher_model',    str,  None,          "Set the teacher network to use
 add_arg('teacher_pretrained_model', str,  None,                "Whether to use pretrained model.")
 add_arg('compress_config',  str,  None,                 "The config file for compression with yaml format.")
 add_arg('quant_only',       bool, False,                "Only do quantization-aware training.")
+add_arg('enable_ce',        bool, False,                "If set, run the task with continuous evaluation logs.")
 # yapf: enable
 
 model_list = [m for m in dir(models) if "__" not in m]
 
 
 def compress(args):
+    # add ce
+    if args.enable_ce:
+        SEED = 1
+        fluid.default_main_program().random_seed = SEED
+        fluid.default_startup_program().random_seed = SEED
+
     image_shape = [int(m) for m in args.image_shape.split(",")]
 
     assert args.model in model_list, "{} is not in lists: {}".format(args.model,
diff --git a/PaddleSlim/configs/mobilenetv1_resnet50_distillation.yaml b/PaddleSlim/configs/mobilenetv1_resnet50_distillation.yaml
index 43a8db34a80b7dfcae3ba5e9343ad99a3739bf15..3a2b00dc0d3bb9d22a9bdcbd58e620f70906e62f 100644
--- a/PaddleSlim/configs/mobilenetv1_resnet50_distillation.yaml
+++ b/PaddleSlim/configs/mobilenetv1_resnet50_distillation.yaml
@@ -1,10 +1,5 @@
 version: 1.0
 distillers:
-    fsp_distiller:
-        class: 'FSPDistiller'
-        teacher_pairs: [['res2a_branch2a.conv2d.output.1.tmp_0', 'res3a_branch2a.conv2d.output.1.tmp_0']]
-        student_pairs: [['depthwise_conv2d_1.tmp_0', 'conv2d_3.tmp_0']]
-        distillation_loss_weight: 1
     l2_distiller:
         class: 'L2Distiller'
         teacher_feature_map: 'res_fc.tmp_0'
@@ -13,7 +8,7 @@ distillers:
 strategies:
     distillation_strategy:
         class: 'DistillationStrategy'
-        distillers: ['fsp_distiller', 'l2_distiller']
+        distillers: ['l2_distiller']
         start_epoch: 0
         end_epoch: 130
 compressor:
diff --git a/PaddleSlim/docs/tutorial.md b/PaddleSlim/docs/tutorial.md
index 0492bde905f9f4d0b5b37e89a1f17d6b87883bd5..6c3a6719b0b627081cb318c9f4a72f8bdd983a4a 100644
--- a/PaddleSlim/docs/tutorial.md
+++ b/PaddleSlim/docs/tutorial.md
@@ -44,7 +44,7 @@
 <strong>表2：模型量化前后精度对比</strong>
 </p>
 
-目前，学术界主要将量化分为两大类：`Post Training Quantization`和`Quantization Aware Training`。`Post Training Quantization`是指使用KL散度、滑动平均等方法确定量化参数且不需要重新训练的定点量化方法。`Quantization Aware Training`是在训练过程中对量化进行建模以确定量化参数，它与`Post Training Quantization`模式相比可以提供更高的预测精度。本文主要针对`Quantization Aware Training`量化模式进行阐述说明。
+目前，学术界主要将量化分为两大类：`Post Training Quantization`和`Quantization Aware Training`。`Post Training Quantization`是指使用KL散度、滑动平均等方法确定量化参数且不需要重新训练的定点量化方法。`Quantization Aware Training`是在训练过程中对量化进行建模以确定量化参数，它与`Post Training Quantization`模式相比可以提供更高的预测精度。
 
 ### 1.2 量化原理
 
@@ -59,7 +59,7 @@ $$ M = max(abs(x)) $$ $$ q = \left \lfloor \frac{x}{M} * (n - 1) \right \rceil $
 $q = scale * r + b$
 其中`min-max`和`max-abs`被称为量化参数或者量化比例或者量化范围。
 
-#### 1.2.2 量化训练框架
+#### 1.2.2 量化训练
 ##### 1.2.2.1 前向传播
 前向传播过程采用模拟量化的方式，具体描述如下：
 
@@ -104,7 +104,7 @@ $$
 
 因此，量化Pass也会改变相应反向算子的某些输入。
 
-#### 1.2.3 确定量化参数
+##### 1.2.2.3 确定量化比例系数
 存在着两种策略可以计算求取量化比例系数，即动态策略和静态策略。动态策略会在每次迭代过程中计算量化比例系数的值。静态策略则对不同的输入采用相同的量化比例系数。
 对于权重而言，在训练过程中采用动态策略。换句话说，在每次迭代过程中量化比例系数均会被重新计算得到直至训练过程结束。
 对于激活而言，可以选择动态策略也可以选择静态策略。若选择使用静态策略，则量化比例系数会在训练过程中被评估求得，且在推断过程中被使用(不同的输入均保持不变)。静态策略中的量化比例系数可于训练过程中通过如下三种方式进行评估：
@@ -119,7 +119,18 @@ $$ Vt = (1 - k) * V + k * V_{t-1} $$
 
 式中，$V$ 是当前batch的最大绝对值， $Vt$是滑动平均值。$k$是一个因子，例如其值可取为0.9。
 
+#### 1.2.4 训练后量化
 
+训练后量化是基于采样数据，采用KL散度等方法计算量化比例因子的方法。相比量化训练，训练后量化不需要重新训练，可以快速得到量化模型。
+
+训练后量化的目标是求取量化比例因子，主要有两种方法：非饱和量化方法 ( No Saturation) 和饱和量化方法 (Saturation)。非饱和量化方法计算FP32类型Tensor中绝对值的最大值`abs_max`，将其映射为127，则量化比例因子等于`abs_max/127`。饱和量化方法使用KL散度计算一个合适的阈值`T` (`0<T<mab_max`)，将其映射为127，则量化比例因子等于`T/127`。一般而言，对于待量化op的权重Tensor，采用非饱和量化方法，对于待量化op的激活Tensor（包括输入和输出），采用饱和量化方法 。
+
+训练后量化的实现步骤如下：
+* 加载预训练的FP32模型，配置`DataLoader`；
+* 读取样本数据，执行模型的前向推理，保存待量化op激活Tensor的数值；
+* 基于激活Tensor的采样数据，使用饱和量化方法计算它的量化比例因子；
+* 模型权重Tensor数据一直保持不变，使用非饱和方法计算它每个通道的绝对值最大值，作为每个通道的量化比例因子；
+* 将FP32模型转成INT8模型，进行保存。
 
 
 ## 2. 卷积核剪裁原理
diff --git a/PaddleSlim/quant_low_level_api/README.md b/PaddleSlim/quant_low_level_api/README.md
index 377bca8030e8e25429778c6fa288c31848087d55..afd9818c6431bd568cb147553bc8e9b09486961c 100644
--- a/PaddleSlim/quant_low_level_api/README.md
+++ b/PaddleSlim/quant_low_level_api/README.md
@@ -1,160 +1,29 @@
 <div align="center">
   <h3>
-    <a href="../docs/tutorial.md">
-      算法原理介绍
+    <a href="./README.md">
+      模型量化概述
     </a>
     <span> | </span>
-    <a href="../docs/usage.md">
-      使用文档
+    <a href="../docs/tutorial.md">
+      模型量化原理
     </a>
     <span> | </span>
-    <a href="../docs/demo.md">
-      示例文档
+    <a href="./quantization_aware_training.md">
+      量化训练使用方法和示例
     </a>
     <span> | </span>
-    <a href="../docs/model_zoo.md">
-      Model Zoo
+    <a href="./post_training_quantization.md">
+      训练后量化使用方法和示例
     </a>
   </h3>
 </div>
 
 ---
-# 量化训练Low-Level API使用示例
-
-## 目录
-
-- [量化训练Low-Level APIs介绍](#1-量化训练low-level-apis介绍)
-- [基于Low-Level API的量化训练](#2-基于low-level-api的量化训练)
-
-## 1. 量化训练Low-Level APIs介绍
-量化训练Low-Level APIs主要涉及到PaddlePaddle框架中的四个IrPass，即`QuantizationTransformPass`、`QuantizationFreezePass`、`ConvertToInt8Pass`以及`TransformForMobilePass`。这四个IrPass的具体功能描述如下：
-
-* `QuantizationTransformPass`: QuantizationTransformPass主要负责在IrGraph的`conv2d`、`depthwise_conv2d`、`mul`等算子的各个输入前插入连续的量化op和反量化op，并改变相应反向算子的某些输入，示例如图1：
-
-<p align="center">
-<img src="../docs/images/usage/TransformPass.png" height=400 width=520 hspace='10'/> <br />
-<strong>图1：应用QuantizationTransformPass后的结果</strong>
-</p>
-
-* `QuantizationFreezePass`：QuantizationFreezePass主要用于改变IrGraph中量化op和反量化op的顺序，即将类似图1中的量化op和反量化op顺序改变为图2中的布局。除此之外，QuantizationFreezePass还会将`conv2d`、`depthwise_conv2d`、`mul`等算子的权重离线量化为int8_t范围内的值(但数据类型仍为float32)，以减少预测过程中对权重的量化操作，示例如图2：
-
-<p align="center">
-<img src="../docs/images/usage/FreezePass.png" height=400 width=420 hspace='10'/> <br />
-<strong>图2：应用QuantizationFreezePass后的结果</strong>
-</p>
-
-* `ConvertToInt8Pass`：ConvertToInt8Pass必须在QuantizationFreezePass之后执行，其主要目的是将执行完QuantizationFreezePass后输出的权重类型由`FP32`更改为`INT8`。换言之，用户可以选择将量化后的权重保存为float32类型（不执行ConvertToInt8Pass）或者int8_t类型（执行ConvertToInt8Pass），示例如图3：
-
-<p align="center">
-<img src="../docs/images/usage/ConvertToInt8Pass.png" height=400 width=400 hspace='10'/> <br />
-<strong>图3：应用ConvertToInt8Pass后的结果</strong>
-</p>
-
-* `TransformForMobilePass`：经TransformForMobilePass转换后，用户可得到兼容[paddle-mobile](https://github.com/PaddlePaddle/paddle-mobile)移动端预测库的量化模型。paddle-mobile中的量化op和反量化op的名称分别为`quantize`和`dequantize`。`quantize`算子和PaddlePaddle框架中的`fake_quantize_abs_max`算子簇的功能类似，`dequantize` 算子和PaddlePaddle框架中的`fake_dequantize_max_abs`算子簇的功能相同。若选择paddle-mobile执行量化训练输出的模型，则需要将`fake_quantize_abs_max`等算子改为`quantize`算子以及将`fake_dequantize_max_abs`等算子改为`dequantize`算子，示例如图4：
-
-<p align="center">
-<img src="../docs/images/usage/TransformForMobilePass.png" height=400 width=400 hspace='10'/> <br />
-<strong>图4：应用TransformForMobilePass后的结果</strong>
-</p>
-
-## 2. 基于Low-Level API的量化训练
-本小节以ResNet50和MobileNetV1为例，介绍了PaddlePaddle量化训练Low-Level API的使用方法，具体如下：
-
-1） 执行如下命令clone [Pddle models repo](https://github.com/PaddlePaddle/models)：
-```bash
-git clone https://github.com/PaddlePaddle/models.git
-```
-
-2） 准备数据集（包括训练数据集和验证数据集）。以ILSVRC2012数据集为例，数据集应包含如下结构：
-```bash
-data
-└──ILSVRC2012
-        ├── train
-        ├── train_list.txt
-        ├── val
-        └── val_list.txt
-```
-3）切换到`models/PaddleSlim/quant_low_level_api`目录下，修改`run_quant.sh`内容，即将**data_dir**设置为第2)步所准备的数据集路径。最后，执行`run_quant.sh`脚本即可进行量化训练。
-
-### 2.1 量化训练Low-Level API使用小结：
-
-* 参照[quant.py](quant.py)文件的内容，总结使用量化训练Low-Level API的方法如下：
-```python
-#startup_program = fluid.Program()
-#train_program = fluid.Program()
-#train_cost = build_program(
-#    main_prog=train_program,
-#    startup_prog=startup_program,
-#    is_train=True)
-#build_program(
-#    main_prog=test_program,
-#    startup_prog=startup_program,
-#    is_train=False)
-#test_program = test_program.clone(for_test=True)
-# The above pseudo code is used to build up the model.
-# ---------------------------------------------------------------------------------
-# The following code are part of Quantization Aware Training logic:
-# 0) Convert Programs to IrGraphs.
-main_graph = IrGraph(core.Graph(train_program.desc), for_test=False)
-test_graph = IrGraph(core.Graph(test_program.desc), for_test=True)
-# 1) Make some quantization transforms in the graph before training and testing.
-# According to the weight and activation quantization type, the graph will be added
-# some fake quantize operators and fake dequantize operators.
-transform_pass = QuantizationTransformPass(
-        scope=fluid.global_scope(), place=place,
-        activation_quantize_type=activation_quant_type,
-        weight_quantize_type=weight_quant_type)
-transform_pass.apply(main_graph)
-transform_pass.apply(test_graph)
-# Compile the train_graph for training.
-binary = fluid.CompiledProgram(main_graph.graph).with_data_parallel(
-    loss_name=train_cost.name, build_strategy=build_strategy)
-# Convert the transformed test_graph to test program for testing.
-test_prog = test_graph.to_program()
-# For training
-exe.run(binary, fetch_list=train_fetch_list)
-# For testing
-exe.run(program=test_prog, fetch_list=test_fetch_list)
-# 2) Freeze the graph after training by adjusting the quantize
-# operators' order for the inference.
-freeze_pass = QuantizationFreezePass(
-    scope=fluid.global_scope(),
-    place=place,
-    weight_quantize_type=weight_quant_type)
-freeze_pass.apply(test_graph)
-# 3) Convert the weights into int8_t type.
-# [This step is optional.]
-convert_int8_pass = ConvertToInt8Pass(scope=fluid.global_scope(), place=place)
-convert_int8_pass.apply(test_graph)
-# 4) Convert the freezed graph for paddle-mobile execution.
-# [This step is optional. But, if you execute this step, you must execute the step 3).]
-mobile_pass = TransformForMobilePass()
-mobile_pass.apply(test_graph)
-```
-* [run_quant.sh](run_quant.sh)脚本中的命令配置详解：
-
-```bash
-   --model：指定量化训练的模型，如MobileNet、ResNet50。
-   --pretrained_fp32_model：指定预训练float32模型参数的位置。
-   --checkpoint：指定模型断点训练的checkpoint路径。若指定了checkpoint路径，则不应该再指定pretrained_fp32_model路径。
-   --use_gpu：选择是否使用GPU训练。
-   --data_dir：指定训练数据集和验证数据集的位置。
-   --batch_size：设置训练batch size大小。
-   --total_images：指定训练数据图像的总数。
-   --class_dim：指定类别总数。
-   --image_shape：指定图像的尺寸。
-   --model_save_dir：指定模型保存的路径。
-   --lr_strategy：学习率衰减策略。
-   --num_epochs：训练的总epoch数。
-   --lr：初始学习率，指定预训练模型参数进行fine-tune时一般设置一个较小的初始学习率。
-   --act_quant_type：激活量化类型，可选abs_max,  moving_average_abs_max, range_abs_max。
-   --wt_quant_type：权重量化类型，可选abs_max, channel_wise_abs_max。
-```
+模型量化是使用更少的比特数表示神经网络的权重和激活的方法，具有加快推理速度、减小存储大小、降低功耗等优点。
 
-> **备注:** 量化训练结束后，用户可在其指定的模型保存路径下看到float、int8和mobile三个目录。下面对三个目录下保存的模型特点进行解释说明:
-> - **float目录:** 参数范围为int8范围但参数数据类型为float32的量化模型。
-> - **int8目录:** 参数范围为int8范围且参数数据类型为int8的量化模型。
-> - **mobile目录:** 参数特点与int8目录相同且兼容[paddle-mobile](https://github.com/PaddlePaddle/paddle-mobile)的量化模型。
->
-> **注意:** 目前PaddlePaddle框架在Server端只支持使用float目录下的量化模型做预测。
+目前，模型量化主要分为量化训练（Quantization Aware Training）和训练后量化（Post Training Quantization）。量化训练是在训练过程中对量化进行建模以确定量化参数，具有为复杂模型提供更高的精度的优点。训练后量化是基于采样数据，采用KL散度等方法计算量化比例因子的方法。它具有无需重新训练、快速获得量化模型的方法。
 
+模型量化的原理和Low-Level API使用方法可以参考如下文档：
+* [模型量化原理](../docs/tutorial.md)
+* [量化训练Low-Level API使用方法和示例](./quantization_aware_training.md)
+* [训练后量化Low-Level API使用方法和示例](./post_training_quantization.md)
diff --git a/PaddleSlim/quant_low_level_api/post_training_quantization.md b/PaddleSlim/quant_low_level_api/post_training_quantization.md
new file mode 100644
index 0000000000000000000000000000000000000000..ff9d5d56c6cf19fdb00efccec7a68d633897b32a
--- /dev/null
+++ b/PaddleSlim/quant_low_level_api/post_training_quantization.md
@@ -0,0 +1,147 @@
+<div align="center">
+  <h3>
+    <a href="./README.md">
+      模型量化概述
+    </a>
+    <span> | </span>
+    <a href="../docs/tutorial.md">
+      模型量化原理
+    </a>
+    <span> | </span>
+    <a href="./quantization_aware_training.md">
+      量化训练使用方法和示例
+    </a>
+    <span> | </span>
+    <a href="./post_training_quantization.md">
+      训练后量化使用方法和示例
+    </a>
+  </h3>
+</div>
+
+---
+# 训练后量化Low-Level API使用方法和示例
+
+## 目录
+
+- [训练后量化使用说明](#1-训练后量化使用说明)
+- [训练后量化使用示例](#2-训练后量化使用示例)
+
+## 1. 训练后量化使用说明
+
+1）**准备模型和校准数据**
+
+首先，需要准备已经训练好的FP32预测模型，即 `save_inference_model()` 保存的模型。训练后量化读取校准数据进行前向计算，所以需要准备校准数据集。校准数据集应为测试集（或训练集）中具有代表性的一部分，如随机取出的部分数据，这样可以计算得到更加准确的量化比例因子。建议样本数据的数量为100~500。
+
+2）**配置校准数据生成器**
+
+训练后量化内部使用异步数据读取的方式读取校准数据，用户只需要根据模型的输入，配置读取数据的sample_generator。sample_generator是Python生成器，用作`DataLoader.set_sample_generator()`的数据源，**必须每次返回单个样本**。建议参考官方文档[异步数据读取](https://www.paddlepaddle.org.cn/documentation/docs/zh/user_guides/howto/prepare_data/use_py_reader.html)。
+
+3）**调用训练后量化**
+
+机器上安装PaddlePaddle develop分支编译的whl包，然后调用PostTrainingQuantization实现训练后量化，以下对api接口进行详细介绍。
+
+``` python
+class PostTrainingQuantization(
+    executor,
+    sample_generator,
+    model_dir,
+    model_filename=None,
+    params_filename=None,
+    batch_size=10,
+    batch_nums=None,
+    scope=None,
+    algo="KL",
+    quantizable_op_type=["conv2d", "depthwise_conv2d", "mul"],
+    is_full_quantize=False,
+    is_use_cache_file=False,
+    cache_dir="./temp_post_training")
+```
+调用上述api，传入训练后量化必要的参数。参数说明：
+* executor：执行模型的executor，可以在cpu或者gpu上执行。
+* sample_generator：第二步中配置的校准数据生成器。
+* model_dir：待量化模型的路径，其中保存模型文件和权重文件。
+* model_filename：待量化模型的模型文件名，如果模型文件名不是`__model__`，则需要使用model_filename设置模型文件名。
+* params_filename：待量化模型的权重文件名，如果所有权重保存成一个文件，则需要使用params_filename设置权重文件名。
+* batch_size：一次读取样本数据的数量。
+* batch_nums：读取样本数据的次数。如果设置为None，则从sample_generator中读取所有样本数据进行训练后量化；如果设置为非None，则从sample_generator中读取`batch_size*batch_nums`个样本数据。
+* scope：模型运行时使用的scope，默认为None，则会使用global_scope()。
+* algo：计算待量化激活Tensor的量化比例因子的方法。设置为`KL`，则使用KL散度方法，设置为`direct`，则使用abs max方法。默认为`KL`。
+* quantizable_op_type: 需要量化的op类型，默认是`["conv2d", "depthwise_conv2d", "mul"]`，列表中的值可以是任意支持量化的op类型。
+* is_full_quantize：是否进行全量化。设置为True，则对模型中所有支持量化的op进行量化；设置为False，则只对`quantizable_op_type` 中op类型进行量化。目前，支持的量化类型如下：'conv2d', 'depthwise_conv2d', 'mul', "pool2d", "elementwise_add", "concat", "softmax", "argmax", "transpose", "equal", "gather", "greater_equal", "greater_than", "less_equal", "less_than", "mean", "not_equal", "reshape", "reshape2", "bilinear_interp", "nearest_interp", "trilinear_interp", "slice", "squeeze", "elementwise_sub"。
+* is_use_cache_file：是否使用缓存文件。如果设置为True，训练后量化过程中的采样数据会保存到磁盘文件中；如果设置为False，所有采样数据会保存到内存中。当待量化的模型很大或者校准数据数量很大，建议设置is_use_cache_file为True。默认为False。
+* cache_dir：当is_use_cache_file等于True，会将采样数据保存到该文件中。量化完成后，该文件中的临时文件会自动删除。
+
+```
+PostTrainingQuantization.quantize()
+```
+调用上述接口开始训练后量化。根据样本数量、模型的大小和量化op类型不同，训练后量化需要的时间也不一样。比如使用ImageNet2012数据集中100图片对`MobileNetV1`进行训练后量化，花费大概1分钟。
+
+```
+PostTrainingQuantization.save_quantized_model(save_model_path)
+```
+调用上述接口保存训练后量化模型，其中save_model_path为保存的路径。
+
+**训练后量化支持部分量化功能**
+* 方法1：设置quantizable_op_type，则只会对quantizable_op_type中的Op类型进行量化，模型中其他Op类型保持不量化。
+* 方法2：构建网络的时候，将不需要量化的特定Op定义在 `skip_quant` 的name_scope中，则可以跳过特定Op的量化，示例如下。
+```python
+with fluid.name_scope('skip_quant'):
+    pool = fluid.layers.pool2d(input=hidden, pool_size=2, pool_type='avg', pool_stride=2)
+    # 不对pool2d进行量化
+```
+
+## 2. 训练后量化使用示例
+
+下面以MobileNetV1为例，介绍训练后量化Low-Level API的使用方法。
+
+> 该示例的代码放在[models/PaddleSlim/quant_low_level_api/](https://github.com/PaddlePaddle/models/tree/develop/PaddleSlim/quant_low_level_api)目录下。如果需要执行该示例，首先clone下来[models](https://github.com/PaddlePaddle/models.git)，然后执行[run_post_training_quanzation.sh](run_post_training_quanzation.sh)脚本，最后量化模型保存在`mobilenetv1_int8_model`目录下。
+
+1）**准备模型和校准数据**
+
+安装最新版PaddlePaddle，准备已经训练好的FP32预测模型。
+
+准备校准数据，文件结构如下。val文件夹中有100张图片，val_list.txt文件中包含图片的label。
+```bash
+samples_100
+└──val
+└──val_list.txt
+```
+
+2）**配置校准数据生成器**
+
+MobileNetV1的输入是图片和标签，所以配置读取校准数据的sample_generator，每次返回一张图片和一个标签。详细代码在[models/PaddleSlim/reader.py](https://github.com/PaddlePaddle/models/blob/develop/PaddleSlim/reader.py)。
+
+3）**调用训练后量化**
+
+调用训练后量化的核心代码如下，详细代码在[post_training_quantization.py](post_training_quantization.py)。
+``` python
+place = fluid.CUDAPlace(0) if args.use_gpu == "True" else fluid.CPUPlace()
+exe = fluid.Executor(place)
+sample_generator = reader.val(data_dir=args.data_path)
+
+ptq = PostTrainingQuantization(
+    executor=exe,
+    sample_generator=sample_generator,
+    model_dir=args.model_dir,
+    model_filename=args.model_filename,
+    params_filename=args.params_filename,
+    batch_size=args.batch_size,
+    batch_nums=args.batch_nums,
+    algo=args.algo,
+    is_full_quantize=args.is_full_quantize == "True")
+quantized_program = ptq.quantize()
+ptq.save_quantized_model(args.save_model_path)
+```
+4）**测试训练后量化模型精度**
+
+使用ImageNet2012测试集中100张图片做校准数据集，对`conv2d`, `depthwise_conv2d`, `mul`, `pool2d`, `elementwise_add`和`concat`进行训练后量化，然后在ImageNet2012验证集上测试。下表列出了常见分类模型训练后量化前后的精度对比。
+
+模型 | FP32 Top1 | FP32 Top5 | INT8 Top1 | INT8 Top5| Top1 Diff | Tp5 Diff
+-|:-:|:-:|:-:|:-:|:-:|:-:
+googlenet   | 70.50% | 89.59% | 70.12% | 89.38% | -0.38% | -0.21%
+mobilenetv1 | 70.91% | 89.54% | 70.24% | 89.03% | -0.67% | -0.51%
+mobilenetv2 | 71.90% | 90.56% | 71.36% | 90.17% | -0.54% | -0.39%
+resnet50    | 76.35% | 92.80% | 76.26% | 92.81% | -0.09% | +0.01%
+resnet101   | 77.49% | 93.57% | 75.44% | 92.56% | -2.05% | -1.01%
+vgg16       | 72.08% | 90.63% | 71.93% | 90.64% | -0.15% | +0.01%
+vgg19       | 72.56% | 90.83% | 72.55% | 90.77% | -0.01% | -0.06%
diff --git a/PaddleSlim/quant_low_level_api/post_training_quantization.py b/PaddleSlim/quant_low_level_api/post_training_quantization.py
new file mode 100644
index 0000000000000000000000000000000000000000..9b68500882794e8d9e8010359da483479266bef6
--- /dev/null
+++ b/PaddleSlim/quant_low_level_api/post_training_quantization.py
@@ -0,0 +1,63 @@
+import sys
+import os
+import six
+import numpy as np
+import argparse
+import paddle.fluid as fluid
+sys.path.append('..')
+import reader
+from paddle.fluid.contrib.slim.quantization import PostTrainingQuantization
+
+parser = argparse.ArgumentParser()
+parser.add_argument(
+    "--model_dir", type=str, default="", help="path/to/fp32_model_params")
+parser.add_argument(
+    "--data_path", type=str, default="/dataset/ILSVRC2012/", help="")
+parser.add_argument("--save_model_path", type=str, default="")
+parser.add_argument(
+    "--model_filename",
+    type=str,
+    default=None,
+    help="The name of file to load the inference program, If it is None, the default filename __model__ will be used"
+)
+parser.add_argument(
+    "--params_filename",
+    type=str,
+    default=None,
+    help="The name of file to load all parameters, If parameters were saved in separate files, set it as None"
+)
+parser.add_argument(
+    "--algo",
+    type=str,
+    default="KL",
+    help="use KL or direct method to quantize the activation tensor, set it as KL or direct"
+)
+parser.add_argument("--is_full_quantize", type=str, default="False", help="")
+parser.add_argument("--batch_size", type=int, default=10, help="")
+parser.add_argument("--batch_nums", type=int, default=10, help="")
+parser.add_argument("--use_gpu", type=str, default="False", help="")
+args = parser.parse_args()
+
+print("-------------------args----------------------")
+for arg, value in sorted(six.iteritems(vars(args))):
+    print("%s: %s" % (arg, value))
+print("---------------------------------------------")
+
+place = fluid.CUDAPlace(0) if args.use_gpu == "True" else fluid.CPUPlace()
+exe = fluid.Executor(place)
+sample_generator = reader.val(data_dir=args.data_path)
+
+ptq = PostTrainingQuantization(
+    executor=exe,
+    sample_generator=sample_generator,
+    model_dir=args.model_dir,
+    model_filename=args.model_filename,
+    params_filename=args.params_filename,
+    batch_size=args.batch_size,
+    batch_nums=args.batch_nums,
+    algo=args.algo,
+    is_full_quantize=args.is_full_quantize == "True")
+quantized_program = ptq.quantize()
+ptq.save_quantized_model(args.save_model_path)
+
+print("post training quantization finish.\n")
diff --git a/PaddleSlim/quant_low_level_api/quantization_aware_training.md b/PaddleSlim/quant_low_level_api/quantization_aware_training.md
new file mode 100644
index 0000000000000000000000000000000000000000..7ea6946f50912f6345bcec6ee0da7a64f81545c3
--- /dev/null
+++ b/PaddleSlim/quant_low_level_api/quantization_aware_training.md
@@ -0,0 +1,202 @@
+<div align="center">
+  <h3>
+    <a href="./README.md">
+      模型量化概述
+    </a>
+    <span> | </span>
+    <a href="../docs/tutorial.md">
+      模型量化原理
+    </a>
+    <span> | </span>
+    <a href="./quantization_aware_training.md">
+      量化训练使用方法和示例
+    </a>
+    <span> | </span>
+    <a href="./post_training_quantization.md">
+      训练后量化使用方法和示例
+    </a>
+  </h3>
+</div>
+
+---
+# 量化训练Low-Level API使用示例
+
+## 目录
+
+- [量化训练Low-Level APIs介绍](#1-量化训练low-level-apis介绍)
+- [基于Low-Level API的量化训练](#2-基于low-level-api的量化训练)
+## 1. 量化训练Low-Level APIs介绍
+量化训练Low-Level APIs主要涉及到PaddlePaddle框架中的五个IrPass，即`QuantizationTransformPass`、`AddQuantDequantPass`、`QuantizationFreezePass`、`ConvertToInt8Pass`以及`TransformForMobilePass`。这五个IrPass的具体功能描述如下：
+
+* `QuantizationTransformPass`: QuantizationTransformPass主要负责在IrGraph的`conv2d`、`depthwise_conv2d`、`mul`等算子的各个输入前插入连续的量化op和反量化op，并改变相应反向算子的某些输入，示例如图1。
+
+<p align="center">
+<img src="../docs/images/usage/TransformPass.png" height=400 width=520 hspace='10'/> <br />
+<strong>图1：应用QuantizationTransformPass后的结果</strong>
+</p>
+
+QuantizationTransformPass支持对模型中特定类别op进行量化，只需要设置输入参数`quantizable_op_type`，默认`quantizable_op_type=['conv2d', 'depthwise_conv2d', 'mul']`。比如设置`quantizable_op_type=['conv2d']`，则该pass只会对模型中的`conv2d` 进行量化。注意，设置QuantizationTransformPass的`quantizable_op_type` 后， 也需要在QuantizationFreezePass 和 ConvertToInt8Pass传入相同的 `quantizable_op_type` 。
+
+QuantizationTransformPass也支持对模型中的个别op不进行量化。如下示例：首先定义 `skip_pattern` ；然后在构建模型时候，在skip_pattern的name_scope中定义不需要量化的op，即示例中的 `conv1` ；最后在调用QuantizationTransformPass的时候，传输设置的`skip_pattern`参数，则可以实现不对 `conv1` 进行量化。
+
+```
+# define network
+skip_pattern=['skip_quant']
+......
+with fluid.name_scope(skip_pattern[0]):
+  conv1 = fluid.layers.conv2d(
+            input=input,
+            filter_size=filter_size,
+            num_filters=ch_out,
+            stride=stride,
+            padding=padding,
+            act=None,
+            bias_attr=bias_attr)
+......
+# init QuantizationTransformPass and set skip_pattern
+transform_pass = QuantizationTransformPass(
+            scope=fluid.global_scope(),
+            place=place,
+            activation_quantize_type=activation_quant_type,
+            weight_quantize_type=weight_quantize_type,
+            skip_pattern=skip_pattern)
+# apply QuantizationTransformPass
+```
+
+* `AddQuantDequantPass`  ：AddQuantDequantPass主要负责在IrGraph的 `elementwise_add` 和 `pool2d` 等算子的各个输入前插入 `QuantDequant` op，在量化训练中收集待量化op输入 `Tensor` 的量化 `scale` 信息。该Pass使用方法和QuantizationTransformPass相似，同样支持对模型中特定类别op进行量化，支持对模型中的个别op不进行量化。注意，目前PaddleLite还不支持`elementwise_add` 和 `pool2d` 的int8 kernel。
+
+* `QuantizationFreezePass`：QuantizationFreezePass主要用于改变IrGraph中量化op和反量化op的顺序，即将类似图1中的量化op和反量化op顺序改变为图2中的布局。除此之外，QuantizationFreezePass还会将`conv2d`、`depthwise_conv2d`、`mul`等算子的权重离线量化为int8_t范围内的值(但数据类型仍为float32)，以减少预测过程中对权重的量化操作，示例如图2：
+
+<p align="center">
+<img src="../docs/images/usage/FreezePass.png" height=400 width=420 hspace='10'/> <br />
+<strong>图2：应用QuantizationFreezePass后的结果</strong>
+</p>
+
+* `ConvertToInt8Pass`：ConvertToInt8Pass必须在QuantizationFreezePass之后执行，其主要目的是将执行完QuantizationFreezePass后输出的权重类型由`FP32`更改为`INT8`。换言之，用户可以选择将量化后的权重保存为float32类型（不执行ConvertToInt8Pass）或者int8_t类型（执行ConvertToInt8Pass），示例如图3：
+
+<p align="center">
+<img src="../docs/images/usage/ConvertToInt8Pass.png" height=400 width=400 hspace='10'/> <br />
+<strong>图3：应用ConvertToInt8Pass后的结果</strong>
+</p>
+
+* `TransformForMobilePass`：经TransformForMobilePass转换后，用户可得到兼容[paddle-mobile](https://github.com/PaddlePaddle/paddle-mobile)移动端预测库的量化模型。paddle-mobile中的量化op和反量化op的名称分别为`quantize`和`dequantize`。`quantize`算子和PaddlePaddle框架中的`fake_quantize_abs_max`算子簇的功能类似，`dequantize` 算子和PaddlePaddle框架中的`fake_dequantize_max_abs`算子簇的功能相同。若选择paddle-mobile执行量化训练输出的模型，则需要将`fake_quantize_abs_max`等算子改为`quantize`算子以及将`fake_dequantize_max_abs`等算子改为`dequantize`算子，示例如图4：
+
+<p align="center">
+<img src="../docs/images/usage/TransformForMobilePass.png" height=400 width=400 hspace='10'/> <br />
+<strong>图4：应用TransformForMobilePass后的结果</strong>
+</p>
+
+## 2. 基于Low-Level API的量化训练
+本小节以ResNet50和MobileNetV1为例，介绍了PaddlePaddle量化训练Low-Level API的使用方法，具体如下：
+
+1） 执行如下命令clone [Pddle models repo](https://github.com/PaddlePaddle/models)：
+```bash
+git clone https://github.com/PaddlePaddle/models.git
+```
+
+2） 准备数据集（包括训练数据集和验证数据集）。以ILSVRC2012数据集为例，数据集应包含如下结构：
+```bash
+data
+└──ILSVRC2012
+        ├── train
+        ├── train_list.txt
+        ├── val
+        └── val_list.txt
+```
+3）切换到`models/PaddleSlim/quant_low_level_api`目录下，修改`run_quantization_aware_training.sh`内容，即将**data_dir**设置为第2)步所准备的数据集路径。最后，执行`run_quantization_aware_training.sh`脚本即可进行量化训练。
+
+### 2.1 量化训练Low-Level API使用小结：
+
+* 参照[quantization_aware_training.py](quantization_aware_training.py)文件的内容，总结使用量化训练Low-Level API的方法如下：
+```python
+#startup_program = fluid.Program()
+#train_program = fluid.Program()
+#train_cost = build_program(
+#    main_prog=train_program,
+#    startup_prog=startup_program,
+#    is_train=True)
+#build_program(
+#    main_prog=test_program,
+#    startup_prog=startup_program,
+#    is_train=False)
+#test_program = test_program.clone(for_test=True)
+# The above pseudo code is used to build up the model.
+# ---------------------------------------------------------------------------------
+# The following code are part of Quantization Aware Training logic:
+# 0) Convert Programs to IrGraphs.
+main_graph = IrGraph(core.Graph(train_program.desc), for_test=False)
+test_graph = IrGraph(core.Graph(test_program.desc), for_test=True)
+# 1) Make some quantization transforms in the graph before training and testing.
+# According to the weight and activation quantization type, the graph will be added
+# some fake quantize operators and fake dequantize operators.
+transform_pass = QuantizationTransformPass(
+        scope=fluid.global_scope(), place=place,
+        activation_quantize_type=activation_quant_type,
+        weight_quantize_type=weight_quant_type)
+transform_pass.apply(main_graph)
+transform_pass.apply(test_graph)
+# Compile the train_graph for training.
+binary = fluid.CompiledProgram(main_graph.graph).with_data_parallel(
+    loss_name=train_cost.name, build_strategy=build_strategy)
+# Convert the transformed test_graph to test program for testing.
+test_prog = test_graph.to_program()
+# For training
+exe.run(binary, fetch_list=train_fetch_list)
+# For testing
+exe.run(program=test_prog, fetch_list=test_fetch_list)
+# 2) Freeze the graph after training by adjusting the quantize
+# operators' order for the inference.
+freeze_pass = QuantizationFreezePass(
+    scope=fluid.global_scope(),
+    place=place,
+    weight_quantize_type=weight_quant_type)
+freeze_pass.apply(test_graph)
+# 3) Convert the weights into int8_t type.
+# [This step is optional.]
+convert_int8_pass = ConvertToInt8Pass(scope=fluid.global_scope(), place=place)
+convert_int8_pass.apply(test_graph)
+# 4) Convert the freezed graph for paddle-mobile execution.
+# [This step is optional. But, if you execute this step, you must execute the step 3).]
+mobile_pass = TransformForMobilePass()
+mobile_pass.apply(test_graph)
+```
+* [run_quantization_aware_training.sh](run_quantization_aware_training.sh)脚本中的命令配置详解：
+
+```bash
+   --model：指定量化训练的模型，如MobileNet、ResNet50。
+   --pretrained_fp32_model：指定预训练float32模型参数的位置。
+   --checkpoint：指定模型断点训练的checkpoint路径。若指定了checkpoint路径，则不应该再指定pretrained_fp32_model路径。
+   --use_gpu：选择是否使用GPU训练。
+   --data_dir：指定训练数据集和验证数据集的位置。
+   --batch_size：设置训练batch size大小。
+   --total_images：指定训练数据图像的总数。
+   --class_dim：指定类别总数。
+   --image_shape：指定图像的尺寸。
+   --model_save_dir：指定模型保存的路径。
+   --lr_strategy：学习率衰减策略。
+   --num_epochs：训练的总epoch数。
+   --lr：初始学习率，指定预训练模型参数进行fine-tune时一般设置一个较小的初始学习率。
+   --act_quant_type：激活量化类型，可选moving_average_abs_max, range_abs_max和abs_max。
+   --wt_quant_type：权重量化类型，可选abs_max, channel_wise_abs_max。
+```
+
+> **备注:** 量化训练结束后，用户可在其指定的模型保存路径下看到float、int8和mobile三个目录。下面对三个目录下保存的模型特点进行解释说明:
+> - **float目录:** 参数范围为int8范围但参数数据类型为float32的量化模型。
+> - **int8目录:** 参数范围为int8范围且参数数据类型为int8的量化模型。
+> - **mobile目录:** 参数特点与int8目录相同且兼容[paddle-mobile](https://github.com/PaddlePaddle/paddle-mobile)的量化模型。
+>
+> **注意:** 目前PaddlePaddle框架在Server端只支持使用float目录下的量化模型做预测。
+
+### 2.2 测试QAT量化模型精度
+
+使用ImageNet2012的训练集进行训练，然后在ImageNet2012验证集上测试。其中，我们对`conv2d`, `depthwise_conv2d`, `mul`, `pool2d`, `elementwise_add`和`concat`进行量化，训练5个epoch。下表列出了常见分类模型QAT量化前后的精度。
+
+模型 | FP32 Top1 | FP32 Top5 | INT8 Top1 | INT8 Top5| Top1 Diff | Tp5 Diff
+-|:-:|:-:|:-:|:-:|:-:|:-:
+googlenet   | 70.50% | 89.59% | 69.96% | 89.18% | -0.54% | -0.41%
+mobilenetv1 | 70.91% | 89.54% | 70.50% | 89.42% | -0.41% | -0.12%
+mobilenetv2 | 71.90% | 90.56% | 72.05% | 90.56% | +0.15% | -0.00%
+resnet50    | 76.35% | 92.80% | 76.52% | 92.93% | +0.17% | +0.13%
+resnet101   | 77.49% | 93.57% | 77.80% | 93.78% | +0.31% | +0.21%
+vgg16       | 72.08% | 90.63% | 71.53% | 89.70% | -0.55% | -0.93%
+vgg19       | 72.56% | 90.83% | 71.99% | 89.93% | -0.57% | -0.90%
diff --git a/PaddleSlim/quant_low_level_api/quant.py b/PaddleSlim/quant_low_level_api/quantization_aware_training.py
similarity index 100%
rename from PaddleSlim/quant_low_level_api/quant.py
rename to PaddleSlim/quant_low_level_api/quantization_aware_training.py
diff --git a/PaddleSlim/quant_low_level_api/run_post_training_quanzation.sh b/PaddleSlim/quant_low_level_api/run_post_training_quanzation.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1d9ea31f93b6034d95580ac398fdc6688858905d
--- /dev/null
+++ b/PaddleSlim/quant_low_level_api/run_post_training_quanzation.sh
@@ -0,0 +1,23 @@
+export CUDA_VISIBLE_DEVICES=0
+
+root_url="https://paddle-inference-dist.bj.bcebos.com/int8"
+mobilenetv1="mobilenetv1_fp32_model"
+samples="samples_100"
+if [ ! -d ${mobilenetv1} ]; then
+    wget ${root_url}/${mobilenetv1}.tgz
+    tar zxf ${mobilenetv1}.tgz
+fi
+if [ ! -d ${samples} ]; then
+    wget ${root_url}/${samples}.tgz
+    tar zxf ${samples}.tgz
+fi
+
+python post_training_quantization.py \
+    --model_dir=${mobilenetv1} \
+    --data_path=${samples} \
+    --save_model_path="mobilenetv1_int8_model" \
+    --algo="KL" \
+    --is_full_quantize=False \
+    --batch_size=10 \
+    --batch_nums=10 \
+    --use_gpu=True \
diff --git a/PaddleSlim/quant_low_level_api/run_quant.sh b/PaddleSlim/quant_low_level_api/run_quantization_aware_training.sh
similarity index 100%
rename from PaddleSlim/quant_low_level_api/run_quant.sh
rename to PaddleSlim/quant_low_level_api/run_quantization_aware_training.sh
diff --git a/PaddleSpeech/DeepVoice3/README_cn.md b/PaddleSpeech/DeepVoice3/README_cn.md
index 0828c9835747c795aa6398d7f4ad5dcea34aa674..a726a074e782289d2212808d118c637699d79ddf 100644
--- a/PaddleSpeech/DeepVoice3/README_cn.md
+++ b/PaddleSpeech/DeepVoice3/README_cn.md
@@ -10,7 +10,7 @@ Paddle 实现的 Deepvoice3，一个基于卷积神经网络的语音合成 (Tex
 
 ### 安装 paddlepaddle 框架
 
-为了更快的训练速度和更好的支持，我们推荐使用最新的开发版 paddle。用户可以最新编译的开发版 whl 包，也可以选择从源码编译 Paddle。
+为了更快的训练速度和更好的支持，我们推荐使用最新的 Paddle 开发版。用户也可以最新编译的开发版 whl 包，也可以选择从源码编译 Paddle。
 
 1. 下载最新编译的开发版 whl 包。可以从  [**多版本 wheel 包列表-dev**](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/Tables.html#whl-dev) 页面中选择合适的版本。
 
diff --git a/PaddleSpeech/DeepVoice3/deepvoice3_paddle/conv.py b/PaddleSpeech/DeepVoice3/deepvoice3_paddle/conv.py
index 3e43232df3542c9cd12bd8228c0557a362fa4baf..0805135ff8a55163d6a5ea840d46cfa09b1139c2 100644
--- a/PaddleSpeech/DeepVoice3/deepvoice3_paddle/conv.py
+++ b/PaddleSpeech/DeepVoice3/deepvoice3_paddle/conv.py
@@ -31,7 +31,7 @@ class Conv1D(dg.Layer):
 
     def __init__(self,
                  name_scope,
-                 in_cahnnels,
+                 in_channels,
                  num_filters,
                  filter_size=3,
                  dilation=1,
@@ -49,7 +49,7 @@ class Conv1D(dg.Layer):
         else:
             padding = (dilation * (filter_size - 1)) // 2
 
-        self.in_channels = in_cahnnels
+        self.in_channels = in_channels
         self.num_filters = num_filters
         self.filter_size = filter_size
         self.dilation = dilation
diff --git a/PaddleSpeech/DeepVoice3/deepvoice3_paddle/data.py b/PaddleSpeech/DeepVoice3/deepvoice3_paddle/data.py
index 9bdda194a5720435d075169f3bdf0b78e7279b7a..d6dc55db2a6574fa5327f6aa07141d19947ad9ad 100644
--- a/PaddleSpeech/DeepVoice3/deepvoice3_paddle/data.py
+++ b/PaddleSpeech/DeepVoice3/deepvoice3_paddle/data.py
@@ -273,7 +273,6 @@ def create_batch(batch):
 
     x_batch = np.array(
         [_pad(x[0], max_input_len) for x in batch], dtype=np.int64)
-    x_batch = np.expand_dims(x_batch, axis=-1)
 
     mel_batch = np.array(
         [_pad_2d(
@@ -295,7 +294,6 @@ def create_batch(batch):
     text_positions = np.array(
         [_pad(np.arange(1, len(x[0]) + 1), max_input_len) for x in batch],
         dtype=np.int64)
-    text_positions = np.expand_dims(text_positions, axis=-1)
 
     max_decoder_target_len = max_target_len // r // downsample_step
 
@@ -305,7 +303,6 @@ def create_batch(batch):
         np.expand_dims(
             np.arange(
                 s, e, dtype=np.int64), axis=0), (len(batch), 1))
-    frame_positions = np.expand_dims(frame_positions, axis=-1)
 
     # done flags
     done = np.array([
@@ -318,7 +315,7 @@ def create_batch(batch):
     done = np.expand_dims(np.expand_dims(done, axis=1), axis=1)
 
     if multi_speaker:
-        speaker_ids = np.expand_dims(np.array([x[3] for x in batch]), axis=-1)
+        speaker_ids = np.array([x[3] for x in batch])
         return (x_batch, input_lengths, mel_batch, y_batch, text_positions,
                 frame_positions, done, target_lengths, speaker_ids)
     else:
diff --git a/PaddleSpeech/DeepVoice3/deepvoice3_paddle/deepvoice3.py b/PaddleSpeech/DeepVoice3/deepvoice3_paddle/deepvoice3.py
index 83b5b2ef8122457fa130301b85ff5433e514e756..ca6dbbb6105d4e127187632d36d6cad0a1216e4e 100644
--- a/PaddleSpeech/DeepVoice3/deepvoice3_paddle/deepvoice3.py
+++ b/PaddleSpeech/DeepVoice3/deepvoice3_paddle/deepvoice3.py
@@ -206,7 +206,7 @@ class Encoder(dg.Layer):
         Encode text sequence.
         
         Args:
-            x (Variable): Shape(B, T_enc, 1), dtype: int64. Ihe input text
+            x (Variable): Shape(B, T_enc), dtype: int64. Ihe input text
                 indices. T_enc means the timesteps of decoder input x.
             speaker_embed (Variable, optional): Shape(Batch_size, speaker_dim),
                 dtype: float32. Speaker embeddings. This arg is not None only
@@ -591,10 +591,10 @@ class Decoder(dg.Layer):
                 of text inputs for each example.
             inputs (Variable): Shape(B, C_mel, 1, T_mel), ground truth
                 mel-spectrogram, which is used as decoder inputs when training.
-            text_positions (Variable): Shape(B, T_enc, 1), dtype: int64.
+            text_positions (Variable): Shape(B, T_enc), dtype: int64.
                 Positions indices for text inputs for the encoder, where 
                 T_enc means the encoder timesteps.
-            frame_positions (Variable): Shape(B, T_dec // r, 1), dtype: 
+            frame_positions (Variable): Shape(B, T_dec // r), dtype: 
                 int64. Positions indices for each decoder time steps.
             speaker_embed: shape(batch_size, speaker_dim), speaker embedding, 
                 only used for multispeaker model.
@@ -717,7 +717,7 @@ class Decoder(dg.Layer):
                 values (Variable): shape(B, C_emb, 1, T_enc), the value
                     representation from an encoder, where C_emb means
                     text embedding size.
-            text_positions (Variable): Shape(B, T_enc, 1), dtype: int64.
+            text_positions (Variable): Shape(B, T_enc), dtype: int64.
                 Positions indices for text inputs for the encoder, where 
                 T_enc means the encoder timesteps.
                
@@ -789,7 +789,7 @@ class Decoder(dg.Layer):
 
         while True:
             frame_pos = fluid.layers.fill_constant(
-                shape=[B, 1, 1], value=t + 1, dtype="int64")
+                shape=[B, 1], value=t + 1, dtype="int64")
             w = self.query_position_rate
             if self.n_speakers > 1:
                 w = w * fluid.layers.reshape(
@@ -1222,19 +1222,19 @@ class DeepVoiceTTS(dg.Layer):
         Encode text sequence and decode with ground truth mel spectrogram.
                 
         Args:
-            text_sequences (Variable): Shape(B, T_enc, 1), dtype: int64. Ihe
+            text_sequences (Variable): Shape(B, T_enc), dtype: int64. Ihe
                 input text indices. T_enc means the timesteps of text_sequences.
             valid_lengths (Variable): shape(batch_size,), dtype: int64,
                 valid lengths for each example in text_sequences.
             mel_inputs (Variable): Shape(B, C_mel, 1, T_mel), ground truth
                 mel-spectrogram, which is used as decoder inputs when training. 
-            speaker_indices (Variable, optional): Shape(Batch_size, 1),
+            speaker_indices (Variable, optional): Shape(Batch_size),
                 dtype: int64. Speaker index for each example. This arg is not
                 None only when the model is a multispeaker model.
-            text_positions (Variable): Shape(B, T_enc, 1), dtype: int64.
+            text_positions (Variable): Shape(B, T_enc), dtype: int64.
                 Positions indices for text inputs for the encoder, where 
                 T_enc means the encoder timesteps.
-            frame_positions (Variable): Shape(B, T_dec // r, 1), dtype: 
+            frame_positions (Variable): Shape(B, T_dec // r), dtype: 
                 int64. Positions indices for each decoder time steps.
 
         Returns:
@@ -1295,12 +1295,12 @@ class DeepVoiceTTS(dg.Layer):
         Encode text sequence and decode without ground truth mel spectrogram.
         
         Args:
-            text_sequences (Variable): Shape(B, T_enc, 1), dtype: int64. Ihe
+            text_sequences (Variable): Shape(B, T_enc), dtype: int64. Ihe
                 input text indices. T_enc means the timesteps of text_sequences.
-            text_positions (Variable): Shape(B, T_enc, 1), dtype: int64.
+            text_positions (Variable): Shape(B, T_enc), dtype: int64.
                 Positions indices for text inputs for the encoder, where 
                 T_enc means the encoder timesteps.
-            speaker_indices (Variable, optional): Shape(Batch_size, 1),
+            speaker_indices (Variable, optional): Shape(Batch_size),
                 dtype: int64. Speaker index for each example. This arg is not
                 None only when the model is a multispeaker model.
 
@@ -1423,7 +1423,7 @@ class ConvS2S(dg.Layer):
         Encode text sequence and decode with ground truth mel spectrogram.
 
         Args:
-            text_sequences (Variable): Shape(B, T_enc, 1), dtype: int64. Ihe
+            text_sequences (Variable): Shape(B, T_enc), dtype: int64. Ihe
                 input text indices. T_enc means the timesteps of text_sequences.
             valid_lengths (Variable): shape(batch_size,), dtype: int64,
                 valid lengths for each example in text_sequences.
@@ -1432,10 +1432,10 @@ class ConvS2S(dg.Layer):
             speaker_embed (Variable, optional): Shape(Batch_size, speaker_dim),
                 dtype: float32. Speaker embeddings. This arg is not None only
                 when the model is a multispeaker model.
-            text_positions (Variable): Shape(B, T_enc, 1), dtype: int64.
+            text_positions (Variable): Shape(B, T_enc), dtype: int64.
                 Positions indices for text inputs for the encoder, where 
                 T_enc means the encoder timesteps.
-            frame_positions (Variable): Shape(B, T_dec // r, 1), dtype: 
+            frame_positions (Variable): Shape(B, T_dec // r), dtype: 
                 int64. Positions indices for each decoder time steps.
 
         Returns:
@@ -1466,9 +1466,9 @@ class ConvS2S(dg.Layer):
         Encode text sequence and decode without ground truth mel spectrogram.
         
         Args:
-            text_sequences (Variable): Shape(B, T_enc, 1), dtype: int64. Ihe
+            text_sequences (Variable): Shape(B, T_enc), dtype: int64. Ihe
                 input text indices. T_enc means the timesteps of text_sequences.
-            text_positions (Variable): Shape(B, T_enc, 1), dtype: int64.
+            text_positions (Variable): Shape(B, T_enc), dtype: int64.
                 Positions indices for text inputs for the encoder, where 
                 T_enc means the encoder timesteps.
             speaker_embed (Variable, optional): Shape(Batch_size, speaker_dim),
diff --git a/PaddleSpeech/DeepVoice3/deepvoice3_paddle/dry_run.py b/PaddleSpeech/DeepVoice3/deepvoice3_paddle/dry_run.py
index 9c9518ce8aa73ad12ec225dfd4fcdeb5f5d5995b..c947f5bd2315b73e57f72c4396465f8c61412e35 100644
--- a/PaddleSpeech/DeepVoice3/deepvoice3_paddle/dry_run.py
+++ b/PaddleSpeech/DeepVoice3/deepvoice3_paddle/dry_run.py
@@ -48,7 +48,7 @@ def dry_run(model):
     mel_dim = hparams.num_mels
 
     x = np.random.randint(
-        low=0, high=n_vocab, size=(batch_size, enc_length, 1), dtype="int64")
+        low=0, high=n_vocab, size=(batch_size, enc_length), dtype="int64")
     input_lengths = np.arange(
         enc_length - batch_size + 1, enc_length + 1, dtype="int64")
     mel = np.random.randn(batch_size, mel_dim, 1, mel_length).astype("float32")
@@ -60,18 +60,16 @@ def dry_run(model):
             0, enc_length, dtype="int64"), (batch_size, 1))
     text_mask = text_positions > np.expand_dims(input_lengths, 1)
     text_positions[text_mask] = 0
-    text_positions = np.expand_dims(text_positions, axis=-1)
 
     frame_positions = np.tile(
         np.arange(
             1, decoder_length + 1, dtype="int64"), (batch_size, 1))
-    frame_positions = np.expand_dims(frame_positions, axis=-1)
 
     done = np.zeros(shape=(batch_size, 1, 1, decoder_length), dtype="float32")
     target_lengths = np.array([snd_sample_length] * batch_size).astype("int64")
 
     speaker_ids = np.random.randint(
-        low=0, high=n_speakers, size=(batch_size, 1),
+        low=0, high=n_speakers, size=(batch_size),
         dtype="int64") if n_speakers > 1 else None
 
     ismultispeaker = speaker_ids is not None
diff --git a/PaddleSpeech/DeepVoice3/deepvoice3_paddle/modules.py b/PaddleSpeech/DeepVoice3/deepvoice3_paddle/modules.py
index 0df3ef10d459e8171047135556ca50333512dc6d..082c0ed79ff5e967322801f30ef08b5aed02cb15 100644
--- a/PaddleSpeech/DeepVoice3/deepvoice3_paddle/modules.py
+++ b/PaddleSpeech/DeepVoice3/deepvoice3_paddle/modules.py
@@ -366,14 +366,14 @@ class PositionEmbedding(dg.Layer):
         self._dtype = dtype
 
     def set_weight(self, array):
-        assert self.embed._w.shape == list(array.shape), "shape does not match"
-        self.embed._w._ivar.value().get_tensor().set(
-            array, fluid.framework._current_expected_place())
+        assert self.embed.weight.shape == list(
+            array.shape), "shape does not match"
+        self.embed.weight.set_value(array)
 
     def forward(self, indices, speaker_position_rate=None):
         """
         Args:
-            indices (Variable): Shape (B, T, 1), dtype: int64, position
+            indices (Variable): Shape (B, T), dtype: int64, position
                 indices, where B means the batch size, T means the time steps.
             speaker_position_rate (Variable | float, optional), position
                 rate. It can be a float point number or a Variable with 
@@ -384,14 +384,14 @@ class PositionEmbedding(dg.Layer):
             out (Variable): Shape(B, C_pos), position embedding, where C_pos 
                 means position embedding size.
         """
-        rad = fluid.layers.transpose(self.embed._w, perm=[1, 0])
+        rad = fluid.layers.transpose(self.embed.weight, perm=[1, 0])
         batch_size = indices.shape[0]
 
         if speaker_position_rate is None:
             weight = compute_position_embedding(rad)
             out = self._helper.create_variable_for_type_inference(self._dtype)
             self._helper.append_op(
-                type="lookup_table",
+                type="lookup_table_v2",
                 inputs={"Ids": indices,
                         "W": weight},
                 outputs={"Out": out},
@@ -417,7 +417,7 @@ class PositionEmbedding(dg.Layer):
             weight = compute_position_embedding(scaled_rad)
             out = self._helper.create_variable_for_type_inference(self._dtype)
             self._helper.append_op(
-                type="lookup_table",
+                type="lookup_table_v2",
                 inputs={"Ids": indices,
                         "W": weight},
                 outputs={"Out": out},
@@ -441,7 +441,7 @@ class PositionEmbedding(dg.Layer):
                     self._dtype)
                 sequence = indices[i]
                 self._helper.append_op(
-                    type="lookup_table",
+                    type="lookup_table_v2",
                     inputs={"Ids": sequence,
                             "W": weight},
                     outputs={"Out": out},
diff --git a/PaddleSpeech/DeepVoice3/eval_model.py b/PaddleSpeech/DeepVoice3/eval_model.py
index 3d800f3398a7bdc78b976fa5e05748fb7d3a79d5..196b8d0c194b80e71eb5b944e6e812678dbfc4e0 100644
--- a/PaddleSpeech/DeepVoice3/eval_model.py
+++ b/PaddleSpeech/DeepVoice3/eval_model.py
@@ -67,9 +67,9 @@ def tts(model, text, p=0., speaker_id=None):
     model.eval()
 
     sequence = np.array(_frontend.text_to_sequence(text, p=p)).astype("int64")
-    sequence = np.reshape(sequence, (1, -1, 1))
+    sequence = np.reshape(sequence, (1, -1))
     text_positions = np.arange(1, sequence.shape[1] + 1, dtype="int64")
-    text_positions = np.reshape(text_positions, (1, -1, 1))
+    text_positions = np.reshape(text_positions, (1, -1))
 
     sequence = dg.to_variable(sequence)
     text_positions = dg.to_variable(text_positions)
@@ -191,8 +191,8 @@ def eval_model(global_step, writer, model, checkpoint_dir, ismultispeaker):
 
             # Mel
             writer.add_image(
-                "(Eval) Predicted mel spectrogram text{}_{}".format(
-                    idx, speaker_str),
+                "Eval_Predicted_mel_spectrogram_text{}_{}".format(idx,
+                                                                  speaker_str),
                 prepare_spec_image(mel),
                 global_step,
                 dataformats='HWC')
@@ -205,8 +205,8 @@ def eval_model(global_step, writer, model, checkpoint_dir, ismultispeaker):
 
             try:
                 writer.add_audio(
-                    "(Eval) Predicted audio signal {}_{}".format(idx,
-                                                                 speaker_str),
+                    "Eval_Predicted_audio_signal_{}_{}".format(idx,
+                                                               speaker_str),
                     signal,
                     global_step,
                     sample_rate=hparams.sample_rate)
@@ -273,7 +273,7 @@ def save_states(global_step,
         mel_output = mel_outputs[idx].numpy().squeeze().T
         mel_output = prepare_spec_image(audio._denormalize(mel_output))
         writer.add_image(
-            "Predicted mel spectrogram",
+            "Predicted_mel_spectrogram",
             mel_output,
             global_step,
             dataformats="HWC")
@@ -282,7 +282,7 @@ def save_states(global_step,
         linear_output = linear_outputs[idx].numpy().squeeze().T
         spectrogram = prepare_spec_image(audio._denormalize(linear_output))
         writer.add_image(
-            "Predicted linear spectrogram",
+            "Predicted_linear_spectrogram",
             spectrogram,
             global_step,
             dataformats="HWC")
@@ -293,7 +293,7 @@ def save_states(global_step,
                     "step{:09d}_predicted.wav".format(global_step))
         try:
             writer.add_audio(
-                "Predicted audio signal",
+                "Predicted_audio_signal",
                 signal,
                 global_step,
                 sample_rate=hparams.sample_rate)
@@ -306,7 +306,7 @@ def save_states(global_step,
         mel_output = mel[idx].numpy().squeeze().T
         mel_output = prepare_spec_image(audio._denormalize(mel_output))
         writer.add_image(
-            "Target mel spectrogram",
+            "Target_mel_spectrogram",
             mel_output,
             global_step,
             dataformats="HWC")
@@ -315,7 +315,7 @@ def save_states(global_step,
         linear_output = y[idx].numpy().squeeze().T
         spectrogram = prepare_spec_image(audio._denormalize(linear_output))
         writer.add_image(
-            "Target linear spectrogram",
+            "Target_linear_spectrogram",
             spectrogram,
             global_step,
             dataformats="HWC")
diff --git a/README.md b/README.md
index fb2d656209b2a439e463aded1132fe8f7d253310..4701c4dac40421bac64de174725e7c172b975985 100644
--- a/README.md
+++ b/README.md
@@ -75,9 +75,12 @@ PaddlePaddle 提供了丰富的计算单元，使得用户可以采用模块化
 
 | 模型名称                                                     | 模型简介                                                     | 数据集    | 评估指标        |
 | ------------------------------------------------------------ | ------------------------------------------------------------ | --------- | --------------- |
-| [ICNet](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/icnet) | 主要用于图像实时语义分割，能够兼顾速度和准确性，易于线上部署 | Cityscape | Mean IoU=67.0%  |
-| [DeepLab   V3+](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/deeplabv3%2B) | 通过 encoder-decoder 进行多尺度信息的融合，同时保留了原来的空洞卷积和 ASSP 层，   其骨干网络使用了 Xception 模型，提高了语义分割的健壮性和运行速率 | Cityscape | Mean IoU=78.81% |
-
+| [ICNet](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/icnet) | 主要用于图像实时语义分割，能够兼顾速度和准确性，易于线上部署 | Cityscapes | Mean IoU=67.0%  |
+| [DeepLab   V3+](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/deeplabv3%2B) | 通过 encoder-decoder 进行多尺度信息的融合，同时保留了原来的空洞卷积和 ASSP 层，   其骨干网络使用了 Xception 模型，提高了语义分割的健壮性和运行速率 | Cityscapes | Mean IoU=78.81% |
+| [PSPNet (res101)](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/Research/SemSegPaddle) | 通过利用不同子区域和全局的上下文信息来增强语义分割质量，同时提出deeply supervised 的辅助loss去改善模型的优化 | Cityscapes | Mean IoU = 78.1 |
+| [GloRe (res101)](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/Research/SemSegPaddle)| 提出一个轻量级的、可端到端训练的全局推理单元GloRe来高效推理image regions之间的关系，增强了模型上下文建模能力| Cityscapes | Mean IoU = 78.4 |
+| [PSPNet (res101)](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/Research/SemSegPaddle) | -| PASCAL Context | Mean IoU = 48.9  |
+| [GloRe (res101)](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/Research/SemSegPaddle)| -| PASCAL Context | Mean IoU = 48.4 |
 ### 关键点检测
 
 人体骨骼关键点检测 (Pose Estimation) 主要检测人体的一些关键点，如关节，五官等，通过关键点描述人体骨骼信息。人体骨骼关键点检测对于描述人体姿态，预测人体行为至关重要。是诸多计算机视觉任务的基础，例如动作分类，异常行为检测，以及自动驾驶等等。
@@ -271,6 +274,7 @@ PaddleSlim 模型压缩工具库的实验结果和模型库见 [详细实验结
 
 | 版本号        | tar包                                                         | zip包                                                         |
 | ------------- | ------------------------------------------------------------- | ------------------------------------------------------------- |
+| models 1.6    | https://paddlepaddle-modles.bj.bcebos.com/models-1.6.tar.gz   | https://paddlepaddle-modles.bj.bcebos.com/models-1.6.zip   |
 | models 1.5.1  | https://paddlepaddle-modles.bj.bcebos.com/models-1.5.1.tar.gz | https://paddlepaddle-modles.bj.bcebos.com/models-1.5.1.zip |
 | models 1.5    | https://paddlepaddle-modles.bj.bcebos.com/models-1.5.tar.gz   | https://paddlepaddle-modles.bj.bcebos.com/models-1.5.zip   |
 | models 1.4    | https://paddlepaddle-modles.bj.bcebos.com/models-1.4.tar.gz   | https://paddlepaddle-modles.bj.bcebos.com/models-1.4.zip   |
diff --git a/dygraph/bert/README.md b/dygraph/bert/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b55dd581c69ea97ad11d1781a6d0cbc1e7a21311
--- /dev/null
+++ b/dygraph/bert/README.md
@@ -0,0 +1,167 @@
+# BERT on PaddlePaddle
+
+[BERT](https://arxiv.org/abs/1810.04805) 是一个迁移能力很强的通用语义表示模型， 以 [Transformer](https://arxiv.org/abs/1706.03762) 为网络基本组件，以双向 `Masked Language Model`  
+和 `Next Sentence Prediction` 为训练目标，通过预训练得到通用语义表示，再结合简单的输出层，应用到下游的 NLP 任务，在多个任务上取得了 SOTA 的结果。本项目是 BERT 在 Paddle Fluid 上的开源实现。
+
+同时推荐用户参考[ IPython Notebook demo](https://aistudio.baidu.com/aistudio/projectDetail/122282)
+
+### 发布要点
+
+1) 动态图BERT模型
+
+2）目前仅支持fine-tuning任务，后续会开展对pre-training任务的支持
+
+3）数据集目前验证了glue上的部分任务，squad上的任务后续会进行验证
+
+4）目前暂不支持FP16/FP32混合精度训练。
+
+| Model | Layers | Hidden size | Heads |Parameters |
+| :------| :------: | :------: |:------: |:------: |
+| [BERT-Base, Uncased](https://baidu-nlp.bj.bcebos.com/DYGRAPH_models%2FBERT%2Fdata.tar.gz) | 12 | 768 |12 |110M |
+
+每个压缩包都包含了模型配置文件 `bert_config.json`、参数文件夹 `params`、动态图参数文件夹`dygraph_params` 和词汇表 `vocab.txt`；
+
+## 内容速览
+- [**安装**](#安装)
+- [**Fine-Tuning**: 预训练模型如何应用到特定 NLP 任务上](#nlp-任务的-fine-tuning)
+  - [语句和句对分类任务](#语句和句对分类任务)
+
+## 目录结构
+```text
+.
+├── data                                        # 示例数据
+├── model                                       # 模型定义
+├── reader                                      # 数据读取
+├── utils                                       # 辅助文件
+├── batching.py                                 # 构建 batch 脚本
+├── optimization.py                             # 优化方法定义
+|── run_classifier.py                           # 分类任务的 fine tuning
+|── tokenization.py                             # 原始文本的 token 化
+|── train.py                                    # 预训练过程的定义
+|── run_classifier_multi_gpu.sh                 # 预训练任务的启动脚本
+|── run_classifier_single_gpu.sh                # 预训练任务的启动脚本
+```
+
+## 安装
+本项目依赖于 Paddle Fluid **1.7.0** 及以上版本，请参考[安装指南](http://www.paddlepaddle.org/#quick-start)进行安装。
+
+## NLP 任务的 Fine-tuning
+
+在完成 BERT 模型的预训练后，即可利用预训练参数在特定的 NLP 任务上做 Fine-tuning。以下利用开源的预训练模型，示例如何进行分类任务和阅读理解任务的 Fine-tuning，如果要运行这些任务，请通过 [发布要点](#发布要点) 一节提供的链接预先下载好对应的预训练模型。
+
+### 语句和句对分类任务
+
+对于 [GLUE 数据](https://gluebenchmark.com/tasks)，请下载[文件](https://baidu-nlp.bj.bcebos.com/DYGRAPH_models%2FBERT%2Fdata.tar.gz)，并解压到同一个目录。以 GLUE/MNLI 任务为例，启动 Fine-tuning 的方式如下（也可以直接运行run_classifier_single_gpu.sh）：
+
+```shell
+#!/bin/bash
+
+BERT_BASE_PATH="./data/pretrained_models/uncased_L-12_H-768_A-12/"
+TASK_NAME='MNLI'
+DATA_PATH="./data/glue_data/MNLI/"
+CKPT_PATH="./data/saved_model/mnli_models"
+
+export CUDA_VISIBLE_DEVICES=0
+
+# start fine-tuning
+python run_classifier.py\
+    --task_name ${TASK_NAME} \
+    --use_cuda true \
+    --do_train true \
+    --do_test true \
+    --batch_size 64 \
+    --init_pretraining_params ${BERT_BASE_PATH}/dygraph_params/ \
+    --data_dir ${DATA_PATH} \
+    --vocab_path ${BERT_BASE_PATH}/vocab.txt \
+    --checkpoints ${CKPT_PATH} \
+    --save_steps 1000 \
+    --weight_decay  0.01 \
+    --warmup_proportion 0.1 \
+    --validation_steps 100 \
+    --epoch 3 \
+    --max_seq_len 128 \
+    --bert_config_path ${BERT_BASE_PATH}/bert_config.json \
+    --learning_rate 5e-5 \
+    --skip_steps 10 \
+    --shuffle true
+
+```
+
+这里的 `uncased_L-12_H-768_A-12/` 即是转换后的英文预训练模型，程序会将模型存储在`CKPT_PATH`指定的位置里。
+
+### 使用单机多卡进行fine-tuning
+
+飞桨动态图使用多进程方式进行数据并行和梯度同步，可以参考`run_classifier_multi_gpu.sh`脚本进行单机多卡fine-tuning：
+
+```shell
+#!/bin/bash
+
+BERT_BASE_PATH="./data/pretrained_models/uncased_L-12_H-768_A-12/"
+TASK_NAME='MNLI'
+DATA_PATH="./data/glue_data/MNLI/"
+CKPT_PATH="./data/saved_model/mnli_models"
+GPU_TO_USE="0,1,2,3"
+
+export CUDA_VISIBLE_DEVICES=$GPU_TO_USE
+
+# start fine-tuning
+python -m paddle.distributed.launch --selected_gpus=$GPU_TO_USE --log_dir ./cls_log run_classifier.py \
+    --task_name ${TASK_NAME} \
+    --use_cuda true \
+    --use_data_parallel true \
+    --do_train true \
+    --do_test true \
+    --batch_size 64 \
+    --in_tokens false \
+    --init_pretraining_params ${BERT_BASE_PATH}/dygraph_params/ \
+    --data_dir ${DATA_PATH} \
+    --vocab_path ${BERT_BASE_PATH}/vocab.txt \
+    --checkpoints ${CKPT_PATH} \
+    --save_steps 1000 \
+    --weight_decay  0.01 \
+    --warmup_proportion 0.1 \
+    --validation_steps 100 \
+    --epoch 3 \
+    --max_seq_len 128 \
+    --bert_config_path ${BERT_BASE_PATH}/bert_config.json \
+    --learning_rate 5e-5 \
+    --skip_steps 10 \
+    --shuffle true
+```
+
+### 读取训练好的模型进行预测
+
+可以参考`run_classifier_prediction.sh`脚本，读取训练好的模型进行预测，可参考以下命令：
+
+```shell
+#!/bin/bash
+
+BERT_BASE_PATH="./data/pretrained_models/uncased_L-12_H-768_A-12/"
+TASK_NAME='MNLI'
+DATA_PATH="./data/glue_data/MNLI/"
+CKPT_PATH="./data/saved_model/mnli_models"
+
+export CUDA_VISIBLE_DEVICES=0
+
+# start testing
+python run_classifier.py\
+    --task_name ${TASK_NAME} \
+    --use_cuda true \
+    --do_train false \
+    --do_test true \
+    --batch_size 64 \
+    --in_tokens false \
+    --data_dir ${DATA_PATH} \
+    --vocab_path ${BERT_BASE_PATH}/vocab.txt \
+    --checkpoints ${CKPT_PATH} \
+    --save_steps 1000 \
+    --weight_decay  0.01 \
+    --warmup_proportion 0.1 \
+    --validation_steps 100 \
+    --epoch 3 \
+    --max_seq_len 128 \
+    --bert_config_path ${BERT_BASE_PATH}/bert_config.json \
+    --learning_rate 5e-5 \
+    --skip_steps 10 \
+    --shuffle false
+```
diff --git a/dygraph/bert/batching.py b/dygraph/bert/batching.py
new file mode 100644
index 0000000000000000000000000000000000000000..7a214700a9e2db27900602c235c32e435e7b85fb
--- /dev/null
+++ b/dygraph/bert/batching.py
@@ -0,0 +1,189 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Mask, padding and batching."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+
+def mask(batch_tokens, total_token_num, vocab_size, CLS=1, SEP=2, MASK=3):
+    """
+    Add mask for batch_tokens, return out, mask_label, mask_pos;
+    Note: mask_pos responding the batch_tokens after padded;
+    """
+    max_len = max([len(sent) for sent in batch_tokens])
+    mask_label = []
+    mask_pos = []
+    prob_mask = np.random.rand(total_token_num)
+    # Note: the first token is [CLS], so [low=1]
+    replace_ids = np.random.randint(1, high=vocab_size, size=total_token_num)
+    pre_sent_len = 0
+    prob_index = 0
+    for sent_index, sent in enumerate(batch_tokens):
+        mask_flag = False
+        prob_index += pre_sent_len
+        for token_index, token in enumerate(sent):
+            prob = prob_mask[prob_index + token_index]
+            if prob > 0.15:
+                continue
+            elif 0.03 < prob <= 0.15:
+                # mask
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    sent[token_index] = MASK
+                    mask_flag = True
+                    mask_pos.append(sent_index * max_len + token_index)
+            elif 0.015 < prob <= 0.03:
+                # random replace
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    sent[token_index] = replace_ids[prob_index + token_index]
+                    mask_flag = True
+                    mask_pos.append(sent_index * max_len + token_index)
+            else:
+                # keep the original token
+                if token != SEP and token != CLS:
+                    mask_label.append(sent[token_index])
+                    mask_pos.append(sent_index * max_len + token_index)
+        pre_sent_len = len(sent)
+
+        # ensure at least mask one word in a sentence
+        while not mask_flag:
+            token_index = int(np.random.randint(1, high=len(sent) - 1, size=1))
+            if sent[token_index] != SEP and sent[token_index] != CLS:
+                mask_label.append(sent[token_index])
+                sent[token_index] = MASK
+                mask_flag = True
+                mask_pos.append(sent_index * max_len + token_index)
+    mask_label = np.array(mask_label).astype("int64").reshape([-1, 1])
+    mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1])
+    return batch_tokens, mask_label, mask_pos
+
+
+def prepare_batch_data(insts,
+                       total_token_num,
+                       voc_size=0,
+                       pad_id=None,
+                       cls_id=None,
+                       sep_id=None,
+                       mask_id=None,
+                       return_input_mask=True,
+                       return_max_len=True,
+                       return_num_token=False):
+    """
+    1. generate Tensor of data
+    2. generate Tensor of position
+    3. generate self attention mask, [shape: batch_size *  max_len * max_len]
+    """
+
+    batch_src_ids = [inst[0] for inst in insts]
+    batch_sent_ids = [inst[1] for inst in insts]
+    batch_pos_ids = [inst[2] for inst in insts]
+    labels_list = []
+    # compatible with squad, whose example includes start/end positions, 
+    # or unique id
+
+    for i in range(3, len(insts[0]), 1):
+        labels = [inst[i] for inst in insts]
+        labels = np.array(labels).astype("int64").reshape([-1, 1])
+        labels_list.append(labels)
+
+    # First step: do mask without padding
+    if mask_id >= 0:
+        out, mask_label, mask_pos = mask(
+            batch_src_ids,
+            total_token_num,
+            vocab_size=voc_size,
+            CLS=cls_id,
+            SEP=sep_id,
+            MASK=mask_id)
+    else:
+        out = batch_src_ids
+    # Second step: padding
+    src_id, self_input_mask = pad_batch_data(
+        out, pad_idx=pad_id, return_input_mask=True)
+    pos_id = pad_batch_data(
+        batch_pos_ids,
+        pad_idx=pad_id,
+        return_pos=False,
+        return_input_mask=False)
+    sent_id = pad_batch_data(
+        batch_sent_ids,
+        pad_idx=pad_id,
+        return_pos=False,
+        return_input_mask=False)
+
+    if mask_id >= 0:
+        return_list = [
+            src_id, pos_id, sent_id, self_input_mask, mask_label, mask_pos
+        ] + labels_list
+    else:
+        return_list = [src_id, pos_id, sent_id, self_input_mask] + labels_list
+
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
+def pad_batch_data(insts,
+                   pad_idx=0,
+                   return_pos=False,
+                   return_input_mask=False,
+                   return_max_len=False,
+                   return_num_token=False):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and input mask.
+    """
+    return_list = []
+    max_len = max(len(inst) for inst in insts)
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+
+    inst_data = np.array([
+        list(inst) + list([pad_idx] * (max_len - len(inst))) for inst in insts
+    ])
+    return_list += [inst_data.astype("int64").reshape([-1, max_len])]
+
+    # position data
+    if return_pos:
+        inst_pos = np.array([
+            list(range(0, len(inst))) + [pad_idx] * (max_len - len(inst))
+            for inst in insts
+        ])
+
+        return_list += [inst_pos.astype("int64").reshape([-1, max_len])]
+
+    if return_input_mask:
+        # This is used to avoid attention on paddings.
+        input_mask_data = np.array([[1] * len(inst) + [0] *
+                                    (max_len - len(inst)) for inst in insts])
+        input_mask_data = np.expand_dims(input_mask_data, axis=-1)
+        return_list += [input_mask_data.astype("float32")]
+
+    if return_max_len:
+        return_list += [max_len]
+
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
+if __name__ == "__main__":
+    pass
diff --git a/dygraph/bert/model/__init__.py b/dygraph/bert/model/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/dygraph/bert/model/bert.py b/dygraph/bert/model/bert.py
new file mode 100644
index 0000000000000000000000000000000000000000..df59391a38e0360e3b19c617c55dba858ce07349
--- /dev/null
+++ b/dygraph/bert/model/bert.py
@@ -0,0 +1,266 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"dygraph transformer layers"
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import six
+import json
+import numpy as np
+
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.dygraph import Embedding, LayerNorm, Linear, to_variable, Layer, guard
+
+from model.transformer_encoder import EncoderLayer, PrePostProcessLayer
+
+
+class BertConfig(object):
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+
+    def _parse(self, config_path):
+        try:
+            with open(config_path) as json_file:
+                config_dict = json.load(json_file)
+        except Exception:
+            raise IOError("Error in parsing bert model config file '%s'" %
+                          config_path)
+        else:
+            return config_dict
+
+    def __getitem__(self, key):
+        return self._config_dict[key]
+
+    def print_config(self):
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+
+
+class BertModelLayer(Layer):
+    """
+    bert
+    """
+
+    def __init__(self, config, return_pooled_out=True, use_fp16=False):
+        super(BertModelLayer, self).__init__()
+
+        self._emb_size = config['hidden_size']
+        self._n_layer = config['num_hidden_layers']
+        self._n_head = config['num_attention_heads']
+        self._voc_size = config['vocab_size']
+        self._max_position_seq_len = config['max_position_embeddings']
+        self._sent_types = config['type_vocab_size']
+        self._hidden_act = config['hidden_act']
+        self._prepostprocess_dropout = config['hidden_dropout_prob']
+        self._attention_dropout = config['attention_probs_dropout_prob']
+        self.return_pooled_out = return_pooled_out
+
+        self._word_emb_name = "word_embedding"
+        self._pos_emb_name = "pos_embedding"
+        self._sent_emb_name = "sent_embedding"
+        self._dtype = "float16" if use_fp16 else "float32"
+
+        self._param_initializer = fluid.initializer.TruncatedNormal(
+            scale=config['initializer_range'])
+
+        self._src_emb = Embedding(
+            size=[self._voc_size, self._emb_size],
+            param_attr=fluid.ParamAttr(
+                name=self._word_emb_name, initializer=self._param_initializer),
+            dtype=self._dtype)
+
+        self._pos_emb = Embedding(
+            size=[self._max_position_seq_len, self._emb_size],
+            param_attr=fluid.ParamAttr(
+                name=self._pos_emb_name, initializer=self._param_initializer),
+            dtype=self._dtype)
+
+        self._sent_emb = Embedding(
+            size=[self._sent_types, self._emb_size],
+            param_attr=fluid.ParamAttr(
+                name=self._sent_emb_name, initializer=self._param_initializer),
+            dtype=self._dtype)
+
+        self.pooled_fc = Linear(
+            input_dim=self._emb_size,
+            output_dim=self._emb_size,
+            param_attr=fluid.ParamAttr(
+                name="pooled_fc.w_0", initializer=self._param_initializer),
+            bias_attr="pooled_fc.b_0",
+            act="tanh")
+
+        self.pre_process_layer = PrePostProcessLayer(
+            "nd", self._emb_size, self._prepostprocess_dropout, "")
+
+        self._encoder = EncoderLayer(
+            hidden_act=self._hidden_act,
+            n_layer=self._n_layer,
+            n_head=self._n_head,
+            d_key=self._emb_size // self._n_head,
+            d_value=self._emb_size // self._n_head,
+            d_model=self._emb_size,
+            d_inner_hid=self._emb_size * 4,
+            prepostprocess_dropout=self._prepostprocess_dropout,
+            attention_dropout=self._attention_dropout,
+            relu_dropout=0,
+            preprocess_cmd="",
+            postprocess_cmd="dan",
+            param_initializer=self._param_initializer)
+
+    def forward(self, src_ids, position_ids, sentence_ids, input_mask):
+        """
+        forward
+        """
+        src_emb = self._src_emb(src_ids)
+        pos_emb = self._pos_emb(position_ids)
+        sent_emb = self._sent_emb(sentence_ids)
+
+        emb_out = src_emb + pos_emb
+        emb_out = emb_out + sent_emb
+
+        emb_out = self.pre_process_layer(emb_out)
+
+        self_attn_mask = fluid.layers.matmul(
+            x=input_mask, y=input_mask, transpose_y=True)
+        self_attn_mask = fluid.layers.scale(
+            x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False)
+        n_head_self_attn_mask = fluid.layers.stack(
+            x=[self_attn_mask] * self._n_head, axis=1)
+        n_head_self_attn_mask.stop_gradient = True
+
+        enc_output = self._encoder(emb_out, n_head_self_attn_mask)
+
+        if not self.return_pooled_out:
+            return enc_output
+        next_sent_feat = fluid.layers.slice(
+            input=enc_output, axes=[1], starts=[0], ends=[1])
+        next_sent_feat = self.pooled_fc(next_sent_feat)
+        next_sent_feat = fluid.layers.reshape(
+            next_sent_feat, shape=[-1, self._emb_size])
+
+        return enc_output, next_sent_feat
+
+
+class PretrainModelLayer(Layer):
+    """
+    pretrain model
+    """
+
+    def __init__(self,
+                 config,
+                 return_pooled_out=True,
+                 weight_sharing=True,
+                 use_fp16=False):
+        super(PretrainModelLayer, self).__init__()
+        self.config = config
+        self._voc_size = config['vocab_size']
+        self._emb_size = config['hidden_size']
+        self._hidden_act = config['hidden_act']
+        self._prepostprocess_dropout = config['hidden_dropout_prob']
+
+        self._word_emb_name = "word_embedding"
+        self._param_initializer = fluid.initializer.TruncatedNormal(
+            scale=config['initializer_range'])
+        self._weight_sharing = weight_sharing
+        self.use_fp16 = use_fp16
+        self._dtype = "float16" if use_fp16 else "float32"
+
+        self.bert_layer = BertModelLayer(
+            config=self.config, return_pooled_out=True, use_fp16=self.use_fp16)
+
+        self.pre_process_layer = PrePostProcessLayer(
+            "n", self._emb_size, self._prepostprocess_dropout, "pre_encoder")
+
+        self.pooled_fc = Linear(
+            input_dim=self._emb_size,
+            output_dim=self._emb_size,
+            param_attr=fluid.ParamAttr(
+                name="mask_lm_trans_fc.w_0",
+                initializer=self._param_initializer),
+            bias_attr="mask_lm_trans_fc.b_0",
+            act="tanh")
+
+        self.mask_lm_out_bias_attr = fluid.ParamAttr(
+            name="mask_lm_out_fc.b_0",
+            initializer=fluid.initializer.Constant(value=0.0))
+
+        if not self._weight_sharing:
+            self.out_fc = Linear(
+                input_dim=self._emb_size,
+                output_dim=self._voc_size,
+                param_attr=fluid.ParamAttr(
+                    name="mask_lm_out_fc.w_0",
+                    initializer=self._param_initializer),
+                bias_attr=self.mask_lm_out_bias_attr)
+        else:
+            self.fc_create_params = self.create_parameter(
+                shape=[self._voc_size],
+                dtype=self._dtype,
+                attr=self.mask_lm_out_bias_attr,
+                is_bias=True)
+
+        self.next_sent_fc = Linear(
+            input_dim=self._emb_size,
+            output_dim=2,
+            param_attr=fluid.ParamAttr(
+                name="next_sent_fc.w_0", initializer=self._param_initializer),
+            bias_attr="next_sent_fc.b_0")
+
+    def forward(self, src_ids, position_ids, sentence_ids, input_mask,
+                mask_label, mask_pos, labels):
+        """
+        forward
+        """
+        mask_pos = fluid.layers.cast(x=mask_pos, dtype='int32')
+
+        enc_output, next_sent_feat = self.bert_layer(src_ids, position_ids,
+                                                     sentence_ids, input_mask)
+        reshaped_emb_out = fluid.layers.reshape(
+            x=enc_output, shape=[-1, self._emb_size])
+
+        mask_feat = fluid.layers.gather(input=reshaped_emb_out, index=mask_pos)
+
+        mask_trans_feat = self.pooled_fc(mask_feat)
+        mask_trans_feat = self.pre_process_layer(None, mask_trans_feat, "n",
+                                                 self._prepostprocess_dropout)
+
+        if self._weight_sharing:
+            fc_out = fluid.layers.matmul(
+                x=mask_trans_feat,
+                y=self.bert_layer._src_emb._w,
+                transpose_y=True)
+            fc_out += self.fc_create_params
+        else:
+            fc_out = self.out_fc(mask_trans_feat)
+
+        mask_lm_loss = fluid.layers.softmax_with_cross_entropy(
+            logits=fc_out, label=mask_label)
+        mean_mask_lm_loss = fluid.layers.mean(mask_lm_loss)
+
+        next_sent_fc_out = self.next_sent_fc(next_sent_feat)
+
+        next_sent_loss, next_sent_softmax = fluid.layers.softmax_with_cross_entropy(
+            logits=next_sent_fc_out, label=labels, return_softmax=True)
+
+        next_sent_acc = fluid.layers.accuracy(
+            input=next_sent_softmax, label=labels)
+
+        mean_next_sent_loss = fluid.layers.mean(next_sent_loss)
+
+        loss = mean_next_sent_loss + mean_mask_lm_loss
+        return next_sent_acc, mean_mask_lm_loss, loss
diff --git a/dygraph/bert/model/cls.py b/dygraph/bert/model/cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..7130f8b9345052774f7133b5b57e713bf683ac19
--- /dev/null
+++ b/dygraph/bert/model/cls.py
@@ -0,0 +1,92 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"dygraph transformer layers"
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import six
+import json
+import numpy as np
+
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.dygraph import Linear, Layer
+
+from model.bert import BertModelLayer
+
+
+class ClsModelLayer(Layer):
+    """
+    classify model
+    """
+
+    def __init__(self,
+                 args,
+                 config,
+                 num_labels,
+                 is_training=True,
+                 return_pooled_out=True,
+                 use_fp16=False):
+        super(ClsModelLayer, self).__init__()
+        self.config = config
+        self.is_training = is_training
+        self.use_fp16 = use_fp16
+        self.loss_scaling = args.loss_scaling
+
+        self.bert_layer = BertModelLayer(
+            config=self.config, return_pooled_out=True, use_fp16=self.use_fp16)
+
+        self.cls_fc = Linear(
+            input_dim=self.config["hidden_size"],
+            output_dim=num_labels,
+            param_attr=fluid.ParamAttr(
+                name="cls_out_w",
+                initializer=fluid.initializer.TruncatedNormal(scale=0.02)),
+            bias_attr=fluid.ParamAttr(
+                name="cls_out_b", initializer=fluid.initializer.Constant(0.)))
+
+    def forward(self, data_ids):
+        """
+        forward
+        """
+        src_ids = data_ids[0]
+        position_ids = data_ids[1]
+        sentence_ids = data_ids[2]
+        input_mask = data_ids[3]
+        labels = data_ids[4]
+
+        enc_output, next_sent_feat = self.bert_layer(src_ids, position_ids,
+                                                     sentence_ids, input_mask)
+
+        cls_feats = fluid.layers.dropout(
+            x=next_sent_feat,
+            dropout_prob=0.1,
+            dropout_implementation="upscale_in_train")
+
+        logits = self.cls_fc(cls_feats)
+
+        ce_loss, probs = fluid.layers.softmax_with_cross_entropy(
+            logits=logits, label=labels, return_softmax=True)
+        loss = fluid.layers.mean(x=ce_loss)
+
+        if self.use_fp16 and self.loss_scaling > 1.0:
+            loss *= self.loss_scaling
+
+        num_seqs = fluid.layers.create_tensor(dtype='int64')
+        accuracy = fluid.layers.accuracy(
+            input=probs, label=labels, total=num_seqs)
+
+        return loss, accuracy, num_seqs
diff --git a/dygraph/bert/model/transformer_encoder.py b/dygraph/bert/model/transformer_encoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..3fd0daf158307e0aca4d490be1a6f11855743165
--- /dev/null
+++ b/dygraph/bert/model/transformer_encoder.py
@@ -0,0 +1,395 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. 
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"dygraph transformer layers"
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.dygraph import Embedding, LayerNorm, Linear, Layer
+
+
+class PrePostProcessLayer(Layer):
+    """
+    PrePostProcessLayer
+    """
+
+    def __init__(self, process_cmd, d_model, dropout_rate, name):
+        super(PrePostProcessLayer, self).__init__()
+        self.process_cmd = process_cmd
+        self.functors = []
+        self.exec_order = ""
+
+        for cmd in self.process_cmd:
+            if cmd == "a":  # add residual connection
+                self.functors.append(lambda x, y: x + y if y else x)
+                self.exec_order += "a"
+            elif cmd == "n":  # add layer normalization
+                self.functors.append(
+                    self.add_sublayer(
+                        "layer_norm_%d" % len(
+                            self.sublayers(include_sublayers=False)),
+                        LayerNorm(
+                            normalized_shape=d_model,
+                            param_attr=fluid.ParamAttr(
+                                name=name + "_layer_norm_scale",
+                                initializer=fluid.initializer.Constant(1.)),
+                            bias_attr=fluid.ParamAttr(
+                                name=name + "_layer_norm_bias",
+                                initializer=fluid.initializer.Constant(0.)))))
+                self.exec_order += "n"
+            elif cmd == "d":  # add dropout
+                if dropout_rate:
+                    self.functors.append(lambda x: fluid.layers.dropout(
+                        x, dropout_prob=dropout_rate, is_test=False))
+                    self.exec_order += "d"
+
+    def forward(self, x, residual=None):
+        for i, cmd in enumerate(self.exec_order):
+            if cmd == "a":
+                x = self.functors[i](x, residual)
+            else:
+                x = self.functors[i](x)
+        return x
+
+
+class PositionwiseFeedForwardLayer(Layer):
+    """
+    PositionwiseFeedForwardLayer
+    """
+
+    def __init__(self,
+                 hidden_act,
+                 d_inner_hid,
+                 d_model,
+                 dropout_rate,
+                 param_initializer=None,
+                 name=""):
+        super(PositionwiseFeedForwardLayer, self).__init__()
+
+        self._i2h = Linear(
+            input_dim=d_model,
+            output_dim=d_inner_hid,
+            param_attr=fluid.ParamAttr(
+                name=name + '_fc_0.w_0', initializer=param_initializer),
+            bias_attr=name + '_fc_0.b_0',
+            act=hidden_act)
+
+        self._h2o = Linear(
+            input_dim=d_inner_hid,
+            output_dim=d_model,
+            param_attr=fluid.ParamAttr(
+                name=name + '_fc_1.w_0', initializer=param_initializer),
+            bias_attr=name + '_fc_1.b_0')
+
+        self._dropout_rate = dropout_rate
+
+    def forward(self, x):
+        """
+        forward
+        :param x:
+        :return:
+        """
+        hidden = self._i2h(x)
+        if self._dropout_rate:
+            hidden = fluid.layers.dropout(
+                hidden,
+                dropout_prob=self._dropout_rate,
+                upscale_in_train="upscale_in_train",
+                is_test=False)
+        out = self._h2o(hidden)
+        return out
+
+
+class MultiHeadAttentionLayer(Layer):
+    """
+    MultiHeadAttentionLayer
+    """
+
+    def __init__(self,
+                 d_key,
+                 d_value,
+                 d_model,
+                 n_head=1,
+                 dropout_rate=0.,
+                 cache=None,
+                 gather_idx=None,
+                 static_kv=False,
+                 param_initializer=None,
+                 name=""):
+        super(MultiHeadAttentionLayer, self).__init__()
+        self._n_head = n_head
+        self._d_key = d_key
+        self._d_value = d_value
+        self._d_model = d_model
+        self._dropout_rate = dropout_rate
+
+        self._q_fc = Linear(
+            input_dim=d_model,
+            output_dim=d_key * n_head,
+            param_attr=fluid.ParamAttr(
+                name=name + '_query_fc.w_0', initializer=param_initializer),
+            bias_attr=name + '_query_fc.b_0')
+
+        self._k_fc = Linear(
+            input_dim=d_model,
+            output_dim=d_key * n_head,
+            param_attr=fluid.ParamAttr(
+                name=name + '_key_fc.w_0', initializer=param_initializer),
+            bias_attr=name + '_key_fc.b_0')
+
+        self._v_fc = Linear(
+            input_dim=d_model,
+            output_dim=d_value * n_head,
+            param_attr=fluid.ParamAttr(
+                name=name + '_value_fc.w_0', initializer=param_initializer),
+            bias_attr=name + '_value_fc.b_0')
+
+        self._proj_fc = Linear(
+            input_dim=d_value * n_head,
+            output_dim=d_model,
+            param_attr=fluid.ParamAttr(
+                name=name + '_output_fc.w_0', initializer=param_initializer),
+            bias_attr=name + '_output_fc.b_0')
+
+    def forward(self, queries, keys, values, attn_bias):
+        """
+        forward
+        :param queries:
+        :param keys:
+        :param values:
+        :param attn_bias:
+        :return:
+        """
+        # compute q ,k ,v
+        keys = queries if keys is None else keys
+        values = keys if values is None else values
+
+        q = self._q_fc(queries)
+        k = self._k_fc(keys)
+        v = self._v_fc(values)
+
+        # split head
+
+        q_hidden_size = q.shape[-1]
+        reshaped_q = fluid.layers.reshape(
+            x=q,
+            shape=[0, 0, self._n_head, q_hidden_size // self._n_head],
+            inplace=False)
+        transpose_q = fluid.layers.transpose(x=reshaped_q, perm=[0, 2, 1, 3])
+
+        k_hidden_size = k.shape[-1]
+        reshaped_k = fluid.layers.reshape(
+            x=k,
+            shape=[0, 0, self._n_head, k_hidden_size // self._n_head],
+            inplace=False)
+        transpose_k = fluid.layers.transpose(x=reshaped_k, perm=[0, 2, 1, 3])
+
+        v_hidden_size = v.shape[-1]
+        reshaped_v = fluid.layers.reshape(
+            x=v,
+            shape=[0, 0, self._n_head, v_hidden_size // self._n_head],
+            inplace=False)
+        transpose_v = fluid.layers.transpose(x=reshaped_v, perm=[0, 2, 1, 3])
+
+        scaled_q = fluid.layers.scale(x=transpose_q, scale=self._d_key**-0.5)
+        # scale dot product attention
+        product = fluid.layers.matmul(
+            #x=transpose_q,
+            x=scaled_q,
+            y=transpose_k,
+            transpose_y=True)
+        #alpha=self._d_model**-0.5)
+        if attn_bias:
+            product += attn_bias
+        weights = fluid.layers.softmax(product)
+        if self._dropout_rate:
+            weights_droped = fluid.layers.dropout(
+                weights,
+                dropout_prob=self._dropout_rate,
+                dropout_implementation="upscale_in_train",
+                is_test=False)
+            out = fluid.layers.matmul(weights_droped, transpose_v)
+        else:
+            out = fluid.layers.matmul(weights, transpose_v)
+
+        # combine heads
+        if len(out.shape) != 4:
+            raise ValueError("Input(x) should be a 4-D Tensor.")
+        trans_x = fluid.layers.transpose(out, perm=[0, 2, 1, 3])
+        final_out = fluid.layers.reshape(
+            x=trans_x,
+            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
+            inplace=False)
+
+        # fc to output
+        proj_out = self._proj_fc(final_out)
+        return proj_out
+
+
+class EncoderSubLayer(Layer):
+    """
+    EncoderSubLayer
+    """
+
+    def __init__(self,
+                 hidden_act,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 prepostprocess_dropout,
+                 attention_dropout,
+                 relu_dropout,
+                 preprocess_cmd="n",
+                 postprocess_cmd="da",
+                 param_initializer=None,
+                 name=""):
+
+        super(EncoderSubLayer, self).__init__()
+        self.name = name
+        self._preprocess_cmd = preprocess_cmd
+        self._postprocess_cmd = postprocess_cmd
+        self._prepostprocess_dropout = prepostprocess_dropout
+
+        self._preprocess_layer = PrePostProcessLayer(
+            self._preprocess_cmd,
+            d_model,
+            prepostprocess_dropout,
+            name=name + "_pre_att")
+
+        self._multihead_attention_layer = MultiHeadAttentionLayer(
+            d_key,
+            d_value,
+            d_model,
+            n_head,
+            attention_dropout,
+            None,
+            None,
+            False,
+            param_initializer,
+            name=name + "_multi_head_att")
+
+        self._postprocess_layer = PrePostProcessLayer(
+            self._postprocess_cmd,
+            d_model,
+            self._prepostprocess_dropout,
+            name=name + "_post_att")
+        self._preprocess_layer2 = PrePostProcessLayer(
+            self._preprocess_cmd,
+            d_model,
+            self._prepostprocess_dropout,
+            name=name + "_pre_ffn")
+
+        self._positionwise_feed_forward = PositionwiseFeedForwardLayer(
+            hidden_act,
+            d_inner_hid,
+            d_model,
+            relu_dropout,
+            param_initializer,
+            name=name + "_ffn")
+
+        self._postprocess_layer2 = PrePostProcessLayer(
+            self._postprocess_cmd,
+            d_model,
+            self._prepostprocess_dropout,
+            name=name + "_post_ffn")
+
+    def forward(self, enc_input, attn_bias):
+        """
+        forward
+        :param enc_input:
+        :param attn_bias:
+        :return:
+        """
+        pre_process_multihead = self._preprocess_layer(enc_input)
+
+        attn_output = self._multihead_attention_layer(pre_process_multihead,
+                                                      None, None, attn_bias)
+        attn_output = self._postprocess_layer(attn_output, enc_input)
+
+        pre_process2_output = self._preprocess_layer2(attn_output)
+
+        ffd_output = self._positionwise_feed_forward(pre_process2_output)
+
+        return self._postprocess_layer2(ffd_output, attn_output)
+
+
+class EncoderLayer(Layer):
+    """
+    encoder
+    """
+
+    def __init__(self,
+                 hidden_act,
+                 n_layer,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 prepostprocess_dropout,
+                 attention_dropout,
+                 relu_dropout,
+                 preprocess_cmd="n",
+                 postprocess_cmd="da",
+                 param_initializer=None,
+                 name=""):
+
+        super(EncoderLayer, self).__init__()
+        self._preprocess_cmd = preprocess_cmd
+        self._encoder_sublayers = list()
+        self._prepostprocess_dropout = prepostprocess_dropout
+        self._n_layer = n_layer
+        self._hidden_act = hidden_act
+        self._preprocess_layer = PrePostProcessLayer(
+            self._preprocess_cmd, 3, self._prepostprocess_dropout,
+            "post_encoder")
+
+        for i in range(n_layer):
+            self._encoder_sublayers.append(
+                self.add_sublayer(
+                    'esl_%d' % i,
+                    EncoderSubLayer(
+                        hidden_act,
+                        n_head,
+                        d_key,
+                        d_value,
+                        d_model,
+                        d_inner_hid,
+                        prepostprocess_dropout,
+                        attention_dropout,
+                        relu_dropout,
+                        preprocess_cmd,
+                        postprocess_cmd,
+                        param_initializer,
+                        name=name + '_layer_' + str(i))))
+
+    def forward(self, enc_input, attn_bias):
+        """
+        forward
+        :param enc_input:
+        :param attn_bias:
+        :return:
+        """
+        for i in range(self._n_layer):
+            enc_output = self._encoder_sublayers[i](enc_input, attn_bias)
+            enc_input = enc_output
+
+        return self._preprocess_layer(enc_output)
diff --git a/dygraph/bert/optimization.py b/dygraph/bert/optimization.py
new file mode 100755
index 0000000000000000000000000000000000000000..5c4c02b74c69e058e5476f95a41195af492d5dc2
--- /dev/null
+++ b/dygraph/bert/optimization.py
@@ -0,0 +1,170 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimization and learning rate scheduling."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import paddle.fluid as fluid
+
+from paddle.fluid.dygraph.learning_rate_scheduler import LearningRateDecay
+
+
+class ConstantLR(LearningRateDecay):
+    def __init__(self, learning_rate, begin=0, step=1, dtype='float32'):
+        super(ConstantLR, self).__init__(begin, step, dtype)
+        self.learning_rate = learning_rate
+
+    def step(self):
+        return self.learning_rate
+
+
+class LinearDecay(LearningRateDecay):
+    def __init__(self,
+                 learning_rate,
+                 warmup_steps,
+                 decay_steps,
+                 end_learning_rate=0.0001,
+                 power=1.0,
+                 cycle=False,
+                 begin=0,
+                 step=1,
+                 dtype='float32'):
+        super(LinearDecay, self).__init__(begin, step, dtype)
+        self.learning_rate = learning_rate
+        self.warmup_steps = warmup_steps
+        self.decay_steps = decay_steps
+        self.end_learning_rate = end_learning_rate
+        self.power = power
+        self.cycle = cycle
+
+    def step(self):
+        if self.step_num < self.warmup_steps:
+            decayed_lr = self.learning_rate * (self.step_num /
+                                               self.warmup_steps)
+            decayed_lr = self.create_lr_var(decayed_lr)
+        else:
+            tmp_step_num = self.step_num
+            tmp_decay_steps = self.decay_steps
+            if self.cycle:
+                div_res = fluid.layers.ceil(
+                    self.create_lr_var(tmp_step_num / float(self.decay_steps)))
+                if tmp_step_num == 0:
+                    div_res = self.create_lr_var(1.0)
+                tmp_decay_steps = self.decay_steps * div_res
+            else:
+                tmp_step_num = self.create_lr_var(
+                    tmp_step_num
+                    if tmp_step_num < self.decay_steps else self.decay_steps)
+                decayed_lr = (self.learning_rate - self.end_learning_rate) * \
+                    ((1 - tmp_step_num / tmp_decay_steps) ** self.power) + self.end_learning_rate
+
+        return decayed_lr
+
+
+class Optimizer(object):
+    def __init__(self,
+                 warmup_steps,
+                 num_train_steps,
+                 learning_rate,
+                 model_cls,
+                 weight_decay,
+                 scheduler='linear_warmup_decay',
+                 loss_scaling=1.0,
+                 parameter_list=None):
+        self.warmup_steps = warmup_steps
+        self.num_train_steps = num_train_steps
+        self.learning_rate = learning_rate
+        self.model_cls = model_cls
+        self.weight_decay = weight_decay
+        self.scheduler = scheduler
+        self.loss_scaling = loss_scaling
+        self.parameter_list = parameter_list
+
+        self.scheduled_lr = 0.0
+        self.optimizer = self.lr_schedule()
+
+    def lr_schedule(self):
+        if self.warmup_steps > 0:
+            if self.scheduler == 'noam_decay':
+                self.scheduled_lr = fluid.dygraph.NoamDecay(1 / (
+                    self.warmup_steps * (self.learning_rate**2)),
+                                                            self.warmup_steps)
+            elif self.scheduler == 'linear_warmup_decay':
+                self.scheduled_lr = LinearDecay(self.learning_rate,
+                                                self.warmup_steps,
+                                                self.num_train_steps, 0.0)
+            else:
+                raise ValueError("Unkown learning rate scheduler, should be "
+                                 "'noam_decay' or 'linear_warmup_decay'")
+            optimizer = fluid.optimizer.Adam(
+                learning_rate=self.scheduled_lr,
+                parameter_list=self.parameter_list)
+        else:
+            self.scheduled_lr = ConstantLR(self.learning_rate)
+            optimizer = fluid.optimizer.Adam(
+                learning_rate=self.scheduled_lr,
+                parameter_list=self.parameter_list)
+
+        return optimizer
+
+    def exclude_from_weight_decay(self, name):
+        if name.find("layer_norm") > -1:
+            return True
+        bias_suffix = ["_bias", "_b", ".b_0"]
+        for suffix in bias_suffix:
+            if name.endswith(suffix):
+                return True
+        return False
+
+    def optimization(self, loss, use_data_parallel=False, model=None):
+        param_list = dict()
+
+        clip_norm_thres = 1.0
+        #grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(clip_norm_thres)
+
+        if use_data_parallel:
+            loss = model.scale_loss(loss)
+
+        loss.backward()
+
+        if self.weight_decay > 0:
+            for param in self.model_cls.parameters():
+                param_list[param.name] = param * 1.0
+                param_list[param.name].stop_gradient = True
+
+        if use_data_parallel:
+            assert model is not None
+            model.apply_collective_grads()
+
+        #_, param_grads = self.optimizer.minimize(loss, grad_clip=grad_clip)
+        _, param_grads = self.optimizer.minimize(loss)
+
+        if self.weight_decay > 0:
+            for param, grad in param_grads:
+                if self.exclude_from_weight_decay(param.name):
+                    continue
+                if isinstance(self.scheduled_lr.step(), float):
+                    updated_param = param.numpy() - param_list[
+                        param.name].numpy(
+                        ) * self.weight_decay * self.scheduled_lr.step()
+                else:
+                    updated_param = param.numpy() - param_list[
+                        param.name].numpy(
+                        ) * self.weight_decay * self.scheduled_lr.step().numpy()
+                updated_param_var = fluid.dygraph.to_variable(updated_param)
+                param = updated_param_var
+                #param = fluid.layers.reshape(x=updated_param_var, shape=list(updated_param_var.shape))
diff --git a/dygraph/bert/reader/__init__.py b/dygraph/bert/reader/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/dygraph/bert/reader/cls.py b/dygraph/bert/reader/cls.py
new file mode 100644
index 0000000000000000000000000000000000000000..60bd5505066827e457424345ea11b2758680b03d
--- /dev/null
+++ b/dygraph/bert/reader/cls.py
@@ -0,0 +1,552 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import io
+import os
+import types
+import csv
+import numpy as np
+import tokenization
+from batching import prepare_batch_data
+
+
+class DataProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+
+    def __init__(self,
+                 data_dir,
+                 vocab_path,
+                 max_seq_len,
+                 do_lower_case,
+                 in_tokens,
+                 random_seed=None):
+        self.data_dir = data_dir
+        self.max_seq_len = max_seq_len
+        self.tokenizer = tokenization.FullTokenizer(
+            vocab_file=vocab_path, do_lower_case=do_lower_case)
+        self.vocab = self.tokenizer.vocab
+        self.in_tokens = in_tokens
+
+        np.random.seed(random_seed)
+
+        self.current_train_example = -1
+        self.num_examples = {'train': -1, 'dev': -1, 'test': -1}
+        self.current_train_epoch = -1
+
+    def get_train_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the train set."""
+        raise NotImplementedError()
+
+    def get_dev_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the dev set."""
+        raise NotImplementedError()
+
+    def get_test_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for prediction."""
+        raise NotImplementedError()
+
+    def get_labels(self):
+        """Gets the list of labels for this data set."""
+        raise NotImplementedError()
+
+    def convert_example(self, index, example, labels, max_seq_len, tokenizer):
+        """Converts a single `InputExample` into a single `InputFeatures`."""
+        feature = convert_single_example(index, example, labels, max_seq_len,
+                                         tokenizer)
+        return feature
+
+    def generate_instance(self, feature):
+        """
+        generate instance with given feature
+
+        Args:
+            feature: InputFeatures(object). A single set of features of data.
+        """
+        input_pos = list(range(len(feature.input_ids)))
+        return [
+            feature.input_ids, feature.segment_ids, input_pos, feature.label_id
+        ]
+
+    def generate_batch_data(self,
+                            batch_data,
+                            total_token_num,
+                            voc_size=-1,
+                            mask_id=-1,
+                            return_input_mask=True,
+                            return_max_len=False,
+                            return_num_token=False):
+        return prepare_batch_data(
+            batch_data,
+            total_token_num,
+            voc_size=-1,
+            pad_id=self.vocab["[PAD]"],
+            cls_id=self.vocab["[CLS]"],
+            sep_id=self.vocab["[SEP]"],
+            mask_id=-1,
+            return_input_mask=True,
+            return_max_len=False,
+            return_num_token=False)
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with io.open(input_file, "r", encoding="utf8") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            lines = []
+            for line in reader:
+                lines.append(line)
+            return lines
+
+    def get_num_examples(self, phase):
+        """Get number of examples for train, dev or test."""
+        if phase not in ['train', 'dev', 'test']:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'dev', 'test'].")
+        return self.num_examples[phase]
+
+    def get_train_progress(self):
+        """Gets progress for training phase."""
+        return self.current_train_example, self.current_train_epoch
+
+    def data_generator(self,
+                       batch_size,
+                       phase='train',
+                       epoch=1,
+                       dev_count=1,
+                       shuffle=True,
+                       shuffle_seed=None):
+        """
+        Generate data for train, dev or test.
+    
+        Args:
+          batch_size: int. The batch size of generated data.
+          phase: string. The phase for which to generate data.
+          epoch: int. Total epoches to generate data.
+          shuffle: bool. Whether to shuffle examples.
+        """
+        if phase == 'train':
+            examples = self.get_train_examples(self.data_dir)
+            self.num_examples['train'] = len(examples)
+        elif phase == 'dev':
+            examples = self.get_dev_examples(self.data_dir)
+            self.num_examples['dev'] = len(examples)
+        elif phase == 'test':
+            examples = self.get_test_examples(self.data_dir)
+            self.num_examples['test'] = len(examples)
+        else:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'dev', 'test'].")
+
+        def instance_reader():
+            for epoch_index in range(epoch):
+                if shuffle:
+                    if shuffle_seed is not None:
+                        np.random.seed(shuffle_seed)
+                    np.random.shuffle(examples)
+                if phase == 'train':
+                    self.current_train_epoch = epoch_index
+                for (index, example) in enumerate(examples):
+                    if phase == 'train':
+                        self.current_train_example = index + 1
+                    feature = self.convert_example(
+                        index, example,
+                        self.get_labels(), self.max_seq_len, self.tokenizer)
+
+                    instance = self.generate_instance(feature)
+                    yield instance
+
+        def batch_reader(reader, batch_size, in_tokens):
+            batch, total_token_num, max_len = [], 0, 0
+            for instance in reader():
+                token_ids, sent_ids, pos_ids, label = instance[:4]
+                max_len = max(max_len, len(token_ids))
+                if in_tokens:
+                    to_append = (len(batch) + 1) * max_len <= batch_size
+                else:
+                    to_append = len(batch) < batch_size
+                if to_append:
+                    batch.append(instance)
+                    total_token_num += len(token_ids)
+                else:
+                    yield batch, total_token_num
+                    batch, total_token_num, max_len = [instance], len(
+                        token_ids), len(token_ids)
+
+            if len(batch) > 0:
+                yield batch, total_token_num
+
+        def wrapper():
+            all_dev_batches = []
+            for batch_data, total_token_num in batch_reader(
+                    instance_reader, batch_size, self.in_tokens):
+                batch_data = self.generate_batch_data(
+                    batch_data,
+                    total_token_num,
+                    voc_size=-1,
+                    mask_id=-1,
+                    return_input_mask=True,
+                    return_max_len=False,
+                    return_num_token=False)
+                if len(all_dev_batches) < dev_count:
+                    all_dev_batches.append(batch_data)
+
+                if len(all_dev_batches) == dev_count:
+                    for batch in all_dev_batches:
+                        yield batch
+                    all_dev_batches = []
+
+        return wrapper
+
+
+class InputExample(object):
+    """A single training/test example for simple sequence classification."""
+
+    def __init__(self, guid, text_a, text_b=None, label=None):
+        """Constructs a InputExample.
+
+    Args:
+      guid: Unique id for the example.
+      text_a: string. The untokenized text of the first sequence. For single
+        sequence tasks, only this sequence must be specified.
+      text_b: (Optional) string. The untokenized text of the second sequence.
+        Only must be specified for sequence pair tasks.
+      label: (Optional) string. The label of the example. This should be
+        specified for train and dev examples, but not for test examples.
+    """
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+
+
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            tokens_b.pop()
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self, input_ids, input_mask, segment_ids, label_id):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+
+
+class XnliProcessor(DataProcessor):
+    """Processor for the XNLI data set."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        self.language = "zh"
+        lines = self._read_tsv(
+            os.path.join(data_dir, "multinli", "multinli.train.%s.tsv" %
+                         self.language))
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "train-%d" % (i)
+            text_a = tokenization.convert_to_unicode(line[0])
+            text_b = tokenization.convert_to_unicode(line[1])
+            label = tokenization.convert_to_unicode(line[2])
+            if label == tokenization.convert_to_unicode("contradictory"):
+                label = tokenization.convert_to_unicode("contradiction")
+            examples.append(
+                InputExample(
+                    guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        self.language = "zh"
+        lines = self._read_tsv(os.path.join(data_dir, "xnli.dev.tsv"))
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "dev-%d" % (i)
+            language = tokenization.convert_to_unicode(line[0])
+            if language != tokenization.convert_to_unicode(self.language):
+                continue
+            text_a = tokenization.convert_to_unicode(line[6])
+            text_b = tokenization.convert_to_unicode(line[7])
+            label = tokenization.convert_to_unicode(line[1])
+            examples.append(
+                InputExample(
+                    guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        self.language = "zh"
+        lines = self._read_tsv(os.path.join(data_dir, "xnli.test.tsv"))
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "test-%d" % (i)
+            language = tokenization.convert_to_unicode(line[0])
+            if language != tokenization.convert_to_unicode(self.language):
+                continue
+            text_a = tokenization.convert_to_unicode(line[6])
+            text_b = tokenization.convert_to_unicode(line[7])
+            label = tokenization.convert_to_unicode(line[1])
+            examples.append(
+                InputExample(
+                    guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+    def get_labels(self):
+        """See base class."""
+        return ["contradiction", "entailment", "neutral"]
+
+
+class MnliProcessor(DataProcessor):
+    """Processor for the MultiNLI data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev_matched.tsv")),
+            "dev_matched")
+
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "test_matched.tsv")), "test")
+
+    def get_labels(self):
+        """See base class."""
+        return ["contradiction", "entailment", "neutral"]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type,
+                              tokenization.convert_to_unicode(line[0]))
+            text_a = tokenization.convert_to_unicode(line[8])
+            text_b = tokenization.convert_to_unicode(line[9])
+            if set_type == "test":
+                label = "contradiction"
+            else:
+                label = tokenization.convert_to_unicode(line[-1])
+            examples.append(
+                InputExample(
+                    guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class MrpcProcessor(DataProcessor):
+    """Processor for the MRPC data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
+
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1"]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = "%s-%s" % (set_type, i)
+            text_a = tokenization.convert_to_unicode(line[3])
+            text_b = tokenization.convert_to_unicode(line[4])
+            if set_type == "test":
+                label = "0"
+            else:
+                label = tokenization.convert_to_unicode(line[0])
+            examples.append(
+                InputExample(
+                    guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+class ColaProcessor(DataProcessor):
+    """Processor for the CoLA data set (GLUE version)."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
+
+    def get_test_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
+
+    def get_labels(self):
+        """See base class."""
+        return ["0", "1"]
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            # Only the test set has a header
+            if set_type == "test" and i == 0:
+                continue
+            guid = "%s-%s" % (set_type, i)
+            if set_type == "test":
+                text_a = tokenization.convert_to_unicode(line[1])
+                label = "0"
+            else:
+                text_a = tokenization.convert_to_unicode(line[3])
+                label = tokenization.convert_to_unicode(line[1])
+            examples.append(
+                InputExample(
+                    guid=guid, text_a=text_a, text_b=None, label=label))
+        return examples
+
+
+def convert_single_example_to_unicode(guid, single_example):
+    text_a = tokenization.convert_to_unicode(single_example[0])
+    text_b = tokenization.convert_to_unicode(single_example[1])
+    label = tokenization.convert_to_unicode(single_example[2])
+    return InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)
+
+
+def convert_single_example(ex_index, example, label_list, max_seq_length,
+                           tokenizer):
+    """Converts a single `InputExample` into a single `InputFeatures`."""
+    label_map = {}
+    for (i, label) in enumerate(label_list):
+        label_map[label] = i
+
+    tokens_a = tokenizer.tokenize(example.text_a)
+    tokens_b = None
+    if example.text_b:
+        tokens_b = tokenizer.tokenize(example.text_b)
+
+    if tokens_b:
+        # Modifies `tokens_a` and `tokens_b` in place so that the total
+        # length is less than the specified length.
+        # Account for [CLS], [SEP], [SEP] with "- 3"
+        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
+    else:
+        # Account for [CLS] and [SEP] with "- 2"
+        if len(tokens_a) > max_seq_length - 2:
+            tokens_a = tokens_a[0:(max_seq_length - 2)]
+
+    # The convention in BERT is:
+    # (a) For sequence pairs:
+    #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
+    #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
+    # (b) For single sequences:
+    #  tokens:   [CLS] the dog is hairy . [SEP]
+    #  type_ids: 0     0   0   0  0     0 0
+    #
+    # Where "type_ids" are used to indicate whether this is the first
+    # sequence or the second sequence. The embedding vectors for `type=0` and
+    # `type=1` were learned during pre-training and are added to the wordpiece
+    # embedding vector (and position vector). This is not *strictly* necessary
+    # since the [SEP] token unambiguously separates the sequences, but it makes
+    # it easier for the model to learn the concept of sequences.
+    #
+    # For classification tasks, the first vector (corresponding to [CLS]) is
+    # used as as the "sentence vector". Note that this only makes sense because
+    # the entire model is fine-tuned.
+    tokens = []
+    segment_ids = []
+    tokens.append("[CLS]")
+    segment_ids.append(0)
+    for token in tokens_a:
+        tokens.append(token)
+        segment_ids.append(0)
+    tokens.append("[SEP]")
+    segment_ids.append(0)
+
+    if tokens_b:
+        for token in tokens_b:
+            tokens.append(token)
+            segment_ids.append(1)
+        tokens.append("[SEP]")
+        segment_ids.append(1)
+
+    input_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+    # The mask has 1 for real tokens and 0 for padding tokens. Only real
+    # tokens are attended to.
+    input_mask = [1] * len(input_ids)
+
+    label_id = label_map[example.label]
+
+    feature = InputFeatures(
+        input_ids=input_ids,
+        input_mask=input_mask,
+        segment_ids=segment_ids,
+        label_id=label_id)
+    return feature
+
+
+def convert_examples_to_features(examples, label_list, max_seq_length,
+                                 tokenizer):
+    """Convert a set of `InputExample`s to a list of `InputFeatures`."""
+
+    features = []
+    for (ex_index, example) in enumerate(examples):
+        if ex_index % 10000 == 0:
+            print("Writing example %d of %d" % (ex_index, len(examples)))
+
+        feature = convert_single_example(ex_index, example, label_list,
+                                         max_seq_length, tokenizer)
+
+        features.append(feature)
+    return features
+
+
+if __name__ == '__main__':
+    pass
diff --git a/dygraph/bert/reader/pretraining.py b/dygraph/bert/reader/pretraining.py
new file mode 100644
index 0000000000000000000000000000000000000000..c21a43d33caedd9a01c02dacbedd01a16e1eec9f
--- /dev/null
+++ b/dygraph/bert/reader/pretraining.py
@@ -0,0 +1,289 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+from __future__ import division
+
+import os
+import numpy as np
+import types
+import gzip
+import logging
+import re
+import six
+import collections
+import tokenization
+
+import paddle
+import paddle.fluid as fluid
+
+from batching import prepare_batch_data
+
+
+class DataReader(object):
+    def __init__(self,
+                 data_dir,
+                 vocab_path,
+                 batch_size=4096,
+                 in_tokens=True,
+                 max_seq_len=512,
+                 shuffle_files=True,
+                 epoch=100,
+                 voc_size=0,
+                 is_test=False,
+                 generate_neg_sample=False):
+
+        self.vocab = self.load_vocab(vocab_path)
+        self.data_dir = data_dir
+        self.batch_size = batch_size
+        self.in_tokens = in_tokens
+        self.shuffle_files = shuffle_files
+        self.epoch = epoch
+        self.current_epoch = 0
+        self.current_file_index = 0
+        self.total_file = 0
+        self.current_file = None
+        self.voc_size = voc_size
+        self.max_seq_len = max_seq_len
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.mask_id = self.vocab["[MASK]"]
+        self.is_test = is_test
+        self.generate_neg_sample = generate_neg_sample
+        if self.in_tokens:
+            assert self.batch_size >= self.max_seq_len, "The number of " \
+                   "tokens in batch should not be smaller than max seq length."
+
+        if self.is_test:
+            self.epoch = 1
+            self.shuffle_files = False
+
+    def get_progress(self):
+        """return current progress of traning data
+        """
+        return self.current_epoch, self.current_file_index, self.total_file, self.current_file
+
+    def parse_line(self, line, max_seq_len=512):
+        """ parse one line to token_ids, sentence_ids, pos_ids, label
+        """
+        line = line.strip().decode().split(";")
+        assert len(line) == 4, "One sample must have 4 fields!"
+        (token_ids, sent_ids, pos_ids, label) = line
+        token_ids = [int(token) for token in token_ids.split(" ")]
+        sent_ids = [int(token) for token in sent_ids.split(" ")]
+        pos_ids = [int(token) for token in pos_ids.split(" ")]
+        assert len(token_ids) == len(sent_ids) == len(
+            pos_ids
+        ), "[Must be true]len(token_ids) == len(sent_ids) == len(pos_ids)"
+        label = int(label)
+        if len(token_ids) > max_seq_len:
+            return None
+        return [token_ids, sent_ids, pos_ids, label]
+
+    def read_file(self, file):
+        assert file.endswith('.gz'), "[ERROR] %s is not a gzip file" % file
+        file_path = self.data_dir + "/" + file
+        with gzip.open(file_path, "rb") as f:
+            for line in f:
+                parsed_line = self.parse_line(
+                    line, max_seq_len=self.max_seq_len)
+                if parsed_line is None:
+                    continue
+                yield parsed_line
+
+    def convert_to_unicode(self, text):
+        """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+        if six.PY3:
+            if isinstance(text, str):
+                return text
+            elif isinstance(text, bytes):
+                return text.decode("utf-8", "ignore")
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        elif six.PY2:
+            if isinstance(text, str):
+                return text.decode("utf-8", "ignore")
+            elif isinstance(text, unicode):
+                return text
+            else:
+                raise ValueError("Unsupported string type: %s" % (type(text)))
+        else:
+            raise ValueError("Not running on Python2 or Python 3?")
+
+    def load_vocab(self, vocab_file):
+        """Loads a vocabulary file into a dictionary."""
+        vocab = collections.OrderedDict()
+        fin = open(vocab_file)
+        for num, line in enumerate(fin):
+            items = self.convert_to_unicode(line.strip()).split("\t")
+            if len(items) > 2:
+                break
+            token = items[0]
+            index = items[1] if len(items) == 2 else num
+            token = token.strip()
+            vocab[token] = int(index)
+        return vocab
+
+    def random_pair_neg_samples(self, pos_samples):
+        """ randomly generate negtive samples using pos_samples
+
+            Args:
+                pos_samples: list of positive samples
+            
+            Returns:
+                neg_samples: list of negtive samples
+        """
+        np.random.shuffle(pos_samples)
+        num_sample = len(pos_samples)
+        neg_samples = []
+        miss_num = 0
+
+        for i in range(num_sample):
+            pair_index = (i + 1) % num_sample
+            origin_src_ids = pos_samples[i][0]
+            origin_sep_index = origin_src_ids.index(2)
+            pair_src_ids = pos_samples[pair_index][0]
+            pair_sep_index = pair_src_ids.index(2)
+
+            src_ids = origin_src_ids[:origin_sep_index + 1] + pair_src_ids[
+                pair_sep_index + 1:]
+            if len(src_ids) >= self.max_seq_len:
+                miss_num += 1
+                continue
+            sent_ids = [0] * len(origin_src_ids[:origin_sep_index + 1]) + [
+                1
+            ] * len(pair_src_ids[pair_sep_index + 1:])
+            pos_ids = list(range(len(src_ids)))
+            neg_sample = [src_ids, sent_ids, pos_ids, 0]
+            assert len(src_ids) == len(sent_ids) == len(
+                pos_ids
+            ), "[ERROR]len(src_id) == lne(sent_id) == len(pos_id) must be True"
+            neg_samples.append(neg_sample)
+        return neg_samples, miss_num
+
+    def mixin_negtive_samples(self, pos_sample_generator, buffer=1000):
+        """ 1. generate negtive samples by randomly group sentence_1 and sentence_2 of positive samples
+            2. combine negtive samples and positive samples
+            
+            Args:
+                pos_sample_generator: a generator producing a parsed positive sample, which is a list: [token_ids, sent_ids, pos_ids, 1]
+
+            Returns:
+                sample: one sample from shuffled positive samples and negtive samples
+        """
+        pos_samples = []
+        num_total_miss = 0
+        pos_sample_num = 0
+        try:
+            while True:
+                while len(pos_samples) < buffer:
+                    pos_sample = next(pos_sample_generator)
+                    label = pos_sample[3]
+                    assert label == 1, "positive sample's label must be 1"
+                    pos_samples.append(pos_sample)
+                    pos_sample_num += 1
+
+                neg_samples, miss_num = self.random_pair_neg_samples(
+                    pos_samples)
+                num_total_miss += miss_num
+                samples = pos_samples + neg_samples
+                pos_samples = []
+                np.random.shuffle(samples)
+                for sample in samples:
+                    yield sample
+        except StopIteration:
+            print("stopiteration: reach end of file")
+            if len(pos_samples) == 1:
+                yield pos_samples[0]
+            elif len(pos_samples) == 0:
+                yield None
+            else:
+                neg_samples, miss_num = self.random_pair_neg_samples(
+                    pos_samples)
+                num_total_miss += miss_num
+                samples = pos_samples + neg_samples
+                pos_samples = []
+                np.random.shuffle(samples)
+                for sample in samples:
+                    yield sample
+            print("miss_num:%d\tideal_total_sample_num:%d\tmiss_rate:%f" %
+                  (num_total_miss, pos_sample_num * 2,
+                   num_total_miss / (pos_sample_num * 2)))
+
+    def data_generator(self):
+        """
+        data_generator
+        """
+        files = os.listdir(self.data_dir)
+        self.total_file = len(files)
+        assert self.total_file > 0, "[Error] data_dir is empty"
+
+        def wrapper():
+            def reader():
+                for epoch in range(self.epoch):
+                    self.current_epoch = epoch + 1
+                    if self.shuffle_files:
+                        np.random.shuffle(files)
+                    for index, file in enumerate(files):
+                        self.current_file_index = index + 1
+                        self.current_file = file
+                        sample_generator = self.read_file(file)
+                        if not self.is_test and self.generate_neg_sample:
+                            sample_generator = self.mixin_negtive_samples(
+                                sample_generator)
+                        for sample in sample_generator:
+                            if sample is None:
+                                continue
+                            yield sample
+
+            def batch_reader(reader, batch_size, in_tokens):
+                batch, total_token_num, max_len = [], 0, 0
+                for parsed_line in reader():
+                    token_ids, sent_ids, pos_ids, label = parsed_line
+                    max_len = max(max_len, len(token_ids))
+                    if in_tokens:
+                        to_append = (len(batch) + 1) * max_len <= batch_size
+                    else:
+                        to_append = len(batch) < batch_size
+                    if to_append:
+                        batch.append(parsed_line)
+                        total_token_num += len(token_ids)
+                    else:
+                        yield batch, total_token_num
+                        batch, total_token_num, max_len = [parsed_line], len(
+                            token_ids), len(token_ids)
+
+                if len(batch) > 0:
+                    yield batch, total_token_num
+
+            for batch_data, total_token_num in batch_reader(
+                    reader, self.batch_size, self.in_tokens):
+                yield prepare_batch_data(
+                    batch_data,
+                    total_token_num,
+                    voc_size=self.voc_size,
+                    pad_id=self.pad_id,
+                    cls_id=self.cls_id,
+                    sep_id=self.sep_id,
+                    mask_id=self.mask_id,
+                    return_input_mask=True,
+                    return_max_len=False,
+                    return_num_token=False)
+
+        return wrapper
+
+
+if __name__ == "__main__":
+    pass
diff --git a/dygraph/bert/reader/squad.py b/dygraph/bert/reader/squad.py
new file mode 100644
index 0000000000000000000000000000000000000000..79c2ca97def6d2b49b6e1207d3d21150c2dd6771
--- /dev/null
+++ b/dygraph/bert/reader/squad.py
@@ -0,0 +1,933 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Run BERT on SQuAD 1.1 and SQuAD 2.0."""
+
+import six
+import math
+import json
+import random
+import collections
+import tokenization
+from batching import prepare_batch_data
+
+
+class SquadExample(object):
+    """A single training/test example for simple sequence classification.
+
+     For examples without an answer, the start and end position are -1.
+  """
+
+    def __init__(self,
+                 qas_id,
+                 question_text,
+                 doc_tokens,
+                 orig_answer_text=None,
+                 start_position=None,
+                 end_position=None,
+                 is_impossible=False):
+        self.qas_id = qas_id
+        self.question_text = question_text
+        self.doc_tokens = doc_tokens
+        self.orig_answer_text = orig_answer_text
+        self.start_position = start_position
+        self.end_position = end_position
+        self.is_impossible = is_impossible
+
+    def __str__(self):
+        return self.__repr__()
+
+    def __repr__(self):
+        s = ""
+        s += "qas_id: %s" % (tokenization.printable_text(self.qas_id))
+        s += ", question_text: %s" % (
+            tokenization.printable_text(self.question_text))
+        s += ", doc_tokens: [%s]" % (" ".join(self.doc_tokens))
+        if self.start_position:
+            s += ", start_position: %d" % (self.start_position)
+        if self.start_position:
+            s += ", end_position: %d" % (self.end_position)
+        if self.start_position:
+            s += ", is_impossible: %r" % (self.is_impossible)
+        return s
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self,
+                 unique_id,
+                 example_index,
+                 doc_span_index,
+                 tokens,
+                 token_to_orig_map,
+                 token_is_max_context,
+                 input_ids,
+                 input_mask,
+                 segment_ids,
+                 start_position=None,
+                 end_position=None,
+                 is_impossible=None):
+        self.unique_id = unique_id
+        self.example_index = example_index
+        self.doc_span_index = doc_span_index
+        self.tokens = tokens
+        self.token_to_orig_map = token_to_orig_map
+        self.token_is_max_context = token_is_max_context
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.start_position = start_position
+        self.end_position = end_position
+        self.is_impossible = is_impossible
+
+
+def read_squad_examples(input_file, is_training, version_2_with_negative=False):
+    """Read a SQuAD json file into a list of SquadExample."""
+    with open(input_file, "r") as reader:
+        input_data = json.load(reader)["data"]
+
+    def is_whitespace(c):
+        if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
+            return True
+        return False
+
+    examples = []
+    for entry in input_data:
+        for paragraph in entry["paragraphs"]:
+            paragraph_text = paragraph["context"]
+            doc_tokens = []
+            char_to_word_offset = []
+            prev_is_whitespace = True
+            for c in paragraph_text:
+                if is_whitespace(c):
+                    prev_is_whitespace = True
+                else:
+                    if prev_is_whitespace:
+                        doc_tokens.append(c)
+                    else:
+                        doc_tokens[-1] += c
+                    prev_is_whitespace = False
+                char_to_word_offset.append(len(doc_tokens) - 1)
+
+            for qa in paragraph["qas"]:
+                qas_id = qa["id"]
+                question_text = qa["question"]
+                start_position = None
+                end_position = None
+                orig_answer_text = None
+                is_impossible = False
+                if is_training:
+
+                    if version_2_with_negative:
+                        is_impossible = qa["is_impossible"]
+                    if (len(qa["answers"]) != 1) and (not is_impossible):
+                        raise ValueError(
+                            "For training, each question should have exactly 1 answer."
+                        )
+                    if not is_impossible:
+                        answer = qa["answers"][0]
+                        orig_answer_text = answer["text"]
+                        answer_offset = answer["answer_start"]
+                        answer_length = len(orig_answer_text)
+                        start_position = char_to_word_offset[answer_offset]
+                        end_position = char_to_word_offset[answer_offset +
+                                                           answer_length - 1]
+                        # Only add answers where the text can be exactly recovered from the
+                        # document. If this CAN'T happen it's likely due to weird Unicode
+                        # stuff so we will just skip the example.
+                        #
+                        # Note that this means for training mode, every example is NOT
+                        # guaranteed to be preserved.
+                        actual_text = " ".join(doc_tokens[start_position:(
+                            end_position + 1)])
+                        cleaned_answer_text = " ".join(
+                            tokenization.whitespace_tokenize(orig_answer_text))
+                        if actual_text.find(cleaned_answer_text) == -1:
+                            print("Could not find answer: '%s' vs. '%s'",
+                                  actual_text, cleaned_answer_text)
+                            continue
+                    else:
+                        start_position = -1
+                        end_position = -1
+                        orig_answer_text = ""
+
+                example = SquadExample(
+                    qas_id=qas_id,
+                    question_text=question_text,
+                    doc_tokens=doc_tokens,
+                    orig_answer_text=orig_answer_text,
+                    start_position=start_position,
+                    end_position=end_position,
+                    is_impossible=is_impossible)
+                examples.append(example)
+
+    return examples
+
+
+def convert_examples_to_features(
+        examples,
+        tokenizer,
+        max_seq_length,
+        doc_stride,
+        max_query_length,
+        is_training,
+        #output_fn
+):
+    """Loads a data file into a list of `InputBatch`s."""
+
+    unique_id = 1000000000
+
+    for (example_index, example) in enumerate(examples):
+        query_tokens = tokenizer.tokenize(example.question_text)
+
+        if len(query_tokens) > max_query_length:
+            query_tokens = query_tokens[0:max_query_length]
+
+        tok_to_orig_index = []
+        orig_to_tok_index = []
+        all_doc_tokens = []
+        for (i, token) in enumerate(example.doc_tokens):
+            orig_to_tok_index.append(len(all_doc_tokens))
+            sub_tokens = tokenizer.tokenize(token)
+            for sub_token in sub_tokens:
+                tok_to_orig_index.append(i)
+                all_doc_tokens.append(sub_token)
+
+        tok_start_position = None
+        tok_end_position = None
+        if is_training and example.is_impossible:
+            tok_start_position = -1
+            tok_end_position = -1
+        if is_training and not example.is_impossible:
+            tok_start_position = orig_to_tok_index[example.start_position]
+            if example.end_position < len(example.doc_tokens) - 1:
+                tok_end_position = orig_to_tok_index[example.end_position +
+                                                     1] - 1
+            else:
+                tok_end_position = len(all_doc_tokens) - 1
+            (tok_start_position, tok_end_position) = _improve_answer_span(
+                all_doc_tokens, tok_start_position, tok_end_position, tokenizer,
+                example.orig_answer_text)
+
+        # The -3 accounts for [CLS], [SEP] and [SEP]
+        max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
+
+        # We can have documents that are longer than the maximum sequence length.
+        # To deal with this we do a sliding window approach, where we take chunks
+        # of the up to our max length with a stride of `doc_stride`.
+        _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
+            "DocSpan", ["start", "length"])
+        doc_spans = []
+        start_offset = 0
+        while start_offset < len(all_doc_tokens):
+            length = len(all_doc_tokens) - start_offset
+            if length > max_tokens_for_doc:
+                length = max_tokens_for_doc
+            doc_spans.append(_DocSpan(start=start_offset, length=length))
+            if start_offset + length == len(all_doc_tokens):
+                break
+            start_offset += min(length, doc_stride)
+
+        for (doc_span_index, doc_span) in enumerate(doc_spans):
+            tokens = []
+            token_to_orig_map = {}
+            token_is_max_context = {}
+            segment_ids = []
+            tokens.append("[CLS]")
+            segment_ids.append(0)
+            for token in query_tokens:
+                tokens.append(token)
+                segment_ids.append(0)
+            tokens.append("[SEP]")
+            segment_ids.append(0)
+
+            for i in range(doc_span.length):
+                split_token_index = doc_span.start + i
+                token_to_orig_map[len(tokens)] = tok_to_orig_index[
+                    split_token_index]
+
+                is_max_context = _check_is_max_context(
+                    doc_spans, doc_span_index, split_token_index)
+                token_is_max_context[len(tokens)] = is_max_context
+                tokens.append(all_doc_tokens[split_token_index])
+                segment_ids.append(1)
+            tokens.append("[SEP]")
+            segment_ids.append(1)
+
+            input_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+            # The mask has 1 for real tokens and 0 for padding tokens. Only real
+            # tokens are attended to.
+            input_mask = [1] * len(input_ids)
+
+            # Zero-pad up to the sequence length.
+            #while len(input_ids) < max_seq_length:
+            #  input_ids.append(0)
+            #  input_mask.append(0)
+            #  segment_ids.append(0)
+
+            #assert len(input_ids) == max_seq_length
+            #assert len(input_mask) == max_seq_length
+            #assert len(segment_ids) == max_seq_length
+
+            start_position = None
+            end_position = None
+            if is_training and not example.is_impossible:
+                # For training, if our document chunk does not contain an annotation
+                # we throw it out, since there is nothing to predict.
+                doc_start = doc_span.start
+                doc_end = doc_span.start + doc_span.length - 1
+                out_of_span = False
+                if not (tok_start_position >= doc_start and
+                        tok_end_position <= doc_end):
+                    out_of_span = True
+                if out_of_span:
+                    start_position = 0
+                    end_position = 0
+                else:
+                    doc_offset = len(query_tokens) + 2
+                    start_position = tok_start_position - doc_start + doc_offset
+                    end_position = tok_end_position - doc_start + doc_offset
+
+            if is_training and example.is_impossible:
+                start_position = 0
+                end_position = 0
+            """
+            if example_index < 3:
+                print("*** Example ***")
+                print("unique_id: %s" % (unique_id))
+                print("example_index: %s" % (example_index))
+                print("doc_span_index: %s" % (doc_span_index))
+                print("tokens: %s" % " ".join(
+                    [tokenization.printable_text(x) for x in tokens]))
+                print("token_to_orig_map: %s" % " ".join([
+                    "%d:%d" % (x, y)
+                    for (x, y) in six.iteritems(token_to_orig_map)
+                ]))
+                print("token_is_max_context: %s" % " ".join([
+                    "%d:%s" % (x, y)
+                    for (x, y) in six.iteritems(token_is_max_context)
+                ]))
+                print("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+                print("input_mask: %s" % " ".join([str(x) for x in input_mask]))
+                print("segment_ids: %s" %
+                      " ".join([str(x) for x in segment_ids]))
+                if is_training and example.is_impossible:
+                    print("impossible example")
+                if is_training and not example.is_impossible:
+                    answer_text = " ".join(tokens[start_position:(end_position +
+                                                                  1)])
+                    print("start_position: %d" % (start_position))
+                    print("end_position: %d" % (end_position))
+                    print("answer: %s" %
+                          (tokenization.printable_text(answer_text)))
+            """
+
+            feature = InputFeatures(
+                unique_id=unique_id,
+                example_index=example_index,
+                doc_span_index=doc_span_index,
+                tokens=tokens,
+                token_to_orig_map=token_to_orig_map,
+                token_is_max_context=token_is_max_context,
+                input_ids=input_ids,
+                input_mask=input_mask,
+                segment_ids=segment_ids,
+                start_position=start_position,
+                end_position=end_position,
+                is_impossible=example.is_impossible)
+
+            unique_id += 1
+
+            yield feature
+
+
+def _improve_answer_span(doc_tokens, input_start, input_end, tokenizer,
+                         orig_answer_text):
+    """Returns tokenized answer spans that better match the annotated answer."""
+
+    # The SQuAD annotations are character based. We first project them to
+    # whitespace-tokenized words. But then after WordPiece tokenization, we can
+    # often find a "better match". For example:
+    #
+    #   Question: What year was John Smith born?
+    #   Context: The leader was John Smith (1895-1943).
+    #   Answer: 1895
+    #
+    # The original whitespace-tokenized answer will be "(1895-1943).". However
+    # after tokenization, our tokens will be "( 1895 - 1943 ) .". So we can match
+    # the exact answer, 1895.
+    #
+    # However, this is not always possible. Consider the following:
+    #
+    #   Question: What country is the top exporter of electornics?
+    #   Context: The Japanese electronics industry is the lagest in the world.
+    #   Answer: Japan
+    #
+    # In this case, the annotator chose "Japan" as a character sub-span of
+    # the word "Japanese". Since our WordPiece tokenizer does not split
+    # "Japanese", we just use "Japanese" as the annotation. This is fairly rare
+    # in SQuAD, but does happen.
+    tok_answer_text = " ".join(tokenizer.tokenize(orig_answer_text))
+
+    for new_start in range(input_start, input_end + 1):
+        for new_end in range(input_end, new_start - 1, -1):
+            text_span = " ".join(doc_tokens[new_start:(new_end + 1)])
+            if text_span == tok_answer_text:
+                return (new_start, new_end)
+
+    return (input_start, input_end)
+
+
+def _check_is_max_context(doc_spans, cur_span_index, position):
+    """Check if this is the 'max context' doc span for the token."""
+
+    # Because of the sliding window approach taken to scoring documents, a single
+    # token can appear in multiple documents. E.g.
+    #  Doc: the man went to the store and bought a gallon of milk
+    #  Span A: the man went to the
+    #  Span B: to the store and bought
+    #  Span C: and bought a gallon of
+    #  ...
+    #
+    # Now the word 'bought' will have two scores from spans B and C. We only
+    # want to consider the score with "maximum context", which we define as
+    # the *minimum* of its left and right context (the *sum* of left and
+    # right context will always be the same, of course).
+    #
+    # In the example the maximum context for 'bought' would be span C since
+    # it has 1 left context and 3 right context, while span B has 4 left context
+    # and 0 right context.
+    best_score = None
+    best_span_index = None
+    for (span_index, doc_span) in enumerate(doc_spans):
+        end = doc_span.start + doc_span.length - 1
+        if position < doc_span.start:
+            continue
+        if position > end:
+            continue
+        num_left_context = position - doc_span.start
+        num_right_context = end - position
+        score = min(num_left_context,
+                    num_right_context) + 0.01 * doc_span.length
+        if best_score is None or score > best_score:
+            best_score = score
+            best_span_index = span_index
+
+    return cur_span_index == best_span_index
+
+
+class DataProcessor(object):
+    def __init__(self, vocab_path, do_lower_case, max_seq_length, in_tokens,
+                 doc_stride, max_query_length):
+        self._tokenizer = tokenization.FullTokenizer(
+            vocab_file=vocab_path, do_lower_case=do_lower_case)
+        self._max_seq_length = max_seq_length
+        self._doc_stride = doc_stride
+        self._max_query_length = max_query_length
+        self._in_tokens = in_tokens
+
+        self.vocab = self._tokenizer.vocab
+        self.vocab_size = len(self.vocab)
+        self.pad_id = self.vocab["[PAD]"]
+        self.cls_id = self.vocab["[CLS]"]
+        self.sep_id = self.vocab["[SEP]"]
+        self.mask_id = self.vocab["[MASK]"]
+
+        self.current_train_example = -1
+        self.num_train_examples = -1
+        self.current_train_epoch = -1
+
+        self.train_examples = None
+        self.predict_examples = None
+        self.num_examples = {'train': -1, 'predict': -1}
+
+    def get_train_progress(self):
+        """Gets progress for training phase."""
+        return self.current_train_example, self.current_train_epoch
+
+    def get_examples(self,
+                     data_path,
+                     is_training,
+                     version_2_with_negative=False):
+        examples = read_squad_examples(
+            input_file=data_path,
+            is_training=is_training,
+            version_2_with_negative=version_2_with_negative)
+        return examples
+
+    def get_num_examples(self, phase):
+        if phase not in ['train', 'predict']:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'predict'].")
+        return self.num_examples[phase]
+
+    def get_features(self, examples, is_training):
+        features = convert_examples_to_features(
+            examples=examples,
+            tokenizer=self._tokenizer,
+            max_seq_length=self._max_seq_length,
+            doc_stride=self._doc_stride,
+            max_query_length=self._max_query_length,
+            is_training=is_training)
+        return features
+
+    def data_generator(self,
+                       data_path,
+                       batch_size,
+                       phase='train',
+                       shuffle=False,
+                       dev_count=1,
+                       version_2_with_negative=False,
+                       epoch=1):
+        if phase == 'train':
+            self.train_examples = self.get_examples(
+                data_path,
+                is_training=True,
+                version_2_with_negative=version_2_with_negative)
+            examples = self.train_examples
+            self.num_examples['train'] = len(self.train_examples)
+        elif phase == 'predict':
+            self.predict_examples = self.get_examples(
+                data_path,
+                is_training=False,
+                version_2_with_negative=version_2_with_negative)
+            examples = self.predict_examples
+            self.num_examples['predict'] = len(self.predict_examples)
+        else:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'predict'].")
+
+        def batch_reader(features, batch_size, in_tokens):
+            batch, total_token_num, max_len = [], 0, 0
+            for (index, feature) in enumerate(features):
+                if phase == 'train':
+                    self.current_train_example = index + 1
+                seq_len = len(feature.input_ids)
+                labels = [feature.unique_id
+                          ] if feature.start_position is None else [
+                              feature.start_position, feature.end_position
+                          ]
+                example = [
+                    feature.input_ids, feature.segment_ids, range(seq_len)
+                ] + labels
+                max_len = max(max_len, seq_len)
+
+                #max_len = max(max_len, len(token_ids))
+                if in_tokens:
+                    to_append = (len(batch) + 1) * max_len <= batch_size
+                else:
+                    to_append = len(batch) < batch_size
+
+                if to_append:
+                    batch.append(example)
+                    total_token_num += seq_len
+                else:
+                    yield batch, total_token_num
+                    batch, total_token_num, max_len = [example
+                                                       ], seq_len, seq_len
+            if len(batch) > 0:
+                yield batch, total_token_num
+
+        def wrapper():
+            for epoch_index in range(epoch):
+                if shuffle:
+                    random.shuffle(examples)
+                if phase == 'train':
+                    self.current_train_epoch = epoch_index
+                    features = self.get_features(examples, is_training=True)
+                else:
+                    features = self.get_features(examples, is_training=False)
+
+                all_dev_batches = []
+                for batch_data, total_token_num in batch_reader(
+                        features, batch_size, self._in_tokens):
+                    batch_data = prepare_batch_data(
+                        batch_data,
+                        total_token_num,
+                        voc_size=-1,
+                        pad_id=self.pad_id,
+                        cls_id=self.cls_id,
+                        sep_id=self.sep_id,
+                        mask_id=-1,
+                        return_input_mask=True,
+                        return_max_len=False,
+                        return_num_token=False)
+                    if len(all_dev_batches) < dev_count:
+                        all_dev_batches.append(batch_data)
+
+                    if len(all_dev_batches) == dev_count:
+                        for batch in all_dev_batches:
+                            yield batch
+                        all_dev_batches = []
+
+        return wrapper
+
+
+def write_predictions(all_examples, all_features, all_results, n_best_size,
+                      max_answer_length, do_lower_case, output_prediction_file,
+                      output_nbest_file, output_null_log_odds_file,
+                      version_2_with_negative, null_score_diff_threshold,
+                      verbose):
+    """Write final predictions to the json file and log-odds of null if needed."""
+    print("Writing predictions to: %s" % (output_prediction_file))
+    print("Writing nbest to: %s" % (output_nbest_file))
+
+    example_index_to_features = collections.defaultdict(list)
+    for feature in all_features:
+        example_index_to_features[feature.example_index].append(feature)
+
+    unique_id_to_result = {}
+    for result in all_results:
+        unique_id_to_result[result.unique_id] = result
+
+    _PrelimPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+        "PrelimPrediction", [
+            "feature_index", "start_index", "end_index", "start_logit",
+            "end_logit"
+        ])
+
+    all_predictions = collections.OrderedDict()
+    all_nbest_json = collections.OrderedDict()
+    scores_diff_json = collections.OrderedDict()
+
+    for (example_index, example) in enumerate(all_examples):
+        features = example_index_to_features[example_index]
+
+        prelim_predictions = []
+        # keep track of the minimum score of null start+end of position 0
+        score_null = 1000000  # large and positive
+        min_null_feature_index = 0  # the paragraph slice with min mull score
+        null_start_logit = 0  # the start logit at the slice with min null score
+        null_end_logit = 0  # the end logit at the slice with min null score
+        for (feature_index, feature) in enumerate(features):
+            result = unique_id_to_result[feature.unique_id]
+            start_indexes = _get_best_indexes(result.start_logits, n_best_size)
+            end_indexes = _get_best_indexes(result.end_logits, n_best_size)
+            # if we could have irrelevant answers, get the min score of irrelevant
+            if version_2_with_negative:
+                feature_null_score = result.start_logits[0] + result.end_logits[
+                    0]
+                if feature_null_score < score_null:
+                    score_null = feature_null_score
+                    min_null_feature_index = feature_index
+                    null_start_logit = result.start_logits[0]
+                    null_end_logit = result.end_logits[0]
+            for start_index in start_indexes:
+                for end_index in end_indexes:
+                    # We could hypothetically create invalid predictions, e.g., predict
+                    # that the start of the span is in the question. We throw out all
+                    # invalid predictions.
+                    if start_index >= len(feature.tokens):
+                        continue
+                    if end_index >= len(feature.tokens):
+                        continue
+                    if start_index not in feature.token_to_orig_map:
+                        continue
+                    if end_index not in feature.token_to_orig_map:
+                        continue
+                    if not feature.token_is_max_context.get(start_index, False):
+                        continue
+                    if end_index < start_index:
+                        continue
+                    length = end_index - start_index + 1
+                    if length > max_answer_length:
+                        continue
+                    prelim_predictions.append(
+                        _PrelimPrediction(
+                            feature_index=feature_index,
+                            start_index=start_index,
+                            end_index=end_index,
+                            start_logit=result.start_logits[start_index],
+                            end_logit=result.end_logits[end_index]))
+
+        if version_2_with_negative:
+            prelim_predictions.append(
+                _PrelimPrediction(
+                    feature_index=min_null_feature_index,
+                    start_index=0,
+                    end_index=0,
+                    start_logit=null_start_logit,
+                    end_logit=null_end_logit))
+        prelim_predictions = sorted(
+            prelim_predictions,
+            key=lambda x: (x.start_logit + x.end_logit),
+            reverse=True)
+
+        _NbestPrediction = collections.namedtuple(  # pylint: disable=invalid-name
+            "NbestPrediction", ["text", "start_logit", "end_logit"])
+
+        seen_predictions = {}
+        nbest = []
+        for pred in prelim_predictions:
+            if len(nbest) >= n_best_size:
+                break
+            feature = features[pred.feature_index]
+            if pred.start_index > 0:  # this is a non-null prediction
+                tok_tokens = feature.tokens[pred.start_index:(pred.end_index + 1
+                                                              )]
+                orig_doc_start = feature.token_to_orig_map[pred.start_index]
+                orig_doc_end = feature.token_to_orig_map[pred.end_index]
+                orig_tokens = example.doc_tokens[orig_doc_start:(orig_doc_end +
+                                                                 1)]
+                tok_text = " ".join(tok_tokens)
+
+                # De-tokenize WordPieces that have been split off.
+                tok_text = tok_text.replace(" ##", "")
+                tok_text = tok_text.replace("##", "")
+
+                # Clean whitespace
+                tok_text = tok_text.strip()
+                tok_text = " ".join(tok_text.split())
+                orig_text = " ".join(orig_tokens)
+
+                final_text = get_final_text(tok_text, orig_text, do_lower_case,
+                                            verbose)
+                if final_text in seen_predictions:
+                    continue
+
+                seen_predictions[final_text] = True
+            else:
+                final_text = ""
+                seen_predictions[final_text] = True
+
+            nbest.append(
+                _NbestPrediction(
+                    text=final_text,
+                    start_logit=pred.start_logit,
+                    end_logit=pred.end_logit))
+
+        # if we didn't inlude the empty option in the n-best, inlcude it
+        if version_2_with_negative:
+            if "" not in seen_predictions:
+                nbest.append(
+                    _NbestPrediction(
+                        text="",
+                        start_logit=null_start_logit,
+                        end_logit=null_end_logit))
+        # In very rare edge cases we could have no valid predictions. So we
+        # just create a nonce prediction in this case to avoid failure.
+        if not nbest:
+            nbest.append(
+                _NbestPrediction(
+                    text="empty", start_logit=0.0, end_logit=0.0))
+
+        assert len(nbest) >= 1
+
+        total_scores = []
+        best_non_null_entry = None
+        for entry in nbest:
+            total_scores.append(entry.start_logit + entry.end_logit)
+            if not best_non_null_entry:
+                if entry.text:
+                    best_non_null_entry = entry
+        # debug
+        if best_non_null_entry is None:
+            print("Emmm..., sth wrong")
+
+        probs = _compute_softmax(total_scores)
+
+        nbest_json = []
+        for (i, entry) in enumerate(nbest):
+            output = collections.OrderedDict()
+            output["text"] = entry.text
+            output["probability"] = probs[i]
+            output["start_logit"] = entry.start_logit
+            output["end_logit"] = entry.end_logit
+            nbest_json.append(output)
+
+        assert len(nbest_json) >= 1
+
+        if not version_2_with_negative:
+            all_predictions[example.qas_id] = nbest_json[0]["text"]
+        else:
+            # predict "" iff the null score - the score of best non-null > threshold
+            score_diff = score_null - best_non_null_entry.start_logit - (
+                best_non_null_entry.end_logit)
+            scores_diff_json[example.qas_id] = score_diff
+            if score_diff > null_score_diff_threshold:
+                all_predictions[example.qas_id] = ""
+            else:
+                all_predictions[example.qas_id] = best_non_null_entry.text
+
+        all_nbest_json[example.qas_id] = nbest_json
+
+    with open(output_prediction_file, "w") as writer:
+        writer.write(json.dumps(all_predictions, indent=4) + "\n")
+
+    with open(output_nbest_file, "w") as writer:
+        writer.write(json.dumps(all_nbest_json, indent=4) + "\n")
+
+    if version_2_with_negative:
+        with open(output_null_log_odds_file, "w") as writer:
+            writer.write(json.dumps(scores_diff_json, indent=4) + "\n")
+
+
+def get_final_text(pred_text, orig_text, do_lower_case, verbose):
+    """Project the tokenized prediction back to the original text."""
+
+    # When we created the data, we kept track of the alignment between original
+    # (whitespace tokenized) tokens and our WordPiece tokenized tokens. So
+    # now `orig_text` contains the span of our original text corresponding to the
+    # span that we predicted.
+    #
+    # However, `orig_text` may contain extra characters that we don't want in
+    # our prediction.
+    #
+    # For example, let's say:
+    #   pred_text = steve smith
+    #   orig_text = Steve Smith's
+    #
+    # We don't want to return `orig_text` because it contains the extra "'s".
+    #
+    # We don't want to return `pred_text` because it's already been normalized
+    # (the SQuAD eval script also does punctuation stripping/lower casing but
+    # our tokenizer does additional normalization like stripping accent
+    # characters).
+    #
+    # What we really want to return is "Steve Smith".
+    #
+    # Therefore, we have to apply a semi-complicated alignment heruistic between
+    # `pred_text` and `orig_text` to get a character-to-charcter alignment. This
+    # can fail in certain cases in which case we just return `orig_text`.
+
+    def _strip_spaces(text):
+        ns_chars = []
+        ns_to_s_map = collections.OrderedDict()
+        for (i, c) in enumerate(text):
+            if c == " ":
+                continue
+            ns_to_s_map[len(ns_chars)] = i
+            ns_chars.append(c)
+        ns_text = "".join(ns_chars)
+        return (ns_text, ns_to_s_map)
+
+    # We first tokenize `orig_text`, strip whitespace from the result
+    # and `pred_text`, and check if they are the same length. If they are
+    # NOT the same length, the heuristic has failed. If they are the same
+    # length, we assume the characters are one-to-one aligned.
+    tokenizer = tokenization.BasicTokenizer(do_lower_case=do_lower_case)
+
+    tok_text = " ".join(tokenizer.tokenize(orig_text))
+
+    start_position = tok_text.find(pred_text)
+    if start_position == -1:
+        if verbose:
+            print("Unable to find text: '%s' in '%s'" % (pred_text, orig_text))
+        return orig_text
+    end_position = start_position + len(pred_text) - 1
+
+    (orig_ns_text, orig_ns_to_s_map) = _strip_spaces(orig_text)
+    (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)
+
+    if len(orig_ns_text) != len(tok_ns_text):
+        if verbose:
+            print("Length not equal after stripping spaces: '%s' vs '%s'",
+                  orig_ns_text, tok_ns_text)
+        return orig_text
+
+    # We then project the characters in `pred_text` back to `orig_text` using
+    # the character-to-character alignment.
+    tok_s_to_ns_map = {}
+    for (i, tok_index) in six.iteritems(tok_ns_to_s_map):
+        tok_s_to_ns_map[tok_index] = i
+
+    orig_start_position = None
+    if start_position in tok_s_to_ns_map:
+        ns_start_position = tok_s_to_ns_map[start_position]
+        if ns_start_position in orig_ns_to_s_map:
+            orig_start_position = orig_ns_to_s_map[ns_start_position]
+
+    if orig_start_position is None:
+        if verbose:
+            print("Couldn't map start position")
+        return orig_text
+
+    orig_end_position = None
+    if end_position in tok_s_to_ns_map:
+        ns_end_position = tok_s_to_ns_map[end_position]
+        if ns_end_position in orig_ns_to_s_map:
+            orig_end_position = orig_ns_to_s_map[ns_end_position]
+
+    if orig_end_position is None:
+        if verbose:
+            print("Couldn't map end position")
+        return orig_text
+
+    output_text = orig_text[orig_start_position:(orig_end_position + 1)]
+    return output_text
+
+
+def _get_best_indexes(logits, n_best_size):
+    """Get the n-best logits from a list."""
+    index_and_score = sorted(
+        enumerate(logits), key=lambda x: x[1], reverse=True)
+
+    best_indexes = []
+    for i in range(len(index_and_score)):
+        if i >= n_best_size:
+            break
+        best_indexes.append(index_and_score[i][0])
+    return best_indexes
+
+
+def _compute_softmax(scores):
+    """Compute softmax probability over raw logits."""
+    if not scores:
+        return []
+
+    max_score = None
+    for score in scores:
+        if max_score is None or score > max_score:
+            max_score = score
+
+    exp_scores = []
+    total_sum = 0.0
+    for score in scores:
+        x = math.exp(score - max_score)
+        exp_scores.append(x)
+        total_sum += x
+
+    probs = []
+    for score in exp_scores:
+        probs.append(score / total_sum)
+    return probs
+
+
+if __name__ == '__main__':
+    train_file = 'squad/train-v1.1.json'
+    vocab_file = 'uncased_L-12_H-768_A-12/vocab.txt'
+    do_lower_case = True
+    tokenizer = tokenization.FullTokenizer(
+        vocab_file=vocab_file, do_lower_case=do_lower_case)
+    train_examples = read_squad_examples(
+        input_file=train_file, is_training=True)
+    print("begin converting")
+    for (index, feature) in enumerate(
+            convert_examples_to_features(
+                examples=train_examples,
+                tokenizer=tokenizer,
+                max_seq_length=384,
+                doc_stride=128,
+                max_query_length=64,
+                is_training=True,
+                #output_fn=train_writer.process_feature
+            )):
+        if index < 10:
+            print(index, feature.input_ids, feature.input_mask,
+                  feature.segment_ids)
+    #for (index, example) in enumerate(train_examples):
+    #    if index < 5:
+    #        print(example)
diff --git a/dygraph/bert/run_classifier.py b/dygraph/bert/run_classifier.py
new file mode 100755
index 0000000000000000000000000000000000000000..737feb4f1bf3d17134b36983642768e078e43ab7
--- /dev/null
+++ b/dygraph/bert/run_classifier.py
@@ -0,0 +1,298 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""BERT fine-tuning in Paddle Dygraph Mode."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import six
+import sys
+if six.PY2:
+    reload(sys)
+    sys.setdefaultencoding('utf8')
+import ast
+import time
+import argparse
+import numpy as np
+import multiprocessing
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.dygraph import to_variable
+import reader.cls as reader
+from model.bert import BertConfig
+from model.cls import ClsModelLayer
+from optimization import Optimizer
+from utils.args import ArgumentGroup, print_arguments, check_cuda
+from utils.init import init_from_static_model
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
+model_g.add_arg("bert_config_path",      str,  "./config/bert_config.json",  "Path to the json file for bert model config.")
+model_g.add_arg("init_checkpoint",       str,  None,                         "Init checkpoint to resume training from.")
+model_g.add_arg("init_pretraining_params",  str,  None,
+                "Init pre-training params which preforms fine-tuning from. If the "
+                "arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
+model_g.add_arg("checkpoints",           str,  "checkpoints",                "Path to save checkpoints.")
+
+train_g = ArgumentGroup(parser, "training", "training options.")
+train_g.add_arg("epoch",             int,    100,     "Number of epoches for training.")
+train_g.add_arg("learning_rate",     float,  0.0001,  "Learning rate used to train with warmup.")
+train_g.add_arg("lr_scheduler",      str,    "linear_warmup_decay",
+                "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay'])
+train_g.add_arg("weight_decay",      float,  0.01,    "Weight decay rate for L2 regularizer.")
+train_g.add_arg("warmup_proportion",     float,  0.1,                         "Proportion of training steps to perform linear learning rate warmup for.")
+train_g.add_arg("save_steps",        int,    10000,   "The steps interval to save checkpoints.")
+train_g.add_arg("validation_steps",  int,    1000,    "The steps interval to evaluate model performance.")
+train_g.add_arg("loss_scaling",      float,  1.0,
+                "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled.")
+
+log_g = ArgumentGroup(parser,     "logging", "logging related.")
+log_g.add_arg("skip_steps",          int,    10,    "The steps interval to print loss.")
+
+data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options")
+data_g.add_arg("data_dir",            str,  None,       "Path to training data.")
+data_g.add_arg("vocab_path",          str,  None,       "Vocabulary path.")
+data_g.add_arg("max_seq_len",         int,  512,                   "Tokens' number of the longest seqence allowed.")
+data_g.add_arg("batch_size",          int,  32,
+               "The total number of examples in one batch for training, see also --in_tokens.")
+data_g.add_arg("in_tokens",           bool, False,
+               "If set, the batch size will be the maximum number of tokens in one batch. "
+               "Otherwise, it will be the maximum number of examples in one batch.")
+data_g.add_arg("do_lower_case", bool, True,
+               "Whether to lower case the input text. Should be True for uncased models and False for cased models.")
+data_g.add_arg("random_seed",   int,  5512,     "Random seed.")
+
+run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+run_type_g.add_arg("use_cuda",                     bool,   True,   "If set, use GPU for training.")
+run_type_g.add_arg("shuffle",                      bool,   True,  "")
+run_type_g.add_arg("task_name",                    str,    None,
+                   "The name of task to perform fine-tuning, should be in {'xnli', 'mnli', 'cola', 'mrpc'}.")
+run_type_g.add_arg("do_train",                     bool,   True,  "Whether to perform training.")
+run_type_g.add_arg("do_test",                      bool,   False,  "Whether to perform evaluation on test data set.")
+run_type_g.add_arg("use_data_parallel", bool, False,  "The flag indicating whether to shuffle instances in each pass.")
+
+args = parser.parse_args()
+
+def create_data(batch):
+    """
+    convert data to variable
+    """
+    src_ids = to_variable(batch[0], "src_ids")
+    position_ids = to_variable(batch[1], "position_ids")
+    sentence_ids = to_variable(batch[2], "sentence_ids")
+    input_mask = to_variable(batch[3], "input_mask")
+    labels = to_variable(batch[4], "labels")
+    labels.stop_gradient = True
+    return src_ids, position_ids, sentence_ids, input_mask, labels
+
+if args.use_cuda:
+    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
+    dev_count = fluid.core.get_cuda_device_count()
+else:
+    place = fluid.CPUPlace()
+    dev_count = int(os.environ.get('CPU_NUM', multiprocessing.cpu_count()))
+
+
+def train(args):
+    bert_config = BertConfig(args.bert_config_path)
+    bert_config.print_config()
+
+    if not (args.do_train or args.do_test):
+        raise ValueError("For args `do_train`, `do_test`, at "
+                        "least one of them must be True.")
+
+    trainer_count = fluid.dygraph.parallel.Env().nranks
+
+    task_name = args.task_name.lower()
+    processors = {
+        'xnli': reader.XnliProcessor,
+        'cola': reader.ColaProcessor,
+        'mrpc': reader.MrpcProcessor,
+        'mnli': reader.MnliProcessor,
+    }
+
+    processor = processors[task_name](data_dir=args.data_dir,
+                                      vocab_path=args.vocab_path,
+                                      max_seq_len=args.max_seq_len,
+                                      do_lower_case=args.do_lower_case,
+                                      in_tokens=args.in_tokens,
+                                      random_seed=args.random_seed)
+    num_labels = len(processor.get_labels())
+    shuffle_seed = 1 if trainer_count > 1 else None
+
+    train_data_generator = processor.data_generator(
+                                      batch_size=args.batch_size,
+                                      phase='train',
+                                      epoch=args.epoch,
+                                      dev_count=trainer_count,
+                                      shuffle=args.shuffle,
+                                      shuffle_seed=shuffle_seed)
+    num_train_examples = processor.get_num_examples(phase='train')
+    max_train_steps = args.epoch * num_train_examples // args.batch_size // trainer_count
+    warmup_steps = int(max_train_steps * args.warmup_proportion)
+
+    print("Device count: %d" % dev_count)
+    print("Trainer count: %d" % trainer_count)
+    print("Num train examples: %d" % num_train_examples)
+    print("Max train steps: %d" % max_train_steps)
+    print("Num warmup steps: %d" % warmup_steps)
+
+    with fluid.dygraph.guard(place):
+
+        if args.use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
+
+        cls_model = ClsModelLayer(
+                            args,
+                            bert_config,
+                            num_labels,
+                            is_training=True,
+                            return_pooled_out=True)
+
+        optimizer = Optimizer(
+                    warmup_steps=warmup_steps,
+                    num_train_steps=max_train_steps,
+                    learning_rate=args.learning_rate,
+                    model_cls=cls_model,
+                    weight_decay=args.weight_decay,
+                    scheduler=args.lr_scheduler,
+                    loss_scaling=args.loss_scaling,
+                    parameter_list=cls_model.parameters())
+
+        if args.init_pretraining_params:
+            print("Load pre-trained model from %s" % args.init_pretraining_params)
+            init_from_static_model(args.init_pretraining_params, cls_model, bert_config)
+
+        if args.use_data_parallel:
+            cls_model = fluid.dygraph.parallel.DataParallel(cls_model, strategy)
+            train_data_generator = fluid.contrib.reader.distributed_batch_reader(train_data_generator)
+
+        steps = 0
+        time_begin = time.time()
+
+        for batch in train_data_generator():
+            data_ids = create_data(batch)
+            loss, accuracy, num_seqs = cls_model(data_ids)
+
+            optimizer.optimization(loss, use_data_parallel = args.use_data_parallel, model = cls_model)
+            cls_model.clear_gradients()
+
+            if steps != 0 and steps % args.skip_steps == 0:
+                time_end = time.time()
+                used_time = time_end - time_begin
+                current_example, current_epoch = processor.get_train_progress()
+                localtime = time.asctime(time.localtime(time.time()))
+                print("%s, epoch: %s, steps: %s, dy_graph loss: %f, acc: %f, speed: %f steps/s" % (localtime, current_epoch, steps, loss.numpy(), accuracy.numpy(), args.skip_steps / used_time))
+                time_begin = time.time()
+
+            if steps != 0 and steps % args.save_steps == 0 and fluid.dygraph.parallel.Env().local_rank == 0:
+                save_path = os.path.join(args.checkpoints, "steps" + "_" + str(steps))
+                fluid.save_dygraph(
+                    cls_model.state_dict(),
+                    save_path)
+                fluid.save_dygraph(
+                    optimizer.optimizer.state_dict(),
+                    save_path)
+                print("Save model parameters and optimizer status at %s" % save_path)
+
+            steps += 1
+
+        if fluid.dygraph.parallel.Env().local_rank == 0:
+            save_path = os.path.join(args.checkpoints, "final")
+            fluid.save_dygraph(
+                cls_model.state_dict(),
+                save_path)
+            fluid.save_dygraph(
+                optimizer.optimizer.state_dict(),
+                save_path)
+            print("Save model parameters and optimizer status at %s" % save_path)
+        return cls_model
+
+def predict(args, cls_model = None):
+
+    bert_config = BertConfig(args.bert_config_path)
+    bert_config.print_config()
+
+    task_name = args.task_name.lower()
+    processors = {
+        'xnli': reader.XnliProcessor,
+        'cola': reader.ColaProcessor,
+        'mrpc': reader.MrpcProcessor,
+        'mnli': reader.MnliProcessor,
+    }
+
+    processor = processors[task_name](data_dir=args.data_dir,
+            vocab_path=args.vocab_path,
+            max_seq_len=args.max_seq_len,
+            do_lower_case=args.do_lower_case,
+            in_tokens=False)
+
+    test_data_generator = processor.data_generator(
+                                batch_size=args.batch_size,
+                                phase='dev',
+                                epoch=1,
+                                shuffle=False)
+
+    num_labels = len(processor.get_labels())
+
+    with fluid.dygraph.guard(place):
+        if cls_model is None:
+            cls_model = ClsModelLayer(
+                args,
+                bert_config,
+                num_labels,
+                is_training=False,
+                return_pooled_out=True)
+
+            #restore the model
+            save_path = os.path.join(args.checkpoints, "final")
+            print("Load params from %s" % save_path)
+            model_dict,_ = fluid.load_dygraph(save_path)
+            cls_model.load_dict(model_dict)
+
+        print('Do predicting ...... ')
+        cls_model.eval()
+
+        total_cost, total_acc, total_num_seqs = [], [], []
+
+        for batch in test_data_generator():
+            data_ids = create_data(batch)
+            np_loss, np_acc, np_num_seqs = cls_model(data_ids)
+
+            np_loss = np_loss.numpy()
+            np_acc = np_acc.numpy()
+            np_num_seqs = np_num_seqs.numpy()
+
+            total_cost.extend(np_loss * np_num_seqs)
+            total_acc.extend(np_acc * np_num_seqs)
+            total_num_seqs.extend(np_num_seqs)
+
+        print("[evaluation] average acc: %f" % (np.sum(total_acc) / np.sum(total_num_seqs)))
+
+
+if __name__ == '__main__':
+
+    print_arguments(args)
+    check_cuda(args.use_cuda)
+
+    if args.do_train:
+        cls_model = train(args)
+        if args.do_test:
+            predict(args, cls_model = cls_model)
+
+    elif args.do_test:
+        predict(args)
diff --git a/dygraph/bert/run_classifier_multi_gpu.sh b/dygraph/bert/run_classifier_multi_gpu.sh
new file mode 100755
index 0000000000000000000000000000000000000000..041a1091a7ed95ba1a8df475af7b98a9108065e7
--- /dev/null
+++ b/dygraph/bert/run_classifier_multi_gpu.sh
@@ -0,0 +1,35 @@
+#!/bin/bash
+
+BERT_BASE_PATH="./data/pretrained_models/uncased_L-12_H-768_A-12/"
+TASK_NAME='MNLI'
+DATA_PATH="./data/glue_data/MNLI/"
+CKPT_PATH="./data/saved_model/mnli_models"
+GPU_TO_USE="0,1,2,3"
+
+export CUDA_VISIBLE_DEVICES=$GPU_TO_USE
+
+# start fine-tuning
+python -m paddle.distributed.launch --selected_gpus=$GPU_TO_USE --log_dir ./cls_log run_classifier.py \
+    --task_name ${TASK_NAME} \
+    --use_cuda true \
+    --use_data_parallel true \
+    --do_train true \
+    --do_test true \
+    --batch_size 64 \
+    --in_tokens false \
+    --init_pretraining_params ${BERT_BASE_PATH}/dygraph_params/ \
+    --data_dir ${DATA_PATH} \
+    --vocab_path ${BERT_BASE_PATH}/vocab.txt \
+    --checkpoints ${CKPT_PATH} \
+    --save_steps 1000 \
+    --weight_decay  0.01 \
+    --warmup_proportion 0.1 \
+    --validation_steps 100 \
+    --epoch 3 \
+    --max_seq_len 128 \
+    --bert_config_path ${BERT_BASE_PATH}/bert_config.json \
+    --learning_rate 5e-5 \
+    --skip_steps 10 \
+    --shuffle true
+
+
diff --git a/dygraph/bert/run_classifier_predict.sh b/dygraph/bert/run_classifier_predict.sh
new file mode 100755
index 0000000000000000000000000000000000000000..60c358bb8bd85275e5653b2440caa1229f8735f6
--- /dev/null
+++ b/dygraph/bert/run_classifier_predict.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+BERT_BASE_PATH="./data/pretrained_models/uncased_L-12_H-768_A-12/"
+TASK_NAME='MNLI'
+DATA_PATH="./data/glue_data/MNLI/"
+CKPT_PATH="./data/saved_model/mnli_models"
+
+export CUDA_VISIBLE_DEVICES=0
+
+# start testing
+python run_classifier.py\
+    --task_name ${TASK_NAME} \
+    --use_cuda true \
+    --do_train false \
+    --do_test true \
+    --batch_size 64 \
+    --in_tokens false \
+    --data_dir ${DATA_PATH} \
+    --vocab_path ${BERT_BASE_PATH}/vocab.txt \
+    --checkpoints ${CKPT_PATH} \
+    --save_steps 1000 \
+    --weight_decay  0.01 \
+    --warmup_proportion 0.1 \
+    --validation_steps 100 \
+    --epoch 3 \
+    --max_seq_len 128 \
+    --bert_config_path ${BERT_BASE_PATH}/bert_config.json \
+    --learning_rate 5e-5 \
+    --skip_steps 10 \
+    --shuffle false
+
diff --git a/dygraph/bert/run_classifier_single_gpu.sh b/dygraph/bert/run_classifier_single_gpu.sh
new file mode 100755
index 0000000000000000000000000000000000000000..2fe4d89f0b63e2b9a72d95802795df30e7667036
--- /dev/null
+++ b/dygraph/bert/run_classifier_single_gpu.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+BERT_BASE_PATH="./data/pretrained_models/uncased_L-12_H-768_A-12/"
+TASK_NAME='MNLI'
+DATA_PATH="./data/glue_data/MNLI/"
+CKPT_PATH="./data/saved_model/mnli_models"
+
+export CUDA_VISIBLE_DEVICES=0
+
+# start fine-tuning
+python run_classifier.py\
+    --task_name ${TASK_NAME} \
+    --use_cuda true \
+    --do_train true \
+    --do_test true \
+    --batch_size 64 \
+    --init_pretraining_params ${BERT_BASE_PATH}/dygraph_params/ \
+    --data_dir ${DATA_PATH} \
+    --vocab_path ${BERT_BASE_PATH}/vocab.txt \
+    --checkpoints ${CKPT_PATH} \
+    --save_steps 1000 \
+    --weight_decay  0.01 \
+    --warmup_proportion 0.1 \
+    --validation_steps 100 \
+    --epoch 3 \
+    --max_seq_len 128 \
+    --bert_config_path ${BERT_BASE_PATH}/bert_config.json \
+    --learning_rate 5e-5 \
+    --skip_steps 10 \
+    --shuffle true
+
diff --git a/dygraph/bert/tokenization.py b/dygraph/bert/tokenization.py
new file mode 100644
index 0000000000000000000000000000000000000000..08570f30fe9e6a8036a15095e67e6e8dd8686c14
--- /dev/null
+++ b/dygraph/bert/tokenization.py
@@ -0,0 +1,371 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#         http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import unicodedata
+import six
+import io
+
+
+def convert_to_unicode(text):
+    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    elif six.PY2:
+        if isinstance(text, str):
+            return text.decode("utf-8", "ignore")
+        elif isinstance(text, unicode):
+            return text
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Not running on Python2 or Python 3?")
+
+
+def printable_text(text):
+    """Returns text encoded in a way suitable for print or `tf.logging`."""
+
+    # These functions want `str` for both Python2 and Python3, but in one case
+    # it's a Unicode string and in the other it's a byte string.
+    if six.PY3:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, bytes):
+            return text.decode("utf-8", "ignore")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    elif six.PY2:
+        if isinstance(text, str):
+            return text
+        elif isinstance(text, unicode):
+            return text.encode("utf-8")
+        else:
+            raise ValueError("Unsupported string type: %s" % (type(text)))
+    else:
+        raise ValueError("Not running on Python2 or Python 3?")
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    fin = io.open(vocab_file, encoding="utf8")
+    for num, line in enumerate(fin):
+        items = convert_to_unicode(line.strip()).split("\t")
+        if len(items) > 2:
+            break
+        token = items[0]
+        index = items[1] if len(items) == 2 else num
+        token = token.strip()
+        vocab[token] = int(index)
+    return vocab
+
+
+def convert_by_vocab(vocab, items):
+    """Converts a sequence of [tokens|ids] using the vocab."""
+    output = []
+    for item in items:
+        output.append(vocab[item])
+    return output
+
+
+def convert_tokens_to_ids(vocab, tokens):
+    return convert_by_vocab(vocab, tokens)
+
+
+def convert_ids_to_tokens(inv_vocab, ids):
+    return convert_by_vocab(inv_vocab, ids)
+
+
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a peice of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+
+class FullTokenizer(object):
+    """Runs end-to-end tokenziation."""
+
+    def __init__(self, vocab_file, do_lower_case=True):
+        self.vocab = load_vocab(vocab_file)
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+        self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+
+    def tokenize(self, text):
+        split_tokens = []
+        for token in self.basic_tokenizer.tokenize(text):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+
+        return split_tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        return convert_by_vocab(self.vocab, tokens)
+
+    def convert_ids_to_tokens(self, ids):
+        return convert_by_vocab(self.inv_vocab, ids)
+
+
+class CharTokenizer(object):
+    """Runs end-to-end tokenziation."""
+
+    def __init__(self, vocab_file, do_lower_case=True):
+        self.vocab = load_vocab(vocab_file)
+        self.inv_vocab = {v: k for k, v in self.vocab.items()}
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
+
+    def tokenize(self, text):
+        split_tokens = []
+        for token in text.lower().split(" "):
+            for sub_token in self.wordpiece_tokenizer.tokenize(token):
+                split_tokens.append(sub_token)
+
+        return split_tokens
+
+    def convert_tokens_to_ids(self, tokens):
+        return convert_by_vocab(self.vocab, tokens)
+
+    def convert_ids_to_tokens(self, ids):
+        return convert_by_vocab(self.inv_vocab, ids)
+
+
+class BasicTokenizer(object):
+    """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
+
+    def __init__(self, do_lower_case=True):
+        """Constructs a BasicTokenizer.
+
+        Args:
+            do_lower_case: Whether to lower case the input.
+        """
+        self.do_lower_case = do_lower_case
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text."""
+        text = convert_to_unicode(text)
+        text = self._clean_text(text)
+
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        text = self._tokenize_chinese_chars(text)
+
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if self.do_lower_case:
+                token = token.lower()
+                token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text):
+        """Splits punctuation on a piece of text."""
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #     https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+            (cp >= 0x3400 and cp <= 0x4DBF) or  #
+            (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+            (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+            (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+            (cp >= 0x2B820 and cp <= 0x2CEAF) or
+            (cp >= 0xF900 and cp <= 0xFAFF) or  #
+            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xfffd or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenziation."""
+
+    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, text):
+        """Tokenizes a piece of text into its word pieces.
+
+        This uses a greedy longest-match-first algorithm to perform tokenization
+        using the given vocabulary.
+
+        For example:
+            input = "unaffable"
+            output = ["un", "##aff", "##able"]
+
+        Args:
+            text: A single token or whitespace separated tokens. This should have
+                already been passed through `BasicTokenizer.
+
+        Returns:
+            A list of wordpiece tokens.
+        """
+
+        text = convert_to_unicode(text)
+
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+
+
+def _is_whitespace(char):
+    """Checks whether `chars` is a whitespace character."""
+    # \t, \n, and \r are technically contorl characters but we treat them
+    # as whitespace since they are generally considered as such.
+    if char == " " or char == "\t" or char == "\n" or char == "\r":
+        return True
+    cat = unicodedata.category(char)
+    if cat == "Zs":
+        return True
+    return False
+
+
+def _is_control(char):
+    """Checks whether `chars` is a control character."""
+    # These are technically control characters but we count them as whitespace
+    # characters.
+    if char == "\t" or char == "\n" or char == "\r":
+        return False
+    cat = unicodedata.category(char)
+    if cat.startswith("C"):
+        return True
+    return False
+
+
+def _is_punctuation(char):
+    """Checks whether `chars` is a punctuation character."""
+    cp = ord(char)
+    # We treat all non-letter/number ASCII as punctuation.
+    # Characters such as "^", "$", and "`" are not in the Unicode
+    # Punctuation class but we treat them as punctuation anyways, for
+    # consistency.
+    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
+        (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
+        return True
+    cat = unicodedata.category(char)
+    if cat.startswith("P"):
+        return True
+    return False
diff --git a/dygraph/bert/utils/__init__.py b/dygraph/bert/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/dygraph/bert/utils/args.py b/dygraph/bert/utils/args.py
new file mode 100644
index 0000000000000000000000000000000000000000..66e9bb81a35bb4cc4c8c79cac4631841742bdeb8
--- /dev/null
+++ b/dygraph/bert/utils/args.py
@@ -0,0 +1,61 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Arguments for configuration."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import six
+import argparse
+
+import paddle.fluid as fluid
+
+
+def str2bool(v):
+    # because argparse does not support to parse "true, False" as python
+    # boolean directly
+    return v.lower() in ("true", "t", "1")
+
+
+class ArgumentGroup(object):
+    def __init__(self, parser, title, des):
+        self._group = parser.add_argument_group(title=title, description=des)
+
+    def add_arg(self, name, type, default, help, **kwargs):
+        type = str2bool if type == bool else type
+        self._group.add_argument(
+            "--" + name,
+            default=default,
+            type=type,
+            help=help + ' Default: %(default)s.',
+            **kwargs)
+
+
+def print_arguments(args):
+    print('-----------  Configuration Arguments -----------')
+    for arg, value in sorted(six.iteritems(vars(args))):
+        print('%s: %s' % (arg, value))
+    print('------------------------------------------------')
+
+def check_cuda(use_cuda, err = \
+    "\nYou can not set use_cuda = True in the model because you are using paddlepaddle-cpu.\n \
+    Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda = False to run models on CPU.\n"
+                                                                                                                     ):
+    try:
+        if use_cuda == True and fluid.is_compiled_with_cuda() == False:
+            print(err)
+            sys.exit(1)
+    except Exception as e:
+        pass
diff --git a/dygraph/bert/utils/cards.py b/dygraph/bert/utils/cards.py
new file mode 100644
index 0000000000000000000000000000000000000000..70c58ee30da7f68f00d12af0b5dc1025dad42630
--- /dev/null
+++ b/dygraph/bert/utils/cards.py
@@ -0,0 +1,26 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+
+def get_cards():
+    """
+    get gpu cards number
+    """
+    num = 0
+    cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
+    if cards != '':
+        num = len(cards.split(","))
+    return num
diff --git a/dygraph/bert/utils/convert_static_to_dygraph.py b/dygraph/bert/utils/convert_static_to_dygraph.py
new file mode 100755
index 0000000000000000000000000000000000000000..f590715304af8da152babdec60b7c4cba5291c56
--- /dev/null
+++ b/dygraph/bert/utils/convert_static_to_dygraph.py
@@ -0,0 +1,222 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import shutil
+import sys
+import os
+
+
+def usage():
+    """
+    usage information
+    """
+    print
+    print("please use command: ")
+    print(
+        "python convert_static_to_dygraph.py input_params_dir output_params_dir")
+    print
+
+
+def convert_static_to_dygraph(static_model_path, dygraph_model_path):
+    """
+    convert paddle static bert model to dygraph model 
+    """
+
+    def mkdir(path):
+        if not os.path.isdir(path):
+            if os.path.split(path)[0]:
+                mkdir(os.path.split(path)[0])
+        else:
+            return
+        os.mkdir(path)
+
+    if os.path.exists(dygraph_model_path):
+        shutil.rmtree(dygraph_model_path)
+    mkdir(dygraph_model_path)
+
+    if not os.path.exists(static_model_path):
+        print("paddle static model path doesn't exist.....")
+        return -1
+
+    file_list = []
+    for root, dirs, files in os.walk(static_model_path):
+        file_list.extend(files)
+
+    os.makedirs(os.path.join(dygraph_model_path, "PretrainModelLayer_0"))
+    os.makedirs(
+        os.path.join(dygraph_model_path,
+                     "PretrainModelLayer_0/BertModelLayer_0"))
+    os.makedirs(
+        os.path.join(dygraph_model_path,
+                     "PretrainModelLayer_0/PrePostProcessLayer_0"))
+    os.makedirs(
+        os.path.join(
+            dygraph_model_path,
+            "PretrainModelLayer_0/BertModelLayer_0/PrePostProcessLayer_0"))
+
+    #os.chdir(static_model_path)
+    #convert embedding file
+    embedding_type = ["word", "pos", "sent"]
+    for i in range(3):
+        src_name = embedding_type[i] + "_embedding"
+        trg_name = "Embedding_" + str(i) + "." + src_name
+        shutil.copyfile(
+            os.path.join(static_model_path, src_name),
+            os.path.join(dygraph_model_path,
+                         "PretrainModelLayer_0/BertModelLayer_0/" + trg_name))
+
+    #convert pre_encoder file
+    shutil.copyfile(
+        os.path.join(static_model_path, "pre_encoder_layer_norm_scale"),
+        os.path.join(
+            dygraph_model_path,
+            "PretrainModelLayer_0/BertModelLayer_0/PrePostProcessLayer_0/LayerNorm_0._layer_norm_scale"
+        ))
+    shutil.copyfile(
+        os.path.join(static_model_path, "pre_encoder_layer_norm_bias"),
+        os.path.join(
+            dygraph_model_path,
+            "PretrainModelLayer_0/BertModelLayer_0/PrePostProcessLayer_0/LayerNorm_0._layer_norm_bias"
+        ))
+
+    #convert mask lm params file
+    shutil.copyfile(
+        os.path.join(static_model_path, "mask_lm_out_fc.b_0"),
+        os.path.join(dygraph_model_path,
+                     "PretrainModelLayer_0/Layer_0.mask_lm_out_fc.b_0"))
+    shutil.copyfile(
+        os.path.join(static_model_path, "mask_lm_trans_fc.b_0"),
+        os.path.join(dygraph_model_path,
+                     "PretrainModelLayer_0/FC_0.mask_lm_trans_fc.b_0"))
+    shutil.copyfile(
+        os.path.join(static_model_path, "mask_lm_trans_fc.w_0"),
+        os.path.join(dygraph_model_path,
+                     "PretrainModelLayer_0/FC_0.mask_lm_trans_fc.w_0"))
+    shutil.copyfile(
+        os.path.join(static_model_path, "mask_lm_trans_layer_norm_bias"),
+        os.path.join(
+            dygraph_model_path,
+            "PretrainModelLayer_0/PrePostProcessLayer_0/LayerNorm_0._layer_norm_bias"
+        ))
+    shutil.copyfile(
+        os.path.join(static_model_path, "mask_lm_trans_layer_norm_scale"),
+        os.path.join(
+            dygraph_model_path,
+            "PretrainModelLayer_0/PrePostProcessLayer_0/LayerNorm_0._layer_norm_scale"
+        ))
+    shutil.copyfile(
+        os.path.join(static_model_path, "next_sent_fc.b_0"),
+        os.path.join(dygraph_model_path,
+                     "PretrainModelLayer_0/FC_1.next_sent_fc.b_0"))
+    shutil.copyfile(
+        os.path.join(static_model_path, "next_sent_fc.w_0"),
+        os.path.join(dygraph_model_path,
+                     "PretrainModelLayer_0/FC_1.next_sent_fc.w_0"))
+    shutil.copyfile(
+        os.path.join(static_model_path, "pooled_fc.b_0"),
+        os.path.join(
+            dygraph_model_path,
+            "PretrainModelLayer_0/BertModelLayer_0/FC_0.pooled_fc.b_0"))
+    shutil.copyfile(
+        os.path.join(static_model_path, "pooled_fc.w_0"),
+        os.path.join(
+            dygraph_model_path,
+            "PretrainModelLayer_0/BertModelLayer_0/FC_0.pooled_fc.w_0"))
+
+    encoder_num = 0
+    for f in file_list:
+        if not f.startswith("encoder_layer"):
+            continue
+        layer_num = f.split('_')[2]
+        if int(layer_num) > encoder_num:
+            encoder_num = int(layer_num)
+
+    encoder_num += 1
+    for i in range(encoder_num):
+        encoder_dir = "EncoderSubLayer_" + str(i)
+        os.makedirs(
+            os.path.join(dygraph_model_path,
+                         "PretrainModelLayer_0/BertModelLayer_0/" +
+                         "EncoderLayer_0/", encoder_dir))
+        os.makedirs(
+            os.path.join(dygraph_model_path,
+                         "PretrainModelLayer_0/BertModelLayer_0/" +
+                         "EncoderLayer_0/", encoder_dir +
+                         "/PositionwiseFeedForwardLayer_0"))
+        os.makedirs(
+            os.path.join(
+                dygraph_model_path, "PretrainModelLayer_0/BertModelLayer_0/" +
+                "EncoderLayer_0/", encoder_dir + "/MultiHeadAttentionLayer_0"))
+        os.makedirs(
+            os.path.join(
+                dygraph_model_path, "PretrainModelLayer_0/BertModelLayer_0/" +
+                "EncoderLayer_0/", encoder_dir + "/PrePostProcessLayer_1"))
+        os.makedirs(
+            os.path.join(
+                dygraph_model_path, "PretrainModelLayer_0/BertModelLayer_0/" +
+                "EncoderLayer_0/", encoder_dir + "/PrePostProcessLayer_3"))
+
+    encoder_map_dict = {
+        "ffn_fc_0.b_0": ("PositionwiseFeedForwardLayer_0", "FC_0.ffn_fc_0.b_0"),
+        "ffn_fc_0.w_0": ("PositionwiseFeedForwardLayer_0", "FC_0.ffn_fc_0.w_0"),
+        "ffn_fc_1.b_0": ("PositionwiseFeedForwardLayer_0", "FC_1.ffn_fc_1.b_0"),
+        "ffn_fc_1.w_0": ("PositionwiseFeedForwardLayer_0", "FC_1.ffn_fc_1.w_0"),
+        "multi_head_att_key_fc.b_0":
+        ("MultiHeadAttentionLayer_0", "FC_1.key_fc.b_0"),
+        "multi_head_att_key_fc.w_0":
+        ("MultiHeadAttentionLayer_0", "FC_1.key_fc.w_0"),
+        "multi_head_att_output_fc.b_0":
+        ("MultiHeadAttentionLayer_0", "FC_3.output_fc.b_0"),
+        "multi_head_att_output_fc.w_0":
+        ("MultiHeadAttentionLayer_0", "FC_3.output_fc.w_0"),
+        "multi_head_att_query_fc.b_0":
+        ("MultiHeadAttentionLayer_0", "FC_0.query_fc.b_0"),
+        "multi_head_att_query_fc.w_0":
+        ("MultiHeadAttentionLayer_0", "FC_0.query_fc.w_0"),
+        "multi_head_att_value_fc.b_0":
+        ("MultiHeadAttentionLayer_0", "FC_2.value_fc.b_0"),
+        "multi_head_att_value_fc.w_0":
+        ("MultiHeadAttentionLayer_0", "FC_2.value_fc.w_0"),
+        "post_att_layer_norm_bias":
+        ("PrePostProcessLayer_1", "LayerNorm_0.post_att_layer_norm_bias"),
+        "post_att_layer_norm_scale":
+        ("PrePostProcessLayer_1", "LayerNorm_0.post_att_layer_norm_scale"),
+        "post_ffn_layer_norm_bias":
+        ("PrePostProcessLayer_3", "LayerNorm_0.post_ffn_layer_norm_bias"),
+        "post_ffn_layer_norm_scale":
+        ("PrePostProcessLayer_3", "LayerNorm_0.post_ffn_layer_norm_scale")
+    }
+
+    for f in file_list:
+        if not f.startswith("encoder_layer"):
+            continue
+        layer_num = f.split('_')[2]
+        suffix_name = "_".join(f.split('_')[3:])
+        in_dir = encoder_map_dict[suffix_name][0]
+        rename = encoder_map_dict[suffix_name][1]
+        encoder_layer = "EncoderSubLayer_" + layer_num
+        shutil.copyfile(
+            os.path.join(static_model_path, f),
+            os.path.join(dygraph_model_path,
+                         "PretrainModelLayer_0/BertModelLayer_0/EncoderLayer_0/"
+                         + encoder_layer + "/" + in_dir + "/" + rename))
+
+
+if __name__ == "__main__":
+
+    if len(sys.argv) < 3:
+        usage()
+        exit(1)
+    static_model_path = sys.argv[1]
+    dygraph_model_path = sys.argv[2]
+    convert_static_to_dygraph(static_model_path, dygraph_model_path)
diff --git a/dygraph/bert/utils/fp16.py b/dygraph/bert/utils/fp16.py
new file mode 100644
index 0000000000000000000000000000000000000000..e153c2b9a1029897def264278c5dbe72e1f369f5
--- /dev/null
+++ b/dygraph/bert/utils/fp16.py
@@ -0,0 +1,97 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import paddle
+import paddle.fluid as fluid
+
+
+def cast_fp16_to_fp32(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP16,
+            "out_dtype": fluid.core.VarDesc.VarType.FP32
+        })
+
+
+def cast_fp32_to_fp16(i, o, prog):
+    prog.global_block().append_op(
+        type="cast",
+        inputs={"X": i},
+        outputs={"Out": o},
+        attrs={
+            "in_dtype": fluid.core.VarDesc.VarType.FP32,
+            "out_dtype": fluid.core.VarDesc.VarType.FP16
+        })
+
+
+def copy_to_master_param(p, block):
+    v = block.vars.get(p.name, None)
+    if v is None:
+        raise ValueError("no param name %s found!" % p.name)
+    new_p = fluid.framework.Parameter(
+        block=block,
+        shape=v.shape,
+        dtype=fluid.core.VarDesc.VarType.FP32,
+        type=v.type,
+        lod_level=v.lod_level,
+        stop_gradient=p.stop_gradient,
+        trainable=p.trainable,
+        optimize_attr=p.optimize_attr,
+        regularizer=p.regularizer,
+        gradient_clip_attr=p.gradient_clip_attr,
+        error_clip=p.error_clip,
+        name=v.name + ".master")
+    return new_p
+
+
+def create_master_params_grads(params_grads, main_prog, startup_prog,
+                               loss_scaling):
+    master_params_grads = []
+    tmp_role = main_prog._current_role
+    OpRole = fluid.core.op_proto_and_checker_maker.OpRole
+    main_prog._current_role = OpRole.Backward
+    for p, g in params_grads:
+        # create master parameters
+        master_param = copy_to_master_param(p, main_prog.global_block())
+        startup_master_param = startup_prog.global_block()._clone_variable(
+            master_param)
+        startup_p = startup_prog.global_block().var(p.name)
+        cast_fp16_to_fp32(startup_p, startup_master_param, startup_prog)
+        # cast fp16 gradients to fp32 before apply gradients
+        if g.name.find("layer_norm") > -1:
+            if loss_scaling > 1:
+                scaled_g = g / float(loss_scaling)
+            else:
+                scaled_g = g
+            master_params_grads.append([p, scaled_g])
+            continue
+        master_grad = fluid.layers.cast(g, "float32")
+        if loss_scaling > 1:
+            master_grad = master_grad / float(loss_scaling)
+        master_params_grads.append([master_param, master_grad])
+    main_prog._current_role = tmp_role
+    return master_params_grads
+
+
+def master_param_to_train_param(master_params_grads, params_grads, main_prog):
+    for idx, m_p_g in enumerate(master_params_grads):
+        train_p, _ = params_grads[idx]
+        if train_p.name.find("layer_norm") > -1:
+            continue
+        with main_prog._optimized_guard([m_p_g[0], m_p_g[1]]):
+            cast_fp32_to_fp16(m_p_g[0], train_p, main_prog)
diff --git a/dygraph/bert/utils/init.py b/dygraph/bert/utils/init.py
new file mode 100644
index 0000000000000000000000000000000000000000..d823473d59f35ff256d97db108c128ca92a7c1fd
--- /dev/null
+++ b/dygraph/bert/utils/init.py
@@ -0,0 +1,239 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+
+import os
+import six
+import ast
+import copy
+
+import numpy as np
+import paddle.fluid as fluid
+
+
+def cast_fp32_to_fp16(exe, main_program):
+    print("Cast parameters to float16 data format.")
+    for param in main_program.global_block().all_parameters():
+        if not param.name.endswith(".master"):
+            param_t = fluid.global_scope().find_var(param.name).get_tensor()
+            data = np.array(param_t)
+            if param.name.find("layer_norm") == -1:
+                param_t.set(np.float16(data).view(np.uint16), exe.place)
+            master_param_var = fluid.global_scope().find_var(param.name +
+                                                             ".master")
+            if master_param_var is not None:
+                master_param_var.get_tensor().set(data, exe.place)
+
+
+def init_checkpoint(exe, init_checkpoint_path, main_program, use_fp16=False):
+    assert os.path.exists(
+        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+
+    def existed_persitables(var):
+        if not fluid.io.is_persistable(var):
+            return False
+        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
+
+    fluid.io.load_vars(
+        exe,
+        init_checkpoint_path,
+        main_program=main_program,
+        predicate=existed_persitables)
+    print("Load model from {}".format(init_checkpoint_path))
+
+    if use_fp16:
+        cast_fp32_to_fp16(exe, main_program)
+
+
+def init_pretraining_params(exe,
+                            pretraining_params_path,
+                            main_program,
+                            use_fp16=False):
+    assert os.path.exists(pretraining_params_path
+                          ), "[%s] cann't be found." % pretraining_params_path
+
+    def existed_params(var):
+        if not isinstance(var, fluid.framework.Parameter):
+            return False
+        return os.path.exists(os.path.join(pretraining_params_path, var.name))
+
+    fluid.io.load_vars(
+        exe,
+        pretraining_params_path,
+        main_program=main_program,
+        predicate=existed_params)
+    print("Load pretraining parameters from {}.".format(
+        pretraining_params_path))
+
+    if use_fp16:
+        cast_fp32_to_fp16(exe, main_program)
+
+
+def init_from_static_model(dir_path, cls_model, bert_config):
+    def load_numpy_weight(file_name):
+        res = np.load(os.path.join(dir_path, file_name), allow_pickle=True)
+        assert res is not None
+        return res
+
+    # load word embedding
+    _param = load_numpy_weight("word_embedding")
+    cls_model.bert_layer._src_emb.set_dict({"weight": _param})
+    print("INIT word embedding")
+
+    _param = load_numpy_weight("pos_embedding")
+    cls_model.bert_layer._pos_emb.set_dict({"weight": _param})
+    print("INIT pos embedding")
+
+    _param = load_numpy_weight("sent_embedding")
+    cls_model.bert_layer._sent_emb.set_dict({"weight": _param})
+    print("INIT sent embedding")
+
+    _param0 = load_numpy_weight("pooled_fc.w_0")
+    _param1 = load_numpy_weight("pooled_fc.b_0")
+    cls_model.bert_layer.pooled_fc.set_dict({
+        "weight": _param0,
+        "bias": _param1
+    })
+    print("INIT pooled_fc")
+
+    _param0 = load_numpy_weight("pre_encoder_layer_norm_scale")
+    _param1 = load_numpy_weight("pre_encoder_layer_norm_bias")
+    cls_model.bert_layer.pre_process_layer._sub_layers["layer_norm_0"].set_dict(
+        {
+            "weight": _param0,
+            "bias": _param1
+        })
+    print("INIT pre_encoder layer norm")
+
+    for _i in range(bert_config["num_hidden_layers"]):
+        _param_weight = "encoder_layer_%d_multi_head_att_query_fc.w_0" % _i
+        _param_bias = "encoder_layer_%d_multi_head_att_query_fc.b_0" % _i
+
+        _param_weight = load_numpy_weight(_param_weight)
+        _param_bias = load_numpy_weight(_param_bias)
+
+        cls_model.bert_layer._encoder._sub_layers[
+            "esl_%d" % _i]._multihead_attention_layer._q_fc.set_dict({
+                "weight": _param_weight,
+                "bias": _param_bias
+            })
+        print("INIT multi_head_att_query_fc %d" % _i)
+
+        _param_weight = "encoder_layer_%d_multi_head_att_key_fc.w_0" % _i
+        _param_bias = "encoder_layer_%d_multi_head_att_key_fc.b_0" % _i
+
+        _param_weight = load_numpy_weight(_param_weight)
+        _param_bias = load_numpy_weight(_param_bias)
+
+        cls_model.bert_layer._encoder._sub_layers[
+            "esl_%d" % _i]._multihead_attention_layer._k_fc.set_dict({
+                "weight": _param_weight,
+                "bias": _param_bias
+            })
+        print("INIT multi_head_att_key_fc %d" % _i)
+
+        _param_weight = "encoder_layer_%d_multi_head_att_value_fc.w_0" % _i
+        _param_bias = "encoder_layer_%d_multi_head_att_value_fc.b_0" % _i
+
+        _param_weight = load_numpy_weight(_param_weight)
+        _param_bias = load_numpy_weight(_param_bias)
+
+        cls_model.bert_layer._encoder._sub_layers[
+            "esl_%d" % _i]._multihead_attention_layer._v_fc.set_dict({
+                "weight": _param_weight,
+                "bias": _param_bias
+            })
+        print("INIT multi_head_att_value_fc %d" % _i)
+
+        # init output fc
+        _param_weight = "encoder_layer_%d_multi_head_att_output_fc.w_0" % _i
+        _param_bias = "encoder_layer_%d_multi_head_att_output_fc.b_0" % _i
+
+        _param_weight = load_numpy_weight(_param_weight)
+        _param_bias = load_numpy_weight(_param_bias)
+
+        cls_model.bert_layer._encoder._sub_layers[
+            "esl_%d" % _i]._multihead_attention_layer._proj_fc.set_dict({
+                "weight": _param_weight,
+                "bias": _param_bias
+            })
+        print("INIT multi_head_att_output_fc %d" % _i)
+
+        # init layer_norm 1
+        _param_weight = "encoder_layer_%d_post_att_layer_norm_scale" % _i
+        _param_bias = "encoder_layer_%d_post_att_layer_norm_bias" % _i
+
+        _param_weight = load_numpy_weight(_param_weight)
+        _param_bias = load_numpy_weight(_param_bias)
+
+        cls_model.bert_layer._encoder._sub_layers[
+            "esl_%d" % _i]._postprocess_layer.layer_norm_0.set_dict({
+                "weight": _param_weight,
+                "bias": _param_bias
+            })
+        print("INIT layer norm in attention at %d layer" % _i)
+
+        # init layer_norm 2
+        _param_weight = "encoder_layer_%d_post_ffn_layer_norm_scale" % _i
+        _param_bias = "encoder_layer_%d_post_ffn_layer_norm_bias" % _i
+
+        _param_weight = load_numpy_weight(_param_weight)
+        _param_bias = load_numpy_weight(_param_bias)
+
+        cls_model.bert_layer._encoder._sub_layers[
+            "esl_%d" % _i]._postprocess_layer2.layer_norm_0.set_dict({
+                "weight": _param_weight,
+                "bias": _param_bias
+            })
+        print("INIT layer norm in FFN at %d layer" % _i)
+
+        # init FFN 1
+        _param_weight = "encoder_layer_%d_ffn_fc_0.w_0" % _i
+        _param_bias = "encoder_layer_%d_ffn_fc_0.b_0" % _i
+
+        _param_weight = load_numpy_weight(_param_weight)
+        _param_bias = load_numpy_weight(_param_bias)
+
+        cls_model.bert_layer._encoder._sub_layers[
+            "esl_%d" % _i]._positionwise_feed_forward._i2h.set_dict({
+                "weight": _param_weight,
+                "bias": _param_bias
+            })
+        print("INIT FFN-1 at %d layer" % _i)
+
+        # init FFN 2
+        _param_weight = "encoder_layer_%d_ffn_fc_1.w_0" % _i
+        _param_bias = "encoder_layer_%d_ffn_fc_1.b_0" % _i
+
+        _param_weight = load_numpy_weight(_param_weight)
+        _param_bias = load_numpy_weight(_param_bias)
+
+        cls_model.bert_layer._encoder._sub_layers[
+            "esl_%d" % _i]._positionwise_feed_forward._h2o.set_dict({
+                "weight": _param_weight,
+                "bias": _param_bias
+            })
+        print("INIT FFN-2 at %d layer" % _i)
+
+    # init cls fc
+    #_param_weight = "cls_out_w"
+    #_param_bias = "cls_out_b"
+
+    #_param_weight = load_numpy_weight(_param_weight)
+    #_param_bias = load_numpy_weight(_param_bias)
+
+    #cls_model.cls_fc.set_dict({"weight":_param_weight, "bias":_param_bias})
+    #print("INIT CLS FC layer")
+    return True
diff --git a/dygraph/bmn/README.md b/dygraph/bmn/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..715c3cb7d77d014f10e68f3d32aab512aef21156
--- /dev/null
+++ b/dygraph/bmn/README.md
@@ -0,0 +1,131 @@
+# BMN 视频动作定位模型动态图实现
+
+---
+## 内容
+
+- [模型简介](#模型简介)
+- [代码结构](#代码结构)
+- [数据准备](#数据准备)
+- [模型训练](#模型训练)
+- [模型评估](#模型评估)
+- [模型推断](#模型推断)
+- [参考论文](#参考论文)
+
+
+## 模型简介
+
+BMN模型是百度自研，2019年ActivityNet夺冠方案，为视频动作定位问题中proposal的生成提供高效的解决方案，在PaddlePaddle上首次开源。此模型引入边界匹配(Boundary-Matching, BM)机制来评估proposal的置信度，按照proposal开始边界的位置及其长度将所有可能存在的proposal组合成一个二维的BM置信度图，图中每个点的数值代表其所对应的proposal的置信度分数。网络由三个模块组成，基础模块作为主干网络处理输入的特征序列，TEM模块预测每一个时序位置属于动作开始、动作结束的概率，PEM模块生成BM置信度图。
+
+<p align="center">
+<img src="../../PaddleCV/PaddleVideo/images/BMN.png" height=300 width=500 hspace='10'/> <br />
+BMN Overview
+</p>
+
+BMN模型的静态图实现请参考[PaddleVideo](../../PaddleCV/PaddleVideo)
+
+动态图文档请参考[Dygraph](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/user_guides/howto/dygraph/DyGraph.html)
+
+
+## 代码结构
+```
+├── bmn.yaml           # 网络配置文件，用户可方便的配置参数
+├── run.sh             # 快速运行脚本，可直接开始多卡训练
+├── train.py           # 训练代码，包含网络结构相关代码
+├── eval.py            # 评估代码，评估网络性能
+├── predict.py         # 预测代码，针对任意输入预测结果
+├── model.py           # 网络结构与损失函数定义
+├── reader.py          # 数据reader
+├── eval_anet_prop.py  # 计算精度评估指标
+├── bmn_utils.py       # 模型细节相关代码
+├── config_utils.py    # 配置细节相关代码
+└── infer.list         # 推断文件列表
+```
+
+
+## 数据准备
+
+BMN的训练数据采用ActivityNet1.3提供的数据集，我们提供了处理好的视频特征，请下载[bmn\_feat](https://paddlemodels.bj.bcebos.com/video_detection/bmn_feat.tar.gz)数据后解压，同时相应的修改bmn.yaml中的特征路径feat\_path。
+
+
+## 模型训练
+
+数据准备完成后，可通过如下两种方式启动训练：
+
+默认使用4卡训练，启动方式如下:
+
+    bash run.sh
+
+若使用单卡训练，启动方式如下:
+
+    export CUDA_VISIBLE_DEVICES=0
+    python train.py
+
+- 代码运行需要先安装pandas
+
+- 从头开始训练，使用上述启动命令行或者脚本程序即可启动训练，不需要用到预训练模型
+
+**训练策略：**
+
+*  采用Adam优化器，初始learning\_rate=0.001
+*  权重衰减系数为1e-4
+*  学习率在迭代次数达到4200的时候做一次衰减，衰减系数为0.1
+
+- 下面的表格列出了此模型训练的大致时长(单位：分钟)，使用的GPU型号为P40，CUDA版本8.0，cudnn版本7.2
+
+|       |  单卡  | 4卡   |
+| :---: | :---: | :---: |
+| 静态图 |  79   |   27  |
+| 动态图 |  98   |   31  |
+
+## 模型评估
+
+训练完成后，可通过如下方式进行模型评估:
+
+    python eval.py --weights=$PATH_TO_WEIGHTS
+
+- 进行评估时，可修改命令行中的`weights`参数指定需要评估的权重，如果不设置，将使用默认参数文件checkpoint/bmn\_paddle\_dy\_final.pdparams。
+
+- 上述程序会将运行结果保存在output/EVAL/BMN\_results文件夹下，测试结果保存在evaluate\_results/bmn\_results\_validation.json文件中。
+
+- 使用CPU进行评估时，请将上面的命令行`use_gpu`设置为False。
+
+- 注：评估时可能会出现loss为nan的情况。这是由于评估时用的是单个样本，可能存在没有iou>0.6的样本，所以为nan，对最终的评估结果没有影响。
+
+
+使用ActivityNet官方提供的测试脚本，即可计算AR@AN和AUC。具体计算过程如下：
+
+- ActivityNet数据集的具体使用说明可以参考其[官方网站](http://activity-net.org)
+
+- 下载指标评估代码，请从[ActivityNet Gitub repository](https://github.com/activitynet/ActivityNet.git)下载，将Evaluation文件夹拷贝至models/dygraph/bmn目录下。(注：由于第三方评估代码不支持python3，此处建议使用python2进行评估；若使用python3，print函数需要添加括号，请对Evaluation目录下的.py文件做相应修改。)
+
+- 请下载[activity\_net\_1\_3\_new.json](https://paddlemodels.bj.bcebos.com/video_detection/activity_net_1_3_new.json)文件，并将其放置在models/dygraph/bmn/Evaluation/data目录下，相较于原始的activity\_net.v1-3.min.json文件，我们过滤了其中一些失效的视频条目。
+
+- 计算精度指标
+
+    ```python eval_anet_prop.py```
+
+
+在ActivityNet1.3数据集下评估精度如下:
+
+| AR@1 | AR@5 | AR@10 | AR@100 | AUC |
+| :---: | :---: | :---: | :---: | :---: |
+| 33.46 | 49.25 | 56.25 | 75.40 | 67.16% |
+
+
+## 模型推断
+
+可通过如下方式启动模型推断：
+
+    python predict.py --weights=$PATH_TO_WEIGHTS \
+                      --filelist=$FILELIST
+
+- 使用python命令行启动程序时，`--filelist`参数指定待推断的文件列表，如果不设置，默认为./infer.list。`--weights`参数为训练好的权重参数，如果不设置，将使用默认参数文件checkpoint/bmn\_paddle\_dy\_final.pdparams。
+
+- 上述程序会将运行结果保存在output/INFER/BMN\_results文件夹下，测试结果保存在predict\_results/bmn\_results\_test.json文件中。
+
+- 使用CPU进行推断时，请将命令行中的`use_gpu`设置为False
+
+
+## 参考论文
+
+- [BMN: Boundary-Matching Network for Temporal Action Proposal Generation](https://arxiv.org/abs/1907.09702), Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, Shilei Wen.
diff --git a/dygraph/bmn/bmn.yaml b/dygraph/bmn/bmn.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..e16fa53bc01c4a94fcfb9781091443fa4e55775c
--- /dev/null
+++ b/dygraph/bmn/bmn.yaml
@@ -0,0 +1,50 @@
+MODEL:
+  name: "BMN"
+  tscale: 100
+  dscale: 100
+  feat_dim: 400
+  prop_boundary_ratio: 0.5
+  num_sample: 32
+  num_sample_perbin: 3
+  anno_file: "../../PaddleCV/PaddleVideo/data/dataset/bmn/activitynet_1.3_annotations.json"
+  feat_path: './fix_feat_100'
+
+TRAIN:
+  subset: "train"
+  epoch: 9
+  batch_size: 16
+  num_threads: 8
+  use_gpu: True
+  num_gpus: 4
+  learning_rate: 0.001
+  learning_rate_decay: 0.1
+  lr_decay_iter: 4200
+  l2_weight_decay: 1e-4
+
+VALID:
+  subset: "validation"
+  batch_size: 16
+  num_threads: 8
+  use_gpu: True
+  num_gpus: 4
+
+TEST:
+  subset: "validation"
+  batch_size: 1
+  num_threads: 1
+  snms_alpha: 0.001
+  snms_t1: 0.5
+  snms_t2: 0.9
+  output_path: "output/EVAL/BMN_results"
+  result_path: "evaluate_results"
+
+INFER:
+  subset: "test"
+  batch_size: 1
+  num_threads: 1
+  snms_alpha: 0.4
+  snms_t1: 0.5
+  snms_t2: 0.9
+  filelist: './infer.list'
+  output_path: "output/INFER/BMN_results"
+  result_path: "predict_results"
diff --git a/dygraph/bmn/bmn_utils.py b/dygraph/bmn/bmn_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..02a960841e6c97d3484f870be260e28ca123e566
--- /dev/null
+++ b/dygraph/bmn/bmn_utils.py
@@ -0,0 +1,217 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import numpy as np
+import pandas as pd
+import multiprocessing as mp
+import json
+import os
+import math
+
+
+def iou_with_anchors(anchors_min, anchors_max, box_min, box_max):
+    """Compute jaccard score between a box and the anchors.
+    """
+    len_anchors = anchors_max - anchors_min
+    int_xmin = np.maximum(anchors_min, box_min)
+    int_xmax = np.minimum(anchors_max, box_max)
+    inter_len = np.maximum(int_xmax - int_xmin, 0.)
+    union_len = len_anchors - inter_len + box_max - box_min
+    jaccard = np.divide(inter_len, union_len)
+    return jaccard
+
+
+def ioa_with_anchors(anchors_min, anchors_max, box_min, box_max):
+    """Compute intersection between score a box and the anchors.
+    """
+    len_anchors = anchors_max - anchors_min
+    int_xmin = np.maximum(anchors_min, box_min)
+    int_xmax = np.minimum(anchors_max, box_max)
+    inter_len = np.maximum(int_xmax - int_xmin, 0.)
+    scores = np.divide(inter_len, len_anchors)
+    return scores
+
+
+def boundary_choose(score_list):
+    max_score = max(score_list)
+    mask_high = (score_list > max_score * 0.5)
+    score_list = list(score_list)
+    score_middle = np.array([0.0] + score_list + [0.0])
+    score_front = np.array([0.0, 0.0] + score_list)
+    score_back = np.array(score_list + [0.0, 0.0])
+    mask_peak = ((score_middle > score_front) & (score_middle > score_back))
+    mask_peak = mask_peak[1:-1]
+    mask = (mask_high | mask_peak).astype('float32')
+    return mask
+
+
+def soft_nms(df, alpha, t1, t2):
+    '''
+    df: proposals generated by network;
+    alpha: alpha value of Gaussian decaying function;
+    t1, t2: threshold for soft nms.
+    '''
+    df = df.sort_values(by="score", ascending=False)
+    tstart = list(df.xmin.values[:])
+    tend = list(df.xmax.values[:])
+    tscore = list(df.score.values[:])
+
+    rstart = []
+    rend = []
+    rscore = []
+
+    while len(tscore) > 1 and len(rscore) < 101:
+        max_index = tscore.index(max(tscore))
+        tmp_iou_list = iou_with_anchors(
+            np.array(tstart),
+            np.array(tend), tstart[max_index], tend[max_index])
+        for idx in range(0, len(tscore)):
+            if idx != max_index:
+                tmp_iou = tmp_iou_list[idx]
+                tmp_width = tend[max_index] - tstart[max_index]
+                if tmp_iou > t1 + (t2 - t1) * tmp_width:
+                    tscore[idx] = tscore[idx] * np.exp(-np.square(tmp_iou) /
+                                                       alpha)
+
+        rstart.append(tstart[max_index])
+        rend.append(tend[max_index])
+        rscore.append(tscore[max_index])
+        tstart.pop(max_index)
+        tend.pop(max_index)
+        tscore.pop(max_index)
+
+    newDf = pd.DataFrame()
+    newDf['score'] = rscore
+    newDf['xmin'] = rstart
+    newDf['xmax'] = rend
+    return newDf
+
+
+def video_process(video_list,
+                  video_dict,
+                  output_path,
+                  result_dict,
+                  snms_alpha=0.4,
+                  snms_t1=0.55,
+                  snms_t2=0.9):
+
+    for video_name in video_list:
+        print("Processing video........" + video_name)
+        df = pd.read_csv(os.path.join(output_path, video_name + ".csv"))
+        if len(df) > 1:
+            df = soft_nms(df, snms_alpha, snms_t1, snms_t2)
+
+        video_duration = video_dict[video_name]["duration_second"]
+        proposal_list = []
+        for idx in range(min(100, len(df))):
+            tmp_prop={"score":df.score.values[idx], \
+                      "segment":[max(0,df.xmin.values[idx])*video_duration, \
+                                 min(1,df.xmax.values[idx])*video_duration]}
+            proposal_list.append(tmp_prop)
+        result_dict[video_name[2:]] = proposal_list
+
+
+def bmn_post_processing(video_dict, subset, output_path, result_path):
+    video_list = video_dict.keys()
+    video_list = list(video_dict.keys())
+    global result_dict
+    result_dict = mp.Manager().dict()
+    pp_num = 12
+
+    num_videos = len(video_list)
+    num_videos_per_thread = int(num_videos / pp_num)
+    processes = []
+    for tid in range(pp_num - 1):
+        tmp_video_list = video_list[tid * num_videos_per_thread:(tid + 1) *
+                                    num_videos_per_thread]
+        p = mp.Process(
+            target=video_process,
+            args=(tmp_video_list, video_dict, output_path, result_dict))
+        p.start()
+        processes.append(p)
+    tmp_video_list = video_list[(pp_num - 1) * num_videos_per_thread:]
+    p = mp.Process(
+        target=video_process,
+        args=(tmp_video_list, video_dict, output_path, result_dict))
+    p.start()
+    processes.append(p)
+    for p in processes:
+        p.join()
+
+    result_dict = dict(result_dict)
+    output_dict = {
+        "version": "VERSION 1.3",
+        "results": result_dict,
+        "external_data": {}
+    }
+    outfile = open(
+        os.path.join(result_path, "bmn_results_%s.json" % subset), "w")
+
+    json.dump(output_dict, outfile)
+    outfile.close()
+
+
+def _get_interp1d_bin_mask(seg_xmin, seg_xmax, tscale, num_sample,
+                           num_sample_perbin):
+    """ generate sample mask for a boundary-matching pair """
+    plen = float(seg_xmax - seg_xmin)
+    plen_sample = plen / (num_sample * num_sample_perbin - 1.0)
+    total_samples = [
+        seg_xmin + plen_sample * ii
+        for ii in range(num_sample * num_sample_perbin)
+    ]
+    p_mask = []
+    for idx in range(num_sample):
+        bin_samples = total_samples[idx * num_sample_perbin:(idx + 1) *
+                                    num_sample_perbin]
+        bin_vector = np.zeros([tscale])
+        for sample in bin_samples:
+            sample_upper = math.ceil(sample)
+            sample_decimal, sample_down = math.modf(sample)
+            if int(sample_down) <= (tscale - 1) and int(sample_down) >= 0:
+                bin_vector[int(sample_down)] += 1 - sample_decimal
+            if int(sample_upper) <= (tscale - 1) and int(sample_upper) >= 0:
+                bin_vector[int(sample_upper)] += sample_decimal
+        bin_vector = 1.0 / num_sample_perbin * bin_vector
+        p_mask.append(bin_vector)
+    p_mask = np.stack(p_mask, axis=1)
+    return p_mask
+
+
+def get_interp1d_mask(tscale, dscale, prop_boundary_ratio, num_sample,
+                      num_sample_perbin):
+    """ generate sample mask for each point in Boundary-Matching Map """
+    mask_mat = []
+    for start_index in range(tscale):
+        mask_mat_vector = []
+        for duration_index in range(dscale):
+            if start_index + duration_index < tscale:
+                p_xmin = start_index
+                p_xmax = start_index + duration_index
+                center_len = float(p_xmax - p_xmin) + 1
+                sample_xmin = p_xmin - center_len * prop_boundary_ratio
+                sample_xmax = p_xmax + center_len * prop_boundary_ratio
+                p_mask = _get_interp1d_bin_mask(sample_xmin, sample_xmax,
+                                                tscale, num_sample,
+                                                num_sample_perbin)
+            else:
+                p_mask = np.zeros([tscale, num_sample])
+            mask_mat_vector.append(p_mask)
+        mask_mat_vector = np.stack(mask_mat_vector, axis=2)
+        mask_mat.append(mask_mat_vector)
+    mask_mat = np.stack(mask_mat, axis=3)
+    mask_mat = mask_mat.astype(np.float32)
+
+    sample_mask = np.reshape(mask_mat, [tscale, -1])
+    return sample_mask
diff --git a/dygraph/bmn/config_utils.py b/dygraph/bmn/config_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb59a25c316d3e435a9f74c453b61e0362f85d38
--- /dev/null
+++ b/dygraph/bmn/config_utils.py
@@ -0,0 +1,85 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import yaml
+import logging
+
+logger = logging.getLogger(__name__)
+
+CONFIG_SECS = [
+    'train',
+    'valid',
+    'test',
+    'infer',
+]
+
+
+class AttrDict(dict):
+    def __getattr__(self, key):
+        return self[key]
+
+    def __setattr__(self, key, value):
+        if key in self.__dict__:
+            self.__dict__[key] = value
+        else:
+            self[key] = value
+
+
+def parse_config(cfg_file):
+    """Load a config file into AttrDict"""
+    with open(cfg_file, 'r') as fopen:
+        yaml_config = AttrDict(yaml.load(fopen, Loader=yaml.Loader))
+    create_attr_dict(yaml_config)
+    return yaml_config
+
+
+def create_attr_dict(yaml_config):
+    from ast import literal_eval
+    for key, value in yaml_config.items():
+        if type(value) is dict:
+            yaml_config[key] = value = AttrDict(value)
+        if isinstance(value, str):
+            try:
+                value = literal_eval(value)
+            except BaseException:
+                pass
+        if isinstance(value, AttrDict):
+            create_attr_dict(yaml_config[key])
+        else:
+            yaml_config[key] = value
+    return
+
+
+def merge_configs(cfg, sec, args_dict):
+    assert sec in CONFIG_SECS, "invalid config section {}".format(sec)
+    sec_dict = getattr(cfg, sec.upper())
+    for k, v in args_dict.items():
+        if v is None:
+            continue
+        try:
+            if hasattr(sec_dict, k):
+                setattr(sec_dict, k, v)
+        except:
+            pass
+    return cfg
+
+
+def print_configs(cfg, mode):
+    logger.info("---------------- {:>5} Arguments ----------------".format(
+        mode))
+    for sec, sec_items in cfg.items():
+        logger.info("{}:".format(sec))
+        for k, v in sec_items.items():
+            logger.info("    {}:{}".format(k, v))
+    logger.info("-------------------------------------------------")
diff --git a/dygraph/bmn/eval.py b/dygraph/bmn/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..2565fa76b07bdc50f842edbdae8de624d89ba00e
--- /dev/null
+++ b/dygraph/bmn/eval.py
@@ -0,0 +1,220 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import paddle
+import paddle.fluid as fluid
+import numpy as np
+import argparse
+import pandas as pd
+import os
+import sys
+import ast
+import json
+import logging
+
+from reader import BMNReader
+from model import BMN, bmn_loss_func
+from bmn_utils import boundary_choose, bmn_post_processing
+from config_utils import *
+
+DATATYPE = 'float32'
+
+logging.root.handlers = []
+FORMAT = '[%(levelname)s: %(filename)s: %(lineno)4d]: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("BMN test for performance evaluation.")
+    parser.add_argument(
+        '--config_file',
+        type=str,
+        default='bmn.yaml',
+        help='path to config file of model')
+    parser.add_argument(
+        '--batch_size', type=int, default=1, help='eval batch size.')
+    parser.add_argument(
+        '--use_gpu',
+        type=ast.literal_eval,
+        default=True,
+        help='default use gpu.')
+    parser.add_argument(
+        '--weights',
+        type=str,
+        default="checkpoint/bmn_paddle_dy_final",
+        help='weight path, None to automatically download weights provided by Paddle.'
+    )
+    parser.add_argument(
+        '--log_interval',
+        type=int,
+        default=1,
+        help='mini-batch interval to log.')
+    args = parser.parse_args()
+    return args
+
+
+def get_dataset_dict(cfg):
+    anno_file = cfg.MODEL.anno_file
+    annos = json.load(open(anno_file))
+    subset = cfg.TEST.subset
+    video_dict = {}
+    for video_name in annos.keys():
+        video_subset = annos[video_name]["subset"]
+        if subset in video_subset:
+            video_dict[video_name] = annos[video_name]
+    video_list = list(video_dict.keys())
+    video_list.sort()
+    return video_dict, video_list
+
+
+def gen_props(pred_bm, pred_start, pred_end, fid, video_list, cfg, mode='test'):
+    if mode == 'infer':
+        output_path = cfg.INFER.output_path
+    else:
+        output_path = cfg.TEST.output_path
+    tscale = cfg.MODEL.tscale
+    dscale = cfg.MODEL.dscale
+    snippet_xmins = [1.0 / tscale * i for i in range(tscale)]
+    snippet_xmaxs = [1.0 / tscale * i for i in range(1, tscale + 1)]
+    cols = ["xmin", "xmax", "score"]
+
+    video_name = video_list[fid]
+    pred_bm = pred_bm[0, 0, :, :] * pred_bm[0, 1, :, :]
+    start_mask = boundary_choose(pred_start)
+    start_mask[0] = 1.
+    end_mask = boundary_choose(pred_end)
+    end_mask[-1] = 1.
+    score_vector_list = []
+    for idx in range(dscale):
+        for jdx in range(tscale):
+            start_index = jdx
+            end_index = start_index + idx
+            if end_index < tscale and start_mask[start_index] == 1 and end_mask[
+                    end_index] == 1:
+                xmin = snippet_xmins[start_index]
+                xmax = snippet_xmaxs[end_index]
+                xmin_score = pred_start[start_index]
+                xmax_score = pred_end[end_index]
+                bm_score = pred_bm[idx, jdx]
+                conf_score = xmin_score * xmax_score * bm_score
+                score_vector_list.append([xmin, xmax, conf_score])
+
+    score_vector_list = np.stack(score_vector_list)
+    video_df = pd.DataFrame(score_vector_list, columns=cols)
+    video_df.to_csv(
+        os.path.join(output_path, "%s.csv" % video_name), index=False)
+
+
+# Performance Evaluation
+def test_bmn(args):
+    config = parse_config(args.config_file)
+    test_config = merge_configs(config, 'test', vars(args))
+    print_configs(test_config, "Test")
+
+    if not os.path.isdir(test_config.TEST.output_path):
+        os.makedirs(test_config.TEST.output_path)
+    if not os.path.isdir(test_config.TEST.result_path):
+        os.makedirs(test_config.TEST.result_path)
+
+    if not args.use_gpu:
+        place = fluid.CPUPlace()
+    else:
+        place = fluid.CUDAPlace(0)
+
+    with fluid.dygraph.guard(place):
+        bmn = BMN(test_config)
+
+        # load checkpoint
+        if args.weights:
+            assert os.path.exists(args.weights + '.pdparams'
+                                  ), "Given weight dir {} not exist.".format(
+                                      args.weights)
+
+        logger.info('load test weights from {}'.format(args.weights))
+        model_dict, _ = fluid.load_dygraph(args.weights)
+        bmn.set_dict(model_dict)
+
+        reader = BMNReader(mode="test", cfg=test_config)
+        test_reader = reader.create_reader()
+
+        aggr_loss = 0.0
+        aggr_tem_loss = 0.0
+        aggr_pem_reg_loss = 0.0
+        aggr_pem_cls_loss = 0.0
+        aggr_batch_size = 0
+        video_dict, video_list = get_dataset_dict(test_config)
+
+        bmn.eval()
+        for batch_id, data in enumerate(test_reader()):
+            video_feat = np.array([item[0] for item in data]).astype(DATATYPE)
+            gt_iou_map = np.array([item[1] for item in data]).astype(DATATYPE)
+            gt_start = np.array([item[2] for item in data]).astype(DATATYPE)
+            gt_end = np.array([item[3] for item in data]).astype(DATATYPE)
+            video_idx = [item[4] for item in data][0]  #batch_size=1 by default
+
+            x_data = fluid.dygraph.base.to_variable(video_feat)
+            gt_iou_map = fluid.dygraph.base.to_variable(gt_iou_map)
+            gt_start = fluid.dygraph.base.to_variable(gt_start)
+            gt_end = fluid.dygraph.base.to_variable(gt_end)
+            gt_iou_map.stop_gradient = True
+            gt_start.stop_gradient = True
+            gt_end.stop_gradient = True
+
+            pred_bm, pred_start, pred_end = bmn(x_data)
+            loss, tem_loss, pem_reg_loss, pem_cls_loss = bmn_loss_func(
+                pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end,
+                test_config)
+
+            pred_bm = pred_bm.numpy()
+            pred_start = pred_start[0].numpy()
+            pred_end = pred_end[0].numpy()
+            aggr_loss += np.mean(loss.numpy())
+            aggr_tem_loss += np.mean(tem_loss.numpy())
+            aggr_pem_reg_loss += np.mean(pem_reg_loss.numpy())
+            aggr_pem_cls_loss += np.mean(pem_cls_loss.numpy())
+            aggr_batch_size += 1
+
+            if batch_id % args.log_interval == 0:
+                logger.info("Processing................ batch {}".format(
+                    batch_id))
+
+            gen_props(
+                pred_bm,
+                pred_start,
+                pred_end,
+                video_idx,
+                video_list,
+                test_config,
+                mode='test')
+
+        avg_loss = aggr_loss / aggr_batch_size
+        avg_tem_loss = aggr_tem_loss / aggr_batch_size
+        avg_pem_reg_loss = aggr_pem_reg_loss / aggr_batch_size
+        avg_pem_cls_loss = aggr_pem_cls_loss / aggr_batch_size
+
+        logger.info('[EVAL] \tAvg_oss = {}, \tAvg_tem_loss = {}, \tAvg_pem_reg_loss = {}, \tAvg_pem_cls_loss = {}'.format(
+            '%.04f' % avg_loss, '%.04f' % avg_tem_loss, \
+            '%.04f' % avg_pem_reg_loss, '%.04f' % avg_pem_cls_loss))
+
+        logger.info("Post_processing....This may take a while")
+        bmn_post_processing(video_dict, test_config.TEST.subset,
+                            test_config.TEST.output_path,
+                            test_config.TEST.result_path)
+        logger.info("[EVAL] eval finished")
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    test_bmn(args)
diff --git a/dygraph/bmn/eval_anet_prop.py b/dygraph/bmn/eval_anet_prop.py
new file mode 100644
index 0000000000000000000000000000000000000000..2a91c6effff8083ad54180c0931585731f2fa6d8
--- /dev/null
+++ b/dygraph/bmn/eval_anet_prop.py
@@ -0,0 +1,110 @@
+'''
+Calculate AR@N and AUC;
+Modefied from ActivityNet Gitub repository](https://github.com/activitynet/ActivityNet.git)
+'''
+
+import sys
+sys.path.append('./Evaluation')
+
+from eval_proposal import ANETproposal
+import numpy as np
+import argparse
+import os
+
+parser = argparse.ArgumentParser("Eval AR vs AN of proposal")
+parser.add_argument(
+    '--eval_file',
+    type=str,
+    default='bmn_results_validation.json',
+    help='name of results file to eval')
+
+
+def run_evaluation(ground_truth_filename,
+                   proposal_filename,
+                   max_avg_nr_proposals=100,
+                   tiou_thresholds=np.linspace(0.5, 0.95, 10),
+                   subset='validation'):
+
+    anet_proposal = ANETproposal(
+        ground_truth_filename,
+        proposal_filename,
+        tiou_thresholds=tiou_thresholds,
+        max_avg_nr_proposals=max_avg_nr_proposals,
+        subset=subset,
+        verbose=True,
+        check_status=False)
+    anet_proposal.evaluate()
+    recall = anet_proposal.recall
+    average_recall = anet_proposal.avg_recall
+    average_nr_proposals = anet_proposal.proposals_per_video
+
+    return (average_nr_proposals, average_recall, recall)
+
+
+def plot_metric(average_nr_proposals,
+                average_recall,
+                recall,
+                tiou_thresholds=np.linspace(0.5, 0.95, 10)):
+    fn_size = 14
+    plt.figure(num=None, figsize=(12, 8))
+    ax = plt.subplot(1, 1, 1)
+
+    colors = [
+        'k', 'r', 'yellow', 'b', 'c', 'm', 'b', 'pink', 'lawngreen', 'indigo'
+    ]
+    area_under_curve = np.zeros_like(tiou_thresholds)
+    for i in range(recall.shape[0]):
+        area_under_curve[i] = np.trapz(recall[i], average_nr_proposals)
+
+    for idx, tiou in enumerate(tiou_thresholds[::2]):
+        ax.plot(
+            average_nr_proposals,
+            recall[2 * idx, :],
+            color=colors[idx + 1],
+            label="tiou=[" + str(tiou) + "], area=" + str(
+                int(area_under_curve[2 * idx] * 100) / 100.),
+            linewidth=4,
+            linestyle='--',
+            marker=None)
+
+    # Plots Average Recall vs Average number of proposals.
+    ax.plot(
+        average_nr_proposals,
+        average_recall,
+        color=colors[0],
+        label="tiou = 0.5:0.05:0.95," + " area=" + str(
+            int(np.trapz(average_recall, average_nr_proposals) * 100) / 100.),
+        linewidth=4,
+        linestyle='-',
+        marker=None)
+
+    handles, labels = ax.get_legend_handles_labels()
+    ax.legend(
+        [handles[-1]] + handles[:-1], [labels[-1]] + labels[:-1], loc='best')
+
+    plt.ylabel('Average Recall', fontsize=fn_size)
+    plt.xlabel('Average Number of Proposals per Video', fontsize=fn_size)
+    plt.grid(b=True, which="both")
+    plt.ylim([0, 1.0])
+    plt.setp(plt.axes().get_xticklabels(), fontsize=fn_size)
+    plt.setp(plt.axes().get_yticklabels(), fontsize=fn_size)
+    plt.show()
+
+
+if __name__ == "__main__":
+    args = parser.parse_args()
+    eval_file = args.eval_file
+    eval_file_path = os.path.join("evaluate_results", eval_file)
+    uniform_average_nr_proposals_valid, uniform_average_recall_valid, uniform_recall_valid = run_evaluation(
+        "./Evaluation/data/activity_net_1_3_new.json",
+        eval_file_path,
+        max_avg_nr_proposals=100,
+        tiou_thresholds=np.linspace(0.5, 0.95, 10),
+        subset='validation')
+
+    print("AR@1; AR@5; AR@10; AR@100")
+    print("%.02f %.02f %.02f %.02f" %
+          (100 * np.mean(uniform_recall_valid[:, 0]),
+           100 * np.mean(uniform_recall_valid[:, 4]),
+           100 * np.mean(uniform_recall_valid[:, 9]),
+           100 * np.mean(uniform_recall_valid[:, -1])))
diff --git a/dygraph/bmn/infer.list b/dygraph/bmn/infer.list
new file mode 100644
index 0000000000000000000000000000000000000000..44768f089e70e40913d9787571ae0a7151232558
--- /dev/null
+++ b/dygraph/bmn/infer.list
@@ -0,0 +1 @@
+{"v_4Lu8ECLHvK4": {"duration_second": 124.23, "subset": "validation", "duration_frame": 3718, "annotations": [{"segment": [0.01, 124.22675736961452], "label": "Playing kickball"}], "feature_frame": 3712}, "v_5qsXmDi8d74": {"duration_second": 186.59599999999998, "subset": "validation", "duration_frame": 5596, "annotations": [{"segment": [61.402645865834636, 173.44250858034323], "label": "Sumo"}], "feature_frame": 5600}, "v_2D22fVcAcyo": {"duration_second": 215.78400000000002, "subset": "validation", "duration_frame": 6473, "annotations": [{"segment": [10.433652106084244, 25.242706708268333], "label": "Slacklining"}, {"segment": [38.368914196567864, 66.30417628705149], "label": "Slacklining"}, {"segment": [74.71841185647428, 91.2103135725429], "label": "Slacklining"}, {"segment": [103.66338221528862, 126.8866723868955], "label": "Slacklining"}, {"segment": [132.27178315132608, 180.0855070202808], "label": "Slacklining"}], "feature_frame": 6464}, "v_wPYr19iFxhw": {"duration_second": 56.611000000000004, "subset": "validation", "duration_frame": 1693, "annotations": [{"segment": [0.01, 56.541], "label": "Welding"}], "feature_frame": 1696}, "v_K6Tm5xHkJ5c": {"duration_second": 114.64, "subset": "validation", "duration_frame": 2745, "annotations": [{"segment": [25.81087088455538, 50.817943021840875], "label": "Playing accordion"}, {"segment": [52.78278440405616, 110.6562942074883], "label": "Playing accordion"}], "feature_frame": 2736}}
\ No newline at end of file
diff --git a/dygraph/bmn/model.py b/dygraph/bmn/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..f77e8e0e95bc4e0d397ba7247327c5bf7038c8e4
--- /dev/null
+++ b/dygraph/bmn/model.py
@@ -0,0 +1,339 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid import ParamAttr
+import numpy as np
+import math
+
+from bmn_utils import get_interp1d_mask
+
+DATATYPE = 'float32'
+
+
+# Net
+class Conv1D(fluid.dygraph.Layer):
+    def __init__(self,
+                 prefix,
+                 num_channels=256,
+                 num_filters=256,
+                 size_k=3,
+                 padding=1,
+                 groups=1,
+                 act="relu"):
+        super(Conv1D, self).__init__()
+        fan_in = num_channels * size_k * 1
+        k = 1. / math.sqrt(fan_in)
+        param_attr = ParamAttr(
+            name=prefix + "_w",
+            initializer=fluid.initializer.Uniform(
+                low=-k, high=k))
+        bias_attr = ParamAttr(
+            name=prefix + "_b",
+            initializer=fluid.initializer.Uniform(
+                low=-k, high=k))
+
+        self._conv2d = fluid.dygraph.Conv2D(
+            num_channels=num_channels,
+            num_filters=num_filters,
+            filter_size=(1, size_k),
+            stride=1,
+            padding=(0, padding),
+            groups=groups,
+            act=act,
+            param_attr=param_attr,
+            bias_attr=bias_attr)
+
+    def forward(self, x):
+        x = fluid.layers.unsqueeze(input=x, axes=[2])
+        x = self._conv2d(x)
+        x = fluid.layers.squeeze(input=x, axes=[2])
+        return x
+
+
+class BMN(fluid.dygraph.Layer):
+    def __init__(self, cfg):
+        super(BMN, self).__init__()
+
+        #init config
+        self.tscale = cfg.MODEL.tscale
+        self.dscale = cfg.MODEL.dscale
+        self.prop_boundary_ratio = cfg.MODEL.prop_boundary_ratio
+        self.num_sample = cfg.MODEL.num_sample
+        self.num_sample_perbin = cfg.MODEL.num_sample_perbin
+
+        self.hidden_dim_1d = 256
+        self.hidden_dim_2d = 128
+        self.hidden_dim_3d = 512
+
+        # Base Module
+        self.b_conv1 = Conv1D(
+            prefix="Base_1",
+            num_channels=400,
+            num_filters=self.hidden_dim_1d,
+            size_k=3,
+            padding=1,
+            groups=4,
+            act="relu")
+        self.b_conv2 = Conv1D(
+            prefix="Base_2",
+            num_filters=self.hidden_dim_1d,
+            size_k=3,
+            padding=1,
+            groups=4,
+            act="relu")
+
+        # Temporal Evaluation Module
+        self.ts_conv1 = Conv1D(
+            prefix="TEM_s1",
+            num_filters=self.hidden_dim_1d,
+            size_k=3,
+            padding=1,
+            groups=4,
+            act="relu")
+        self.ts_conv2 = Conv1D(
+            prefix="TEM_s2", num_filters=1, size_k=1, padding=0, act="sigmoid")
+        self.te_conv1 = Conv1D(
+            prefix="TEM_e1",
+            num_filters=self.hidden_dim_1d,
+            size_k=3,
+            padding=1,
+            groups=4,
+            act="relu")
+        self.te_conv2 = Conv1D(
+            prefix="TEM_e2", num_filters=1, size_k=1, padding=0, act="sigmoid")
+
+        #Proposal Evaluation Module
+        self.p_conv1 = Conv1D(
+            prefix="PEM_1d",
+            num_filters=self.hidden_dim_2d,
+            size_k=3,
+            padding=1,
+            act="relu")
+
+        # init to speed up
+        sample_mask = get_interp1d_mask(self.tscale, self.dscale,
+                                        self.prop_boundary_ratio,
+                                        self.num_sample, self.num_sample_perbin)
+        self.sample_mask = fluid.dygraph.base.to_variable(sample_mask)
+        self.sample_mask.stop_gradient = True
+
+        self.p_conv3d1 = fluid.dygraph.Conv3D(
+            num_channels=128,
+            num_filters=self.hidden_dim_3d,
+            filter_size=(self.num_sample, 1, 1),
+            stride=(self.num_sample, 1, 1),
+            padding=0,
+            act="relu",
+            param_attr=ParamAttr(name="PEM_3d1_w"),
+            bias_attr=ParamAttr(name="PEM_3d1_b"))
+
+        self.p_conv2d1 = fluid.dygraph.Conv2D(
+            num_channels=512,
+            num_filters=self.hidden_dim_2d,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            act="relu",
+            param_attr=ParamAttr(name="PEM_2d1_w"),
+            bias_attr=ParamAttr(name="PEM_2d1_b"))
+        self.p_conv2d2 = fluid.dygraph.Conv2D(
+            num_channels=128,
+            num_filters=self.hidden_dim_2d,
+            filter_size=3,
+            stride=1,
+            padding=1,
+            act="relu",
+            param_attr=ParamAttr(name="PEM_2d2_w"),
+            bias_attr=ParamAttr(name="PEM_2d2_b"))
+        self.p_conv2d3 = fluid.dygraph.Conv2D(
+            num_channels=128,
+            num_filters=self.hidden_dim_2d,
+            filter_size=3,
+            stride=1,
+            padding=1,
+            act="relu",
+            param_attr=ParamAttr(name="PEM_2d3_w"),
+            bias_attr=ParamAttr(name="PEM_2d3_b"))
+        self.p_conv2d4 = fluid.dygraph.Conv2D(
+            num_channels=128,
+            num_filters=2,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            act="sigmoid",
+            param_attr=ParamAttr(name="PEM_2d4_w"),
+            bias_attr=ParamAttr(name="PEM_2d4_b"))
+
+    def forward(self, x):
+        #Base Module
+        x = self.b_conv1(x)
+        x = self.b_conv2(x)
+
+        #TEM
+        xs = self.ts_conv1(x)
+        xs = self.ts_conv2(xs)
+        xs = fluid.layers.squeeze(xs, axes=[1])
+        xe = self.te_conv1(x)
+        xe = self.te_conv2(xe)
+        xe = fluid.layers.squeeze(xe, axes=[1])
+
+        #PEM
+        xp = self.p_conv1(x)
+        #BM layer
+        xp = fluid.layers.matmul(xp, self.sample_mask)
+        xp = fluid.layers.reshape(
+            xp, shape=[0, 0, -1, self.dscale, self.tscale])
+
+        xp = self.p_conv3d1(xp)
+        xp = fluid.layers.squeeze(xp, axes=[2])
+        xp = self.p_conv2d1(xp)
+        xp = self.p_conv2d2(xp)
+        xp = self.p_conv2d3(xp)
+        xp = self.p_conv2d4(xp)
+        return xp, xs, xe
+
+
+def bmn_loss_func(pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end,
+                  cfg):
+    def _get_mask(cfg):
+        dscale = cfg.MODEL.dscale
+        tscale = cfg.MODEL.tscale
+        bm_mask = []
+        for idx in range(dscale):
+            mask_vector = [1 for i in range(tscale - idx)
+                           ] + [0 for i in range(idx)]
+            bm_mask.append(mask_vector)
+        bm_mask = np.array(bm_mask, dtype=np.float32)
+        self_bm_mask = fluid.layers.create_global_var(
+            shape=[dscale, tscale], value=0, dtype=DATATYPE, persistable=True)
+        fluid.layers.assign(bm_mask, self_bm_mask)
+        self_bm_mask.stop_gradient = True
+        return self_bm_mask
+
+    def tem_loss_func(pred_start, pred_end, gt_start, gt_end):
+        def bi_loss(pred_score, gt_label):
+            pred_score = fluid.layers.reshape(
+                x=pred_score, shape=[-1], inplace=False)
+            gt_label = fluid.layers.reshape(
+                x=gt_label, shape=[-1], inplace=False)
+            gt_label.stop_gradient = True
+            pmask = fluid.layers.cast(x=(gt_label > 0.5), dtype=DATATYPE)
+            num_entries = fluid.layers.cast(
+                fluid.layers.shape(pmask), dtype=DATATYPE)
+            num_positive = fluid.layers.cast(
+                fluid.layers.reduce_sum(pmask), dtype=DATATYPE)
+            ratio = num_entries / num_positive
+            coef_0 = 0.5 * ratio / (ratio - 1)
+            coef_1 = 0.5 * ratio
+            epsilon = 0.000001
+            temp = fluid.layers.log(pred_score + epsilon)
+            loss_pos = fluid.layers.elementwise_mul(
+                fluid.layers.log(pred_score + epsilon), pmask)
+            loss_pos = coef_1 * fluid.layers.reduce_mean(loss_pos)
+            loss_neg = fluid.layers.elementwise_mul(
+                fluid.layers.log(1.0 - pred_score + epsilon), (1.0 - pmask))
+            loss_neg = coef_0 * fluid.layers.reduce_mean(loss_neg)
+            loss = -1 * (loss_pos + loss_neg)
+            return loss
+
+        loss_start = bi_loss(pred_start, gt_start)
+        loss_end = bi_loss(pred_end, gt_end)
+        loss = loss_start + loss_end
+        return loss
+
+    def pem_reg_loss_func(pred_score, gt_iou_map, mask):
+
+        gt_iou_map = fluid.layers.elementwise_mul(gt_iou_map, mask)
+
+        u_hmask = fluid.layers.cast(x=gt_iou_map > 0.7, dtype=DATATYPE)
+        u_mmask = fluid.layers.logical_and(gt_iou_map <= 0.7, gt_iou_map > 0.3)
+        u_mmask = fluid.layers.cast(x=u_mmask, dtype=DATATYPE)
+        u_lmask = fluid.layers.logical_and(gt_iou_map <= 0.3, gt_iou_map >= 0.)
+        u_lmask = fluid.layers.cast(x=u_lmask, dtype=DATATYPE)
+        u_lmask = fluid.layers.elementwise_mul(u_lmask, mask)
+
+        num_h = fluid.layers.cast(
+            fluid.layers.reduce_sum(u_hmask), dtype=DATATYPE)
+        num_m = fluid.layers.cast(
+            fluid.layers.reduce_sum(u_mmask), dtype=DATATYPE)
+        num_l = fluid.layers.cast(
+            fluid.layers.reduce_sum(u_lmask), dtype=DATATYPE)
+
+        r_m = num_h / num_m
+        u_smmask = fluid.layers.uniform_random(
+            shape=[gt_iou_map.shape[1], gt_iou_map.shape[2]],
+            dtype=DATATYPE,
+            min=0.0,
+            max=1.0)
+        u_smmask = fluid.layers.elementwise_mul(u_mmask, u_smmask)
+        u_smmask = fluid.layers.cast(x=(u_smmask > (1. - r_m)), dtype=DATATYPE)
+
+        r_l = num_h / num_l
+        u_slmask = fluid.layers.uniform_random(
+            shape=[gt_iou_map.shape[1], gt_iou_map.shape[2]],
+            dtype=DATATYPE,
+            min=0.0,
+            max=1.0)
+        u_slmask = fluid.layers.elementwise_mul(u_lmask, u_slmask)
+        u_slmask = fluid.layers.cast(x=(u_slmask > (1. - r_l)), dtype=DATATYPE)
+
+        weights = u_hmask + u_smmask + u_slmask
+        weights.stop_gradient = True
+        loss = fluid.layers.square_error_cost(pred_score, gt_iou_map)
+        loss = fluid.layers.elementwise_mul(loss, weights)
+        loss = 0.5 * fluid.layers.reduce_sum(loss) / fluid.layers.reduce_sum(
+            weights)
+
+        return loss
+
+    def pem_cls_loss_func(pred_score, gt_iou_map, mask):
+        gt_iou_map = fluid.layers.elementwise_mul(gt_iou_map, mask)
+        gt_iou_map.stop_gradient = True
+        pmask = fluid.layers.cast(x=(gt_iou_map > 0.9), dtype=DATATYPE)
+        nmask = fluid.layers.cast(x=(gt_iou_map <= 0.9), dtype=DATATYPE)
+        nmask = fluid.layers.elementwise_mul(nmask, mask)
+
+        num_positive = fluid.layers.reduce_sum(pmask)
+        num_entries = num_positive + fluid.layers.reduce_sum(nmask)
+        ratio = num_entries / num_positive
+        coef_0 = 0.5 * ratio / (ratio - 1)
+        coef_1 = 0.5 * ratio
+        epsilon = 0.000001
+        loss_pos = fluid.layers.elementwise_mul(
+            fluid.layers.log(pred_score + epsilon), pmask)
+        loss_pos = coef_1 * fluid.layers.reduce_sum(loss_pos)
+        loss_neg = fluid.layers.elementwise_mul(
+            fluid.layers.log(1.0 - pred_score + epsilon), nmask)
+        loss_neg = coef_0 * fluid.layers.reduce_sum(loss_neg)
+        loss = -1 * (loss_pos + loss_neg) / num_entries
+        return loss
+
+    pred_bm_reg = fluid.layers.squeeze(
+        fluid.layers.slice(
+            pred_bm, axes=[1], starts=[0], ends=[1]), axes=[1])
+    pred_bm_cls = fluid.layers.squeeze(
+        fluid.layers.slice(
+            pred_bm, axes=[1], starts=[1], ends=[2]), axes=[1])
+
+    bm_mask = _get_mask(cfg)
+
+    pem_reg_loss = pem_reg_loss_func(pred_bm_reg, gt_iou_map, bm_mask)
+    pem_cls_loss = pem_cls_loss_func(pred_bm_cls, gt_iou_map, bm_mask)
+
+    tem_loss = tem_loss_func(pred_start, pred_end, gt_start, gt_end)
+
+    loss = tem_loss + 10 * pem_reg_loss + pem_cls_loss
+    return loss, tem_loss, pem_reg_loss, pem_cls_loss
diff --git a/dygraph/bmn/predict.py b/dygraph/bmn/predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..363e15b0d36c8065436531432109dbc60296b8ee
--- /dev/null
+++ b/dygraph/bmn/predict.py
@@ -0,0 +1,147 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import paddle
+import paddle.fluid as fluid
+import numpy as np
+import argparse
+import sys
+import os
+import ast
+import json
+
+from model import BMN
+from eval import gen_props
+from reader import BMNReader
+from bmn_utils import bmn_post_processing
+from config_utils import *
+
+DATATYPE = 'float32'
+
+logging.root.handlers = []
+FORMAT = '[%(levelname)s: %(filename)s: %(lineno)4d]: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("BMN test for performance evaluation.")
+    parser.add_argument(
+        '--config_file',
+        type=str,
+        default='bmn.yaml',
+        help='path to config file of model')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=None,
+        help='training batch size. None to use config file setting.')
+    parser.add_argument(
+        '--use_gpu',
+        type=ast.literal_eval,
+        default=True,
+        help='default use gpu.')
+    parser.add_argument(
+        '--weights',
+        type=str,
+        default="checkpoint/bmn_paddle_dy_final",
+        help='weight path, None to automatically download weights provided by Paddle.'
+    )
+    parser.add_argument(
+        '--save_dir',
+        type=str,
+        default="predict_results/",
+        help='output dir path, default to use ./predict_results/')
+    parser.add_argument(
+        '--log_interval',
+        type=int,
+        default=1,
+        help='mini-batch interval to log.')
+    args = parser.parse_args()
+    return args
+
+
+def get_dataset_dict(cfg):
+    file_list = cfg.INFER.filelist
+    annos = json.load(open(file_list))
+    video_dict = {}
+    for video_name in annos.keys():
+        video_dict[video_name] = annos[video_name]
+    video_list = list(video_dict.keys())
+    video_list.sort()
+    return video_dict, video_list
+
+
+# Prediction
+def infer_bmn(args):
+    config = parse_config(args.config_file)
+    infer_config = merge_configs(config, 'infer', vars(args))
+    print_configs(infer_config, "Infer")
+
+    if not os.path.isdir(infer_config.INFER.output_path):
+        os.makedirs(infer_config.INFER.output_path)
+    if not os.path.isdir(infer_config.INFER.result_path):
+        os.makedirs(infer_config.INFER.result_path)
+    place = fluid.CUDAPlace(0)
+    with fluid.dygraph.guard(place):
+        bmn = BMN(infer_config)
+        # load checkpoint
+        if args.weights:
+            assert os.path.exists(args.weights + ".pdparams"
+                                  ), "Given weight dir {} not exist.".format(
+                                      args.weights)
+
+        logger.info('load test weights from {}'.format(args.weights))
+        model_dict, _ = fluid.load_dygraph(args.weights)
+        bmn.set_dict(model_dict)
+
+        reader = BMNReader(mode="infer", cfg=infer_config)
+        infer_reader = reader.create_reader()
+
+        video_dict, video_list = get_dataset_dict(infer_config)
+
+        bmn.eval()
+        for batch_id, data in enumerate(infer_reader()):
+            video_feat = np.array([item[0] for item in data]).astype(DATATYPE)
+            video_idx = [item[1] for item in data][0]  #batch_size=1 by default
+
+            x_data = fluid.dygraph.base.to_variable(video_feat)
+
+            pred_bm, pred_start, pred_end = bmn(x_data)
+
+            pred_bm = pred_bm.numpy()
+            pred_start = pred_start[0].numpy()
+            pred_end = pred_end[0].numpy()
+
+            logger.info("Processing................ batch {}".format(batch_id))
+            gen_props(
+                pred_bm,
+                pred_start,
+                pred_end,
+                video_idx,
+                video_list,
+                infer_config,
+                mode='infer')
+
+        logger.info("Post_processing....This may take a while")
+        bmn_post_processing(video_dict, infer_config.INFER.subset,
+                            infer_config.INFER.output_path,
+                            infer_config.INFER.result_path)
+        logger.info("[INFER] infer finished. Results saved in {}".format(
+            args.save_dir) + "bmn_results_test.json")
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    infer_bmn(args)
diff --git a/dygraph/bmn/reader.py b/dygraph/bmn/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..11209fee20c6f7e9fb84c98441eb1ff0796b307a
--- /dev/null
+++ b/dygraph/bmn/reader.py
@@ -0,0 +1,290 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import paddle
+import numpy as np
+import random
+import json
+import multiprocessing
+import functools
+import logging
+import platform
+import os
+
+logger = logging.getLogger(__name__)
+
+from bmn_utils import iou_with_anchors, ioa_with_anchors
+
+
+class BMNReader():
+    def __init__(self, mode, cfg):
+        self.mode = mode
+        self.tscale = cfg.MODEL.tscale  # 100
+        self.dscale = cfg.MODEL.dscale  # 100
+        self.anno_file = cfg.MODEL.anno_file
+        self.file_list = cfg.INFER.filelist
+        self.subset = cfg[mode.upper()]['subset']
+        self.tgap = 1. / self.tscale
+        self.feat_path = cfg.MODEL.feat_path
+
+        self.get_dataset_dict()
+        self.get_match_map()
+
+        self.batch_size = cfg[mode.upper()]['batch_size']
+        self.num_threads = cfg[mode.upper()]['num_threads']
+        if (mode == 'test') or (mode == 'infer'):
+            self.num_threads = 1  # set num_threads as 1 for test and infer
+
+    def get_dataset_dict(self):
+        assert (os.path.exists(self.feat_path)), "Input feature path not exists"
+        assert (os.listdir(self.feat_path)), "Input feature file not exists"
+        self.video_dict = {}
+        if self.mode == "infer":
+            annos = json.load(open(self.file_list))
+            for video_name in annos.keys():
+                self.video_dict[video_name] = annos[video_name]
+        else:
+            annos = json.load(open(self.anno_file))
+            for video_name in annos.keys():
+                video_subset = annos[video_name]["subset"]
+                if self.subset in video_subset:
+                    self.video_dict[video_name] = annos[video_name]
+        self.video_list = list(self.video_dict.keys())
+        self.video_list.sort()
+        print("%s subset video numbers: %d" %
+              (self.subset, len(self.video_list)))
+
+    def get_match_map(self):
+        match_map = []
+        for idx in range(self.tscale):
+            tmp_match_window = []
+            xmin = self.tgap * idx
+            for jdx in range(1, self.tscale + 1):
+                xmax = xmin + self.tgap * jdx
+                tmp_match_window.append([xmin, xmax])
+            match_map.append(tmp_match_window)
+        match_map = np.array(match_map)
+        match_map = np.transpose(match_map, [1, 0, 2])
+        match_map = np.reshape(match_map, [-1, 2])
+        self.match_map = match_map
+        self.anchor_xmin = [self.tgap * i for i in range(self.tscale)]
+        self.anchor_xmax = [self.tgap * i for i in range(1, self.tscale + 1)]
+
+    def get_video_label(self, video_name):
+        video_info = self.video_dict[video_name]
+        video_second = video_info['duration_second']
+        video_labels = video_info['annotations']
+
+        gt_bbox = []
+        gt_iou_map = []
+        for gt in video_labels:
+            tmp_start = max(min(1, gt["segment"][0] / video_second), 0)
+            tmp_end = max(min(1, gt["segment"][1] / video_second), 0)
+            gt_bbox.append([tmp_start, tmp_end])
+            tmp_gt_iou_map = iou_with_anchors(
+                self.match_map[:, 0], self.match_map[:, 1], tmp_start, tmp_end)
+            tmp_gt_iou_map = np.reshape(tmp_gt_iou_map,
+                                        [self.dscale, self.tscale])
+            gt_iou_map.append(tmp_gt_iou_map)
+        gt_iou_map = np.array(gt_iou_map)
+        gt_iou_map = np.max(gt_iou_map, axis=0)
+
+        gt_bbox = np.array(gt_bbox)
+        gt_xmins = gt_bbox[:, 0]
+        gt_xmaxs = gt_bbox[:, 1]
+        gt_len_small = 3 * self.tgap
+        gt_start_bboxs = np.stack(
+            (gt_xmins - gt_len_small / 2, gt_xmins + gt_len_small / 2), axis=1)
+        gt_end_bboxs = np.stack(
+            (gt_xmaxs - gt_len_small / 2, gt_xmaxs + gt_len_small / 2), axis=1)
+
+        match_score_start = []
+        for jdx in range(len(self.anchor_xmin)):
+            match_score_start.append(
+                np.max(
+                    ioa_with_anchors(self.anchor_xmin[jdx], self.anchor_xmax[
+                        jdx], gt_start_bboxs[:, 0], gt_start_bboxs[:, 1])))
+        match_score_end = []
+        for jdx in range(len(self.anchor_xmin)):
+            match_score_end.append(
+                np.max(
+                    ioa_with_anchors(self.anchor_xmin[jdx], self.anchor_xmax[
+                        jdx], gt_end_bboxs[:, 0], gt_end_bboxs[:, 1])))
+
+        gt_start = np.array(match_score_start)
+        gt_end = np.array(match_score_end)
+        return gt_iou_map, gt_start, gt_end
+
+    def load_file(self, video_name):
+        file_name = video_name + ".npy"
+        file_path = os.path.join(self.feat_path, file_name)
+        video_feat = np.load(file_path)
+        video_feat = video_feat.T
+        video_feat = video_feat.astype("float32")
+        return video_feat
+
+    def create_reader(self):
+        """reader creator for bmn model"""
+        if self.mode == 'infer':
+            return self.make_infer_reader()
+        if self.num_threads == 1:
+            return self.make_reader()
+        else:
+            sysstr = platform.system()
+            if sysstr == 'Windows':
+                return self.make_multithread_reader()
+            else:
+                return self.make_multiprocess_reader()
+
+    def make_infer_reader(self):
+        """reader for inference"""
+
+        def reader():
+            batch_out = []
+            for video_name in self.video_list:
+                video_idx = self.video_list.index(video_name)
+                video_feat = self.load_file(video_name)
+                batch_out.append((video_feat, video_idx))
+
+                if len(batch_out) == self.batch_size:
+                    yield batch_out
+                    batch_out = []
+
+        return reader
+
+    def make_reader(self):
+        """single process reader"""
+
+        def reader():
+            video_list = self.video_list
+            if self.mode == 'train':
+                random.shuffle(video_list)
+
+            batch_out = []
+            for video_name in video_list:
+                video_idx = video_list.index(video_name)
+                video_feat = self.load_file(video_name)
+                gt_iou_map, gt_start, gt_end = self.get_video_label(video_name)
+
+                if self.mode == 'train' or self.mode == 'valid':
+                    batch_out.append((video_feat, gt_iou_map, gt_start, gt_end))
+                elif self.mode == 'test':
+                    batch_out.append(
+                        (video_feat, gt_iou_map, gt_start, gt_end, video_idx))
+                else:
+                    raise NotImplementedError('mode {} not implemented'.format(
+                        self.mode))
+                if len(batch_out) == self.batch_size:
+                    yield batch_out
+                    batch_out = []
+
+        return reader
+
+    def make_multithread_reader(self):
+        def reader():
+            if self.mode == 'train':
+                random.shuffle(self.video_list)
+            for video_name in self.video_list:
+                video_idx = self.video_list.index(video_name)
+                yield [video_name, video_idx]
+
+        def process_data(sample, mode):
+            video_name = sample[0]
+            video_idx = sample[1]
+            video_feat = self.load_file(video_name)
+            gt_iou_map, gt_start, gt_end = self.get_video_label(video_name)
+            if mode == 'train' or mode == 'valid':
+                return (video_feat, gt_iou_map, gt_start, gt_end)
+            elif mode == 'test':
+                return (video_feat, gt_iou_map, gt_start, gt_end, video_idx)
+            else:
+                raise NotImplementedError('mode {} not implemented'.format(
+                    mode))
+
+        mapper = functools.partial(process_data, mode=self.mode)
+
+        def batch_reader():
+            xreader = paddle.reader.xmap_readers(mapper, reader,
+                                                 self.num_threads, 1024)
+            batch = []
+            for item in xreader():
+                batch.append(item)
+                if len(batch) == self.batch_size:
+                    yield batch
+                    batch = []
+
+        return batch_reader
+
+    def make_multiprocess_reader(self):
+        """multiprocess reader"""
+
+        def read_into_queue(video_list, queue):
+
+            batch_out = []
+            for video_name in video_list:
+                video_idx = video_list.index(video_name)
+                video_feat = self.load_file(video_name)
+                gt_iou_map, gt_start, gt_end = self.get_video_label(video_name)
+
+                if self.mode == 'train' or self.mode == 'valid':
+                    batch_out.append((video_feat, gt_iou_map, gt_start, gt_end))
+                elif self.mode == 'test':
+                    batch_out.append(
+                        (video_feat, gt_iou_map, gt_start, gt_end, video_idx))
+                else:
+                    raise NotImplementedError('mode {} not implemented'.format(
+                        self.mode))
+
+                if len(batch_out) == self.batch_size:
+                    queue.put(batch_out)
+                    batch_out = []
+            queue.put(None)
+
+        def queue_reader():
+            video_list = self.video_list
+            if self.mode == 'train':
+                random.shuffle(video_list)
+
+            n = self.num_threads
+            queue_size = 20
+            reader_lists = [None] * n
+            file_num = int(len(video_list) // n)
+            for i in range(n):
+                if i < len(reader_lists) - 1:
+                    tmp_list = video_list[i * file_num:(i + 1) * file_num]
+                else:
+                    tmp_list = video_list[i * file_num:]
+                reader_lists[i] = tmp_list
+
+            manager = multiprocessing.Manager()
+            queue = manager.Queue(queue_size)
+            p_list = [None] * len(reader_lists)
+            for i in range(len(reader_lists)):
+                reader_list = reader_lists[i]
+                p_list[i] = multiprocessing.Process(
+                    target=read_into_queue, args=(reader_list, queue))
+                p_list[i].start()
+            reader_num = len(reader_lists)
+            finish_num = 0
+            while finish_num < reader_num:
+                sample = queue.get()
+                if sample is None:
+                    finish_num += 1
+                else:
+                    yield sample
+            for i in range(len(p_list)):
+                if p_list[i].is_alive():
+                    p_list[i].join()
+
+        return queue_reader
diff --git a/dygraph/bmn/run.sh b/dygraph/bmn/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b426012056b92f9524271d127595775f281f789a
--- /dev/null
+++ b/dygraph/bmn/run.sh
@@ -0,0 +1,5 @@
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python -m paddle.distributed.launch \
+       --selected_gpus=0,1,2,3 \
+       --log_dir ./mylog \
+       train.py --use_data_parallel True
diff --git a/dygraph/bmn/train.py b/dygraph/bmn/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..0171830e2a3b7e02ee3ef3a95a410a803293ec1e
--- /dev/null
+++ b/dygraph/bmn/train.py
@@ -0,0 +1,243 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import paddle
+import paddle.fluid as fluid
+import numpy as np
+import argparse
+import ast
+import logging
+import sys
+import os
+
+from model import BMN, bmn_loss_func
+from reader import BMNReader
+from config_utils import *
+
+DATATYPE = 'float32'
+
+logging.root.handlers = []
+FORMAT = '[%(levelname)s: %(filename)s: %(lineno)4d]: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("Paddle dynamic graph mode of BMN.")
+    parser.add_argument(
+        "--use_data_parallel",
+        type=ast.literal_eval,
+        default=False,
+        help="The flag indicating whether to use data parallel mode to train the model."
+    )
+    parser.add_argument(
+        '--config_file',
+        type=str,
+        default='bmn.yaml',
+        help='path to config file of model')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=None,
+        help='training batch size. None to use config file setting.')
+    parser.add_argument(
+        '--learning_rate',
+        type=float,
+        default=0.001,
+        help='learning rate use for training. None to use config file setting.')
+    parser.add_argument(
+        '--resume',
+        type=str,
+        default=None,
+        help='filename to resume training based on previous checkpoints. '
+        'None for not resuming any checkpoints.')
+    parser.add_argument(
+        '--use_gpu',
+        type=ast.literal_eval,
+        default=True,
+        help='default use gpu.')
+    parser.add_argument(
+        '--epoch',
+        type=int,
+        default=9,
+        help='epoch number, 0 for read from config file')
+    parser.add_argument(
+        '--valid_interval',
+        type=int,
+        default=1,
+        help='validation epoch interval, 0 for no validation.')
+    parser.add_argument(
+        '--save_dir',
+        type=str,
+        default="checkpoint",
+        help='path to save train snapshoot')
+    parser.add_argument(
+        '--log_interval',
+        type=int,
+        default=10,
+        help='mini-batch interval to log.')
+    args = parser.parse_args()
+    return args
+
+
+# Optimizer
+def optimizer(cfg, parameter_list):
+    bd = [cfg.TRAIN.lr_decay_iter]
+    base_lr = cfg.TRAIN.learning_rate
+    lr_decay = cfg.TRAIN.learning_rate_decay
+    l2_weight_decay = cfg.TRAIN.l2_weight_decay
+    lr = [base_lr, base_lr * lr_decay]
+    optimizer = fluid.optimizer.Adam(
+        fluid.layers.piecewise_decay(
+            boundaries=bd, values=lr),
+        parameter_list=parameter_list,
+        regularization=fluid.regularizer.L2DecayRegularizer(
+            regularization_coeff=l2_weight_decay))
+    return optimizer
+
+
+# Validation
+def val_bmn(model, config, args):
+    reader = BMNReader(mode="valid", cfg=config)
+    val_reader = reader.create_reader()
+    for batch_id, data in enumerate(val_reader()):
+        video_feat = np.array([item[0] for item in data]).astype(DATATYPE)
+        gt_iou_map = np.array([item[1] for item in data]).astype(DATATYPE)
+        gt_start = np.array([item[2] for item in data]).astype(DATATYPE)
+        gt_end = np.array([item[3] for item in data]).astype(DATATYPE)
+
+        x_data = fluid.dygraph.base.to_variable(video_feat)
+        gt_iou_map = fluid.dygraph.base.to_variable(gt_iou_map)
+        gt_start = fluid.dygraph.base.to_variable(gt_start)
+        gt_end = fluid.dygraph.base.to_variable(gt_end)
+        gt_iou_map.stop_gradient = True
+        gt_start.stop_gradient = True
+        gt_end.stop_gradient = True
+
+        pred_bm, pred_start, pred_end = model(x_data)
+
+        loss, tem_loss, pem_reg_loss, pem_cls_loss = bmn_loss_func(
+            pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end, config)
+        avg_loss = fluid.layers.mean(loss)
+
+        if args.log_interval > 0 and (batch_id % args.log_interval == 0):
+            logger.info('[VALID] iter {} '.format(batch_id)
+                + '\tLoss = {}, \ttem_loss = {}, \tpem_reg_loss = {}, \tpem_cls_loss = {}'.format(
+                '%.04f' % avg_loss.numpy()[0], '%.04f' % tem_loss.numpy()[0], \
+                '%.04f' % pem_reg_loss.numpy()[0], '%.04f' % pem_cls_loss.numpy()[0]))
+
+
+# TRAIN
+def train_bmn(args):
+    config = parse_config(args.config_file)
+    train_config = merge_configs(config, 'train', vars(args))
+    valid_config = merge_configs(config, 'valid', vars(args))
+
+    if not args.use_gpu:
+        place = fluid.CPUPlace()
+    elif not args.use_data_parallel:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
+
+    with fluid.dygraph.guard(place):
+        if args.use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
+        bmn = BMN(train_config)
+        adam = optimizer(train_config, parameter_list=bmn.parameters())
+
+        if args.use_data_parallel:
+            bmn = fluid.dygraph.parallel.DataParallel(bmn, strategy)
+
+        if args.resume:
+            # if resume weights is given, load resume weights directly
+            assert os.path.exists(args.resume + ".pdparams"), \
+                "Given resume weight dir {} not exist.".format(args.resume)
+
+            model, _ = fluid.dygraph.load_dygraph(args.resume)
+            bmn.set_dict(model)
+
+        reader = BMNReader(mode="train", cfg=train_config)
+        train_reader = reader.create_reader()
+        if args.use_data_parallel:
+            train_reader = fluid.contrib.reader.distributed_batch_reader(
+                train_reader)
+
+        for epoch in range(args.epoch):
+            for batch_id, data in enumerate(train_reader()):
+                video_feat = np.array(
+                    [item[0] for item in data]).astype(DATATYPE)
+                gt_iou_map = np.array(
+                    [item[1] for item in data]).astype(DATATYPE)
+                gt_start = np.array([item[2] for item in data]).astype(DATATYPE)
+                gt_end = np.array([item[3] for item in data]).astype(DATATYPE)
+
+                x_data = fluid.dygraph.base.to_variable(video_feat)
+                gt_iou_map = fluid.dygraph.base.to_variable(gt_iou_map)
+                gt_start = fluid.dygraph.base.to_variable(gt_start)
+                gt_end = fluid.dygraph.base.to_variable(gt_end)
+                gt_iou_map.stop_gradient = True
+                gt_start.stop_gradient = True
+                gt_end.stop_gradient = True
+
+                pred_bm, pred_start, pred_end = bmn(x_data)
+
+                loss, tem_loss, pem_reg_loss, pem_cls_loss = bmn_loss_func(
+                    pred_bm, pred_start, pred_end, gt_iou_map, gt_start, gt_end,
+                    train_config)
+                avg_loss = fluid.layers.mean(loss)
+
+                if args.use_data_parallel:
+                    avg_loss = bmn.scale_loss(avg_loss)
+                    avg_loss.backward()
+                    bmn.apply_collective_grads()
+                else:
+                    avg_loss.backward()
+
+                adam.minimize(avg_loss)
+
+                bmn.clear_gradients()
+
+                if args.log_interval > 0 and (
+                        batch_id % args.log_interval == 0):
+                    logger.info('[TRAIN] Epoch {}, iter {} '.format(epoch, batch_id)
+                         + '\tLoss = {}, \ttem_loss = {}, \tpem_reg_loss = {}, \tpem_cls_loss = {}'.format(
+                            '%.04f' % avg_loss.numpy()[0], '%.04f' % tem_loss.numpy()[0], \
+                            '%.04f' % pem_reg_loss.numpy()[0], '%.04f' % pem_cls_loss.numpy()[0]))
+
+            logger.info('[TRAIN] Epoch {} training finished'.format(epoch))
+            if not os.path.isdir(args.save_dir):
+                os.makedirs(args.save_dir)
+            save_model_name = os.path.join(
+                args.save_dir, "bmn_paddle_dy" + "_epoch{}".format(epoch))
+            fluid.dygraph.save_dygraph(bmn.state_dict(), save_model_name)
+
+            # validation
+            if args.valid_interval > 0 and (epoch + 1
+                                            ) % args.valid_interval == 0:
+                bmn.eval()
+                val_bmn(bmn, valid_config, args)
+                bmn.train()
+
+        #save final results
+        if fluid.dygraph.parallel.Env().local_rank == 0:
+            save_model_name = os.path.join(args.save_dir,
+                                           "bmn_paddle_dy" + "_final")
+            fluid.dygraph.save_dygraph(bmn.state_dict(), save_model_name)
+        logger.info('[TRAIN] training finished')
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    train_bmn(args)
diff --git a/dygraph/cycle_gan/README.md b/dygraph/cycle_gan/README.md
index 64fea3f958899b1bedebed374bfb1378b474d413..12889d457181245ab286f62f4f43541f3240fdd6 100644
--- a/dygraph/cycle_gan/README.md
+++ b/dygraph/cycle_gan/README.md
@@ -35,12 +35,24 @@ Cycle GAN 是一种image to image 的图像生成网络，实现了非对称图
 
 ## 数据准备
 
-本教程使用 cityscapes 数据集 来进行模型的训练测试工作，可以通过指定 `python download.py --dataset cityscapes` 下载得到。
+CycleGAN 支持的数据集可以参考download.py中的`cycle_pix_dataset`，可以通过指定`python download.py --dataset xxx` 下载得到。
 
-cityscapes 训练集包含2975张街景实拍图片，2975张对应真实街景的语义分割图片。测试集包含499张实拍图片和499张语义分割图片。
+由于版权问题，cityscapes 数据集无法通过脚本直接获得，需要从[官方](https://www.cityscapes-dataset.com/)下载数据，
+下载完之后执行`python prepare_cityscapes_dataset.py --gtFine_dir ./gtFine/ --leftImg8bit_dir ./leftImg8bit --output_dir ./data/cityscapes/`处理，
+将数据存放在`data/cityscapes`。
 
-数据下载处理完毕后，并组织为以下路径结构：
+数据下载处理完毕后，需要您将数据组织为以下路径结构：
+```
+data
+|-- cityscapes
+|   |-- testA
+|   |-- testB
+|   |-- trainA
+|   |-- trainB
+
+```
 
+然后运行txt生成脚本：`python generate_txt.py`，最终数据组织如下所示:
 ```
 data
 |-- cityscapes
diff --git a/dygraph/cycle_gan/data_reader.py b/dygraph/cycle_gan/data_reader.py
index eb162b2304156b96984e43369a671b06226874bd..ba71431672020a589cbd06a125bc7bc7f72cf4a2 100644
--- a/dygraph/cycle_gan/data_reader.py
+++ b/dygraph/cycle_gan/data_reader.py
@@ -19,11 +19,12 @@ import os
 from PIL import Image, ImageOps
 import numpy as np
 
-A_LIST_FILE = "./data/cityscapes/trainA.txt"
-B_LIST_FILE = "./data/cityscapes/trainB.txt"
-A_TEST_LIST_FILE = "./data/cityscapes/testA.txt"
-B_TEST_LIST_FILE = "./data/cityscapes/testB.txt"
-IMAGES_ROOT = "./data/cityscapes/"
+DATASET = "cityscapes"
+A_LIST_FILE = "./data/"+DATASET+"/trainA.txt"
+B_LIST_FILE = "./data/"+DATASET+"/trainB.txt"
+A_TEST_LIST_FILE = "./data/"+DATASET+"/testA.txt"
+B_TEST_LIST_FILE = "./data/"+DATASET+"/testB.txt"
+IMAGES_ROOT = "./data/"+DATASET+"/"
 
 def image_shape():
     return [3, 256, 256]
diff --git a/dygraph/cycle_gan/download.py b/dygraph/cycle_gan/download.py
index 72337dc93c1892b00407de3c0bb44f6bb9c9bb47..99cd30ad39a622be6ce07a8f539f021481eebb69 100644
--- a/dygraph/cycle_gan/download.py
+++ b/dygraph/cycle_gan/download.py
@@ -153,7 +153,7 @@ if __name__ == '__main__':
     args = parser.parse_args()
     cycle_pix_dataset = [
         'apple2orange', 'summer2winter_yosemite', 'horse2zebra', 'monet2photo',
-        'cezanne2photo', 'ukiyoe2photo', 'vangogh2photo', 'maps', 'cityscapes',
+        'cezanne2photo', 'ukiyoe2photo', 'vangogh2photo', 'maps',
         'facades', 'iphone2dslr_flower', 'ae_photos', 'mini'
     ]
 
diff --git a/dygraph/cycle_gan/generate_txt.py b/dygraph/cycle_gan/generate_txt.py
new file mode 100644
index 0000000000000000000000000000000000000000..5a8d3fd21b960697d4503c86d2de88f9d1d7389e
--- /dev/null
+++ b/dygraph/cycle_gan/generate_txt.py
@@ -0,0 +1,25 @@
+import sys
+import os
+import argparse
+
+def gen_txt(dir_path):
+    dataname = "cityscapes"
+    ### generator .txt file according to dirs
+    dirs = os.listdir(os.path.join(dir_path, '{}'.format(dataname)))
+    for d in dirs:
+        txt_file = d + '.txt'
+        txt_dir = os.path.join(dir_path, dataname)
+        f = open(os.path.join(txt_dir, txt_file), 'w')
+        for fil in os.listdir(os.path.join(txt_dir, d)):
+            wl = d + '/' + fil + '\n'
+            f.write(wl)
+        f.close()
+    sys.stderr.write("\n")
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description=__doc__)
+    # yapf: disable
+    parser.add_argument('--output_dir',       type=str,     default="datasets",       help='Path to output Cityscapes directory.')
+    # yapf: enable
+    args = parser.parse_args()
+    gen_txt(args.output_dir)
\ No newline at end of file
diff --git a/dygraph/cycle_gan/layers.py b/dygraph/cycle_gan/layers.py
index 06b2094d560c81b13ba3ebb3205c00dc7d3b6a6c..458fb59e0eca812adbb36b5443f3f1a923ee95d3 100644
--- a/dygraph/cycle_gan/layers.py
+++ b/dygraph/cycle_gan/layers.py
@@ -24,8 +24,8 @@ use_cudnn = False
 
 class conv2d(fluid.dygraph.Layer):
     """docstring for Conv2D"""
-    def __init__(self, 
-                name_scope,
+    def __init__(self,
+                num_channels,
                 num_filters=64,
                 filter_size=7,
                 stride=1,
@@ -35,32 +35,29 @@ class conv2d(fluid.dygraph.Layer):
                 relu=True,
                 relufactor=0.0,
                 use_bias=False):
-        super(conv2d, self).__init__(name_scope)
+        super(conv2d, self).__init__()
 
         if use_bias == False:
             con_bias_attr = False
         else:
-            con_bias_attr = fluid.ParamAttr(name="conv_bias",initializer=fluid.initializer.Constant(0.0))
+            con_bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Constant(0.0))
 
         self.conv = Conv2D(
-            self.full_name(),
+            num_channels=num_channels,
             num_filters=num_filters,
             filter_size=filter_size,
             stride=stride,
             padding=padding,
             use_cudnn=use_cudnn,
             param_attr=fluid.ParamAttr(
-                name="conv2d_weights",
                 initializer=fluid.initializer.NormalInitializer(loc=0.0,scale=stddev)),
             bias_attr=con_bias_attr)
         if norm:
-            self.bn = BatchNorm(self.full_name(),
+            self.bn = BatchNorm(
                 num_channels=num_filters,
                 param_attr=fluid.ParamAttr(
-                    name="scale",
                     initializer=fluid.initializer.NormalInitializer(1.0,0.02)),
                 bias_attr=fluid.ParamAttr(
-                    name="bias",
                     initializer=fluid.initializer.Constant(0.0)),
                 trainable_statistics=True
                 )
@@ -82,7 +79,7 @@ class conv2d(fluid.dygraph.Layer):
 
 class DeConv2D(fluid.dygraph.Layer):
     def __init__(self,
-            name_scope,
+            num_channels,
             num_filters=64,
             filter_size=7,
             stride=1,
@@ -94,32 +91,30 @@ class DeConv2D(fluid.dygraph.Layer):
             relufactor=0.0,
             use_bias=False
             ):
-        super(DeConv2D,self).__init__(name_scope)
+        super(DeConv2D,self).__init__()
 
         if use_bias == False:
             de_bias_attr = False
         else:
-            de_bias_attr = fluid.ParamAttr(name="de_bias",initializer=fluid.initializer.Constant(0.0))
+            de_bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Constant(0.0))
 
-        self._deconv = Conv2DTranspose(self.full_name(),
-                                        num_filters,
-                                        filter_size=filter_size,
-                                        stride=stride,
-                                        padding=padding,
-                                        param_attr=fluid.ParamAttr(
-                                            name="this_is_deconv_weights",
-                                            initializer=fluid.initializer.NormalInitializer(loc=0.0, scale=stddev)),
-                                        bias_attr=de_bias_attr)
+        self._deconv = Conv2DTranspose(num_channels,
+                                       num_filters,
+                                       filter_size=filter_size,
+                                       stride=stride,
+                                       padding=padding,
+                                       param_attr=fluid.ParamAttr(
+                                           initializer=fluid.initializer.NormalInitializer(loc=0.0, scale=stddev)),
+                                       bias_attr=de_bias_attr)
 
 
 
         if norm:
-            self.bn = BatchNorm(self.full_name(),
+            self.bn = BatchNorm(
                 num_channels=num_filters,
                 param_attr=fluid.ParamAttr(
-                    name="de_wights",
                     initializer=fluid.initializer.NormalInitializer(1.0, 0.02)),
-                bias_attr=fluid.ParamAttr(name="de_bn_bias",initializer=fluid.initializer.Constant(0.0)),
+                bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Constant(0.0)),
                 trainable_statistics=True)        
         self.outpadding = outpadding
         self.relufactor = relufactor
diff --git a/dygraph/cycle_gan/model.py b/dygraph/cycle_gan/model.py
index b4bae7ff957e43b241d58318bd713fd00699d392..712776bdbd0aea6c083b1b40d42f11caa0a410c0 100644
--- a/dygraph/cycle_gan/model.py
+++ b/dygraph/cycle_gan/model.py
@@ -18,18 +18,19 @@ import paddle.fluid as fluid
 
 class build_resnet_block(fluid.dygraph.Layer):
     def __init__(self,
-        name_scope,
         dim,
         use_bias=False):
-        super(build_resnet_block,self).__init__(name_scope)
+        super(build_resnet_block,self).__init__()
 
-        self.conv0 = conv2d(self.full_name(),
+        self.conv0 = conv2d(
+            num_channels=dim,
             num_filters=dim,
             filter_size=3,
             stride=1,
             stddev=0.02,
             use_bias=False)
-        self.conv1 = conv2d(self.full_name(),
+        self.conv1 = conv2d(
+            num_channels=dim,
             num_filters=dim,
             filter_size=3,
             stride=1,
@@ -47,38 +48,41 @@ class build_resnet_block(fluid.dygraph.Layer):
         out_res = self.conv1(out_res)
         return out_res + inputs
 
+
 class build_generator_resnet_9blocks(fluid.dygraph.Layer):
-    def __init__ (self,
-            name_scope):
-        super(build_generator_resnet_9blocks,self).__init__(name_scope)
+    def __init__ (self, input_channel):
+        super(build_generator_resnet_9blocks, self).__init__()
 
-        self.conv0 = conv2d(self.full_name(),
+        self.conv0 = conv2d(
+            num_channels=input_channel,
             num_filters=32,
             filter_size=7,
             stride=1,
             padding=0,
             stddev=0.02)
-        self.conv1 = conv2d(self.full_name(),
+        self.conv1 = conv2d(
+            num_channels=32,
             num_filters=64,
             filter_size=3,
             stride=2,
             padding=1,
             stddev=0.02)
-        self.conv2 = conv2d(self.full_name(),
+        self.conv2 = conv2d(
+            num_channels=64,
             num_filters=128,
             filter_size=3,
             stride=2,
             padding=1,
             stddev=0.02)
         self.build_resnet_block_list=[]
-        dim = 32*4
+        dim = 128
         for i in range(9):
             Build_Resnet_Block = self.add_sublayer(
                 "generator_%d" % (i+1),
-                build_resnet_block(self.full_name(),
-                                    128))
+                build_resnet_block(dim))
             self.build_resnet_block_list.append(Build_Resnet_Block)
-        self.deconv0 = DeConv2D(self.full_name(),
+        self.deconv0 = DeConv2D(
+            num_channels=dim,
             num_filters=32*2,
             filter_size=3,
             stride=2,
@@ -86,15 +90,17 @@ class build_generator_resnet_9blocks(fluid.dygraph.Layer):
             padding=[1, 1],
             outpadding=[0, 1, 0, 1],
             )
-        self.deconv1 = DeConv2D(self.full_name(),
+        self.deconv1 = DeConv2D(
+            num_channels=32*2,
             num_filters=32,
             filter_size=3,
             stride=2,
             stddev=0.02,
             padding=[1, 1],
             outpadding=[0, 1, 0, 1])
-        self.conv3 = conv2d(self.full_name(),
-            num_filters=3,
+        self.conv3 = conv2d(
+            num_channels=32,
+            num_filters=input_channel,
             filter_size=7,
             stride=1,
             stddev=0.02,
@@ -102,6 +108,7 @@ class build_generator_resnet_9blocks(fluid.dygraph.Layer):
             relu=False,
             norm=False,
             use_bias=True)
+
     def forward(self,inputs):
         pad_input = fluid.layers.pad2d(inputs, [3, 3, 3, 3], mode="reflect")
         y = self.conv0(pad_input)
@@ -116,11 +123,13 @@ class build_generator_resnet_9blocks(fluid.dygraph.Layer):
         y = fluid.layers.tanh(y)
         return y
 
+
 class build_gen_discriminator(fluid.dygraph.Layer):
-    def __init__(self,name_scope):
-        super(build_gen_discriminator,self).__init__(name_scope)
+    def __init__(self, input_channel):
+        super(build_gen_discriminator, self).__init__()
         
-        self.conv0 = conv2d(self.full_name(),
+        self.conv0 = conv2d(
+            num_channels=input_channel,
             num_filters=64,
             filter_size=4,
             stride=2,
@@ -129,28 +138,32 @@ class build_gen_discriminator(fluid.dygraph.Layer):
             norm=False,
             use_bias=True,
             relufactor=0.2)
-        self.conv1 = conv2d(self.full_name(),
+        self.conv1 = conv2d(
+            num_channels=64,
             num_filters=128,
             filter_size=4,
             stride=2,
             stddev=0.02,
             padding=1,
             relufactor=0.2)
-        self.conv2 = conv2d(self.full_name(),
+        self.conv2 = conv2d(
+            num_channels=128,
             num_filters=256,
             filter_size=4,
             stride=2,
             stddev=0.02,
             padding=1,
             relufactor=0.2)
-        self.conv3 = conv2d(self.full_name(),
+        self.conv3 = conv2d(
+            num_channels=256,
             num_filters=512,
             filter_size=4,
             stride=1,
             stddev=0.02,
             padding=1,
             relufactor=0.2)
-        self.conv4 = conv2d(self.full_name(),
+        self.conv4 = conv2d(
+            num_channels=512,
             num_filters=1,
             filter_size=4,
             stride=1,
@@ -159,6 +172,7 @@ class build_gen_discriminator(fluid.dygraph.Layer):
             norm=False,
             relu=False,
             use_bias=True)
+
     def forward(self,inputs):
         y = self.conv0(inputs)
         y = self.conv1(y)
diff --git a/dygraph/cycle_gan/prepare_cityscapes_dataset.py b/dygraph/cycle_gan/prepare_cityscapes_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..6f60d9ae7ab644901f1f7a2ff46ae0eba585469f
--- /dev/null
+++ b/dygraph/cycle_gan/prepare_cityscapes_dataset.py
@@ -0,0 +1,71 @@
+import os
+import argparse
+import functools
+import glob
+import sys
+from PIL import Image
+''' Based on https://github.com/junyanz/CycleGAN'''
+
+
+def load_image(path):
+    return Image.open(path).convert('RGB').resize((256, 256))
+
+
+def propress_cityscapes(gtFine_dir, leftImg8bit_dir, output_dir, phase):
+    save_dir = os.path.join(output_dir, phase)
+    try:
+        os.makedirs(save_dir)
+    except Exception as e:
+        print("{} makedirs".format(e))
+        pass
+    try:
+        os.makedirs(save_dir + 'A')
+    except Exception as e:
+        print("{} makedirs".format(e))
+    try:
+        os.makedirs(save_dir + 'B')
+    except Exception as e:
+        print("{} makedirs".format(e))
+
+    seg_expr = os.path.join(gtFine_dir, phase, "*", "*_color.png")
+    seg_paths = glob.glob(seg_expr)
+    seg_paths = sorted(seg_paths)
+
+    photo_expr = os.path.join(leftImg8bit_dir, phase, "*", '*_leftImg8bit.png')
+    photo_paths = glob.glob(photo_expr)
+    photo_paths = sorted(photo_paths)
+
+    assert len(seg_paths) == len(photo_paths), \
+          "[%d] gtFine images NOT match [%d] leftImg8bit images. Aborting." % (len(seg_paths), len(photo_paths))
+
+    for i, (seg_path, photo_path) in enumerate(zip(seg_paths, photo_paths)):
+        seg_image = load_image(seg_path)
+        photo_image = load_image(photo_path)
+        # save image
+        save_path = os.path.join(save_dir+'A', "%d_A.jpg" % i)
+        photo_image.save(save_path, format='JPEG', subsampling=0, quality=100)
+        save_path = os.path.join(save_dir+'B', "%d_B.jpg" % i)
+        seg_image.save(save_path, format='JPEG', subsampling=0, quality=100)
+
+        if i % 10 == 0:
+            print("proprecess %d ~ %d images." % (i, i + 10))
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description=__doc__)
+    # yapf: disable
+    parser.add_argument('--gtFine_dir',       type=str,     default=None,       help='Path to Cityscapes gtFine directory.')
+    parser.add_argument('--leftImg8bit_dir',  type=str,     default=None,       help='Path to Cityscapes leftImg8bit_trainvaltest directory.')
+    parser.add_argument('--output_dir',       type=str,     default=None,       help='Path to output Cityscapes directory.')
+    # yapf: enable
+    args = parser.parse_args()
+
+    print('Preparing Cityscapes Dataset for test phase')
+    propress_cityscapes(args.gtFine_dir, args.leftImg8bit_dir, args.output_dir,
+                        'test')
+
+    print('Preparing Cityscapes Dataset for train phase')
+    propress_cityscapes(args.gtFine_dir, args.leftImg8bit_dir, args.output_dir,
+                        'train')
+
+    print("DONE!!!")
\ No newline at end of file
diff --git a/dygraph/cycle_gan/train.py b/dygraph/cycle_gan/train.py
index 147b08dbf60b982b6f3e5d447a2a5af5bd526cb6..a1422047b0d02f5e6cd9dfaa97e5840d38a7bf69 100644
--- a/dygraph/cycle_gan/train.py
+++ b/dygraph/cycle_gan/train.py
@@ -47,7 +47,7 @@ lambda_identity = 0.5
 tep_per_epoch = 2974
 
 
-def optimizer_setting():
+def optimizer_setting(parameters):
     lr = 0.0002
     optimizer = fluid.optimizer.Adam(
         learning_rate=fluid.layers.piecewise_decay(
@@ -56,6 +56,7 @@ def optimizer_setting():
                 140 * step_per_epoch, 160 * step_per_epoch, 180 * step_per_epoch
             ],
             values=[lr, lr * 0.8, lr * 0.6, lr * 0.4, lr * 0.2, lr * 0.1]),
+        parameter_list=parameters,
         beta1=0.5)
     return optimizer
 
@@ -84,13 +85,18 @@ def train(args):
         A_test_reader = data_reader.a_test_reader()
         B_test_reader = data_reader.b_test_reader()
 
-        cycle_gan = Cycle_Gan("cycle_gan", istrain=True)
+        cycle_gan = Cycle_Gan(input_channel=data_shape[1], istrain=True)
 
         losses = [[], []]
         t_time = 0
-        optimizer1 = optimizer_setting()
-        optimizer2 = optimizer_setting()
-        optimizer3 = optimizer_setting()
+
+        vars_G = cycle_gan.build_generator_resnet_9blocks_a.parameters() + cycle_gan.build_generator_resnet_9blocks_b.parameters()
+        vars_da = cycle_gan.build_gen_discriminator_a.parameters()
+        vars_db = cycle_gan.build_gen_discriminator_b.parameters()
+
+        optimizer1 = optimizer_setting(vars_G)
+        optimizer2 = optimizer_setting(vars_da)
+        optimizer3 = optimizer_setting(vars_db)
 
         for epoch in range(args.epoch):
             batch_id = 0
@@ -114,13 +120,8 @@ def train(args):
                 g_loss_out = g_loss.numpy()
 
                 g_loss.backward()
-                vars_G = []
-                for param in cycle_gan.parameters():
-                    if param.name[:
-                                  52] == "cycle_gan/Cycle_Gan_0/build_generator_resnet_9blocks":
-                        vars_G.append(param)
 
-                optimizer1.minimize(g_loss, parameter_list=vars_G)
+                optimizer1.minimize(g_loss)
                 cycle_gan.clear_gradients()
 
                 fake_pool_B = B_pool.pool_image(fake_B).numpy()
@@ -141,12 +142,7 @@ def train(args):
                 d_loss_A = fluid.layers.reduce_mean(d_loss_A)
 
                 d_loss_A.backward()
-                vars_da = []
-                for param in cycle_gan.parameters():
-                    if param.name[:
-                                  47] == "cycle_gan/Cycle_Gan_0/build_gen_discriminator_0":
-                        vars_da.append(param)
-                optimizer2.minimize(d_loss_A, parameter_list=vars_da)
+                optimizer2.minimize(d_loss_A)
                 cycle_gan.clear_gradients()
 
                 # optimize the d_B network
@@ -158,12 +154,7 @@ def train(args):
                 d_loss_B = fluid.layers.reduce_mean(d_loss_B)
 
                 d_loss_B.backward()
-                vars_db = []
-                for param in cycle_gan.parameters():
-                    if param.name[:
-                                  47] == "cycle_gan/Cycle_Gan_0/build_gen_discriminator_1":
-                        vars_db.append(param)
-                optimizer3.minimize(d_loss_B, parameter_list=vars_db)
+                optimizer3.minimize(d_loss_B)
 
                 cycle_gan.clear_gradients()
 
diff --git a/dygraph/cycle_gan/trainer.py b/dygraph/cycle_gan/trainer.py
index 687faa00a318d782a934649e3dbfb48f870a3367..4ad1c19c7ab7a9e2dda0a2c4ba75b29e0751b1eb 100644
--- a/dygraph/cycle_gan/trainer.py
+++ b/dygraph/cycle_gan/trainer.py
@@ -25,14 +25,14 @@ lambda_identity = 0.5
 
 
 class Cycle_Gan(fluid.dygraph.Layer):
-    def __init__(self, name_scope,istrain=True):
-        super (Cycle_Gan, self).__init__(name_scope)
+    def __init__(self, input_channel, istrain=True):
+        super (Cycle_Gan, self).__init__()
 
-        self.build_generator_resnet_9blocks_a = build_generator_resnet_9blocks(self.full_name())
-        self.build_generator_resnet_9blocks_b = build_generator_resnet_9blocks(self.full_name())
+        self.build_generator_resnet_9blocks_a = build_generator_resnet_9blocks(input_channel)
+        self.build_generator_resnet_9blocks_b = build_generator_resnet_9blocks(input_channel)
         if istrain:
-            self.build_gen_discriminator_a = build_gen_discriminator(self.full_name())
-            self.build_gen_discriminator_b = build_gen_discriminator(self.full_name())
+            self.build_gen_discriminator_a = build_gen_discriminator(input_channel)
+            self.build_gen_discriminator_b = build_gen_discriminator(input_channel)
 
     def forward(self,input_A,input_B,is_G,is_DA,is_DB):
 
diff --git a/dygraph/lac/README.md b/dygraph/lac/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..56a21bacea94c37fe8cb64c60d7978354dcd4cbd
--- /dev/null
+++ b/dygraph/lac/README.md
@@ -0,0 +1,129 @@
+# 中文词法分析
+
+## 1. 简介
+
+Lexical Analysis of Chinese，简称 LAC，是一个联合的词法分析模型，在单个模型中完成中文分词、词性标注、专名识别任务。我们在自建的数据集上对分词、词性标注、专名识别进行整体的评估效果，具体数值见下表；此外，我们在百度开放的 [ERNIE](https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE) 模型上 finetune，并对比基线模型、BERT finetuned 和 ERNIE finetuned 的效果，可以看出会有显著的提升。可通过 [AI开放平台-词法分析](http://ai.baidu.com/tech/nlp/lexical) 线上体验百度的词法分析服务。
+这里的是LAC的动态图实现，相同网络结构的静态图实现可以参照：[LAC静态图实现](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis)
+
+|模型|Precision|Recall|F1-score|
+|:-:|:-:|:-:|:-:|
+|Lexical Analysis|89.2%|89.4%|89.3%|
+
+## 2. 快速开始
+
+### 安装说明
+
+#### 1.PaddlePaddle 安装
+
+本项目依赖 PaddlePaddle 1.7.0 及以上版本和PaddleHub 1.0.0及以上版本 ，PaddlePaddle安装请参考官网 [快速安装](http://www.paddlepaddle.org/paddle#quick-start)，PaddleHub安装参考 [PaddleHub](https://github.com/PaddlePaddle/PaddleHub)。
+
+> Warning: GPU 和 CPU 版本的 PaddlePaddle 分别是 paddlepaddle-gpu 和 paddlepaddle，请安装时注意区别。
+
+#### 2. 克隆代码
+克隆工具集代码库到本地
+```bash
+ git clone https://github.com/PaddlePaddle/models.git
+ cd models/dygraph/lac
+```
+
+#### 3. 环境依赖
+PaddlePaddle的版本要求是：Python 2 版本是 2.7.15+、Python 3 版本是 3.5.1+/3.6/3.7。LAC的代码可支持Python2/3，无具体版本限制
+
+### 数据准备
+
+#### 训练数据集
+
+下载数据集文件，解压后会生成 `./data/` 文件夹
+```bash
+python downloads.py dataset
+```
+
+### 模型训练
+基于示例的数据集，可通过下面的命令，在训练集 `./data/train.tsv` 上进行训练
+
+```bash
+bash run.sh
+```
+
+### 模型评估
+
+我们基于自建的数据集训练了一个词法分析的模型，可以直接用这个模型对测试集 `./data/test.tsv` 进行验证，
+```bash
+# baseline model
+sh eval.sh 
+
+```
+
+### 模型预测
+
+加载已有的模型，对未知的数据进行预测
+```bash
+# baseline model
+sh predict.sh
+
+```
+
+## 3. 进阶使用
+
+### 任务定义与建模
+词法分析任务的输入是一个字符串（我们后面使用『句子』来指代它），而输出是句子中的词边界和词性、实体类别。序列标注是词法分析的经典建模方式。我们使用基于 GRU 的网络结构学习特征，将学习到的特征接入 CRF 解码层完成序列标注。CRF 解码层本质上是将传统 CRF 中的线性模型换成了非线性神经网络，基于句子级别的似然概率，因而能够更好的解决标记偏置问题。模型要点如下，具体细节请参考 `run_sequence_labeling.py` 代码。
+1. 输入采用 one-hot 方式表示，每个字以一个 id 表示
+2. one-hot 序列通过字表，转换为实向量表示的字向量序列；
+3. 字向量序列作为双向 GRU 的输入，学习输入序列的特征表示，得到新的特性表示序列，我们堆叠了两层双向GRU以增加学习能力；
+4. CRF 以 GRU 学习到的特征为输入，以标记序列为监督信号，实现序列标注。
+
+词性和专名类别标签集合如下表，其中词性标签 24 个（小写字母），专名类别标签 4 个（大写字母）。这里需要说明的是，人名、地名、机构名和时间四个类别，在上表中存在两套标签（PER / LOC / ORG / TIME 和 nr / ns / nt / t），被标注为第二套标签的词，是模型判断为低置信度的人名、地名、机构名和时间词。开发者可以基于这两套标签，在四个类别的准确、召回之间做出自己的权衡。
+
+| 标签 | 含义     | 标签 | 含义     | 标签 | 含义     | 标签 | 含义     |
+| ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- |
+| n    | 普通名词 | f    | 方位名词 | s    | 处所名词 | t    | 时间     |
+| nr   | 人名     | ns   | 地名     | nt   | 机构名   | nw   | 作品名   |
+| nz   | 其他专名 | v    | 普通动词 | vd   | 动副词   | vn   | 名动词   |
+| a    | 形容词   | ad   | 副形词   | an   | 名形词   | d    | 副词     |
+| m    | 数量词   | q    | 量词     | r    | 代词     | p    | 介词     |
+| c    | 连词     | u    | 助词     | xc   | 其他虚词 | w    | 标点符号 |
+| PER  | 人名     | LOC  | 地名     | ORG  | 机构名   | TIME | 时间     |
+
+### 模型原理介绍
+上面介绍的模型原理如下图所示：<br />
+
+
+![GRU-CRF-MODEL](https://github.com/PaddlePaddle/models/blob/develop/PaddleNLP/lexical_analysis/gru-crf-model.png)
+
+### 数据格式
+训练使用的数据可以由用户根据实际的应用场景，自己组织数据。除了第一行是 `text_a\tlabel` 固定的开头，后面的每行数据都是由两列组成，以制表符分隔，第一列是 utf-8 编码的中文文本，以 `\002` 分割，第二列是对应每个字的标注，以 `\002` 分隔。我们采用 IOB2 标注体系，即以 X-B 作为类型为 X 的词的开始，以 X-I 作为类型为 X 的词的持续，以 O 表示不关注的字（实际上，在词性、专名联合标注中，不存在 O ）。示例如下：
+
+```text
+除\002了\002他\002续\002任\002十\002二\002届\002政\002协\002委\002员\002,\002马\002化\002腾\002,\002雷\002军\002,\002李\002彦\002宏\002也\002被\002推\002选\002为\002新\002一\002届\002全\002国\002人\002大\002代\002表\002或\002全\002国\002政\002协\002委\002员	p-B\002p-I\002r-B\002v-B\002v-I\002m-B\002m-I\002m-I\002ORG-B\002ORG-I\002n-B\002n-I\002w-B\002PER-B\002PER-I\002PER-I\002w-B\002PER-B\002PER-I\002w-B\002PER-B\002PER-I\002PER-I\002d-B\002p-B\002v-B\002v-I\002v-B\002a-B\002m-B\002m-I\002ORG-B\002ORG-I\002ORG-I\002ORG-I\002n-B\002n-I\002c-B\002n-B\002n-I\002ORG-B\002ORG-I\002n-B\002n-I
+```
+
++ 我们随同代码一并发布了完全版的模型和相关的依赖数据。但是，由于模型的训练数据过于庞大，我们没有发布训练数据，仅在`data`目录下放置少数样本用以示例输入数据格式。
+
++ 模型依赖数据包括：
+    1. 输入文本的词典，在`conf`目录下，对应`word.dic`
+    2. 对输入文本中特殊字符进行转换的字典，在`conf`目录下，对应`q2b.dic`
+    3. 标记标签的词典,在`conf`目录下，对应`tag.dic`
+
++ 在训练和预测阶段，我们都需要进行原始数据的预处理，具体处理工作包括：
+
+    1. 从原始数据文件中抽取出句子和标签，构造句子序列和标签序列
+    2. 将句子序列中的特殊字符进行转换
+    3. 依据词典获取词对应的整数索引
+
+
+## 4. 其他
+### 在论文中引用 LAC
+
+如果您的学术工作成果中使用了 LAC，请您增加下述引用。我们非常欣慰 LAC 能够对您的学术工作带来帮助。
+
+```text
+@article{jiao2018LAC,
+	title={Chinese Lexical Analysis with Deep Bi-GRU-CRF Network},
+	author={Jiao, Zhenyu and Sun, Shuqi and Sun, Ke},
+	journal={arXiv preprint arXiv:1807.01882},
+	year={2018},
+	url={https://arxiv.org/abs/1807.01882}
+}
+```
+### 如何贡献代码
+如果你可以修复某个 issue 或者增加一个新功能，欢迎给我们提交PR。如果对应的PR被接受了，我们将根据贡献的质量和难度 进行打分（0-5分，越高越好）。如果你累计获得了 10 分，可以联系我们获得面试机会或为你写推荐信。
diff --git a/dygraph/lac/conf/args.yaml b/dygraph/lac/conf/args.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..ec56ad4c2b806cc9b1597704dc4141f51e962633
--- /dev/null
+++ b/dygraph/lac/conf/args.yaml
@@ -0,0 +1,84 @@
+model:
+  word_emb_dim:
+    val: 128
+    meaning: "The dimension in which a word is embedded."
+  grnn_hidden_dim:
+    val: 128
+    meaning: "The number of hidden nodes in the GRNN layer."
+  bigru_num:
+    val: 2
+    meaning: "The number of bi_gru layers in the network."
+  init_checkpoint:
+    val: ""
+    meaning: "Path to init model"
+  inference_save_dir:
+    val: ""
+    meaning: "Path to save inference model"
+
+train:
+  random_seed:
+    val: 0
+    meaning: "Random seed for training"
+  print_steps:
+    val: 1
+    meaning: "Print the result per xxx batch of training"
+  save_steps:
+    val: 10
+    meaning: "Save the model once per xxxx batch of training"
+  validation_steps:
+    val: 10
+    meaning: "Do the validation once per xxxx batch of training"
+  batch_size:
+    val: 300
+    meaning: "The number of sequences contained in a mini-batch"
+  epoch:
+    val: 10
+    meaning: "Corpus iteration num"
+  use_cuda:
+    val: False
+    meaning: "If set, use GPU for training."
+  traindata_shuffle_buffer:
+    val: 20000
+    meaning: "The buffer size used in shuffle the training data."
+  base_learning_rate:
+    val: 0.001
+    meaning: "The basic learning rate that affects the entire network."
+  emb_learning_rate:
+    val: 2
+    meaning: "The real learning rate of the embedding layer will be (emb_learning_rate * base_learning_rate)."
+  crf_learning_rate:
+    val: 0.2
+    meaning: "The real learning rate of the embedding layer will be (crf_learning_rate * base_learning_rate)."
+  enable_ce:
+    val: false
+    meaning: 'If set, run the task with continuous evaluation logs.'
+  cpu_num:
+    val: 10
+    meaning: "The number of cpu used to train model, this argument wouldn't be valid if use_cuda=true"
+  use_data_parallel:
+    val: False
+    meaning: "The flag indicating whether to use data parallel mode to train the model."
+
+data:
+  word_dict_path:
+    val: "./conf/word.dic"
+    meaning: "The path of the word dictionary."
+  label_dict_path:
+    val: "./conf/tag.dic"
+    meaning: "The path of the label dictionary."
+  word_rep_dict_path:
+    val: "./conf/q2b.dic"
+    meaning: "The path of the word replacement Dictionary."
+  train_data:
+    val: "./data/train.tsv"
+    meaning: "The folder where the training data is located."
+  test_data:
+    val: "./data/test.tsv"
+    meaning: "The folder where the test data is located."
+  infer_data:
+    val: "./data/infer.tsv"
+    meaning: "The folder where the infer data is located."
+  model_save_dir:
+    val: "./models"
+    meaning: "The model will be saved in this path."
+
diff --git a/dygraph/lac/conf/customization.dic b/dygraph/lac/conf/customization.dic
new file mode 100755
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/dygraph/lac/conf/customization.dic.example b/dygraph/lac/conf/customization.dic.example
new file mode 100755
index 0000000000000000000000000000000000000000..22ebf8a4b15ec51870c7b0d683963e1f9e75d673
--- /dev/null
+++ b/dygraph/lac/conf/customization.dic.example
@@ -0,0 +1,3 @@
+[D:MONTH]
+月
+月份
diff --git a/dygraph/lac/conf/ernie_args.yaml b/dygraph/lac/conf/ernie_args.yaml
new file mode 100755
index 0000000000000000000000000000000000000000..5d8707366454fc7444022f6ef046fbea601e7364
--- /dev/null
+++ b/dygraph/lac/conf/ernie_args.yaml
@@ -0,0 +1,77 @@
+model:
+  ernie_config_path:
+    val: "../LARK/ERNIE/config/ernie_config.json"
+    meaning: "Path to the json file for ernie model config."
+  init_checkpoint:
+    val: ""
+    meaning: "Path to init model"
+  mode:
+    val: "train"
+    meaning: "Setting to train or eval or infer"
+  init_pretraining_params:
+    val: "pretrained/params/"
+    meaning: "Init pre-training params which preforms fine-tuning from. If the arg 'init_checkpoint' has been set, this argument wouldn't be valid."
+
+train:
+  random_seed:
+    val: 0
+    meaning: "Random seed for training"
+  batch_size:
+    val: 10
+    meaning: "The number of sequences contained in a mini-batch"
+  epoch:
+    val: 10
+    meaning: "Corpus iteration num"
+  use_cuda:
+    val: True
+    meaning: "If set, use GPU for training."
+  base_learning_rate:
+    val: 0.0002
+    meaning: "The basic learning rate that affects the entire network."
+  init_bound:
+    val: 0.1
+    meaning: "init bound for initialization."
+  crf_learning_rate:
+    val: 0.2
+    meaning: "The real learning rate of the embedding layer will be (crf_learning_rate * base_learning_rate)."
+  cpu_num:
+    val: 10
+    meaning: "The number of cpu used to train model, it works when use_cuda=False"
+  print_steps:
+    val: 1
+    meaning: "Print the result per xxx batch of training"
+  save_steps:
+    val: 10
+    meaning: "Save the model once per xxxx batch of training"
+  validation_steps:
+    val: 5
+    meaning: "Do the validation once per xxxx batch of training"
+
+data:
+  vocab_path:
+    val: "../LARK/ERNIE/config/vocab.txt"
+    meaning: "The path of the vocabulary."
+  label_map_config:
+    val: "./conf/label_map.json"
+    meaning: "The path of the label dictionary."
+  num_labels:
+    val: 57
+    meaning: "label number"
+  max_seq_len:
+    val: 128
+    meaning: "Number of words of the longest seqence."
+  do_lower_case:
+    val: True
+    meaning: "Whether to lower case the input text. Should be True for uncased models and False for cased models."
+  train_data:
+    val: "./data/train.tsv"
+    meaning: "The folder where the training data is located."
+  test_data:
+    val: "./data/test.tsv"
+    meaning: "The folder where the test data is located."
+  infer_data:
+    val: "./data/test.tsv"
+    meaning: "The folder where the infer data is located."
+  model_save_dir:
+    val: "./ernie_models"
+    meaning: "The model will be saved in this path."
diff --git a/dygraph/lac/conf/label_map.json b/dygraph/lac/conf/label_map.json
new file mode 100755
index 0000000000000000000000000000000000000000..52011d81f0ac4b70188acdddbb6e4f42208586d1
--- /dev/null
+++ b/dygraph/lac/conf/label_map.json
@@ -0,0 +1 @@
+{"d-B": 8, "c-I": 7, "PER-I": 49, "nr-B": 16, "u-B": 36, "c-B": 6, "nr-I": 17, "an-I": 5, "ns-B": 18, "vn-I": 43, "w-B": 44, "an-B": 4, "PER-B": 48, "vn-B": 42, "ns-I": 19, "a-I": 1, "r-B": 30, "xc-B": 46, "LOC-B": 50, "ad-I": 3, "nz-B": 24, "u-I": 37, "a-B": 0, "ad-B": 2, "vd-I": 41, "nw-B": 22, "m-I": 13, "d-I": 9, "n-B": 14, "nz-I": 25, "vd-B": 40, "nw-I": 23, "n-I": 15, "nt-B": 20, "ORG-I": 53, "nt-I": 21, "ORG-B": 52, "LOC-I": 51, "t-B": 34, "TIME-I": 55, "O": 56, "s-I": 33, "f-I": 11, "TIME-B": 54, "t-I": 35, "f-B": 10, "s-B": 32, "r-I": 31, "q-B": 28, "v-I": 39, "v-B": 38, "w-I": 45, "q-I": 29, "p-B": 26, "xc-I": 47, "m-B": 12, "p-I": 27}
\ No newline at end of file
diff --git a/dygraph/lac/conf/q2b.dic b/dygraph/lac/conf/q2b.dic
new file mode 100755
index 0000000000000000000000000000000000000000..d1f14691e228be8a5d5d1385ad968bcafab0c07a
--- /dev/null
+++ b/dygraph/lac/conf/q2b.dic
@@ -0,0 +1,172 @@
+　	 
+、	,
+。	.
+—	-
+～	~
+‖	|
+…	.
+‘	'
+’	'
+“	"
+”	"
+〔	(
+〕	)
+〈	<
+〉	>
+「	'
+」	'
+『	"
+』	"
+〖	[
+〗	]
+【	[
+】	]
+∶	:
+＄	$
+！	!
+＂	"
+＃	#
+％	%
+＆	&
+＇	'
+（	(
+）	)
+＊	*
+＋	+
+，	,
+－	-
+．	.
+／	/
+０	0
+１	1
+２	2
+３	3
+４	4
+５	5
+６	6
+７	7
+８	8
+９	9
+：	:
+；	;
+＜	<
+＝	=
+＞	>
+？	?
+＠	@
+Ａ	a
+Ｂ	b
+Ｃ	c
+Ｄ	d
+Ｅ	e
+Ｆ	f
+Ｇ	g
+Ｈ	h
+Ｉ	i
+Ｊ	j
+Ｋ	k
+Ｌ	l
+Ｍ	m
+Ｎ	n
+Ｏ	o
+Ｐ	p
+Ｑ	q
+Ｒ	r
+Ｓ	s
+Ｔ	t
+Ｕ	u
+Ｖ	v
+Ｗ	w
+Ｘ	x
+Ｙ	y
+Ｚ	z
+［	[
+＼	\
+］	]
+＾	^
+＿	_
+｀	`
+ａ	a
+ｂ	b
+ｃ	c
+ｄ	d
+ｅ	e
+ｆ	f
+ｇ	g
+ｈ	h
+ｉ	i
+ｊ	j
+ｋ	k
+ｌ	l
+ｍ	m
+ｎ	n
+ｏ	o
+ｐ	p
+ｑ	q
+ｒ	r
+ｓ	s
+ｔ	t
+ｕ	u
+ｖ	v
+ｗ	w
+ｘ	x
+ｙ	y
+ｚ	z
+｛	{
+｜	|
+｝	}
+￣	~
+〝	"
+〞	"
+﹐	,
+﹑	,
+﹒	.
+﹔	;
+﹕	:
+﹖	?
+﹗	!
+﹙	(
+﹚	)
+﹛	{
+﹜	{
+﹝	[
+﹞	]
+﹟	#
+﹠	&
+﹡	*
+﹢	+
+﹣	-
+﹤	<
+﹥	>
+﹦	=
+﹨	\
+﹩	$
+﹪	%
+﹫	@
+ 	,
+A	a
+B	b
+C	c
+D	d
+E	e
+F	f
+G	g
+H	h
+I	i
+J	j
+K	k
+L	l
+M	m
+N	n
+O	o
+P	p
+Q	q
+R	r
+S	s
+T	t
+U	u
+V	v
+W	w
+X	x
+Y	y
+Z	z
diff --git a/dygraph/lac/conf/strong_punc.dic b/dygraph/lac/conf/strong_punc.dic
new file mode 100755
index 0000000000000000000000000000000000000000..595e2f673bf432fd86f8d2310d5a9adc5f824a17
--- /dev/null
+++ b/dygraph/lac/conf/strong_punc.dic
@@ -0,0 +1,5 @@
+!
+。
+！
+;
+；
diff --git a/dygraph/lac/conf/tag.dic b/dygraph/lac/conf/tag.dic
new file mode 100755
index 0000000000000000000000000000000000000000..753fa9670e92b45a9daa54ddcb1e2f06491a792e
--- /dev/null
+++ b/dygraph/lac/conf/tag.dic
@@ -0,0 +1,57 @@
+0	a-B
+1	a-I
+2	ad-B
+3	ad-I
+4	an-B
+5	an-I
+6	c-B
+7	c-I
+8	d-B
+9	d-I
+10	f-B
+11	f-I
+12	m-B
+13	m-I
+14	n-B
+15	n-I
+16	nr-B
+17	nr-I
+18	ns-B
+19	ns-I
+20	nt-B
+21	nt-I
+22	nw-B
+23	nw-I
+24	nz-B
+25	nz-I
+26	p-B
+27	p-I
+28	q-B
+29	q-I
+30	r-B
+31	r-I
+32	s-B
+33	s-I
+34	t-B
+35	t-I
+36	u-B
+37	u-I
+38	v-B
+39	v-I
+40	vd-B
+41	vd-I
+42	vn-B
+43	vn-I
+44	w-B
+45	w-I
+46	xc-B
+47	xc-I
+48	PER-B
+49	PER-I
+50	LOC-B
+51	LOC-I
+52	ORG-B
+53	ORG-I
+54	TIME-B
+55	TIME-I
+56	O
diff --git a/dygraph/lac/conf/word.dic b/dygraph/lac/conf/word.dic
new file mode 100755
index 0000000000000000000000000000000000000000..d0ec32491250c8da85800069e4ca7260c6c3700f
--- /dev/null
+++ b/dygraph/lac/conf/word.dic
@@ -0,0 +1,20940 @@
+0	a
+1	e
+2	i
+3	n
+4	o
+5	s
+6	r
+7	t
+8	l
+9	0
+10	u
+11	c
+12	1
+13	d
+14	m
+15	h
+16	g
+17	2
+18	p
+19	b
+20	y
+21	5
+22	3
+23	8
+24	6
+25	k
+26	A
+27	4
+28	9
+29	f
+30	7
+31	S
+32	v
+33	E
+34	w
+35	z
+36	C
+37	x
+38	T
+39	I
+40	j
+41	M
+42	R
+43	O
+44	D
+45	L
+46	N
+47	B
+48	P
+49	H
+50	G
+51	李
+52	F
+53	K
+54	王
+55	张
+56	q
+57	U
+58	刘
+59	陈
+60	W
+61	Y
+62	V
+63	斯
+64	文
+65	X
+66	J
+67	Z
+68	华
+69	明
+70	尔
+71	林
+72	德
+73	晓
+74	杨
+75	金
+76	Q
+77	克
+78	小
+79	志
+80	国
+81	海
+82	丽
+83	平
+84	玉
+85	黄
+86	吴
+87	建
+88	特
+89	拉
+90	子
+91	赵
+92	利
+93	马
+94	军
+95	周
+96	亚
+97	伟
+98	东
+99	红
+100	龙
+101	春
+102	云
+103	生
+104	朱
+105	孙
+106	徐
+107	永
+108	达
+109	美
+110	安
+111	杰
+112	卡
+113	天
+114	新
+115	罗
+116	里
+117	大
+118	光
+119	波
+120	家
+121	成
+122	福
+123	高
+124	胡
+125	荣
+126	英
+127	阿
+128	思
+129	立
+130	瑞
+131	峰
+132	宝
+133	郭
+134	清
+135	兰
+136	西
+137	山
+138	维
+139	爱
+140	宇
+141	佳
+142	辉
+143	俊
+144	雅
+145	庆
+146	尼
+147	梅
+148	格
+149	之
+150	一
+151	君
+152	忠
+153	强
+154	学
+155	世
+156	雪
+157	良
+158	民
+159	芳
+160	郑
+161	敏
+162	秀
+163	迪
+164	元
+165	洪
+166	祥
+167	泽
+168	中
+169	康
+170	科
+171	嘉
+172	正
+173	飞
+174	巴
+175	兴
+176	松
+177	恩
+178	江
+179	乐
+180	宏
+181	振
+182	斌
+183	路
+184	雨
+185	娜
+186	雷
+187	玲
+188	长
+189	多
+190	凯
+191	米
+192	加
+193	奇
+194	吉
+195	青
+196	武
+197	水
+198	布
+199	力
+200	燕
+201	纳
+202	白
+203	慧
+204	宋
+205	万
+206	莱
+207	勇
+208	丹
+209	威
+210	宁
+211	南
+212	士
+213	堂
+214	何
+215	普
+216	洛
+217	秋
+218	胜
+219	仁
+220	韩
+221	奥
+222	富
+223	丁
+224	月
+225	石
+226	方
+227	博
+228	森
+229	艳
+230	鹏
+231	刚
+232	凤
+233	诺
+234	阳
+235	涛
+236	叶
+237	香
+238	比
+239	曹
+240	少
+241	昌
+242	泰
+243	伊
+244	亮
+245	沈
+246	霞
+247	梁
+248	菲
+249	谢
+250	唐
+251	智
+252	梦
+253	希
+254	曼
+255	贝
+256	杜
+257	木
+258	花
+259	苏
+260	星
+261	萍
+262	心
+263	景
+264	超
+265	欣
+266	树
+267	广
+268	许
+269	伯
+270	来
+271	夫
+272	塔
+273	卫
+274	义
+275	冯
+276	可
+277	田
+278	道
+279	圣
+280	汉
+281	三
+282	娟
+283	友
+284	夏
+285	基
+286	宗
+287	人
+288	贵
+289	婷
+290	鲁
+291	根
+292	艾
+293	静
+294	诗
+295	惠
+296	法
+297	蔡
+298	玛
+299	喜
+300	浩
+301	欧
+302	保
+303	潘
+304	风
+305	莉
+306	珍
+307	源
+308	桂
+309	远
+310	孟
+311	沙
+312	继
+313	顺
+314	锦
+315	邓
+316	贤
+317	书
+318	全
+319	得
+320	轩
+321	通
+322	吕
+323	才
+324	妮
+325	董
+326	曾
+327	彭
+328	雄
+329	琴
+330	旭
+331	袁
+332	城
+333	琳
+334	芬
+335	豪
+336	村
+337	卢
+338	剑
+339	蒋
+340	伦
+341	培
+342	魏
+343	瓦
+344	哈
+345	莫
+346	丝
+347	兵
+348	古
+349	银
+350	泉
+351	发
+352	传
+353	群
+354	若
+355	虎
+356	连
+357	如
+358	肖
+359	鑫
+360	盛
+361	先
+362	凡
+363	鸿
+364	图
+365	章
+366	姜
+367	琪
+368	启
+369	柏
+370	耀
+371	开
+372	依
+373	坤
+374	有
+375	萨
+376	怡
+377	崔
+378	川
+379	祖
+380	尚
+381	贾
+382	园
+383	素
+384	托
+385	淑
+386	健
+387	彦
+388	余
+389	双
+390	信
+391	麦
+392	范
+393	汪
+394	蒂
+395	程
+396	朝
+397	和
+398	然
+399	本
+400	塞
+401	灵
+402	秦
+403	铭
+404	河
+405	进
+406	姆
+407	百
+408	陆
+409	彬
+410	锋
+411	洁
+412	莲
+413	冰
+414	晨
+415	邦
+416	兆
+417	钟
+418	日
+419	绍
+420	铁
+421	怀
+422	赛
+423	善
+424	舒
+425	恒
+426	其
+427	行
+428	旺
+429	修
+430	易
+431	任
+432	莎
+433	.
+434	顾
+435	艺
+436	丰
+437	皮
+438	帕
+439	延
+440	隆
+441	门
+442	太
+443	哲
+444	定
+445	蒙
+446	洋
+447	紫
+448	庄
+449	姚
+450	戴
+451	向
+452	顿
+453	礼
+454	权
+455	桥
+456	颖
+457	镇
+458	茂
+459	益
+460	露
+461	齐
+462	仙
+463	儿
+464	勒
+465	地
+466	真
+467	凌
+468	毛
+469	佩
+470	冬
+471	弗
+472	九
+473	润
+474	涵
+475	千
+476	史
+477	碧
+478	自
+479	承
+480	彩
+481	翔
+482	乔
+483	施
+484	治
+485	索
+486	会
+487	运
+488	卓
+489	毅
+490	年
+491	莹
+492	沃
+493	于
+494	孔
+495	薛
+496	业
+497	柳
+498	内
+499	钱
+500	廷
+501	登
+502	仕
+503	熙
+504	守
+505	敬
+506	孝
+507	雯
+508	增
+509	相
+510	时
+511	楠
+512	二
+513	竹
+514	谷
+515	不
+516	牛
+517	好
+518	京
+519	仲
+520	赫
+521	黑
+522	朗
+523	汤
+524	悦
+525	蓝
+526	公
+527	梓
+528	珠
+529	芝
+530	苑
+531	炳
+532	奎
+533	黎
+534	老
+535	佛
+536	谭
+537	鱼
+538	尹
+539	神
+540	温
+541	帝
+542	锡
+543	陶
+544	墨
+545	媛
+546	上
+547	乌
+548	常
+549	言
+550	熊
+551	化
+552	火
+553	升
+554	庭
+555	臣
+556	同
+557	头
+558	晶
+559	磊
+560	楚
+561	提
+562	优
+563	勤
+564	歌
+565	岩
+566	琦
+567	草
+568	韦
+569	库
+570	溪
+571	逸
+572	五
+573	政
+574	冠
+575	果
+576	跃
+577	辰
+578	柯
+579	戈
+580	廖
+581	薇
+582	琼
+583	申
+584	占
+585	湖
+586	辛
+587	代
+588	四
+589	严
+590	扎
+591	倩
+592	邹
+593	乃
+594	宜
+595	捷
+596	理
+597	洲
+598	鸣
+599	邱
+600	栋
+601	翠
+602	睿
+603	满
+604	容
+605	霖
+606	纪
+607	岳
+608	卿
+609	羽
+610	扬
+611	阁
+612	亦
+613	邵
+614	居
+615	久
+616	桑
+617	寿
+618	记
+619	北
+620	哥
+621	瑶
+622	埃
+623	彤
+624	贺
+625	菊
+626	湘
+627	诚
+628	宾
+629	郝
+630	非
+631	珊
+632	存
+633	无
+634	颜
+635	意
+636	盖
+637	é
+638	霍
+639	初
+640	派
+641	野
+642	摩
+643	妍
+644	应
+645	口
+646	馨
+647	名
+648	坚
+649	品
+650	能
+651	寒
+652	纯
+653	蓉
+654	声
+655	葛
+656	航
+657	以
+658	坦
+659	童
+660	尤
+661	色
+662	晴
+663	令
+664	重
+665	聪
+666	芙
+667	亭
+668	柱
+669	合
+670	兹
+671	育
+672	音
+673	厚
+674	迈
+675	付
+676	奈
+677	语
+678	情
+679	宫
+680	列
+681	都
+682	钦
+683	炎
+684	必
+685	客
+686	蕾
+687	龚
+688	笑
+689	左
+690	作
+691	楼
+692	切
+693	娇
+694	宪
+695	韵
+696	农
+697	流
+698	密
+699	关
+700	岭
+701	干
+702	为
+703	夜
+704	氏
+705	微
+706	男
+707	显
+708	腾
+709	甘
+710	娅
+711	晋
+712	昊
+713	仪
+714	查
+715	焕
+716	姬
+717	印
+718	台
+719	苗
+720	钰
+721	甲
+722	勋
+723	车
+724	班
+725	锐
+726	原
+727	虹
+728	六
+729	段
+730	曲
+731	崇
+732	七
+733	茹
+734	萌
+735	&
+736	巧
+737	州
+738	那
+739	标
+740	俞
+741	堡
+742	劳
+743	联
+744	土
+745	血
+746	起
+747	乡
+748	瑜
+749	岛
+750	池
+751	战
+752	师
+753	茶
+754	鹤
+755	彪
+756	鼎
+757	婉
+758	裕
+759	季
+760	耶
+761	闫
+762	冷
+763	昆
+764	知
+765	绿
+766	麟
+767	朵
+768	默
+769	贞
+770	什
+771	赖
+772	倪
+773	尧
+774	灿
+775	因
+776	官
+777	昭
+778	奕
+779	穆
+780	佐
+781	影
+782	荷
+783	功
+784	撒
+785	照
+786	井
+787	宽
+788	桐
+789	萱
+790	坊
+791	聚
+792	萧
+793	球
+794	璐
+795	晖
+796	鬼
+797	面
+798	字
+799	慕
+800	费
+801	越
+802	约
+803	曦
+804	后
+805	欢
+806	枫
+807	玮
+808	殷
+809	包
+810	念
+811	八
+812	汝
+813	翰
+814	黃
+815	奴
+816	手
+817	望
+818	茜
+819	儒
+820	傅
+821	气
+822	玄
+823	黛
+824	汇
+825	肯
+826	龍
+827	耐
+828	佑
+829	湾
+830	单
+831	岚
+832	舍
+833	热
+834	昂
+835	步
+836	钢
+837	环
+838	御
+839	缘
+840	伍
+841	下
+842	机
+843	乾
+844	魔
+845	前
+846	震
+847	巨
+848	线
+849	皓
+850	盈
+851	庞
+852	谦
+853	宣
+854	女
+855	体
+856	靖
+857	均
+858	劲
+859	济
+860	硕
+861	营
+862	帆
+863	妙
+864	瑟
+865	财
+866	出
+867	在
+868	炜
+869	味
+870	斗
+871	留
+872	深
+873	芸
+874	耿
+875	沛
+876	经
+877	管
+878	菜
+879	献
+880	外
+881	殿
+882	房
+883	焦
+884	骨
+885	点
+886	禹
+887	禄
+888	毕
+889	桃
+890	空
+891	侯
+892	鹰
+893	岗
+894	津
+895	雁
+896	帅
+897	妃
+898	复
+899	衣
+900	骏
+901	聂
+902	绪
+903	娃
+904	眼
+905	舟
+906	打
+907	分
+908	油
+909	者
+910	度
+911	角
+912	朴
+913	藤
+914	枝
+915	落
+916	亨
+917	游
+918	潮
+919	皇
+920	華
+921	梵
+922	滨
+923	禾
+924	郎
+925	洞
+926	精
+927	烈
+928	翁
+929	允
+930	塘
+931	璇
+932	事
+933	祝
+934	翼
+935	粉
+936	板
+937	赤
+938	盘
+939	昕
+940	蕊
+941	姿
+942	侠
+943	回
+944	á
+945	秉
+946	征
+947	圆
+948	考
+949	茨
+950	娘
+951	邢
+952	电
+953	瑾
+954	酒
+955	寺
+956	尊
+957	冉
+958	边
+959	别
+960	刀
+961	工
+962	筱
+963	馬
+964	坡
+965	弘
+966	樊
+967	裴
+968	柔
+969	甫
+970	妹
+971	浦
+972	锁
+973	渊
+974	映
+975	当
+976	鲍
+977	见
+978	麻
+979	婧
+980	选
+981	牙
+982	烟
+983	翟
+984	钧
+985	屋
+986	冲
+987	放
+988	芹
+989	煜
+990	再
+991	尘
+992	司
+993	创
+994	恋
+995	幼
+996	展
+997	镜
+998	实
+999	浪
+1000	珂
+1001	爽
+1002	驰
+1003	鹿
+1004	吾
+1005	简
+1006	虫
+1007	网
+1008	从
+1009	è
+1010	紅
+1011	食
+1012	赞
+1013	à
+1014	柴
+1015	沟
+1016	魂
+1017	張
+1018	叔
+1019	端
+1020	入
+1021	闻
+1022	耳
+1023	慈
+1024	汀
+1025	集
+1026	郁
+1027	娥
+1028	死
+1029	伏
+1030	观
+1031	鸟
+1032	港
+1033	仓
+1034	芭
+1035	羊
+1036	纽
+1037	詹
+1038	唯
+1039	主
+1040	亿
+1041	旗
+1042	朋
+1043	蔚
+1044	商
+1045	斐
+1046	拜
+1047	凝
+1048	十
+1049	酷
+1050	片
+1051	性
+1052	烨
+1053	長
+1054	寨
+1055	蓓
+1056	动
+1057	魁
+1058	猫
+1059	迎
+1060	魚
+1061	敦
+1062	浮
+1063	東
+1064	用
+1065	霜
+1066	咏
+1067	采
+1068	狼
+1069	解
+1070	衡
+1071	录
+1072	府
+1073	琛
+1074	舞
+1075	街
+1076	澜
+1077	致
+1078	则
+1079	努
+1080	愛
+1081	举
+1082	淼
+1083	ì
+1084	晟
+1085	肉
+1086	身
+1087	巷
+1088	伽
+1089	畅
+1090	典
+1091	首
+1092	ê
+1093	斋
+1094	拿
+1095	沐
+1096	骆
+1097	丙
+1098	狗
+1099	瓜
+1100	內
+1101	细
+1102	í
+1103	视
+1104	屯
+1105	臻
+1106	酸
+1107	速
+1108	頭
+1109	养
+1110	傲
+1111	牧
+1112	添
+1113	直
+1114	鸡
+1115	泊
+1116	勃
+1117	昱
+1118	巍
+1119	宸
+1120	式
+1121	茵
+1122	豆
+1123	休
+1124	半
+1125	场
+1126	蛇
+1127	灯
+1128	临
+1129	玺
+1130	煌
+1131	顶
+1132	次
+1133	忆
+1134	壮
+1135	社
+1136	席
+1137	物
+1138	陵
+1139	醉
+1140	毒
+1141	媚
+1142	風
+1143	积
+1144	佰
+1145	車
+1146	庚
+1147	过
+1148	猛
+1149	菁
+1150	母
+1151	两
+1152	龄
+1153	破
+1154	买
+1155	效
+1156	祺
+1157	發
+1158	玥
+1159	我
+1160	藏
+1161	县
+1162	号
+1163	坎
+1164	训
+1165	嘎
+1166	众
+1167	懿
+1168	ò
+1169	底
+1170	党
+1171	門
+1172	尾
+1173	予
+1174	達
+1175	转
+1176	变
+1177	盟
+1178	是
+1179	阮
+1180	药
+1181	船
+1182	足
+1183	快
+1184	蘭
+1185	毓
+1186	乙
+1187	讯
+1188	杏
+1189	渡
+1190	陽
+1191	而
+1192	拓
+1193	象
+1194	喻
+1195	汗
+1196	眉
+1197	散
+1198	也
+1199	横
+1200	召
+1201	节
+1202	归
+1203	离
+1204	坪
+1205	位
+1206	制
+1207	暗
+1208	榕
+1209	今
+1210	量
+1211	器
+1212	仔
+1213	脱
+1214	所
+1215	交
+1216	结
+1217	轻
+1218	颂
+1219	现
+1220	又
+1221	界
+1222	病
+1223	封
+1224	祁
+1225	寅
+1226	岸
+1227	樱
+1228	阴
+1229	妖
+1230	澳
+1231	期
+1232	历
+1233	命
+1234	绮
+1235	彼
+1236	夕
+1237	丸
+1238	异
+1239	淳
+1240	苦
+1241	ó
+1242	澄
+1243	求
+1244	開
+1245	杀
+1246	途
+1247	ü
+1248	珀
+1249	调
+1250	沁
+1251	國
+1252	反
+1253	零
+1254	茗
+1255	０
+1256	族
+1257	蒲
+1258	泓
+1259	棠
+1260	引
+1261	弟
+1262	爾
+1263	牌
+1264	团
+1265	至
+1266	独
+1267	娴
+1268	迷
+1269	倒
+1270	瀚
+1271	铃
+1272	苍
+1273	淇
+1274	轮
+1275	狄
+1276	繁
+1277	樂
+1278	卜
+1279	氣
+1280	校
+1281	婚
+1282	断
+1283	霸
+1284	寶
+1285	固
+1286	豹
+1287	韬
+1288	隐
+1289	教
+1290	姓
+1291	１
+1292	极
+1293	带
+1294	走
+1295	羅
+1296	帮
+1297	亞
+1298	净
+1299	婕
+1300	难
+1301	挺
+1302	糖
+1303	招
+1304	凉
+1305	蜜
+1306	收
+1307	数
+1308	奧
+1309	雲
+1310	述
+1311	逊
+1312	杭
+1313	幽
+1314	脚
+1315	２
+1316	廉
+1317	桦
+1318	灰
+1319	医
+1320	与
+1321	陳
+1322	坝
+1323	芮
+1324	目
+1325	丘
+1326	舜
+1327	覃
+1328	潇
+1329	含
+1330	亲
+1331	铜
+1332	晚
+1333	支
+1334	猪
+1335	画
+1336	玖
+1337	ú
+1338	店
+1339	项
+1340	渝
+1341	排
+1342	旋
+1343	笔
+1344	压
+1345	芷
+1346	报
+1347	強
+1348	乳
+1349	融
+1350	笛
+1351	冈
+1352	的
+1353	棋
+1354	领
+1355	瑛
+1356	屈
+1357	狂
+1358	院
+1359	峻
+1360	孤
+1361	谋
+1362	未
+1363	兔
+1364	鲜
+1365	衍
+1366	术
+1367	吟
+1368	间
+1369	计
+1370	觉
+1371	泥
+1372	乱
+1373	蝶
+1374	倍
+1375	卷
+1376	残
+1377	蓬
+1378	对
+1379	植
+1380	耕
+1381	盾
+1382	迦
+1383	缪
+1384	条
+1385	域
+1386	欲
+1387	杯
+1388	虚
+1389	习
+1390	爷
+1391	早
+1392	麗
+1393	郡
+1394	浅
+1395	退
+1396	纸
+1397	策
+1398	ａ
+1399	活
+1400	窦
+1401	攀
+1402	屏
+1403	刺
+1404	泳
+1405	旦
+1406	补
+1407	防
+1408	姝
+1409	恺
+1410	晔
+1411	肤
+1412	軍
+1413	漫
+1414	失
+1415	滕
+1416	背
+1417	词
+1418	晗
+1419	表
+1420	來
+1421	涂
+1422	坑
+1423	誉
+1424	装
+1425	受
+1426	甜
+1427	機
+1428	邪
+1429	嘴
+1430	雍
+1431	棉
+1432	霄
+1433	针
+1434	荆
+1435	料
+1436	鼠
+1437	革
+1438	炫
+1439	将
+1440	绝
+1441	锅
+1442	取
+1443	電
+1444	宿
+1445	货
+1446	粤
+1447	葵
+1448	姐
+1449	介
+1450	爵
+1451	阔
+1452	涅
+1453	闪
+1454	听
+1455	央
+1456	掌
+1457	近
+1458	贡
+1459	沉
+1460	迟
+1461	改
+1462	配
+1463	庙
+1464	染
+1465	铮
+1466	阎
+1467	芯
+1468	汐
+1469	颐
+1470	蛋
+1471	护
+1472	部
+1473	孚
+1474	伤
+1475	狐
+1476	饭
+1477	鼓
+1478	娄
+1479	戚
+1480	略
+1481	啸
+1482	幸
+1483	滋
+1484	指
+1485	悠
+1486	妻
+1487	脉
+1488	丛
+1489	警
+1490	模
+1491	洗
+1492	奶
+1493	枪
+1494	恶
+1495	宴
+1496	靳
+1497	３
+1498	契
+1499	想
+1500	床
+1501	泪
+1502	随
+1503	市
+1504	探
+1505	焰
+1506	豫
+1507	點
+1508	住
+1509	淮
+1510	炮
+1511	圈
+1512	吐
+1513	楷
+1514	话
+1515	線
+1516	接
+1517	假
+1518	仇
+1519	射
+1520	蜀
+1521	婴
+1522	佟
+1523	追
+1524	奔
+1525	胶
+1526	晏
+1527	兽
+1528	跳
+1529	弦
+1530	质
+1531	體
+1532	扣
+1533	更
+1534	卉
+1535	梨
+1536	形
+1537	息
+1538	壁
+1539	共
+1540	菱
+1541	闵
+1542	渠
+1543	感
+1544	寻
+1545	盐
+1546	惊
+1547	珺
+1548	慎
+1549	去
+1550	狮
+1551	韶
+1552	雙
+1553	瞳
+1554	宅
+1555	座
+1556	总
+1557	趣
+1558	萬
+1559	短
+1560	奉
+1561	滩
+1562	飛
+1563	扶
+1564	折
+1565	筠
+1566	寇
+1567	ā
+1568	尖
+1569	暖
+1570	弥
+1571	惜
+1572	涌
+1573	符
+1574	８
+1575	匠
+1576	嫣
+1577	璞
+1578	杉
+1579	让
+1580	雾
+1581	動
+1582	蕴
+1583	处
+1584	宠
+1585	楊
+1586	务
+1587	猴
+1588	翻
+1589	到
+1590	竞
+1591	参
+1592	某
+1593	闽
+1594	送
+1595	匡
+1596	钊
+1597	薄
+1598	磨
+1599	芒
+1600	婵
+1601	厄
+1602	渔
+1603	户
+1604	推
+1605	研
+1606	纲
+1607	恭
+1608	聖
+1609	茅
+1610	资
+1611	宛
+1612	魅
+1613	软
+1614	胎
+1615	鸭
+1616	愚
+1617	喆
+1618	鉴
+1619	荒
+1620	协
+1621	罪
+1622	铎
+1623	迅
+1624	笙
+1625	語
+1626	葆
+1627	匹
+1628	区
+1629	绣
+1630	型
+1631	轶
+1632	额
+1633	消
+1634	靓
+1635	硬
+1636	着
+1637	姣
+1638	偏
+1639	票
+1640	碎
+1641	套
+1642	遥
+1643	冀
+1644	际
+1645	架
+1646	拳
+1647	巫
+1648	６
+1649	妇
+1650	赋
+1651	私
+1652	曙
+1653	站
+1654	载
+1655	抗
+1656	芦
+1657	膜
+1658	尸
+1659	适
+1660	错
+1661	潭
+1662	击
+1663	俭
+1664	巢
+1665	幻
+1666	婆
+1667	麒
+1668	值
+1669	止
+1670	种
+1671	維
+1672	ｃ
+1673	岐
+1674	後
+1675	伶
+1676	墙
+1677	刃
+1678	缇
+1679	琰
+1680	殇
+1681	烧
+1682	窝
+1683	砚
+1684	無
+1685	矿
+1686	遗
+1687	争
+1688	怪
+1689	ｂ
+1690	末
+1691	逆
+1692	码
+1693	释
+1694	屠
+1695	问
+1696	恬
+1697	腰
+1698	掉
+1699	時
+1700	具
+1701	脸
+1702	璋
+1703	隋
+1704	芽
+1705	控
+1706	壹
+1707	甄
+1708	會
+1709	价
+1710	劫
+1711	菌
+1712	熱
+1713	岁
+1714	痛
+1715	刻
+1716	單
+1717	咸
+1718	書
+1719	兮
+1720	服
+1721	敖
+1722	禁
+1723	差
+1724	沫
+1725	栗
+1726	暮
+1727	倾
+1728	戰
+1729	投
+1730	戏
+1731	币
+1732	要
+1733	造
+1734	冥
+1735	肌
+1736	降
+1737	龟
+1738	低
+1739	ｏ
+1740	痕
+1741	學
+1742	弹
+1743	淡
+1744	迹
+1745	箭
+1746	岑
+1747	读
+1748	灭
+1749	萝
+1750	潜
+1751	穗
+1752	俄
+1753	吊
+1754	虞
+1755	斑
+1756	炉
+1757	肥
+1758	说
+1759	稳
+1760	焱
+1761	隽
+1762	急
+1763	橙
+1764	卞
+1765	雀
+1766	停
+1767	槐
+1768	级
+1769	剧
+1770	姑
+1771	岱
+1772	ｅ
+1773	弄
+1774	脑
+1775	蔓
+1776	论
+1777	壳
+1778	鼻
+1779	圖
+1780	醒
+1781	犬
+1782	堤
+1783	闲
+1784	坐
+1785	专
+1786	蜂
+1787	饶
+1788	证
+1789	液
+1790	莺
+1791	导
+1792	跑
+1793	砂
+1794	谈
+1795	虾
+1796	湛
+1797	杂
+1798	看
+1799	父
+1800	埠
+1801	盲
+1802	敌
+1803	泛
+1804	摇
+1805	翎
+1806	霆
+1807	核
+1808	屿
+1809	换
+1810	股
+1811	产
+1812	呈
+1813	漏
+1814	興
+1815	铺
+1816	刑
+1817	省
+1818	裝
+1819	刁
+1820	曰
+1821	劉
+1822	察
+1823	除
+1824	齿
+1825	峥
+1826	牟
+1827	飘
+1828	律
+1829	鞋
+1830	禅
+1831	瞿
+1832	右
+1833	璟
+1834	滑
+1835	煤
+1836	滢
+1837	琨
+1838	逢
+1839	税
+1840	宮
+1841	状
+1842	納
+1843	谨
+1844	寄
+1845	弓
+1846	练
+1847	序
+1848	纱
+1849	恨
+1850	凱
+1851	寧
+1852	帶
+1853	境
+1854	局
+1855	操
+1856	妤
+1857	裂
+1858	猎
+1859	眠
+1860	泡
+1861	辞
+1862	ｉ
+1863	势
+1864	戎
+1865	室
+1866	順
+1867	透
+1868	享
+1869	演
+1870	裘
+1871	由
+1872	助
+1873	第
+1874	奋
+1875	储
+1876	伐
+1877	沪
+1878	９
+1879	磁
+1880	拍
+1881	盼
+1882	珈
+1883	贻
+1884	偷
+1885	混
+1886	仰
+1887	队
+1888	場
+1889	胤
+1890	呼
+1891	案
+1892	驹
+1893	还
+1894	铂
+1895	栾
+1896	腿
+1897	响
+1898	禧
+1899	溢
+1900	饼
+1901	４
+1902	馆
+1903	材
+1904	粮
+1905	姗
+1906	缺
+1907	桢
+1908	業
+1909	歆
+1910	惟
+1911	纹
+1912	祯
+1913	崖
+1914	预
+1915	肇
+1916	連
+1917	悲
+1918	唱
+1919	鹭
+1920	胸
+1921	杆
+1922	暴
+1923	園
+1924	准
+1925	汶
+1926	吳
+1927	钻
+1928	纤
+1929	氧
+1930	冶
+1931	脂
+1932	怨
+1933	島
+1934	爆
+1935	尽
+1936	夹
+1937	挂
+1938	肠
+1939	绵
+1940	崎
+1941	銀
+1942	措
+1943	算
+1944	陀
+1945	橋
+1946	执
+1947	职
+1948	徽
+1949	邑
+1950	瑪
+1951	荡
+1952	戒
+1953	旧
+1954	丑
+1955	浓
+1956	便
+1957	仑
+1958	歇
+1959	縣
+1960	围
+1961	纬
+1962	褚
+1963	丞
+1964	胆
+1965	辅
+1966	减
+1967	贯
+1968	圭
+1969	乘
+1970	率
+1971	別
+1972	藍
+1973	扇
+1974	萊
+1975	瘦
+1976	漢
+1977	ｎ
+1978	滿
+1979	榆
+1980	屹
+1981	廣
+1982	句
+1983	借
+1984	鞠
+1985	垂
+1986	骥
+1987	鐵
+1988	雞
+1989	號
+1990	胃
+1991	玩
+1992	雕
+1993	罕
+1994	墩
+1995	谊
+1996	贼
+1997	對
+1998	件
+1999	编
+2000	ｄ
+2001	嫂
+2002	葉
+2003	栓
+2004	湿
+2005	统
+2006	箱
+2007	庸
+2008	终
+2009	轉
+2010	吹
+2011	噶
+2012	炼
+2013	聯
+2014	谱
+2015	悬
+2016	甸
+2017	兩
+2018	委
+2019	徒
+2020	午
+2021	忘
+2022	藻
+2023	遇
+2024	師
+2025	數
+2026	激
+2027	經
+2028	炯
+2029	怒
+2030	珏
+2031	靈
+2032	熹
+2033	靜
+2034	兒
+2035	報
+2036	調
+2037	圩
+2038	袋
+2039	妆
+2040	各
+2041	祭
+2042	层
+2043	聲
+2044	陌
+2045	幕
+2046	帽
+2047	了
+2048	舌
+2049	碗
+2050	記
+2051	窑
+2052	丕
+2053	貝
+2054	盤
+2055	過
+2056	醇
+2057	紧
+2058	类
+2059	娣
+2060	嵘
+2061	弃
+2062	嵩
+2063	卖
+2064	侨
+2065	ｐ
+2066	块
+2067	束
+2068	绳
+2069	橫
+2070	鄂
+2071	窗
+2072	粒
+2073	膏
+2074	灏
+2075	義
+2076	馥
+2077	藥
+2078	卧
+2079	夷
+2080	诸
+2081	侃
+2082	抱
+2083	絲
+2084	故
+2085	厨
+2086	喷
+2087	荔
+2088	俏
+2089	凶
+2090	斜
+2091	忍
+2092	關
+2093	完
+2094	皖
+2095	逃
+2096	榜
+2097	样
+2098	淫
+2099	運
+2100	喀
+2101	互
+2102	浆
+2103	結
+2104	侧
+2105	闯
+2106	抽
+2107	腊
+2108	秘
+2109	请
+2110	写
+2111	续
+2112	组
+2113	此
+2114	烁
+2115	吸
+2116	销
+2117	翊
+2118	漾
+2119	荫
+2120	進
+2121	ù
+2122	键
+2123	囚
+2124	等
+2125	疏
+2126	弱
+2127	棒
+2128	渣
+2129	嫁
+2130	夺
+2131	链
+2132	懒
+2133	你
+2134	骁
+2135	励
+2136	胖
+2137	螺
+2138	恰
+2139	珉
+2140	须
+2141	墅
+2142	款
+2143	堆
+2144	轴
+2145	整
+2146	咪
+2147	注
+2148	救
+2149	網
+2150	勾
+2151	播
+2152	称
+2153	裸
+2154	频
+2155	棚
+2156	尿
+2157	珑
+2158	旻
+2159	害
+2160	枣
+2161	阵
+2162	备
+2163	稻
+2164	叫
+2165	就
+2166	攻
+2167	辣
+2168	邻
+2169	俐
+2170	昀
+2171	踏
+2172	肝
+2173	坛
+2174	像
+2175	夢
+2176	愿
+2177	斩
+2178	腹
+2179	苟
+2180	愁
+2181	樹
+2182	錢
+2183	蟹
+2184	傻
+2185	鹅
+2186	态
+2187	苇
+2188	筒
+2189	溫
+2190	諾
+2191	蕙
+2192	穿
+2193	紙
+2194	涧
+2195	奸
+2196	厂
+2197	鸥
+2198	琅
+2199	漆
+2200	昶
+2201	檀
+2202	险
+2203	昇
+2204	補
+2205	译
+2206	枕
+2207	悅
+2208	持
+2209	评
+2210	庵
+2211	黔
+2212	煞
+2213	拾
+2214	熟
+2215	试
+2216	题
+2217	浴
+2218	遠
+2219	摆
+2220	邬
+2221	枯
+2222	鞭
+2223	蔻
+2224	７
+2225	劍
+2226	吃
+2227	勉
+2228	纶
+2229	迁
+2230	伴
+2231	疯
+2232	使
+2233	肃
+2234	审
+2235	梭
+2236	他
+2237	拔
+2238	悟
+2239	穴
+2240	豐
+2241	勝
+2242	實
+2243	綠
+2244	玻
+2245	彻
+2246	告
+2247	蛮
+2248	抢
+2249	瓷
+2250	枢
+2251	系
+2252	峡
+2253	蘇
+2254	淘
+2255	负
+2256	ｓ
+2257	员
+2258	乎
+2259	邊
+2260	賽
+2261	歐
+2262	纵
+2263	哀
+2264	被
+2265	籍
+2266	肩
+2267	尺
+2268	圓
+2269	旅
+2270	漪
+2271	泗
+2272	莊
+2273	臧
+2274	標
+2275	朔
+2276	搜
+2277	塑
+2278	視
+2279	狱
+2280	铸
+2281	筑
+2282	附
+2283	剂
+2284	筋
+2285	柜
+2286	购
+2287	滚
+2288	驴
+2289	腳
+2290	墓
+2291	盆
+2292	骑
+2293	溜
+2294	垒
+2295	陰
+2296	始
+2297	废
+2298	赢
+2299	隔
+2300	粗
+2301	议
+2302	峪
+2303	蒸
+2304	傷
+2305	芊
+2306	砖
+2307	變
+2308	检
+2309	巾
+2310	充
+2311	免
+2312	版
+2313	拼
+2314	笼
+2315	袖
+2316	滔
+2317	鴻
+2318	貨
+2319	置
+2320	疮
+2321	灌
+2322	槽
+2323	厉
+2324	錦
+2325	瓶
+2326	企
+2327	栖
+2328	吧
+2329	睡
+2330	渭
+2331	梯
+2332	胥
+2333	织
+2334	價
+2335	荟
+2336	坏
+2337	唇
+2338	澈
+2339	臭
+2340	怜
+2341	赌
+2342	玫
+2343	柒
+2344	囊
+2345	慢
+2346	樓
+2347	穷
+2348	養
+2349	扫
+2350	僧
+2351	鸽
+2352	凰
+2353	燃
+2354	溶
+2355	绒
+2356	勿
+2357	亡
+2358	贴
+2359	燈
+2360	詞
+2361	宰
+2362	湯
+2363	鲸
+2364	帛
+2365	漠
+2366	饰
+2367	吻
+2368	條
+2369	惑
+2370	詩
+2371	做
+2372	ｕ
+2373	財
+2374	阅
+2375	移
+2376	忧
+2377	诱
+2378	麥
+2379	奚
+2380	串
+2381	級
+2382	奖
+2383	寂
+2384	剪
+2385	盗
+2386	偶
+2387	妈
+2388	驿
+2389	突
+2390	滴
+2391	煊
+2392	昔
+2393	往
+2394	限
+2395	帐
+2396	蛟
+2397	败
+2398	輝
+2399	椿
+2400	殺
+2401	酱
+2402	約
+2403	撞
+2404	痴
+2405	庐
+2406	寰
+2407	陪
+2408	苹
+2409	辽
+2410	霓
+2411	擎
+2412	澤
+2413	俗
+2414	嗣
+2415	拥
+2416	ｔ
+2417	碟
+2418	待
+2419	菡
+2420	缸
+2421	傳
+2422	阶
+2423	络
+2424	欠
+2425	兄
+2426	殊
+2427	枭
+2428	遂
+2429	難
+2430	環
+2431	课
+2432	危
+2433	巡
+2434	話
+2435	耘
+2436	樟
+2437	逐
+2438	候
+2439	遊
+2440	爪
+2441	钉
+2442	畫
+2443	當
+2444	疆
+2445	插
+2446	糕
+2447	薪
+2448	阻
+2449	缩
+2450	頂
+2451	割
+2452	袭
+2453	弯
+2454	挑
+2455	铨
+2456	見
+2457	葬
+2458	咒
+2459	倚
+2460	祎
+2461	贷
+2462	輪
+2463	筆
+2464	测
+2465	產
+2466	蜡
+2467	每
+2468	脫
+2469	腔
+2470	仟
+2471	叙
+2472	ｈ
+2473	肾
+2474	領
+2475	误
+2476	熠
+2477	邮
+2478	荃
+2479	ē
+2480	稅
+2481	径
+2482	扁
+2483	臨
+2484	ｇ
+2485	绯
+2486	蓮
+2487	缝
+2488	伪
+2489	悉
+2490	碳
+2491	丫
+2492	魯
+2493	援
+2494	宙
+2495	蚁
+2496	換
+2497	費
+2498	莘
+2499	刊
+2500	區
+2501	疾
+2502	炬
+2503	己
+2504	巩
+2505	祈
+2506	伞
+2507	妥
+2508	孜
+2509	襄
+2510	拖
+2511	呆
+2512	汁
+2513	猿
+2514	疑
+2515	赟
+2516	及
+2517	叉
+2518	缠
+2519	裤
+2520	硫
+2521	翘
+2522	丧
+2523	识
+2524	赐
+2525	頓
+2526	椰
+2527	戶
+2528	ｘ
+2529	浙
+2530	笃
+2531	壶
+2532	哉
+2533	饮
+2534	俪
+2535	碑
+2536	倫
+2537	潤
+2538	截
+2539	棍
+2540	规
+2541	餐
+2542	岙
+2543	稿
+2544	绘
+2545	骐
+2546	牢
+2547	累
+2548	葱
+2549	裙
+2550	衫
+2551	侍
+2552	哨
+2553	離
+2554	叹
+2555	祸
+2556	避
+2557	萃
+2558	蒿
+2559	哭
+2560	將
+2561	几
+2562	渐
+2563	决
+2564	供
+2565	斷
+2566	困
+2567	租
+2568	闷
+2569	灼
+2570	氯
+2571	扑
+2572	例
+2573	膠
+2574	間
+2575	橘
+2576	虛
+2577	飯
+2578	尉
+2579	蟲
+2580	赣
+2581	涼
+2582	灾
+2583	質
+2584	犯
+2585	%
+2586	導
+2587	節
+2588	轨
+2589	拐
+2590	瀛
+2591	骞
+2592	沅
+2593	妾
+2594	骅
+2595	旁
+2596	觅
+2597	且
+2598	示
+2599	似
+2600	赏
+2601	粟
+2602	復
+2603	哑
+2604	觀
+2605	敢
+2606	只
+2607	烏
+2608	親
+2609	姨
+2610	豬
+2611	著
+2612	選
+2613	浚
+2614	兜
+2615	监
+2616	驾
+2617	并
+2618	蚕
+2619	針
+2620	磷
+2621	扩
+2622	烂
+2623	履
+2624	泼
+2625	闹
+2626	泾
+2627	办
+2628	吞
+2629	蛙
+2630	焊
+2631	坟
+2632	盒
+2633	愈
+2634	ｙ
+2635	焚
+2636	抓
+2637	偉
+2638	垚
+2639	烤
+2640	羚
+2641	淋
+2642	披
+2643	阙
+2644	ｍ
+2645	罡
+2646	慰
+2647	洼
+2648	髮
+2649	柄
+2650	燒
+2651	荻
+2652	弈
+2653	番
+2654	參
+2655	技
+2656	碱
+2657	捕
+2658	夸
+2659	逼
+2660	漂
+2661	鳞
+2662	慶
+2663	鸾
+2664	裳
+2665	樵
+2666	隊
+2667	懋
+2668	稀
+2669	預
+2670	验
+2671	缓
+2672	旱
+2673	函
+2674	稚
+2675	鲨
+2676	幅
+2677	佘
+2678	資
+2679	返
+2680	划
+2681	專
+2682	沖
+2683	忌
+2684	藩
+2685	璃
+2686	奏
+2687	陇
+2688	腸
+2689	鎮
+2690	廊
+2691	批
+2692	绫
+2693	签
+2694	幺
+2695	忻
+2696	璧
+2697	肽
+2698	涉
+2699	桶
+2700	苔
+2701	搭
+2702	替
+2703	種
+2704	把
+2705	鳳
+2706	減
+2707	苓
+2708	锤
+2709	優
+2710	煙
+2711	即
+2712	舰
+2713	颈
+2714	贱
+2715	钩
+2716	冻
+2717	獨
+2718	銅
+2719	卯
+2720	妞
+2721	碰
+2722	袍
+2723	赶
+2724	填
+2725	霁
+2726	债
+2727	闸
+2728	择
+2729	趙
+2730	胺
+2731	阜
+2732	絕
+2733	刮
+2734	罐
+2735	虐
+2736	扭
+2737	铝
+2738	钙
+2739	聘
+2740	汽
+2741	铅
+2742	牵
+2743	烽
+2744	棣
+2745	葯
+2746	恕
+2747	藝
+2748	售
+2749	極
+2750	壓
+2751	喉
+2752	皂
+2753	触
+2754	異
+2755	彈
+2756	菇
+2757	翅
+2758	垫
+2759	腦
+2760	寸
+2761	珩
+2762	锌
+2763	昏
+2764	膳
+2765	逝
+2766	绅
+2767	损
+2768	現
+2769	ｌ
+2770	肺
+2771	畏
+2772	伙
+2773	煦
+2774	挽
+2775	韓
+2776	涤
+2777	ｖ
+2778	霏
+2779	恐
+2780	炸
+2781	貓
+2782	鳥
+2783	芋
+2784	笠
+2785	冢
+2786	坂
+2787	叠
+2788	皋
+2789	腐
+2790	桓
+2791	噴
+2792	皆
+2793	蝉
+2794	崩
+2795	鋼
+2796	忙
+2797	疗
+2798	篇
+2799	鄉
+2800	跨
+2801	答
+2802	衛
+2803	涩
+2804	庫
+2805	處
+2806	驼
+2807	硝
+2808	堃
+2809	試
+2810	務
+2811	棕
+2812	孕
+2813	杖
+2814	爹
+2815	劇
+2816	椒
+2817	拙
+2818	兼
+2819	诡
+2820	册
+2821	應
+2822	栏
+2823	仿
+2824	抛
+2825	卒
+2826	访
+2827	枚
+2828	鲤
+2829	ｆ
+2830	卵
+2831	孽
+2832	蚀
+2833	认
+2834	歪
+2835	厦
+2836	钛
+2837	挖
+2838	哇
+2839	熏
+2840	涯
+2841	悍
+2842	咬
+2843	曉
+2844	竺
+2845	厝
+2846	說
+2847	鲲
+2848	遮
+2849	榮
+2850	弋
+2851	跟
+2852	臂
+2853	貴
+2854	禮
+2855	創
+2856	骄
+2857	讲
+2858	距
+2859	硅
+2860	灣
+2861	恆
+2862	權
+2863	臺
+2864	览
+2865	贫
+2866	圃
+2867	孑
+2868	磐
+2869	澎
+2870	醫
+2871	陸
+2872	刷
+2873	笋
+2874	属
+2875	贪
+2876	町
+2877	堰
+2878	闭
+2879	彰
+2880	账
+2881	已
+2882	評
+2883	侬
+2884	農
+2885	覆
+2886	拨
+2887	炒
+2888	洙
+2889	臉
+2890	媒
+2891	爬
+2892	捞
+2893	嫩
+2894	肚
+2895	鏡
+2896	驱
+2897	伸
+2898	甚
+2899	掛
+2900	垣
+2901	况
+2902	滞
+2903	匯
+2904	催
+2905	傑
+2906	ū
+2907	總
+2908	桔
+2909	猜
+2910	炽
+2911	職
+2912	冒
+2913	莽
+2914	聽
+2915	骚
+2916	洒
+2917	曜
+2918	衰
+2919	绕
+2920	暄
+2921	诉
+2922	授
+2923	奢
+2924	題
+2925	晃
+2926	眸
+2927	踢
+2928	妄
+2929	護
+2930	簡
+2931	丈
+2932	灶
+2933	诊
+2934	罩
+2935	醋
+2936	桩
+2937	崗
+2938	绞
+2939	沧
+2940	裁
+2941	拆
+2942	镁
+2943	犁
+2944	判
+2945	尕
+2946	氢
+2947	鸠
+2948	劝
+2949	竖
+2950	飚
+2951	最
+2952	蹄
+2953	羡
+2954	陷
+2955	缨
+2956	旷
+2957	页
+2958	翌
+2959	烛
+2960	筝
+2961	毁
+2962	戀
+2963	荀
+2964	陂
+2965	貼
+2966	鶴
+2967	讀
+2968	輕
+2969	档
+2970	抚
+2971	副
+2972	订
+2973	槍
+2974	凹
+2975	編
+2976	稼
+2977	拱
+2978	雏
+2979	碼
+2980	桌
+2981	霉
+2982	睦
+2983	骊
+2984	摸
+2985	證
+2986	茄
+2987	絮
+2988	匪
+2989	豚
+2990	酥
+2991	團
+2992	厅
+2993	获
+2994	鸦
+2995	押
+2996	沿
+2997	逗
+2998	愉
+2999	椅
+3000	卦
+3001	鞍
+3002	笨
+3003	寫
+3004	純
+3005	緣
+3006	竟
+3007	組
+3008	抄
+3009	滇
+3010	粪
+3011	鍋
+3012	淦
+3013	佬
+3014	泣
+3015	弼
+3016	俠
+3017	旸
+3018	浑
+3019	绥
+3020	设
+3021	薯
+3022	梧
+3023	亢
+3024	幹
+3025	症
+3026	舫
+3027	煮
+3028	咔
+3029	軟
+3030	賢
+3031	賣
+3032	狀
+3033	癌
+3034	氨
+3035	靠
+3036	細
+3037	揭
+3038	构
+3039	彧
+3040	帘
+3041	卤
+3042	秒
+3043	镭
+3044	潼
+3045	ｋ
+3046	韧
+3047	栩
+3048	熔
+3049	坞
+3050	污
+3051	遵
+3052	製
+3053	孫
+3054	羲
+3055	忽
+3056	勐
+3057	營
+3058	纷
+3059	殘
+3060	脊
+3061	寡
+3062	洵
+3063	仆
+3064	劈
+3065	辩
+3066	鐘
+3067	缤
+3068	禽
+3069	甬
+3070	勺
+3071	佃
+3072	茸
+3073	蛾
+3074	谁
+3075	虽
+3076	痰
+3077	凸
+3078	酮
+3079	腕
+3080	宵
+3081	穹
+3082	惡
+3083	計
+3084	ｒ
+3085	钓
+3086	抵
+3087	给
+3088	晕
+3089	課
+3090	許
+3091	員
+3092	综
+3093	茉
+3094	亂
+3095	啟
+3096	問
+3097	捐
+3098	烦
+3099	脆
+3100	備
+3101	棱
+3102	埋
+3103	泷
+3104	洽
+3105	珞
+3106	婦
+3107	羞
+3108	确
+3109	隨
+3110	犀
+3111	蚊
+3112	毫
+3113	謝
+3114	糊
+3115	颠
+3116	喵
+3117	胞
+3118	邸
+3119	軒
+3120	測
+3121	份
+3122	斧
+3123	弧
+3124	矛
+3125	冕
+3126	琉
+3127	狸
+3128	扒
+3129	甩
+3130	肆
+3131	柚
+3132	屎
+3133	庶
+3134	蓋
+3135	額
+3136	否
+3137	擊
+3138	鴨
+3139	旨
+3140	峙
+3141	騰
+3142	購
+3143	歸
+3144	遁
+3145	檢
+3146	缔
+3147	矮
+3148	煎
+3149	紋
+3150	浸
+3151	梗
+3152	瑰
+3153	闺
+3154	挡
+3155	砍
+3156	筹
+3157	涟
+3158	宥
+3159	纺
+3160	贸
+3161	聊
+3162	缅
+3163	沣
+3164	芃
+3165	銷
+3166	潞
+3167	溥
+3168	虱
+3169	矢
+3170	梳
+3171	输
+3172	晁
+3173	穎
+3174	獸
+3175	呂
+3176	飒
+3177	頻
+3178	析
+3179	帖
+3180	懷
+3181	旬
+3182	裡
+3183	焉
+3184	漁
+3185	層
+3186	个
+3187	跌
+3188	粘
+3189	役
+3190	揚
+3191	鵬
+3192	鳌
+3193	驻
+3194	罚
+3195	晞
+3196	乖
+3197	搏
+3198	岔
+3199	氮
+3200	琢
+3201	粹
+3202	碘
+3203	抹
+3204	骗
+3205	湄
+3206	玟
+3207	鸢
+3208	沸
+3209	誓
+3210	歡
+3211	削
+3212	臀
+3213	铠
+3214	滾
+3215	憨
+3216	框
+3217	耗
+3218	摘
+3219	责
+3220	障
+3221	赠
+3222	遺
+3223	瑄
+3224	搖
+3225	鷹
+3226	踪
+3227	歷
+3228	嶺
+3229	葳
+3230	瑤
+3231	倉
+3232	潔
+3233	拒
+3234	統
+3235	据
+3236	衬
+3237	麓
+3238	啦
+3239	怕
+3240	魄
+3241	窃
+3242	侵
+3243	為
+3244	薩
+3245	璨
+3246	署
+3247	蒼
+3248	叁
+3249	炭
+3250	類
+3251	炀
+3252	讨
+3253	聆
+3254	蝇
+3255	冤
+3256	轰
+3257	裔
+3258	粥
+3259	涨
+3260	沂
+3261	沼
+3262	決
+3263	悔
+3264	壽
+3265	夙
+3266	荼
+3267	ī
+3268	按
+3269	担
+3270	堪
+3271	卑
+3272	尋
+3273	苯
+3274	垢
+3275	忱
+3276	濠
+3277	貌
+3278	骂
+3279	澍
+3280	靡
+3281	谜
+3282	館
+3283	璜
+3284	隱
+3285	拴
+3286	瞬
+3287	扰
+3288	违
+3289	铿
+3290	聿
+3291	瞻
+3292	犹
+3293	箫
+3294	酉
+3295	很
+3296	勞
+3297	岡
+3298	燮
+3299	蔺
+3300	薰
+3301	缚
+3302	锭
+3303	楓
+3304	绩
+3305	督
+3306	芥
+3307	茧
+3308	緊
+3309	坠
+3310	辜
+3311	辈
+3312	惨
+3313	搬
+3314	翀
+3315	幣
+3316	镐
+3317	涓
+3318	敛
+3319	锚
+3320	錯
+3321	凭
+3322	埔
+3323	劣
+3324	吏
+3325	糜
+3326	浊
+3327	術
+3328	積
+3329	却
+3330	刹
+3331	蒜
+3332	溯
+3333	餅
+3334	瞎
+3335	锴
+3336	钜
+3337	籽
+3338	掩
+3339	孩
+3340	簽
+3341	驚
+3342	肿
+3343	邝
+3344	谟
+3345	ě
+3346	億
+3347	患
+3348	終
+3349	襟
+3350	跪
+3351	獅
+3352	没
+3353	浣
+3354	渚
+3355	痞
+3356	脾
+3357	滤
+3358	凄
+3359	歧
+3360	鎖
+3361	柠
+3362	態
+3363	擒
+3364	泄
+3365	皙
+3366	晒
+3367	陕
+3368	柿
+3369	锟
+3370	膝
+3371	握
+3372	濕
+3373	循
+3374	淹
+3375	敷
+3376	樣
+3377	規
+3378	挚
+3379	址
+3380	論
+3381	株
+3382	仗
+3383	稱
+3384	還
+3385	氟
+3386	辟
+3387	谛
+3388	谌
+3389	譜
+3390	锥
+3391	亏
+3392	阀
+3393	锯
+3394	蛊
+3395	撤
+3396	扯
+3397	钞
+3398	獎
+3399	錄
+3400	銘
+3401	茫
+3402	崧
+3403	侣
+3404	乞
+3405	欺
+3406	瘤
+3407	篮
+3408	泠
+3409	阚
+3410	濑
+3411	钳
+3412	荊
+3413	咲
+3414	蝎
+3415	卸
+3416	耍
+3417	摄
+3418	惹
+3419	壬
+3420	辱
+3421	柑
+3422	顽
+3423	铉
+3424	祚
+3425	複
+3426	挥
+3427	蛤
+3428	沾
+3429	脏
+3430	找
+3431	圍
+3432	促
+3433	賓
+3434	朮
+3435	挤
+3436	郊
+3437	既
+3438	舅
+3439	給
+3440	咕
+3441	骋
+3442	夾
+3443	鄭
+3444	鈴
+3445	浒
+3446	酶
+3447	屁
+3448	茲
+3449	迫
+3450	焯
+3451	晰
+3452	戲
+3453	驗
+3454	舸
+3455	驭
+3456	肢
+3457	罢
+3458	嫡
+3459	栈
+3460	箐
+3461	这
+3462	銮
+3463	認
+3464	鬥
+3465	縮
+3466	愤
+3467	郜
+3468	仝
+3469	递
+3470	勢
+3471	ō
+3472	贰
+3473	粵
+3474	痘
+3475	姦
+3476	缴
+3477	揽
+3478	恪
+3479	舵
+3480	艷
+3481	葡
+3482	鋒
+3483	叛
+3484	産
+3485	窩
+3486	嵌
+3487	敲
+3488	蓄
+3489	泻
+3490	畜
+3491	抒
+3492	韻
+3493	項
+3494	摊
+3495	疃
+3496	の
+3497	烯
+3498	吓
+3499	戊
+3500	腺
+3501	褲
+3502	監
+3503	谣
+3504	廠
+3505	迭
+3506	鄢
+3507	谏
+3508	載
+3509	拂
+3510	茎
+3511	俱
+3512	斤
+3513	紀
+3514	颤
+3515	尝
+3516	沥
+3517	習
+3518	淞
+3519	昧
+3520	逍
+3521	嗨
+3522	榴
+3523	臥
+3524	嬌
+3525	側
+3526	券
+3527	渗
+3528	雜
+3529	閃
+3530	盜
+3531	艇
+3532	喬
+3533	详
+3534	秃
+3535	採
+3536	汛
+3537	呀
+3538	厌
+3539	喊
+3540	訂
+3541	訊
+3542	燊
+3543	栅
+3544	誠
+3545	夭
+3546	皱
+3547	蛛
+3548	矣
+3549	鳴
+3550	攸
+3551	麵
+3552	冼
+3553	儀
+3554	晉
+3555	濤
+3556	莓
+3557	齊
+3558	晦
+3559	竣
+3560	抖
+3561	ｗ
+3562	キ
+3563	墻
+3564	媽
+3565	敗
+3566	淺
+3567	礁
+3568	荐
+3569	估
+3570	驳
+3571	舱
+3572	绰
+3573	宦
+3574	泵
+3575	寮
+3576	雌
+3577	脐
+3578	舊
+3579	續
+3580	弩
+3581	羌
+3582	拌
+3583	瓣
+3584	戟
+3585	髓
+3586	暑
+3587	婶
+3588	撕
+3589	豁
+3590	竿
+3591	隙
+3592	谓
+3593	铖
+3594	旌
+3595	蝦
+3596	秧
+3597	或
+3598	颢
+3599	兑
+3600	厥
+3601	鳄
+3602	暂
+3603	汾
+3604	钝
+3605	杠
+3606	買
+3607	苒
+3608	牆
+3609	炊
+3610	糠
+3611	矾
+3612	懂
+3613	侗
+3614	剛
+3615	壇
+3616	帳
+3617	櫃
+3618	毀
+3619	湧
+3620	捉
+3621	練
+3622	窖
+3623	緑
+3624	沽
+3625	馋
+3626	斥
+3627	郵
+3628	喇
+3629	垛
+3630	概
+3631	们
+3632	岂
+3633	腎
+3634	銳
+3635	岷
+3636	烙
+3637	掠
+3638	浜
+3639	泸
+3640	醬
+3641	沱
+3642	蔷
+3643	皎
+3644	榛
+3645	檐
+3646	閣
+3647	抬
+3648	顏
+3649	橡
+3650	镛
+3651	塊
+3652	盡
+3653	壯
+3654	靴
+3655	亥
+3656	酚
+3657	窄
+3658	肛
+3659	亘
+3660	糟
+3661	烘
+3662	貂
+3663	講
+3664	狠
+3665	窥
+3666	賭
+3667	賀
+3668	莞
+3669	箕
+3670	爺
+3671	喘
+3672	但
+3673	咖
+3674	織
+3675	い
+3676	彿
+3677	唤
+3678	蕉
+3679	僵
+3680	熬
+3681	妓
+3682	踩
+3683	铲
+3684	匙
+3685	撑
+3686	弛
+3687	耻
+3688	丢
+3689	堵
+3690	膽
+3691	厘
+3692	辨
+3693	瓢
+3694	崴
+3695	篱
+3696	碾
+3697	畔
+3698	涝
+3699	膚
+3700	绛
+3701	黏
+3702	屑
+3703	衝
+3704	簧
+3705	杞
+3706	轲
+3707	贲
+3708	溝
+3709	烷
+3710	霧
+3711	塵
+3712	瘾
+3713	颉
+3714	凿
+3715	彝
+3716	诛
+3717	訪
+3718	鮮
+3719	覺
+3720	歲
+3721	窟
+3722	週
+3723	苞
+3724	濟
+3725	叟
+3726	爭
+3727	椎
+3728	療
+3729	眾
+3730	審
+3731	拋
+3732	棘
+3733	诀
+3734	鹃
+3735	倦
+3736	擦
+3737	暢
+3738	酬
+3739	蠢
+3740	聞
+3741	囧
+3742	從
+3743	脈
+3744	缆
+3745	陋
+3746	哪
+3747	酿
+3748	娆
+3749	屍
+3750	檬
+3751	捧
+3752	凛
+3753	靶
+3754	疣
+3755	餘
+3756	鹊
+3757	陣
+3758	昙
+3759	栎
+3760	鳖
+3761	镶
+3762	飄
+3763	烫
+3764	芜
+3765	垦
+3766	癣
+3767	蟾
+3768	萤
+3769	寓
+3770	診
+3771	蚌
+3772	霈
+3773	诈
+3774	負
+3775	吼
+3776	疹
+3777	縫
+3778	則
+3779	鹽
+3780	啊
+3781	捣
+3782	勘
+3783	俯
+3784	陡
+3785	叮
+3786	$
+3787	饱
+3788	寬
+3789	帥
+3790	漿
+3791	掘
+3792	棺
+3793	汞
+3794	钵
+3795	こ
+3796	绸
+3797	括
+3798	濂
+3799	壞
+3800	躲
+3801	拦
+3802	錫
+3803	拟
+3804	钠
+3805	嘛
+3806	趋
+3807	遣
+3808	谐
+3809	墟
+3810	喧
+3811	榭
+3812	閉
+3813	筛
+3814	ｊ
+3815	渴
+3816	峨
+3817	嬰
+3818	巳
+3819	梢
+3820	漱
+3821	疤
+3822	祉
+3823	矽
+3824	痒
+3825	咽
+3826	邀
+3827	缀
+3828	庇
+3829	虔
+3830	盏
+3831	羿
+3832	抑
+3833	叨
+3834	弑
+3835	唛
+3836	侑
+3837	賊
+3838	稽
+3839	黨
+3840	妝
+3841	谍
+3842	蓁
+3843	ま
+3844	蕃
+3845	藜
+3846	赘
+3847	诞
+3848	眷
+3849	够
+3850	岫
+3851	釣
+3852	喃
+3853	樑
+3854	钮
+3855	鋪
+3856	牡
+3857	溴
+3858	缕
+3859	溺
+3860	溟
+3861	描
+3862	渺
+3863	藕
+3864	胚
+3865	刨
+3866	獵
+3867	琬
+3868	寝
+3869	稷
+3870	缎
+3871	锈
+3872	需
+3873	遍
+3874	醛
+3875	戬
+3876	噬
+3877	闰
+3878	蔣
+3879	協
+3880	響
+3881	顯
+3882	飾
+3883	厢
+3884	钗
+3885	毯
+3886	询
+3887	簪
+3888	堅
+3889	鼬
+3890	貢
+3891	遭
+3892	肘
+3893	燥
+3894	砸
+3895	趾
+3896	豔
+3897	蟒
+3898	淨
+3899	廟
+3900	唑
+3901	ｚ
+3902	诠
+3903	垭
+3904	龜
+3905	剥
+3906	辦
+3907	翱
+3908	挨
+3909	峽
+3910	紗
+3911	拘
+3912	绢
+3913	畴
+3914	蔼
+3915	隶
+3916	溃
+3917	濃
+3918	碌
+3919	宓
+3920	趴
+3921	浔
+3922	搞
+3923	挪
+3924	楞
+3925	邈
+3926	虑
+3927	捌
+3928	舉
+3929	嫔
+3930	漓
+3931	捻
+3932	逵
+3933	呢
+3934	砾
+3935	谬
+3936	琥
+3937	撮
+3938	準
+3939	嗜
+3940	它
+3941	議
+3942	於
+3943	執
+3944	顔
+3945	匣
+3946	焘
+3947	狭
+3948	涡
+3949	衔
+3950	靚
+3951	祠
+3952	雉
+3953	疼
+3954	镖
+3955	嚣
+3956	骸
+3957	ん
+3958	証
+3959	恢
+3960	凑
+3961	丐
+3962	貞
+3963	蛹
+3964	呵
+3965	昼
+3966	蛉
+3967	翳
+3968	匀
+3969	侦
+3970	設
+3971	轧
+3972	損
+3973	盧
+3974	叩
+3975	這
+3976	跡
+3977	谕
+3978	迴
+3979	鳗
+3980	炕
+3981	珮
+3982	カ
+3983	咀
+3984	搅
+3985	矫
+3986	矩
+3987	箍
+3988	渤
+3989	狩
+3990	苛
+3991	劼
+3992	濡
+3993	慌
+3994	勁
+3995	腫
+3996	般
+3997	酌
+3998	徕
+3999	廓
+4000	燎
+4001	颇
+4002	樽
+4003	槎
+4004	鑽
+4005	摔
+4006	诵
+4007	槿
+4008	琐
+4009	塌
+4010	锻
+4011	願
+4012	顧
+4013	萎
+4014	は
+4015	膛
+4016	祛
+4017	檔
+4018	蠡
+4019	觸
+4020	虬
+4021	談
+4022	喝
+4023	娱
+4024	噪
+4025	胀
+4026	褐
+4027	疫
+4028	札
+4029	昉
+4030	呱
+4031	禪
+4032	債
+4033	屬
+4034	佶
+4035	垠
+4036	貿
+4037	葭
+4038	齡
+4039	萦
+4040	蕤
+4041	燚
+4042	#
+4043	劑
+4044	彥
+4045	棗
+4046	紐
+4047	浇
+4048	汲
+4049	臼
+4050	咎
+4051	絨
+4052	裹
+4053	茬
+4054	厕
+4055	傾
+4056	釋
+4057	秽
+4058	颅
+4059	蹦
+4060	么
+4061	嘟
+4062	锣
+4063	腻
+4064	寐
+4065	妲
+4066	湃
+4067	醜
+4068	另
+4069	泮
+4070	幂
+4071	獄
+4072	滅
+4073	玳
+4074	氰
+4075	鞘
+4076	峭
+4077	鹂
+4078	嗅
+4079	ら
+4080	瑙
+4081	咳
+4082	蝗
+4083	瓯
+4084	猷
+4085	樾
+4086	赎
+4087	她
+4088	朕
+4089	淀
+4090	頁
+4091	飙
+4092	羁
+4093	镒
+4094	喂
+4095	袜
+4096	钺
+4097	扉
+4098	曆
+4099	櫻
+4100	曳
+4101	辕
+4102	帧
+4103	誤
+4104	哄
+4105	漳
+4106	亓
+4107	隅
+4108	訴
+4109	螨
+4110	艮
+4111	識
+4112	適
+4113	诏
+4114	饵
+4115	俨
+4116	郦
+4117	坳
+4118	鵝
+4119	礦
+4120	褒
+4121	犇
+4122	隘
+4123	咯
+4124	赴
+4125	競
+4126	個
+4127	劃
+4128	殼
+4129	睛
+4130	究
+4131	兢
+4132	緩
+4133	纠
+4134	惧
+4135	践
+4136	躬
+4137	惯
+4138	稠
+4139	惩
+4140	秤
+4141	嚴
+4142	茁
+4143	濮
+4144	亩
+4145	憬
+4146	撩
+4147	赔
+4148	渎
+4149	镀
+4150	汴
+4151	婢
+4152	菩
+4153	鍾
+4154	锰
+4155	挠
+4156	泱
+4157	毗
+4158	丅
+4159	琮
+4160	痧
+4161	痣
+4162	堕
+4163	鄙
+4164	搓
+4165	な
+4166	蕭
+4167	赦
+4168	耆
+4169	稍
+4170	險
+4171	胭
+4172	沢
+4173	婬
+4174	畈
+4175	炖
+4176	毋
+4177	蜗
+4178	煲
+4179	铧
+4180	並
+4181	廚
+4182	佈
+4183	衙
+4184	荧
+4185	钥
+4186	黯
+4187	雳
+4188	吨
+4189	铬
+4190	請
+4191	鎏
+4192	釉
+4193	栽
+4194	騎
+4195	磚
+4196	廢
+4197	郢
+4198	偃
+4199	賞
+4200	奪
+4201	鬓
+4202	鳍
+4203	乏
+4204	蹲
+4205	盯
+4206	ー
+4207	く
+4208	し
+4209	ア
+4210	寵
+4211	悶
+4212	構
+4213	煉
+4214	粿
+4215	絶
+4216	诫
+4217	狙
+4218	钾
+4219	敵
+4220	偿
+4221	锄
+4222	姫
+4223	幡
+4224	戳
+4225	澹
+4226	坯
+4227	濯
+4228	骈
+4229	嬉
+4230	砌
+4231	囡
+4232	峦
+4233	漕
+4234	闾
+4235	镍
+4236	罰
+4237	肋
+4238	遐
+4239	荤
+4240	窍
+4241	绾
+4242	怯
+4243	携
+4244	鹄
+4245	戌
+4246	凳
+4247	蕩
+4248	揉
+4249	柘
+4250	冗
+4251	須
+4252	蔽
+4253	焜
+4254	驯
+4255	騙
+4256	騷
+4257	恳
+4258	凈
+4259	籁
+4260	註
+4261	傣
+4262	凍
+4263	霭
+4264	爸
+4265	謀
+4266	酯
+4267	渍
+4268	駿
+4269	绎
+4270	粲
+4271	衷
+4272	葫
+4273	鬆
+4274	況
+4275	掃
+4276	撸
+4277	呗
+4278	碩
+4279	诘
+4280	贊
+4281	坨
+4282	芩
+4283	垌
+4284	茱
+4285	塚
+4286	洱
+4287	齒
+4288	嫚
+4289	篆
+4290	瑯
+4291	贩
+4292	き
+4293	啓
+4294	墊
+4295	潛
+4296	瀾
+4297	饥
+4298	笺
+4299	轿
+4300	糞
+4301	範
+4302	嘲
+4303	啶
+4304	繼
+4305	捆
+4306	拢
+4307	脓
+4308	渥
+4309	谅
+4310	迩
+4311	烹
+4312	瀑
+4313	姥
+4314	缦
+4315	蛆
+4316	毙
+4317	腥
+4318	痨
+4319	喪
+4320	に
+4321	壤
+4322	饲
+4323	胄
+4324	淚
+4325	濱
+4326	矶
+4327	汰
+4328	ノ
+4329	飲
+4330	媳
+4331	磬
+4332	砺
+4333	啼
+4334	瘟
+4335	扈
+4336	祀
+4337	頸
+4338	蘆
+4339	钨
+4340	馳
+4341	佣
+4342	鬧
+4343	舂
+4344	翩
+4345	蝠
+4346	挣
+4347	誘
+4348	蛰
+4349	佚
+4350	辙
+4351	邁
+4352	塗
+4353	賬
+4354	塬
+4355	埭
+4356	诰
+4357	圻
+4358	拗
+4359	耽
+4360	祿
+4361	璠
+4362	瓊
+4363	珣
+4364	た
+4365	儲
+4366	棄
+4367	辑
+4368	灸
+4369	狡
+4370	綿
+4371	歼
+4372	糧
+4373	癸
+4374	撫
+4375	帷
+4376	镰
+4377	俩
+4378	垄
+4379	募
+4380	嗔
+4381	滥
+4382	鏈
+4383	僻
+4384	馍
+4385	娼
+4386	撇
+4387	崽
+4388	蚂
+4389	酪
+4390	怿
+4391	愫
+4392	廈
+4393	琏
+4394	械
+4395	些
+4396	恤
+4397	疝
+4398	榄
+4399	琚
+4400	り
+4401	リ
+4402	妒
+4403	杲
+4404	楣
+4405	槌
+4406	槟
+4407	孺
+4408	桧
+4409	桀
+4410	牲
+4411	戍
+4412	幫
+4413	旎
+4414	铣
+4415	躺
+4416	剃
+4417	锵
+4418	呜
+4419	嫌
+4420	剔
+4421	駕
+4422	谎
+4423	绚
+4424	眩
+4425	阉
+4426	駐
+4427	討
+4428	驅
+4429	腋
+4430	痹
+4431	冊
+4432	饿
+4433	磅
+4434	乍
+4435	毡
+4436	盔
+4437	簇
+4438	殖
+4439	説
+4440	篁
+4441	襲
+4442	攒
+4443	鮑
+4444	哆
+4445	遲
+4446	遷
+4447	禀
+4448	賴
+4449	邰
+4450	軌
+4451	奂
+4452	倌
+4453	荞
+4454	苡
+4455	苷
+4456	圳
+4457	莜
+4458	荪
+4459	菀
+4460	軸
+4461	羹
+4462	爐
+4463	確
+4464	讓
+4465	癬
+4466	獲
+4467	籃
+4468	垟
+4469	奮
+4470	擺
+4471	暈
+4472	瀬
+4473	蓟
+4474	溅
+4475	疥
+4476	届
+4477	綱
+4478	烬
+4479	嵐
+4480	雇
+4481	蹭
+4482	俺
+4483	敞
+4484	砲
+4485	涣
+4486	阑
+4487	聶
+4488	蹇
+4489	糯
+4490	災
+4491	淬
+4492	骡
+4493	吗
+4494	疲
+4495	錶
+4496	狎
+4497	漩
+4498	泫
+4499	泯
+4500	擂
+4501	鹫
+4502	枳
+4503	剩
+4504	韫
+4505	攘
+4506	怂
+4507	镕
+4508	讼
+4509	牝
+4510	譯
+4511	膘
+4512	惶
+4513	铵
+4514	钿
+4515	頔
+4516	硐
+4517	涎
+4518	驮
+4519	裆
+4520	褶
+4521	捍
+4522	绑
+4523	痈
+4524	訓
+4525	膀
+4526	懸
+4527	鴿
+4528	兀
+4529	貪
+4530	壕
+4531	隼
+4532	澡
+4533	躁
+4534	秩
+4535	蚝
+4536	哼
+4537	淤
+4538	盂
+4539	叽
+4540	違
+4541	遙
+4542	欄
+4543	诃
+4544	郗
+4545	劭
+4546	偌
+4547	倬
+4548	阡
+4549	苕
+4550	谒
+4551	莒
+4552	埕
+4553	輸
+4554	葩
+4555	蕨
+4556	爛
+4557	爲
+4558	燦
+4559	拽
+4560	讚
+4561	悼
+4562	籠
+4563	サ
+4564	佔
+4565	搶
+4566	曌
+4567	紡
+4568	拷
+4569	緹
+4570	嚼
+4571	藉
+4572	韭
+4573	饺
+4574	綫
+4575	哺
+4576	脖
+4577	吵
+4578	め
+4579	ち
+4580	痢
+4581	嗟
+4582	馈
+4583	庾
+4584	獾
+4585	獐
+4586	鈺
+4587	蹬
+4588	磕
+4589	愣
+4590	脹
+4591	僚
+4592	噜
+4593	匿
+4594	婊
+4595	啤
+4596	尻
+4597	驷
+4598	骧
+4599	繪
+4600	嗪
+4601	赓
+4602	滟
+4603	鋁
+4604	扮
+4605	纾
+4606	撬
+4607	馃
+4608	朽
+4609	瘘
+4610	嗓
+4611	瑕
+4612	啡
+4613	と
+4614	麝
+4615	删
+4616	汕
+4617	胧
+4618	際
+4619	轼
+4620	掰
+4621	讽
+4622	頌
+4623	瘫
+4624	镝
+4625	颓
+4626	涕
+4627	舷
+4628	慾
+4629	憂
+4630	癖
+4631	酣
+4632	鸳
+4633	歹
+4634	翡
+4635	帜
+4636	箴
+4637	箬
+4638	骤
+4639	痔
+4640	姻
+4641	舆
+4642	赃
+4643	嘿
+4644	觞
+4645	遼
+4646	唔
+4647	唧
+4648	桿
+4649	孃
+4650	倭
+4651	偕
+4652	芪
+4653	躍
+4654	縱
+4655	癡
+4656	萘
+4657	堇
+4658	輔
+4659	攝
+4660	據
+4661	忿
+4662	蓼
+4663	辭
+4664	碍
+4665	慷
+4666	か
+4667	あ
+4668	弊
+4669	啞
+4670	彎
+4671	灘
+4672	煩
+4673	缉
+4674	徑
+4675	綺
+4676	荚
+4677	竭
+4678	簿
+4679	倡
+4680	趁
+4681	釜
+4682	绷
+4683	む
+4684	鄧
+4685	モ
+4686	垮
+4687	宕
+4688	澧
+4689	撲
+4690	鋆
+4691	洄
+4692	蘑
+4693	樸
+4694	惘
+4695	该
+4696	戮
+4697	榔
+4698	滦
+4699	ゆ
+4700	滄
+4701	娑
+4702	闳
+4703	嫖
+4704	篷
+4705	捏
+4706	湟
+4707	恼
+4708	阖
+4709	螟
+4710	膺
+4711	沦
+4712	泌
+4713	帼
+4714	玑
+4715	啃
+4716	鹦
+4717	鹞
+4718	婿
+4719	搁
+4720	惰
+4721	瑗
+4722	筷
+4723	ナ
+4724	る
+4725	嘶
+4726	枧
+4727	杵
+4728	肴
+4729	芍
+4730	暧
+4731	朦
+4732	绊
+4733	枉
+4734	挫
+4735	奠
+4736	桅
+4737	潍
+4738	辖
+4739	暇
+4740	戾
+4741	龛
+4742	锷
+4743	嘻
+4744	ｑ
+4745	矜
+4746	焙
+4747	瑚
+4748	夯
+4749	ン
+4750	蟠
+4751	覽
+4752	凋
+4753	酰
+4754	斬
+4755	貫
+4756	胰
+4757	陨
+4758	炙
+4759	謎
+4760	誌
+4761	鯨
+4762	鲈
+4763	匾
+4764	鳅
+4765	拯
+4766	僑
+4767	哒
+4768	恥
+4769	璘
+4770	谧
+4771	讷
+4772	佼
+4773	佗
+4774	畸
+4775	篡
+4776	窜
+4777	涇
+4778	芘
+4779	弁
+4780	壑
+4781	谯
+4782	茭
+4783	冽
+4784	賈
+4785	菽
+4786	燙
+4787	础
+4788	揣
+4789	鬃
+4790	赚
+4791	怠
+4792	筏
+4793	犊
+4794	畢
+4795	タ
+4796	弢
+4797	彌
+4798	沒
+4799	瀨
+4800	綏
+4801	窘
+4802	悸
+4803	綾
+4804	枷
+4805	捡
+4806	颊
+4807	疽
+4808	沮
+4809	辊
+4810	箔
+4811	コ
+4812	幔
+4813	チ
+4814	粱
+4815	鄰
+4816	愧
+4817	扳
+4818	も
+4819	鈣
+4820	靛
+4821	鍍
+4822	柵
+4823	艦
+4824	讳
+4825	涞
+4826	浏
+4827	恽
+4828	棵
+4829	峤
+4830	啪
+4831	虏
+4832	嗒
+4833	徵
+4834	硼
+4835	湫
+4836	怅
+4837	嫒
+4838	畦
+4839	鍵
+4840	蔑
+4841	翹
+4842	逯
+4843	渲
+4844	繳
+4845	鈞
+4846	眀
+4847	绶
+4848	钎
+4849	缙
+4850	琊
+4851	呛
+4852	禿
+4853	廳
+4854	懶
+4855	楔
+4856	疳
+4857	蠻
+4858	ラ
+4859	咨
+4860	璎
+4861	擅
+4862	鑑
+4863	炅
+4864	腌
+4865	祟
+4866	薑
+4867	轸
+4868	暾
+4869	腮
+4870	玦
+4871	獻
+4872	ろ
+4873	ロ
+4874	傢
+4875	憩
+4876	吠
+4877	睢
+4878	偽
+4879	憋
+4880	蠟
+4881	钼
+4882	捂
+4883	倘
+4884	韋
+4885	掏
+4886	瓮
+4887	镯
+4888	睇
+4889	烃
+4890	慘
+4891	癞
+4892	癫
+4893	殉
+4894	谚
+4895	骇
+4896	颌
+4897	颍
+4898	饕
+4899	耙
+4900	ひ
+4901	酩
+4902	榨
+4903	辐
+4904	刈
+4905	責
+4906	逾
+4907	绽
+4908	蒯
+4909	蚤
+4910	鲫
+4911	麸
+4912	迂
+4913	鲷
+4914	臆
+4915	贮
+4916	佞
+4917	瑀
+4918	痳
+4919	係
+4920	吡
+4921	咩
+4922	呷
+4923	啉
+4924	擴
+4925	擔
+4926	衮
+4927	僖
+4928	嬴
+4929	趕
+4930	踫
+4931	鹵
+4932	邺
+4933	癢
+4934	輩
+4935	莳
+4936	萼
+4937	蘅
+4938	鳝
+4939	鳐
+4940	撰
+4941	瑩
+4942	瘋
+4943	慨
+4944	績
+4945	珅
+4946	哗
+4947	え
+4948	シ
+4949	墜
+4950	幾
+4951	憶
+4952	擾
+4953	煥
+4954	紛
+4955	桨
+4956	絡
+4957	仅
+4958	ス
+4959	褂
+4960	阐
+4961	洺
+4962	橱
+4963	洩
+4964	贬
+4965	釘
+4966	呕
+4967	疟
+4968	や
+4969	洮
+4970	っ
+4971	氓
+4972	殴
+4973	迤
+4974	ユ
+4975	て
+4976	偲
+4977	掐
+4978	繩
+4979	臟
+4980	膨
+4981	漉
+4982	暹
+4983	鉻
+4984	妩
+4985	鉛
+4986	珥
+4987	邕
+4988	胁
+4989	楸
+4990	瓒
+4991	叭
+4992	戛
+4993	驶
+4994	炔
+4995	階
+4996	鑒
+4997	缮
+4998	腓
+4999	耸
+5000	腚
+5001	閘
+5002	桉
+5003	恃
+5004	楹
+5005	橹
+5006	蓑
+5007	栀
+5008	侶
+5009	籌
+5010	ね
+5011	斓
+5012	畲
+5013	顫
+5014	铳
+5015	砥
+5016	蜕
+5017	锶
+5018	祜
+5019	铛
+5020	唾
+5021	嵇
+5022	袂
+5023	佯
+5024	殃
+5025	婳
+5026	扼
+5027	昨
+5028	赭
+5029	詠
+5030	侄
+5031	踝
+5032	傍
+5033	禺
+5034	貧
+5035	缶
+5036	霾
+5037	邯
+5038	蜚
+5039	翥
+5040	掷
+5041	罔
+5042	蝽
+5043	襪
+5044	怎
+5045	諸
+5046	斛
+5047	誼
+5048	鲛
+5049	媞
+5050	漲
+5051	吖
+5052	叱
+5053	譚
+5054	譽
+5055	漸
+5056	鸮
+5057	郅
+5058	芗
+5059	贏
+5060	貸
+5061	亵
+5062	俎
+5063	剎
+5064	俘
+5065	篙
+5066	気
+5067	荭
+5068	莪
+5069	萸
+5070	蒽
+5071	マ
+5072	夼
+5073	藓
+5074	牽
+5075	鱗
+5076	繆
+5077	钒
+5078	珐
+5079	穩
+5080	脯
+5081	珪
+5082	さ
+5083	じ
+5084	け
+5085	エ
+5086	ク
+5087	彊
+5088	挌
+5089	暉
+5090	棟
+5091	踞
+5092	艰
+5093	缄
+5094	酵
+5095	较
+5096	糾
+5097	糙
+5098	お
+5099	メ
+5100	釀
+5101	喔
+5102	啾
+5103	篓
+5104	掳
+5105	拧
+5106	哦
+5107	氫
+5108	つ
+5109	摹
+5110	悖
+5111	嗝
+5112	沔
+5113	與
+5114	眯
+5115	衢
+5116	娉
+5117	剖
+5118	嫦
+5119	嬷
+5120	湮
+5121	繫
+5122	舖
+5123	鈔
+5124	醚
+5125	庖
+5126	馒
+5127	潋
+5128	逻
+5129	聋
+5130	纖
+5131	潺
+5132	遛
+5133	滲
+5134	绉
+5135	绀
+5136	磺
+5137	菓
+5138	顷
+5139	玠
+5140	淒
+5141	挟
+5142	痫
+5143	鹬
+5144	鹳
+5145	閱
+5146	偵
+5147	胯
+5148	璀
+5149	娶
+5150	甑
+5151	辘
+5152	魇
+5153	ル
+5154	嶋
+5155	榻
+5156	杈
+5157	昵
+5158	黍
+5159	塍
+5160	丟
+5161	恣
+5162	れ
+5163	袒
+5164	挞
+5165	锂
+5166	旖
+5167	铄
+5168	掀
+5169	砦
+5170	舔
+5171	燧
+5172	稔
+5173	漬
+5174	蜒
+5175	裾
+5176	瀘
+5177	暫
+5178	嚎
+5179	蚧
+5180	匆
+5181	掖
+5182	铱
+5183	詢
+5184	擋
+5185	燉
+5186	壺
+5187	販
+5188	爻
+5189	蜥
+5190	翦
+5191	仄
+5192	螂
+5193	砧
+5194	厮
+5195	粑
+5196	匝
+5197	吁
+5198	豎
+5199	蝴
+5200	蛀
+5201	剌
+5202	歳
+5203	遜
+5204	咚
+5205	渦
+5206	讴
+5207	谤
+5208	抠
+5209	僮
+5210	俑
+5211	廂
+5212	撥
+5213	芨
+5214	诩
+5215	芫
+5216	巽
+5217	苣
+5218	茴
+5219	荏
+5220	苴
+5221	賤
+5222	鹹
+5223	祕
+5224	逮
+5225	薏
+5226	矗
+5227	ǐ
+5228	禍
+5229	瘡
+5230	緻
+5231	涪
+5232	唬
+5233	イ
+5234	钡
+5235	雹
+5236	們
+5237	兇
+5238	兌
+5239	勛
+5240	剝
+5241	揮
+5242	擼
+5243	敘
+5244	殤
+5245	灑
+5246	烜
+5247	揪
+5248	綜
+5249	拣
+5250	絞
+5251	柬
+5252	秸
+5253	緒
+5254	埂
+5255	逛
+5256	逞
+5257	滁
+5258	麽
+5259	揍
+5260	岘
+5261	袄
+5262	坷
+5263	繞
+5264	瞒
+5265	聰
+5266	髋
+5267	屌
+5268	颁
+5269	啄
+5270	傘
+5271	疵
+5272	嬅
+5273	崂
+5274	徙
+5275	呐
+5276	噻
+5277	彗
+5278	闱
+5279	寥
+5280	嚓
+5281	潢
+5282	瞄
+5283	婺
+5284	骜
+5285	骠
+5286	纨
+5287	鈎
+5288	嵬
+5289	阆
+5290	庠
+5291	悯
+5292	剁
+5293	瞧
+5294	缜
+5295	酋
+5296	癲
+5297	叼
+5298	バ
+5299	疸
+5300	楝
+5301	闊
+5302	搔
+5303	瑷
+5304	ト
+5305	戗
+5306	陝
+5307	娛
+5308	柺
+5309	蔥
+5310	爰
+5311	獒
+5312	蠕
+5313	杳
+5314	脲
+5315	閑
+5316	孰
+5317	薊
+5318	橄
+5319	褥
+5320	胪
+5321	腱
+5322	仍
+5323	膈
+5324	赊
+5325	竑
+5326	刪
+5327	孖
+5328	擁
+5329	坍
+5330	壩
+5331	捨
+5332	锉
+5333	跋
+5334	ハ
+5335	熄
+5336	沓
+5337	湍
+5338	惕
+5339	焖
+5340	钏
+5341	钴
+5342	馅
+5343	発
+5344	凪
+5345	曬
+5346	癜
+5347	耦
+5348	窈
+5349	奄
+5350	簾
+5351	蠓
+5352	螭
+5353	臾
+5354	吱
+5355	鯊
+5356	氛
+5357	咋
+5358	徹
+5359	噩
+5360	乜
+5361	孬
+5362	揖
+5363	鼐
+5364	醪
+5365	撼
+5366	蚰
+5367	蛎
+5368	鲟
+5369	帚
+5370	蔗
+5371	厍
+5372	鬱
+5373	诣
+5374	羯
+5375	蜓
+5376	盅
+5377	誕
+5378	蜻
+5379	剡
+5380	簌
+5381	筵
+5382	酊
+5383	怔
+5384	贿
+5385	み
+5386	忒
+5387	叻
+5388	吒
+5389	撷
+5390	遞
+5391	廁
+5392	俚
+5393	贇
+5394	勖
+5395	夔
+5396	苋
+5397	诤
+5398	塾
+5399	賠
+5400	谲
+5401	淵
+5402	鼾
+5403	莼
+5404	輯
+5405	菰
+5406	滯
+5407	薮
+5408	揆
+5409	辯
+5410	髯
+5411	瑠
+5412	皑
+5413	盎
+5414	哎
+5415	祷
+5416	ウ
+5417	償
+5418	厭
+5419	嘆
+5420	嚇
+5421	嬿
+5422	嶽
+5423	憑
+5424	憲
+5425	攤
+5426	桜
+5427	檯
+5428	渾
+5429	湉
+5430	澀
+5431	綉
+5432	綸
+5433	緯
+5434	疚
+5435	倔
+5436	笹
+5437	硃
+5438	瀉
+5439	妨
+5440	ム
+5441	栢
+5442	猥
+5443	膩
+5444	悌
+5445	鉆
+5446	悚
+5447	屆
+5448	铆
+5449	崮
+5450	嗦
+5451	箩
+5452	屡
+5453	饷
+5454	涿
+5455	娲
+5456	娓
+5457	娈
+5458	姊
+5459	撈
+5460	拈
+5461	鎂
+5462	讫
+5463	録
+5464	嵊
+5465	猶
+5466	吝
+5467	霹
+5468	溱
+5469	羨
+5470	琵
+5471	恂
+5472	琤
+5473	疊
+5474	凜
+5475	堑
+5476	珲
+5477	甦
+5478	梆
+5479	筐
+5480	穰
+5481	瓠
+5482	饒
+5483	鸪
+5484	疱
+5485	鹉
+5486	猩
+5487	痂
+5488	嘘
+5489	瘀
+5490	閨
+5491	閩
+5492	惦
+5493	侩
+5494	敕
+5495	桠
+5496	赉
+5497	伺
+5498	殓
+5499	犟
+5500	唆
+5501	雛
+5502	淄
+5503	勍
+5504	レ
+5505	飕
+5506	獭
+5507	蘿
+5508	讹
+5509	ワ
+5510	飨
+5511	頑
+5512	趟
+5513	侮
+5514	蝕
+5515	惋
+5516	碛
+5517	熵
+5518	钤
+5519	硒
+5520	飏
+5521	蟬
+5522	睑
+5523	稞
+5524	盞
+5525	擬
+5526	勸
+5527	擇
+5528	駝
+5529	窠
+5530	耒
+5531	裱
+5532	ず
+5533	憾
+5534	曈
+5535	蜃
+5536	ヒ
+5537	簸
+5538	憎
+5539	鰲
+5540	敝
+5541	謂
+5542	柞
+5543	醴
+5544	蠹
+5545	蚶
+5546	翕
+5547	雎
+5548	雒
+5549	跖
+5550	啬
+5551	誦
+5552	铀
+5553	蜷
+5554	蹊
+5555	蹼
+5556	誇
+5557	蜢
+5558	跷
+5559	謙
+5560	咱
+5561	伫
+5562	ミ
+5563	呓
+5564	诒
+5565	倏
+5566	鄱
+5567	倜
+5568	芾
+5569	茆
+5570	阪
+5571	谄
+5572	谙
+5573	芡
+5574	隗
+5575	芎
+5576	茯
+5577	荇
+5578	濾
+5579	龐
+5580	菘
+5581	菟
+5582	齋
+5583	蕲
+5584	掬
+5585	扪
+5586	轟
+5587	燭
+5588	捶
+5589	幢
+5590	ǎ
+5591	鳕
+5592	皺
+5593	縛
+5594	扛
+5595	穂
+5596	ゴ
+5597	セ
+5598	ギ
+5599	噹
+5600	墳
+5601	奬
+5602	姍
+5603	嫄
+5604	慮
+5605	様
+5606	灝
+5607	槛
+5608	伎
+5609	綁
+5610	澗
+5611	痉
+5612	剿
+5613	撅
+5614	緋
+5615	睫
+5616	筍
+5617	舶
+5618	菠
+5619	矇
+5620	怖
+5621	猖
+5622	ǔ
+5623	郴
+5624	椽
+5625	オ
+5626	暘
+5627	獣
+5628	羔
+5629	庒
+5630	掂
+5631	鉀
+5632	灞
+5633	鍛
+5634	颗
+5635	麂
+5636	浯
+5637	鋅
+5638	鋸
+5639	寞
+5640	併
+5641	銜
+5642	峒
+5643	喙
+5644	嗯
+5645	忏
+5646	滏
+5647	繡
+5648	沌
+5649	臘
+5650	沭
+5651	阈
+5652	姒
+5653	苺
+5654	滂
+5655	淙
+5656	汩
+5657	媾
+5658	艶
+5659	嫱
+5660	莆
+5661	曝
+5662	錐
+5663	撂
+5664	逄
+5665	逑
+5666	馏
+5667	囿
+5668	嘀
+5669	弭
+5670	啮
+5671	皿
+5672	泺
+5673	纏
+5674	噗
+5675	歉
+5676	玎
+5677	悄
+5678	珙
+5679	缬
+5680	缭
+5681	擠
+5682	愷
+5683	恍
+5684	鸩
+5685	餌
+5686	鹑
+5687	蠶
+5688	疖
+5689	瘕
+5690	榈
+5691	椤
+5692	闇
+5693	辫
+5694	瑭
+5695	氪
+5696	榫
+5697	昴
+5698	昝
+5699	拭
+5700	殒
+5701	腈
+5702	枞
+5703	枋
+5704	隧
+5705	腩
+5706	妊
+5707	蓆
+5708	楮
+5709	枸
+5710	辇
+5711	臊
+5712	窮
+5713	琯
+5714	禛
+5715	恙
+5716	ネ
+5717	捅
+5718	飓
+5719	眺
+5720	虧
+5721	勵
+5722	顛
+5723	螞
+5724	飽
+5725	幌
+5726	蟻
+5727	搪
+5728	砣
+5729	镫
+5730	晤
+5731	蘊
+5732	萄
+5733	蘋
+5734	碣
+5735	頤
+5736	诬
+5737	镗
+5738	梟
+5739	瘿
+5740	蚜
+5741	衲
+5742	聃
+5743	馮
+5744	駒
+5745	颀
+5746	蟆
+5747	螽
+5748	螈
+5749	哟
+5750	堯
+5751	滘
+5752	颞
+5753	颚
+5754	颛
+5755	衄
+5756	徳
+5757	炘
+5758	該
+5759	詳
+5760	囍
+5761	孵
+5762	鯉
+5763	諜
+5764	亟
+5765	蛄
+5766	蚺
+5767	袅
+5768	衾
+5769	踵
+5770	斟
+5771	孛
+5772	箧
+5773	羟
+5774	笏
+5775	蛏
+5776	跛
+5777	鴉
+5778	蛭
+5779	鱿
+5780	蹴
+5781	仵
+5782	暨
+5783	蜈
+5784	酐
+5785	鲑
+5786	髒
+5787	篩
+5788	觚
+5789	鯛
+5790	瀝
+5791	摺
+5792	哝
+5793	呦
+5794	喏
+5795	哌
+5796	咻
+5797	瀟
+5798	髻
+5799	俣
+5800	賺
+5801	贈
+5802	滬
+5803	郄
+5804	蹤
+5805	墉
+5806	俟
+5807	傩
+5808	偎
+5809	凼
+5810	荜
+5811	陟
+5812	贛
+5813	隍
+5814	邛
+5815	垡
+5816	荠
+5817	摧
+5818	萁
+5819	莨
+5820	蒌
+5821	嶼
+5822	稗
+5823	掇
+5824	蕈
+5825	鳢
+5826	鞣
+5827	鞅
+5828	瑋
+5829	竊
+5830	籤
+5831	蛔
+5832	猾
+5833	粄
+5834	が
+5835	ジ
+5836	*
+5837	伕
+5838	厠
+5839	嘯
+5840	姮
+5841	廬
+5842	搾
+5843	潑
+5844	讥
+5845	絳
+5846	喚
+5847	铰
+5848	硷
+5849	絢
+5850	す
+5851	搀
+5852	掺
+5853	硯
+5854	毆
+5855	濁
+5856	峄
+5857	幛
+5858	哩
+5859	喋
+5860	啵
+5861	婪
+5862	烩
+5863	猝
+5864	迸
+5865	ヤ
+5866	洹
+5867	鋭
+5868	撃
+5869	拇
+5870	膿
+5871	臍
+5872	鉤
+5873	悻
+5874	嗑
+5875	嗖
+5876	喑
+5877	饬
+5878	琶
+5879	懵
+5880	噫
+5881	忡
+5882	怵
+5883	孀
+5884	姘
+5885	潦
+5886	怆
+5887	砰
+5888	蔫
+5889	藐
+5890	乒
+5891	嫫
+5892	骓
+5893	孢
+5894	纡
+5895	孪
+5896	沆
+5897	泔
+5898	錘
+5899	怏
+5900	庹
+5901	抡
+5902	銹
+5903	巅
+5904	恸
+5905	遒
+5906	遨
+5907	狞
+5908	淏
+5909	癱
+5910	绡
+5911	纭
+5912	扦
+5913	玢
+5914	缢
+5915	缥
+5916	珧
+5917	躯
+5918	畿
+5919	鸫
+5920	鸱
+5921	鸨
+5922	樞
+5923	懈
+5924	衅
+5925	鹗
+5926	惺
+5927	餓
+5928	蠱
+5929	痱
+5930	匈
+5931	榉
+5932	楦
+5933	ど
+5934	よ
+5935	龋
+5936	戢
+5937	笆
+5938	雫
+5939	隴
+5940	擘
+5941	杓
+5942	牍
+5943	嚷
+5944	樯
+5945	砷
+5946	轭
+5947	栉
+5948	觑
+5949	闖
+5950	柩
+5951	腭
+5952	捎
+5953	樨
+5954	枰
+5955	鑄
+5956	閥
+5957	滷
+5958	焗
+5959	嘔
+5960	蛻
+5961	胳
+5962	勳
+5963	歙
+5964	蘚
+5965	瞰
+5966	螢
+5967	わ
+5968	蠔
+5969	斫
+5970	砭
+5971	旃
+5972	钇
+5973	褪
+5974	烊
+5975	淌
+5976	铍
+5977	铐
+5978	鸵
+5979	熨
+5980	铤
+5981	铢
+5982	镔
+5983	顆
+5984	癒
+5985	僱
+5986	媗
+5987	琇
+5988	嘗
+5989	竦
+5990	癀
+5991	秆
+5992	衿
+5993	竜
+5994	螃
+5995	蟮
+5996	罂
+5997	螳
+5998	傭
+5999	夠
+6000	蝼
+6001	驕
+6002	噎
+6003	ぴ
+6004	侖
+6005	訾
+6006	嬛
+6007	謠
+6008	蜘
+6009	酢
+6010	趸
+6011	醍
+6012	フ
+6013	汎
+6014	匕
+6015	氐
+6016	蚓
+6017	蚬
+6018	鲢
+6019	諧
+6020	蚴
+6021	訣
+6022	綦
+6023	謊
+6024	鳩
+6025	驢
+6026	蛳
+6027	窒
+6028	瘴
+6029	笳
+6030	鲵
+6031	嘱
+6032	貘
+6033	睾
+6034	佤
+6035	詐
+6036	篾
+6037	蛸
+6038	貔
+6039	簋
+6040	窺
+6041	卻
+6042	唏
+6043	咧
+6044	慣
+6045	歎
+6046	烔
+6047	鷺
+6048	べ
+6049	贅
+6050	刍
+6051	蹟
+6052	黒
+6053	艽
+6054	堀
+6055	鷄
+6056	垅
+6057	勰
+6058	坭
+6059	谔
+6060	凫
+6061	賜
+6062	谠
+6063	俸
+6064	垓
+6065	黴
+6066	邴
+6067	圪
+6068	賦
+6069	荥
+6070	剋
+6071	僕
+6072	陛
+6073	較
+6074	莅
+6075	荨
+6076	茛
+6077	菖
+6078	轄
+6079	薹
+6080	捺
+6081	骰
+6082	掸
+6083	禎
+6084	々
+6085	腑
+6086	竅
+6087	玙
+6088	玕
+6089	ご
+6090	う
+6091	せ
+6092	ぎ
+6093	グ
+6094	倖
+6095	厲
+6096	唸
+6097	姪
+6098	姉
+6099	寢
+6100	崟
+6101	悽
+6102	柊
+6103	棧
+6104	殯
+6105	湊
+6106	湜
+6107	潰
+6108	骷
+6109	紳
+6110	ソ
+6111	粳
+6112	紹
+6113	綢
+6114	綴
+6115	叢
+6116	洸
+6117	膊
+6118	惭
+6119	豺
+6120	姵
+6121	躇
+6122	癮
+6123	溼
+6124	岬
+6125	釆
+6126	鬣
+6127	啜
+6128	喱
+6129	喽
+6130	だ
+6131	ダ
+6132	麾
+6133	猗
+6134	搂
+6135	艙
+6136	悒
+6137	愕
+6138	懊
+6139	睹
+6140	脣
+6141	慵
+6142	悴
+6143	懦
+6144	涑
+6145	晾
+6146	噌
+6147	噤
+6148	忖
+6149	饴
+6150	馀
+6151	饽
+6152	遽
+6153	邃
+6154	迥
+6155	淅
+6156	闩
+6157	肅
+6158	嘈
+6159	鎬
+6160	苾
+6161	嗳
+6162	迳
+6163	汨
+6164	闼
+6165	媪
+6166	粕
+6167	骝
+6168	嬲
+6169	孳
+6170	辔
+6171	挛
+6172	狹
+6173	逖
+6174	阊
+6175	嶂
+6176	帏
+6177	釵
+6178	罵
+6179	鄒
+6180	嘹
+6181	恻
+6182	阗
+6183	醃
+6184	沚
+6185	诅
+6186	佺
+6187	曠
+6188	绗
+6189	绂
+6190	谴
+6191	菈
+6192	缌
+6193	缗
+6194	缑
+6195	缟
+6196	鏢
+6197	荳
+6198	嬸
+6199	衹
+6200	衆
+6201	衎
+6202	鸷
+6203	痿
+6204	戦
+6205	椋
+6206	瞪
+6207	ド
+6208	蒨
+6209	煽
+6210	苫
+6211	啥
+6212	鑲
+6213	吶
+6214	瓤
+6215	榷
+6216	戡
+6217	闌
+6218	隸
+6219	氙
+6220	ニ
+6221	吮
+6222	藴
+6223	榧
+6224	虢
+6225	椴
+6226	瘸
+6227	慑
+6228	栲
+6229	肓
+6230	隻
+6231	蔆
+6232	殡
+6233	槲
+6234	晌
+6235	轱
+6236	桁
+6237	杼
+6238	蔵
+6239	觐
+6240	扔
+6241	桷
+6242	牯
+6243	牒
+6244	胗
+6245	艘
+6246	蔬
+6247	鳃
+6248	棂
+6249	闢
+6250	棹
+6251	贽
+6252	膻
+6253	瀏
+6254	僅
+6255	昳
+6256	漣
+6257	婭
+6258	愆
+6259	恚
+6260	虓
+6261	黝
+6262	铟
+6263	蝸
+6264	黠
+6265	秣
+6266	飼
+6267	餃
+6268	罹
+6269	磴
+6270	砻
+6271	锑
+6272	頰
+6273	锢
+6274	礴
+6275	頒
+6276	煨
+6277	绦
+6278	焓
+6279	頜
+6280	砗
+6281	碓
+6282	眦
+6283	碇
+6284	迢
+6285	镉
+6286	秭
+6287	镞
+6288	誊
+6289	钯
+6290	睨
+6291	欽
+6292	鱧
+6293	礙
+6294	玪
+6295	瘪
+6296	餮
+6297	衽
+6298	唁
+6299	衩
+6300	袢
+6301	耜
+6302	鸯
+6303	疡
+6304	馭
+6305	峯
+6306	ズ
+6307	氦
+6308	び
+6309	踊
+6310	虻
+6311	颦
+6312	颏
+6313	颔
+6314	滓
+6315	遏
+6316	濒
+6317	ピ
+6318	鲎
+6319	龈
+6320	霎
+6321	醌
+6322	誡
+6323	伧
+6324	馗
+6325	廿
+6326	蚣
+6327	蹙
+6328	虺
+6329	笈
+6330	蜞
+6331	裟
+6332	剽
+6333	蚱
+6334	築
+6335	褻
+6336	蛐
+6337	鲳
+6338	鲂
+6339	菏
+6340	糸
+6341	羧
+6342	仉
+6343	笪
+6344	繇
+6345	靥
+6346	赳
+6347	鲅
+6348	粼
+6349	糁
+6350	粽
+6351	鲶
+6352	稣
+6353	伢
+6354	踹
+6355	鰐
+6356	蝙
+6357	螯
+6358	糍
+6359	佧
+6360	鰻
+6361	淩
+6362	濺
+6363	弒
+6364	楽
+6365	嬤
+6366	呔
+6367	卟
+6368	擢
+6369	哏
+6370	哧
+6371	呤
+6372	咄
+6373	咛
+6374	璽
+6375	盪
+6376	囤
+6377	讪
+6378	诳
+6379	诜
+6380	岿
+6381	鵡
+6382	俦
+6383	鹼
+6384	麩
+6385	踐
+6386	郇
+6387	埙
+6388	郯
+6389	ボ
+6390	茏
+6391	艿
+6392	俳
+6393	阱
+6394	侏
+6395	俾
+6396	茚
+6397	茕
+6398	偈
+6399	苌
+6400	荑
+6401	貶
+6402	脔
+6403	壅
+6404	鷓
+6405	谶
+6406	鷗
+6407	邳
+6408	羸
+6409	垸
+6410	苜
+6411	鸚
+6412	佥
+6413	荦
+6414	苻
+6415	搗
+6416	鱔
+6417	穀
+6418	龑
+6419	荽
+6420	輿
+6421	葶
+6422	蓍
+6423	蓦
+6424	菅
+6425	蓥
+6426	憐
+6427	婠
+6428	蘖
+6429	轎
+6430	抻
+6431	掾
+6432	捋
+6433	辻
+6434	鱷
+6435	鱘
+6436	谆
+6437	瑢
+6438	蹈
+6439	祤
+6440	瘉
+6441	縷
+6442	繃
+6443	窅
+6444	竇
+6445	玘
+6446	玗
+6447	粧
+6448	秾
+6449	┳
+6450	畝
+6451	ざ
+6452	ザ
+6453	ケ
+6454	摈
+6455	伝
+6456	嚒
+6457	墮
+6458	妺
+6459	幀
+6460	廯
+6461	彙
+6462	摯
+6463	枊
+6464	櫥
+6465	檸
+6466	汙
+6467	潯
+6468	煒
+6469	煖
+6470	そ
+6471	骶
+6472	緝
+6473	締
+6474	緬
+6475	紺
+6476	厩
+6477	経
+6478	鞑
+6479	げ
+6480	澆
+6481	谗
+6482	掣
+6483	踌
+6484	侈
+6485	烝
+6486	尓
+6487	曖
+6488	翛
+6489	圜
+6490	嘧
+6491	喟
+6492	喹
+6493	嗷
+6494	唰
+6495	垃
+6496	纜
+6497	诲
+6498	樺
+6499	甯
+6500	泞
+6501	т
+6502	婁
+6503	浍
+6504	浠
+6505	浈
+6506	淖
+6507	с
+6508	愦
+6509	惆
+6510	憧
+6511	惴
+6512	悭
+6513	銑
+6514	髅
+6515	聾
+6516	沤
+6517	憔
+6518	涔
+6519	洳
+6520	溧
+6521	汜
+6522	汊
+6523	崆
+6524	溏
+6525	譬
+6526	彘
+6527	逅
+6528	氖
+6529	渌
+6530	驸
+6531	驽
+6532	沏
+6533	骀
+6534	骟
+6535	嬗
+6536	苪
+6537	尜
+6538	纣
+6539	鉬
+6540	鈍
+6541	犸
+6542	嗤
+6543	囔
+6544	囝
+6545	馔
+6546	逋
+6547	屐
+6548	孱
+6549	銲
+6550	涮
+6551	鉑
+6552	溲
+6553	狍
+6554	嘤
+6555	庥
+6556	砒
+6557	潴
+6558	湔
+6559	抿
+6560	阏
+6561	罷
+6562	狷
+6563	帑
+6564	恹
+6565	妁
+6566	潸
+6567	澌
+6568	馁
+6569	錠
+6570	鄖
+6571	鋤
+6572	徂
+6573	窿
+6574	廪
+6575	妣
+6576	奀
+6577	岀
+6578	绺
+6579	髑
+6580	で
+6581	缃
+6582	顼
+6583	莖
+6584	泅
+6585	荘
+6586	莙
+6587	鸶
+6588	皈
+6589	鸲
+6590	喰
+6591	疴
+6592	鹈
+6593	痤
+6594	鹧
+6595	瘊
+6596	汹
+6597	疔
+6598	饃
+6599	濫
+6600	榀
+6601	楂
+6602	楫
+6603	閲
+6604	閻
+6605	磋
+6606	杌
+6607	璁
+6608	瑁
+6609	冴
+6610	掙
+6611	氷
+6612	辍
+6613	闿
+6614	蕪
+6615	纂
+6616	毽
+6617	氅
+6618	氘
+6619	槁
+6620	枘
+6621	薬
+6622	肼
+6623	桄
+6624	鐲
+6625	薈
+6626	栊
+6627	墒
+6628	觇
+6629	闕
+6630	觎
+6631	薔
+6632	橐
+6633	暝
+6634	胍
+6635	嗽
+6636	胫
+6637	辂
+6638	犍
+6639	挈
+6640	膣
+6641	檩
+6642	噁
+6643	濬
+6644	唄
+6645	琲
+6646	痠
+6647	歩
+6648	黜
+6649	悫
+6650	碜
+6651	矸
+6652	欖
+6653	殳
+6654	旄
+6655	欹
+6656	毂
+6657	诽
+6658	兲
+6659	虜
+6660	咁
+6661	瞽
+6662	睽
+6663	撐
+6664	澪
+6665	碉
+6666	囷
+6667	飢
+6668	锘
+6669	蠅
+6670	蠍
+6671	磲
+6672	顱
+6673	屉
+6674	紊
+6675	韌
+6676	眄
+6677	盹
+6678	镬
+6679	镂
+6680	颙
+6681	煅
+6682	斡
+6683	钫
+6684	秕
+6685	秫
+6686	哮
+6687	睐
+6688	钲
+6689	睚
+6690	瀕
+6691	駛
+6692	ぱ
+6693	駁
+6694	駄
+6695	嘜
+6696	満
+6697	蟥
+6698	簟
+6699	吭
+6700	吩
+6701	雩
+6702	霰
+6703	鰱
+6704	ぶ
+6705	ブ
+6706	謹
+6707	戸
+6708	醮
+6709	醅
+6710	蹩
+6711	ふ
+6712	澱
+6713	铡
+6714	諒
+6715	卅
+6716	囟
+6717	貍
+6718	鼋
+6719	鼍
+6720	罄
+6721	舐
+6722	蝈
+6723	鲣
+6724	鬲
+6725	乩
+6726	笄
+6727	蜱
+6728	翮
+6729	郧
+6730	笕
+6731	蜩
+6732	蛩
+6733	鲩
+6734	錾
+6735	蹶
+6736	騁
+6737	箜
+6738	鲮
+6739	跆
+6740	仨
+6741	赝
+6742	豊
+6743	匮
+6744	涸
+6745	笥
+6746	粢
+6747	赧
+6748	瞩
+6749	跤
+6750	睁
+6751	伉
+6752	襯
+6753	诌
+6754	筚
+6755	筌
+6756	騏
+6757	豉
+6758	糗
+6759	剀
+6760	瀞
+6761	嘢
+6762	呋
+6763	邏
+6764	吲
+6765	咂
+6766	唠
+6767	吆
+6768	哔
+6769	啕
+6770	甌
+6771	讦
+6772	诟
+6773	殆
+6774	鼯
+6775	侪
+6776	ほ
+6777	郓
+6778	诶
+6779	谀
+6780	倮
+6781	黉
+6782	黙
+6783	坻
+6784	兖
+6785	莛
+6786	苄
+6787	貳
+6788	贖
+6789	陬
+6790	谖
+6791	偻
+6792	兕
+6793	傥
+6794	畚
+6795	鶯
+6796	隈
+6797	谥
+6798	谪
+6799	鵲
+6800	僭
+6801	赑
+6802	谮
+6803	嘩
+6804	畑
+6805	攜
+6806	癥
+6807	価
+6808	莠
+6809	荩
+6810	萜
+6811	軼
+6812	媄
+6813	薨
+6814	薤
+6815	蕖
+6816	藁
+6817	迖
+6818	藿
+6819	蘼
+6820	奁
+6821	揄
+6822	尬
+6823	拶
+6824	燴
+6825	狛
+6826	磡
+6827	磧
+6828	磯
+6829	ǒ
+6830	鳔
+6831	鳟
+6832	鳏
+6833	鳎
+6834	鲀
+6835	鲹
+6836	瑣
+6837	唉
+6838	皜
+6839	皞
+6840	惮
+6841	郸
+6842	祘
+6843	揩
+6844	繚
+6845	袱
+6846	珝
+6847	珰
+6848	禱
+6849	畬
+6850	ぐ
+6851	睏
+6852	乚
+6853	倓
+6854	倞
+6855	僞
+6856	儷
+6857	儘
+6858	啫
+6859	嚮
+6860	噯
+6861	埇
+6862	埗
+6863	垵
+6864	塢
+6865	奭
+6866	妠
+6867	帰
+6868	恵
+6869	憤
+6870	挾
+6871	摳
+6872	攪
+6873	暦
+6874	暐
+6875	柈
+6876	枂
+6877	棲
+6878	棨
+6879	樁
+6880	槓
+6881	檳
+6882	毐
+6883	洑
+6884	湲
+6885	潁
+6886	瀆
+6887	讣
+6888	ゼ
+6889	嫉
+6890	絵
+6891	饯
+6892	ゾ
+6893	壘
+6894	絃
+6895	抉
+6896	絆
+6897	嫻
+6898	箏
+6899	箓
+6900	剐
+6901	徊
+6902	槻
+6903	碴
+6904	搽
+6905	诧
+6906	伋
+6907	矯
+6908	矻
+6909	瞅
+6910	堿
+6911	啰
+6912	瘁
+6913	岽
+6914	嶙
+6915	岌
+6916	岜
+6917	釗
+6918	啻
+6919	嗫
+6920	啷
+6921	斃
+6922	猬
+6923	夥
+6924	舛
+6925	猊
+6926	鈦
+6927	髀
+6928	跺
+6929	涘
+6930	遄
+6931	澶
+6932	浥
+6933	炁
+6934	浞
+6935	銨
+6936	洧
+6937	髂
+6938	ツ
+6939	臑
+6940	泖
+6941	慊
+6942	ッ
+6943	娒
+6944	辆
+6945	嗉
+6946	谰
+6947	峁
+6948	夤
+6949	岵
+6950	獠
+6951	獬
+6952	愠
+6953	崃
+6954	拎
+6955	崤
+6956	忉
+6957	怃
+6958	遴
+6959	迓
+6960	娩
+6961	羈
+6962	嗲
+6963	馇
+6964	囵
+6965	庑
+6966	漯
+6967	麈
+6968	挎
+6969	屾
+6970	涙
+6971	妫
+6972	髌
+6973	テ
+6974	徠
+6975	芣
+6976	茀
+6977	鄀
+6978	嚅
+6979	忤
+6980	潆
+6981	瞥
+6982	艱
+6983	嫘
+6984	骘
+6985	苨
+6986	翯
+6987	犴
+6988	馓
+6989	怙
+6990	逦
+6991	沩
+6992	囫
+6993	怦
+6994	羼
+6995	嵋
+6996	嵴
+6997	繭
+6998	囹
+6999	圉
+7000	鈕
+7001	怛
+7002	乓
+7003	咆
+7004	阒
+7005	鉗
+7006	徇
+7007	鉚
+7008	嶷
+7009	豳
+7010	咙
+7011	您
+7012	彷
+7013	妪
+7014	漭
+7015	噢
+7016	戕
+7017	鉅
+7018	鰾
+7019	旵
+7020	麋
+7021	绱
+7022	纰
+7023	鐐
+7024	莀
+7025	菂
+7026	橇
+7027	锹
+7028	缡
+7029	鏘
+7030	鏜
+7031	鏽
+7032	甾
+7033	哾
+7034	昫
+7035	饑
+7036	ば
+7037	鸺
+7038	鹁
+7039	鹌
+7040	鹩
+7041	鹨
+7042	餡
+7043	疠
+7044	橢
+7045	鏖
+7046	撻
+7047	閪
+7048	榘
+7049	椹
+7050	魃
+7051	囪
+7052	鑵
+7053	鑼
+7054	锜
+7055	溉
+7056	痊
+7057	颧
+7058	葷
+7059	応
+7060	旮
+7061	辚
+7062	陞
+7063	蕗
+7064	忞
+7065	膑
+7066	胱
+7067	氚
+7068	氲
+7069	牖
+7070	霂
+7071	霑
+7072	魉
+7073	瓴
+7074	殁
+7075	赡
+7076	桎
+7077	赈
+7078	肱
+7079	脘
+7080	槠
+7081	肫
+7082	閏
+7083	菴
+7084	桤
+7085	枨
+7086	槭
+7087	樗
+7088	桕
+7089	觌
+7090	腴
+7091	樘
+7092	雑
+7093	闘
+7094	隠
+7095	雖
+7096	萵
+7097	蕁
+7098	橛
+7099	轵
+7100	栌
+7101	纫
+7102	桴
+7103	桫
+7104	柝
+7105	朐
+7106	薙
+7107	橼
+7108	甥
+7109	辄
+7110	脍
+7111	蕎
+7112	甪
+7113	単
+7114	実
+7115	昰
+7116	窯
+7117	旼
+7118	沨
+7119	岺
+7120	濰
+7121	塱
+7122	汭
+7123	疙
+7124	婍
+7125	戆
+7126	怼
+7127	砜
+7128	砀
+7129	頃
+7130	魍
+7131	懲
+7132	戩
+7133	撿
+7134	齑
+7135	熳
+7136	鞏
+7137	囂
+7138	虤
+7139	滎
+7140	瞑
+7141	钭
+7142	畎
+7143	畋
+7144	顎
+7145	魑
+7146	戯
+7147	铯
+7148	铫
+7149	檄
+7150	蠄
+7151	旒
+7152	锛
+7153	砟
+7154	酞
+7155	炝
+7156	炻
+7157	钹
+7158	罾
+7159	盥
+7160	铼
+7161	锪
+7162	戽
+7163	嚏
+7164	硇
+7165	黻
+7166	黼
+7167	铊
+7168	铌
+7169	镪
+7170	锸
+7171	頗
+7172	盱
+7173	铑
+7174	钕
+7175	镱
+7176	飆
+7177	腆
+7178	祢
+7179	祧
+7180	詈
+7181	铗
+7182	镏
+7183	颯
+7184	蝨
+7185	禳
+7186	钶
+7187	淆
+7188	牺
+7189	恓
+7190	玨
+7191	鱈
+7192	攏
+7193	嘚
+7194	黢
+7195	я
+7196	褡
+7197	窨
+7198	窕
+7199	駱
+7200	囱
+7201	襦
+7202	裥
+7203	讶
+7204	耋
+7205	耵
+7206	裨
+7207	聒
+7208	褙
+7209	褓
+7210	馴
+7211	慜
+7212	浐
+7213	蟀
+7214	髙
+7215	ビ
+7216	売
+7217	を
+7218	勻
+7219	蚩
+7220	蚨
+7221	驍
+7222	舀
+7223	覓
+7224	黥
+7225	籀
+7226	臬
+7227	魟
+7228	詭
+7229	岞
+7230	霪
+7231	隹
+7232	龇
+7233	髦
+7234	е
+7235	愔
+7236	懼
+7237	謡
+7238	貅
+7239	醺
+7240	酴
+7241	髡
+7242	崭
+7243	鴣
+7244	鴦
+7245	へ
+7246	ヘ
+7247	踅
+7248	蠊
+7249	龉
+7250	醭
+7251	驛
+7252	筲
+7253	襞
+7254	鯽
+7255	踽
+7256	锗
+7257	跄
+7258	蹉
+7259	芈
+7260	蹑
+7261	跗
+7262	跚
+7263	麴
+7264	鮀
+7265	誅
+7266	仞
+7267	驟
+7268	鬍
+7269	鲭
+7270	蹰
+7271	跎
+7272	仃
+7273	蝾
+7274	酝
+7275	読
+7276	跸
+7277	叵
+7278	驤
+7279	髄
+7280	貉
+7281	跹
+7282	蠛
+7283	篝
+7284	篪
+7285	筘
+7286	蝮
+7287	蛴
+7288	蝤
+7289	蜇
+7290	龊
+7291	鲋
+7292	鲽
+7293	鮫
+7294	蘸
+7295	跻
+7296	豸
+7297	踉
+7298	踟
+7299	狰
+7300	赜
+7301	刎
+7302	驪
+7303	鬚
+7304	鲞
+7305	骉
+7306	喳
+7307	篤
+7308	訶
+7309	髖
+7310	蜉
+7311	蹀
+7312	佝
+7313	乸
+7314	沺
+7315	琍
+7316	琎
+7317	巖
+7318	禦
+7319	╯
+7320	淪
+7321	咦
+7322	擀
+7323	甙
+7324	呖
+7325	咝
+7326	哞
+7327	哽
+7328	哓
+7329	呲
+7330	哕
+7331	咿
+7332	╰
+7333	漷
+7334	禩
+7335	檜
+7336	鷲
+7337	髭
+7338	囑
+7339	诂
+7340	凇
+7341	诨
+7342	侉
+7343	佻
+7344	伲
+7345	鸞
+7346	鄞
+7347	郫
+7348	鄯
+7349	鼩
+7350	ぼ
+7351	軀
+7352	墀
+7353	転
+7354	酃
+7355	籴
+7356	倥
+7357	坩
+7358	坼
+7359	垆
+7360	茑
+7361	鹀
+7362	矍
+7363	坌
+7364	谑
+7365	陔
+7366	匍
+7367	茈
+7368	陲
+7369	傧
+7370	茼
+7371	芟
+7372	鵰
+7373	谘
+7374	亳
+7375	垩
+7376	隰
+7377	谡
+7378	邙
+7379	袤
+7380	儆
+7381	酆
+7382	鹮
+7383	賁
+7384	賃
+7385	儋
+7386	圮
+7387	苘
+7388	賄
+7389	鸛
+7390	埝
+7391	坜
+7392	鱒
+7393	乂
+7394	朧
+7395	沄
+7396	痺
+7397	穢
+7398	譞
+7399	擷
+7400	ポ
+7401	嬢
+7402	葑
+7403	莴
+7404	莩
+7405	菝
+7406	蒺
+7407	蓐
+7408	菔
+7409	輓
+7410	蒹
+7411	蒴
+7412	輛
+7413	軻
+7414	齲
+7415	傀
+7416	拮
+7417	薜
+7418	蕻
+7419	轅
+7420	蓿
+7421	捩
+7422	摒
+7423	奘
+7424	匏
+7425	揿
+7426	尴
+7427	抟
+7428	摁
+7429	辮
+7430	挹
+7431	搦
+7432	辺
+7433	谞
+7434	睜
+7435	搐
+7436	鳜
+7437	鱸
+7438	骺
+7439	鞲
+7440	鳙
+7441	鲉
+7442	鲘
+7443	鳉
+7444	鳑
+7445	讐
+7446	瑝
+7447	瑨
+7448	镑
+7449	皚
+7450	皦
+7451	盁
+7452	祙
+7453	痋
+7454	瘈
+7455	瘍
+7456	瘓
+7457	璣
+7458	瓏
+7459	瓘
+7460	笒
+7461	珖
+7462	豢
+7463	籟
+7464	粬
+7465	ё
+7466	禵
+7467	┛
+7468	┃
+7469	甡
+7470	ぷ
+7471	ぁ
+7472	ぇ
+7473	ガ
+7474	+
+7475	狈
+7476	盶
+7477	眬
+7478	睒
+7479	侘
+7480	亅
+7481	仮
+7482	亶
+7483	仏
+7484	偓
+7485	値
+7486	倻
+7487	儉
+7488	叡
+7489	厳
+7490	啱
+7491	噠
+7492	嚕
+7493	嘰
+7494	噭
+7495	噸
+7496	垈
+7497	垕
+7498	墾
+7499	墎
+7500	墘
+7501	姸
+7502	姈
+7503	嫲
+7504	嫆
+7505	嫊
+7506	嫋
+7507	崁
+7508	崈
+7509	崚
+7510	崢
+7511	幍
+7512	帯
+7513	巻
+7514	彫
+7515	弶
+7516	悪
+7517	慬
+7518	懇
+7519	憙
+7520	憫
+7521	挻
+7522	拝
+7523	斂
+7524	攢
+7525	攬
+7526	昞
+7527	暠
+7528	晝
+7529	晧
+7530	晸
+7531	杺
+7532	梶
+7533	梼
+7534	槺
+7535	槃
+7536	槑
+7537	櫞
+7538	櫚
+7539	殭
+7540	殲
+7541	洣
+7542	浛
+7543	洨
+7544	湰
+7545	湴
+7546	湳
+7547	淸
+7548	渼
+7549	漖
+7550	瀧
+7551	瀮
+7552	煢
+7553	焼
+7554	ぜ
+7555	兗
+7556	惣
+7557	紓
+7558	紘
+7559	紜
+7560	紮
+7561	鹘
+7562	絹
+7563	綑
+7564	ň
+7565	肪
+7566	ぞ
+7567	溇
+7568	綻
+7569	継
+7570	緞
+7571	糰
+7572	侥
+7573	続
+7574	綽
+7575	綝
+7576	攫
+7577	絜
+7578	緈
+7579	絪
+7580	綰
+7581	紈
+7582	絎
+7583	紉
+7584	惫
+7585	児
+7586	姽
+7587	暪
+7588	筧
+7589	恫
+7590	ゲ
+7591	瞋
+7592	睬
+7593	瞞
+7594	睺
+7595	硚
+7596	碁
+7597	砳
+7598	砕
+7599	珹
+7600	癆
+7601	嚜
+7602	惇
+7603	潽
+7604	嗎
+7605	幄
+7606	釁
+7607	釐
+7608	髁
+7609	劻
+7610	屃
+7611	浟
+7612	羱
+7613	郷
+7614	喁
+7615	唷
+7616	鄘
+7617	喈
+7618	鄲
+7619	骼
+7620	咐
+7621	惱
+7622	繽
+7623	纻
+7624	缷
+7625	罈
+7626	佲
+7627	団
+7628	夆
+7629	猕
+7630	飧
+7631	鈿
+7632	鉄
+7633	屄
+7634	嵚
+7635	聳
+7636	ゅ
+7637	ュ
+7638	嬪
+7639	濞
+7640	寤
+7641	瀹
+7642	鍊
+7643	鍙
+7644	冧
+7645	変
+7646	庝
+7647	鋇
+7648	洇
+7649	浃
+7650	洌
+7651	鋰
+7652	镊
+7653	臏
+7654	ャ
+7655	悝
+7656	惬
+7657	銖
+7658	溍
+7659	谩
+7660	脅
+7661	峋
+7662	浼
+7663	徉
+7664	徨
+7665	郃
+7666	噱
+7667	釧
+7668	邂
+7669	洫
+7670	溽
+7671	悛
+7672	忝
+7673	徭
+7674	怄
+7675	馄
+7676	鈉
+7677	徘
+7678	鈊
+7679	銬
+7680	鎌
+7681	ㄆ
+7682	芵
+7683	苼
+7684	苽
+7685	茋
+7686	崛
+7687	嗌
+7688	馐
+7689	馑
+7690	彖
+7691	咫
+7692	涫
+7693	婀
+7694	驺
+7695	芻
+7696	媲
+7697	骛
+7698	苃
+7699	骖
+7700	骢
+7701	苧
+7702	苭
+7703	鎧
+7704	罠
+7705	舎
+7706	镣
+7707	犰
+7708	犷
+7709	嗵
+7710	忪
+7711	錳
+7712	泐
+7713	阄
+7714	銶
+7715	狁
+7716	腘
+7717	狒
+7718	狨
+7719	狲
+7720	嘭
+7721	怍
+7722	怩
+7723	溆
+7724	湓
+7725	郟
+7726	翾
+7727	猄
+7728	噙
+7729	氵
+7730	狴
+7731	狳
+7732	帔
+7733	噘
+7734	儡
+7735	滠
+7736	猇
+7737	狺
+7738	癇
+7739	礳
+7740	昪
+7741	渃
+7742	瀋
+7743	–
+7744	廋
+7745	х
+7746	ょ
+7747	ョ
+7748	冪
+7749	绔
+7750	缁
+7751	绁
+7752	暻
+7753	菉
+7754	菫
+7755	玷
+7756	缛
+7757	鏗
+7758	荌
+7759	缣
+7760	莔
+7761	缰
+7762	缯
+7763	鏟
+7764	缵
+7765	莢
+7766	鐮
+7767	痙
+7768	俤
+7769	揹
+7770	梔
+7771	疍
+7772	黟
+7773	婐
+7774	氿
+7775	鸬
+7776	稹
+7777	皤
+7778	穑
+7779	饋
+7780	饹
+7781	餍
+7782	π
+7783	袆
+7784	鹆
+7785	鹇
+7786	鹋
+7787	疰
+7788	痃
+7789	鹕
+7790	鹚
+7791	痦
+7792	痼
+7793	鹪
+7794	瘌
+7795	餚
+7796	餛
+7797	疬
+7798	饅
+7799	瘙
+7800	杄
+7801	珽
+7802	夐
+7803	涢
+7804	楱
+7805	椁
+7806	棰
+7807	閬
+7808	熒
+7809	嗚
+7810	欒
+7811	韪
+7812	鑰
+7813	铚
+7814	葇
+7815	涥
+7816	滉
+7817	戋
+7818	検
+7819	浭
+7820	澥
+7821	蔪
+7822	囯
+7823	氹
+7824	氆
+7825	毵
+7826	耄
+7827	毳
+7828	氡
+7829	査
+7830	薦
+7831	藠
+7832	氩
+7833	氤
+7834	槊
+7835	杪
+7836	桡
+7837	蔭
+7838	曷
+7839	脬
+7840	閎
+7841	萩
+7842	晷
+7843	殚
+7844	殛
+7845	呻
+7846	菶
+7847	杷
+7848	隕
+7849	蒄
+7850	镚
+7851	閔
+7852	雋
+7853	轫
+7854	娠
+7855	鐸
+7856	柰
+7857	腠
+7858	胛
+7859	腼
+7860	薺
+7861	轳
+7862	蔔
+7863	胙
+7864	蒾
+7865	轹
+7866	檎
+7867	牦
+7868	媵
+7869	膂
+7870	胝
+7871	胴
+7872	檫
+7873	雝
+7874	蓇
+7875	曩
+7876	犒
+7877	臌
+7878	関
+7879	蒟
+7880	挲
+7881	胼
+7882	辋
+7883	檗
+7884	巉
+7885	昮
+7886	碏
+7887	嶧
+7888	済
+7889	滸
+7890	籬
+7891	琀
+7892	獺
+7893	╥
+7894	璱
+7895	峣
+7896	抜
+7897	渋
+7898	璉
+7899	ы
+7900	図
+7901	栱
+7902	眵
+7903	韡
+7904	懑
+7905	韮
+7906	韾
+7907	蓖
+7908	呁
+7909	浲
+7910	滌
+7911	彀
+7912	欤
+7913	鞕
+7914	ヌ
+7915	斕
+7916	蹋
+7917	ь
+7918	嗩
+7919	畹
+7920	罘
+7921	顒
+7922	顓
+7923	顗
+7924	顥
+7925	埼
+7926	蝲
+7927	氾
+7928	颼
+7929	锃
+7930	罴
+7931	锝
+7932	砉
+7933	锕
+7934	砝
+7935	礤
+7936	礓
+7937	藺
+7938	钽
+7939	蠲
+7940	锱
+7941	镦
+7942	锲
+7943	锨
+7944	钚
+7945	顳
+7946	焐
+7947	嗡
+7948	颋
+7949	蟞
+7950	蘄
+7951	挝
+7952	镲
+7953	頼
+7954	硌
+7955	頽
+7956	锺
+7957	眙
+7958	椭
+7959	眭
+7960	碚
+7961	碡
+7962	祗
+7963	铙
+7964	锖
+7965	镌
+7966	镓
+7967	頡
+7968	碲
+7969	禊
+7970	颱
+7971	蟯
+7972	蟄
+7973	韜
+7974	頦
+7975	蝀
+7976	蟈
+7977	磔
+7978	忑
+7979	颶
+7980	磙
+7981	忐
+7982	燹
+7983	蠣
+7984	玧
+7985	獼
+7986	甁
+7987	禟
+7988	桲
+7989	譴
+7990	祃
+7991	窵
+7992	琺
+7993	俶
+7994	昺
+7995	渕
+7996	∕
+7997	鱉
+7998	廙
+7999	琻
+8000	玭
+8001	﹏
+8002	縉
+8003	烴
+8004	聍
+8005	癔
+8006	瘛
+8007	瘵
+8008	瘠
+8009	駙
+8010	駟
+8011	喲
+8012	癃
+8013	皴
+8014	裢
+8015	耧
+8016	裊
+8017	褛
+8018	聩
+8019	褊
+8020	褫
+8021	颃
+8022	媜
+8023	昽
+8024	梠
+8025	㎡
+8026	嚭
+8027	埡
+8028	簕
+8029	簫
+8030	黧
+8031	篦
+8032	笞
+8033	蟋
+8034	蟑
+8035	螬
+8036	髪
+8037	в
+8038	虼
+8039	颥
+8040	蚍
+8041	蚋
+8042	驊
+8043	驎
+8044	円
+8045	捯
+8046	曇
+8047	眶
+8048	滛
+8049	烎
+8050	魘
+8051	艋
+8052	舢
+8053	魷
+8054	詮
+8055	婗
+8056	滝
+8057	龃
+8058	鲼
+8059	觥
+8060	龌
+8061	鰭
+8062	謬
+8063	鮜
+8064	酽
+8065	醢
+8066	醯
+8067	酡
+8068	鯇
+8069	鯖
+8070	辗
+8071	眨
+8072	圾
+8073	髫
+8074	卮
+8075	丨
+8076	艟
+8077	黾
+8078	艄
+8079	虿
+8080	龀
+8081	罅
+8082	箦
+8083	蜮
+8084	鲠
+8085	鲥
+8086	雠
+8087	誥
+8088	趵
+8089	趼
+8090	蹂
+8091	趺
+8092	嘏
+8093	蜴
+8094	鲦
+8095	襜
+8096	諦
+8097	箸
+8098	笮
+8099	襠
+8100	笊
+8101	箅
+8102	蜿
+8103	鍪
+8104	鏊
+8105	亻
+8106	豨
+8107	鯤
+8108	箪
+8109	筇
+8110	箢
+8111	蛲
+8112	蝻
+8113	籼
+8114	諭
+8115	鲱
+8116	躅
+8117	仂
+8118	諮
+8119	簁
+8120	鯧
+8121	謐
+8122	誰
+8123	鳯
+8124	訫
+8125	豈
+8126	蝰
+8127	粞
+8128	鯪
+8129	鲴
+8130	鮪
+8131	笤
+8132	笾
+8133	蝌
+8134	螋
+8135	蝓
+8136	趄
+8137	糌
+8138	鲇
+8139	鲆
+8140	鲻
+8141	鲺
+8142	鲐
+8143	躞
+8144	貊
+8145	伥
+8146	魆
+8147	鰍
+8148	鮭
+8149	鮍
+8150	髎
+8151	諷
+8152	鳶
+8153	筮
+8154	騮
+8155	詔
+8156	鯰
+8157	鮰
+8158	筅
+8159	篼
+8160	蝥
+8161	蜊
+8162	糅
+8163	酎
+8164	踮
+8165	刿
+8166	諺
+8167	鬢
+8168	骕
+8169	鴛
+8170	糨
+8171	鳊
+8172	巔
+8173	噐
+8174	攔
+8175	丷
+8176	烺
+8177	眘
+8178	譙
+8179	疭
+8180	丼
+8181	奡
+8182	н
+8183	┻
+8184	邨
+8185	哚
+8186	呃
+8187	咤
+8188	呙
+8189	逨
+8190	哳
+8191	呶
+8192	唢
+8193	哂
+8194	啁
+8195	咣
+8196	唿
+8197	玹
+8198	眛
+8199	匱
+8200	噓
+8201	嶶
+8202	鼹
+8203	掹
+8204	鷥
+8205	蹿
+8206	ぺ
+8207	広
+8208	讧
+8209	趨
+8210	鶉
+8211	鶗
+8212	ベ
+8213	佴
+8214	佾
+8215	鼷
+8216	堺
+8217	燐
+8218	麀
+8219	鸰
+8220	ホ
+8221	鄣
+8222	郛
+8223	郏
+8224	鼽
+8225	慄
+8226	嬡
+8227	屲
+8228	堞
+8229	堠
+8230	劬
+8231	芄
+8232	鄄
+8233	艹
+8234	谂
+8235	诿
+8236	貯
+8237	劾
+8238	茔
+8239	鹍
+8240	陉
+8241	訇
+8242	鬯
+8243	荛
+8244	苁
+8245	鵪
+8246	鸊
+8247	偬
+8248	厶
+8249	賚
+8250	垴
+8251	贠
+8252	谝
+8253	邗
+8254	儇
+8255	苤
+8256	冫
+8257	圹
+8258	埸
+8259	黿
+8260	邶
+8261	埤
+8262	茌
+8263	谵
+8264	贍
+8265	瑅
+8266	眞
+8267	亀
+8268	坵
+8269	檞
+8270	玏
+8271	沇
+8272	縴
+8273	凖
+8274	淉
+8275	齷
+8276	龕
+8277	傒
+8278	栞
+8279	蓣
+8280	荸
+8281	荬
+8282	轂
+8283	萋
+8284	萏
+8285	菹
+8286	蓠
+8287	蒡
+8288	葜
+8289	甍
+8290	軽
+8291	軾
+8292	瓃
+8293	倆
+8294	巣
+8295	玓
+8296	淶
+8297	慆
+8298	兀
+8299	м
+8300	掁
+8301	栟
+8302	迍
+8303	蘧
+8304	轆
+8305	蕺
+8306	蘩
+8307	掴
+8308	捭
+8309	耷
+8310	掼
+8311	拊
+8312	拚
+8313	捃
+8314	込
+8315	盃
+8316	疇
+8317	偁
+8318	燻
+8319	牤
+8320	牘
+8321	犽
+8322	犢
+8323	磥
+8324	磦
+8325	磻
+8326	磾
+8327	▕
+8328	▽
+8329	鼢
+8330	鳓
+8331	靼
+8332	鞯
+8333	鲃
+8334	鲊
+8335	鲏
+8336	鲖
+8337	鲙
+8338	鲯
+8339	鲾
+8340	鳀
+8341	鳠
+8342	讃
+8343	譆
+8344	讎
+8345	讖
+8346	瑒
+8347	瑧
+8348	瑫
+8349	瑮
+8350	瑱
+8351	瑸
+8352	癭
+8353	皛
+8354	癩
+8355	皰
+8356	皸
+8357	礬
+8358	祼
+8359	禇
+8360	痎
+8361	瘄
+8362	瘆
+8363	瘣
+8364	瘧
+8365	瘨
+8366	瘺
+8367	瓔
+8368	瓚
+8369	瓛
+8370	瓟
+8371	縹
+8372	繄
+8373	繅
+8374	繕
+8375	穇
+8376	穫
+8377	窊
+8378	窓
+8379	窣
+8380	笉
+8381	笣
+8382	玚
+8383	玔
+8384	珄
+8385	珌
+8386	珎
+8387	珛
+8388	珵
+8389	粙
+8390	粨
+8391	粩
+8392	Т
+8393	О
+8394	Л
+8395	Э
+8396	Б
+8397	秬
+8398	稈
+8399	┗
+8400	━
+8401	畤
+8402	畯
+8403	ぉ
+8404	ぃ
+8405	プ
+8406	ィ
+8407	ゥ
+8408	眧
+8409	盷
+8410	眴
+8411	亊
+8412	伒
+8413	亇
+8414	仱
+8415	亜
+8416	亹
+8417	仐
+8418	仚
+8419	偞
+8420	偍
+8421	倕
+8422	倗
+8423	倧
+8424	倴
+8425	倵
+8426	倶
+8427	儈
+8428	僝
+8429	僜
+8430	儔
+8431	儚
+8432	劏
+8433	劵
+8434	劊
+8435	劌
+8436	剺
+8437	叇
+8438	叆
+8439	厔
+8440	厙
+8441	厷
+8442	喛
+8443	啣
+8444	啈
+8445	圇
+8446	嚢
+8447	嚨
+8448	嚟
+8449	噲
+8450	噵
+8451	噽
+8452	嚀
+8453	嚐
+8454	堊
+8455	垿
+8456	埈
+8457	埆
+8458	垻
+8459	垾
+8460	垏
+8461	垝
+8462	垞
+8463	垱
+8464	垳
+8465	夨
+8466	塤
+8467	壆
+8468	墐
+8469	娮
+8470	娀
+8471	妧
+8472	妶
+8473	姀
+8474	嫤
+8475	嫰
+8476	嫽
+8477	媭
+8478	媱
+8479	嫀
+8480	嫃
+8481	嫏
+8482	嫑
+8483	嫕
+8484	嫙
+8485	尨
+8486	宍
+8487	尅
+8488	尪
+8489	孿
+8490	寔
+8491	寗
+8492	寭
+8493	寯
+8494	寳
+8495	対
+8496	専
+8497	峘
+8498	崀
+8499	嶗
+8500	嵂
+8501	崼
+8502	嵒
+8503	崍
+8504	崐
+8505	巂
+8506	巸
+8507	巹
+8508	巿
+8509	帀
+8510	帡
+8511	帩
+8512	彯
+8513	惙
+8514	惢
+8515	悧
+8516	悩
+8517	悰
+8518	悵
+8519	慇
+8520	憚
+8521	憣
+8522	憭
+8523	掲
+8524	抃
+8525	拏
+8526	拠
+8527	拤
+8528	拫
+8529	拵
+8530	拸
+8531	挏
+8532	挐
+8533	挓
+8534	摽
+8535	摜
+8536	摫
+8537	搥
+8538	摑
+8539	敻
+8540	攣
+8541	攱
+8542	攲
+8543	攽
+8544	敐
+8545	暱
+8546	晙
+8547	晛
+8548	晫
+8549	晳
+8550	暁
+8551	暅
+8552	枴
+8553	柅
+8554	枹
+8555	枺
+8556	朶
+8557	枌
+8558	枒
+8559	枓
+8560	枙
+8561	枟
+8562	枦
+8563	枱
+8564	棸
+8565	椏
+8566	梋
+8567	棫
+8568	棻
+8569	梾
+8570	棅
+8571	樋
+8572	槱
+8573	榎
+8574	榿
+8575	槀
+8576	槇
+8577	槈
+8578	槉
+8579	槏
+8580	槖
+8581	槜
+8582	槝
+8583	橿
+8584	檁
+8585	櫌
+8586	檵
+8587	檻
+8588	櫆
+8589	櫈
+8590	毎
+8591	歔
+8592	殢
+8593	洦
+8594	泘
+8595	泑
+8596	洶
+8597	洴
+8598	浉
+8599	洰
+8600	洢
+8601	泚
+8602	洈
+8603	湋
+8604	湙
+8605	湚
+8606	湞
+8607	潿
+8608	澂
+8609	澔
+8610	潄
+8611	潏
+8612	潟
+8613	灤
+8614	灕
+8615	瀅
+8616	瀰
+8617	瀼
+8618	灃
+8619	灄
+8620	煬
+8621	煾
+8622	煡
+8623	煓
+8624	嚰
+8625	糬
+8626	ń
+8627	幗
+8628	摻
+8629	絔
+8630	綋
+8631	綎
+8632	痪
+8633	樉
+8634	緙
+8635	緱
+8636	緲
+8637	糀
+8638	紸
+8639	綣
+8640	紂
+8641	綬
+8642	鞴
+8643	肮
+8644	柟
+8645	箋
+8646	箖
+8647	箠
+8648	筯
+8649	筶
+8650	筼
+8651	刽
+8652	筜
+8653	嚥
+8654	嵅
+8655	毑
+8656	瞼
+8657	瞾
+8658	矅
+8659	矖
+8660	矚
+8661	瞐
+8662	睞
+8663	瞕
+8664	瞤
+8665	С
+8666	嫵
+8667	槼
+8668	砢
+8669	硞
+8670	硤
+8671	硨
+8672	硵
+8673	矰
+8674	盿
+8675	巃
+8676	焌
+8677	礎
+8678	禔
+8679	朅
+8680	楢
+8681	И
+8682	嫪
+8683	敧
+8684	獦
+8685	п
+8686	屺
+8687	岣
+8688	岖
+8689	幞
+8690	醽
+8691	醾
+8692	醿
+8693	釄
+8694	釈
+8695	Я
+8696	伭
+8697	偭
+8698	熈
+8699	羶
+8700	羾
+8701	翃
+8702	翚
+8703	о
+8704	傕
+8705	愢
+8706	曕
+8707	涏
+8708	鄚
+8709	唳
+8710	喾
+8711	啖
+8712	鄴
+8713	尷
+8714	椑
+8715	熇
+8716	繾
+8717	繹
+8718	纮
+8719	缊
+8720	罌
+8721	罍
+8722	呪
+8723	炩
+8724	熲
+8725	鈄
+8726	猞
+8727	獍
+8728	猸
+8729	狻
+8730	饪
+8731	饣
+8732	鈡
+8733	猢
+8734	猡
+8735	獗
+8736	鈷
+8737	β
+8738	吢
+8739	埪
+8740	徛
+8741	毬
+8742	浡
+8743	灺
+8744	赂
+8745	聤
+8746	聴
+8747	麇
+8748	у
+8749	傚
+8750	冨
+8751	堝
+8752	撳
+8753	欏
+8754	氬
+8755	滃
+8756	濆
+8757	錨
+8758	瀣
+8759	錩
+8760	濉
+8761	鍎
+8762	鍔
+8763	鍖
+8764	鍘
+8765	鍢
+8766	鍥
+8767	づ
+8768	偱
+8769	壟
+8770	惻
+8771	斉
+8772	舺
+8773	艅
+8774	艎
+8775	嶄
+8776	曚
+8777	澉
+8778	涠
+8779	洎
+8780	洚
+8781	鋯
+8782	鋶
+8783	鋹
+8784	偰
+8785	縻
+8786	娿
+8787	阕
+8788	悱
+8789	愀
+8790	悃
+8791	惝
+8792	惚
+8793	銆
+8794	銍
+8795	銓
+8796	銚
+8797	銥
+8798	γ
+8799	吤
+8800	幟
+8801	捗
+8802	脇
+8803	脛
+8804	脩
+8805	脳
+8806	脷
+8807	饨
+8808	褰
+8809	愎
+8810	岢
+8811	宄
+8812	徜
+8813	崦
+8814	噔
+8815	嗄
+8816	嚆
+8817	辶
+8818	謇
+8819	邋
+8820	迮
+8821	迕
+8822	渑
+8823	淠
+8824	溷
+8825	胊
+8826	憷
+8827	隳
+8828	崞
+8829	嗥
+8830	鉨
+8831	纁
+8832	淝
+8833	鉉
+8834	罝
+8835	狃
+8836	嵝
+8837	錮
+8838	腄
+8839	傛
+8840	忔
+8841	鎊
+8842	妯
+8843	妗
+8844	鎭
+8845	鏃
+8846	伷
+8847	吰
+8848	嬈
+8849	澠
+8850	芧
+8851	芠
+8852	苳
+8853	茖
+8854	茤
+8855	茪
+8856	徼
+8857	彡
+8858	犭
+8859	噼
+8860	嚯
+8861	翬
+8862	忾
+8863	迨
+8864	抨
+8865	渖
+8866	娌
+8867	芶
+8868	胬
+8869	鍮
+8870	媸
+8871	嫠
+8872	骒
+8873	鍺
+8874	骣
+8875	鍼
+8876	苢
+8877	鎡
+8878	芓
+8879	鎢
+8880	迄
+8881	纥
+8882	鎣
+8883	芛
+8884	纩
+8885	孓
+8886	臚
+8887	鄃
+8888	釭
+8889	臜
+8890	嵛
+8891	嵯
+8892	鄆
+8893	囗
+8894	忭
+8895	忸
+8896	馕
+8897	庀
+8898	屣
+8899	渫
+8900	撵
+8901	湎
+8902	阃
+8903	醞
+8904	郕
+8905	羕
+8906	鈑
+8907	臠
+8908	胠
+8909	猰
+8910	狽
+8911	嵫
+8912	郚
+8913	嘌
+8914	臢
+8915	漶
+8916	狯
+8917	嶝
+8918	鄌
+8919	圄
+8920	嗾
+8921	怫
+8922	搴
+8923	逡
+8924	逶
+8925	遑
+8926	鉶
+8927	阌
+8928	阋
+8929	錚
+8930	鉷
+8931	罳
+8932	鉸
+8933	釹
+8934	舠
+8935	銼
+8936	酺
+8937	羣
+8938	纓
+8939	羥
+8940	猁
+8941	彳
+8942	帙
+8943	嘬
+8944	傈
+8945	廑
+8946	舥
+8947	遘
+8948	潲
+8949	溘
+8950	膙
+8951	錡
+8952	醁
+8953	鉞
+8954	醂
+8955	猃
+8956	帻
+8957	廨
+8958	遢
+8959	爿
+8960	繸
+8961	鈀
+8962	舲
+8963	脄
+8964	肸
+8965	赁
+8966	呸
+8967	譩
+8968	璪
+8969	卋
+8970	媌
+8971	揵
+8972	渂
+8973	獴
+8974	濨
+8975	縞
+8976	玞
+8977	塩
+8978	礐
+8979	瑿
+8980	滳
+8981	濩
+8982	嶇
+8983	旂
+8984	栫
+8985	熺
+8986	绋
+8987	缱
+8988	绲
+8989	缈
+8990	绌
+8991	鐓
+8992	鐜
+8993	鐢
+8994	鐫
+8995	デ
+8996	喦
+8997	柷
+8998	溓
+8999	荾
+9000	莧
+9001	菋
+9002	菍
+9003	缂
+9004	绻
+9005	缒
+9006	荄
+9007	缧
+9008	莐
+9009	鏻
+9010	莑
+9011	荖
+9012	鏞
+9013	荗
+9014	缫
+9015	荙
+9016	鏤
+9017	缳
+9018	莦
+9019	譫
+9020	礵
+9021	穌
+9022	卍
+9023	坉
+9024	奷
+9025	檇
+9026	沝
+9027	鱀
+9028	窪
+9029	籇
+9030	丏
+9031	嘍
+9032	塂
+9033	廌
+9034	涴
+9035	滒
+9036	燄
+9037	瘅
+9038	瓞
+9039	鸹
+9040	饈
+9041	饗
+9042	饞
+9043	饤
+9044	饸
+9045	饾
+9046	佇
+9047	傂
+9048	椥
+9049	衸
+9050	鸸
+9051	痄
+9052	蠧
+9053	鹎
+9054	痖
+9055	痍
+9056	衒
+9057	餒
+9058	鹜
+9059	鹛
+9060	鹣
+9061	酗
+9062	餸
+9063	瘐
+9064	蠷
+9065	餾
+9066	餞
+9067	磂
+9068	縠
+9069	乪
+9070	僥
+9071	漞
+9072	瀍
+9073	礒
+9074	僂
+9075	楨
+9076	冮
+9077	嗛
+9078	忛
+9079	旈
+9080	氶
+9081	滈
+9082	轺
+9083	楗
+9084	閭
+9085	閶
+9086	閹
+9087	闀
+9088	闆
+9089	娚
+9090	樕
+9091	炆
+9092	蓕
+9093	蓢
+9094	蓪
+9095	蓯
+9096	㈣
+9097	庤
+9098	氳
+9099	滆
+9100	鑷
+9101	鑾
+9102	钑
+9103	铏
+9104	铻
+9105	壢
+9106	嬋
+9107	斎
+9108	毴
+9109	萚
+9110	萛
+9111	葎
+9112	葖
+9113	葦
+9114	夑
+9115	婈
+9116	甏
+9117	犄
+9118	軎
+9119	戥
+9120	辎
+9121	辏
+9122	陘
+9123	険
+9124	蔃
+9125	蕣
+9126	蕫
+9127	蕶
+9128	侂
+9129	撾
+9130	涬
+9131	熾
+9132	毹
+9133	氍
+9134	雱
+9135	霙
+9136	吽
+9137	喫
+9138	毸
+9139	藘
+9140	藟
+9141	藦
+9142	藨
+9143	昃
+9144	刖
+9145	旯
+9146	璩
+9147	瓿
+9148	锳
+9149	槔
+9150	殄
+9151	殂
+9152	栳
+9153	枇
+9154	枥
+9155	赅
+9156	赍
+9157	氇
+9158	朊
+9159	赕
+9160	菳
+9161	镃
+9162	榍
+9163	赙
+9164	肭
+9165	鐳
+9166	菵
+9167	腽
+9168	殍
+9169	栝
+9170	梃
+9171	觊
+9172	觋
+9173	閒
+9174	雊
+9175	薌
+9176	蔴
+9177	薍
+9178	橥
+9179	菼
+9180	觏
+9181	薸
+9182	镴
+9183	鐺
+9184	蒻
+9185	萂
+9186	檠
+9187	梏
+9188	枵
+9189	蕂
+9190	旰
+9191	暌
+9192	曛
+9193	牾
+9194	擞
+9195	腧
+9196	蓀
+9197	薖
+9198	蓂
+9199	閟
+9200	鑣
+9201	柽
+9202	脒
+9203	閡
+9204	轾
+9205	檑
+9206	柃
+9207	柢
+9208	闡
+9209	蕌
+9210	犏
+9211	贶
+9212	僳
+9213	蔞
+9214	薠
+9215	蓎
+9216	椠
+9217	閆
+9218	闋
+9219	蔂
+9220	薁
+9221	琭
+9222	奻
+9223	渇
+9224	鱂
+9225	禙
+9226	丗
+9227	朏
+9228	桭
+9229	汧
+9230	磄
+9231	縢
+9232	眊
+9233	俫
+9234	峠
+9235	璆
+9236	咷
+9237	圙
+9238	楪
+9239	欸
+9240	甴
+9241	奾
+9242	媓
+9243	榟
+9244	渉
+9245	疕
+9246	璈
+9247	奌
+9248	桯
+9249	礽
+9250	唅
+9251	沬
+9252	両
+9253	朓
+9254	愴
+9255	韃
+9256	恝
+9257	恁
+9258	頊
+9259	囃
+9260	埻
+9261	娡
+9262	樛
+9263	蚡
+9264	蛯
+9265	蛺
+9266	蜆
+9267	嶌
+9268	靆
+9269	歃
+9270	臁
+9271	靰
+9272	飑
+9273	霡
+9274	欷
+9275	膦
+9276	靄
+9277	靺
+9278	魈
+9279	嬏
+9280	藹
+9281	虁
+9282	虒
+9283	栴
+9284	燁
+9285	瞀
+9286	畀
+9287	瞌
+9288	瞟
+9289	瞍
+9290	睥
+9291	頫
+9292	囄
+9293	嬑
+9294	屛
+9295	嵨
+9296	櫸
+9297	蝃
+9298	螄
+9299	螆
+9300	庯
+9301	橈
+9302	濓
+9303	锇
+9304	锆
+9305	锔
+9306	锒
+9307	飩
+9308	飪
+9309	屜
+9310	徬
+9311	愊
+9312	撓
+9313	浵
+9314	蟶
+9315	蟷
+9316	蠂
+9317	蠑
+9318	蠙
+9319	砑
+9320	罱
+9321	锞
+9322	罟
+9323	觳
+9324	畛
+9325	锍
+9326	礅
+9327	砼
+9328	豌
+9329	靉
+9330	烀
+9331	瞠
+9332	盍
+9333	钅
+9334	顰
+9335	镡
+9336	锫
+9337	镢
+9338	礞
+9339	炱
+9340	鞡
+9341	砬
+9342	蘡
+9343	扃
+9344	镧
+9345	蘢
+9346	祓
+9347	镳
+9348	砩
+9349	硎
+9350	硭
+9351	礻
+9352	铋
+9353	钍
+9354	顴
+9355	螮
+9356	镩
+9357	镨
+9358	蟜
+9359	颎
+9360	蟝
+9361	鞨
+9362	頷
+9363	螲
+9364	蚷
+9365	硗
+9366	硖
+9367	祆
+9368	钐
+9369	铒
+9370	颕
+9371	蚃
+9372	頹
+9373	韐
+9374	蟢
+9375	鞮
+9376	蝜
+9377	眈
+9378	霳
+9379	硪
+9380	煸
+9381	熘
+9382	蝟
+9383	钣
+9384	铘
+9385	铞
+9386	飋
+9387	蟧
+9388	矬
+9389	矧
+9390	镆
+9391	鞶
+9392	頠
+9393	磉
+9394	爝
+9395	靦
+9396	飔
+9397	碹
+9398	蘵
+9399	燠
+9400	燔
+9401	铥
+9402	飖
+9403	稂
+9404	镘
+9405	靨
+9406	飗
+9407	蛍
+9408	颳
+9409	靬
+9410	蘗
+9411	蟳
+9412	镙
+9413	靂
+9414	蘘
+9415	蝂
+9416	籮
+9417	媕
+9418	巎
+9419	杍
+9420	榡
+9421	匤
+9422	旿
+9423	汮
+9424	媖
+9425	宬
+9426	昸
+9427	鱇
+9428	圞
+9429	痩
+9430	禠
+9431	甃
+9432	凩
+9433	奓
+9434	昄
+9435	玬
+9436	弇
+9437	杕
+9438	禡
+9439	僊
+9440	慚
+9441	磏
+9442	祅
+9443	籲
+9444	俷
+9445	卬
+9446	坣
+9447	礜
+9448	玁
+9449	瘩
+9450	侎
+9451	傫
+9452	烋
+9453	瘭
+9454	癍
+9455	窀
+9456	瘰
+9457	穸
+9458	瘼
+9459	瘢
+9460	駡
+9461	駭
+9462	а
+9463	毖
+9464	炑
+9465	熝
+9466	褆
+9467	褕
+9468	褢
+9469	褣
+9470	褦
+9471	褯
+9472	窬
+9473	癯
+9474	襻
+9475	窭
+9476	疋
+9477	皲
+9478	馼
+9479	馚
+9480	馜
+9481	耩
+9482	袷
+9483	袼
+9484	裎
+9485	裣
+9486	耨
+9487	耱
+9488	裰
+9489	襁
+9490	裈
+9491	聱
+9492	顸
+9493	褴
+9494	裋
+9495	馱
+9496	裏
+9497	唎
+9498	弌
+9499	恛
+9500	渙
+9501	窸
+9502	籓
+9503	労
+9504	椇
+9505	樅
+9506	熀
+9507	簞
+9508	簠
+9509	簷
+9510	Ⅲ
+9511	嶓
+9512	蠖
+9513	螅
+9514	螗
+9515	蟊
+9516	髣
+9517	髴
+9518	鬄
+9519	饔
+9520	営
+9521	捰
+9522	臃
+9523	訏
+9524	訑
+9525	訕
+9526	訚
+9527	黩
+9528	岒
+9529	嶒
+9530	栻
+9531	颟
+9532	虮
+9533	騾
+9534	驀
+9535	驁
+9536	驃
+9537	呉
+9538	喴
+9539	襖
+9540	覚
+9541	傯
+9542	掫
+9543	汈
+9544	濘
+9545	舁
+9546	舻
+9547	舣
+9548	艨
+9549	舴
+9550	舾
+9551	舳
+9552	嬙
+9553	懺
+9554	捲
+9555	斣
+9556	詛
+9557	詝
+9558	詡
+9559	詣
+9560	詰
+9561	詼
+9562	匂
+9563	庼
+9564	楒
+9565	欥
+9566	汌
+9567	龅
+9568	鯻
+9569	鯷
+9570	龆
+9571	觫
+9572	謦
+9573	鰟
+9574	鰣
+9575	鰤
+9576	鰥
+9577	鰩
+9578	鰶
+9579	喼
+9580	壷
+9581	娭
+9582	撝
+9583	曋
+9584	氈
+9585	謄
+9586	諤
+9587	謨
+9588	謫
+9589	謳
+9590	侕
+9591	凊
+9592	婖
+9593	曱
+9594	濙
+9595	醑
+9596	醐
+9597	醣
+9598	鯔
+9599	栒
+9600	椪
+9601	誣
+9602	諏
+9603	咗
+9604	岠
+9605	戻
+9606	鴃
+9607	鴂
+9608	鴯
+9609	鴳
+9610	鴷
+9611	ω
+9612	堌
+9613	愗
+9614	懾
+9615	椮
+9616	炟
+9617	亍
+9618	鼗
+9619	貐
+9620	蠼
+9621	蚪
+9622	艏
+9623	艚
+9624	跫
+9625	豕
+9626	蟛
+9627	舨
+9628	蟪
+9629	骯
+9630	螵
+9631	筢
+9632	恿
+9633	筻
+9634	襛
+9635	颡
+9636	蚵
+9637	蜾
+9638	詀
+9639	訟
+9640	袈
+9641	鲡
+9642	趿
+9643	踱
+9644	蹁
+9645	剜
+9646	劁
+9647	劂
+9648	詁
+9649	魾
+9650	蚯
+9651	鮟
+9652	蹒
+9653	冂
+9654	覦
+9655	騞
+9656	訢
+9657	粜
+9658	諨
+9659	誨
+9660	跣
+9661	鴇
+9662	鳧
+9663	谼
+9664	覧
+9665	箝
+9666	箨
+9667	笫
+9668	羰
+9669	鯡
+9670	跞
+9671	跏
+9672	詆
+9673	訥
+9674	谿
+9675	笸
+9676	蛱
+9677	綮
+9678	粝
+9679	鲰
+9680	鮥
+9681	跬
+9682	躏
+9683	仡
+9684	匚
+9685	仫
+9686	簀
+9687	覬
+9688	鮦
+9689	鰈
+9690	鮨
+9691	鮈
+9692	覯
+9693	鰉
+9694	諱
+9695	鳰
+9696	鬘
+9697	躐
+9698	伛
+9699	阂
+9700	髈
+9701	篌
+9702	篥
+9703	蛞
+9704	蝣
+9705	蛑
+9706	蛘
+9707	趔
+9708	趑
+9709	趱
+9710	豇
+9711	拄
+9712	鮋
+9713	踔
+9714	跽
+9715	鴒
+9716	豋
+9717	仳
+9718	卣
+9719	覲
+9720	驫
+9721	観
+9722	騫
+9723	謖
+9724	謗
+9725	鴕
+9726	骃
+9727	魋
+9728	髏
+9729	鮐
+9730	鳷
+9731	篣
+9732	螓
+9733	蜍
+9734	魎
+9735	訳
+9736	謚
+9737	鲒
+9738	鳆
+9739	鲔
+9740	鳇
+9741	鲕
+9742	誹
+9743	踬
+9744	觖
+9745	踯
+9746	刭
+9747	觱
+9748	鮒
+9749	髕
+9750	觴
+9751	諼
+9752	豖
+9753	騳
+9754	験
+9755	襶
+9756	酏
+9757	鲚
+9758	踺
+9759	骦
+9760	鴝
+9761	鳽
+9762	蜣
+9763	肄
+9764	剞
+9765	訝
+9766	鱬
+9767	卲
+9768	梡
+9769	廝
+9770	╭
+9771	琿
+9772	弎
+9773	梣
+9774	禥
+9775	慟
+9776	峳
+9777	璕
+9778	擱
+9779	祍
+9780	峴
+9781	泂
+9782	渟
+9783	漵
+9784	禨
+9785	婼
+9786	擲
+9787	昐
+9788	鬟
+9789	忂
+9790	攮
+9791	攉
+9792	撙
+9793	撺
+9794	邅
+9795	邩
+9796	呒
+9797	哙
+9798	唣
+9799	哐
+9800	咭
+9801	啭
+9802	遅
+9803	遹
+9804	唼
+9805	痶
+9806	籺
+9807	籘
+9808	盩
+9809	俆
+9810	嘥
+9811	昑
+9812	桾
+9813	漈
+9814	畊
+9815	僽
+9816	坲
+9817	塽
+9818	妘
+9819	奤
+9820	孶
+9821	朥
+9822	Ⅹ
+9823	夲
+9824	燏
+9825	鷟
+9826	鷂
+9827	鷞
+9828	鷨
+9829	鷯
+9830	鷸
+9831	鷿
+9832	и
+9833	ペ
+9834	佢
+9835	嗂
+9836	氌
+9837	诖
+9838	诎
+9839	诙
+9840	诋
+9841	冖
+9842	跂
+9843	跅
+9844	跐
+9845	咘
+9846	圌
+9847	滪
+9848	鵠
+9849	鵟
+9850	鶒
+9851	鶘
+9852	①
+9853	佡
+9854	嬞
+9855	俅
+9856	俜
+9857	贄
+9858	凔
+9859	曽
+9860	鸤
+9861	髹
+9862	娵
+9863	撣
+9864	鄹
+9865	邾
+9866	郾
+9867	蹐
+9868	蹓
+9869	蹠
+9870	蹣
+9871	圏
+9872	婞
+9873	孅
+9874	欬
+9875	黐
+9876	鬈
+9877	冘
+9878	嗆
+9879	嵻
+9880	愜
+9881	栜
+9882	炣
+9883	塄
+9884	墁
+9885	芑
+9886	芏
+9887	鼙
+9888	軋
+9889	軏
+9890	軛
+9891	诮
+9892	劢
+9893	诓
+9894	诔
+9895	讵
+9896	卺
+9897	诹
+9898	诼
+9899	哿
+9900	麬
+9901	苈
+9902	苠
+9903	倨
+9904	貰
+9905	阽
+9906	偾
+9907	鷇
+9908	賒
+9909	鷈
+9910	阼
+9911	匐
+9912	踖
+9913	鷉
+9914	鹔
+9915	陧
+9916	垤
+9917	埏
+9918	巯
+9919	芴
+9920	躪
+9921	鹠
+9922	鵑
+9923	傺
+9924	鹡
+9925	鸑
+9926	苎
+9927	鶲
+9928	貽
+9929	僬
+9930	埚
+9931	埘
+9932	埒
+9933	圬
+9934	躉
+9935	芤
+9936	茇
+9937	趐
+9938	躊
+9939	踧
+9940	跶
+9941	鹯
+9942	跼
+9943	賡
+9944	龠
+9945	賂
+9946	鹴
+9947	邡
+9948	蠃
+9949	埴
+9950	茺
+9951	鶺
+9952	赪
+9953	黈
+9954	鶻
+9955	踴
+9956	谳
+9957	鵼
+9958	圯
+9959	鸝
+9960	鸂
+9961	鼱
+9962	鱲
+9963	璿
+9964	憊
+9965	泇
+9966	玍
+9967	嶸
+9968	烿
+9969	皐
+9970	竪
+9971	玾
+9972	杦
+9973	泈
+9974	鱓
+9975	媁
+9976	榃
+9977	淲
+9978	玿
+9979	亁
+9980	碭
+9981	癤
+9982	疿
+9983	揦
+9984	榅
+9985	鱵
+9986	瑈
+9987	儁
+9988	漼
+9989	玒
+9990	甕
+9991	塝
+9992	榊
+9993	圐
+9994	岧
+9995	汖
+9996	齶
+9997	齙
+9998	龁
+9999	龔
+10000	龘
+10001	龢
+10002	冚
+10003	嗇
+10004	堓
+10005	壿
+10006	澼
+10007	炤
+10008	莸
+10009	輻
+10010	輾
+10011	轁
+10012	蒎
+10013	鼴
+10014	菪
+10015	萑
+10016	輋
+10017	軫
+10018	蓊
+10019	菸
+10020	齦
+10021	蒗
+10022	葚
+10023	齪
+10024	輗
+10025	葸
+10026	蔸
+10027	蔹
+10028	輜
+10029	輞
+10030	葺
+10031	輟
+10032	縵
+10033	儂
+10034	唞
+10035	妟
+10036	媧
+10037	瀦
+10038	鱖
+10039	礪
+10040	俍
+10041	嘮
+10042	怹
+10043	堽
+10044	嶠
+10045	橚
+10046	汘
+10047	勣
+10048	堔
+10049	搠
+10050	迀
+10051	薷
+10052	辿
+10053	薅
+10054	蕞
+10055	轡
+10056	迏
+10057	迵
+10058	蕹
+10059	廾
+10060	辀
+10061	掮
+10062	揲
+10063	揸
+10064	揠
+10065	辡
+10066	轍
+10067	尥
+10068	摅
+10069	搛
+10070	搋
+10071	轘
+10072	搡
+10073	摞
+10074	摭
+10075	揶
+10076	鳡
+10077	笭
+10078	厾
+10079	棤
+10080	湢
+10081	犠
+10082	牠
+10083	牁
+10084	牂
+10085	燼
+10086	犼
+10087	燜
+10088	爊
+10089	爍
+10090	爕
+10091	犧
+10092	碻
+10093	碸
+10094	磣
+10095	碷
+10096	磠
+10097	磢
+10098	碄
+10099	碶
+10100	磩
+10101	磪
+10102	磭
+10103	磰
+10104	磱
+10105	磳
+10106	磵
+10107	磶
+10108	磹
+10109	磼
+10110	磿
+10111	礀
+10112	礂
+10113	礃
+10114	礄
+10115	礆
+10116	礇
+10117	礈
+10118	礉
+10119	礊
+10120	礋
+10121	`
+10122	^
+10123	╜
+10124	╚
+10125	▇
+10126	ㄗ
+10127	▅
+10128	ǖ
+10129	╛
+10130	ǚ
+10131	ǜ
+10132	ɑ
+10133	╗
+10134	╙
+10135	▄
+10136	▆
+10137	ǘ
+10138	ˊ
+10139	╘
+10140	█
+10141	▉
+10142	▊
+10143	▋
+10144	▌
+10145	▍
+10146	▎
+10147	▏
+10148	▓
+10149	▔
+10150	▼
+10151	◢
+10152	◣
+10153	◤
+10154	◥
+10155	☉
+10156	⊕
+10157	〒
+10158	〝
+10159	〞
+10160	~
+10161	鞔
+10162	鱜
+10163	鱚
+10164	鱺
+10165	鱛
+10166	鱙
+10167	鱹
+10168	鞫
+10169	鰼
+10170	鱻
+10171	鱽
+10172	鱾
+10173	鲄
+10174	鲌
+10175	鲓
+10176	鲗
+10177	鲝
+10178	鲪
+10179	鲬
+10180	鲿
+10181	鳁
+10182	鳂
+10183	鳈
+10184	鳒
+10185	鳚
+10186	鳛
+10187	譤
+10188	讄
+10189	譥
+10190	譡
+10191	譢
+10192	讉
+10193	讋
+10194	讌
+10195	讑
+10196	讒
+10197	讔
+10198	讕
+10199	讙
+10200	讛
+10201	讜
+10202	讞
+10203	讟
+10204	讬
+10205	讱
+10206	讻
+10207	诇
+10208	诐
+10209	诪
+10210	谉
+10211	<
+10212	=
+10213	>
+10214	琡
+10215	琟
+10216	瑍
+10217	琠
+10218	琜
+10219	琞
+10220	瑊
+10221	瑌
+10222	珸
+10223	琝
+10224	瑎
+10225	瑏
+10226	瑐
+10227	瑑
+10228	瑓
+10229	瑔
+10230	瑖
+10231	瑘
+10232	瑡
+10233	瑥
+10234	瑦
+10235	瑬
+10236	瑲
+10237	瑳
+10238	瑴
+10239	瑵
+10240	瑹
+10241	|
+10242	Υ
+10243	Θ
+10244	ψ
+10245	Μ
+10246	Ζ
+10247	Π
+10248	Φ
+10249	Ο
+10250	Ν
+10251	Α
+10252	Ψ
+10253	Ω
+10254	Λ
+10255	Η
+10256	Χ
+10257	Ι
+10258	Ξ
+10259	Δ
+10260	Β
+10261	Γ
+10262	Ε
+10263	Ρ
+10264	癪
+10265	皘
+10266	癧
+10267	癅
+10268	癨
+10269	皝
+10270	皟
+10271	皠
+10272	皡
+10273	皢
+10274	皣
+10275	皥
+10276	皧
+10277	皨
+10278	皪
+10279	皫
+10280	皬
+10281	皭
+10282	皯
+10283	皳
+10284	皵
+10285	皶
+10286	皷
+10287	皹
+10288	皻
+10289	皼
+10290	皽
+10291	皾
+10292	盀
+10293	礰
+10294	礮
+10295	祣
+10296	礫
+10297	礭
+10298	祡
+10299	礍
+10300	祦
+10301	祩
+10302	祪
+10303	祫
+10304	祬
+10305	祮
+10306	祰
+10307	祱
+10308	祲
+10309	祳
+10310	祴
+10311	祵
+10312	祶
+10313	祹
+10314	祻
+10315	祽
+10316	祾
+10317	禂
+10318	禃
+10319	禆
+10320	禈
+10321	禉
+10322	禋
+10323	禌
+10324	禐
+10325	禑
+10326	_
+10327	痐
+10328	瘇
+10329	痏
+10330	痆
+10331	痌
+10332	瘂
+10333	疈
+10334	瘎
+10335	瘏
+10336	瘑
+10337	瘒
+10338	瘔
+10339	瘖
+10340	瘚
+10341	瘜
+10342	瘝
+10343	瘞
+10344	瘬
+10345	瘮
+10346	瘯
+10347	瘲
+10348	瘶
+10349	瘷
+10350	瘹
+10351	瘻
+10352	瘽
+10353	癁
+10354	璥
+10355	瓇
+10356	瓅
+10357	璤
+10358	璢
+10359	瑻
+10360	璡
+10361	瓈
+10362	瓉
+10363	瓌
+10364	瓍
+10365	瓎
+10366	瓐
+10367	瓑
+10368	瓓
+10369	瓕
+10370	瓖
+10371	瓙
+10372	瓝
+10373	瓡
+10374	瓥
+10375	瓧
+10376	瓨
+10377	瓩
+10378	瓪
+10379	瓫
+10380	瓬
+10381	瓭
+10382	瓱
+10383	-
+10384	,
+10385	;
+10386	:
+10387	!
+10388	〈
+10389	〃
+10390	△
+10391	∽
+10392	‖
+10393	ˇ
+10394	“
+10395	〉
+10396	’
+10397	…
+10398	　
+10399	】
+10400	》
+10401	「
+10402	～
+10403	』
+10404	¨
+10405	《
+10406	‘
+10407	·
+10408	、
+10409	。
+10410	ˉ
+10411	”
+10412	?
+10413	縙
+10414	縚
+10415	縘
+10416	縶
+10417	縸
+10418	縗
+10419	縺
+10420	縼
+10421	縿
+10422	繀
+10423	繂
+10424	繈
+10425	繉
+10426	繊
+10427	繋
+10428	繌
+10429	繍
+10430	繎
+10431	繏
+10432	繐
+10433	繑
+10434	繒
+10435	繓
+10436	繖
+10437	繗
+10438	繘
+10439	繙
+10440	繛
+10441	繜
+10442	/
+10443	穈
+10444	穅
+10445	穨
+10446	穦
+10447	穄
+10448	穧
+10449	稝
+10450	穃
+10451	穪
+10452	穬
+10453	穭
+10454	穮
+10455	穱
+10456	穲
+10457	穵
+10458	穻
+10459	穼
+10460	穽
+10461	穾
+10462	窂
+10463	窇
+10464	窉
+10465	窋
+10466	窌
+10467	窎
+10468	窏
+10469	窐
+10470	窔
+10471	窙
+10472	窚
+10473	窛
+10474	窞
+10475	窡
+10476	竈
+10477	竳
+10478	竉
+10479	竲
+10480	竆
+10481	竴
+10482	竵
+10483	竷
+10484	竸
+10485	竻
+10486	竼
+10487	竾
+10488	笀
+10489	笁
+10490	笂
+10491	笅
+10492	笇
+10493	笌
+10494	笍
+10495	笎
+10496	笐
+10497	笓
+10498	笖
+10499	笗
+10500	笘
+10501	笚
+10502	笜
+10503	笝
+10504	笟
+10505	笡
+10506	笢
+10507	笧
+10508	笩
+10509	'
+10510	"
+10511	珇
+10512	珆
+10513	珋
+10514	珒
+10515	珓
+10516	珔
+10517	珕
+10518	珗
+10519	珘
+10520	珚
+10521	珜
+10522	珟
+10523	珡
+10524	珢
+10525	珤
+10526	珦
+10527	珨
+10528	珫
+10529	珬
+10530	珯
+10531	珳
+10532	珴
+10533	珶
+10534	粇
+10535	粅
+10536	籣
+10537	粆
+10538	粈
+10539	粊
+10540	粋
+10541	粌
+10542	粍
+10543	粎
+10544	粏
+10545	粐
+10546	粓
+10547	粔
+10548	粖
+10549	粚
+10550	粛
+10551	粠
+10552	粡
+10553	粣
+10554	粦
+10555	粫
+10556	粭
+10557	粯
+10558	粰
+10559	粴
+10560	粶
+10561	粷
+10562	粸
+10563	粺
+10564	(
+10565	)
+10566	[
+10567	]
+10568	{
+10569	}
+10570	Ж
+10571	К
+10572	Е
+10573	У
+10574	Н
+10575	А
+10576	Х
+10577	Ц
+10578	Й
+10579	Щ
+10580	Ё
+10581	Ф
+10582	З
+10583	М
+10584	Г
+10585	В
+10586	Д
+10587	П
+10588	禴
+10589	秪
+10590	秥
+10591	禰
+10592	秢
+10593	秨
+10594	禓
+10595	秮
+10596	秱
+10597	秲
+10598	秳
+10599	秴
+10600	秵
+10601	秶
+10602	秷
+10603	秹
+10604	秺
+10605	秼
+10606	秿
+10607	稁
+10608	稄
+10609	稉
+10610	稊
+10611	稌
+10612	稏
+10613	稐
+10614	稑
+10615	稒
+10616	稓
+10617	稕
+10618	稖
+10619	稘
+10620	稙
+10621	稛
+10622	┐
+10623	┄
+10624	﹡
+10625	┈
+10626	﹟
+10627	│
+10628	┌
+10629	┑
+10630	┋
+10631	┉
+10632	┓
+10633	└
+10634	┇
+10635	﹞
+10636	﹠
+10637	┒
+10638	┅
+10639	┊
+10640	〡
+10641	─
+10642	‐
+10643	┍
+10644	﹢
+10645	﹣
+10646	﹤
+10647	﹥
+10648	﹦
+10649	﹨
+10650	﹩
+10651	﹪
+10652	﹫
+10653	〇
+10654	甞
+10655	畘
+10656	畖
+10657	甠
+10658	甗
+10659	甝
+10660	畕
+10661	畗
+10662	甛
+10663	畞
+10664	畟
+10665	畠
+10666	畡
+10667	畣
+10668	畧
+10669	畨
+10670	畩
+10671	畮
+10672	畱
+10673	畳
+10674	畵
+10675	畷
+10676	畺
+10677	畻
+10678	畼
+10679	畽
+10680	畾
+10681	疀
+10682	疁
+10683	疂
+10684	疄
+10685	@
+10686	ぅ
+10687	⒋
+10688	ⅷ
+10689	Ⅶ
+10690	⒆
+10691	ⅵ
+10692	⒌
+10693	ⅰ
+10694	⒖
+10695	⒎
+10696	⒏
+10697	⒒
+10698	ⅶ
+10699	⒍
+10700	ⅸ
+10701	ⅳ
+10702	ⅱ
+10703	ⅲ
+10704	ⅴ
+10705	⒈
+10706	（
+10707	Ｗ
+10708	，
+10709	＆
+10710	５
+10711	／
+10712	－
+10713	！
+10714	？
+10715	＋
+10716	；
+10717	＇
+10718	）
+10719	．
+10720	￥
+10721	＂
+10722	＃
+10723	％
+10724	ァ
+10725	ェ
+10726	ォ
+10727	\
+10728	盽
+10729	盺
+10730	眫
+10731	盻
+10732	盵
+10733	眥
+10734	眪
+10735	盄
+10736	眮
+10737	眰
+10738	眱
+10739	眲
+10740	眳
+10741	眹
+10742	眻
+10743	眽
+10744	眿
+10745	睂
+10746	睄
+10747	睅
+10748	睆
+10749	睈
+10750	睉
+10751	睊
+10752	睋
+10753	睌
+10754	睍
+10755	睎
+10756	睓
+10757	睔
+10758	睕
+10759	睖
+10760	睗
+10761	睘
+10762	睙
+10763	
+10764	
+10765	
+10766	
+10767	
+10768	
+10769	
+10770	
+10771	
+10772	
+10773	
+10774	
+10775	
+10776	
+10777	
+10778	
+10779	
+10780	
+10781	
+10782	
+10783	
+10784	
+10785	
+10786	
+10787	
+10788	
+10789	
+10790	
+10791	伌
+10792	乣
+10793	乛
+10794	仺
+10795	伂
+10796	仸
+10797	伆
+10798	乢
+10799	伅
+10800	伃
+10801	仭
+10802	伩
+10803	伔
+10804	伀
+10805	乕
+10806	亄
+10807	仹
+10808	伓
+10809	仼
+10810	伄
+10811	丂
+10812	仯
+10813	仴
+10814	乗
+10815	伇
+10816	亐
+10817	亖
+10818	亗
+10819	亙
+10820	亝
+10821	亣
+10822	亪
+10823	亯
+10824	亰
+10825	亱
+10826	亴
+10827	亷
+10828	亸
+10829	亼
+10830	亽
+10831	亾
+10832	仈
+10833	仌
+10834	仒
+10835	仛
+10836	仜
+10837	仠
+10838	仢
+10839	仦
+10840	仧
+10841	俙
+10842	俕
+10843	傋
+10844	倈
+10845	偊
+10846	偘
+10847	偟
+10848	俖
+10849	偗
+10850	偔
+10851	偂
+10852	偪
+10853	偡
+10854	偢
+10855	偒
+10856	偦
+10857	俒
+10858	俔
+10859	倇
+10860	偋
+10861	偠
+10862	偐
+10863	偖
+10864	侤
+10865	偆
+10866	偄
+10867	偅
+10868	俓
+10869	偙
+10870	倎
+10871	倐
+10872	倛
+10873	倝
+10874	倠
+10875	倢
+10876	倣
+10877	倯
+10878	倰
+10879	倱
+10880	倲
+10881	倳
+10882	倷
+10883	倸
+10884	倹
+10885	倽
+10886	倿
+10887	偀
+10888	僠
+10889	儴
+10890	凎
+10891	冏
+10892	儸
+10893	儼
+10894	兊
+10895	僟
+10896	儻
+10897	儹
+10898	儭
+10899	兛
+10900	兎
+10901	兏
+10902	兓
+10903	僛
+10904	儃
+10905	儅
+10906	儳
+10907	儵
+10908	儺
+10909	傽
+10910	儰
+10911	儮
+10912	儯
+10913	儱
+10914	儽
+10915	儊
+10916	儌
+10917	儍
+10918	儎
+10919	儏
+10920	儐
+10921	儑
+10922	儓
+10923	儕
+10924	儖
+10925	儗
+10926	儙
+10927	儛
+10928	儜
+10929	儞
+10930	儠
+10931	儢
+10932	儣
+10933	儤
+10934	儥
+10935	儦
+10936	儧
+10937	儨
+10938	儩
+10939	儫
+10940	劥
+10941	刞
+10942	刕
+10943	剘
+10944	匃
+10945	劕
+10946	剕
+10947	劙
+10948	劦
+10949	刜
+10950	劘
+10951	劖
+10952	効
+10953	劮
+10954	劯
+10955	劔
+10956	刐
+10957	刔
+10958	剓
+10959	剗
+10960	劎
+10961	劧
+10962	劗
+10963	凘
+10964	劋
+10965	刓
+10966	劚
+10967	剙
+10968	剚
+10969	剟
+10970	剠
+10971	剢
+10972	剣
+10973	剤
+10974	剦
+10975	剨
+10976	剫
+10977	剬
+10978	剭
+10979	剮
+10980	剰
+10981	剱
+10982	剳
+10983	剴
+10984	剶
+10985	剷
+10986	剸
+10987	剹
+10988	剻
+10989	剼
+10990	剾
+10991	劀
+10992	劄
+10993	劅
+10994	叴
+10995	卄
+10996	叏
+10997	厏
+10998	咓
+10999	呑
+11000	叕
+11001	厊
+11002	叞
+11003	叺
+11004	卂
+11005	叝
+11006	叚
+11007	叀
+11008	吙
+11009	叿
+11010	吀
+11011	叓
+11012	吇
+11013	匸
+11014	匽
+11015	厈
+11016	厎
+11017	収
+11018	叾
+11019	叐
+11020	叜
+11021	匑
+11022	叅
+11023	叄
+11024	匼
+11025	厐
+11026	厑
+11027	厒
+11028	厓
+11029	厖
+11030	厗
+11031	厛
+11032	厜
+11033	厞
+11034	厡
+11035	厤
+11036	厧
+11037	厪
+11038	厫
+11039	厬
+11040	厯
+11041	厰
+11042	厱
+11043	厵
+11044	厸
+11045	厹
+11046	厺
+11047	厼
+11048	厽
+11049	哷
+11050	哵
+11051	唦
+11052	嗺
+11053	喿
+11054	啲
+11055	啨
+11056	啺
+11057	喌
+11058	哶
+11059	啹
+11060	啳
+11061	喎
+11062	喐
+11063	喕
+11064	哰
+11065	哴
+11066	唟
+11067	唥
+11068	啩
+11069	喍
+11070	啯
+11071	啴
+11072	咢
+11073	啢
+11074	啠
+11075	哱
+11076	啽
+11077	唨
+11078	唩
+11079	唫
+11080	唭
+11081	唲
+11082	唴
+11083	唵
+11084	唶
+11085	唹
+11086	唺
+11087	唻
+11088	唽
+11089	啀
+11090	啂
+11091	啅
+11092	啇
+11093	啋
+11094	啌
+11095	啍
+11096	啎
+11097	啑
+11098	啒
+11099	啔
+11100	啗
+11101	啘
+11102	啙
+11103	啚
+11104	啛
+11105	嚧
+11106	嘸
+11107	嘵
+11108	嚚
+11109	噡
+11110	囎
+11111	嚞
+11112	噟
+11113	嚘
+11114	嘷
+11115	嚡
+11116	嚳
+11117	嚪
+11118	嚫
+11119	嚝
+11120	嚙
+11121	嚩
+11122	嚛
+11123	嚠
+11124	嚖
+11125	嚔
+11126	嚗
+11127	嚤
+11128	噣
+11129	噥
+11130	噦
+11131	噧
+11132	噮
+11133	噰
+11134	噳
+11135	噷
+11136	噺
+11137	噾
+11138	噿
+11139	嚁
+11140	嚂
+11141	嚃
+11142	嚄
+11143	嚈
+11144	嚉
+11145	嚊
+11146	嚌
+11147	埓
+11148	坄
+11149	坁
+11150	埁
+11151	垀
+11152	堶
+11153	坾
+11154	埌
+11155	埖
+11156	坃
+11157	埊
+11158	垺
+11159	埧
+11160	埛
+11161	埜
+11162	埢
+11163	圼
+11164	圿
+11165	坽
+11166	坿
+11167	埀
+11168	埄
+11169	埉
+11170	垽
+11171	垼
+11172	圽
+11173	埍
+11174	垁
+11175	垇
+11176	垉
+11177	垊
+11178	垍
+11179	垎
+11180	垐
+11181	垑
+11182	垔
+11183	垖
+11184	垗
+11185	垘
+11186	垙
+11187	垜
+11188	垥
+11189	垨
+11190	垪
+11191	垬
+11192	垯
+11193	垰
+11194	垶
+11195	垷
+11196	壌
+11197	塦
+11198	塣
+11199	墌
+11200	壸
+11201	壃
+11202	壈
+11203	壍
+11204	壄
+11205	墶
+11206	壙
+11207	壐
+11208	壂
+11209	壔
+11210	塠
+11211	墈
+11212	墋
+11213	墽
+11214	壎
+11215	墿
+11216	堾
+11217	墹
+11218	墷
+11219	墸
+11220	墺
+11221	塡
+11222	壉
+11223	墍
+11224	墏
+11225	墑
+11226	墔
+11227	墕
+11228	墖
+11229	増
+11230	墛
+11231	墝
+11232	墠
+11233	墡
+11234	墢
+11235	墣
+11236	墤
+11237	墥
+11238	墦
+11239	墧
+11240	墪
+11241	墫
+11242	墬
+11243	墭
+11244	墯
+11245	墰
+11246	墱
+11247	墲
+11248	墴
+11249	姶
+11250	奰
+11251	姩
+11252	妦
+11253	婘
+11254	妡
+11255	姲
+11256	姷
+11257	奯
+11258	姱
+11259	姯
+11260	姟
+11261	娍
+11262	姺
+11263	姼
+11264	姭
+11265	奫
+11266	妢
+11267	姧
+11268	姰
+11269	夽
+11270	姢
+11271	姠
+11272	姡
+11273	姤
+11274	姳
+11275	妬
+11276	妭
+11277	妰
+11278	妱
+11279	妳
+11280	妴
+11281	妵
+11282	妷
+11283	妸
+11284	妼
+11285	妽
+11286	妿
+11287	姁
+11288	姂
+11289	姃
+11290	姄
+11291	姅
+11292	姇
+11293	姌
+11294	姎
+11295	姏
+11296	姕
+11297	姖
+11298	姙
+11299	姛
+11300	嫶
+11301	媊
+11302	媈
+11303	嫧
+11304	媬
+11305	嬜
+11306	嫭
+11307	媩
+11308	嫷
+11309	媉
+11310	嫮
+11311	嫛
+11312	嬁
+11313	嫹
+11314	嫺
+11315	嫬
+11316	媅
+11317	媇
+11318	媫
+11319	嫥
+11320	嫸
+11321	嫨
+11322	嫯
+11323	婡
+11324	嫟
+11325	嫝
+11326	嫞
+11327	嫢
+11328	媆
+11329	嫳
+11330	媮
+11331	媯
+11332	媰
+11333	媴
+11334	媶
+11335	媷
+11336	媹
+11337	媺
+11338	媻
+11339	媼
+11340	媿
+11341	嫅
+11342	嫇
+11343	嫈
+11344	嫍
+11345	嫎
+11346	嫐
+11347	嫓
+11348	嫗
+11349	宆
+11350	尐
+11351	寏
+11352	岟
+11353	屪
+11354	尙
+11355	寍
+11356	尠
+11357	尩
+11358	宊
+11359	尟
+11360	尛
+11361	尶
+11362	尫
+11363	尭
+11364	尗
+11365	尰
+11366	宂
+11367	寋
+11368	寎
+11369	尒
+11370	尞
+11371	孈
+11372	尌
+11373	尡
+11374	寑
+11375	寕
+11376	寖
+11377	寘
+11378	寙
+11379	寚
+11380	寛
+11381	寜
+11382	寠
+11383	寣
+11384	寪
+11385	寱
+11386	寲
+11387	寴
+11388	寷
+11389	寽
+11390	尀
+11391	嵈
+11392	峖
+11393	崹
+11394	嵶
+11395	崿
+11396	峾
+11397	崷
+11398	嵃
+11399	嵉
+11400	峗
+11401	嵀
+11402	崱
+11403	嵖
+11404	嵎
+11405	嵏
+11406	峓
+11407	峕
+11408	峿
+11409	崸
+11410	嵍
+11411	崺
+11412	嵁
+11413	岪
+11414	崵
+11415	崲
+11416	崳
+11417	崶
+11418	峔
+11419	嵄
+11420	崄
+11421	崅
+11422	崉
+11423	崊
+11424	崋
+11425	崌
+11426	崏
+11427	崑
+11428	崒
+11429	崓
+11430	崕
+11431	崘
+11432	崙
+11433	崜
+11434	崝
+11435	崠
+11436	崡
+11437	崣
+11438	崥
+11439	崨
+11440	崪
+11441	崫
+11442	崬
+11443	崯
+11444	幋
+11445	帹
+11446	巭
+11447	庽
+11448	巪
+11449	帵
+11450	幇
+11451	幆
+11452	幁
+11453	幙
+11454	幏
+11455	幐
+11456	帿
+11457	幓
+11458	嶿
+11459	巤
+11460	巬
+11461	幎
+11462	帺
+11463	幃
+11464	嶡
+11465	帲
+11466	帴
+11467	嶾
+11468	幈
+11469	巰
+11470	巵
+11471	巶
+11472	巺
+11473	巼
+11474	帄
+11475	帇
+11476	帉
+11477	帊
+11478	帋
+11479	帍
+11480	帎
+11481	帒
+11482	帓
+11483	帗
+11484	帞
+11485	帟
+11486	帠
+11487	帢
+11488	帣
+11489	帤
+11490	帨
+11491	帪
+11492	彺
+11493	彣
+11494	弤
+11495	忳
+11496	徸
+11497	彟
+11498	彴
+11499	彽
+11500	廮
+11501	彲
+11502	彮
+11503	徔
+11504	徃
+11505	彨
+11506	徎
+11507	廩
+11508	弡
+11509	弣
+11510	彠
+11511	彾
+11512	廆
+11513	彜
+11514	彚
+11515	彛
+11516	彞
+11517	廫
+11518	彵
+11519	弨
+11520	弫
+11521	弬
+11522	弮
+11523	弰
+11524	弲
+11525	弳
+11526	弴
+11527	弸
+11528	弻
+11529	弽
+11530	弾
+11531	弿
+11532	彁
+11533	彂
+11534	彃
+11535	彄
+11536	彅
+11537	彆
+11538	彇
+11539	彉
+11540	彋
+11541	彍
+11542	彑
+11543	惔
+11544	恅
+11545	恀
+11546	惃
+11547	悀
+11548	愾
+11549	愖
+11550	惉
+11551	恷
+11552	惁
+11553	惏
+11554	惖
+11555	恄
+11556	惎
+11557	惌
+11558	悺
+11559	惪
+11560	惛
+11561	惈
+11562	怺
+11563	怾
+11564	恾
+11565	惂
+11566	惗
+11567	惄
+11568	惍
+11569	怈
+11570	悿
+11571	悾
+11572	惀
+11573	怽
+11574	惐
+11575	悁
+11576	悂
+11577	悆
+11578	悇
+11579	悈
+11580	悊
+11581	悋
+11582	悎
+11583	悏
+11584	悐
+11585	悑
+11586	悓
+11587	悕
+11588	悗
+11589	悘
+11590	悙
+11591	悜
+11592	悞
+11593	悡
+11594	悢
+11595	悤
+11596	悥
+11597	悮
+11598	悳
+11599	悷
+11600	懘
+11601	慲
+11602	慯
+11603	懆
+11604	憕
+11605	戺
+11606	懽
+11607	懍
+11608	憒
+11609	懄
+11610	懓
+11611	懙
+11612	慱
+11613	懐
+11614	懎
+11615	憽
+11616	懣
+11617	懛
+11618	懜
+11619	懌
+11620	懟
+11621	憓
+11622	懅
+11623	懚
+11624	懏
+11625	懁
+11626	憿
+11627	懀
+11628	懃
+11629	慭
+11630	懕
+11631	憖
+11632	憗
+11633	憘
+11634	憛
+11635	憜
+11636	憞
+11637	憟
+11638	憠
+11639	憡
+11640	憢
+11641	憥
+11642	憦
+11643	憪
+11644	憮
+11645	憯
+11646	憰
+11647	憱
+11648	憳
+11649	憴
+11650	憵
+11651	憸
+11652	憹
+11653	憺
+11654	憻
+11655	抈
+11656	抆
+11657	挩
+11658	拁
+11659	捵
+11660	挰
+11661	抾
+11662	挦
+11663	挵
+11664	挼
+11665	抇
+11666	挴
+11667	挱
+11668	挕
+11669	捒
+11670	挿
+11671	捀
+11672	挮
+11673	捇
+11674	抂
+11675	抅
+11676	抺
+11677	拀
+11678	挧
+11679	挬
+11680	挳
+11681	扏
+11682	挙
+11683	挗
+11684	挘
+11685	挜
+11686	挶
+11687	拃
+11688	拑
+11689	拕
+11690	拞
+11691	拡
+11692	拪
+11693	拰
+11694	拲
+11695	拹
+11696	拺
+11697	拻
+11698	挀
+11699	挃
+11700	挄
+11701	挅
+11702	挆
+11703	挊
+11704	挋
+11705	挍
+11706	挒
+11707	揯
+11708	摠
+11709	搤
+11710	擏
+11711	撟
+11712	摤
+11713	搢
+11714	摝
+11715	摪
+11716	摰
+11717	揰
+11718	摨
+11719	摥
+11720	摗
+11721	摲
+11722	摣
+11723	摶
+11724	揫
+11725	搟
+11726	搣
+11727	摟
+11728	摱
+11729	摡
+11730	摦
+11731	揁
+11732	摛
+11733	摙
+11734	摚
+11735	揬
+11736	搧
+11737	搨
+11738	搩
+11739	搫
+11740	搮
+11741	搯
+11742	搰
+11743	搱
+11744	搲
+11745	搳
+11746	搵
+11747	搷
+11748	搸
+11749	搹
+11750	搻
+11751	搼
+11752	摀
+11753	摂
+11754	摃
+11755	摉
+11756	摋
+11757	摌
+11758	摍
+11759	摎
+11760	摏
+11761	摐
+11762	摓
+11763	摕
+11764	敶
+11765	擿
+11766	擽
+11767	敤
+11768	旝
+11769	斪
+11770	敩
+11771	攟
+11772	敠
+11773	敯
+11774	敮
+11775	敪
+11776	敺
+11777	敨
+11778	敾
+11779	攞
+11780	攠
+11781	敡
+11782	敹
+11783	敥
+11784	敭
+11785	擛
+11786	敜
+11787	敚
+11788	敟
+11789	擻
+11790	敱
+11791	攦
+11792	攧
+11793	攨
+11794	攩
+11795	攭
+11796	攰
+11797	攷
+11798	攺
+11799	攼
+11800	敀
+11801	敁
+11802	敂
+11803	敃
+11804	敄
+11805	敆
+11806	敇
+11807	敊
+11808	敋
+11809	敍
+11810	敎
+11811	敒
+11812	敓
+11813	暣
+11814	昤
+11815	昢
+11816	暔
+11817	晘
+11818	曶
+11819	暚
+11820	晐
+11821	暒
+11822	暟
+11823	暤
+11824	昣
+11825	暞
+11826	暛
+11827	暋
+11828	暩
+11829	暙
+11830	暬
+11831	昜
+11832	昡
+11833	晎
+11834	晑
+11835	暓
+11836	暥
+11837	暕
+11838	暜
+11839	旲
+11840	暏
+11841	暍
+11842	暎
+11843	晜
+11844	晠
+11845	晢
+11846	晣
+11847	晥
+11848	晩
+11849	晪
+11850	晬
+11851	晱
+11852	晲
+11853	晵
+11854	晹
+11855	晻
+11856	晼
+11857	晽
+11858	晿
+11859	暀
+11860	暃
+11861	暆
+11862	柎
+11863	朻
+11864	朸
+11865	枿
+11866	杶
+11867	桏
+11868	栕
+11869	柆
+11870	枽
+11871	柕
+11872	朹
+11873	柉
+11874	柇
+11875	柨
+11876	柗
+11877	柛
+11878	柣
+11879	朳
+11880	朷
+11881	杮
+11882	杴
+11883	枾
+11884	柖
+11885	柀
+11886	朄
+11887	枻
+11888	枼
+11889	柋
+11890	杸
+11891	杹
+11892	杻
+11893	杽
+11894	枀
+11895	枃
+11896	枅
+11897	枆
+11898	枈
+11899	枍
+11900	枎
+11901	枏
+11902	枑
+11903	枔
+11904	枖
+11905	枛
+11906	枠
+11907	枡
+11908	枤
+11909	枩
+11910	枬
+11911	枮
+11912	棿
+11913	梎
+11914	梌
+11915	棬
+11916	梸
+11917	椬
+11918	棳
+11919	棪
+11920	椀
+11921	梍
+11922	棷
+11923	棴
+11924	棥
+11925	椃
+11926	椄
+11927	椈
+11928	梉
+11929	梴
+11930	梷
+11931	椂
+11932	棭
+11933	棶
+11934	棦
+11935	棩
+11936	梊
+11937	梹
+11938	梺
+11939	梻
+11940	梽
+11941	梿
+11942	棁
+11943	棃
+11944	棆
+11945	棇
+11946	棈
+11947	棊
+11948	棌
+11949	棎
+11950	棏
+11951	棐
+11952	棑
+11953	棓
+11954	棔
+11955	棖
+11956	棙
+11957	棛
+11958	棜
+11959	棝
+11960	棞
+11961	棡
+11962	棢
+11963	槾
+11964	榒
+11965	榐
+11966	槰
+11967	榽
+11968	橑
+11969	樧
+11970	槵
+11971	榺
+11972	槮
+11973	槹
+11974	樀
+11975	榑
+11976	槸
+11977	槶
+11978	槨
+11979	樃
+11980	槴
+11981	樆
+11982	榌
+11983	榏
+11984	榹
+11985	榼
+11986	槯
+11987	槷
+11988	楡
+11989	槫
+11990	槩
+11991	槪
+11992	槬
+11993	榾
+11994	槂
+11995	槄
+11996	槅
+11997	槆
+11998	槒
+11999	槕
+12000	槗
+12001	槙
+12002	槚
+12003	槞
+12004	槡
+12005	槢
+12006	槣
+12007	槤
+12008	槥
+12009	槦
+12010	櫒
+12011	欦
+12012	櫖
+12013	檤
+12014	櫐
+12015	櫟
+12016	櫙
+12017	櫗
+12018	櫋
+12019	櫩
+12020	櫡
+12021	櫢
+12022	櫕
+12023	橾
+12024	檣
+12025	檥
+12026	櫑
+12027	櫠
+12028	櫓
+12029	櫘
+12030	橜
+12031	櫎
+12032	櫍
+12033	櫏
+12034	橽
+12035	櫛
+12036	檧
+12037	檨
+12038	檪
+12039	檭
+12040	檮
+12041	檰
+12042	檱
+12043	檲
+12044	檴
+12045	檶
+12046	檷
+12047	檹
+12048	檺
+12049	檼
+12050	檽
+12051	檾
+12052	檿
+12053	櫀
+12054	櫁
+12055	櫂
+12056	櫄
+12057	櫅
+12058	櫉
+12059	毚
+12060	歗
+12061	毃
+12062	殈
+12063	汍
+12064	毈
+12065	殀
+12066	殾
+12067	毜
+12068	歘
+12069	毌
+12070	毉
+12071	殹
+12072	毧
+12073	毞
+12074	毟
+12075	毇
+12076	毣
+12077	歖
+12078	歿
+12079	殅
+12080	毝
+12081	毊
+12082	欯
+12083	殻
+12084	殽
+12085	歕
+12086	殌
+12087	殎
+12088	殏
+12089	殐
+12090	殑
+12091	殕
+12092	殗
+12093	殙
+12094	殜
+12095	殝
+12096	殞
+12097	殟
+12098	殠
+12099	殣
+12100	殥
+12101	殦
+12102	殧
+12103	殨
+12104	殩
+12105	殫
+12106	殬
+12107	殮
+12108	殰
+12109	殱
+12110	殶
+12111	洿
+12112	沗
+12113	沕
+12114	涽
+12115	涀
+12116	洭
+12117	浀
+12118	洯
+12119	洝
+12120	浄
+12121	洬
+12122	浕
+12123	沎
+12124	泏
+12125	泒
+12126	洤
+12127	浂
+12128	洡
+12129	洟
+12130	洠
+12131	沑
+12132	洷
+12133	泙
+12134	泜
+12135	泝
+12136	泟
+12137	泤
+12138	泦
+12139	泧
+12140	泩
+12141	泬
+12142	泭
+12143	泲
+12144	泴
+12145	泹
+12146	泿
+12147	洀
+12148	洂
+12149	洃
+12150	洅
+12151	洆
+12152	洉
+12153	洊
+12154	洍
+12155	洏
+12156	洐
+12157	洓
+12158	洔
+12159	洕
+12160	洖
+12161	洘
+12162	湸
+12163	渀
+12164	淾
+12165	湪
+12166	渵
+12167	滣
+12168	溩
+12169	渱
+12170	湨
+12171	湹
+12172	淿
+12173	湱
+12174	湣
+12175	溈
+12176	湻
+12177	湼
+12178	溁
+12179	淽
+12180	渰
+12181	渳
+12182	湩
+12183	湬
+12184	淍
+12185	湦
+12186	湤
+12187	湥
+12188	湵
+12189	渶
+12190	渷
+12191	渹
+12192	渻
+12193	渽
+12194	渿
+12195	湀
+12196	湁
+12197	湂
+12198	湅
+12199	湆
+12200	湇
+12201	湈
+12202	湌
+12203	湏
+12204	湐
+12205	湑
+12206	湒
+12207	湕
+12208	湗
+12209	湝
+12210	湠
+12211	湡
+12212	澊
+12213	漙
+12214	潹
+12215	濛
+12216	澴
+12217	潀
+12218	潶
+12219	澃
+12220	澋
+12221	漘
+12222	澘
+12223	澐
+12224	澑
+12225	潾
+12226	漑
+12227	潷
+12228	澏
+12229	潻
+12230	澁
+12231	滰
+12232	潳
+12233	潱
+12234	潵
+12235	漒
+12236	澅
+12237	潃
+12238	潅
+12239	潈
+12240	潉
+12241	潊
+12242	潌
+12243	潎
+12244	潐
+12245	潒
+12246	潓
+12247	潕
+12248	潖
+12249	潗
+12250	潙
+12251	潚
+12252	潝
+12253	潠
+12254	潡
+12255	潣
+12256	潥
+12257	潧
+12258	潨
+12259	潩
+12260	潪
+12261	潫
+12262	瀪
+12263	烑
+12264	灛
+12265	灠
+12266	灥
+12267	瀇
+12268	灟
+12269	灜
+12270	灐
+12271	灴
+12272	灧
+12273	灨
+12274	灚
+12275	灮
+12276	灖
+12277	灦
+12278	濦
+12279	灒
+12280	灔
+12281	瀄
+12282	灡
+12283	瀫
+12284	瀭
+12285	瀯
+12286	瀱
+12287	瀲
+12288	瀳
+12289	瀴
+12290	瀶
+12291	瀷
+12292	瀸
+12293	瀺
+12294	瀻
+12295	瀽
+12296	瀿
+12297	灀
+12298	灁
+12299	灂
+12300	灅
+12301	灆
+12302	灇
+12303	灈
+12304	灉
+12305	灊
+12306	灋
+12307	灍
+12308	煷
+12309	焋
+12310	焇
+12311	焴
+12312	燋
+12313	熥
+12314	焲
+12315	煱
+12316	煹
+12317	煰
+12318	煭
+12319	煛
+12320	熆
+12321	煼
+12322	煫
+12323	焄
+12324	焆
+12325	焮
+12326	焳
+12327	煣
+12328	煻
+12329	煯
+12330	煠
+12331	煝
+12332	煟
+12333	焅
+12334	煴
+12335	焵
+12336	焷
+12337	焸
+12338	焹
+12339	焺
+12340	焻
+12341	焽
+12342	焾
+12343	焿
+12344	煀
+12345	煁
+12346	煂
+12347	煃
+12348	煄
+12349	煆
+12350	煇
+12351	煈
+12352	煋
+12353	煍
+12354	煏
+12355	煐
+12356	煑
+12357	煔
+12358	煕
+12359	煗
+12360	煘
+12361	〖
+12362	Ъ
+12363	┘
+12364	⒓
+12365	＜
+12366	伡
+12367	偧
+12368	劶
+12369	吋
+12370	喖
+12371	埣
+12372	壖
+12373	娂
+12374	嫾
+12375	尲
+12376	嵓
+12377	幖
+12378	徏
+12379	懠
+12380	捈
+12381	摷
+12382	敿
+12383	暭
+12384	柤
+12385	椉
+12386	樇
+12387	櫦
+12388	毤
+12389	浖
+12390	溂
+12391	澕
+12392	灱
+12393	熂
+12394	紎
+12395	糭
+12396	紏
+12397	糪
+12398	紑
+12399	紒
+12400	紕
+12401	紖
+12402	紝
+12403	紞
+12404	紟
+12405	紣
+12406	紤
+12407	紥
+12408	紦
+12409	紨
+12410	紩
+12411	紪
+12412	紬
+12413	紭
+12414	紱
+12415	紲
+12416	紴
+12417	紵
+12418	〗
+12419	Ы
+12420	┙
+12421	⒔
+12422	＝
+12423	伣
+12424	偨
+12425	兘
+12426	劷
+12427	吔
+12428	喗
+12429	嚱
+12430	埥
+12431	壗
+12432	娊
+12433	嫿
+12434	尳
+12435	嵔
+12436	惤
+12437	懡
+12438	捊
+12439	斀
+12440	暯
+12441	柦
+12442	椊
+12443	樈
+12444	櫧
+12445	浗
+12446	溄
+12447	澖
+12448	灲
+12449	絗
+12450	絴
+12451	絖
+12452	絒
+12453	紷
+12454	絓
+12455	絸
+12456	絺
+12457	絻
+12458	絼
+12459	絽
+12460	絾
+12461	絿
+12462	綀
+12463	綂
+12464	綃
+12465	綄
+12466	綅
+12467	綆
+12468	綇
+12469	綈
+12470	綊
+12471	綌
+12472	綍
+12473	綐
+12474	綒
+12475	綔
+12476	綕
+12477	綖
+12478	綗
+12479	【
+12480	Ь
+12481	┚
+12482	⒕
+12483	＞
+12484	伨
+12485	偩
+12486	兙
+12487	劸
+12488	吘
+12489	嚲
+12490	埦
+12491	娋
+12492	嬀
+12493	尵
+12494	嵕
+12495	幘
+12496	従
+12497	惥
+12498	懢
+12499	捑
+12500	摼
+12501	斁
+12502	暰
+12503	柧
+12504	椌
+12505	櫨
+12506	毦
+12507	浘
+12508	灳
+12509	熅
+12510	綹
+12511	緗
+12512	綶
+12513	緖
+12514	緘
+12515	綷
+12516	緛
+12517	緜
+12518	緟
+12519	緡
+12520	緢
+12521	緤
+12522	緥
+12523	緦
+12524	緧
+12525	緪
+12526	緫
+12527	緭
+12528	緮
+12529	緰
+12530	緳
+12531	緵
+12532	緶
+12533	緷
+12534	緸
+12535	絘
+12536	綼
+12537	糱
+12538	糂
+12539	絙
+12540	綛
+12541	糲
+12542	糃
+12543	絚
+12544	糳
+12545	糄
+12546	絛
+12547	紻
+12548	糴
+12549	糆
+12550	紼
+12551	緀
+12552	綞
+12553	糵
+12554	糉
+12555	絝
+12556	紽
+12557	緁
+12558	綟
+12559	糶
+12560	紾
+12561	緂
+12562	糷
+12563	糎
+12564	絟
+12565	紿
+12566	緃
+12567	綡
+12568	糹
+12569	絠
+12570	絀
+12571	緄
+12572	糺
+12573	糐
+12574	絁
+12575	緅
+12576	糼
+12577	糑
+12578	緆
+12579	綤
+12580	糽
+12581	糒
+12582	絣
+12583	緇
+12584	糓
+12585	絤
+12586	糿
+12587	糔
+12588	絅
+12589	緉
+12590	綨
+12591	糘
+12592	綩
+12593	紁
+12594	糚
+12595	絧
+12596	絇
+12597	綪
+12598	糛
+12599	絈
+12600	緌
+12601	紃
+12602	糝
+12603	絩
+12604	絉
+12605	緍
+12606	絊
+12607	緎
+12608	糡
+12609	絫
+12610	総
+12611	綯
+12612	紆
+12613	糢
+12614	絬
+12615	緐
+12616	紇
+12617	糣
+12618	絭
+12619	絍
+12620	糤
+12621	絯
+12622	糥
+12623	絰
+12624	絏
+12625	緓
+12626	綳
+12627	糦
+12628	緔
+12629	紌
+12630	絑
+12631	綵
+12632	紶
+12633	綘
+12634	緺
+12635	」
+12636	Ч
+12637	┕
+12638	⒐
+12639	伖
+12640	偣
+12641	劰
+12642	吂
+12643	喒
+12644	嚬
+12645	埞
+12646	壒
+12647	尮
+12648	幑
+12649	徆
+12650	惞
+12651	懝
+12652	捁
+12653	摴
+12654	敼
+12655	椆
+12656	樄
+12657	櫣
+12658	毠
+12659	浌
+12660	湽
+12661	澒
+12662	灩
+12663	煿
+12664	筦
+12665	筤
+12666	箌
+12667	筥
+12668	筟
+12669	筣
+12670	箎
+12671	笯
+12672	筡
+12673	箑
+12674	箒
+12675	箘
+12676	箙
+12677	箚
+12678	箛
+12679	箞
+12680	箟
+12681	箣
+12682	箤
+12683	箥
+12684	箮
+12685	箯
+12686	箰
+12687	箲
+12688	箳
+12689	箵
+12690	箶
+12691	箷
+12692	箹
+12693	箺
+12694	箻
+12695	箼
+12696	箽
+12697	箾
+12698	箿
+12699	篂
+12700	篃
+12701	笰
+12702	筨
+12703	笲
+12704	筩
+12705	笴
+12706	筪
+12707	笵
+12708	筫
+12709	笶
+12710	筬
+12711	笷
+12712	筭
+12713	笻
+12714	筰
+12715	笽
+12716	筳
+12717	笿
+12718	筴
+12719	筀
+12720	筁
+12721	筸
+12722	筂
+12723	筺
+12724	筃
+12725	筄
+12726	筽
+12727	筿
+12728	筈
+12729	箁
+12730	筊
+12731	箂
+12732	箃
+12733	筎
+12734	箄
+12735	筓
+12736	箆
+12737	筕
+12738	箇
+12739	筗
+12740	箈
+12741	筙
+12742	箉
+12743	箊
+12744	筞
+12745	Σ
+12746	〔
+12747	Р
+12748	┎
+12749	⒉
+12750	伈
+12751	偛
+12752	儾
+12753	劜
+12754	啿
+12755	埐
+12756	壊
+12757	姴
+12758	嫴
+12759	尣
+12760	幉
+12761	彶
+12762	惒
+12763	懖
+12764	挷
+12765	摬
+12766	敳
+12767	暡
+12768	柌
+12769	棽
+12770	櫜
+12771	湶
+12772	灢
+12773	煵
+12774	瞏
+12775	瞊
+12776	瞇
+12777	瞉
+12778	瞷
+12779	瞹
+12780	瞈
+12781	矀
+12782	矁
+12783	矂
+12784	矃
+12785	矄
+12786	矆
+12787	矉
+12788	矊
+12789	矋
+12790	矌
+12791	矐
+12792	矑
+12793	矒
+12794	矓
+12795	矔
+12796	矕
+12797	矘
+12798	矙
+12799	矝
+12800	矞
+12801	矟
+12802	矠
+12803	矡
+12804	瞓
+12805	睟
+12806	睠
+12807	睤
+12808	瞖
+12809	睧
+12810	瞗
+12811	睩
+12812	睪
+12813	瞙
+12814	睭
+12815	瞚
+12816	睮
+12817	瞛
+12818	睯
+12819	瞜
+12820	睰
+12821	瞝
+12822	睱
+12823	睲
+12824	瞡
+12825	睳
+12826	睴
+12827	睵
+12828	瞦
+12829	睶
+12830	瞨
+12831	睷
+12832	瞫
+12833	睸
+12834	瞭
+12835	瞮
+12836	睻
+12837	瞯
+12838	睼
+12839	瞱
+12840	瞲
+12841	瞂
+12842	瞴
+12843	瞃
+12844	瞶
+12845	瞆
+12846	矤
+12847	鞒
+12848	Τ
+12849	〕
+12850	┏
+12851	⒊
+12852	偝
+12853	兂
+12854	劤
+12855	叧
+12856	喅
+12857	嚦
+12858	埑
+12859	壋
+12860	尦
+12861	嵆
+12862	幊
+12863	彸
+12864	惓
+12865	懗
+12866	挸
+12867	摮
+12868	柍
+12869	棾
+12870	櫝
+12871	毘
+12872	湷
+12873	澇
+12874	煶
+12875	砠
+12876	硘
+12877	砡
+12878	砙
+12879	砞
+12880	硔
+12881	硙
+12882	矦
+12883	砛
+12884	硛
+12885	硜
+12886	硠
+12887	硡
+12888	硢
+12889	硣
+12890	硥
+12891	硦
+12892	硧
+12893	硩
+12894	硰
+12895	硲
+12896	硳
+12897	硴
+12898	硶
+12899	硸
+12900	硹
+12901	硺
+12902	硽
+12903	硾
+12904	硿
+12905	碀
+12906	碂
+12907	砤
+12908	矨
+12909	砨
+12910	砪
+12911	砫
+12912	砮
+12913	矱
+12914	砯
+12915	矲
+12916	砱
+12917	矴
+12918	矵
+12919	矷
+12920	砵
+12921	矹
+12922	砶
+12923	矺
+12924	砽
+12925	砿
+12926	矼
+12927	硁
+12928	砃
+12929	硂
+12930	砄
+12931	砅
+12932	硄
+12933	砆
+12934	硆
+12935	砇
+12936	硈
+12937	砈
+12938	硉
+12939	砊
+12940	硊
+12941	砋
+12942	硋
+12943	砎
+12944	硍
+12945	砏
+12946	硏
+12947	砐
+12948	硑
+12949	砓
+12950	硓
+12951	碃
+12952	╝
+12953	鱝
+12954	譨
+12955	琣
+12956	礱
+12957	痑
+12958	璦
+12959	穉
+12960	竌
+12961	玜
+12962	籥
+12963	禷
+12964	゛
+12965	乤
+12966	俛
+12967	僡
+12968	刟
+12969	卆
+12970	哸
+12971	坅
+12972	塧
+12973	奱
+12974	媋
+12975	宎
+12976	峚
+12977	廰
+12978	慳
+12979	抋
+12980	揳
+12981	攁
+12982	昦
+12983	朼
+12984	梐
+12985	榓
+12986	檃
+12987	歛
+12988	沘
+12989	渁
+12990	漚
+12991	ˋ
+12992	鰽
+12993	譇
+12994	疉
+12995	緼
+12996	稟
+12997	獳
+12998	籄
+12999	〢
+13000	瓵
+13001	盇
+13002	丄
+13003	侫
+13004	凙
+13005	咥
+13006	嘇
+13007	婣
+13008	孉
+13009	岮
+13010	嶢
+13011	廇
+13012	怉
+13013	慉
+13014	扐
+13015	揂
+13016	旳
+13017	桝
+13018	橝
+13019	欰
+13020	汚
+13021	淎
+13022	滱
+13023	濧
+13024	鳘
+13025	Κ
+13026	—
+13027	┆
+13028	ⅹ
+13029	＊
+13030	仾
+13031	偑
+13032	儶
+13033	劒
+13034	叒
+13035	埅
+13036	壀
+13037	崻
+13038	帾
+13039	挭
+13040	摢
+13041	柂
+13042	棯
+13043	槳
+13044	櫔
+13045	湭
+13046	灙
+13047	煪
+13048	猔
+13049	猑
+13050	獈
+13051	獆
+13052	猒
+13053	猍
+13054	狜
+13055	猏
+13056	獉
+13057	獊
+13058	獋
+13059	獌
+13060	獏
+13061	獓
+13062	獔
+13063	獕
+13064	獖
+13065	獘
+13066	獙
+13067	獚
+13068	獛
+13069	獜
+13070	獝
+13071	獞
+13072	獟
+13073	獡
+13074	獤
+13075	獥
+13076	獧
+13077	獩
+13078	獪
+13079	獫
+13080	獮
+13081	獰
+13082	ㄡ
+13083	︶
+13084	♂
+13085	┽
+13086	⑨
+13087	佱
+13088	傖
+13089	冡
+13090	勧
+13091	呩
+13092	囜
+13093	堘
+13094	夅
+13095	娽
+13096	嬦
+13097	屷
+13098	嶀
+13099	忈
+13100	愥
+13101	戓
+13102	掅
+13103	斸
+13104	栣
+13105	樶
+13106	欋
+13107	氠
+13108	涐
+13109	炨
+13110	醏
+13111	醊
+13112	醻
+13113	岈
+13114	醸
+13115	醎
+13116	醄
+13117	醈
+13118	醹
+13119	岍
+13120	醆
+13121	醼
+13122	釂
+13123	釃
+13124	釅
+13125	釒
+13126	釓
+13127	釔
+13128	釕
+13129	釖
+13130	釙
+13131	釚
+13132	釛
+13133	釟
+13134	釠
+13135	釡
+13136	釢
+13137	釤
+13138	α
+13139	×
+13140	┝
+13141	⒘
+13142	Ａ
+13143	吜
+13144	喠
+13145	嚵
+13146	埩
+13147	壛
+13148	娏
+13149	嬃
+13150	嵙
+13151	幜
+13152	徚
+13153	惲
+13154	懥
+13155	捔
+13156	摿
+13157	斄
+13158	柫
+13159	椓
+13160	樍
+13161	櫫
+13162	毩
+13163	溋
+13164	澚
+13165	灹
+13166	羆
+13167	羄
+13168	羭
+13169	羀
+13170	羃
+13171	羮
+13172	罖
+13173	羂
+13174	羴
+13175	羵
+13176	羺
+13177	羻
+13178	翂
+13179	翄
+13180	翆
+13181	翈
+13182	翉
+13183	翋
+13184	翍
+13185	翏
+13186	翐
+13187	翑
+13188	翓
+13189	翖
+13190	翗
+13191	翙
+13192	翜
+13193	翝
+13194	翞
+13195	翢
+13196	ㄠ
+13197	︵
+13198	∴
+13199	┼
+13200	⑧
+13201	｀
+13202	佮
+13203	冟
+13204	勦
+13205	呧
+13206	嗋
+13207	囙
+13208	堗
+13209	夃
+13210	娻
+13211	嬥
+13212	屶
+13213	嵿
+13214	庎
+13215	忇
+13216	戉
+13217	掄
+13218	撪
+13219	椸
+13220	樴
+13221	氞
+13222	溹
+13223	澿
+13224	炧
+13225	熰
+13226	郶
+13227	鄜
+13228	郲
+13229	鄛
+13230	郂
+13231	郳
+13232	鄝
+13233	鄠
+13234	鄨
+13235	鄩
+13236	鄪
+13237	鄫
+13238	鄮
+13239	鄳
+13240	鄵
+13241	鄶
+13242	鄷
+13243	鄸
+13244	鄺
+13245	鄼
+13246	鄽
+13247	鄾
+13248	鄿
+13249	酂
+13250	ɡ
+13251	±
+13252	Ю
+13253	├
+13254	⒗
+13255	＠
+13256	伬
+13257	偫
+13258	劺
+13259	吚
+13260	喞
+13261	埨
+13262	壚
+13263	娎
+13264	嬂
+13265	嵗
+13266	幚
+13267	徖
+13268	捓
+13269	摾
+13270	暲
+13271	柪
+13272	樌
+13273	櫪
+13274	毨
+13275	浝
+13276	溊
+13277	澙
+13278	灷
+13279	纞
+13280	繻
+13281	纚
+13282	繺
+13283	纴
+13284	纼
+13285	绖
+13286	绤
+13287	绬
+13288	绹
+13289	缐
+13290	缞
+13291	缹
+13292	缻
+13293	缼
+13294	缽
+13295	缾
+13296	缿
+13297	罀
+13298	罁
+13299	罃
+13300	罆
+13301	罇
+13302	罉
+13303	罊
+13304	罋
+13305	罎
+13306	罏
+13307	罒
+13308	ㄢ
+13309	︹
+13310	♀
+13311	р
+13312	┾
+13313	⑩
+13314	傗
+13315	冣
+13316	勨
+13317	嗏
+13318	堚
+13319	娾
+13320	嬧
+13321	屸
+13322	庘
+13323	忊
+13324	愨
+13325	戔
+13326	掆
+13327	撯
+13328	斺
+13329	曗
+13330	栤
+13331	椻
+13332	樷
+13333	欌
+13334	涒
+13335	溾
+13336	獯
+13337	鈤
+13338	怊
+13339	鈢
+13340	鈅
+13341	鈁
+13342	鈃
+13343	猱
+13344	釦
+13345	猓
+13346	鈂
+13347	鈥
+13348	鈧
+13349	鈨
+13350	鈩
+13351	鈪
+13352	鈫
+13353	鈭
+13354	鈮
+13355	鈯
+13356	鈰
+13357	鈱
+13358	鈲
+13359	鈳
+13360	鈵
+13361	鈶
+13362	鈸
+13363	鈹
+13364	鈻
+13365	鈼
+13366	鈽
+13367	鈾
+13368	鉁
+13369	鉂
+13370	鉃
+13371	÷
+13372	┞
+13373	ぢ
+13374	⒙
+13375	Ｂ
+13376	ヂ
+13377	甭
+13378	伮
+13379	偮
+13380	兟
+13381	喡
+13382	嚶
+13383	壜
+13384	娐
+13385	嬄
+13386	幝
+13387	惵
+13388	懧
+13389	捖
+13390	撀
+13391	斅
+13392	暵
+13393	柭
+13394	椔
+13395	樎
+13396	櫬
+13397	溌
+13398	澛
+13399	熉
+13400	耟
+13401	耝
+13402	聕
+13403	耞
+13404	耓
+13405	耛
+13406	翤
+13407	聙
+13408	聛
+13409	聜
+13410	聝
+13411	聟
+13412	聠
+13413	聡
+13414	聢
+13415	聣
+13416	聥
+13417	聦
+13418	聧
+13419	聨
+13420	聫
+13421	聬
+13422	聭
+13423	聮
+13424	聵
+13425	聸
+13426	聹
+13427	聺
+13428	聻
+13429	聼
+13430	ㄥ
+13431	﹀
+13432	″
+13433	╁
+13434	㈠
+13435	佸
+13436	勫
+13437	呭
+13438	嗗
+13439	夊
+13440	婂
+13441	屽
+13442	嶅
+13443	庡
+13444	忓
+13445	愬
+13446	戝
+13447	掑
+13448	斿
+13449	曞
+13450	栧
+13451	楀
+13452	炲
+13453	熷
+13454	錪
+13455	鍉
+13456	鬻
+13457	鍇
+13458	瀵
+13459	錧
+13460	鍆
+13461	鍈
+13462	錊
+13463	鍌
+13464	鍏
+13465	鍐
+13466	鍑
+13467	鍒
+13468	鍓
+13469	鍕
+13470	鍗
+13471	鍚
+13472	鍜
+13473	鍝
+13474	鍞
+13475	鍟
+13476	鍠
+13477	鍡
+13478	鍣
+13479	鍤
+13480	鍦
+13481	鍧
+13482	鍨
+13483	鍩
+13484	ㄅ
+13485	ε
+13486	∨
+13487	┡
+13488	⑴
+13489	Ｅ
+13490	ヅ
+13491	伵
+13492	吪
+13493	喤
+13494	埮
+13495	娕
+13496	嵟
+13497	徟
+13498	懪
+13499	捙
+13500	撆
+13501	暸
+13502	椗
+13503	櫯
+13504	毰
+13505	溑
+13506	澟
+13507	熍
+13508	臽
+13509	臹
+13510	舼
+13511	臸
+13512	臷
+13513	艀
+13514	艁
+13515	艃
+13516	艈
+13517	艌
+13518	艐
+13519	艑
+13520	艓
+13521	艔
+13522	艕
+13523	艗
+13524	艛
+13525	艝
+13526	艞
+13527	艤
+13528	艧
+13529	ㄤ
+13530	︿
+13531	′
+13532	╀
+13533	佷
+13534	勪
+13535	呬
+13536	嗕
+13537	囦
+13538	堜
+13539	嬩
+13540	屼
+13541	忎
+13542	愪
+13543	戜
+13544	掍
+13545	斾
+13546	栦
+13547	椾
+13548	欎
+13549	涗
+13550	滀
+13551	濅
+13552	炰
+13553	熶
+13554	鋊
+13555	鋨
+13556	鋦
+13557	鋉
+13558	鋄
+13559	鋥
+13560	鋩
+13561	鋫
+13562	鋬
+13563	鋮
+13564	鋱
+13565	鋲
+13566	鋳
+13567	鋷
+13568	鋺
+13569	鋻
+13570	鋽
+13571	鋾
+13572	鋿
+13573	錀
+13574	錁
+13575	錂
+13576	錅
+13577	錆
+13578	錇
+13579	錈
+13580	δ
+13581	∧
+13582	┠
+13583	⒛
+13584	Ｄ
+13585	伳
+13586	勀
+13587	吥
+13588	喣
+13589	嚹
+13590	埬
+13591	娔
+13592	嬆
+13593	屇
+13594	嵞
+13595	幠
+13596	惸
+13597	懩
+13598	捘
+13599	斈
+13600	暷
+13601	柲
+13602	椖
+13603	毮
+13604	浤
+13605	溎
+13606	澞
+13607	熌
+13608	腵
+13609	腲
+13610	膥
+13611	膢
+13612	腯
+13613	膡
+13614	膤
+13615	腬
+13616	膧
+13617	膫
+13618	膬
+13619	膮
+13620	膯
+13621	膰
+13622	膲
+13623	膴
+13624	膵
+13625	膶
+13626	膷
+13627	膸
+13628	膹
+13629	膼
+13630	膾
+13631	臄
+13632	臅
+13633	臈
+13634	臋
+13635	臐
+13636	臒
+13637	ㄣ
+13638	︺
+13639	°
+13640	┿
+13641	ゃ
+13642	冦
+13643	勩
+13644	呫
+13645	嗐
+13646	囥
+13647	堛
+13648	夈
+13649	嬨
+13650	屻
+13651	嶃
+13652	庛
+13653	忋
+13654	愩
+13655	戙
+13656	掋
+13657	撱
+13658	斻
+13659	曘
+13660	栥
+13661	椼
+13662	欍
+13663	氥
+13664	涖
+13665	溿
+13666	濄
+13667	炪
+13668	熴
+13669	鉦
+13670	銃
+13671	鉥
+13672	鉡
+13673	銄
+13674	鉢
+13675	銇
+13676	銈
+13677	銉
+13678	銊
+13679	銋
+13680	銏
+13681	銕
+13682	銗
+13683	銙
+13684	銛
+13685	銝
+13686	銞
+13687	銟
+13688	銠
+13689	銡
+13690	銣
+13691	銤
+13692	銦
+13693	∶
+13694	┟
+13695	⒚
+13696	Ｃ
+13697	伱
+13698	偯
+13699	兠
+13700	劽
+13701	喢
+13702	嚸
+13703	埫
+13704	壝
+13705	嵜
+13706	徝
+13707	惷
+13708	懨
+13709	撁
+13710	斆
+13711	暶
+13712	柮
+13713	椕
+13714	樏
+13715	櫭
+13716	毭
+13717	浢
+13718	澝
+13719	灻
+13720	熋
+13721	胉
+13722	胇
+13723	脋
+13724	胈
+13725	肹
+13726	胅
+13727	肻
+13728	脌
+13729	脕
+13730	脗
+13731	脙
+13732	脜
+13733	脝
+13734	脟
+13735	脠
+13736	脡
+13737	脢
+13738	脤
+13739	脥
+13740	脦
+13741	脧
+13742	脨
+13743	脪
+13744	脭
+13745	脮
+13746	脰
+13747	脴
+13748	脵
+13749	脺
+13750	脻
+13751	脼
+13752	脽
+13753	饧
+13754	饩
+13755	宀
+13756	猘
+13757	狝
+13758	醓
+13759	酇
+13760	羇
+13761	罙
+13762	郺
+13763	繟
+13764	嗬
+13765	鈇
+13766	耡
+13767	翧
+13768	猹
+13769	忄
+13770	饫
+13771	忮
+13772	錋
+13773	臿
+13774	鋋
+13775	銩
+13776	腶
+13777	腁
+13778	溻
+13779	滗
+13780	鉧
+13781	肁
+13782	懔
+13783	汔
+13784	彐
+13785	猙
+13786	狟
+13787	醔
+13788	酈
+13789	罛
+13790	郆
+13791	纀
+13792	繠
+13793	鈈
+13794	釨
+13795	翨
+13796	錬
+13797	舃
+13798	臖
+13799	鋌
+13800	銪
+13801	腷
+13802	腂
+13803	鉈
+13804	胋
+13805	肂
+13806	猚
+13807	狢
+13808	醕
+13809	酑
+13810	羉
+13811	罜
+13812	郼
+13813	郈
+13814	釩
+13815	耤
+13816	錭
+13817	錍
+13818	臗
+13819	鋍
+13820	銫
+13821	腃
+13822	鉩
+13823	胏
+13824	猟
+13825	狣
+13826	醖
+13827	酓
+13828	羋
+13829	郿
+13830	郉
+13831	纃
+13832	繢
+13833	釪
+13834	翫
+13835	錎
+13836	屦
+13837	膁
+13838	鉊
+13839	胐
+13840	肈
+13841	猠
+13842	狤
+13843	ㄦ
+13844	︽
+13845	℃
+13846	ф
+13847	╂
+13848	㈡
+13849	佹
+13850	冩
+13851	勬
+13852	呮
+13853	嗘
+13854	囨
+13855	堟
+13856	夋
+13857	婃
+13858	嬫
+13859	嶆
+13860	庢
+13861	愭
+13862	戞
+13863	掓
+13864	撴
+13865	旀
+13866	曟
+13867	栨
+13868	楁
+13869	樻
+13870	欐
+13871	氭
+13872	濇
+13873	炴
+13874	熸
+13875	鎩
+13876	鎋
+13877	鎨
+13878	鍬
+13879	鎈
+13880	姹
+13881	鎯
+13882	鎰
+13883	鎱
+13884	鎲
+13885	鎳
+13886	鎴
+13887	鎵
+13888	鎶
+13889	鎷
+13890	鎸
+13891	鎹
+13892	鎺
+13893	鎻
+13894	鎼
+13895	鎽
+13896	鎾
+13897	鎿
+13898	鏀
+13899	鏁
+13900	鏂
+13901	鏄
+13902	鏅
+13903	鏆
+13904	鏇
+13905	鏉
+13906	鏋
+13907	鏌
+13908	ζ
+13909	∑
+13910	┢
+13911	⑵
+13912	Ｆ
+13913	兤
+13914	勂
+13915	喥
+13916	嚻
+13917	埰
+13918	壠
+13919	娖
+13920	嵠
+13921	幤
+13922	惼
+13923	捚
+13924	斊
+13925	柶
+13926	椘
+13927	樒
+13928	櫰
+13929	毱
+13930	浧
+13931	溒
+13932	炂
+13933	熎
+13934	苸
+13935	苵
+13936	芲
+13937	苶
+13938	芢
+13939	苿
+13940	茊
+13941	茍
+13942	茐
+13943	茒
+13944	茓
+13945	茘
+13946	茙
+13947	茝
+13948	茞
+13949	茟
+13950	茠
+13951	茡
+13952	茢
+13953	茣
+13954	茥
+13955	茦
+13956	茩
+13957	茮
+13958	茰
+13959	茷
+13960	茻
+13961	醗
+13962	酔
+13963	羍
+13964	罞
+13965	崾
+13966	郋
+13967	纄
+13968	繣
+13969	嗍
+13970	釫
+13971	耬
+13972	馊
+13973	錏
+13974	臙
+13975	鋏
+13976	銭
+13977	腅
+13978	鉫
+13979	鉋
+13980	胑
+13981	肊
+13982	闶
+13983	鎍
+13984	鍭
+13985	艫
+13986	驵
+13987	芺
+13988	鎐
+13989	艭
+13990	鎑
+13991	鍰
+13992	芼
+13993	鎒
+13994	鍱
+13995	芿
+13996	艵
+13997	鎓
+13998	苀
+13999	鎔
+14000	鍳
+14001	苂
+14002	鎕
+14003	鍴
+14004	艸
+14005	苅
+14006	艻
+14007	鎗
+14008	鍶
+14009	苆
+14010	艼
+14011	鎘
+14012	鍷
+14013	苉
+14014	芀
+14015	鎙
+14016	鍸
+14017	苐
+14018	芁
+14019	鎚
+14020	鍹
+14021	苖
+14022	鎛
+14023	苙
+14024	芅
+14025	嫜
+14026	鎜
+14027	鍻
+14028	苚
+14029	芆
+14030	嬖
+14031	鎝
+14032	苝
+14033	芇
+14034	鎞
+14035	鍽
+14036	芉
+14037	鎟
+14038	芌
+14039	鎠
+14040	鍿
+14041	芐
+14042	鎀
+14043	苩
+14044	芔
+14045	纟
+14046	孥
+14047	苬
+14048	芕
+14049	鎃
+14050	芖
+14051	鎄
+14052	苮
+14053	芚
+14054	鎦
+14055	鎅
+14056	苰
+14057	苲
+14058	芞
+14059	鏍
+14060	茽
+14061	猣
+14062	狥
+14063	醘
+14064	酕
+14065	羏
+14066	鄁
+14067	纅
+14068	繤
+14069	鈌
+14070	釬
+14071	耭
+14072	翭
+14073	鋐
+14074	銯
+14075	膄
+14076	腇
+14077	鉌
+14078	胒
+14079	肍
+14080	猤
+14081	狦
+14082	醙
+14083	酖
+14084	羐
+14085	罣
+14086	郍
+14087	纆
+14088	耮
+14089	錱
+14090	錑
+14091	舋
+14092	臛
+14093	銰
+14094	膅
+14095	腉
+14096	鉭
+14097	鉍
+14098	胓
+14099	肎
+14100	猦
+14101	狧
+14102	酘
+14103	羑
+14104	鄅
+14105	纇
+14106	繦
+14107	釮
+14108	耯
+14109	翲
+14110	錒
+14111	銱
+14112	膆
+14113	腍
+14114	鉮
+14115	鉎
+14116	胔
+14117	肏
+14118	猧
+14119	狪
+14120	醝
+14121	酙
+14122	羒
+14123	罥
+14124	郔
+14125	纈
+14126	繧
+14127	嘞
+14128	鈏
+14129	耰
+14130	翴
+14131	錓
+14132	舏
+14133	臝
+14134	屙
+14135	鋓
+14136	膇
+14137	漤
+14138	滹
+14139	鉯
+14140	鉏
+14141	胕
+14142	肐
+14143	猨
+14144	狫
+14145	酛
+14146	羓
+14147	罦
+14148	鄇
+14149	繨
+14150	鈐
+14151	釰
+14152	耲
+14153	翵
+14154	舑
+14155	臞
+14156	膉
+14157	腏
+14158	鉰
+14159	鉐
+14160	胘
+14161	肑
+14162	猭
+14163	狵
+14164	醟
+14165	酜
+14166	罧
+14167	鄈
+14168	郖
+14169	纊
+14170	釱
+14171	耴
+14172	翶
+14173	錵
+14174	錕
+14175	舓
+14176	鋕
+14177	銴
+14178	膋
+14179	腒
+14180	鉱
+14181	胟
+14182	肒
+14183	猯
+14184	狶
+14185	醠
+14186	酟
+14187	羖
+14188	罫
+14189	郘
+14190	纋
+14191	鈒
+14192	釲
+14193	耹
+14194	翷
+14195	錖
+14196	舕
+14197	鋖
+14198	膌
+14199	腖
+14200	鉲
+14201	鉒
+14202	肔
+14203	醡
+14204	酠
+14205	羗
+14206	罬
+14207	鄊
+14208	郙
+14209	鈓
+14210	耺
+14211	翸
+14212	錷
+14213	臡
+14214	鋗
+14215	膍
+14216	腗
+14217	鉳
+14218	鉓
+14219	胢
+14220	肕
+14221	猲
+14222	醤
+14223	酦
+14224	羘
+14225	罭
+14226	纍
+14227	繬
+14228	釴
+14229	耼
+14230	庋
+14231	錸
+14232	舗
+14233	鋘
+14234	膎
+14235	鉵
+14236	鉔
+14237	胣
+14238	肗
+14239	猳
+14240	狾
+14241	醥
+14242	酧
+14243	羙
+14244	罯
+14245	郞
+14246	纎
+14247	嘁
+14248	嘣
+14249	圊
+14250	耾
+14251	翺
+14252	夂
+14253	庳
+14254	錙
+14255	舘
+14256	臤
+14257	弪
+14258	艴
+14259	逭
+14260	耪
+14261	屮
+14262	鋙
+14263	膐
+14264	腛
+14265	胦
+14266	肙
+14267	阍
+14268	沲
+14269	猵
+14270	狿
+14271	醦
+14272	酨
+14273	羛
+14274	鄍
+14275	繮
+14276	鈖
+14277	聀
+14278	翽
+14279	舙
+14280	鋚
+14281	膒
+14282	腜
+14283	鉖
+14284	胮
+14285	肞
+14286	猀
+14287	醧
+14288	酫
+14289	羜
+14290	鄎
+14291	郠
+14292	繯
+14293	鈗
+14294	釷
+14295	聁
+14296	錻
+14297	錛
+14298	舚
+14299	臦
+14300	鋛
+14301	銺
+14302	膓
+14303	腝
+14304	胵
+14305	肣
+14306	猺
+14307	猂
+14308	醨
+14309	酭
+14310	羠
+14311	鄏
+14312	郣
+14313	纑
+14314	繰
+14315	釸
+14316	聄
+14317	翿
+14318	錼
+14319	錜
+14320	舝
+14321	鋜
+14322	銻
+14323	膔
+14324	腞
+14325	鉹
+14326	胷
+14327	肦
+14328	猻
+14329	醩
+14330	酳
+14331	羢
+14332	罶
+14333	鄐
+14334	郤
+14335	纒
+14336	鈙
+14337	聅
+14338	耂
+14339	錽
+14340	錝
+14341	臩
+14342	鋝
+14343	膕
+14344	腟
+14345	鉺
+14346	鉙
+14347	胹
+14348	肧
+14349	猼
+14350	猅
+14351	鄑
+14352	郥
+14353	繲
+14354	鈚
+14355	釺
+14356	聇
+14357	耇
+14358	錿
+14359	錞
+14360	舤
+14361	臫
+14362	鋞
+14363	膖
+14364	腡
+14365	胻
+14366	肨
+14367	猽
+14368	猆
+14369	酻
+14370	罸
+14371	郩
+14372	纔
+14373	帱
+14374	鈛
+14375	聈
+14376	耈
+14377	廒
+14378	廛
+14379	鍀
+14380	錟
+14381	臮
+14382	鋟
+14383	銾
+14384	膗
+14385	腢
+14386	鉼
+14387	胾
+14388	肬
+14389	丬
+14390	獀
+14391	醰
+14392	酼
+14393	羦
+14394	罺
+14395	郪
+14396	纕
+14397	繴
+14398	鈜
+14399	釼
+14400	耉
+14401	舦
+14402	臯
+14403	鋠
+14404	銿
+14405	腣
+14406	鉽
+14407	胿
+14408	肰
+14409	獁
+14410	猈
+14411	醱
+14412	醀
+14413	罻
+14414	鄔
+14415	郬
+14416	繵
+14417	耊
+14418	鍂
+14419	舧
+14420	臰
+14421	鋡
+14422	鋀
+14423	腤
+14424	鉾
+14425	鉝
+14426	脀
+14427	肳
+14428	獂
+14429	猉
+14430	醲
+14431	罼
+14432	鄕
+14433	郮
+14434	纗
+14435	繶
+14436	釾
+14437	聏
+14438	耎
+14439	鍃
+14440	舩
+14441	臱
+14442	鋢
+14443	膞
+14444	鉿
+14445	脁
+14446	肵
+14447	獃
+14448	猋
+14449	醳
+14450	羪
+14451	罽
+14452	郰
+14453	纘
+14454	噍
+14455	鈟
+14456	釿
+14457	聐
+14458	耏
+14459	鍄
+14460	錣
+14461	舮
+14462	臲
+14463	鋣
+14464	鋂
+14465	膟
+14466	腨
+14467	鉟
+14468	脃
+14469	肶
+14470	猌
+14471	羫
+14472	罿
+14473	鄗
+14474	郱
+14475	纙
+14476	聑
+14477	耑
+14478	鍅
+14479	錤
+14480	臵
+14481	鋃
+14482	銁
+14483	鉠
+14484	翣
+14485	酄
+14486	罓
+14487	鍫
+14488	錉
+14489	臓
+14490	銧
+14491	脿
+14492	碽
+14493	╞
+14494	癰
+14495	礲
+14496	痓
+14497	縝
+14498	穊
+14499	竍
+14500	玝
+14501	籦
+14502	禸
+14503	゜
+14504	乥
+14505	僢
+14506	刡
+14507	哹
+14508	嘼
+14509	坆
+14510	塨
+14511	奲
+14512	宐
+14513	峛
+14514	巄
+14515	廱
+14516	恇
+14517	慴
+14518	抌
+14519	攂
+14520	昩
+14521	朾
+14522	梑
+14523	榖
+14524	檅
+14525	歜
+14526	漛
+14527	瀊
+14528	焍
+14529	碆
+14530	˙
+14531	譈
+14532	礏
+14533	瑽
+14534	稡
+14535	窧
+14536	禕
+14537	〣
+14538	瓸
+14539	盉
+14540	侭
+14541	傿
+14542	凚
+14543	匓
+14544	咮
+14545	嘊
+14546	圔
+14547	塀
+14548	夿
+14549	婤
+14550	孊
+14551	嶣
+14552	怋
+14553	払
+14554	揃
+14555	擝
+14556	旴
+14557	朆
+14558	桞
+14559	楤
+14560	橞
+14561	欱
+14562	汢
+14563	烞
+14564	碿
+14565	╟
+14566	鱟
+14567	譪
+14568	琧
+14569	痗
+14570	璫
+14571	穋
+14572	竎
+14573	籧
+14574	禼
+14575	ヽ
+14576	甤
+14577	眂
+14578	乧
+14579	俢
+14580	僣
+14581	刢
+14582	卌
+14583	哻
+14584	嘽
+14585	坈
+14586	奵
+14587	媍
+14588	宑
+14589	峜
+14590	巆
+14591	廲
+14592	恈
+14593	抍
+14594	揷
+14595	攃
+14596	朿
+14597	梒
+14598	榗
+14599	檆
+14600	沜
+14601	漜
+14602	焎
+14603	碈
+14604	珻
+14605	癈
+14606	疌
+14607	緾
+14608	稢
+14609	禖
+14610	〤
+14611	盋
+14612	丆
+14613	侰
+14614	匔
+14615	咰
+14616	嘋
+14617	圕
+14618	塁
+14619	婥
+14620	孋
+14621	岰
+14622	嶤
+14623	怌
+14624	慍
+14625	扖
+14626	揅
+14627	朇
+14628	桟
+14629	楥
+14630	欳
+14631	汣
+14632	淐
+14633	烠
+14634	ㄧ
+14635	︾
+14636	＄
+14637	╃
+14638	㈢
+14639	傜
+14640	勭
+14641	呯
+14642	嗙
+14643	囩
+14644	堢
+14645	夌
+14646	婄
+14647	嬬
+14648	庣
+14649	忕
+14650	愮
+14651	戠
+14652	掔
+14653	撶
+14654	楃
+14655	樼
+14656	欑
+14657	氱
+14658	涚
+14659	濈
+14660	炵
+14661	鏯
+14662	鐍
+14663	鐋
+14664	绨
+14665	鏮
+14666	鏪
+14667	鐊
+14668	鐌
+14669	缍
+14670	绠
+14671	鏫
+14672	鐎
+14673	鐏
+14674	鐑
+14675	鐒
+14676	鐔
+14677	鐕
+14678	鐖
+14679	鐗
+14680	鐙
+14681	鐚
+14682	鐛
+14683	鐝
+14684	鐞
+14685	鐟
+14686	鐠
+14687	鐡
+14688	鐣
+14689	鐤
+14690	鐥
+14691	鐦
+14692	鐧
+14693	鐨
+14694	鐩
+14695	鐪
+14696	鐬
+14697	鐭
+14698	ㄇ
+14699	η
+14700	∏
+14701	┣
+14702	⑶
+14703	Ｇ
+14704	伹
+14705	偳
+14706	兦
+14707	勄
+14708	埱
+14709	壡
+14710	娗
+14711	嬊
+14712	屒
+14713	嵡
+14714	幥
+14715	徢
+14716	惽
+14717	捛
+14718	撉
+14719	椙
+14720	櫱
+14721	浨
+14722	澢
+14723	炃
+14724	熐
+14725	莁
+14726	荿
+14727	莮
+14728	莬
+14729	荹
+14730	莭
+14731	茾
+14732	荺
+14733	莯
+14734	莵
+14735	莻
+14736	莾
+14737	莿
+14738	菃
+14739	菄
+14740	菆
+14741	菎
+14742	菐
+14743	菑
+14744	菒
+14745	菕
+14746	菗
+14747	菙
+14748	菚
+14749	菛
+14750	菞
+14751	菢
+14752	菣
+14753	菤
+14754	菦
+14755	菧
+14756	菨
+14757	菬
+14758	鏰
+14759	莂
+14760	茿
+14761	绐
+14762	缋
+14763	缏
+14764	鏱
+14765	鏐
+14766	莃
+14767	荁
+14768	鏑
+14769	莄
+14770	荂
+14771	鏳
+14772	鏒
+14773	莇
+14774	鏴
+14775	莈
+14776	荅
+14777	鏵
+14778	荈
+14779	鏕
+14780	莋
+14781	鏷
+14782	莌
+14783	荋
+14784	鏸
+14785	莍
+14786	鏹
+14787	鏙
+14788	莏
+14789	荍
+14790	鏺
+14791	鏚
+14792	荎
+14793	鏛
+14794	荓
+14795	鏼
+14796	荕
+14797	鏝
+14798	莕
+14799	鏾
+14800	缲
+14801	莗
+14802	鐀
+14803	鏠
+14804	鐁
+14805	莚
+14806	荝
+14807	莝
+14808	荢
+14809	鐃
+14810	鏣
+14811	莟
+14812	荰
+14813	鐄
+14814	莡
+14815	荱
+14816	鐅
+14817	荲
+14818	鐆
+14819	鏦
+14820	莣
+14821	鏧
+14822	莤
+14823	荴
+14824	鏨
+14825	莥
+14826	荵
+14827	巛
+14828	鐉
+14829	鏩
+14830	荶
+14831	菭
+14832	磀
+14833	╠
+14834	鱠
+14835	琩
+14836	璬
+14837	縟
+14838	竏
+14839	ヾ
+14840	眃
+14841	乨
+14842	僤
+14843	刣
+14844	嘾
+14845	塪
+14846	媎
+14847	宒
+14848	峝
+14849	巇
+14850	恉
+14851	慸
+14852	抎
+14853	攄
+14854	杁
+14855	榙
+14856	歞
+14857	渄
+14858	漝
+14859	瀌
+14860	焏
+14861	―
+14862	譊
+14863	珼
+14864	癉
+14865	礑
+14866	璂
+14867	緿
+14868	稤
+14869	獶
+14870	禗
+14871	〥
+14872	盌
+14873	侱
+14874	僁
+14875	凞
+14876	匘
+14877	奃
+14878	孌
+14879	岲
+14880	嶥
+14881	怐
+14882	慏
+14883	扗
+14884	揇
+14885	朌
+14886	桪
+14887	楧
+14888	橠
+14889	欴
+14890	汥
+14891	滵
+14892	濪
+14893	烡
+14894	︷
+14895	○
+14896	ю
+14897	ゐ
+14898	ヰ
+14899	侌
+14900	傪
+14901	凁
+14902	勷
+14903	咅
+14904	嗮
+14905	囸
+14906	堭
+14907	夝
+14908	岎
+14909	嶐
+14910	庰
+14911	忦
+14912	掟
+14913	擆
+14914	旔
+14915	曫
+14916	栶
+14917	楌
+14918	橉
+14919	欚
+14920	濔
+14921	烉
+14922	餪
+14923	饉
+14924	鹱
+14925	餧
+14926	饆
+14927	餈
+14928	餦
+14929	饊
+14930	饌
+14931	饍
+14932	饎
+14933	饏
+14934	饐
+14935	饘
+14936	饙
+14937	饚
+14938	饜
+14939	饟
+14940	饠
+14941	饡
+14942	饢
+14943	饦
+14944	饳
+14945	饻
+14946	馂
+14947	ㄐ
+14948	⌒
+14949	┬
+14950	⑿
+14951	Ｐ
+14952	冃
+14953	勑
+14954	呅
+14955	埿
+14956	壭
+14957	娦
+14958	嬓
+14959	屝
+14960	嵭
+14961	幮
+14962	徯
+14963	愋
+14964	捫
+14965	撔
+14966	斝
+14967	曅
+14968	栃
+14969	櫺
+14970	毿
+14971	浶
+14972	溞
+14973	澬
+14974	炐
+14975	熜
+14976	衊
+14977	衈
+14978	衺
+14979	衉
+14980	衃
+14981	衇
+14982	衶
+14983	蠤
+14984	衻
+14985	衼
+14986	袀
+14987	袃
+14988	袇
+14989	袉
+14990	袊
+14991	袌
+14992	袎
+14993	袏
+14994	袐
+14995	袑
+14996	袓
+14997	袔
+14998	袕
+14999	袗
+15000	袘
+15001	袙
+15002	袚
+15003	袛
+15004	袝
+15005	袞
+15006	袟
+15007	袠
+15008	袡
+15009	袥
+15010	袦
+15011	袧
+15012	袨
+15013	袩
+15014	餉
+15015	衋
+15016	蠥
+15017	餬
+15018	蠦
+15019	餋
+15020	衏
+15021	衐
+15022	蠨
+15023	餎
+15024	衑
+15025	餱
+15026	蠪
+15027	餲
+15028	餑
+15029	蠫
+15030	餳
+15031	衕
+15032	衖
+15033	蠭
+15034	餵
+15035	餔
+15036	衘
+15037	蠮
+15038	餶
+15039	餕
+15040	衚
+15041	蠯
+15042	餷
+15043	餖
+15044	蠰
+15045	餗
+15046	衜
+15047	餹
+15048	蠳
+15049	瘃
+15050	餺
+15051	餙
+15052	衞
+15053	餻
+15054	衟
+15055	蠵
+15056	衠
+15057	餽
+15058	餜
+15059	衦
+15060	衧
+15061	蠸
+15062	餿
+15063	衪
+15064	蠺
+15065	饀
+15066	餟
+15067	衭
+15068	疒
+15069	瘗
+15070	瘥
+15071	饁
+15072	餠
+15073	衯
+15074	蠽
+15075	衱
+15076	蠾
+15077	餢
+15078	衳
+15079	蠿
+15080	衴
+15081	衁
+15082	餤
+15083	衵
+15084	衂
+15085	馉
+15086	袪
+15087	╡
+15088	鱡
+15089	琫
+15090	癳
+15091	礶
+15092	痚
+15093	璭
+15094	竐
+15095	玡
+15096	籩
+15097	秂
+15098	〆
+15099	甧
+15100	眅
+15101	俥
+15102	卐
+15103	唀
+15104	噀
+15105	坋
+15106	塭
+15107	奺
+15108	媏
+15109	宔
+15110	峞
+15111	巈
+15112	廵
+15113	恊
+15114	慹
+15115	抏
+15116	揺
+15117	攅
+15118	昬
+15119	梕
+15120	榚
+15121	檈
+15122	歟
+15123	沞
+15124	渆
+15125	焑
+15126	碋
+15127	‥
+15128	鱁
+15129	癊
+15130	疎
+15131	璄
+15132	縀
+15133	稥
+15134	窫
+15135	獷
+15136	籈
+15137	禘
+15138	瓻
+15139	盓
+15140	丒
+15141	侲
+15142	凟
+15143	匛
+15144	咵
+15145	嘐
+15146	圗
+15147	塃
+15148	奅
+15149	婨
+15150	孍
+15151	岴
+15152	嶦
+15153	廍
+15154	怑
+15155	慐
+15156	扙
+15157	揈
+15158	擡
+15159	旹
+15160	朎
+15161	桬
+15162	欵
+15163	汦
+15164	淓
+15165	滶
+15166	烢
+15167	ㄩ
+15168	﹂
+15169	￠
+15170	ч
+15171	╅
+15172	㈤
+15173	侀
+15174	傞
+15175	勯
+15176	呴
+15177	囬
+15178	堥
+15179	婇
+15180	嬮
+15181	岄
+15182	嶉
+15183	庨
+15184	愰
+15185	掗
+15186	曢
+15187	栭
+15188	楅
+15189	橀
+15190	欓
+15191	濋
+15192	熼
+15193	榇
+15194	閌
+15195	閊
+15196	閇
+15197	閧
+15198	椐
+15199	锧
+15200	閈
+15201	閫
+15202	閮
+15203	閯
+15204	閰
+15205	閳
+15206	閴
+15207	閵
+15208	閷
+15209	閸
+15210	閺
+15211	閼
+15212	閽
+15213	閾
+15214	閿
+15215	闁
+15216	闂
+15217	闃
+15218	闄
+15219	闅
+15220	闈
+15221	闉
+15222	ㄉ
+15223	ι
+15224	∩
+15225	┥
+15226	⑸
+15227	Ｉ
+15228	伾
+15229	勆
+15230	吷
+15231	喩
+15232	嚿
+15233	埳
+15234	壣
+15235	屔
+15236	嵣
+15237	徤
+15238	惿
+15239	懮
+15240	捝
+15241	撋
+15242	斏
+15243	暽
+15244	柹
+15245	椛
+15246	櫳
+15247	毶
+15248	浬
+15249	溕
+15250	蒦
+15251	蓗
+15252	蓔
+15253	蒧
+15254	蒣
+15255	蒥
+15256	蓒
+15257	葽
+15258	蒤
+15259	蓘
+15260	蓙
+15261	蓚
+15262	蓛
+15263	蓜
+15264	蓞
+15265	蓡
+15266	蓤
+15267	蓧
+15268	蓨
+15269	蓫
+15270	蓭
+15271	蓱
+15272	蓲
+15273	蓳
+15274	蓴
+15275	蓵
+15276	蓶
+15277	蓷
+15278	蓸
+15279	蓹
+15280	蓺
+15281	蓻
+15282	蓽
+15283	蓾
+15284	蔀
+15285	蔁
+15286	ㄨ
+15287	﹁
+15288	¤
+15289	ц
+15290	╄
+15291	ヨ
+15292	佽
+15293	傝
+15294	冭
+15295	勮
+15296	呰
+15297	堣
+15298	夎
+15299	婅
+15300	嬭
+15301	岃
+15302	嶈
+15303	忚
+15304	愯
+15305	戣
+15306	掕
+15307	撹
+15308	旇
+15309	曡
+15310	栬
+15311	楄
+15312	樿
+15313	涜
+15314	濊
+15315	炶
+15316	熻
+15317	鑐
+15318	鑯
+15319	鑭
+15320	鑏
+15321	杩
+15322	璺
+15323	鑋
+15324	鑍
+15325	鑮
+15326	鑌
+15327	鑱
+15328	鑳
+15329	鑴
+15330	鑶
+15331	鑸
+15332	鑹
+15333	鑺
+15334	鑻
+15335	鑿
+15336	钀
+15337	钁
+15338	钂
+15339	钃
+15340	钄
+15341	钖
+15342	钘
+15343	铇
+15344	铓
+15345	铔
+15346	铦
+15347	ㄈ
+15348	θ
+15349	∪
+15350	┤
+15351	⑷
+15352	Ｈ
+15353	伻
+15354	勅
+15355	喨
+15356	嚾
+15357	埲
+15358	娙
+15359	屓
+15360	幦
+15361	徣
+15362	惾
+15363	懭
+15364	捜
+15365	暼
+15366	柸
+15367	椚
+15368	樔
+15369	櫲
+15370	浫
+15371	溔
+15372	澣
+15373	炄
+15374	熑
+15375	萡
+15376	萟
+15377	萠
+15378	萞
+15379	葅
+15380	葈
+15381	菮
+15382	葊
+15383	葋
+15384	葌
+15385	葍
+15386	葏
+15387	葐
+15388	葒
+15389	葓
+15390	葔
+15391	葕
+15392	葘
+15393	葝
+15394	葟
+15395	葠
+15396	葢
+15397	葤
+15398	葥
+15399	葧
+15400	葨
+15401	葪
+15402	葮
+15403	葰
+15404	葲
+15405	葴
+15406	葹
+15407	葻
+15408	﹃
+15409	￡
+15410	ш
+15411	╆
+15412	㈥
+15413	侁
+15414	勱
+15415	呹
+15416	嗞
+15417	囮
+15418	堦
+15419	岅
+15420	嶊
+15421	庩
+15422	愱
+15423	戧
+15424	撽
+15425	旉
+15426	曣
+15427	栮
+15428	楆
+15429	橁
+15430	欔
+15431	濌
+15432	炾
+15433	阘
+15434	阇
+15435	陗
+15436	陓
+15437	阓
+15438	攴
+15439	闧
+15440	陒
+15441	陖
+15442	甓
+15443	戤
+15444	闬
+15445	陙
+15446	陚
+15447	陜
+15448	陠
+15449	陥
+15450	陦
+15451	陫
+15452	陭
+15453	陮
+15454	陱
+15455	陹
+15456	陻
+15457	陼
+15458	陾
+15459	陿
+15460	隀
+15461	隁
+15462	隂
+15463	隃
+15464	隄
+15465	隇
+15466	隉
+15467	ㄊ
+15468	κ
+15469	∈
+15470	┦
+15471	⑹
+15472	Ｊ
+15473	伿
+15474	偸
+15475	兪
+15476	勈
+15477	吺
+15478	囀
+15479	埵
+15480	嬍
+15481	屖
+15482	嵤
+15483	幨
+15484	徥
+15485	愂
+15486	懯
+15487	捠
+15488	斒
+15489	暿
+15490	樖
+15491	櫴
+15492	毷
+15493	炇
+15494	蔨
+15495	蕕
+15496	蕓
+15497	蔩
+15498	蔧
+15499	蕒
+15500	蕔
+15501	蔦
+15502	蕘
+15503	蕚
+15504	蕛
+15505	蕜
+15506	蕝
+15507	蕟
+15508	蕠
+15509	蕡
+15510	蕢
+15511	蕥
+15512	蕦
+15513	蕧
+15514	蕬
+15515	蕮
+15516	蕯
+15517	蕰
+15518	蕱
+15519	蕳
+15520	蕷
+15521	蕸
+15522	蕼
+15523	蕽
+15524	蕿
+15525	薀
+15526	﹄
+15527	‰
+15528	щ
+15529	╇
+15530	㈦
+15531	傠
+15532	冸
+15533	勲
+15534	呺
+15535	嗠
+15536	堧
+15537	夒
+15538	婋
+15539	岆
+15540	庪
+15541	愲
+15542	戨
+15543	旊
+15544	曤
+15545	栯
+15546	楇
+15547	橂
+15548	滊
+15549	濍
+15550	炿
+15551	雦
+15552	隷
+15553	氕
+15554	搿
+15555	肟
+15556	敫
+15557	雥
+15558	雧
+15559	攵
+15560	隌
+15561	毪
+15562	隲
+15563	雬
+15564	雭
+15565	雮
+15566	雰
+15567	雴
+15568	雵
+15569	雸
+15570	雺
+15571	雼
+15572	雽
+15573	雿
+15574	霃
+15575	霅
+15576	霊
+15577	霋
+15578	霌
+15579	霐
+15580	霒
+15581	霔
+15582	霗
+15583	霚
+15584	霛
+15585	霝
+15586	霟
+15587	ㄋ
+15588	λ
+15589	∷
+15590	┧
+15591	⑺
+15592	Ｋ
+15593	佀
+15594	偹
+15595	兯
+15596	勊
+15597	囁
+15598	埶
+15599	壦
+15600	娝
+15601	嬎
+15602	屗
+15603	嵥
+15604	幩
+15605	徦
+15606	愃
+15607	懰
+15608	捤
+15609	撍
+15610	斔
+15611	曀
+15612	椝
+15613	櫵
+15614	浰
+15615	溗
+15616	澦
+15617	炈
+15618	熕
+15619	薫
+15620	薧
+15621	藒
+15622	藎
+15623	薣
+15624	藑
+15625	薂
+15626	薥
+15627	藔
+15628	藗
+15629	藙
+15630	藚
+15631	藞
+15632	藡
+15633	藢
+15634	藣
+15635	藧
+15636	藪
+15637	藫
+15638	藭
+15639	藮
+15640	藯
+15641	藰
+15642	藱
+15643	藲
+15644	藳
+15645	藵
+15646	藶
+15647	藷
+15648	蒩
+15649	葾
+15650	榱
+15651	萢
+15652	阛
+15653	闍
+15654	蔄
+15655	赆
+15656	赇
+15657	隺
+15658	薃
+15659	脶
+15660	肜
+15661	脞
+15662	锽
+15663	蒪
+15664	葿
+15665	萣
+15666	阞
+15667	蔮
+15668	蔅
+15669	隑
+15670	薭
+15671	薆
+15672	蒫
+15673	蒀
+15674	鑓
+15675	阠
+15676	蔯
+15677	隿
+15678	隒
+15679	薱
+15680	閐
+15681	镈
+15682	蒬
+15683	蒁
+15684	鑔
+15685	萪
+15686	桊
+15687	阣
+15688	闐
+15689	蔰
+15690	蔇
+15691	牮
+15692	雂
+15693	隓
+15694	薲
+15695	薉
+15696	镋
+15697	蒭
+15698	蒃
+15699	萫
+15700	阤
+15701	闑
+15702	蔱
+15703	蔈
+15704	雃
+15705	薳
+15706	肷
+15707	腙
+15708	胨
+15709	蒮
+15710	鑖
+15711	菷
+15712	阥
+15713	闒
+15714	蔲
+15715	蔉
+15716	雈
+15717	隖
+15718	薴
+15719	薋
+15720	蒅
+15721	鑗
+15722	鐶
+15723	萭
+15724	菺
+15725	阦
+15726	闓
+15727	蔳
+15728	蔊
+15729	薵
+15730	镠
+15731	蒱
+15732	蒆
+15733	鐷
+15734	萮
+15735	菻
+15736	阧
+15737	闔
+15738	蔋
+15739	薶
+15740	镮
+15741	蒳
+15742	蒊
+15743	殪
+15744	萯
+15745	阨
+15746	蔍
+15747	晡
+15748	雐
+15749	隝
+15750	胩
+15751	胂
+15752	閖
+15753	蒵
+15754	蒍
+15755	鑚
+15756	鐹
+15757	萰
+15758	菾
+15759	阩
+15760	蔶
+15761	蔎
+15762	隞
+15763	薐
+15764	閗
+15765	镵
+15766	蒶
+15767	蒏
+15768	鑛
+15769	萲
+15770	菿
+15771	阫
+15772	闗
+15773	蔾
+15774	蔏
+15775	隟
+15776	薻
+15777	蒐
+15778	鑜
+15779	鐻
+15780	萳
+15781	萀
+15782	阬
+15783	蔿
+15784	雔
+15785	薼
+15786	薒
+15787	閙
+15788	镸
+15789	蒑
+15790	鐼
+15791	萴
+15792	阭
+15793	蕀
+15794	蔒
+15795	隡
+15796	薽
+15797	薓
+15798	镹
+15799	蒒
+15800	鑞
+15801	鐽
+15802	萅
+15803	阯
+15804	闚
+15805	隢
+15806	薾
+15807	镺
+15808	蒓
+15809	轷
+15810	鐿
+15811	萶
+15812	萇
+15813	柙
+15814	阰
+15815	蔕
+15816	牿
+15817	犋
+15818	雘
+15819	隣
+15820	薿
+15821	薕
+15822	閜
+15823	镻
+15824	蒔
+15825	鑠
+15826	鑀
+15827	萷
+15828	阷
+15829	蕄
+15830	蔖
+15831	隤
+15832	藀
+15833	閝
+15834	镼
+15835	鑡
+15836	鑁
+15837	萹
+15838	萉
+15839	阸
+15840	蕅
+15841	蔘
+15842	雚
+15843	隥
+15844	藂
+15845	薗
+15846	閞
+15847	蓃
+15848	蒖
+15849	鑢
+15850	萺
+15851	阹
+15852	闞
+15853	蕆
+15854	蔙
+15855	隦
+15856	藃
+15857	薘
+15858	镾
+15859	蓅
+15860	蒘
+15861	鑃
+15862	萻
+15863	萐
+15864	阺
+15865	闟
+15866	蕇
+15867	蔛
+15868	藄
+15869	赀
+15870	閠
+15871	蒚
+15872	鑤
+15873	萾
+15874	萒
+15875	阾
+15876	闠
+15877	蕋
+15878	蔜
+15879	隩
+15880	藅
+15881	薚
+15882	閁
+15883	蒛
+15884	辁
+15885	鑅
+15886	萿
+15887	萓
+15888	棼
+15889	椟
+15890	陁
+15891	蔝
+15892	贳
+15893	藆
+15894	薝
+15895	膪
+15896	脎
+15897	胲
+15898	閂
+15899	蓈
+15900	蒝
+15901	葀
+15902	萔
+15903	陃
+15904	蕍
+15905	雟
+15906	藇
+15907	薞
+15908	蒞
+15909	葁
+15910	萕
+15911	陊
+15912	蔠
+15913	雡
+15914	隬
+15915	藈
+15916	薟
+15917	閤
+15918	閄
+15919	蓌
+15920	鑨
+15921	鑈
+15922	葂
+15923	萖
+15924	陎
+15925	闤
+15926	蕏
+15927	蔢
+15928	隭
+15929	藊
+15930	閅
+15931	蒠
+15932	鑩
+15933	鑉
+15934	葃
+15935	萗
+15936	陏
+15937	闥
+15938	蕐
+15939	隮
+15940	藋
+15941	薡
+15942	閦
+15943	蓏
+15944	蒢
+15945	鑪
+15946	鑊
+15947	葄
+15948	萙
+15949	陑
+15950	闦
+15951	蕑
+15952	蔤
+15953	雤
+15954	藌
+15955	薢
+15956	柁
+15957	锠
+15958	葼
+15959	霠
+15960	藸
+15961	磃
+15962	╢
+15963	鱢
+15964	譮
+15965	癴
+15966	礷
+15967	痜
+15968	璮
+15969	縡
+15970	玣
+15971	籪
+15972	秄
+15973	ゝ
+15974	眆
+15975	乫
+15976	俧
+15977	僨
+15978	刦
+15979	唂
+15980	坒
+15981	塮
+15982	媐
+15983	宖
+15984	峟
+15985	廸
+15986	恌
+15987	慺
+15988	抐
+15989	揻
+15990	攆
+15991	杅
+15992	梖
+15993	榝
+15994	檉
+15995	歠
+15996	沠
+15997	漟
+15998	瀎
+15999	焒
+16000	‵
+16001	譌
+16002	癋
+16003	礔
+16004	疐
+16005	璅
+16006	縁
+16007	稦
+16008	籉
+16009	侳
+16010	凢
+16011	匜
+16012	咶
+16013	嘑
+16014	塅
+16015	奆
+16016	婩
+16017	孎
+16018	岶
+16019	廎
+16020	怓
+16021	慒
+16022	扚
+16023	揊
+16024	擣
+16025	楩
+16026	橣
+16027	欶
+16028	淔
+16029	烣
+16030	╣
+16031	鱣
+16032	癵
+16033	礸
+16034	痝
+16035	穏
+16036	竒
+16037	玤
+16038	籫
+16039	秅
+16040	ゞ
+16041	甮
+16042	乬
+16043	僩
+16044	刧
+16045	唃
+16046	噂
+16047	坓
+16048	塯
+16049	奼
+16050	媑
+16051	廹
+16052	恎
+16053	慻
+16054	抔
+16055	揼
+16056	杇
+16057	梘
+16058	榞
+16059	檊
+16060	漡
+16061	焔
+16062	碐
+16063	℅
+16064	鱃
+16065	譍
+16066	珿
+16067	癎
+16068	礕
+16069	疓
+16070	縂
+16071	稧
+16072	獹
+16073	籊
+16074	盙
+16075	侴
+16076	僄
+16077	凣
+16078	匞
+16079	嘒
+16080	塆
+16081	奊
+16082	婫
+16083	孏
+16084	岹
+16085	嶨
+16086	廏
+16087	怗
+16088	慓
+16089	扜
+16090	揋
+16091	擥
+16092	桮
+16093	橤
+16094	汫
+16095	淕
+16096	濭
+16097	烥
+16098	磆
+16099	╤
+16100	鱤
+16101	琱
+16102	癶
+16103	礹
+16104	痟
+16105	穐
+16106	竓
+16107	秇
+16108	﹉
+16109	県
+16110	乭
+16111	俬
+16112	僪
+16113	卙
+16114	噃
+16115	坔
+16116	塰
+16117	宧
+16118	峢
+16119	巋
+16120	廻
+16121	恏
+16122	慼
+16123	抙
+16124	揾
+16125	攈
+16126	昲
+16127	杊
+16128	梙
+16129	檋
+16130	歨
+16131	瀐
+16132	碒
+16133	℉
+16134	鱄
+16135	譎
+16136	縃
+16137	稨
+16138	窰
+16139	籋
+16140	禜
+16141	瓾
+16142	盚
+16143	丠
+16144	凥
+16145	匟
+16146	咹
+16147	嘓
+16148	圚
+16149	塇
+16150	孒
+16151	廐
+16152	怘
+16153	慔
+16154	扝
+16155	揌
+16156	擧
+16157	旽
+16158	朒
+16159	楬
+16160	橦
+16161	欻
+16162	汬
+16163	淗
+16164	滺
+16165	磇
+16166	鱥
+16167	譱
+16168	癷
+16169	縤
+16170	竔
+16171	籭
+16172	秈
+16173	﹊
+16174	甶
+16175	眎
+16176	乮
+16177	俰
+16178	僫
+16179	刬
+16180	卛
+16181	噄
+16182	坕
+16183	奿
+16184	媔
+16185	宨
+16186	巌
+16187	廼
+16188	恑
+16189	慽
+16190	搃
+16191	杋
+16192	梚
+16193	榠
+16194	檌
+16195	瀒
+16196	焛
+16197	碔
+16198	↖
+16199	鱅
+16200	譏
+16201	琁
+16202	癐
+16203	礗
+16204	疘
+16205	縄
+16206	窱
+16207	禝
+16208	㊣
+16209	甀
+16210	侷
+16211	僆
+16212	処
+16213	匢
+16214	咺
+16215	圛
+16216	塈
+16217	奍
+16218	岻
+16219	嶪
+16220	廔
+16221	怚
+16222	扞
+16223	揑
+16224	旾
+16225	桰
+16226	橧
+16227	欼
+16228	滻
+16229	烮
+16230	№
+16231	╉
+16232	㈨
+16233	冺
+16234	勴
+16235	呿
+16236	堩
+16237	夗
+16238	嬳
+16239	岉
+16240	嶍
+16241	庬
+16242	忢
+16243	戫
+16244	掜
+16245	旐
+16246	曧
+16247	楉
+16248	橅
+16249	欗
+16250	氻
+16251	涰
+16252	滍
+16253	濏
+16254	烅
+16255	燀
+16256	泶
+16257	韣
+16258	憝
+16259	慝
+16260	砘
+16261	韂
+16262	韠
+16263	韢
+16264	鞞
+16265	恧
+16266	韁
+16267	肀
+16268	韤
+16269	韥
+16270	韨
+16271	韯
+16272	韰
+16273	韱
+16274	韲
+16275	韷
+16276	韸
+16277	韹
+16278	韺
+16279	韽
+16280	頀
+16281	頄
+16282	頇
+16283	頉
+16284	頋
+16285	頍
+16286	ㄍ
+16287	ν
+16288	⊥
+16289	┩
+16290	⑼
+16291	Ｍ
+16292	佂
+16293	偼
+16294	兺
+16295	喭
+16296	壨
+16297	屚
+16298	嵧
+16299	愅
+16300	捦
+16301	撏
+16302	斖
+16303	曂
+16304	柾
+16305	椡
+16306	櫷
+16307	毻
+16308	溚
+16309	澩
+16310	炌
+16311	熗
+16312	蚟
+16313	蛜
+16314	蛗
+16315	蚠
+16316	蚚
+16317	蚞
+16318	蛖
+16319	蛚
+16320	虭
+16321	蚛
+16322	蛡
+16323	蛢
+16324	蛣
+16325	蛥
+16326	蛦
+16327	蛧
+16328	蛨
+16329	蛪
+16330	蛫
+16331	蛬
+16332	蛶
+16333	蛷
+16334	蛼
+16335	蛽
+16336	蛿
+16337	蜁
+16338	蜄
+16339	蜅
+16340	蜋
+16341	蜌
+16342	蜎
+16343	蜏
+16344	蜑
+16345	蜔
+16346	§
+16347	ъ
+16348	╈
+16349	㈧
+16350	侅
+16351	傡
+16352	冹
+16353	呾
+16354	嗢
+16355	囲
+16356	堨
+16357	夓
+16358	婌
+16359	嬱
+16360	岇
+16361	忟
+16362	愳
+16363	旍
+16364	曥
+16365	栰
+16366	楈
+16367	橃
+16368	氺
+16369	涭
+16370	濎
+16371	烄
+16372	熿
+16373	靱
+16374	靯
+16375	靇
+16376	旆
+16377	靃
+16378	靅
+16379	靮
+16380	靲
+16381	靵
+16382	靷
+16383	靸
+16384	靹
+16385	靻
+16386	靽
+16387	靾
+16388	靿
+16389	鞀
+16390	鞁
+16391	鞃
+16392	鞄
+16393	鞆
+16394	鞈
+16395	鞉
+16396	鞊
+16397	鞌
+16398	鞎
+16399	鞐
+16400	鞓
+16401	鞖
+16402	鞗
+16403	鞙
+16404	鞚
+16405	鞛
+16406	鞜
+16407	ㄌ
+16408	μ
+16409	√
+16410	┨
+16411	ぬ
+16412	⑻
+16413	Ｌ
+16414	佁
+16415	偺
+16416	勌
+16417	吿
+16418	壧
+16419	娞
+16420	屘
+16421	嵦
+16422	幪
+16423	徧
+16424	愄
+16425	懱
+16426	捥
+16427	撎
+16428	曁
+16429	柼
+16430	椞
+16431	樚
+16432	櫶
+16433	浱
+16434	溙
+16435	澨
+16436	炋
+16437	熖
+16438	蘞
+16439	蘜
+16440	虀
+16441	蘾
+16442	蘝
+16443	蘙
+16444	蘽
+16445	虂
+16446	虃
+16447	虅
+16448	虆
+16449	虇
+16450	虈
+16451	虉
+16452	虊
+16453	虋
+16454	虌
+16455	虖
+16456	虗
+16457	虘
+16458	虙
+16459	虝
+16460	虠
+16461	虡
+16462	虣
+16463	虥
+16464	虦
+16465	虨
+16466	虩
+16467	︻
+16468	☆
+16469	╊
+16470	ゎ
+16471	㈩
+16472	ヮ
+16473	侇
+16474	傤
+16475	冾
+16476	囶
+16477	堫
+16478	夘
+16479	婎
+16480	嬵
+16481	岊
+16482	嶎
+16483	庮
+16484	忣
+16485	愵
+16486	戭
+16487	掝
+16488	擃
+16489	旑
+16490	曨
+16491	橆
+16492	氼
+16493	涱
+16494	濐
+16495	烆
+16496	頯
+16497	瞵
+16498	顋
+16499	頮
+16500	罨
+16501	頪
+16502	頬
+16503	頏
+16504	顐
+16505	顑
+16506	顕
+16507	顖
+16508	顙
+16509	顚
+16510	顜
+16511	顝
+16512	顟
+16513	顠
+16514	顡
+16515	顢
+16516	顣
+16517	顦
+16518	顩
+16519	顬
+16520	顭
+16521	ㄎ
+16522	ξ
+16523	∥
+16524	┪
+16525	⑽
+16526	Ｎ
+16527	佄
+16528	兾
+16529	勎
+16530	娢
+16531	幬
+16532	徫
+16533	愇
+16534	懳
+16535	斘
+16536	曃
+16537	栁
+16538	椢
+16539	樜
+16540	毼
+16541	浳
+16542	溛
+16543	炍
+16544	熚
+16545	蝋
+16546	蝆
+16547	蝵
+16548	蝊
+16549	蝅
+16550	蝱
+16551	蝳
+16552	蜙
+16553	蝄
+16554	蝹
+16555	蝺
+16556	蝿
+16557	螀
+16558	螁
+16559	螇
+16560	螉
+16561	螊
+16562	螌
+16563	螎
+16564	螑
+16565	螒
+16566	螔
+16567	螕
+16568	螖
+16569	螘
+16570	螙
+16571	螚
+16572	螛
+16573	螝
+16574	螠
+16575	螡
+16576	螣
+16577	︼
+16578	★
+16579	э
+16580	╋
+16581	侊
+16582	傦
+16583	冿
+16584	勶
+16585	咃
+16586	嗭
+16587	堬
+16588	夛
+16589	婏
+16590	嬶
+16591	岋
+16592	嶏
+16593	忥
+16594	愶
+16595	掞
+16596	擄
+16597	旓
+16598	曪
+16599	栵
+16600	楋
+16601	欙
+16602	涳
+16603	滐
+16604	烇
+16605	燂
+16606	锎
+16607	颺
+16608	铷
+16609	飤
+16610	铴
+16611	锏
+16612	颻
+16613	铩
+16614	锓
+16615	铽
+16616	颷
+16617	飡
+16618	飣
+16619	铹
+16620	颸
+16621	飥
+16622	飫
+16623	飬
+16624	飭
+16625	飮
+16626	飱
+16627	飳
+16628	飴
+16629	飵
+16630	飶
+16631	飸
+16632	飺
+16633	餀
+16634	餁
+16635	餂
+16636	餄
+16637	ㄏ
+16638	ο
+16639	∠
+16640	┫
+16641	⑾
+16642	Ｏ
+16643	佅
+16644	傁
+16645	兿
+16646	勏
+16647	呄
+16648	喯
+16649	囅
+16650	埾
+16651	壪
+16652	娤
+16653	嬒
+16654	嵪
+16655	幭
+16656	懴
+16657	捪
+16658	斚
+16659	曄
+16660	栂
+16661	椣
+16662	櫹
+16663	毾
+16664	澫
+16665	炏
+16666	熛
+16667	蟕
+16668	蟐
+16669	蟸
+16670	蟔
+16671	蟏
+16672	蟵
+16673	螥
+16674	蟎
+16675	蟺
+16676	蟼
+16677	蟽
+16678	蟿
+16679	蠀
+16680	蠁
+16681	蠆
+16682	蠇
+16683	蠈
+16684	蠉
+16685	蠋
+16686	蠌
+16687	蠎
+16688	蠏
+16689	蠐
+16690	蠒
+16691	蠗
+16692	蠘
+16693	蠚
+16694	蠜
+16695	蠠
+16696	锊
+16697	韆
+16698	鞟
+16699	蚢
+16700	虯
+16701	愍
+16702	砹
+16703	霢
+16704	蘟
+16705	灬
+16706	爨
+16707	炷
+16708	蝍
+16709	蜛
+16710	钆
+16711	颽
+16712	蟖
+16713	螦
+16714	镟
+16715	镥
+16716	镤
+16717	锬
+16718	锩
+16719	铈
+16720	韇
+16721	蚥
+16722	虰
+16723	靊
+16724	霣
+16725	蘠
+16726	藼
+16727	蝏
+16728	蜝
+16729	颾
+16730	蟗
+16731	韈
+16732	鞢
+16733	蚦
+16734	虲
+16735	靋
+16736	霤
+16737	藽
+16738	頲
+16739	蝐
+16740	蜟
+16741	钋
+16742	颿
+16743	顲
+16744	蟘
+16745	螩
+16746	韉
+16747	鞤
+16748	蚫
+16749	虳
+16750	眇
+16751	靌
+16752	霥
+16753	藾
+16754	頳
+16755	蝑
+16756	蜠
+16757	铕
+16758	飀
+16759	螪
+16760	镄
+16761	韊
+16762	鞥
+16763	蚭
+16764	虴
+16765	黹
+16766	靍
+16767	霦
+16768	蘣
+16769	蘀
+16770	頴
+16771	蝒
+16772	蜤
+16773	钌
+16774	飁
+16775	蟚
+16776	锼
+16777	鞦
+16778	蚮
+16779	虵
+16780	靎
+16781	蘤
+16782	蘁
+16783	頵
+16784	蝔
+16785	飂
+16786	螰
+16787	鞧
+16788	蚲
+16789	虶
+16790	靏
+16791	霨
+16792	蘥
+16793	蘂
+16794	頖
+16795	蜧
+16796	飃
+16797	螱
+16798	韍
+16799	蚳
+16800	虷
+16801	靐
+16802	霩
+16803	蘦
+16804	蘃
+16805	蝖
+16806	蜨
+16807	颒
+16808	韎
+16809	鞩
+16810	虸
+16811	眍
+16812	靑
+16813	霫
+16814	蘨
+16815	煳
+16816	蝘
+16817	蜪
+16818	钔
+16819	飅
+16820	蟟
+16821	螴
+16822	锿
+16823	锾
+16824	韏
+16825	鞪
+16826	蚸
+16827	靔
+16828	霬
+16829	蘪
+16830	蝚
+16831	蜫
+16832	螶
+16833	鞬
+16834	蚹
+16835	蚄
+16836	靕
+16837	霮
+16838	蘫
+16839	頺
+16840	頚
+16841	蝛
+16842	蜬
+16843	飇
+16844	颣
+16845	螷
+16846	蚻
+16847	蚅
+16848	靗
+16849	霯
+16850	蜭
+16851	飈
+16852	蟣
+16853	韒
+16854	蚼
+16855	蚆
+16856	靘
+16857	霱
+16858	蘉
+16859	蝝
+16860	蜯
+16861	飉
+16862	颩
+16863	蟤
+16864	螹
+16865	鞱
+16866	蚽
+16867	蚇
+16868	蘮
+16869	頝
+16870	蝞
+16871	蜰
+16872	飊
+16873	颪
+16874	蟦
+16875	螻
+16876	镅
+16877	韔
+16878	鞳
+16879	蚾
+16880	眢
+16881	眚
+16882	霴
+16883	蘯
+16884	煺
+16885	頾
+16886	頞
+16887	蜲
+16888	钪
+16889	钬
+16890	螼
+16891	镎
+16892	韕
+16893	鞵
+16894	蚿
+16895	蚉
+16896	靝
+16897	蘰
+16898	頿
+16899	頟
+16900	蝡
+16901	蜳
+16902	飌
+16903	蟨
+16904	螾
+16905	韖
+16906	蛁
+16907	蚎
+16908	靟
+16909	蝢
+16910	蜵
+16911	飍
+16912	颭
+16913	蟩
+16914	螿
+16915	韗
+16916	鞷
+16917	蛂
+16918	蚏
+16919	靣
+16920	霷
+16921	蘲
+16922	顁
+16923	蜶
+16924	蟫
+16925	蟁
+16926	韘
+16927	鞸
+16928	蛃
+16929	蚐
+16930	靤
+16931	霺
+16932	蘳
+16933	顂
+16934	頢
+16935	蝧
+16936	飐
+16937	蟂
+16938	钸
+16939	韙
+16940	鞹
+16941	蛅
+16942	蚑
+16943	霻
+16944	蘴
+16945	蘐
+16946	頣
+16947	蜹
+16948	颰
+16949	蟭
+16950	蟃
+16951	韚
+16952	鞺
+16953	蛈
+16954	蚒
+16955	睃
+16956	碥
+16957	靧
+16958	霼
+16959	禚
+16960	顄
+16961	蝩
+16962	蜺
+16963	稆
+16964	稃
+16965	韛
+16966	鞻
+16967	蛌
+16968	蚔
+16969	霽
+16970	蘶
+16971	蘓
+16972	蝪
+16973	蜼
+16974	颲
+16975	蟰
+16976	蟅
+16977	蚖
+16978	靪
+16979	霿
+16980	蘷
+16981	蘔
+16982	蝫
+16983	蜽
+16984	蟇
+16985	韝
+16986	鞽
+16987	蛒
+16988	蚗
+16989	靫
+16990	靀
+16991	蘹
+16992	蘕
+16993	顇
+16994	飜
+16995	颴
+16996	韞
+16997	鞾
+16998	蛓
+16999	蚘
+17000	靁
+17001	蘺
+17002	顈
+17003	頨
+17004	蝭
+17005	蝁
+17006	飝
+17007	颵
+17008	蟉
+17009	韟
+17010	鞿
+17011	蛕
+17012	蚙
+17013	靭
+17014	蘻
+17015	顉
+17016	頩
+17017	蝯
+17018	飠
+17019	蟴
+17020	蟌
+17021	铪
+17022	钷
+17023	頎
+17024	蜖
+17025	鞝
+17026	虪
+17027	顮
+17028	螤
+17029	餇
+17030	磈
+17031	╦
+17032	鱦
+17033	譲
+17034	琷
+17035	癹
+17036	礿
+17037	痡
+17038	璲
+17039	縥
+17040	穓
+17041	竕
+17042	秊
+17043	﹋
+17044	甹
+17045	眏
+17046	乯
+17047	俲
+17048	僯
+17049	刯
+17050	卝
+17051	唈
+17052	坖
+17053	塲
+17054	妀
+17055	宩
+17056	峧
+17057	廽
+17058	抝
+17059	搄
+17060	攋
+17061	昷
+17062	梛
+17063	檍
+17064	歫
+17065	沯
+17066	渏
+17067	漥
+17068	瀓
+17069	碕
+17070	↗
+17071	鱆
+17072	譐
+17073	琂
+17074	癑
+17075	礘
+17076	疛
+17077	璊
+17078	稪
+17079	窲
+17080	禞
+17081	㎎
+17082	盝
+17083	丣
+17084	侸
+17085	僇
+17086	凧
+17087	咼
+17088	嘕
+17089	圝
+17090	塉
+17091	奐
+17092	婮
+17093	孞
+17094	岼
+17095	嶫
+17096	廕
+17097	怞
+17098	慗
+17099	扟
+17100	揓
+17101	擩
+17102	朖
+17103	桱
+17104	楯
+17105	橨
+17106	淛
+17107	滼
+17108	濲
+17109	烰
+17110	磌
+17111	╧
+17112	譳
+17113	琸
+17114	祂
+17115	痥
+17116	縦
+17117	穔
+17118	竗
+17119	籯
+17120	秌
+17121	﹌
+17122	甼
+17123	眐
+17124	乲
+17125	俴
+17126	僰
+17127	刱
+17128	卥
+17129	唊
+17130	噆
+17131	坘
+17132	塳
+17133	妅
+17134	峩
+17135	巏
+17136	弅
+17137	恔
+17138	慿
+17139	択
+17140	搆
+17141	杒
+17142	梜
+17143	榢
+17144	檏
+17145	歬
+17146	沰
+17147	渒
+17148	漦
+17149	瀔
+17150	焝
+17151	↘
+17152	琄
+17153	疜
+17154	璌
+17155	縆
+17156	稫
+17157	窴
+17158	獽
+17159	籏
+17160	㎏
+17161	甂
+17162	侹
+17163	僈
+17164	凨
+17165	匥
+17166	咾
+17167	嘖
+17168	奒
+17169	婯
+17170	孠
+17171	岾
+17172	嶬
+17173	廗
+17174	怟
+17175	扠
+17176	揔
+17177	擪
+17178	昁
+17179	朘
+17180	楰
+17181	欿
+17182	汯
+17183	淜
+17184	滽
+17185	濳
+17186	烱
+17187	磍
+17188	╨
+17189	鱨
+17190	琹
+17191	璴
+17192	縧
+17193	竘
+17194	籰
+17195	秎
+17196	﹍
+17197	甽
+17198	眑
+17199	乴
+17200	俵
+17201	刲
+17202	卨
+17203	唋
+17204	噇
+17205	坙
+17206	塴
+17207	妉
+17208	宭
+17209	峫
+17210	巐
+17211	弆
+17212	恖
+17213	憀
+17214	抣
+17215	搇
+17216	昹
+17217	杔
+17218	榣
+17219	檒
+17220	歭
+17221	沴
+17222	渓
+17223	漧
+17224	焞
+17225	碙
+17226	↙
+17227	譒
+17228	癓
+17229	礚
+17230	疞
+17231	璍
+17232	稬
+17233	獿
+17234	籐
+17235	㎜
+17236	盠
+17237	丩
+17238	侺
+17239	僉
+17240	匧
+17241	哃
+17242	圠
+17243	塋
+17244	婰
+17245	孡
+17246	峀
+17247	嶭
+17248	廘
+17249	怢
+17250	慙
+17251	扡
+17252	揕
+17253	擫
+17254	朙
+17255	桳
+17256	楲
+17257	橪
+17258	歀
+17259	汱
+17260	淟
+17261	濴
+17262	烲
+17263	磎
+17264	╩
+17265	鱩
+17266	譵
+17267	癿
+17268	祄
+17269	痬
+17270	璵
+17271	穖
+17272	竚
+17273	籱
+17274	秏
+17275	﹎
+17276	甿
+17277	眒
+17278	乵
+17279	僲
+17280	刴
+17281	卪
+17282	唌
+17283	噈
+17284	坢
+17285	妋
+17286	媘
+17287	峬
+17288	巑
+17289	恗
+17290	抦
+17291	搈
+17292	攎
+17293	梞
+17294	榤
+17295	檓
+17296	歮
+17297	沵
+17298	漨
+17299	瀖
+17300	焟
+17301	譓
+17302	琈
+17303	癕
+17304	礛
+17305	疢
+17306	璏
+17307	縈
+17308	稭
+17309	窶
+17310	玀
+17311	籑
+17312	㎝
+17313	甅
+17314	丮
+17315	侻
+17316	匨
+17317	哅
+17318	嘙
+17319	圡
+17320	塎
+17321	奙
+17322	孧
+17323	峂
+17324	嶮
+17325	怣
+17326	扢
+17327	揗
+17328	昅
+17329	朚
+17330	桵
+17331	楳
+17332	歁
+17333	汳
+17334	淢
+17335	濵
+17336	烳
+17337	╪
+17338	譶
+17339	皀
+17340	痭
+17341	璶
+17342	縩
+17343	穘
+17344	竛
+17345	秐
+17346	畁
+17347	乶
+17348	僴
+17349	刵
+17350	唍
+17351	噉
+17352	塶
+17353	妌
+17354	媙
+17355	宯
+17356	峮
+17357	巒
+17358	弉
+17359	恘
+17360	抧
+17361	搉
+17362	昻
+17363	杗
+17364	榥
+17365	歯
+17366	沶
+17367	渘
+17368	焠
+17369	碞
+17370	∟
+17371	鱊
+17372	譔
+17373	琋
+17374	癗
+17375	疦
+17376	璑
+17377	窷
+17378	籒
+17379	禢
+17380	㎞
+17381	甆
+17382	丯
+17383	侼
+17384	凬
+17385	匩
+17386	哊
+17387	圢
+17388	塏
+17389	奛
+17390	婲
+17391	孨
+17392	峃
+17393	嶯
+17394	怤
+17395	慛
+17396	扤
+17397	揘
+17398	擭
+17399	朜
+17400	桸
+17401	楴
+17402	歂
+17403	汵
+17404	淣
+17405	漀
+17406	濶
+17407	︸
+17408	●
+17409	ゑ
+17410	Ⅰ
+17411	ヱ
+17412	凂
+17413	咇
+17414	嗰
+17415	囻
+17416	堮
+17417	夞
+17418	婑
+17419	嬹
+17420	岏
+17421	嶑
+17422	庱
+17423	忨
+17424	戱
+17425	旕
+17426	栺
+17427	楍
+17428	橊
+17429	欛
+17430	汃
+17431	涶
+17432	滖
+17433	燅
+17434	馺
+17435	馸
+17436	駘
+17437	瘳
+17438	駖
+17439	馹
+17440	馵
+17441	馌
+17442	駞
+17443	駢
+17444	駥
+17445	駦
+17446	駨
+17447	駩
+17448	駪
+17449	駬
+17450	駮
+17451	駰
+17452	駴
+17453	駵
+17454	駶
+17455	駸
+17456	ㄑ
+17457	ρ
+17458	⊙
+17459	┭
+17460	⒀
+17461	Ｑ
+17462	パ
+17463	傃
+17464	冄
+17465	勓
+17466	呇
+17467	囇
+17468	堁
+17469	娧
+17470	嬔
+17471	屟
+17472	嵮
+17473	幯
+17474	徰
+17475	愌
+17476	捬
+17477	撗
+17478	栄
+17479	椦
+17480	樠
+17481	氀
+17482	浹
+17483	溠
+17484	澭
+17485	裛
+17486	裗
+17487	褈
+17488	裑
+17489	裖
+17490	褅
+17491	袬
+17492	裓
+17493	褉
+17494	褋
+17495	褌
+17496	褍
+17497	褎
+17498	褏
+17499	褑
+17500	褔
+17501	褖
+17502	褗
+17503	褘
+17504	褜
+17505	褝
+17506	褞
+17507	褟
+17508	褤
+17509	褧
+17510	褨
+17511	褩
+17512	褬
+17513	褭
+17514	褱
+17515	褳
+17516	褵
+17517	窆
+17518	馎
+17519	袮
+17520	窳
+17521	衤
+17522	袯
+17523	馽
+17524	馛
+17525	裞
+17526	裠
+17527	袲
+17528	馿
+17529	馝
+17530	袳
+17531	耖
+17532	耔
+17533	耠
+17534	馞
+17535	裦
+17536	袴
+17537	馟
+17538	裧
+17539	袵
+17540	駂
+17541	馠
+17542	裩
+17543	袶
+17544	駃
+17545	馡
+17546	裪
+17547	袸
+17548	耥
+17549	耢
+17550	裉
+17551	馢
+17552	袹
+17553	駅
+17554	馣
+17555	裬
+17556	袺
+17557	駆
+17558	馤
+17559	裭
+17560	袻
+17561	駇
+17562	馦
+17563	裮
+17564	袽
+17565	駈
+17566	馧
+17567	裯
+17568	袾
+17569	駉
+17570	袿
+17571	裼
+17572	裵
+17573	裀
+17574	駋
+17575	馫
+17576	裶
+17577	裃
+17578	駌
+17579	裷
+17580	裄
+17581	裺
+17582	裇
+17583	駎
+17584	裻
+17585	駏
+17586	馯
+17587	裿
+17588	駑
+17589	褀
+17590	裌
+17591	褁
+17592	裍
+17593	駓
+17594	褃
+17595	駔
+17596	褄
+17597	裐
+17598	駹
+17599	褷
+17600	磑
+17601	╫
+17602	鱫
+17603	琽
+17604	皁
+17605	祇
+17606	痮
+17607	璷
+17608	穙
+17609	玱
+17610	籵
+17611	秓
+17612	﹐
+17613	畂
+17614	眔
+17615	乷
+17616	俹
+17617	僶
+17618	刼
+17619	卭
+17620	坥
+17621	塷
+17622	妎
+17623	宱
+17624	巓
+17625	憃
+17626	抩
+17627	搊
+17628	攐
+17629	杘
+17630	榦
+17631	歰
+17632	沷
+17633	漮
+17634	碠
+17635	∣
+17636	譕
+17637	琌
+17638	癘
+17639	礝
+17640	疧
+17641	璒
+17642	縊
+17643	稯
+17644	玂
+17645	禣
+17646	丱
+17647	侽
+17648	凮
+17649	匫
+17650	哋
+17651	圤
+17652	塐
+17653	奜
+17654	峅
+17655	嶰
+17656	廜
+17657	怬
+17658	扥
+17659	揙
+17660	擮
+17661	昈
+17662	朞
+17663	桹
+17664	橭
+17665	歄
+17666	汷
+17667	淥
+17668	濷
+17669	烵
+17670	骱
+17671	『
+17672	Ш
+17673	┖
+17674	⒑
+17675	：
+17676	伜
+17677	偤
+17678	吅
+17679	喓
+17680	姾
+17681	嫼
+17682	尯
+17683	嵑
+17684	幒
+17685	徍
+17686	懞
+17687	捄
+17688	摵
+17689	柡
+17690	櫤
+17691	毢
+17692	澓
+17693	灪
+17694	篳
+17695	篰
+17696	簙
+17697	簗
+17698	篲
+17699	篬
+17700	篯
+17701	簘
+17702	篅
+17703	篭
+17704	簚
+17705	簛
+17706	簜
+17707	簝
+17708	簣
+17709	簤
+17710	簥
+17711	簨
+17712	簩
+17713	簬
+17714	簭
+17715	簮
+17716	簯
+17717	簰
+17718	簱
+17719	簲
+17720	簳
+17721	簴
+17722	簵
+17723	簶
+17724	簹
+17725	簺
+17726	簻
+17727	簼
+17728	◇
+17729	侒
+17730	傮
+17731	凅
+17732	勼
+17733	咉
+17734	圀
+17735	夡
+17736	婓
+17737	嬻
+17738	岓
+17739	庴
+17740	忬
+17741	愺
+17742	戵
+17743	掦
+17744	擉
+17745	旙
+17746	曮
+17747	栿
+17748	楏
+17749	橌
+17750	欝
+17751	汅
+17752	涹
+17753	滙
+17754	濗
+17755	烍
+17756	燇
+17757	骭
+17758	蟓
+17759	骩
+17760	骫
+17761	髛
+17762	螫
+17763	髝
+17764	髠
+17765	髢
+17766	髤
+17767	髥
+17768	髧
+17769	髨
+17770	髩
+17771	髬
+17772	髱
+17773	髲
+17774	髳
+17775	髵
+17776	髶
+17777	髸
+17778	髺
+17779	髼
+17780	髽
+17781	髾
+17782	髿
+17783	鬀
+17784	鬂
+17785	鬅
+17786	ㄓ
+17787	τ
+17788	∮
+17789	┯
+17790	⒂
+17791	Ｓ
+17792	佊
+17793	傆
+17794	冇
+17795	呌
+17796	囉
+17797	堄
+17798	娪
+17799	嵱
+17800	幱
+17801	徲
+17802	懹
+17803	撚
+17804	斢
+17805	栍
+17806	椨
+17807	櫽
+17808	氂
+17809	浻
+17810	溣
+17811	澯
+17812	炗
+17813	熡
+17814	觍
+17815	觺
+17816	觃
+17817	覿
+17818	觷
+17819	觹
+17820	觻
+17821	觽
+17822	觾
+17823	觿
+17824	訁
+17825	訃
+17826	訄
+17827	訆
+17828	訉
+17829	訋
+17830	訌
+17831	訍
+17832	訐
+17833	訒
+17834	訔
+17835	訖
+17836	託
+17837	訛
+17838	訜
+17839	︱
+17840	◎
+17841	Ⅱ
+17842	ヲ
+17843	侐
+17844	凃
+17845	咈
+17846	嗱
+17847	囼
+17848	婒
+17849	嬺
+17850	庲
+17851	忩
+17852	愹
+17853	掤
+17854	擈
+17855	楎
+17856	欜
+17857	汄
+17858	涷
+17859	濖
+17860	烌
+17861	騸
+17862	騶
+17863	騕
+17864	騗
+17865	騵
+17866	虍
+17867	駺
+17868	騖
+17869	騹
+17870	騺
+17871	騻
+17872	騼
+17873	騿
+17874	驂
+17875	驄
+17876	驆
+17877	驇
+17878	驉
+17879	驌
+17880	驑
+17881	驒
+17882	驓
+17883	驔
+17884	驖
+17885	驘
+17886	ㄒ
+17887	σ
+17888	∫
+17889	б
+17890	┮
+17891	⒁
+17892	Ｒ
+17893	佉
+17894	勔
+17895	囈
+17896	壱
+17897	娨
+17898	嬕
+17899	屢
+17900	嵰
+17901	幰
+17902	徱
+17903	愐
+17904	撘
+17905	斠
+17906	栆
+17907	椧
+17908	樢
+17909	櫼
+17910	氁
+17911	浺
+17912	溡
+17913	澮
+17914	炓
+17915	襚
+17916	襘
+17917	襼
+17918	襹
+17919	襙
+17920	襕
+17921	襗
+17922	襺
+17923	褸
+17924	襽
+17925	襾
+17926	覀
+17927	覂
+17928	覅
+17929	覇
+17930	覈
+17931	覉
+17932	覊
+17933	覌
+17934	覎
+17935	覐
+17936	覑
+17937	覒
+17938	覔
+17939	覕
+17940	覗
+17941	覘
+17942	覙
+17943	覛
+17944	覜
+17945	覝
+17946	覞
+17947	覟
+17948	覠
+17949	︳
+17950	◆
+17951	Ⅳ
+17952	ヴ
+17953	侓
+17954	勽
+17955	咊
+17956	嗶
+17957	圁
+17958	堲
+17959	婔
+17960	嬼
+17961	岕
+17962	嶔
+17963	庺
+17964	忯
+17965	愻
+17966	桇
+17967	楐
+17968	橍
+17969	欞
+17970	涺
+17971	鬬
+17972	鬪
+17973	糇
+17974	舭
+17975	鬫
+17976	舡
+17977	鬩
+17978	魗
+17979	魙
+17980	鬇
+17981	簦
+17982	鬨
+17983	舯
+17984	魛
+17985	魜
+17986	魝
+17987	魞
+17988	魠
+17989	魡
+17990	魢
+17991	魣
+17992	魤
+17993	魥
+17994	魦
+17995	魧
+17996	魨
+17997	魩
+17998	魪
+17999	魫
+18000	魬
+18001	魭
+18002	魮
+18003	魰
+18004	魲
+18005	魳
+18006	魴
+18007	魶
+18008	魸
+18009	魹
+18010	魺
+18011	ㄔ
+18012	髟
+18013	υ
+18014	≡
+18015	г
+18016	┰
+18017	⒃
+18018	Ｔ
+18019	佋
+18020	傇
+18021	勗
+18022	呍
+18023	囋
+18024	壴
+18025	娫
+18026	屧
+18027	嵲
+18028	幵
+18029	愒
+18030	栐
+18031	椩
+18032	樤
+18033	櫾
+18034	氃
+18035	浽
+18036	溤
+18037	澰
+18038	熢
+18039	訿
+18040	詜
+18041	訽
+18042	訹
+18043	訞
+18044	詟
+18045	詤
+18046	詥
+18047	詧
+18048	詨
+18049	詪
+18050	詫
+18051	詬
+18052	詯
+18053	詴
+18054	詵
+18055	詶
+18056	詷
+18057	詸
+18058	詺
+18059	詻
+18060	詾
+18061	詿
+18062	■
+18063	Ⅵ
+18064	ヶ
+18065	傱
+18066	咑
+18067	嗹
+18068	圅
+18069	夦
+18070	嬾
+18071	忲
+18072	愽
+18073	戹
+18074	掱
+18075	旜
+18076	曵
+18077	桍
+18078	橏
+18079	濚
+18080	烐
+18081	鯺
+18082	鰚
+18083	鯹
+18084	鰗
+18085	鰙
+18086	觯
+18087	鯸
+18088	鰛
+18089	鰜
+18090	鰝
+18091	鰞
+18092	鰠
+18093	鰡
+18094	鰢
+18095	鰦
+18096	鰧
+18097	鰨
+18098	鰪
+18099	鰮
+18100	鰯
+18101	鰰
+18102	鰳
+18103	鰴
+18104	鰵
+18105	鰷
+18106	鰹
+18107	鰺
+18108	ㄖ
+18109	χ
+18110	≈
+18111	┲
+18112	⒅
+18113	Ｖ
+18114	佒
+18115	冎
+18116	勚
+18117	呏
+18118	堉
+18119	屩
+18120	嵵
+18121	徶
+18122	捴
+18123	斨
+18124	栔
+18125	椫
+18126	樦
+18127	欀
+18128	浿
+18129	溨
+18130	澲
+18131	炛
+18132	熤
+18133	謃
+18134	謁
+18135	謢
+18136	謤
+18137	謥
+18138	謧
+18139	謩
+18140	謪
+18141	謮
+18142	謯
+18143	謰
+18144	謱
+18145	謵
+18146	謶
+18147	謷
+18148	謸
+18149	謺
+18150	謻
+18151	謼
+18152	謽
+18153	謾
+18154	謿
+18155	譀
+18156	譁
+18157	譂
+18158	譃
+18159	譄
+18160	黪
+18161	︴
+18162	□
+18163	Ⅴ
+18164	ヵ
+18165	傰
+18166	匁
+18167	咍
+18168	嗸
+18169	圂
+18170	堳
+18171	夣
+18172	嬽
+18173	岝
+18174	嶕
+18175	庻
+18176	愼
+18177	掯
+18178	旛
+18179	桋
+18180	楑
+18181	橎
+18182	欟
+18183	汋
+18184	涻
+18185	滜
+18186	鮚
+18187	酲
+18188	鮺
+18189	鮸
+18190	鮗
+18191	鮙
+18192	鮷
+18193	鮹
+18194	酾
+18195	醵
+18196	魼
+18197	鮘
+18198	鮻
+18199	鮽
+18200	鮿
+18201	鯀
+18202	鯁
+18203	鯃
+18204	鯄
+18205	鯆
+18206	鯈
+18207	鯋
+18208	鯍
+18209	鯎
+18210	鯏
+18211	鯐
+18212	鯑
+18213	鯒
+18214	鯓
+18215	鯕
+18216	鯗
+18217	鯙
+18218	鯚
+18219	ㄕ
+18220	φ
+18221	≌
+18222	д
+18223	┱
+18224	⒄
+18225	Ｕ
+18226	佌
+18227	傉
+18228	冋
+18229	呎
+18230	喺
+18231	囌
+18232	堈
+18233	壵
+18234	娬
+18235	嬚
+18236	屨
+18237	嵳
+18238	幷
+18239	徴
+18240	愓
+18241	懻
+18242	捳
+18243	撜
+18244	斦
+18245	樥
+18246	氄
+18247	浾
+18248	溦
+18249	炚
+18250	熣
+18251	諂
+18252	諀
+18253	誟
+18254	誁
+18255	諃
+18256	諄
+18257	諅
+18258	諆
+18259	諉
+18260	諌
+18261	諍
+18262	諎
+18263	諑
+18264	諓
+18265	諔
+18266	諕
+18267	諗
+18268	諘
+18269	諙
+18270	諛
+18271	諝
+18272	諞
+18273	諟
+18274	諠
+18275	諡
+18276	諢
+18277	▲
+18278	Ⅷ
+18279	侙
+18280	傴
+18281	凐
+18282	匄
+18283	嗻
+18284	堷
+18285	夬
+18286	婙
+18287	孁
+18288	嶘
+18289	庿
+18290	忴
+18291	慀
+18292	掵
+18293	擑
+18294	桒
+18295	楕
+18296	橒
+18297	欨
+18298	涾
+18299	滧
+18300	濜
+18301	烒
+18302	燌
+18303	鴡
+18304	鴟
+18305	鳾
+18306	鴀
+18307	鴞
+18308	鴠
+18309	鳣
+18310	鴢
+18311	鴤
+18312	鴥
+18313	鴫
+18314	鴬
+18315	鴰
+18316	鴱
+18317	鴴
+18318	鴶
+18319	鴸
+18320	鴹
+18321	鴽
+18322	鴾
+18323	鵀
+18324	鵁
+18325	ㄘ
+18326	∝
+18327	ж
+18328	┴
+18329	Ｘ
+18330	佖
+18331	傌
+18332	冐
+18333	勜
+18334	呚
+18335	嗀
+18336	囏
+18337	娯
+18338	嬝
+18339	嵷
+18340	庁
+18341	捸
+18342	撠
+18343	曍
+18344	栘
+18345	権
+18346	欂
+18347	氊
+18348	涁
+18349	澵
+18350	豟
+18351	豝
+18352	貇
+18353	貄
+18354	豞
+18355	丿
+18356	豙
+18357	豜
+18358	貃
+18359	貆
+18360	谸
+18361	丌
+18362	豛
+18363	乇
+18364	貎
+18365	貏
+18366	貑
+18367	貒
+18368	貕
+18369	貗
+18370	貙
+18371	貚
+18372	貛
+18373	貜
+18374	貟
+18375	貣
+18376	貤
+18377	貥
+18378	丶
+18379	篴
+18380	篈
+18381	觓
+18382	覣
+18383	竽
+18384	騛
+18385	褹
+18386	鬭
+18387	鬉
+18388	舄
+18389	鯝
+18390	謅
+18391	諥
+18392	鮝
+18393	魽
+18394	誂
+18395	酹
+18396	鴄
+18397	鳤
+18398	豠
+18399	谹
+18400	劐
+18401	羝
+18402	銎
+18403	劓
+18404	篵
+18405	骲
+18406	驜
+18407	觔
+18408	覤
+18409	騜
+18410	鬮
+18411	鬊
+18412	鯾
+18413	鮞
+18414	誃
+18415	鴅
+18416	谺
+18417	篶
+18418	篊
+18419	骳
+18420	觕
+18421	覥
+18422	騝
+18423	襝
+18424	鬰
+18425	鬋
+18426	訡
+18427	鯿
+18428	鯟
+18429	謈
+18430	鲧
+18431	魿
+18432	誧
+18433	誄
+18434	鴆
+18435	鳦
+18436	谻
+18437	篸
+18438	篋
+18439	骴
+18440	驞
+18441	觗
+18442	鬌
+18443	詃
+18444	鯠
+18445	謉
+18446	鮠
+18447	躔
+18448	豥
+18449	匦
+18450	篹
+18451	篍
+18452	骵
+18453	觘
+18454	騟
+18455	襡
+18456	褽
+18457	鬳
+18458	詄
+18459	絷
+18460	鰁
+18461	鋈
+18462	鮡
+18463	鮁
+18464	誩
+18465	誆
+18466	鴈
+18467	豦
+18468	谽
+18469	厣
+18470	骹
+18471	觙
+18472	騠
+18473	騀
+18474	襢
+18475	褾
+18476	鬴
+18477	鬎
+18478	詅
+18479	訤
+18480	鰂
+18481	鯢
+18482	謋
+18483	諪
+18484	谾
+18485	篻
+18486	篏
+18487	骻
+18488	驡
+18489	觛
+18490	覩
+18491	騡
+18492	襣
+18493	褿
+18494	鬵
+18495	鬐
+18496	鰃
+18497	謌
+18498	諫
+18499	鮣
+18500	鮃
+18501	鴊
+18502	鳪
+18503	篽
+18504	篐
+18505	骽
+18506	觝
+18507	騂
+18508	襤
+18509	襀
+18510	鬶
+18511	鬑
+18512	詇
+18513	訦
+18514	鰄
+18515	謍
+18516	鮤
+18517	鮄
+18518	誋
+18519	豩
+18520	豀
+18521	篿
+18522	篒
+18523	骾
+18524	驣
+18525	觟
+18526	騣
+18527	騃
+18528	襥
+18529	襂
+18530	鬷
+18531	鬒
+18532	詉
+18533	訧
+18534	敉
+18535	纛
+18536	鰅
+18537	鯥
+18538	鐾
+18539	鮅
+18540	蹯
+18541	鴌
+18542	鳬
+18543	豂
+18544	篔
+18545	骿
+18546	騤
+18547	騄
+18548	襧
+18549	襃
+18550	訨
+18551	鰆
+18552	鯦
+18553	謏
+18554	鮆
+18555	誮
+18556	鴍
+18557	鳭
+18558	豭
+18559	豃
+18560	篕
+18561	髃
+18562	驥
+18563	觡
+18564	覭
+18565	騅
+18566	襅
+18567	鬹
+18568	鬕
+18569	詋
+18570	訩
+18571	鰇
+18572	諯
+18573	鮧
+18574	鮇
+18575	誎
+18576	鴎
+18577	鳮
+18578	豮
+18579	豄
+18580	簂
+18581	篖
+18582	驦
+18583	騦
+18584	騆
+18585	襆
+18586	鬺
+18587	鬖
+18588	詌
+18589	謑
+18590	諰
+18591	誏
+18592	鴏
+18593	豯
+18594	簃
+18595	髆
+18596	驧
+18597	觤
+18598	騧
+18599	鬽
+18600	鬗
+18601	詍
+18602	誱
+18603	誐
+18604	鴐
+18605	豰
+18606	簄
+18607	篘
+18608	髇
+18609	觧
+18610	覰
+18611	篑
+18612	笱
+18613	騨
+18614	騈
+18615	襫
+18616	襈
+18617	鬾
+18618	詎
+18619	訬
+18620	鰊
+18621	謓
+18622	諲
+18623	鮊
+18624	誑
+18625	鴑
+18626	鳱
+18627	豱
+18628	篛
+18629	驩
+18630	篚
+18631	騩
+18632	騉
+18633	襬
+18634	鬿
+18635	鬙
+18636	詏
+18637	艉
+18638	鰋
+18639	鯫
+18640	謔
+18641	諳
+18642	誳
+18643	誒
+18644	鹾
+18645	躜
+18646	鳲
+18647	豲
+18648	刂
+18649	簆
+18650	篜
+18651	觩
+18652	騪
+18653	騊
+18654	襭
+18655	魀
+18656	訮
+18657	鰌
+18658	鯬
+18659	謕
+18660	諴
+18661	鮌
+18662	誴
+18663	誔
+18664	鴓
+18665	豍
+18666	簈
+18667	篞
+18668	髊
+18669	觪
+18670	騋
+18671	襮
+18672	襋
+18673	鬛
+18674	詑
+18675	訯
+18676	鯭
+18677	諵
+18678	誵
+18679	豵
+18680	簉
+18681	篟
+18682	髍
+18683	驲
+18684	觬
+18685	覴
+18686	騬
+18687	襌
+18688	魊
+18689	詒
+18690	訰
+18691	鯮
+18692	諶
+18693	鮎
+18694	誶
+18695	誖
+18696	鳵
+18697	豶
+18698	簊
+18699	篠
+18700	觭
+18701	覵
+18702	騭
+18703	騍
+18704	襰
+18705	襍
+18706	詓
+18707	鰏
+18708	鮯
+18709	鮏
+18710	誷
+18711	豷
+18712	簍
+18713	篢
+18714	觮
+18715	襎
+18716	魌
+18717	誸
+18718	豻
+18719	簎
+18720	髐
+18721	骍
+18722	覷
+18723	簏
+18724	鬠
+18725	糈
+18726	鯱
+18727	諹
+18728	鮱
+18729	踣
+18730	鴘
+18731	鳸
+18732	豼
+18733	豒
+18734	刳
+18735	簐
+18736	骎
+18737	騐
+18738	襳
+18739	襐
+18740	魐
+18741	鬡
+18742	詖
+18743	鰒
+18744	鮲
+18745	誚
+18746	鴙
+18747	豽
+18748	豓
+18749	簑
+18750	篧
+18751	骔
+18752	觲
+18753	覹
+18754	騱
+18755	騑
+18756	襴
+18757	襑
+18758	魒
+18759	鰓
+18760	謜
+18761	鮳
+18762	鮓
+18763	鴚
+18764	鳺
+18765	豾
+18766	簒
+18767	篨
+18768	騲
+18769	騒
+18770	襵
+18771	襒
+18772	魓
+18773	鬤
+18774	詘
+18775	鰔
+18776	鯴
+18777	鮴
+18778	鮔
+18779	誜
+18780	鳻
+18781	豿
+18782	骙
+18783	觵
+18784	覻
+18785	簖
+18786	魕
+18787	訷
+18788	鰕
+18789	鯵
+18790	諽
+18791	鮕
+18792	誽
+18793	鴜
+18794	鳼
+18795	貀
+18796	豗
+18797	簔
+18798	篫
+18799	髗
+18800	觶
+18801	覼
+18802	騔
+18803	魖
+18804	鬦
+18805	訸
+18806	謟
+18807	鮶
+18808	鮖
+18809	誾
+18810	貁
+18811	豘
+18812	酤
+18813	鳋
+18814	觜
+18815	籂
+18816	覡
+18817	魻
+18818	誀
+18819	譅
+18820	鵂
+18821	貭
+18822	磒
+18823	╬
+18824	譸
+18825	琾
+18826	皃
+18827	祊
+18828	痯
+18829	璸
+18830	穚
+18831	竝
+18832	玴
+18833	籶
+18834	秔
+18835	﹑
+18836	畃
+18837	眕
+18838	俻
+18839	僷
+18840	刾
+18841	唒
+18842	噋
+18843	坧
+18844	塸
+18845	妏
+18846	媝
+18847	宲
+18848	峱
+18849	弍
+18850	恜
+18851	憄
+18852	抪
+18853	昿
+18854	杙
+18855	檖
+18856	歱
+18857	漰
+18858	瀙
+18859	焢
+18860	碢
+18861	≒
+18862	鱌
+18863	譖
+18864	癙
+18865	礟
+18866	疨
+18867	璓
+18868	縋
+18869	稰
+18870	窹
+18871	玃
+18872	籔
+18873	禤
+18874	㏄
+18875	丳
+18876	侾
+18877	働
+18878	匬
+18879	哖
+18880	嘝
+18881	圥
+18882	塒
+18883	奝
+18884	婸
+18885	孭
+18886	峆
+18887	嶱
+18888	怭
+18889	慞
+18890	扨
+18891	擯
+18892	朠
+18893	桺
+18894	楶
+18895	歅
+18896	汸
+18897	淧
+18898	漃
+18899	濸
+18900	烶
+18901	磓
+18902	譹
+18903	皅
+18904	祋
+18905	痲
+18906	璹
+18907	縬
+18908	穛
+18909	竡
+18910	玵
+18911	籷
+18912	秖
+18913	﹒
+18914	畄
+18915	眖
+18916	乹
+18917	俼
+18918	僸
+18919	剄
+18920	卶
+18921	唓
+18922	噏
+18923	坬
+18924	塹
+18925	妐
+18926	宷
+18927	峲
+18928	巕
+18929	恞
+18930	憅
+18931	抭
+18932	搎
+18933	攓
+18934	晀
+18935	杚
+18936	榪
+18937	檘
+18938	泀
+18939	渜
+18940	瀜
+18941	焣
+18942	碤
+18943	≦
+18944	鱍
+18945	癚
+18946	礠
+18947	疩
+18948	璔
+18949	縌
+18950	玅
+18951	籕
+18952	㏎
+18953	盦
+18954	丵
+18955	俀
+18956	僎
+18957	凲
+18958	匭
+18959	哘
+18960	嘠
+18961	圦
+18962	塓
+18963	奞
+18964	婹
+18965	孮
+18966	峇
+18967	嶲
+18968	廞
+18969	怮
+18970	扱
+18971	擰
+18972	昋
+18973	朡
+18974	桻
+18975	楺
+18976	歈
+18977	漄
+18978	濹
+18979	烸
+18980	磖
+18981	╮
+18982	鱮
+18983	譺
+18984	皉
+18985	祌
+18986	璻
+18987	縭
+18988	穜
+18989	竢
+18990	玶
+18991	籸
+18992	秗
+18993	﹔
+18994	畆
+18995	眗
+18996	乺
+18997	俽
+18998	剅
+18999	卹
+19000	唕
+19001	坮
+19002	塺
+19003	妑
+19004	媟
+19005	宺
+19006	弐
+19007	恟
+19008	憆
+19009	抮
+19010	搑
+19011	杛
+19012	梤
+19013	榬
+19014	檙
+19015	渞
+19016	漴
+19017	焤
+19018	碦
+19019	≧
+19020	鱎
+19021	琑
+19022	癛
+19023	礡
+19024	疪
+19025	稲
+19026	窻
+19027	玆
+19028	籖
+19029	㏑
+19030	俁
+19031	僐
+19032	凴
+19033	哛
+19034	嘡
+19035	圧
+19036	塕
+19037	奟
+19038	婻
+19039	峈
+19040	嶳
+19041	怰
+19042	慠
+19043	扲
+19044	揜
+19045	昍
+19046	朢
+19047	桼
+19048	楻
+19049	歊
+19050	汻
+19051	漅
+19052	磗
+19053	鱯
+19054	瑂
+19055	皊
+19056	痵
+19057	穝
+19058	竤
+19059	玸
+19060	籹
+19061	秙
+19062	﹕
+19063	畇
+19064	乻
+19065	俿
+19066	僺
+19067	剆
+19068	唖
+19069	噑
+19070	坰
+19071	塻
+19072	妔
+19073	媠
+19074	宻
+19075	巗
+19076	恠
+19077	憇
+19078	抯
+19079	搒
+19080	攕
+19081	晄
+19082	杝
+19083	梥
+19084	檚
+19085	歴
+19086	焥
+19087	碨
+19088	⊿
+19089	鱏
+19090	琒
+19091	癝
+19092	璖
+19093	縎
+19094	稴
+19095	窼
+19096	玈
+19097	籗
+19098	㏒
+19099	盨
+19100	凷
+19101	匰
+19102	哠
+19103	圫
+19104	塖
+19105	孲
+19106	峉
+19107	嶴
+19108	怱
+19109	慡
+19110	扴
+19111	揝
+19112	朣
+19113	桽
+19114	橲
+19115	歋
+19116	汼
+19117	烻
+19118	ㄟ
+19119	∵
+19120	⑦
+19121	＿
+19122	佭
+19123	傔
+19124	冞
+19125	勥
+19126	呥
+19127	嗊
+19128	囘
+19129	堖
+19130	夁
+19131	娺
+19132	屵
+19133	嵾
+19134	庍
+19135	愡
+19136	戇
+19137	撨
+19138	斶
+19139	曔
+19140	栠
+19141	椷
+19142	樳
+19143	欉
+19144	氝
+19145	涍
+19146	溸
+19147	澾
+19148	炦
+19149	熯
+19150	擗
+19151	攥
+19152	遾
+19153	擐
+19154	擤
+19155	邆
+19156	邇
+19157	邉
+19158	邌
+19159	邍
+19160	邎
+19161	邐
+19162	邒
+19163	邔
+19164	邖
+19165	邘
+19166	邚
+19167	邜
+19168	邞
+19169	邟
+19170	邠
+19171	邤
+19172	邥
+19173	邧
+19174	邫
+19175	邭
+19176	邲
+19177	邷
+19178	邼
+19179	邽
+19180	邿
+19181	遖
+19182	逜
+19183	哜
+19184	吣
+19185	遚
+19186	逤
+19187	逥
+19188	遝
+19189	逧
+19190	遟
+19191	逩
+19192	逪
+19193	遡
+19194	逫
+19195	遤
+19196	逬
+19197	遦
+19198	逰
+19199	遧
+19200	遪
+19201	逳
+19202	遫
+19203	逴
+19204	唪
+19205	咴
+19206	啧
+19207	遬
+19208	逷
+19209	遯
+19210	逹
+19211	遰
+19212	逺
+19213	遱
+19214	逽
+19215	逿
+19216	遳
+19217	遀
+19218	遶
+19219	遆
+19220	遈
+19221	啐
+19222	郀
+19223	磘
+19224	譼
+19225	瑃
+19226	皌
+19227	縯
+19228	穞
+19229	竧
+19230	秚
+19231	﹖
+19232	畉
+19233	乼
+19234	倀
+19235	僼
+19236	卼
+19237	唗
+19238	噒
+19239	坱
+19240	塼
+19241	妕
+19242	媡
+19243	宼
+19244	峵
+19245	巘
+19246	弔
+19247	恡
+19248	憈
+19249	抰
+19250	搕
+19251	攖
+19252	晅
+19253	杢
+19254	梩
+19255	榯
+19256	檛
+19257	泃
+19258	渢
+19259	焧
+19260	═
+19261	鱐
+19262	琓
+19263	癟
+19264	礣
+19265	疶
+19266	璗
+19267	縏
+19268	稵
+19269	窽
+19270	玊
+19271	㏕
+19272	乀
+19273	僒
+19274	凾
+19275	圱
+19276	奣
+19277	婽
+19278	孴
+19279	峊
+19280	嶵
+19281	廡
+19282	怲
+19283	扵
+19284	揟
+19285	朤
+19286	楾
+19287	橳
+19288	歍
+19289	汿
+19290	淭
+19291	濼
+19292	烼
+19293	╱
+19294	皍
+19295	祏
+19296	痷
+19297	璾
+19298	縰
+19299	穟
+19300	竨
+19301	玼
+19302	籾
+19303	秛
+19304	﹗
+19305	眜
+19306	乽
+19307	倁
+19308	剈
+19309	卽
+19310	唘
+19311	媢
+19312	寀
+19313	巙
+19314	弖
+19315	憉
+19316	抲
+19317	攗
+19318	晆
+19319	杣
+19320	梪
+19321	榰
+19322	泆
+19323	瀠
+19324	碪
+19325	║
+19326	鱑
+19327	譛
+19328	琔
+19329	癠
+19330	礥
+19331	疷
+19332	縐
+19333	稶
+19334	窾
+19335	玌
+19336	籙
+19337	︰
+19338	甎
+19339	乁
+19340	俇
+19341	刄
+19342	匲
+19343	哢
+19344	嘦
+19345	圲
+19346	塙
+19347	婾
+19348	峌
+19349	怳
+19350	慤
+19351	扷
+19352	揢
+19353	昒
+19354	楿
+19355	橴
+19356	沀
+19357	淯
+19358	漊
+19359	濽
+19360	烾
+19361	→
+19362	侜
+19363	傶
+19364	凓
+19365	匉
+19366	咜
+19367	嗿
+19368	堹
+19369	婜
+19370	岤
+19371	嶛
+19372	忷
+19373	慂
+19374	扂
+19375	旡
+19376	曻
+19377	桗
+19378	楘
+19379	橔
+19380	欪
+19381	汑
+19382	淂
+19383	滫
+19384	鷃
+19385	鷁
+19386	鷡
+19387	鷀
+19388	鶣
+19389	鶿
+19390	鷢
+19391	鷤
+19392	鷧
+19393	鷩
+19394	鷪
+19395	鷫
+19396	鷬
+19397	鷭
+19398	鷰
+19399	鷳
+19400	鷴
+19401	鷵
+19402	鷷
+19403	鷽
+19404	鷾
+19405	鸀
+19406	鸁
+19407	ㄚ
+19408	≮
+19409	②
+19410	Ｚ
+19411	傏
+19412	冓
+19413	呞
+19414	堏
+19415	壼
+19416	娳
+19417	嬟
+19418	屭
+19419	徻
+19420	愙
+19421	戁
+19422	捼
+19423	撢
+19424	斱
+19425	曏
+19426	栚
+19427	椱
+19428	涄
+19429	溭
+19430	澸
+19431	炡
+19432	赻
+19433	赹
+19434	赱
+19435	赸
+19436	趠
+19437	贎
+19438	讠
+19439	赲
+19440	趢
+19441	趤
+19442	趦
+19443	趧
+19444	趩
+19445	趪
+19446	趫
+19447	趬
+19448	趭
+19449	趮
+19450	趯
+19451	趰
+19452	趲
+19453	趶
+19454	趷
+19455	趹
+19456	趻
+19457	趽
+19458	跀
+19459	跁
+19460	跇
+19461	跈
+19462	跉
+19463	跍
+19464	跒
+19465	跓
+19466	※
+19467	Ⅸ
+19468	侚
+19469	凒
+19470	匇
+19471	嗼
+19472	堸
+19473	夰
+19474	婛
+19475	孂
+19476	嶚
+19477	廀
+19478	忶
+19479	慁
+19480	戼
+19481	擓
+19482	旟
+19483	曺
+19484	桖
+19485	楖
+19486	橓
+19487	欩
+19488	汏
+19489	淁
+19490	濝
+19491	烓
+19492	燍
+19493	鵿
+19494	鵞
+19495	鵾
+19496	鶀
+19497	鵃
+19498	鶂
+19499	鶄
+19500	鶅
+19501	鶆
+19502	鶇
+19503	鶊
+19504	鶋
+19505	鶏
+19506	鶑
+19507	鶓
+19508	鶔
+19509	鶕
+19510	鶖
+19511	鶙
+19512	鶚
+19513	鶜
+19514	鶞
+19515	鶠
+19516	鶡
+19517	ㄙ
+19518	≠
+19519	з
+19520	┵
+19521	Ｙ
+19522	傎
+19523	冑
+19524	呝
+19525	嗁
+19526	囐
+19527	堎
+19528	壻
+19529	娰
+19530	嵸
+19531	庂
+19532	徺
+19533	愘
+19534	捹
+19535	撡
+19536	斮
+19537	栙
+19538	椯
+19539	樫
+19540	欃
+19541	涃
+19542	澷
+19543	炠
+19544	熧
+19545	賎
+19546	侔
+19547	賍
+19548	賋
+19549	賩
+19550	賫
+19551	貮
+19552	賮
+19553	賯
+19554	賰
+19555	賱
+19556	賲
+19557	賳
+19558	賵
+19559	賶
+19560	賷
+19561	賸
+19562	賹
+19563	賻
+19564	賾
+19565	賿
+19566	贁
+19567	贋
+19568	←
+19569	Ⅺ
+19570	｛
+19571	侞
+19572	匊
+19573	咞
+19574	嘂
+19575	圎
+19576	夳
+19577	婝
+19578	孄
+19579	岥
+19580	嶜
+19581	忹
+19582	扄
+19583	掻
+19584	擕
+19585	旣
+19586	桘
+19587	楙
+19588	橕
+19589	欫
+19590	汒
+19591	淃
+19592	濢
+19593	烕
+19594	鸴
+19595	鸧
+19596	鸃
+19597	麁
+19598	麃
+19599	麄
+19600	麅
+19601	麆
+19602	麉
+19603	麊
+19604	麌
+19605	麍
+19606	麎
+19607	麏
+19608	麐
+19609	麑
+19610	麔
+19611	麕
+19612	麖
+19613	麘
+19614	麙
+19615	麚
+19616	麛
+19617	麜
+19618	麞
+19619	麠
+19620	麡
+19621	麢
+19622	麣
+19623	麤
+19624	麧
+19625	麨
+19626	ㄛ
+19627	≯
+19628	й
+19629	┷
+19630	③
+19631	［
+19632	佦
+19633	傐
+19634	冔
+19635	勠
+19636	呟
+19637	嗃
+19638	囒
+19639	堐
+19640	嬠
+19641	屰
+19642	嵺
+19643	庅
+19644	徾
+19645	戂
+19646	捽
+19647	斲
+19648	曐
+19649	栛
+19650	椲
+19651	樭
+19652	欅
+19653	氎
+19654	涆
+19655	溮
+19656	澺
+19657	熪
+19658	踾
+19659	踻
+19660	郐
+19661	踎
+19662	踇
+19663	踋
+19664	踼
+19665	跕
+19666	踈
+19667	踿
+19668	蹃
+19669	蹅
+19670	蹆
+19671	蹌
+19672	蹍
+19673	蹎
+19674	蹏
+19675	蹔
+19676	蹕
+19677	蹖
+19678	蹗
+19679	蹘
+19680	蹚
+19681	蹛
+19682	蹜
+19683	蹝
+19684	蹞
+19685	蹡
+19686	蹢
+19687	蹧
+19688	蹨
+19689	蹫
+19690	↑
+19691	Ⅻ
+19692	｜
+19693	侟
+19694	傸
+19695	凕
+19696	匋
+19697	咟
+19698	堻
+19699	夵
+19700	岦
+19701	嶞
+19702	廃
+19703	忺
+19704	扅
+19705	掽
+19706	擖
+19707	旤
+19708	朁
+19709	桙
+19710	楛
+19711	橖
+19712	汓
+19713	淈
+19714	滭
+19715	濣
+19716	烖
+19717	燑
+19718	鼅
+19719	鼃
+19720	黖
+19721	黓
+19722	鼂
+19723	鼄
+19724	麫
+19725	鼆
+19726	鼇
+19727	鼈
+19728	鼉
+19729	鼊
+19730	鼌
+19731	鼏
+19732	鼑
+19733	鼒
+19734	鼔
+19735	鼕
+19736	鼖
+19737	鼘
+19738	鼚
+19739	鼛
+19740	鼜
+19741	鼝
+19742	鼟
+19743	鼡
+19744	鼣
+19745	鼥
+19746	鼦
+19747	鼧
+19748	鼪
+19749	鼫
+19750	鼮
+19751	ㄜ
+19752	≤
+19753	к
+19754	┸
+19755	④
+19756	＼
+19757	佨
+19758	呠
+19759	囓
+19760	堒
+19761	娷
+19762	庈
+19763	徿
+19764	戃
+19765	捾
+19766	斳
+19767	曑
+19768	椳
+19769	樮
+19770	欆
+19771	氒
+19772	溰
+19773	澻
+19774	熫
+19775	躟
+19776	躝
+19777	堋
+19778	躿
+19779	堙
+19780	墚
+19781	堍
+19782	埽
+19783	躙
+19784	軃
+19785	軄
+19786	軆
+19787	軉
+19788	軐
+19789	軓
+19790	軔
+19791	軕
+19792	軗
+19793	軘
+19794	軚
+19795	軞
+19796	軡
+19797	軣
+19798	鶤
+19799	赼
+19800	卩
+19801	阝
+19802	阢
+19803	鵄
+19804	賏
+19805	汆
+19806	馘
+19807	鸻
+19808	踑
+19809	跘
+19810	坫
+19811	躠
+19812	蹵
+19813	塥
+19814	芰
+19815	苊
+19816	冁
+19817	鶥
+19818	赽
+19819	贐
+19820	鵥
+19821	鵅
+19822	鸼
+19823	鸅
+19824	踒
+19825	跙
+19826	黚
+19827	麭
+19828	躡
+19829	蹷
+19830	鷆
+19831	赾
+19832	贑
+19833	谇
+19834	鵆
+19835	賑
+19836	鸆
+19837	踓
+19838	跜
+19839	麮
+19840	蹸
+19841	鶧
+19842	赿
+19843	陴
+19844	鵧
+19845	貲
+19846	踕
+19847	垧
+19848	黡
+19849	麯
+19850	躣
+19851	鶨
+19852	趀
+19853	贓
+19854	鵨
+19855	鵈
+19856	勹
+19857	鹐
+19858	鸈
+19859	坶
+19860	凵
+19861	廴
+19862	黣
+19863	麰
+19864	躤
+19865	蹺
+19866	鶩
+19867	趂
+19868	贔
+19869	鵩
+19870	鵉
+19871	賔
+19872	鹒
+19873	鸉
+19874	踗
+19875	跢
+19876	黤
+19877	躥
+19878	蹻
+19879	鶪
+19880	趃
+19881	鵊
+19882	賕
+19883	貵
+19884	鹓
+19885	踘
+19886	黦
+19887	躦
+19888	蹽
+19889	鷋
+19890	鶫
+19891	趆
+19892	鵋
+19893	賖
+19894	跦
+19895	麳
+19896	躧
+19897	蹾
+19898	鷌
+19899	鶬
+19900	趇
+19901	贗
+19902	賗
+19903	亠
+19904	鹖
+19905	鸌
+19906	跧
+19907	垲
+19908	黫
+19909	躨
+19910	鷍
+19911	鶭
+19912	趈
+19913	贘
+19914	鵭
+19915	賘
+19916	鹙
+19917	鸍
+19918	踛
+19919	跩
+19920	黬
+19921	麶
+19922	躩
+19923	躂
+19924	鷎
+19925	鶮
+19926	趉
+19927	鵎
+19928	賙
+19929	貹
+19930	鹝
+19931	鸎
+19932	踜
+19933	跭
+19934	黭
+19935	麷
+19936	躃
+19937	鷏
+19938	趌
+19939	鵯
+19940	鵏
+19941	貺
+19942	鹟
+19943	踠
+19944	跮
+19945	黮
+19946	麹
+19947	躭
+19948	躄
+19949	鷐
+19950	趍
+19951	鵐
+19952	賛
+19953	鸐
+19954	踡
+19955	跰
+19956	黰
+19957	麺
+19958	鶱
+19959	趎
+19960	贜
+19961	踤
+19962	跱
+19963	黱
+19964	麼
+19965	躰
+19966	趏
+19967	鵒
+19968	賝
+19969	裒
+19970	僦
+19971	鹢
+19972	鸒
+19973	踥
+19974	跲
+19975	墼
+19976	黲
+19977	麿
+19978	躱
+19979	鶳
+19980	鵓
+19981	鹥
+19982	鸓
+19983	踦
+19984	跴
+19985	黳
+19986	黀
+19987	鷔
+19988	趒
+19989	赒
+19990	鵴
+19991	鵔
+19992	賟
+19993	鸔
+19994	黁
+19995	躋
+19996	鷕
+19997	鶵
+19998	趓
+19999	赗
+20000	鵕
+20001	鸕
+20002	踨
+20003	黵
+20004	黂
+20005	躌
+20006	鷖
+20007	鵶
+20008	鵖
+20009	鹲
+20010	鸖
+20011	黶
+20012	躶
+20013	趖
+20014	赥
+20015	鵷
+20016	鸗
+20017	踭
+20018	跿
+20019	黷
+20020	黅
+20021	躷
+20022	躎
+20023	鷘
+20024	鶸
+20025	趗
+20026	赨
+20027	谫
+20028	鵸
+20029	鵘
+20030	氽
+20031	冱
+20032	鸘
+20033	踰
+20034	踀
+20035	埯
+20036	黸
+20037	黆
+20038	躸
+20039	躑
+20040	茳
+20041	鷙
+20042	鶹
+20043	趘
+20044	赩
+20045	鵹
+20046	鵙
+20047	鸙
+20048	踲
+20049	黺
+20050	黇
+20051	躹
+20052	躒
+20053	鷚
+20054	鵺
+20055	鵚
+20056	賥
+20057	賅
+20058	鹷
+20059	踳
+20060	黽
+20061	躻
+20062	躓
+20063	鷛
+20064	趚
+20065	赬
+20066	鵻
+20067	鹸
+20068	踃
+20069	黊
+20070	躼
+20071	鷜
+20072	鶼
+20073	趛
+20074	赮
+20075	鵜
+20076	賧
+20077	鸜
+20078	踶
+20079	踄
+20080	鼀
+20081	黋
+20082	躖
+20083	鷝
+20084	鶽
+20085	趜
+20086	赯
+20087	鵽
+20088	賨
+20089	鹺
+20090	踷
+20091	踆
+20092	鼁
+20093	黌
+20094	躾
+20095	跔
+20096	鶢
+20097	麪
+20098	蹱
+20099	軤
+20100	╲
+20101	譾
+20102	皏
+20103	祐
+20104	痸
+20105	穠
+20106	玽
+20107	籿
+20108	秜
+20109	﹙
+20110	畍
+20111	眝
+20112	乿
+20113	倂
+20114	僾
+20115	剉
+20116	卾
+20117	唙
+20118	噕
+20119	坴
+20120	塿
+20121	妚
+20122	媣
+20123	寁
+20124	峷
+20125	巚
+20126	弙
+20127	恦
+20128	抳
+20129	攙
+20130	晇
+20131	杤
+20132	梫
+20133	榲
+20134	檝
+20135	渧
+20136	漹
+20137	瀡
+20138	焩
+20139	碫
+20140	╒
+20141	琕
+20142	疺
+20143	縑
+20144	稸
+20145	籚
+20146	禫
+20147	￢
+20148	甐
+20149	盫
+20150	俈
+20151	僔
+20152	刅
+20153	匳
+20154	哣
+20155	嘨
+20156	圴
+20157	奦
+20158	媀
+20159	峍
+20160	怴
+20161	慥
+20162	扸
+20163	揤
+20164	擵
+20165	昖
+20166	梀
+20167	橵
+20168	歏
+20169	淰
+20170	漋
+20171	磜
+20172	╳
+20173	鱳
+20174	譿
+20175	瑆
+20176	祑
+20177	瓀
+20178	縲
+20179	穡
+20180	粀
+20181	秝
+20182	﹚
+20183	畐
+20184	倃
+20185	僿
+20186	厀
+20187	唚
+20188	噖
+20189	墂
+20190	妛
+20191	媤
+20192	寃
+20193	峸
+20194	巜
+20195	弚
+20196	恮
+20197	抴
+20198	搘
+20199	晈
+20200	梬
+20201	榳
+20202	渨
+20203	漺
+20204	焪
+20205	碬
+20206	╓
+20207	譝
+20208	琖
+20209	礧
+20210	疻
+20211	璚
+20212	稺
+20213	竁
+20214	籛
+20215	禬
+20216	￤
+20217	甒
+20218	盬
+20219	乄
+20220	俉
+20221	刉
+20222	匴
+20223	哤
+20224	圵
+20225	塛
+20226	峎
+20227	廤
+20228	怶
+20229	慦
+20230	扺
+20231	揥
+20232	昗
+20233	朩
+20234	梂
+20235	橶
+20236	漌
+20237	濿
+20238	焀
+20239	磝
+20240	▁
+20241	鱴
+20242	瑇
+20243	皒
+20244	祒
+20245	痻
+20246	瓁
+20247	縳
+20248	竫
+20249	粁
+20250	秞
+20251	﹛
+20252	眡
+20253	倄
+20254	厁
+20255	唜
+20256	噚
+20257	坸
+20258	墄
+20259	妜
+20260	媥
+20261	寈
+20262	峹
+20263	巟
+20264	弜
+20265	恱
+20266	憍
+20267	抶
+20268	搙
+20269	攛
+20270	杧
+20271	梮
+20272	榵
+20273	檟
+20274	歺
+20275	泋
+20276	渪
+20277	漻
+20278	瀤
+20279	焫
+20280	╔
+20281	琗
+20282	礨
+20283	璛
+20284	縓
+20285	稾
+20286	竂
+20287	玐
+20288	禭
+20289	甔
+20290	盭
+20291	乆
+20292	俋
+20293	僗
+20294	刋
+20295	匵
+20296	哫
+20297	嘪
+20298	圶
+20299	塜
+20300	奨
+20301	媂
+20302	孹
+20303	峏
+20304	廥
+20305	怷
+20306	扻
+20307	昘
+20308	梄
+20309	橷
+20310	歑
+20311	沊
+20312	淴
+20313	漍
+20314	瀀
+20315	焁
+20316	磞
+20317	▂
+20318	讁
+20319	皔
+20320	祔
+20321	痽
+20322	瓂
+20323	穣
+20324	珁
+20325	粂
+20326	秠
+20327	﹜
+20328	畒
+20329	眣
+20330	倅
+20331	剏
+20332	厃
+20333	唝
+20334	噛
+20335	坹
+20336	墆
+20337	媦
+20338	寉
+20339	峺
+20340	巠
+20341	弝
+20342	恲
+20343	抷
+20344	搚
+20345	晊
+20346	杫
+20347	梱
+20348	榶
+20349	檡
+20350	歽
+20351	泍
+20352	瀥
+20353	焬
+20354	碮
+20355	╕
+20356	譟
+20357	琘
+20358	礩
+20359	痀
+20360	璝
+20361	縔
+20362	竃
+20363	籝
+20364	℡
+20365	盰
+20366	乊
+20367	俌
+20368	僘
+20369	刌
+20370	匶
+20371	哬
+20372	嘫
+20373	圷
+20374	奩
+20375	媃
+20376	孻
+20377	峐
+20378	嶻
+20379	廦
+20380	怸
+20381	慪
+20382	扽
+20383	揧
+20384	擸
+20385	昚
+20386	朰
+20387	梇
+20388	橸
+20389	沋
+20390	漎
+20391	瀁
+20392	焂
+20393	↓
+20394	｝
+20395	傹
+20396	匌
+20397	咠
+20398	嘄
+20399	堼
+20400	夶
+20401	婟
+20402	孆
+20403	嶟
+20404	廄
+20405	忼
+20406	慅
+20407	扆
+20408	掿
+20409	擙
+20410	旪
+20411	朂
+20412	桚
+20413	楜
+20414	橗
+20415	欭
+20416	滮
+20417	烗
+20418	齕
+20419	齗
+20420	齵
+20421	鼲
+20422	齖
+20423	齹
+20424	齺
+20425	齻
+20426	齼
+20427	齽
+20428	齾
+20429	龂
+20430	龎
+20431	龏
+20432	龒
+20433	龓
+20434	龖
+20435	龗
+20436	龝
+20437	龞
+20438	龡
+20439	郎
+20440	凉
+20441	裏
+20442	ㄝ
+20443	鬏
+20444	≥
+20445	л
+20446	ぽ
+20447	⑤
+20448	］
+20449	佪
+20450	呡
+20451	娸
+20452	屳
+20453	嵼
+20454	庉
+20455	忀
+20456	愝
+20457	戄
+20458	捿
+20459	撦
+20460	斴
+20461	曒
+20462	椵
+20463	樰
+20464	欇
+20465	涊
+20466	溳
+20467	熭
+20468	輅
+20469	莰
+20470	輣
+20471	輀
+20472	輂
+20473	輠
+20474	輢
+20475	荮
+20476	軥
+20477	輁
+20478	輥
+20479	輦
+20480	輧
+20481	輫
+20482	輬
+20483	輭
+20484	輮
+20485	輲
+20486	輳
+20487	輴
+20488	輵
+20489	輶
+20490	輹
+20491	輽
+20492	轀
+20493	轃
+20494	菥
+20495	莶
+20496	齛
+20497	鼳
+20498	齜
+20499	齝
+20500	鼵
+20501	輈
+20502	軨
+20503	鼶
+20504	軩
+20505	齟
+20506	鼸
+20507	輊
+20508	萆
+20509	鼺
+20510	輌
+20511	軬
+20512	齢
+20513	齣
+20514	輎
+20515	軮
+20516	齤
+20517	齁
+20518	輏
+20519	齥
+20520	齂
+20521	輐
+20522	軰
+20523	齃
+20524	輑
+20525	軱
+20526	齧
+20527	齅
+20528	輒
+20529	軲
+20530	齨
+20531	齆
+20532	軳
+20533	齇
+20534	軴
+20535	蔌
+20536	齈
+20537	齫
+20538	齉
+20539	輖
+20540	齬
+20541	軷
+20542	輘
+20543	齮
+20544	齌
+20545	輙
+20546	軹
+20547	齯
+20548	齍
+20549	輚
+20550	軺
+20551	葙
+20552	蓰
+20553	蒇
+20554	蒈
+20555	齰
+20556	齎
+20557	齱
+20558	齏
+20559	蔟
+20560	齴
+20561	軿
+20562	蒉
+20563	隣
+20564	磟
+20565	▃
+20566	瑉
+20567	皕
+20568	痾
+20569	穤
+20570	竮
+20571	珃
+20572	粃
+20573	秡
+20574	﹝
+20575	眤
+20576	亃
+20577	剒
+20578	厇
+20579	噝
+20580	坺
+20581	墇
+20582	寊
+20583	峼
+20584	弞
+20585	恴
+20586	抸
+20587	搝
+20588	晍
+20589	杬
+20590	梲
+20591	歾
+20592	泎
+20593	渮
+20594	漽
+20595	焭
+20596	碯
+20597	╖
+20598	譠
+20599	琙
+20600	癦
+20601	痁
+20602	縕
+20603	竄
+20604	籞
+20605	㈱
+20606	甖
+20607	盳
+20608	乑
+20609	僙
+20610	刏
+20611	哯
+20612	圸
+20613	塟
+20614	孼
+20615	峑
+20616	廧
+20617	慫
+20618	抁
+20619	揨
+20620	昛
+20621	朲
+20622	梈
+20623	橺
+20624	歓
+20625	沍
+20626	瀂
+20627	焃
+20628	齄
+20629	〓
+20630	￣
+20631	侢
+20632	傼
+20633	凗
+20634	咡
+20635	嘅
+20636	圑
+20637	夻
+20638	孇
+20639	岨
+20640	廅
+20641	怇
+20642	扊
+20643	揀
+20644	旫
+20645	桛
+20646	楟
+20647	欮
+20648	淊
+20649	烚
+20650	燓
+20651	ㄞ
+20652	∞
+20653	⑥
+20654	＾
+20655	佫
+20656	傓
+20657	冝
+20658	呣
+20659	嗈
+20660	囖
+20661	夀
+20662	娹
+20663	嬣
+20664	屴
+20665	嵽
+20666	庌
+20667	忁
+20668	愞
+20669	戅
+20670	撧
+20671	斵
+20672	曓
+20673	椶
+20674	樲
+20675	欈
+20676	氜
+20677	涋
+20678	溵
+20679	澽
+20680	炥
+20681	熮
+20682	轥
+20683	轣
+20684	迆
+20685	轤
+20686	瞢
+20687	轢
+20688	迃
+20689	迉
+20690	迊
+20691	迋
+20692	迌
+20693	迒
+20694	迗
+20695	迚
+20696	迠
+20697	迡
+20698	迣
+20699	迧
+20700	迬
+20701	迯
+20702	迱
+20703	迲
+20704	迶
+20705	迺
+20706	迻
+20707	迼
+20708	迾
+20709	迿
+20710	逇
+20711	逈
+20712	逌
+20713	逎
+20714	逓
+20715	逕
+20716	嗀
+20717	轪
+20718	掎
+20719	掊
+20720	轇
+20721	﨏
+20722	辌
+20723	﨑
+20724	辒
+20725	扌
+20726	辝
+20727	﨔
+20728	辠
+20729	轋
+20730	礼
+20731	轌
+20732	辢
+20733	辤
+20734	尢
+20735	揞
+20736	揎
+20737	﨡
+20738	辥
+20739	轐
+20740	﨤
+20741	辧
+20742	轑
+20743	﨧
+20744	辪
+20745	轒
+20746	辬
+20747	轓
+20748	轔
+20749	搌
+20750	挢
+20751	轕
+20752	轗
+20753	辳
+20754	捱
+20755	轙
+20756	辵
+20757	轚
+20758	撄
+20759	辷
+20760	辸
+20761	轝
+20762	掭
+20763	撖
+20764	逘
+20765	礌
+20766	瑺
+20767	禒
+20768	癄
+20769	繝
+20770	窢
+20771	珷
+20772	粻
+20773	稜
+20774	仩
+20775	儬
+20776	劆
+20777	啝
+20778	嚑
+20779	垹
+20780	墵
+20781	姞
+20782	尃
+20783	崰
+20784	帬
+20785	彔
+20786	悹
+20787	憼
+20788	挔
+20789	摖
+20790	敔
+20791	暊
+20792	枲
+20793	槧
+20794	櫊
+20795	殸
+20796	洜
+20797	潬
+20798	灎
+20799	煚
+20800	牬
+20801	燸
+20802	牗
+20803	爚
+20804	狑
+20805	牞
+20806	爘
+20807	牔
+20808	牥
+20809	牭
+20810	牎
+20811	牱
+20812	牳
+20813	牜
+20814	牷
+20815	燵
+20816	爗
+20817	爙
+20818	牕
+20819	牰
+20820	牣
+20821	燖
+20822	牑
+20823	牏
+20824	牐
+20825	牓
+20826	燶
+20827	牨
+20828	爜
+20829	爞
+20830	爟
+20831	爠
+20832	爡
+20833	爢
+20834	爣
+20835	爤
+20836	爥
+20837	爧
+20838	爩
+20839	爫
+20840	爮
+20841	爯
+20842	爳
+20843	爴
+20844	爼
+20845	牀
+20846	牃
+20847	牅
+20848	牉
+20849	牊
+20850	牸
+20851	牻
+20852	牼
+20853	牴
+20854	牪
+20855	牫
+20856	燗
+20857	牚
+20858	犪
+20859	犩
+20860	犂
+20861	犫
+20862	犅
+20863	犲
+20864	犱
+20865	犮
+20866	犆
+20867	犳
+20868	犉
+20869	燽
+20870	燘
+20871	燾
+20872	犵
+20873	犌
+20874	燿
+20875	狆
+20876	犘
+20877	爀
+20878	燛
+20879	犻
+20880	犐
+20881	犺
+20882	犎
+20883	爁
+20884	爂
+20885	燝
+20886	爃
+20887	燞
+20888	爄
+20889	犿
+20890	犾
+20891	犖
+20892	狅
+20893	犗
+20894	爅
+20895	燡
+20896	爇
+20897	燢
+20898	爈
+20899	爉
+20900	狇
+20901	犙
+20902	燨
+20903	牶
+20904	狊
+20905	犛
+20906	狉
+20907	犚
+20908	狋
+20909	犜
+20910	狏
+20911	狌
+20912	犝
+20913	狓
+20914	爌
+20915	爎
+20916	燫
+20917	燬
+20918	犨
+20919	燯
+20920	狕
+20921	狔
+20922	狖
+20923	犤
+20924	狘
+20925	犥
+20926	燰
+20927	爓
+20928	燱
+20929	爔
+20930	燳
+20931	狚
+20932	犦
+20933	爖
+20934	牋
+20935	OOV_NUM
+20936	OOV_ALPHA
+20937	OOV_ALNUM
+20938	OOV_HANZ
+20940	OOV
diff --git a/dygraph/lac/downloads.py b/dygraph/lac/downloads.py
new file mode 100644
index 0000000000000000000000000000000000000000..c4ef1423a748b217eabb3f730b0df68235294035
--- /dev/null
+++ b/dygraph/lac/downloads.py
@@ -0,0 +1,134 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Download script, download dataset and pretrain models.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import io
+import os
+import sys
+import time
+import hashlib
+import tarfile
+import requests
+
+FILE_INFO = {
+    'BASE_URL': 'https://baidu-nlp.bj.bcebos.com/',
+    'DATA': {
+        'name': 'lexical_analysis-dataset-2.0.0.tar.gz',
+        'md5': '71e4a9a36d0f0177929a1bccedca7dba'
+    },
+}
+
+
+def usage():
+    desc = ("\nDownload datasets and pretrained models for LAC.\n"
+            "Usage:\n"
+            "   1. python download.py dataset\n")
+    print(desc)
+
+
+def md5file(fname):
+    hash_md5 = hashlib.md5()
+    with io.open(fname, "rb") as fin:
+        for chunk in iter(lambda: fin.read(4096), b""):
+            hash_md5.update(chunk)
+    return hash_md5.hexdigest()
+
+
+def extract(fname, dir_path):
+    """
+    Extract tar.gz file
+    """
+    try:
+        tar = tarfile.open(fname, "r:gz")
+        file_names = tar.getnames()
+        for file_name in file_names:
+            tar.extract(file_name, dir_path)
+            print(file_name)
+        tar.close()
+    except Exception as e:
+        raise e
+
+
+def _download(url, filename, md5sum):
+    """
+    Download file and check md5
+    """
+    retry = 0
+    retry_limit = 3
+    chunk_size = 4096
+    while not (os.path.exists(filename) and md5file(filename) == md5sum):
+        if retry < retry_limit:
+            retry += 1
+        else:
+            raise RuntimeError(
+                "Cannot download dataset ({0}) with retry {1} times.".format(
+                    url, retry_limit))
+        try:
+            start = time.time()
+            size = 0
+            res = requests.get(url, stream=True)
+            filesize = int(res.headers['content-length'])
+            if res.status_code == 200:
+                print("[Filesize]: %0.2f MB" % (filesize / 1024 / 1024))
+                # save by chunk
+                with io.open(filename, "wb") as fout:
+                    for chunk in res.iter_content(chunk_size=chunk_size):
+                        if chunk:
+                            fout.write(chunk)
+                            size += len(chunk)
+                            pr = '>' * int(size * 50 / filesize)
+                            print(
+                                '\r[Process ]: %s%.2f%%' %
+                                (pr, float(size / filesize * 100)),
+                                end='')
+            end = time.time()
+            print("\n[CostTime]: %.2f s" % (end - start))
+        except Exception as e:
+            print(e)
+
+
+def download(name, dir_path):
+    url = FILE_INFO['BASE_URL'] + FILE_INFO[name]['name']
+    file_path = os.path.join(dir_path, FILE_INFO[name]['name'])
+
+    if not os.path.exists(dir_path):
+        os.makedirs(dir_path)
+
+    # download data
+    print("Downloading : %s" % name)
+    _download(url, file_path, FILE_INFO[name]['md5'])
+
+    # extract data
+    print("Extracting : %s" % file_path)
+    extract(file_path, dir_path)
+    os.remove(file_path)
+
+
+if __name__ == '__main__':
+    if len(sys.argv) != 2:
+        usage()
+        sys.exit(1)
+    pwd = os.path.join(os.path.dirname(__file__), './')
+
+    if sys.argv[1] == "dataset":
+        download('DATA', pwd)
+
+    else:
+        usage()
diff --git a/dygraph/lac/downloads.sh b/dygraph/lac/downloads.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3708c6f92eaa60a21b66a0afd19eb0fe23ad2299
--- /dev/null
+++ b/dygraph/lac/downloads.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+
+# download dataset file to ./data/
+if [ -d ./data/ ]
+then
+    echo "./data/ directory already existed, ignore download"
+else
+    wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/lexical_analysis-dataset-2.0.0.tar.gz
+    tar xvf lexical_analysis-dataset-2.0.0.tar.gz
+    /bin/rm lexical_analysis-dataset-2.0.0.tar.gz
+fi
+
diff --git a/dygraph/lac/eval.py b/dygraph/lac/eval.py
new file mode 100755
index 0000000000000000000000000000000000000000..03b41effd6eb20081564be689e593aca3baa3c19
--- /dev/null
+++ b/dygraph/lac/eval.py
@@ -0,0 +1,81 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+import sys
+
+import paddle.fluid as fluid
+import paddle
+import utils
+import reader
+import math
+from sequence_labeling import lex_net, Chunk_eval
+parser = argparse.ArgumentParser(__doc__)
+# 1. model parameters
+utils.load_yaml(parser, 'conf/args.yaml')
+args = parser.parse_args()
+def do_eval(args):
+    dataset = reader.Dataset(args)
+
+    if args.use_cuda: 
+        place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) \
+        if args.use_data_parallel else fluid.CUDAPlace(0)
+    else:
+        place = fluid.CPUPlace()
+        
+    with fluid.dygraph.guard(place):
+        test_loader = reader.create_dataloader(
+            args,
+            file_name=args.test_data,
+            place=place,
+            model='lac',
+            reader=dataset,
+            mode='test')
+        model = lex_net(args, dataset.vocab_size, dataset.num_labels)
+        load_path = args.init_checkpoint
+        state_dict, _ = fluid.dygraph.load_dygraph(load_path)
+        #import ipdb; ipdb.set_trace()
+        state_dict["linear_chain_crf.weight"]=state_dict["crf_decoding.weight"]
+        model.set_dict(state_dict)
+        model.eval()
+        chunk_eval = Chunk_eval(int(math.ceil((dataset.num_labels - 1) / 2.0)), "IOB")
+        chunk_evaluator = fluid.metrics.ChunkEvaluator()
+        chunk_evaluator.reset()
+        # test_process(test_loader, chunk_evaluator)
+		
+        def test_process(reader, chunk_evaluator):
+            start_time = time.time()
+            for batch in reader():
+                words, targets, length = batch
+                crf_decode = model(words, length=length)
+                (precision, recall, f1_score, num_infer_chunks, num_label_chunks,
+                    num_correct_chunks) = chunk_eval(
+                        input=crf_decode,
+                        label=targets,
+                        seq_length=length)
+                chunk_evaluator.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+            
+            precision, recall, f1 = chunk_evaluator.eval()
+            end_time = time.time()
+            print("[test] P: %.5f, R: %.5f, F1: %.5f, elapsed time: %.3f s" %
+                (precision, recall, f1, end_time - start_time))
+
+        test_process(test_loader, chunk_evaluator)
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    do_eval(args)
diff --git a/dygraph/lac/eval.sh b/dygraph/lac/eval.sh
new file mode 100644
index 0000000000000000000000000000000000000000..5c108aa5747c8df4614595f9c71831b09da1d817
--- /dev/null
+++ b/dygraph/lac/eval.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+export CUDA_VISIBLE_DEVICES=7
+
+python eval.py     --batch_size 200     --word_emb_dim 128     --grnn_hidden_dim 128     --bigru_num 2     --use_cuda False     --init_checkpoint ./padding_models/step_120000    --test_data ./data/test.tsv     --word_dict_path ./conf/word.dic     --label_dict_path ./conf/tag.dic     --word_rep_dict_path ./conf/q2b.dic
+
diff --git a/dygraph/lac/predict.py b/dygraph/lac/predict.py
new file mode 100755
index 0000000000000000000000000000000000000000..6431f76b79a7442effc3180b103d3994399c9259
--- /dev/null
+++ b/dygraph/lac/predict.py
@@ -0,0 +1,87 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import time
+import sys
+
+import paddle.fluid as fluid
+import paddle
+import utils
+import reader
+import math
+from sequence_labeling import lex_net, Chunk_eval
+parser = argparse.ArgumentParser(__doc__)
+# 1. model parameters
+utils.load_yaml(parser, 'conf/args.yaml')
+args = parser.parse_args()
+
+def do_infer(args):
+    dataset = reader.Dataset(args)
+
+    if args.use_cuda:
+        place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) \
+        if args.use_data_parallel else fluid.CUDAPlace(0)
+    else:
+        place = fluid.CPUPlace()
+
+    with fluid.dygraph.guard(place):
+        infer_loader = reader.create_dataloader(
+            args,
+            file_name=args.infer_data,
+            place=place,
+            model='lac',
+            reader=dataset,
+            mode='infer')
+        model = lex_net(args, dataset.vocab_size, dataset.num_labels)
+        load_path = args.init_checkpoint
+        state_dict, _ = fluid.dygraph.load_dygraph(load_path)
+        #import ipdb; ipdb.set_trace()
+        state_dict["linear_chain_crf.weight"]=state_dict["crf_decoding.weight"]
+        model.set_dict(state_dict)
+        model.eval()
+        chunk_eval = Chunk_eval(int(math.ceil((dataset.num_labels - 1) / 2.0)), "IOB")
+        chunk_evaluator = fluid.metrics.ChunkEvaluator()
+        chunk_evaluator.reset()
+
+        def input_check(data):
+       	    if data.lod()[0][-1] == 0:
+                return data[0]['words']
+            return None
+            
+        def infer_process(reader):
+            results = []
+           
+            for batch in reader():
+                # import ipdb; ipdb.set_trace()
+                words, length = batch
+                #crf_decode = input_check(words)
+                #if crf_decode:
+                #    results += utils.parse_result(crf_decode, crf_decode, dataset)
+                #    continue
+			              
+                crf_decode = model(words, length=length)
+                results += utils.parse_padding_result(words.numpy(), crf_decode.numpy(), length.numpy(), dataset)          
+            return results
+            
+        result = infer_process(infer_loader)
+        for sent, tags in result:
+            result_list = ['(%s, %s)' % (ch, tag) for ch, tag in zip(sent, tags)]
+            print(''.join(result_list))
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    do_infer(args)
diff --git a/dygraph/lac/predict.sh b/dygraph/lac/predict.sh
new file mode 100644
index 0000000000000000000000000000000000000000..36b79ad1172588f2caea6775614259ab50a5d029
--- /dev/null
+++ b/dygraph/lac/predict.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+export CUDA_VISIBLE_DEVICES=7
+
+python predict.py --batch_size 200  --word_emb_dim 128  --grnn_hidden_dim 128  --bigru_num 2  --use_cuda False  --init_checkpoint ./padding_models/step_120000  --infer_data ./data/infer.tsv  --word_dict_path ./conf/word.dic  --label_dict_path ./conf/tag.dic  --word_rep_dict_path ./conf/q2b.dic
diff --git a/dygraph/lac/reader.py b/dygraph/lac/reader.py
new file mode 100755
index 0000000000000000000000000000000000000000..9fc09af8039d63fccc77f7b877de5041f0569d13
--- /dev/null
+++ b/dygraph/lac/reader.py
@@ -0,0 +1,219 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+The file_reader converts raw corpus to input.
+"""
+
+import os
+import argparse
+import __future__
+import io
+import glob
+import paddle.fluid as fluid
+import numpy as np
+
+def load_kv_dict(dict_path,
+                 reverse=False,
+                 delimiter="\t",
+                 key_func=None,
+                 value_func=None):
+    """
+    Load key-value dict from file
+    """
+    result_dict = {}
+    for line in io.open(dict_path, "r", encoding='utf8'):
+        terms = line.strip("\n").split(delimiter)
+        if len(terms) != 2:
+            continue
+        if reverse:
+            value, key = terms
+        else:
+            key, value = terms
+        if key in result_dict:
+            raise KeyError("key duplicated with [%s]" % (key))
+        if key_func:
+            key = key_func(key)
+        if value_func:
+            value = value_func(value)
+        result_dict[key] = value
+    return result_dict
+
+
+class Dataset(object):
+    """data reader"""
+
+    def __init__(self, args, mode="train"):
+        # read dict
+        self.word2id_dict = load_kv_dict(
+            args.word_dict_path, reverse=True, value_func=np.int64)
+        self.id2word_dict = load_kv_dict(args.word_dict_path)
+        self.label2id_dict = load_kv_dict(
+            args.label_dict_path, reverse=True, value_func=np.int64)
+        self.id2label_dict = load_kv_dict(args.label_dict_path)
+        self.word_replace_dict = load_kv_dict(args.word_rep_dict_path)
+
+    @property
+    def vocab_size(self):
+        """vocabuary size"""
+        return max(self.word2id_dict.values()) + 1
+
+    @property
+    def num_labels(self):
+        """num_labels"""
+        return max(self.label2id_dict.values()) + 1
+
+    def get_num_examples(self, filename):
+        """num of line of file"""
+        return sum(1 for line in io.open(filename, "r", encoding='utf8'))
+
+    def word_to_ids(self, words):
+        """convert word to word index"""
+        word_ids = []
+        for word in words:
+            word = self.word_replace_dict.get(word, word)
+            if word not in self.word2id_dict:
+                word = "OOV"
+            word_id = self.word2id_dict[word]
+            word_ids.append(word_id)
+
+        return word_ids
+
+    def label_to_ids(self, labels):
+        """convert label to label index"""
+        label_ids = []
+        for label in labels:
+            if label not in self.label2id_dict:
+                label = "O"
+            label_id = self.label2id_dict[label]
+            label_ids.append(label_id)
+        return label_ids
+
+    def file_reader(self, filename, batch_size=32, _max_seq_len=64, mode="train"):
+        """
+        yield (word_idx, target_idx) one by one from file,
+            or yield (word_idx, ) in `infer` mode
+        """
+        def wrapper():
+            fread = io.open(filename, "r", encoding="utf-8")
+            if mode == "infer":
+                batch, init_lens = [], []
+                for line in fread:
+                    words= line.strip()
+                    word_ids = self.word_to_ids(words)
+                    init_lens.append(len(word_ids))
+                    batch.append(word_ids)
+                    if len(batch) == batch_size:
+                        max_seq_len = min(max(init_lens), _max_seq_len)
+                        new_batch = []
+                        for words_len, words in zip(init_lens, batch):
+                            word_ids = words[0:max_seq_len]
+                            words_len = len(word_ids)
+                            # expand to max_seq_len
+                            word_ids += [0 for _ in range(max_seq_len-words_len)]
+                            new_batch.append((word_ids,words_len))
+                        yield new_batch
+                        batch, init_lens = [], []
+                if len(batch) > 0:
+                    max_seq_len = min(max(init_lens), max_seq_len)
+                    new_batch = []
+                    for words_len, words in zip(init_lens, batch):
+                        word_ids = word_ids[0:max_seq_len]
+                        words_len = len(word_ids)
+                        # expand to max_seq_len
+                        word_ids += [0 for _ in range(max_seq_len-words_len)]
+                        new_batch.append((word_ids,words_len))
+                    yield new_batch
+            else:
+                headline = next(fread)
+                batch, init_lens = [], []
+                for line in fread:
+                    words, labels = line.strip("\n").split("\t")
+                    if len(words)<1:
+                        continue
+                    word_ids = self.word_to_ids(words.split("\002"))
+                    label_ids = self.label_to_ids(labels.split("\002"))
+                    init_lens.append(len(word_ids))
+                    batch.append((word_ids, label_ids))
+                    if len(batch) == batch_size:
+                        max_seq_len = min(max(init_lens), _max_seq_len)
+                        new_batch = []
+                        for words_len, (word_ids, label_ids) in zip(init_lens, batch):
+                            word_ids = word_ids[0:max_seq_len]
+                            words_len = np.int64(len(word_ids))
+                            word_ids += [0 for _ in range(max_seq_len-words_len)]
+                            label_ids = label_ids[0:max_seq_len]
+                            label_ids += [0 for _ in range(max_seq_len-words_len)]
+                            assert len(word_ids) == len(label_ids)
+                            new_batch.append((word_ids, label_ids, words_len))
+                        yield new_batch
+                        batch, init_lens = [], []
+                if len(batch) == batch_size:
+                    max_seq_len = min(max(init_lens), max_seq_len)
+                    new_batch = []
+                    for words_len, (word_ids, label_ids) in zip(init_lens, batch):
+                        max_seq_len = min(max(init_lens), max_seq_len)
+                        word_ids = words[0:max_seq_len]
+                        words_len = np.int64(len(word_ids))
+                        word_ids += [0 for _ in range(max_seq_len-words_len)]
+                        label_ids = label_ids[0:max_seq_len]
+                        label_ids += [0 for _ in range(max_seq_len-words_len)]
+                        assert len(word_ids) == len(label_ids)
+                        new_batch.append((word_ids, label_ids, words_len))
+                    yield new_batch
+            fread.close()
+
+        return wrapper
+
+def create_dataloader(args,
+                    file_name,
+                    place,
+                    model='lac',
+                    reader=None,
+                    return_reader=False,
+                    mode='train'):
+    # init reader
+
+    if model == 'lac':
+        data_loader = fluid.io.DataLoader.from_generator(
+            capacity=50,
+            use_double_buffer=True,
+            iterable=True)
+
+        if reader == None:
+            reader = Dataset(args)
+
+        # create lac pyreader
+        if mode == 'train':
+            #data_loader.set_sample_list_generator(
+            #    fluid.io.batch(
+            #        fluid.io.shuffle(
+            #            reader.file_reader(file_name),
+            #            buf_size=args.traindata_shuffle_buffer),
+            #        batch_size=args.batch_size),
+            #    places=place)
+            data_loader.set_sample_list_generator(
+                    reader.file_reader(
+                        file_name, batch_size=args.batch_size, _max_seq_len=64, mode=mode),
+                places=place)
+        else:
+           data_loader.set_sample_list_generator(
+                    reader.file_reader(
+                        file_name, batch_size=args.batch_size, _max_seq_len=64, mode=mode),
+                places=place)
+                
+    if return_reader:
+        return data_loader, reader
+    else:
+        return data_loader
+
diff --git a/dygraph/lac/run.sh b/dygraph/lac/run.sh
new file mode 100755
index 0000000000000000000000000000000000000000..d8ca0f5d18a78fb92b780fff22e34421f718fb74
--- /dev/null
+++ b/dygraph/lac/run.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+export FLAGS_fraction_of_gpu_memory_to_use=0.02
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_fast_eager_deletion_mode=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+
+python -m paddle.distributed.launch  --selected_gpus=0,1,2,3 train.py \
+        --train_data ./data/train.tsv \
+        --test_data ./data/test.tsv \
+        --model_save_dir ./padding_models \
+        --validation_steps 1000 \
+        --save_steps 10000 \
+        --print_steps 200 \
+        --batch_size 400 \
+        --epoch 10 \
+        --traindata_shuffle_buffer 20000 \
+        --word_emb_dim 128 \
+        --grnn_hidden_dim 128 \
+        --bigru_num 2 \
+        --base_learning_rate 1e-3 \
+        --emb_learning_rate 2 \
+        --crf_learning_rate 0.2 \
+        --word_dict_path ./conf/word.dic \
+        --label_dict_path ./conf/tag.dic \
+        --word_rep_dict_path ./conf/q2b.dic \
+        --enable_ce false \
+        --use_cuda true \
+        --cpu_num 1 \
+        --use_data_paralle True
diff --git a/dygraph/lac/sequence_labeling.py b/dygraph/lac/sequence_labeling.py
new file mode 100755
index 0000000000000000000000000000000000000000..bb8e4c6a2e53b42abb2c75896e1baf667869ae2e
--- /dev/null
+++ b/dygraph/lac/sequence_labeling.py
@@ -0,0 +1,373 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+The function lex_net(args) define the lexical analysis network structure
+"""
+import sys
+import os
+import math
+import numpy as np
+
+import paddle.fluid as fluid
+from paddle.fluid.initializer import NormalInitializer
+from paddle.fluid.dygraph import to_variable
+from paddle.fluid.dygraph.nn import Embedding, Linear, GRUUnit
+
+class DynamicGRU(fluid.dygraph.Layer):
+    def __init__(self,
+                 size,
+                 h_0=None,
+                 param_attr=None,
+                 bias_attr=None,
+                 is_reverse=False,
+                 gate_activation='sigmoid',
+                 candidate_activation='tanh',
+                 origin_mode=False,
+                 init_size = None):
+        super(DynamicGRU, self).__init__()
+
+        self.gru_unit = GRUUnit(
+            size * 3,
+            param_attr=param_attr,
+            bias_attr=bias_attr,
+            activation=candidate_activation,
+            gate_activation=gate_activation,
+            origin_mode=origin_mode)
+
+        self.size = size
+        self.h_0 = h_0
+        self.is_reverse = is_reverse
+
+
+    def forward(self, inputs):
+        hidden = self.h_0
+        res = []
+
+        for i in range(inputs.shape[1]):
+            if self.is_reverse:
+                i = inputs.shape[1] - 1 - i
+
+            input_ = inputs[ :, i:i+1, :]
+            input_ = fluid.layers.reshape(input_, [-1, input_.shape[2]], inplace=False)
+            hidden, reset, gate = self.gru_unit(input_, hidden)
+            hidden_ = fluid.layers.reshape(hidden, [-1, 1, hidden.shape[1]], inplace=False)
+            res.append(hidden_)
+
+        if self.is_reverse:
+            res = res[::-1]
+        res = fluid.layers.concat(res, axis=1)
+        return res
+
+
+class BiGRU(fluid.dygraph.Layer):
+    def __init__(self,
+                 input_dim,
+                 grnn_hidden_dim,
+                 init_bound,
+                 h_0=None):
+        super(BiGRU, self).__init__()
+
+        self.pre_gru = Linear(input_dim=input_dim,
+                            output_dim=grnn_hidden_dim * 3,
+                            param_attr=fluid.ParamAttr(
+                            initializer=fluid.initializer.Uniform(
+                                low=-init_bound, high=init_bound),
+                                regularizer=fluid.regularizer.L2DecayRegularizer(
+                                    regularization_coeff=1e-4)))#,
+                            #num_flatten_dims=2)
+
+        self.gru = DynamicGRU(size=grnn_hidden_dim,
+                h_0=h_0,
+                param_attr=fluid.ParamAttr(
+                    initializer=fluid.initializer.Uniform(
+                        low=-init_bound, high=init_bound),
+                    regularizer=fluid.regularizer.L2DecayRegularizer(
+                        regularization_coeff=1e-4)))
+
+        self.pre_gru_r = Linear(input_dim=input_dim,
+                            output_dim=grnn_hidden_dim * 3,
+                            param_attr=fluid.ParamAttr(
+                                initializer=fluid.initializer.Uniform(
+                                    low=-init_bound, high=init_bound),
+                                regularizer=fluid.regularizer.L2DecayRegularizer(
+                                    regularization_coeff=1e-4)))#,
+                            #num_flatten_dims=2)
+
+        self.gru_r = DynamicGRU(size=grnn_hidden_dim,
+                            is_reverse=True,
+                            h_0=h_0,
+                            param_attr=fluid.ParamAttr(
+                                initializer=fluid.initializer.Uniform(
+                                    low=-init_bound, high=init_bound),
+                                regularizer=fluid.regularizer.L2DecayRegularizer(
+                                    regularization_coeff=1e-4)))
+
+
+    def forward(self, input_feature):
+        res_pre_gru = self.pre_gru(input_feature)
+        res_gru = self.gru(res_pre_gru)
+        res_pre_gru_r = self.pre_gru_r(input_feature)
+        res_gru_r = self.gru_r(res_pre_gru_r)
+        bi_merge = fluid.layers.concat(input=[res_gru, res_gru_r], axis=-1)
+        return bi_merge
+
+
+class Linear_chain_crf(fluid.dygraph.Layer):
+
+    def __init__(self,
+                param_attr, 
+                size=None,
+                is_test=False,
+                dtype='float32'):
+        super(Linear_chain_crf, self).__init__()
+
+        self._param_attr = param_attr
+        self._dtype = dtype
+        self._size = size
+        self._is_test=is_test
+        self._transition = self.create_parameter(
+                        attr=self._param_attr,
+                        shape=[self._size + 2, self._size],
+                        dtype=self._dtype)
+
+    @property
+    def weight(self):
+        return self._transition
+
+    @weight.setter
+    def weight(self, value):
+        self._transition = value
+
+    def forward(self, input, label, length=None):
+        
+        alpha = self._helper.create_variable_for_type_inference(
+                        dtype=self._dtype)
+        emission_exps = self._helper.create_variable_for_type_inference(
+                        dtype=self._dtype)
+        transition_exps = self._helper.create_variable_for_type_inference(
+                        dtype=self._dtype)
+        log_likelihood = self._helper.create_variable_for_type_inference(
+                        dtype=self._dtype)
+        this_inputs = {
+            "Emission": [input],
+            "Transition": self._transition,
+            "Label": [label]
+        }
+        if length:
+            this_inputs['Length'] = [length]
+        self._helper.append_op(
+                        type='linear_chain_crf',
+                        inputs=this_inputs,
+                        outputs={
+                            "Alpha": [alpha],
+                            "EmissionExps": [emission_exps],
+                            "TransitionExps": transition_exps,
+                            "LogLikelihood": log_likelihood
+                        },
+                        attrs={
+                            "is_test": self._is_test,
+                        })
+        return log_likelihood
+
+
+class Crf_decoding(fluid.dygraph.Layer):
+
+    def __init__(self,
+                param_attr, 
+                size=None,
+                is_test=False,
+                dtype='float32'):
+        super(Crf_decoding, self).__init__()
+
+        self._dtype = dtype
+        self._size = size
+        self._is_test = is_test
+        self._param_attr = param_attr
+        self._transition = self.create_parameter(
+                        attr=self._param_attr,
+                        shape=[self._size + 2, self._size],
+                        dtype=self._dtype)
+
+    @property
+    def weight(self):
+        return self._transition
+
+    @weight.setter
+    def weight(self, value):
+        self._transition = value
+
+    def forward(self, input, label=None, length=None):
+        
+        viterbi_path = self._helper.create_variable_for_type_inference(
+                        dtype=self._dtype)
+        this_inputs = {"Emission": [input], "Transition": self._transition, "Label": label}
+        if length:
+            this_inputs['Length'] = [length]
+        self._helper.append_op(
+                        type='crf_decoding',
+                        inputs=this_inputs,
+                        outputs={"ViterbiPath": [viterbi_path]},
+                        attrs={
+                            "is_test": self._is_test,
+                        })
+        return viterbi_path
+
+
+class Chunk_eval(fluid.dygraph.Layer):
+
+    def __init__(self,
+                num_chunk_types,
+                chunk_scheme,
+                excluded_chunk_types=None):
+        super(Chunk_eval, self).__init__()
+        self.num_chunk_types = num_chunk_types
+        self.chunk_scheme = chunk_scheme
+        self.excluded_chunk_types = excluded_chunk_types
+
+    def forward(self, input, label, seq_length=None):
+        
+        precision = self._helper.create_variable_for_type_inference(dtype="float32")
+        recall = self._helper.create_variable_for_type_inference(dtype="float32")
+        f1_score = self._helper.create_variable_for_type_inference(dtype="float32")
+        num_infer_chunks = self._helper.create_variable_for_type_inference(dtype="int64")
+        num_label_chunks = self._helper.create_variable_for_type_inference(dtype="int64")
+        num_correct_chunks = self._helper.create_variable_for_type_inference(dtype="int64")
+
+        this_input = {"Inference": [input], "Label": [label]}
+        if seq_length:
+            this_input["SeqLength"] = [seq_length]
+
+        self._helper.append_op(
+                        type='chunk_eval',
+                        inputs=this_input,
+                        outputs={
+                                "Precision": [precision],
+                                "Recall": [recall],
+                                "F1-Score": [f1_score],
+                                "NumInferChunks": [num_infer_chunks],
+                                "NumLabelChunks": [num_label_chunks],
+                                "NumCorrectChunks": [num_correct_chunks]
+                            },
+                        attrs={
+                            "num_chunk_types": self.num_chunk_types,
+                            "chunk_scheme": self.chunk_scheme,
+                            "excluded_chunk_types": self.excluded_chunk_types or []
+                        })
+        return (precision, recall, f1_score, num_infer_chunks, num_label_chunks,
+            num_correct_chunks)
+
+
+class lex_net(fluid.dygraph.Layer):
+    def __init__(self, 
+                    args, 
+                    vocab_size, 
+                    num_labels,
+                    length=None):
+        super(lex_net, self).__init__()
+        """
+        define the lexical analysis network structure
+        word: stores the input of the model
+        for_infer: a boolean value, indicating if the model to be created is for training or predicting.
+
+        return:
+            for infer: return the prediction
+            otherwise: return the prediction
+        """
+        self.word_emb_dim = args.word_emb_dim
+        self.vocab_size = vocab_size
+        self.num_labels = num_labels
+        self.grnn_hidden_dim = args.grnn_hidden_dim
+        self.emb_lr = args.emb_learning_rate if 'emb_learning_rate' in dir(args) else 1.0
+        self.crf_lr = args.emb_learning_rate if 'crf_learning_rate' in dir(args) else 1.0
+        self.bigru_num = args.bigru_num
+        self.init_bound = 0.1
+        #self.IS_SPARSE = True
+
+        self.word_embedding = Embedding(
+            size=[self.vocab_size, self.word_emb_dim],
+            dtype='float32',
+            #is_sparse=self.IS_SPARSE,
+            param_attr=fluid.ParamAttr(
+                learning_rate=self.emb_lr,
+                name="word_emb",
+                initializer=fluid.initializer.Uniform(
+                    low=-self.init_bound, high=self.init_bound)))
+
+        h_0 = np.zeros((args.batch_size, self.grnn_hidden_dim), dtype="float32")
+        h_0 = to_variable(h_0)
+        self.bigru_units = []
+        for i in range(self.bigru_num):
+            if i == 0:
+                self.bigru_units.append(
+                    self.add_sublayer("bigru_units%d" % i,
+                    BiGRU(self.grnn_hidden_dim, self.grnn_hidden_dim, self.init_bound, h_0=h_0)
+                ))
+            else:
+                self.bigru_units.append(
+                    self.add_sublayer("bigru_units%d" % i,
+                    BiGRU(self.grnn_hidden_dim * 2, self.grnn_hidden_dim, self.init_bound, h_0=h_0)
+                ))
+        
+        self.fc = Linear(input_dim=self.grnn_hidden_dim * 2,
+                        output_dim=self.num_labels,
+                        param_attr=fluid.ParamAttr(
+                            initializer=fluid.initializer.Uniform(
+                                low=-self.init_bound, high=self.init_bound),
+                            regularizer=fluid.regularizer.L2DecayRegularizer(
+                                regularization_coeff=1e-4)))#,
+                        #num_flatten_dims=2)
+        
+        self.linear_chain_crf = Linear_chain_crf(
+                param_attr=fluid.ParamAttr(
+                    name='linear_chain_crfw', learning_rate=self.crf_lr),
+                size=self.num_labels)
+
+        self.crf_decoding = Crf_decoding(
+                param_attr=fluid.ParamAttr(
+                    name='crfw', learning_rate=self.crf_lr),
+                size=self.num_labels)
+        
+    def forward(self, word, target=None, length=None):
+        """
+        Configure the network
+        """
+        #word = fluid.layers.unsqueeze(word, [2])
+        word_embed = self.word_embedding(word)
+        input_feature = word_embed
+        
+        for i in range(self.bigru_num):
+            bigru_output = self.bigru_units[i](input_feature)
+            input_feature = bigru_output
+
+        emission = self.fc(bigru_output)
+
+        if target is not None:
+            crf_cost = self.linear_chain_crf(
+                input=emission,
+                label=target,
+                length=length)
+            avg_cost = fluid.layers.mean(x=crf_cost)
+            self.crf_decoding.weight = self.linear_chain_crf.weight
+            crf_decode = self.crf_decoding(
+                input=emission,
+                length=length)
+            return avg_cost, crf_decode#, word_embed, bigru_output, emission
+        else:
+            crf_decode = self.crf_decoding(
+                input=emission,
+                length=length)
+            return crf_decode
+
+
+    
diff --git a/dygraph/lac/train.py b/dygraph/lac/train.py
new file mode 100755
index 0000000000000000000000000000000000000000..00f294eaeb4d4d05f62ef5d44e9b4b30887b29d3
--- /dev/null
+++ b/dygraph/lac/train.py
@@ -0,0 +1,155 @@
+# -*- coding: UTF-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+import math
+import time
+import random
+import argparse
+import multiprocessing
+
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+
+np.set_printoptions(threshold=np.inf)
+import reader
+import utils
+from sequence_labeling import lex_net, Chunk_eval
+#from eval import test_process
+
+# the function to train model
+def do_train(args):
+
+    dataset = reader.Dataset(args)
+    if args.use_cuda: 
+        place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) \
+        if args.use_data_parallel else fluid.CUDAPlace(0)
+    else:
+        place = fluid.CPUPlace()
+
+    with fluid.dygraph.guard(place):
+        if args.use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
+        #fluid.default_startup_program().random_seed = 102
+        #fluid.default_main_program().random_seed = 102
+        #np.random.seed(102)
+        #random.seed(102)
+        train_loader = reader.create_dataloader(
+            args,
+            file_name=args.train_data,
+            place=place,
+            model='lac',
+            reader=dataset)
+        if args.use_data_parallel:
+            train_loader = fluid.contrib.reader.distributed_batch_reader(
+                train_loader)
+
+        test_loader = reader.create_dataloader(
+            args,
+            file_name=args.test_data,
+            place=place,
+            model='lac',
+            reader=dataset,
+            mode='test')
+        model = lex_net(args, dataset.vocab_size, dataset.num_labels)
+        if args.use_data_parallel:
+            model = fluid.dygraph.parallel.DataParallel(model, strategy)
+        optimizer = fluid.optimizer.AdamOptimizer(learning_rate=args.base_learning_rate,
+                                                parameter_list=model.parameters())
+        chunk_eval = Chunk_eval(int(math.ceil((dataset.num_labels - 1) / 2.0)), "IOB")
+        num_train_examples = dataset.get_num_examples(args.train_data)
+        max_train_steps = args.epoch * num_train_examples // args.batch_size
+        print("Num train examples: %d" % num_train_examples)
+        print("Max train steps: %d" % max_train_steps)
+
+        step = 0
+        print_start_time = time.time()
+        chunk_evaluator = fluid.metrics.ChunkEvaluator()
+        chunk_evaluator.reset()
+
+        def test_process(reader, chunk_evaluator):
+            model.eval()
+            chunk_evaluator.reset()
+
+            start_time = time.time()
+            for batch in reader():
+                words, targets, length = batch
+                crf_decode = model(words, length=length)
+                (precision, recall, f1_score, num_infer_chunks, num_label_chunks,
+                    num_correct_chunks) = chunk_eval(
+                        input=crf_decode,
+                        label=targets,
+                        seq_length=length)
+                chunk_evaluator.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy())
+            
+            precision, recall, f1 = chunk_evaluator.eval()
+            end_time = time.time()
+            print("[test] P: %.5f, R: %.5f, F1: %.5f, elapsed time: %.3f s" %
+                (precision, recall, f1, end_time - start_time))
+            model.train()
+
+        for epoch_id in range(args.epoch):
+            for batch in train_loader():
+                words, targets, length = batch
+
+                start_time = time.time()
+                avg_cost, crf_decode = model(words, targets, length)
+                if args.use_data_parallel:
+                    avg_cost = model.scale_loss(avg_cost)
+                    avg_cost.backward()
+                    model.apply_collective_grads()
+                else:
+                    avg_cost.backward()
+                optimizer.minimize(avg_cost)
+                model.clear_gradients()
+                end_time = time.time()
+
+                if step % args.print_steps == 0:
+                    (precision, recall, f1_score, num_infer_chunks, num_label_chunks,
+                        num_correct_chunks) = chunk_eval(
+                        input=crf_decode,
+                        label=targets,
+                        seq_length=length)
+                    outputs = [avg_cost, precision, recall, f1_score]
+                    avg_cost, precision, recall, f1_score = [np.mean(x.numpy()) for x in outputs]
+
+                    print("[train] step = %d, loss = %.5f, P: %.5f, R: %.5f, F1: %.5f, elapsed time %.5f" % (
+                        step, avg_cost, precision, recall, f1_score, end_time - start_time))
+
+                if step % args.validation_steps == 0:
+                    test_process(test_loader, chunk_evaluator)
+
+                # save checkpoints
+                if step % args.save_steps == 0 and step != 0:
+                    save_path = os.path.join(args.model_save_dir, "step_" + str(step))
+                    paddle.fluid.save_dygraph(model.state_dict(), save_path)
+                step += 1
+
+        
+
+if __name__ == "__main__":
+    # 参数控制可以根据需求使用argparse，yaml或者json
+    # 对NLP任务推荐使用PALM下定义的configure，可以统一argparse，yaml或者json格式的配置文件。
+
+    parser = argparse.ArgumentParser(__doc__)
+    utils.load_yaml(parser, 'conf/args.yaml')
+
+    args = parser.parse_args()
+
+    print(args)
+
+    do_train(args)
diff --git a/dygraph/lac/utils.py b/dygraph/lac/utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..faad8ffe3ca1efda67a68999261b71bea6a61a65
--- /dev/null
+++ b/dygraph/lac/utils.py
@@ -0,0 +1,177 @@
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+util tools
+"""
+from __future__ import print_function
+import os
+import sys
+import numpy as np
+import paddle.fluid as fluid
+import yaml
+import io
+
+
+def str2bool(v):
+    """
+    argparse does not support True or False in python
+    """
+    return v.lower() in ("true", "t", "1")
+
+
+class ArgumentGroup(object):
+    """
+    Put arguments to one group
+    """
+
+    def __init__(self, parser, title, des):
+        """none"""
+        self._group = parser.add_argument_group(title=title, description=des)
+
+    def add_arg(self, name, type, default, help, **kwargs):
+        """ Add argument """
+        type = str2bool if type == bool else type
+        self._group.add_argument(
+            "--" + name,
+            default=default,
+            type=type,
+            help=help + ' Default: %(default)s.',
+            **kwargs)
+
+
+def load_yaml(parser, file_name, **kwargs):
+    with io.open(file_name, 'r', encoding='utf8') as f:
+        args = yaml.load(f)
+        for title in args:
+            group = parser.add_argument_group(title=title, description='')
+            for name in args[title]:
+                _type = type(args[title][name]['val'])
+                _type = str2bool if _type == bool else _type
+                group.add_argument(
+                    "--" + name,
+                    default=args[title][name]['val'],
+                    type=_type,
+                    help=args[title][name]['meaning'] +
+                    ' Default: %(default)s.',
+                    **kwargs)
+
+
+def print_arguments(args):
+    """none"""
+    print('-----------  Configuration Arguments -----------')
+    for arg, value in sorted(vars(args).items()):
+        print('%s: %s' % (arg, value))
+    print('------------------------------------------------')
+
+
+def to_str(string, encoding="utf-8"):
+    """convert to str for print"""
+    if sys.version_info.major == 3:
+        if isinstance(string, bytes):
+            return string.decode(encoding)
+    elif sys.version_info.major == 2:
+        if isinstance(string, unicode):
+            if os.name == 'nt':
+                return string
+            else:
+                return string.encode(encoding)
+    return string
+
+def parse_padding_result(words, crf_decode, seq_lens, dataset):
+    """ parse padding result """
+    # words = np.squeeze(words)
+    batch_size = len(seq_lens)
+
+    batch_out = []
+    for sent_index in range(batch_size):
+
+        sent = [
+            dataset.id2word_dict[str(id)]
+            for id in words[sent_index][1:seq_lens[sent_index] - 1]
+        ]
+        tags = [
+            dataset.id2label_dict[str(id)]
+            for id in crf_decode[sent_index][1:seq_lens[sent_index] - 1]
+        ]
+
+        sent_out = []
+        tags_out = []
+        parital_word = ""
+        for ind, tag in enumerate(tags):
+            # for the first word
+            if parital_word == "":
+                parital_word = sent[ind]
+                tags_out.append(tag.split('-')[0])
+                continue
+
+            # for the beginning of word
+            if tag.endswith("-B") or (tag == "O" and tags[ind - 1] != "O"):
+                sent_out.append(parital_word)
+                tags_out.append(tag.split('-')[0])
+                parital_word = sent[ind]
+                continue
+
+            parital_word += sent[ind]
+
+        # append the last word, except for len(tags)=0
+        if len(sent_out) < len(tags_out):
+            sent_out.append(parital_word)
+
+        batch_out.append([sent_out, tags_out])
+    return batch_out
+
+
+def init_checkpoint(exe, init_checkpoint_path, main_program):
+    """
+    Init CheckPoint
+    """
+    assert os.path.exists(
+        init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path
+
+    def existed_persitables(var):
+        """
+        If existed presitabels
+        """
+        if not fluid.io.is_persistable(var):
+            return False
+        return os.path.exists(os.path.join(init_checkpoint_path, var.name))
+
+    fluid.io.load_vars(
+        exe,
+        init_checkpoint_path,
+        main_program=main_program,
+        predicate=existed_persitables)
+    print("Load model from {}".format(init_checkpoint_path))
+
+
+def init_pretraining_params(exe,
+                            pretraining_params_path,
+                            main_program,
+                            use_fp16=False):
+    """load params of pretrained model, NOT including moment, learning_rate"""
+    assert os.path.exists(pretraining_params_path
+                          ), "[%s] cann't be found." % pretraining_params_path
+
+    def _existed_params(var):
+        if not isinstance(var, fluid.framework.Parameter):
+            return False
+        return os.path.exists(os.path.join(pretraining_params_path, var.name))
+
+    fluid.io.load_vars(
+        exe,
+        pretraining_params_path,
+        main_program=main_program,
+        predicate=_existed_params)
+    print("Load pretraining parameters from {}.".format(
+        pretraining_params_path))
diff --git a/dygraph/mnist/train.py b/dygraph/mnist/train.py
index b067c94c9e61b40cce2bf2c378a222203fc4e004..f81df8f26458c93c1f658a9bc783d14a3c5b8256 100644
--- a/dygraph/mnist/train.py
+++ b/dygraph/mnist/train.py
@@ -21,7 +21,7 @@ import os
 import paddle
 import paddle.fluid as fluid
 from paddle.fluid.optimizer import AdamOptimizer
-from paddle.fluid.dygraph.nn import Conv2D, Pool2D, FC
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, Linear
 from paddle.fluid.dygraph.base import to_variable
 
 
@@ -41,7 +41,6 @@ def parse_args():
 
 class SimpleImgConvPool(fluid.dygraph.Layer):
     def __init__(self,
-                 name_scope,
                  num_channels,
                  num_filters,
                  filter_size,
@@ -58,10 +57,10 @@ class SimpleImgConvPool(fluid.dygraph.Layer):
                  use_cudnn=False,
                  param_attr=None,
                  bias_attr=None):
-        super(SimpleImgConvPool, self).__init__(name_scope)
+        super(SimpleImgConvPool, self).__init__()
 
         self._conv2d = Conv2D(
-            self.full_name(),
+            num_channels=num_channels,
             num_filters=num_filters,
             filter_size=filter_size,
             stride=conv_stride,
@@ -74,7 +73,6 @@ class SimpleImgConvPool(fluid.dygraph.Layer):
             use_cudnn=use_cudnn)
 
         self._pool2d = Pool2D(
-            self.full_name(),
             pool_size=pool_size,
             pool_type=pool_type,
             pool_stride=pool_stride,
@@ -89,20 +87,19 @@ class SimpleImgConvPool(fluid.dygraph.Layer):
 
 
 class MNIST(fluid.dygraph.Layer):
-    def __init__(self, name_scope):
-        super(MNIST, self).__init__(name_scope)
+    def __init__(self):
+        super(MNIST, self).__init__()
 
         self._simple_img_conv_pool_1 = SimpleImgConvPool(
-            self.full_name(), 1, 20, 5, 2, 2, act="relu")
+            1, 20, 5, 2, 2, act="relu")
 
         self._simple_img_conv_pool_2 = SimpleImgConvPool(
-            self.full_name(), 20, 50, 5, 2, 2, act="relu")
+            20, 50, 5, 2, 2, act="relu")
 
-        pool_2_shape = 50 * 4 * 4
+        self.pool_2_shape = 50 * 4 * 4
         SIZE = 10
-        scale = (2.0 / (pool_2_shape**2 * SIZE))**0.5
-        self._fc = FC(self.full_name(),
-                      10,
+        scale = (2.0 / (self.pool_2_shape**2 * SIZE))**0.5
+        self._fc = Linear(self.pool_2_shape, 10,
                       param_attr=fluid.param_attr.ParamAttr(
                           initializer=fluid.initializer.NormalInitializer(
                               loc=0.0, scale=scale)),
@@ -111,6 +108,7 @@ class MNIST(fluid.dygraph.Layer):
     def forward(self, inputs, label=None):
         x = self._simple_img_conv_pool_1(inputs)
         x = self._simple_img_conv_pool_2(x)
+        x = fluid.layers.reshape(x, shape=[-1, self.pool_2_shape])
         x = self._fc(x)
         if label is not None:
             acc = fluid.layers.accuracy(input=x, label=label)
@@ -148,7 +146,7 @@ def inference_mnist():
     place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) \
         if args.use_data_parallel else fluid.CUDAPlace(0)
     with fluid.dygraph.guard(place):
-        mnist_infer = MNIST("mnist")
+        mnist_infer = MNIST()
         # load checkpoint
         model_dict, _ = fluid.load_dygraph("save_temp")
         mnist_infer.set_dict(model_dict)
@@ -188,8 +186,8 @@ def train_mnist(args):
 
         if args.use_data_parallel:
             strategy = fluid.dygraph.parallel.prepare_context()
-        mnist = MNIST("mnist")
-        adam = AdamOptimizer(learning_rate=0.001)
+        mnist = MNIST()
+        adam = AdamOptimizer(learning_rate=0.001, parameter_list=mnist.parameters())
         if args.use_data_parallel:
             mnist = fluid.dygraph.parallel.DataParallel(mnist, strategy)
 
@@ -246,9 +244,10 @@ def train_mnist(args):
             fluid.dygraph.parallel.Env().local_rank == 0)
         if save_parameters:
             fluid.save_dygraph(mnist.state_dict(), "save_temp")
+            
             print("checkpoint saved")
 
-        inference_mnist()
+            inference_mnist()
 
 
 if __name__ == '__main__':
diff --git a/dygraph/mobilenet/README.md b/dygraph/mobilenet/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c2e0477e753d5abe9f90ed17682ad0e58f8e52e3
--- /dev/null
+++ b/dygraph/mobilenet/README.md
@@ -0,0 +1,67 @@
+**模型简介**
+
+图像分类是计算机视觉的重要领域，它的目标是将图像分类到预定义的标签。CNN模型在图像分类领域取得了突破的成果，同时模型复杂度也在不断增加。MobileNet是一种小巧而高效CNN模型，本文介绍如何使PaddlePaddle的动态图MobileNet进行图像分类。
+
+**代码结构**
+
+    ├── run_mul_v1.sh                 # 多卡训练启动脚本_v1
+    ├── run_mul_v1_checkpoint.sh      # 加载checkpoint多卡训练启动脚本_v1
+    ├── run_mul_v2.sh                 # 多卡训练启动脚本_v2
+    ├── run_mul_v2_checkpoint.sh      # 加载checkpoint多卡训练启动脚本_v2
+    ├── run_sing_v1.sh                # 单卡训练启动脚本_v1
+    ├── run_sing_v1_checkpoint.sh     # 加载checkpoint单卡训练启动脚本_v1
+    ├── run_sing_v2.sh                # 单卡训练启动脚本_v2
+    ├── run_sing_v2_checkpoint.sh     # 加载checkpoint单卡训练启动脚本_v2
+    ├── run_cpu_v1.sh                 # CPU训练启动脚本_v1
+    ├── run_cpu_v2.sh                 # CPU训练启动脚本_v2
+    ├── train.py                      # 训练入口
+    ├── mobilenet_v1.py               # 网络结构v1
+    ├── mobilenet_v2.py               # 网络结构v2
+    ├── reader.py                     # 数据reader
+    ├── utils                         # 基础工具目录
+
+**数据准备**
+
+请参考：https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/image_classification
+
+**模型训练**
+
+若使用4卡训练，启动方式如下:
+
+    bash run_mul_v1.sh
+    bash run_mul_v2.sh
+
+若使用单卡训练，启动方式如下:
+
+    bash run_sing_v1.sh
+    bash run_sing_v2.sh
+
+若使用CPU训练，启动方式如下:
+
+    bash run_cpu_v1.sh
+    bash run_cpu_v2.sh
+
+训练过程中,checkpoint会保存在参数model_save_dir指定的文件夹中,我们支持加载checkpoint继续训练.
+加载checkpoint使用4卡训练，启动方式如下:
+
+    bash run_mul_v1_checkpoint.sh
+    bash run_mul_v2_checkpoint.sh
+
+加载checkpoint使用单卡训练，启动方式如下:
+
+    bash run_sing_v1_checkpoint.sh
+    bash run_sing_v2_checkpoint.sh
+
+**模型性能**
+
+    Model          Top-1(单卡/4卡)    Top-5(单卡/4卡)    收敛时间(单卡/4卡)
+    
+    MobileNetV1    0.707/0.711        0.897/0.899        116小时/30.9小时
+    
+    MobileNetV2    0.708/0.724        0.899/0.906        227.8小时/60.8小时
+
+**参考论文**
+
+MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam
+
+MobileNetV2: Inverted Residuals and Linear Bottlenecks, Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen
diff --git a/dygraph/mobilenet/mobilenet_v1.py b/dygraph/mobilenet/mobilenet_v1.py
new file mode 100644
index 0000000000000000000000000000000000000000..e3a5a94eab46477a8fb9676f5a5bf67000783018
--- /dev/null
+++ b/dygraph/mobilenet/mobilenet_v1.py
@@ -0,0 +1,236 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#order: standard library, third party, local library 
+import os
+import time
+import sys
+import math
+import numpy as np
+import argparse
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.initializer import MSRA
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, BatchNorm, Linear
+from paddle.fluid.dygraph.base import to_variable
+from paddle.fluid import framework
+
+
+class ConvBNLayer(fluid.dygraph.Layer):
+    def __init__(self,
+                 num_channels,
+                 filter_size,
+                 num_filters,
+                 stride,
+                 padding,
+                 channels=None,
+                 num_groups=1,
+                 act='relu',
+                 use_cudnn=True,
+                 name=None):
+        super(ConvBNLayer, self).__init__()
+
+        self._conv = Conv2D(
+            num_channels=num_channels,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            groups=num_groups,
+            act=None,
+            use_cudnn=use_cudnn,
+            param_attr=ParamAttr(
+                initializer=MSRA(), name=self.full_name() + "_weights"),
+            bias_attr=False)
+
+        self._batch_norm = BatchNorm(
+            num_filters,
+            act=act,
+            param_attr=ParamAttr(name=self.full_name() + "_bn" + "_scale"),
+            bias_attr=ParamAttr(name=self.full_name() + "_bn" + "_offset"),
+            moving_mean_name=self.full_name() + "_bn" + '_mean',
+            moving_variance_name=self.full_name() + "_bn" + '_variance')
+
+    def forward(self, inputs):
+        y = self._conv(inputs)
+        y = self._batch_norm(y)
+        return y
+
+
+class DepthwiseSeparable(fluid.dygraph.Layer):
+    def __init__(self,
+                 num_channels,
+                 num_filters1,
+                 num_filters2,
+                 num_groups,
+                 stride,
+                 scale,
+                 name=None):
+        super(DepthwiseSeparable, self).__init__()
+
+        self._depthwise_conv = ConvBNLayer(
+            num_channels=num_channels,
+            num_filters=int(num_filters1 * scale),
+            filter_size=3,
+            stride=stride,
+            padding=1,
+            num_groups=int(num_groups * scale),
+            use_cudnn=False)
+
+        self._pointwise_conv = ConvBNLayer(
+            num_channels=int(num_filters1 * scale),
+            filter_size=1,
+            num_filters=int(num_filters2 * scale),
+            stride=1,
+            padding=0)
+
+    def forward(self, inputs):
+        y = self._depthwise_conv(inputs)
+        y = self._pointwise_conv(y)
+        return y
+
+
+class MobileNetV1(fluid.dygraph.Layer):
+    def __init__(self, scale=1.0, class_dim=1000):
+        super(MobileNetV1, self).__init__()
+        self.scale = scale
+        self.dwsl = []
+
+        self.conv1 = ConvBNLayer(
+            num_channels=3,
+            filter_size=3,
+            channels=3,
+            num_filters=int(32 * scale),
+            stride=2,
+            padding=1)
+
+        dws21 = self.add_sublayer(
+            sublayer=DepthwiseSeparable(
+                num_channels=int(32 * scale),
+                num_filters1=32,
+                num_filters2=64,
+                num_groups=32,
+                stride=1,
+                scale=scale),
+            name="conv2_1")
+        self.dwsl.append(dws21)
+
+        dws22 = self.add_sublayer(
+            sublayer=DepthwiseSeparable(
+                num_channels=int(64 * scale),
+                num_filters1=64,
+                num_filters2=128,
+                num_groups=64,
+                stride=2,
+                scale=scale),
+            name="conv2_2")
+        self.dwsl.append(dws22)
+
+        dws31 = self.add_sublayer(
+            sublayer=DepthwiseSeparable(
+                num_channels=int(128 * scale),
+                num_filters1=128,
+                num_filters2=128,
+                num_groups=128,
+                stride=1,
+                scale=scale),
+            name="conv3_1")
+        self.dwsl.append(dws31)
+
+        dws32 = self.add_sublayer(
+            sublayer=DepthwiseSeparable(
+                num_channels=int(128 * scale),
+                num_filters1=128,
+                num_filters2=256,
+                num_groups=128,
+                stride=2,
+                scale=scale),
+            name="conv3_2")
+        self.dwsl.append(dws32)
+
+        dws41 = self.add_sublayer(
+            sublayer=DepthwiseSeparable(
+                num_channels=int(256 * scale),
+                num_filters1=256,
+                num_filters2=256,
+                num_groups=256,
+                stride=1,
+                scale=scale),
+            name="conv4_1")
+        self.dwsl.append(dws41)
+
+        dws42 = self.add_sublayer(
+            sublayer=DepthwiseSeparable(
+                num_channels=int(256 * scale),
+                num_filters1=256,
+                num_filters2=512,
+                num_groups=256,
+                stride=2,
+                scale=scale),
+            name="conv4_2")
+        self.dwsl.append(dws42)
+
+        for i in range(5):
+            tmp = self.add_sublayer(
+                sublayer=DepthwiseSeparable(
+                    num_channels=int(512 * scale),
+                    num_filters1=512,
+                    num_filters2=512,
+                    num_groups=512,
+                    stride=1,
+                    scale=scale),
+                name="conv5_" + str(i + 1))
+            self.dwsl.append(tmp)
+
+        dws56 = self.add_sublayer(
+            sublayer=DepthwiseSeparable(
+                num_channels=int(512 * scale),
+                num_filters1=512,
+                num_filters2=1024,
+                num_groups=512,
+                stride=2,
+                scale=scale),
+            name="conv5_6")
+        self.dwsl.append(dws56)
+
+        dws6 = self.add_sublayer(
+            sublayer=DepthwiseSeparable(
+                num_channels=int(1024 * scale),
+                num_filters1=1024,
+                num_filters2=1024,
+                num_groups=1024,
+                stride=1,
+                scale=scale),
+            name="conv6")
+        self.dwsl.append(dws6)
+
+        self.pool2d_avg = Pool2D(pool_type='avg', global_pooling=True)
+
+        self.out = Linear(
+            int(1024 * scale),
+            class_dim,
+            param_attr=ParamAttr(
+                initializer=MSRA(), name=self.full_name() + "fc7_weights"),
+            bias_attr=ParamAttr(name="fc7_offset"))
+
+    def forward(self, inputs):
+        y = self.conv1(inputs)
+        for dws in self.dwsl:
+            y = dws(y)
+        y = self.pool2d_avg(y)
+        y = fluid.layers.reshape(y, shape=[-1, 1024])
+        y = self.out(y)
+        return y
diff --git a/dygraph/mobilenet/mobilenet_v2.py b/dygraph/mobilenet/mobilenet_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..6da031f298c1e76c21d6415da4b4fe0dd9715731
--- /dev/null
+++ b/dygraph/mobilenet/mobilenet_v2.py
@@ -0,0 +1,219 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#order: standard library, third party, local library 
+import os
+import time
+import math
+import sys
+import numpy as np
+import argparse
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.initializer import MSRA
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, BatchNorm, Linear
+from paddle.fluid.dygraph.base import to_variable
+from paddle.fluid import framework
+
+
+
+class ConvBNLayer(fluid.dygraph.Layer):
+    def __init__(self,
+                 num_channels,
+                 filter_size,
+                 num_filters,
+                 stride,
+                 padding,
+                 channels=None,
+                 num_groups=1,
+                 use_cudnn=True):
+        super(ConvBNLayer, self).__init__()
+
+        tmp_param = ParamAttr(name=self.full_name() + "_weights")
+        self._conv = Conv2D(
+            num_channels=num_channels,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            groups=num_groups,
+            act=None,
+            use_cudnn=use_cudnn,
+            param_attr=tmp_param,
+            bias_attr=False)
+
+        self._batch_norm = BatchNorm(
+            num_filters,
+            param_attr=ParamAttr(name=self.full_name() + "_bn" + "_scale"),
+            bias_attr=ParamAttr(name=self.full_name() + "_bn" + "_offset"),
+            moving_mean_name=self.full_name() + "_bn" + '_mean',
+            moving_variance_name=self.full_name() + "_bn" + '_variance')
+
+    def forward(self, inputs, if_act=True):
+        y = self._conv(inputs)
+        y = self._batch_norm(y)
+        if if_act:
+            y = fluid.layers.relu6(y)
+        return y
+
+
+class InvertedResidualUnit(fluid.dygraph.Layer):
+    def __init__(
+            self,
+            num_channels,
+            num_in_filter,
+            num_filters,
+            stride,
+            filter_size,
+            padding,
+            expansion_factor, ):
+        super(InvertedResidualUnit, self).__init__()
+        num_expfilter = int(round(num_in_filter * expansion_factor))
+        self._expand_conv = ConvBNLayer(
+            num_channels=num_channels,
+            num_filters=num_expfilter,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            num_groups=1)
+
+        self._bottleneck_conv = ConvBNLayer(
+            num_channels=num_expfilter,
+            num_filters=num_expfilter,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            num_groups=num_expfilter,
+            use_cudnn=False)
+
+        self._linear_conv = ConvBNLayer(
+            num_channels=num_expfilter,
+            num_filters=num_filters,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            num_groups=1)
+
+    def forward(self, inputs, ifshortcut):
+        y = self._expand_conv(inputs, if_act=True)
+        y = self._bottleneck_conv(y, if_act=True)
+        y = self._linear_conv(y, if_act=False)
+        if ifshortcut:
+            y = fluid.layers.elementwise_add(inputs, y)
+        return y
+
+
+class InvresiBlocks(fluid.dygraph.Layer):
+    def __init__(self, in_c, t, c, n, s):
+        super(InvresiBlocks, self).__init__()
+
+        self._first_block = InvertedResidualUnit(
+            num_channels=in_c,
+            num_in_filter=in_c,
+            num_filters=c,
+            stride=s,
+            filter_size=3,
+            padding=1,
+            expansion_factor=t)
+
+        self._inv_blocks = []
+        for i in range(1, n):
+            tmp = self.add_sublayer(
+                sublayer=InvertedResidualUnit(
+                    num_channels=c,
+                    num_in_filter=c,
+                    num_filters=c,
+                    stride=1,
+                    filter_size=3,
+                    padding=1,
+                    expansion_factor=t),
+                name=self.full_name() + "_" + str(i + 1))
+            self._inv_blocks.append(tmp)
+
+    def forward(self, inputs):
+        y = self._first_block(inputs, ifshortcut=False)
+        for inv_block in self._inv_blocks:
+            y = inv_block(y, ifshortcut=True)
+        return y
+
+
+class MobileNetV2(fluid.dygraph.Layer):
+    def __init__(self, class_dim=1000, scale=1.0):
+        super(MobileNetV2, self).__init__()
+        self.scale = scale
+        self.class_dim = class_dim
+
+        bottleneck_params_list = [
+            (1, 16, 1, 1),
+            (6, 24, 2, 2),
+            (6, 32, 3, 2),
+            (6, 64, 4, 2),
+            (6, 96, 3, 1),
+            (6, 160, 3, 2),
+            (6, 320, 1, 1),
+        ]
+
+        #1. conv1 
+        self._conv1 = ConvBNLayer(
+            num_channels=3,
+            num_filters=int(32 * scale),
+            filter_size=3,
+            stride=2,
+            padding=1)
+
+        #2. bottleneck sequences
+        self._invl = []
+        i = 1
+        in_c = int(32 * scale)
+        for layer_setting in bottleneck_params_list:
+            t, c, n, s = layer_setting
+            i += 1
+            tmp = self.add_sublayer(
+                sublayer=InvresiBlocks(
+                    in_c=in_c, t=t, c=int(c * scale), n=n, s=s),
+                name='conv' + str(i))
+            self._invl.append(tmp)
+            in_c = int(c * scale)
+
+        #3. last_conv
+        self._out_c = int(1280 * scale) if scale > 1.0 else 1280
+        self._conv9 = ConvBNLayer(
+            num_channels=in_c,
+            num_filters=self._out_c,
+            filter_size=1,
+            stride=1,
+            padding=0)
+
+        #4. pool
+        self._pool2d_avg = Pool2D(pool_type='avg', global_pooling=True)
+
+        #5. fc
+        tmp_param = ParamAttr(name=self.full_name() + "fc10_weights")
+        self._fc = Linear(
+            self._out_c,
+            class_dim,
+            param_attr=tmp_param,
+            bias_attr=ParamAttr(name="fc10_offset"))
+
+    def forward(self, inputs):
+        y = self._conv1(inputs, if_act=True)
+        for inv in self._invl:
+            y = inv(y)
+        y = self._conv9(y, if_act=True)
+        y = self._pool2d_avg(y)
+        y = fluid.layers.reshape(y, shape=[-1, self._out_c])
+        y = self._fc(y)
+        return y
diff --git a/dygraph/mobilenet/reader.py b/dygraph/mobilenet/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..b96d1366690edc34f62be75546c47c49476d85e1
--- /dev/null
+++ b/dygraph/mobilenet/reader.py
@@ -0,0 +1,414 @@
+#copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import sys
+import os
+import math
+import random
+import functools
+import numpy as np
+import cv2
+
+import paddle
+from paddle import fluid
+from utils.autoaugment import ImageNetPolicy
+from PIL import Image
+
+policy = None
+
+random.seed(0)
+np.random.seed(0)
+
+
+def rotate_image(img):
+    """rotate image
+
+    Args:
+        img: image data
+
+    Returns:
+        rotated image data
+    """
+    (h, w) = img.shape[:2]
+    center = (w / 2, h / 2)
+    angle = np.random.randint(-10, 11)
+    M = cv2.getRotationMatrix2D(center, angle, 1.0)
+    rotated = cv2.warpAffine(img, M, (w, h))
+    return rotated
+
+
+def random_crop(img, size, settings, scale=None, ratio=None,
+                interpolation=None):
+    """random crop image
+        
+    Args:
+        img: image data
+        size: crop size
+        settings: arguments
+        scale: scale parameter
+        ratio: ratio parameter
+
+    Returns:
+        random cropped image data
+    """
+    lower_scale = settings.lower_scale
+    lower_ratio = settings.lower_ratio
+    upper_ratio = settings.upper_ratio
+    scale = [lower_scale, 1.0] if scale is None else scale
+    ratio = [lower_ratio, upper_ratio] if ratio is None else ratio
+
+    aspect_ratio = math.sqrt(np.random.uniform(*ratio))
+    w = 1. * aspect_ratio
+    h = 1. / aspect_ratio
+
+    bound = min((float(img.shape[0]) / img.shape[1]) / (h**2),
+                (float(img.shape[1]) / img.shape[0]) / (w**2))
+
+    scale_max = min(scale[1], bound)
+    scale_min = min(scale[0], bound)
+
+    target_area = img.shape[0] * img.shape[1] * np.random.uniform(scale_min,
+                                                                  scale_max)
+    target_size = math.sqrt(target_area)
+    w = int(target_size * w)
+    h = int(target_size * h)
+
+    i = np.random.randint(0, img.shape[0] - h + 1)
+    j = np.random.randint(0, img.shape[1] - w + 1)
+    img = img[i:i + h, j:j + w, :]
+
+    if interpolation:
+        resized = cv2.resize(img, (size, size), interpolation=interpolation)
+    else:
+        resized = cv2.resize(img, (size, size))
+    return resized
+
+
+#NOTE:(2019/08/08) distort color func is not implemented
+def distort_color(img):
+    """distort image color
+
+    Args:
+        img: image data
+
+    Returns:
+        distorted color image data
+    """
+    return img
+
+
+def resize_short(img, target_size, interpolation=None):
+    """resize image
+    
+    Args:
+        img: image data
+        target_size: resize short target size
+        interpolation: interpolation mode
+
+    Returns:
+        resized image data
+    """
+    percent = float(target_size) / min(img.shape[0], img.shape[1])
+    resized_width = int(round(img.shape[1] * percent))
+    resized_height = int(round(img.shape[0] * percent))
+    if interpolation:
+        resized = cv2.resize(
+            img, (resized_width, resized_height), interpolation=interpolation)
+    else:
+        resized = cv2.resize(img, (resized_width, resized_height))
+    return resized
+
+
+def crop_image(img, target_size, center):
+    """crop image 
+    
+    Args:
+        img: images data
+        target_size: crop target size
+        center: crop mode
+    
+    Returns:
+        img: cropped image data
+    """
+    height, width = img.shape[:2]
+    size = target_size
+    if center == True:
+        w_start = (width - size) // 2
+        h_start = (height - size) // 2
+    else:
+        w_start = np.random.randint(0, width - size + 1)
+        h_start = np.random.randint(0, height - size + 1)
+    w_end = w_start + size
+    h_end = h_start + size
+    img = img[h_start:h_end, w_start:w_end, :]
+    return img
+
+
+def create_mixup_reader(settings, rd):
+    """
+    """
+
+    class context:
+        tmp_mix = []
+        tmp_l1 = []
+        tmp_l2 = []
+        tmp_lam = []
+
+    alpha = settings.mixup_alpha
+
+    def fetch_data():
+        for item in rd():
+            yield item
+
+    def mixup_data():
+        for data_list in fetch_data():
+            if alpha > 0.:
+                lam = np.random.beta(alpha, alpha)
+            else:
+                lam = 1.
+            l1 = np.array(data_list)
+            l2 = np.random.permutation(l1)
+            mixed_l = [
+                l1[i][0] * lam + (1 - lam) * l2[i][0] for i in range(len(l1))
+            ]
+            yield (mixed_l, l1, l2, lam)
+
+    def mixup_reader():
+        for context.tmp_mix, context.tmp_l1, context.tmp_l2, context.tmp_lam in mixup_data(
+        ):
+            for i in range(len(context.tmp_mix)):
+                mixed_l = context.tmp_mix[i]
+                l1 = context.tmp_l1[i]
+                l2 = context.tmp_l2[i]
+                lam = context.tmp_lam
+                yield (mixed_l, int(l1[1]), int(l2[1]), float(lam))
+
+    return mixup_reader
+
+
+def process_image(sample, settings, mode, color_jitter, rotate):
+    """ process_image """
+
+    mean = settings.image_mean
+    std = settings.image_std
+    crop_size = settings.crop_size
+
+    img_path = sample[0]
+    img = cv2.imread(img_path)
+
+    if mode == 'train':
+        if rotate:
+            img = rotate_image(img)
+        if crop_size > 0:
+            img = random_crop(
+                img, crop_size, settings, interpolation=settings.interpolation)
+        if color_jitter:
+            img = distort_color(img)
+        if np.random.randint(0, 2) == 1:
+            img = img[:, ::-1, :]
+    else:
+        if crop_size > 0:
+            target_size = settings.resize_short_size
+            img = resize_short(
+                img, target_size, interpolation=settings.interpolation)
+            img = crop_image(img, target_size=crop_size, center=True)
+
+    img = img[:, :, ::-1]
+
+    if 'use_aa' in settings and settings.use_aa and mode == 'train':
+        img = np.ascontiguousarray(img)
+        img = Image.fromarray(img)
+        img = policy(img)
+        img = np.asarray(img)
+
+    img = img.astype('float32').transpose((2, 0, 1)) / 255
+    img_mean = np.array(mean).reshape((3, 1, 1))
+    img_std = np.array(std).reshape((3, 1, 1))
+    img -= img_mean
+    img /= img_std
+
+    if mode == 'train' or mode == 'val':
+        return (img, sample[1])
+    elif mode == 'test':
+        return (img, )
+
+
+def process_batch_data(input_data, settings, mode, color_jitter, rotate):
+    batch_data = []
+    for sample in input_data:
+        if os.path.isfile(sample[0]):
+            batch_data.append(
+                process_image(sample, settings, mode, color_jitter, rotate))
+        else:
+            print("File not exist : %s" % sample[0])
+    return batch_data
+
+
+class ImageNetReader:
+    def __init__(self, seed=None):
+        self.shuffle_seed = seed
+
+    def set_shuffle_seed(self, seed):
+        assert isinstance(seed, int), "shuffle seed must be int"
+        self.shuffle_seed = seed
+
+    def _reader_creator(self,
+                        settings,
+                        file_list,
+                        mode,
+                        shuffle=False,
+                        color_jitter=False,
+                        rotate=False,
+                        data_dir=None):
+        num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+        if mode == 'test':
+            batch_size = 1
+        else:
+            batch_size = settings.batch_size / paddle.fluid.core.get_cuda_device_count(
+            )
+
+        def reader():
+            def read_file_list():
+                with open(file_list) as flist:
+                    full_lines = [line.strip() for line in flist]
+                    if mode != "test" and len(full_lines) < settings.batch_size:
+                        print(
+                            "Warning: The number of the whole data ({}) is smaller than the batch_size ({}), and drop_last is turnning on, so nothing  will feed in program, Terminated now. Please reset batch_size to a smaller number or feed more data!"
+                            .format(len(full_lines), settings.batch_size))
+                        os._exit(1)
+                    if num_trainers > 1 and mode == "train":
+                        assert self.shuffle_seed is not None, "multiprocess train, shuffle seed must be set!"
+                        np.random.RandomState(self.shuffle_seed).shuffle(
+                            full_lines)
+                    elif shuffle:
+                        assert self.shuffle_seed is not None, "multiprocess train, shuffle seed must be set!"
+                        np.random.RandomState(self.shuffle_seed).shuffle(
+                            full_lines)
+
+                batch_data = []
+                for line in full_lines:
+                    img_path, label = line.split()
+                    img_path = os.path.join(data_dir, img_path)
+                    batch_data.append([img_path, int(label)])
+                    if len(batch_data) == batch_size:
+                        if mode == 'train' or mode == 'val' or mode == 'test':
+                            yield batch_data
+
+                        batch_data = []
+
+            return read_file_list
+
+        data_reader = reader()
+        if mode == 'train' and num_trainers > 1:
+            assert self.shuffle_seed is not None, \
+                "If num_trainers > 1, the shuffle_seed must be set, because " \
+                "the order of batch data generated by reader " \
+                "must be the same in the respective processes."
+            data_reader = paddle.fluid.contrib.reader.distributed_batch_reader(
+                data_reader)
+
+        mapper = functools.partial(
+            process_batch_data,
+            settings=settings,
+            mode=mode,
+            color_jitter=color_jitter,
+            rotate=rotate)
+
+        ret = fluid.io.xmap_readers(
+            mapper,
+            data_reader,
+            settings.reader_thread,
+            settings.reader_buf_size,
+            order=False)
+
+        return ret
+
+    def train(self, settings):
+        """Create a reader for trainning
+
+        Args:
+            settings: arguments
+
+        Returns:
+            train reader
+        """
+        file_list = os.path.join(settings.data_dir, 'train_list.txt')
+        assert os.path.isfile(
+            file_list), "{} doesn't exist, please check data list path".format(
+                file_list)
+
+        if 'use_aa' in settings and settings.use_aa:
+            global policy
+            policy = ImageNetPolicy()
+
+        reader = self._reader_creator(
+            settings,
+            file_list,
+            'train',
+            shuffle=True,
+            color_jitter=False,
+            rotate=False,
+            data_dir=settings.data_dir)
+
+        if settings.use_mixup == True:
+            reader = create_mixup_reader(settings, reader)
+            reader = fluid.io.batch(
+                reader,
+                batch_size=int(settings.batch_size /
+                               paddle.fluid.core.get_cuda_device_count()),
+                drop_last=True)
+        return reader
+
+    def val(self, settings):
+        """Create a reader for eval
+
+        Args:
+            settings: arguments
+
+        Returns:
+            eval reader
+        """
+        file_list = os.path.join(settings.data_dir, 'val_list.txt')
+
+        assert os.path.isfile(
+            file_list), "{} doesn't exist, please check data list path".format(
+                file_list)
+
+        return self._reader_creator(
+            settings,
+            file_list,
+            'val',
+            shuffle=False,
+            data_dir=settings.data_dir)
+
+    def test(self, settings):
+        """Create a reader for testing
+
+        Args:
+            settings: arguments
+
+        Returns:
+            test reader
+        """
+        file_list = os.path.join(settings.data_dir, 'val_list.txt')
+
+        assert os.path.isfile(
+            file_list), "{} doesn't exist, please check data list path".format(
+                file_list)
+        return self._reader_creator(
+            settings,
+            file_list,
+            'test',
+            shuffle=False,
+            data_dir=settings.data_dir)
diff --git a/dygraph/mobilenet/run_cpu_v1.sh b/dygraph/mobilenet/run_cpu_v1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..81de4df343f5e4f45c57931c8ea69858ca802dad
--- /dev/null
+++ b/dygraph/mobilenet/run_cpu_v1.sh
@@ -0,0 +1 @@
+python3 train.py    --use_gpu=False  --batch_size=64        --total_images=1281167    --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output/ --lr_strategy=piecewise_decay --lr=0.1   --data_dir=./data/ILSVRC2012  --l2_decay=3e-5  --model=MobileNetV1 
diff --git a/dygraph/mobilenet/run_cpu_v2.sh b/dygraph/mobilenet/run_cpu_v2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4c18c006eca3c46a26175ae420066b1a75989ec3
--- /dev/null
+++ b/dygraph/mobilenet/run_cpu_v2.sh
@@ -0,0 +1 @@
+python3 train.py  --use_gpu=False --batch_size=64      --total_images=1281167    --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output/ --lr_strategy=cosine_decay --lr=0.1  --num_epochs=240  --data_dir=/ssd9/chaj//data/ILSVRC2012 --l2_decay=4e-5  --model=MobileNetV2
diff --git a/dygraph/mobilenet/run_mul_v1.sh b/dygraph/mobilenet/run_mul_v1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fa48ef5fe46ebfcf86c84a21bc1ecb7ad8a492df
--- /dev/null
+++ b/dygraph/mobilenet/run_mul_v1.sh
@@ -0,0 +1,2 @@
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch --log_dir ./mylog.v1 train.py --use_data_parallel 1 --batch_size=256     --total_images=1281167    --class_dim=1000 --image_shape=3,224,224 --lr_strategy=piecewise_decay --lr=0.1   --data_dir=./data/ILSVRC2012 --l2_decay=3e-5  --model=MobileNetV1  --model_save_dir=output.v1.mul/ --num_epochs=120 
diff --git a/dygraph/mobilenet/run_mul_v1_checkpoint.sh b/dygraph/mobilenet/run_mul_v1_checkpoint.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6b511f19a2859b490006ca01eaa75232ef91cb50
--- /dev/null
+++ b/dygraph/mobilenet/run_mul_v1_checkpoint.sh
@@ -0,0 +1,2 @@
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch --log_dir ./mylog.v1.checkpoint train.py --use_data_parallel 1 --batch_size=256     --total_images=1281167    --class_dim=1000 --image_shape=3,224,224 --lr_strategy=piecewise_decay --lr=0.1   --data_dir=./data/ILSVRC2012 --l2_decay=3e-5  --model=MobileNetV1  --model_save_dir=output.v1.mul.checkpoint/ --num_epochs=120 --checkpoint=./output.v1.mul/_mobilenet_v1_epoch50 
diff --git a/dygraph/mobilenet/run_mul_v2.sh b/dygraph/mobilenet/run_mul_v2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..485cad365c3727710678f7426e3238b94c20f6e9
--- /dev/null
+++ b/dygraph/mobilenet/run_mul_v2.sh
@@ -0,0 +1,2 @@
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch --log_dir ./mylog.v2 train.py --use_data_parallel 1 --batch_size=500     --total_images=1281167    --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output.v2.mul/ --lr_strategy=cosine_decay --lr=0.1  --num_epochs=240  --data_dir=./data/ILSVRC2012 --l2_decay=4e-5  --model=MobileNetV2
diff --git a/dygraph/mobilenet/run_mul_v2_checkpoint.sh b/dygraph/mobilenet/run_mul_v2_checkpoint.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2b1b5587c01cde27a6eefe458f9b9bbf1d370f67
--- /dev/null
+++ b/dygraph/mobilenet/run_mul_v2_checkpoint.sh
@@ -0,0 +1,2 @@
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch --log_dir ./mylog.v2.checkpoint train.py --use_data_parallel 1 --batch_size=500     --total_images=1281167    --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output.v2.mul.checkpoint/ --lr_strategy=cosine_decay --lr=0.1  --num_epochs=240  --data_dir=./data/ILSVRC2012 --l2_decay=4e-5  --model=MobileNetV2 --checkpoint=./output.v2.mul/_mobilenet_v2_epoch50
diff --git a/dygraph/mobilenet/run_sing_v1.sh b/dygraph/mobilenet/run_sing_v1.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c4fef2984b06aa98b04e9ab0a481530ec3c22034
--- /dev/null
+++ b/dygraph/mobilenet/run_sing_v1.sh
@@ -0,0 +1,2 @@
+export CUDA_VISIBLE_DEVICES=0
+python3 train.py      --batch_size=256        --total_images=1281167    --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output.v1.sing/ --lr_strategy=piecewise_decay --lr=0.1   --data_dir=./data/ILSVRC2012  --l2_decay=3e-5  --model=MobileNetV1 
diff --git a/dygraph/mobilenet/run_sing_v1_checkpoint.sh b/dygraph/mobilenet/run_sing_v1_checkpoint.sh
new file mode 100644
index 0000000000000000000000000000000000000000..47d68d96604fd1eccbcfe72509fb0858aa390804
--- /dev/null
+++ b/dygraph/mobilenet/run_sing_v1_checkpoint.sh
@@ -0,0 +1,2 @@
+export CUDA_VISIBLE_DEVICES=0
+python3 train.py      --batch_size=256        --total_images=1281167    --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output.v1.sing/ --lr_strategy=piecewise_decay --lr=0.1   --data_dir=./data/ILSVRC2012  --l2_decay=3e-5  --model=MobileNetV1   --checkpoint=./output.v1.sing/_mobilenet_v1_epoch50 
diff --git a/dygraph/mobilenet/run_sing_v2.sh b/dygraph/mobilenet/run_sing_v2.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f747ee5e01ba7d8d5c5eb35fb6e732a381a305b9
--- /dev/null
+++ b/dygraph/mobilenet/run_sing_v2.sh
@@ -0,0 +1,2 @@
+export CUDA_VISIBLE_DEVICES=0
+python3 train.py  --batch_size=500     --total_images=1281167    --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output.v2.sing/ --lr_strategy=cosine_decay --lr=0.1  --num_epochs=240  --data_dir=./data/ILSVRC2012 --l2_decay=4e-5  --model=MobileNetV2
diff --git a/dygraph/mobilenet/run_sing_v2_checkpoint.sh b/dygraph/mobilenet/run_sing_v2_checkpoint.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ed77b221e8cc30dae16ec12d38901c66b56ef72a
--- /dev/null
+++ b/dygraph/mobilenet/run_sing_v2_checkpoint.sh
@@ -0,0 +1,2 @@
+export CUDA_VISIBLE_DEVICES=0
+python3 train.py  --batch_size=500     --total_images=1281167    --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output.v2.sing/ --lr_strategy=cosine_decay --lr=0.1  --num_epochs=240  --data_dir=./data/ILSVRC2012 --l2_decay=4e-5  --model=MobileNetV2   --checkpoint=./output.v2.sing/_mobilenet_v2_epoch50 
diff --git a/dygraph/mobilenet/train.py b/dygraph/mobilenet/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..254279baedf3879ada6bc5c92ab3f733e5f3d524
--- /dev/null
+++ b/dygraph/mobilenet/train.py
@@ -0,0 +1,221 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#order: standard library, third party, local library 
+import os
+import time
+import sys
+import math
+import argparse
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.initializer import MSRA
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.dygraph.base import to_variable
+from paddle.fluid import framework
+import reader
+from utils import *
+from mobilenet_v1 import *
+from mobilenet_v2 import *
+
+args = parse_args()
+if int(os.getenv("PADDLE_TRAINER_ID", 0)) == 0:
+    print_arguments(args)
+
+
+def eval(net, test_data_loader, eop):
+    total_loss = 0.0
+    total_acc1 = 0.0
+    total_acc5 = 0.0
+    total_sample = 0
+    t_last = 0
+    for img, label in test_data_loader():
+        t1 = time.time()
+        label = to_variable(label.numpy().astype('int64').reshape(
+            int(args.batch_size // paddle.fluid.core.get_cuda_device_count()),
+            1))
+        out = net(img)
+        softmax_out = fluid.layers.softmax(out, use_cudnn=False)
+        loss = fluid.layers.cross_entropy(input=softmax_out, label=label)
+        avg_loss = fluid.layers.mean(x=loss)
+        acc_top1 = fluid.layers.accuracy(input=out, label=label, k=1)
+        acc_top5 = fluid.layers.accuracy(input=out, label=label, k=5)
+        t2 = time.time()
+        print( "test | epoch id: %d, avg_loss %0.5f acc_top1 %0.5f acc_top5 %0.5f %2.4f sec read_t:%2.4f" % \
+                (eop, avg_loss.numpy(), acc_top1.numpy(), acc_top5.numpy(), t2 - t1 , t1 - t_last))
+        sys.stdout.flush()
+        total_loss += avg_loss.numpy()
+        total_acc1 += acc_top1.numpy()
+        total_acc5 += acc_top5.numpy()
+        total_sample += 1
+        t_last = time.time()
+    print("final eval loss %0.3f acc1 %0.3f acc5 %0.3f" % \
+          (total_loss / total_sample, \
+           total_acc1 / total_sample, total_acc5 / total_sample))
+    sys.stdout.flush()
+
+
+def train_mobilenet():
+    if not args.use_gpu:
+        place = fluid.CPUPlace()
+    elif not args.use_data_parallel:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id)
+    with fluid.dygraph.guard(place):
+        # 1. init net and optimizer
+        if args.ce:
+            print("ce mode")
+            seed = 33
+            np.random.seed(seed)
+            fluid.default_startup_program().random_seed = seed
+            fluid.default_main_program().random_seed = seed
+        if args.use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
+
+        if args.model == "MobileNetV1":
+            net = MobileNetV1(class_dim=args.class_dim, scale=1.0)
+            model_path_pre = 'mobilenet_v1'
+        elif args.model == "MobileNetV2":
+            net = MobileNetV2(class_dim=args.class_dim, scale=1.0)
+            model_path_pre = 'mobilenet_v2'
+        else:
+            print(
+                "wrong model name, please try model = MobileNetV1 or MobileNetV2"
+            )
+            exit()
+
+        optimizer = create_optimizer(args=args, parameter_list=net.parameters())
+        if args.use_data_parallel:
+            net = fluid.dygraph.parallel.DataParallel(net, strategy)
+
+        # 2. load checkpoint
+        if args.checkpoint:
+            assert os.path.exists(args.checkpoint + ".pdparams"), \
+                "Given dir {}.pdparams not exist.".format(args.checkpoint)
+            assert os.path.exists(args.checkpoint + ".pdopt"), \
+                "Given dir {}.pdopt not exist.".format(args.checkpoint)
+            para_dict, opti_dict = fluid.dygraph.load_dygraph(args.checkpoint)
+            net.set_dict(para_dict)
+            optimizer.set_dict(opti_dict)
+
+        # 3. reader
+        train_data_loader, train_data = utility.create_data_loader(
+            is_train=True, args=args)
+        test_data_loader, test_data = utility.create_data_loader(
+            is_train=False, args=args)
+        num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+        imagenet_reader = reader.ImageNetReader(0)
+        train_reader = imagenet_reader.train(settings=args)
+        test_reader = imagenet_reader.val(settings=args)
+        train_data_loader.set_sample_list_generator(train_reader, place)
+        test_data_loader.set_sample_list_generator(test_reader, place)
+
+        # 4. train loop
+        for eop in range(args.num_epochs):
+            if num_trainers > 1:
+                imagenet_reader.set_shuffle_seed(eop + (
+                    args.random_seed if args.random_seed else 0))
+            net.train()
+            total_loss = 0.0
+            total_acc1 = 0.0
+            total_acc5 = 0.0
+            total_sample = 0
+            batch_id = 0
+            t_last = 0
+            # 4.1 for each batch, call net() , backward(), and minimize()
+            for img, label in train_data_loader():
+                t1 = time.time()
+                label = to_variable(label.numpy().astype('int64').reshape(
+                    int(args.batch_size //
+                        paddle.fluid.core.get_cuda_device_count()), 1))
+                t_start = time.time()
+
+                # 4.1.1 call net()
+                out = net(img)
+
+                t_end = time.time()
+                softmax_out = fluid.layers.softmax(out, use_cudnn=False)
+                loss = fluid.layers.cross_entropy(
+                    input=softmax_out, label=label)
+                avg_loss = fluid.layers.mean(x=loss)
+                acc_top1 = fluid.layers.accuracy(input=out, label=label, k=1)
+                acc_top5 = fluid.layers.accuracy(input=out, label=label, k=5)
+                t_start_back = time.time()
+
+                # 4.1.2 call backward()
+                if args.use_data_parallel:
+                    avg_loss = net.scale_loss(avg_loss)
+                    avg_loss.backward()
+                    net.apply_collective_grads()
+                else:
+                    avg_loss.backward()
+
+                t_end_back = time.time()
+
+                # 4.1.3 call minimize()
+                optimizer.minimize(avg_loss)
+
+                net.clear_gradients()
+                t2 = time.time()
+                train_batch_elapse = t2 - t1
+                if batch_id % args.print_step == 0:
+                    print( "epoch id: %d, batch step: %d,  avg_loss %0.5f acc_top1 %0.5f acc_top5 %0.5f %2.4f sec net_t:%2.4f back_t:%2.4f read_t:%2.4f" % \
+                            (eop, batch_id, avg_loss.numpy(), acc_top1.numpy(), acc_top5.numpy(), train_batch_elapse,
+                              t_end - t_start, t_end_back - t_start_back,  t1 - t_last))
+                    sys.stdout.flush()
+                total_loss += avg_loss.numpy()
+                total_acc1 += acc_top1.numpy()
+                total_acc5 += acc_top5.numpy()
+                total_sample += 1
+                batch_id += 1
+                t_last = time.time()
+            if args.ce:
+                print("kpis\ttrain_acc1\t%0.3f" % (total_acc1 / total_sample))
+                print("kpis\ttrain_acc5\t%0.3f" % (total_acc5 / total_sample))
+                print("kpis\ttrain_loss\t%0.3f" % (total_loss / total_sample))
+            print("epoch %d | batch step %d, loss %0.3f acc1 %0.3f acc5 %0.3f %2.4f sec" % \
+                  (eop, batch_id, total_loss / total_sample, \
+                   total_acc1 / total_sample, total_acc5 / total_sample, train_batch_elapse))
+   
+            # 4.2 save checkpoint
+            save_parameters = (not args.use_data_parallel) or (
+                args.use_data_parallel and
+                fluid.dygraph.parallel.Env().local_rank == 0)
+            if save_parameters:
+                if not os.path.isdir(args.model_save_dir):
+                    os.makedirs(args.model_save_dir)
+                model_path = os.path.join(
+                    args.model_save_dir, "_" + model_path_pre + "_epoch{}".format(eop))
+                fluid.dygraph.save_dygraph(net.state_dict(), model_path)
+                fluid.dygraph.save_dygraph(optimizer.state_dict(), model_path)
+
+            # 4.3 validation
+            net.eval()
+            eval(net, test_data_loader, eop)
+
+        # 5. save final results
+        save_parameters = (not args.use_data_parallel) or (
+            args.use_data_parallel and
+            fluid.dygraph.parallel.Env().local_rank == 0)
+        if save_parameters:
+            model_path = os.path.join(
+                args.model_save_dir, "_" + model_path_pre + "_final")
+            fluid.dygraph.save_dygraph(net.state_dict(), model_path)
+
+
+if __name__ == '__main__':
+    train_mobilenet()
diff --git a/dygraph/mobilenet/utils/__init__.py b/dygraph/mobilenet/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4677e4535712c9f261bf18ba08ba6446d2db76d8
--- /dev/null
+++ b/dygraph/mobilenet/utils/__init__.py
@@ -0,0 +1,15 @@
+#copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+from .optimizer import cosine_decay, lr_warmup, cosine_decay_with_warmup, exponential_decay_with_warmup, Optimizer, create_optimizer
+from .utility import add_arguments, print_arguments, parse_args, check_gpu, check_args, check_version, init_model, save_model, create_data_loader, print_info, best_strategy_compiled, init_model, save_model, ExponentialMovingAverage
diff --git a/dygraph/mobilenet/utils/autoaugment.py b/dygraph/mobilenet/utils/autoaugment.py
new file mode 100644
index 0000000000000000000000000000000000000000..c17bf963b38e61086802c2154d485896c715f04e
--- /dev/null
+++ b/dygraph/mobilenet/utils/autoaugment.py
@@ -0,0 +1,245 @@
+"""
+This code is based on https://github.com/DeepVoltaire/AutoAugment/blob/master/autoaugment.py
+"""
+from PIL import Image, ImageEnhance, ImageOps
+import numpy as np
+import random
+
+
+class ImageNetPolicy(object):
+    """ Randomly choose one of the best 24 Sub-policies on ImageNet.
+
+        Example:
+        >>> policy = ImageNetPolicy()
+        >>> transformed = policy(image)
+
+        Example as a PyTorch Transform:
+        >>> transform=transforms.Compose([
+        >>>     transforms.Resize(256),
+        >>>     ImageNetPolicy(),
+        >>>     transforms.ToTensor()])
+    """
+
+    def __init__(self, fillcolor=(128, 128, 128)):
+        self.policies = [
+            SubPolicy(0.4, "posterize", 8, 0.6, "rotate", 9, fillcolor),
+            SubPolicy(0.6, "solarize", 5, 0.6, "autocontrast", 5, fillcolor),
+            SubPolicy(0.8, "equalize", 8, 0.6, "equalize", 3, fillcolor),
+            SubPolicy(0.6, "posterize", 7, 0.6, "posterize", 6, fillcolor),
+            SubPolicy(0.4, "equalize", 7, 0.2, "solarize", 4, fillcolor),
+            SubPolicy(0.4, "equalize", 4, 0.8, "rotate", 8, fillcolor),
+            SubPolicy(0.6, "solarize", 3, 0.6, "equalize", 7, fillcolor),
+            SubPolicy(0.8, "posterize", 5, 1.0, "equalize", 2, fillcolor),
+            SubPolicy(0.2, "rotate", 3, 0.6, "solarize", 8, fillcolor),
+            SubPolicy(0.6, "equalize", 8, 0.4, "posterize", 6, fillcolor),
+            SubPolicy(0.8, "rotate", 8, 0.4, "color", 0, fillcolor),
+            SubPolicy(0.4, "rotate", 9, 0.6, "equalize", 2, fillcolor),
+            SubPolicy(0.0, "equalize", 7, 0.8, "equalize", 8, fillcolor),
+            SubPolicy(0.6, "invert", 4, 1.0, "equalize", 8, fillcolor),
+            SubPolicy(0.6, "color", 4, 1.0, "contrast", 8, fillcolor),
+            SubPolicy(0.8, "rotate", 8, 1.0, "color", 2, fillcolor),
+            SubPolicy(0.8, "color", 8, 0.8, "solarize", 7, fillcolor),
+            SubPolicy(0.4, "sharpness", 7, 0.6, "invert", 8, fillcolor),
+            SubPolicy(0.6, "shearX", 5, 1.0, "equalize", 9, fillcolor),
+            SubPolicy(0.4, "color", 0, 0.6, "equalize", 3, fillcolor),
+            SubPolicy(0.4, "equalize", 7, 0.2, "solarize", 4, fillcolor),
+            SubPolicy(0.6, "solarize", 5, 0.6, "autocontrast", 5, fillcolor),
+            SubPolicy(0.6, "invert", 4, 1.0, "equalize", 8, fillcolor),
+            SubPolicy(0.6, "color", 4, 1.0, "contrast", 8, fillcolor),
+            SubPolicy(0.8, "equalize", 8, 0.6, "equalize", 3, fillcolor)
+        ]
+
+    def __call__(self, img, policy_idx=None):
+        if policy_idx is None or not isinstance(policy_idx, int):
+            policy_idx = random.randint(0, len(self.policies) - 1)
+        else:
+            policy_idx = policy_idx % len(self.policies)
+        return self.policies[policy_idx](img)
+
+    def __repr__(self):
+        return "AutoAugment ImageNet Policy"
+
+
+class CIFAR10Policy(object):
+    """ Randomly choose one of the best 25 Sub-policies on CIFAR10.
+
+        Example:
+        >>> policy = CIFAR10Policy()
+        >>> transformed = policy(image)
+
+        Example as a PyTorch Transform:
+        >>> transform=transforms.Compose([
+        >>>     transforms.Resize(256),
+        >>>     CIFAR10Policy(),
+        >>>     transforms.ToTensor()])
+    """
+
+    def __init__(self, fillcolor=(128, 128, 128)):
+        self.policies = [
+            SubPolicy(0.1, "invert", 7, 0.2, "contrast", 6, fillcolor),
+            SubPolicy(0.7, "rotate", 2, 0.3, "translateX", 9, fillcolor),
+            SubPolicy(0.8, "sharpness", 1, 0.9, "sharpness", 3, fillcolor),
+            SubPolicy(0.5, "shearY", 8, 0.7, "translateY", 9, fillcolor),
+            SubPolicy(0.5, "autocontrast", 8, 0.9, "equalize", 2, fillcolor),
+            SubPolicy(0.2, "shearY", 7, 0.3, "posterize", 7, fillcolor),
+            SubPolicy(0.4, "color", 3, 0.6, "brightness", 7, fillcolor),
+            SubPolicy(0.3, "sharpness", 9, 0.7, "brightness", 9, fillcolor),
+            SubPolicy(0.6, "equalize", 5, 0.5, "equalize", 1, fillcolor),
+            SubPolicy(0.6, "contrast", 7, 0.6, "sharpness", 5, fillcolor),
+            SubPolicy(0.7, "color", 7, 0.5, "translateX", 8, fillcolor),
+            SubPolicy(0.3, "equalize", 7, 0.4, "autocontrast", 8, fillcolor),
+            SubPolicy(0.4, "translateY", 3, 0.2, "sharpness", 6, fillcolor),
+            SubPolicy(0.9, "brightness", 6, 0.2, "color", 8, fillcolor),
+            SubPolicy(0.5, "solarize", 2, 0.0, "invert", 3, fillcolor),
+            SubPolicy(0.2, "equalize", 0, 0.6, "autocontrast", 0, fillcolor),
+            SubPolicy(0.2, "equalize", 8, 0.8, "equalize", 4, fillcolor),
+            SubPolicy(0.9, "color", 9, 0.6, "equalize", 6, fillcolor),
+            SubPolicy(0.8, "autocontrast", 4, 0.2, "solarize", 8, fillcolor),
+            SubPolicy(0.1, "brightness", 3, 0.7, "color", 0, fillcolor),
+            SubPolicy(0.4, "solarize", 5, 0.9, "autocontrast", 3, fillcolor),
+            SubPolicy(0.9, "translateY", 9, 0.7, "translateY", 9, fillcolor),
+            SubPolicy(0.9, "autocontrast", 2, 0.8, "solarize", 3, fillcolor),
+            SubPolicy(0.8, "equalize", 8, 0.1, "invert", 3, fillcolor),
+            SubPolicy(0.7, "translateY", 9, 0.9, "autocontrast", 1, fillcolor)
+        ]
+
+    def __call__(self, img, policy_idx=None):
+        if policy_idx is None or not isinstance(policy_idx, int):
+            policy_idx = random.randint(0, len(self.policies) - 1)
+        else:
+            policy_idx = policy_idx % len(self.policies)
+        return self.policies[policy_idx](img)
+
+    def __repr__(self):
+        return "AutoAugment CIFAR10 Policy"
+
+
+class SVHNPolicy(object):
+    """ Randomly choose one of the best 25 Sub-policies on SVHN.
+
+        Example:
+        >>> policy = SVHNPolicy()
+        >>> transformed = policy(image)
+
+        Example as a PyTorch Transform:
+        >>> transform=transforms.Compose([
+        >>>     transforms.Resize(256),
+        >>>     SVHNPolicy(),
+        >>>     transforms.ToTensor()])
+    """
+
+    def __init__(self, fillcolor=(128, 128, 128)):
+        self.policies = [
+            SubPolicy(0.9, "shearX", 4, 0.2, "invert", 3, fillcolor),
+            SubPolicy(0.9, "shearY", 8, 0.7, "invert", 5, fillcolor),
+            SubPolicy(0.6, "equalize", 5, 0.6, "solarize", 6, fillcolor),
+            SubPolicy(0.9, "invert", 3, 0.6, "equalize", 3, fillcolor),
+            SubPolicy(0.6, "equalize", 1, 0.9, "rotate", 3, fillcolor),
+            SubPolicy(0.9, "shearX", 4, 0.8, "autocontrast", 3, fillcolor),
+            SubPolicy(0.9, "shearY", 8, 0.4, "invert", 5, fillcolor),
+            SubPolicy(0.9, "shearY", 5, 0.2, "solarize", 6, fillcolor),
+            SubPolicy(0.9, "invert", 6, 0.8, "autocontrast", 1, fillcolor),
+            SubPolicy(0.6, "equalize", 3, 0.9, "rotate", 3, fillcolor),
+            SubPolicy(0.9, "shearX", 4, 0.3, "solarize", 3, fillcolor),
+            SubPolicy(0.8, "shearY", 8, 0.7, "invert", 4, fillcolor),
+            SubPolicy(0.9, "equalize", 5, 0.6, "translateY", 6, fillcolor),
+            SubPolicy(0.9, "invert", 4, 0.6, "equalize", 7, fillcolor),
+            SubPolicy(0.3, "contrast", 3, 0.8, "rotate", 4, fillcolor),
+            SubPolicy(0.8, "invert", 5, 0.0, "translateY", 2, fillcolor),
+            SubPolicy(0.7, "shearY", 6, 0.4, "solarize", 8, fillcolor),
+            SubPolicy(0.6, "invert", 4, 0.8, "rotate", 4, fillcolor), SubPolicy(
+                0.3, "shearY", 7, 0.9, "translateX", 3, fillcolor), SubPolicy(
+                    0.1, "shearX", 6, 0.6, "invert", 5, fillcolor), SubPolicy(
+                        0.7, "solarize", 2, 0.6, "translateY", 7, fillcolor),
+            SubPolicy(0.8, "shearY", 4, 0.8, "invert", 8, fillcolor), SubPolicy(
+                0.7, "shearX", 9, 0.8, "translateY", 3, fillcolor), SubPolicy(
+                    0.8, "shearY", 5, 0.7, "autocontrast", 3, fillcolor),
+            SubPolicy(0.7, "shearX", 2, 0.1, "invert", 5, fillcolor)
+        ]
+
+    def __call__(self, img, policy_idx=None):
+        if policy_idx is None or not isinstance(policy_idx, int):
+            policy_idx = random.randint(0, len(self.policies) - 1)
+        else:
+            policy_idx = policy_idx % len(self.policies)
+        return self.policies[policy_idx](img)
+
+    def __repr__(self):
+        return "AutoAugment SVHN Policy"
+
+
+class SubPolicy(object):
+    def __init__(self,
+                 p1,
+                 operation1,
+                 magnitude_idx1,
+                 p2,
+                 operation2,
+                 magnitude_idx2,
+                 fillcolor=(128, 128, 128)):
+        ranges = {
+            "shearX": np.linspace(0, 0.3, 10),
+            "shearY": np.linspace(0, 0.3, 10),
+            "translateX": np.linspace(0, 150 / 331, 10),
+            "translateY": np.linspace(0, 150 / 331, 10),
+            "rotate": np.linspace(0, 30, 10),
+            "color": np.linspace(0.0, 0.9, 10),
+            "posterize": np.round(np.linspace(8, 4, 10), 0).astype(np.int),
+            "solarize": np.linspace(256, 0, 10),
+            "contrast": np.linspace(0.0, 0.9, 10),
+            "sharpness": np.linspace(0.0, 0.9, 10),
+            "brightness": np.linspace(0.0, 0.9, 10),
+            "autocontrast": [0] * 10,
+            "equalize": [0] * 10,
+            "invert": [0] * 10
+        }
+
+        # from https://stackoverflow.com/questions/5252170/specify-image-filling-color-when-rotating-in-python-with-pil-and-setting-expand
+        def rotate_with_fill(img, magnitude):
+            rot = img.convert("RGBA").rotate(magnitude)
+            return Image.composite(rot,
+                                   Image.new("RGBA", rot.size, (128, ) * 4),
+                                   rot).convert(img.mode)
+
+        func = {
+            "shearX": lambda img, magnitude: img.transform(
+                img.size, Image.AFFINE, (1, magnitude * random.choice([-1, 1]), 0, 0, 1, 0),
+                Image.BICUBIC, fillcolor=fillcolor),
+            "shearY": lambda img, magnitude: img.transform(
+                img.size, Image.AFFINE, (1, 0, 0, magnitude * random.choice([-1, 1]), 1, 0),
+                Image.BICUBIC, fillcolor=fillcolor),
+            "translateX": lambda img, magnitude: img.transform(
+                img.size, Image.AFFINE, (1, 0, magnitude * img.size[0] * random.choice([-1, 1]), 0, 1, 0),
+                fillcolor=fillcolor),
+            "translateY": lambda img, magnitude: img.transform(
+                img.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude * img.size[1] * random.choice([-1, 1])),
+                fillcolor=fillcolor),
+            "rotate": lambda img, magnitude: rotate_with_fill(img, magnitude),
+            # "rotate": lambda img, magnitude: img.rotate(magnitude * random.choice([-1, 1])),
+            "color": lambda img, magnitude: ImageEnhance.Color(img).enhance(1 + magnitude * random.choice([-1, 1])),
+            "posterize": lambda img, magnitude: ImageOps.posterize(img, magnitude),
+            "solarize": lambda img, magnitude: ImageOps.solarize(img, magnitude),
+            "contrast": lambda img, magnitude: ImageEnhance.Contrast(img).enhance(
+                1 + magnitude * random.choice([-1, 1])),
+            "sharpness": lambda img, magnitude: ImageEnhance.Sharpness(img).enhance(
+                1 + magnitude * random.choice([-1, 1])),
+            "brightness": lambda img, magnitude: ImageEnhance.Brightness(img).enhance(
+                1 + magnitude * random.choice([-1, 1])),
+            "autocontrast": lambda img, magnitude: ImageOps.autocontrast(img),
+            "equalize": lambda img, magnitude: ImageOps.equalize(img),
+            "invert": lambda img, magnitude: ImageOps.invert(img)
+        }
+
+        self.p1 = p1
+        self.operation1 = func[operation1]
+        self.magnitude1 = ranges[operation1][magnitude_idx1]
+        self.p2 = p2
+        self.operation2 = func[operation2]
+        self.magnitude2 = ranges[operation2][magnitude_idx2]
+
+    def __call__(self, img):
+        if random.random() < self.p1:
+            img = self.operation1(img, self.magnitude1)
+        if random.random() < self.p2:
+            img = self.operation2(img, self.magnitude2)
+        return img
diff --git a/dygraph/mobilenet/utils/dist_utils.py b/dygraph/mobilenet/utils/dist_utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..29df3d3b110357653bd46723298de1d98d296659
--- /dev/null
+++ b/dygraph/mobilenet/utils/dist_utils.py
@@ -0,0 +1,93 @@
+#copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import paddle.fluid as fluid
+
+
+def nccl2_prepare(args, startup_prog, main_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.mode = "nccl2"
+    t = fluid.DistributeTranspiler(config=config)
+
+    envs = args.dist_env
+
+    t.transpile(
+        envs["trainer_id"],
+        trainers=','.join(envs["trainer_endpoints"]),
+        current_endpoint=envs["current_endpoint"],
+        startup_program=startup_prog,
+        program=main_prog)
+
+
+def pserver_prepare(args, train_prog, startup_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.slice_var_up = args.split_var
+    t = fluid.DistributeTranspiler(config=config)
+    envs = args.dist_env
+    training_role = envs["training_role"]
+
+    t.transpile(
+        envs["trainer_id"],
+        program=train_prog,
+        pservers=envs["pserver_endpoints"],
+        trainers=envs["num_trainers"],
+        sync_mode=not args.async_mode,
+        startup_program=startup_prog)
+    if training_role == "PSERVER":
+        pserver_program = t.get_pserver_program(envs["current_endpoint"])
+        pserver_startup_program = t.get_startup_program(
+            envs["current_endpoint"],
+            pserver_program,
+            startup_program=startup_prog)
+        return pserver_program, pserver_startup_program
+    elif training_role == "TRAINER":
+        train_program = t.get_trainer_program()
+        return train_program, startup_prog
+    else:
+        raise ValueError(
+            'PADDLE_TRAINING_ROLE environment variable must be either TRAINER or PSERVER'
+        )
+
+
+def nccl2_prepare_paddle(trainer_id, startup_prog, main_prog):
+    config = fluid.DistributeTranspilerConfig()
+    config.mode = "nccl2"
+    t = fluid.DistributeTranspiler(config=config)
+    t.transpile(
+        trainer_id,
+        trainers=os.environ.get('PADDLE_TRAINER_ENDPOINTS'),
+        current_endpoint=os.environ.get('PADDLE_CURRENT_ENDPOINT'),
+        startup_program=startup_prog,
+        program=main_prog)
+
+
+def prepare_for_multi_process(exe, build_strategy, train_prog):
+    # prepare for multi-process
+    trainer_id = int(os.environ.get('PADDLE_TRAINER_ID', 0))
+    num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+    if num_trainers < 2: return
+    print("PADDLE_TRAINERS_NUM", num_trainers)
+    print("PADDLE_TRAINER_ID", trainer_id)
+    build_strategy.num_trainers = num_trainers
+    build_strategy.trainer_id = trainer_id
+    # NOTE(zcd): use multi processes to train the model,
+    # and each process use one GPU card.
+    startup_prog = fluid.Program()
+    nccl2_prepare_paddle(trainer_id, startup_prog, train_prog)
+    # the startup_prog are run two times, but it doesn't matter.
+    exe.run(startup_prog)
diff --git a/dygraph/mobilenet/utils/optimizer.py b/dygraph/mobilenet/utils/optimizer.py
new file mode 100644
index 0000000000000000000000000000000000000000..1501bcbe9c2e331c770cdf3d7d20ee2a8ae0a14b
--- /dev/null
+++ b/dygraph/mobilenet/utils/optimizer.py
@@ -0,0 +1,308 @@
+#copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import math
+
+import paddle.fluid as fluid
+import paddle.fluid.layers.ops as ops
+from paddle.fluid.initializer import init_on_cpu
+from paddle.fluid.layers.learning_rate_scheduler import _decay_step_counter
+
+
+def cosine_decay(learning_rate, step_each_epoch, epochs=120):
+    """Applies cosine decay to the learning rate.
+    lr = 0.05 * (math.cos(epoch * (math.pi / 120)) + 1)
+    """
+    global_step = _decay_step_counter()
+
+    with init_on_cpu():
+        epoch = ops.floor(global_step / step_each_epoch)
+        decayed_lr = learning_rate * \
+                     (ops.cos(epoch * (math.pi / epochs)) + 1)/2
+    return decayed_lr
+
+
+def cosine_decay_with_warmup(learning_rate, step_each_epoch, epochs=120):
+    """Applies cosine decay to the learning rate.
+    lr = 0.05 * (math.cos(epoch * (math.pi / 120)) + 1)
+    decrease lr for every mini-batch and start with warmup.
+    """
+    global_step = _decay_step_counter()
+    lr = fluid.layers.tensor.create_global_var(
+        shape=[1],
+        value=0.0,
+        dtype='float32',
+        persistable=True,
+        name="learning_rate")
+
+    warmup_epoch = fluid.layers.fill_constant(
+        shape=[1], dtype='float32', value=float(5), force_cpu=True)
+
+    with init_on_cpu():
+        epoch = ops.floor(global_step / step_each_epoch)
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(epoch < warmup_epoch):
+                decayed_lr = learning_rate * (global_step /
+                                              (step_each_epoch * warmup_epoch))
+                fluid.layers.tensor.assign(input=decayed_lr, output=lr)
+            with switch.default():
+                decayed_lr = learning_rate * \
+                    (ops.cos((global_step - warmup_epoch * step_each_epoch) * (math.pi / (epochs * step_each_epoch))) + 1)/2
+                fluid.layers.tensor.assign(input=decayed_lr, output=lr)
+    return lr
+
+
+def exponential_decay_with_warmup(learning_rate,
+                                  step_each_epoch,
+                                  decay_epochs,
+                                  decay_rate=0.97,
+                                  warm_up_epoch=5.0):
+    """Applies exponential decay to the learning rate.
+    """
+    global_step = _decay_step_counter()
+    lr = fluid.layers.tensor.create_global_var(
+        shape=[1],
+        value=0.0,
+        dtype='float32',
+        persistable=True,
+        name="learning_rate")
+
+    warmup_epoch = fluid.layers.fill_constant(
+        shape=[1], dtype='float32', value=float(warm_up_epoch), force_cpu=True)
+
+    with init_on_cpu():
+        epoch = ops.floor(global_step / step_each_epoch)
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(epoch < warmup_epoch):
+                decayed_lr = learning_rate * (global_step /
+                                              (step_each_epoch * warmup_epoch))
+                fluid.layers.assign(input=decayed_lr, output=lr)
+            with switch.default():
+                div_res = (
+                    global_step - warmup_epoch * step_each_epoch) / decay_epochs
+                div_res = ops.floor(div_res)
+                decayed_lr = learning_rate * (decay_rate**div_res)
+                fluid.layers.assign(input=decayed_lr, output=lr)
+
+    return lr
+
+
+def lr_warmup(learning_rate, warmup_steps, start_lr, end_lr):
+    """ Applies linear learning rate warmup for distributed training
+        Argument learning_rate can be float or a Variable
+        lr = lr + (warmup_rate * step / warmup_steps)
+    """
+    assert (isinstance(end_lr, float))
+    assert (isinstance(start_lr, float))
+    linear_step = end_lr - start_lr
+    with fluid.default_main_program()._lr_schedule_guard():
+        lr = fluid.layers.tensor.create_global_var(
+            shape=[1],
+            value=0.0,
+            dtype='float32',
+            persistable=True,
+            name="learning_rate_warmup")
+
+        global_step = fluid.layers.learning_rate_scheduler._decay_step_counter()
+
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(global_step < warmup_steps):
+                decayed_lr = start_lr + linear_step * (global_step /
+                                                       warmup_steps)
+                fluid.layers.tensor.assign(decayed_lr, lr)
+            with switch.default():
+                fluid.layers.tensor.assign(learning_rate, lr)
+
+        return lr
+
+
+class Optimizer(object):
+    """A class used to represent several optimizer methods
+
+    Attributes:
+        batch_size: batch size on all devices.
+        lr: learning rate.
+        lr_strategy: learning rate decay strategy.
+        l2_decay: l2_decay parameter.
+        momentum_rate: momentum rate when using Momentum optimizer.
+        step_epochs: piecewise decay steps.
+        num_epochs: number of total epochs.
+
+        total_images: total images.
+        step: total steps in the an epoch.
+        
+    """
+
+    def __init__(self, args, parameter_list):
+        self.parameter_list = parameter_list
+        self.batch_size = args.batch_size
+        self.lr = args.lr
+        self.lr_strategy = args.lr_strategy
+        self.l2_decay = args.l2_decay
+        self.momentum_rate = args.momentum_rate
+        self.step_epochs = args.step_epochs
+        self.num_epochs = args.num_epochs
+        self.warm_up_epochs = args.warm_up_epochs
+        self.decay_epochs = args.decay_epochs
+        self.decay_rate = args.decay_rate
+        self.total_images = args.total_images
+
+        self.step = int(math.ceil(float(self.total_images) / self.batch_size))
+
+    def piecewise_decay(self):
+        """piecewise decay with Momentum optimizer
+
+            Returns:
+            a piecewise_decay optimizer
+        """
+        bd = [self.step * e for e in self.step_epochs]
+        lr = [self.lr * (0.1**i) for i in range(len(bd) + 1)]
+        learning_rate = fluid.layers.piecewise_decay(boundaries=bd, values=lr)
+        optimizer = fluid.optimizer.Momentum(
+            learning_rate=learning_rate,
+            momentum=self.momentum_rate,
+            regularization=fluid.regularizer.L2Decay(self.l2_decay),
+            parameter_list=self.parameter_list)
+        return optimizer
+
+    def cosine_decay(self):
+        """cosine decay with Momentum optimizer
+
+        Returns:
+            a cosine_decay optimizer
+        """
+
+        learning_rate = fluid.layers.cosine_decay(
+            learning_rate=self.lr,
+            step_each_epoch=self.step,
+            epochs=self.num_epochs)
+        optimizer = fluid.optimizer.Momentum(
+            learning_rate=learning_rate,
+            momentum=self.momentum_rate,
+            regularization=fluid.regularizer.L2Decay(self.l2_decay),
+            parameter_list=self.parameter_list)
+        return optimizer
+
+    def cosine_decay_warmup(self):
+        """cosine decay with warmup
+
+        Returns:
+            a cosine_decay_with_warmup optimizer
+        """
+
+        learning_rate = cosine_decay_with_warmup(
+            learning_rate=self.lr,
+            step_each_epoch=self.step,
+            epochs=self.num_epochs)
+        optimizer = fluid.optimizer.Momentum(
+            learning_rate=learning_rate,
+            momentum=self.momentum_rate,
+            regularization=fluid.regularizer.L2Decay(self.l2_decay),
+            parameter_list=self.parameter_list)
+        return optimizer
+
+    def exponential_decay_warmup(self):
+        """exponential decay with warmup
+
+        Returns:
+            a exponential_decay_with_warmup optimizer
+        """
+
+        learning_rate = exponential_decay_with_warmup(
+            learning_rate=self.lr,
+            step_each_epoch=self.step,
+            decay_epochs=self.step * self.decay_epochs,
+            decay_rate=self.decay_rate,
+            warm_up_epoch=self.warm_up_epochs)
+        optimizer = fluid.optimizer.RMSProp(
+            learning_rate=learning_rate,
+            regularization=fluid.regularizer.L2Decay(self.l2_decay),
+            momentum=self.momentum_rate,
+            rho=0.9,
+            epsilon=0.001,
+            parameter_list=self.parameter_list)
+        return optimizer
+
+    def linear_decay(self):
+        """linear decay with Momentum optimizer
+
+        Returns:
+            a linear_decay optimizer
+        """
+
+        end_lr = 0
+        learning_rate = fluid.layers.polynomial_decay(
+            self.lr, self.step, end_lr, power=1)
+        optimizer = fluid.optimizer.Momentum(
+            learning_rate=learning_rate,
+            momentum=self.momentum_rate,
+            regularization=fluid.regularizer.L2Decay(self.l2_decay),
+            parameter_list=self.parameter_list)
+
+        return optimizer
+
+    def adam_decay(self):
+        """Adam optimizer
+
+        Returns: 
+            an adam_decay optimizer
+        """
+
+        return fluid.optimizer.Adam(
+            learning_rate=self.lr, parameter_list=self.parameter_list)
+
+    def cosine_decay_RMSProp(self):
+        """cosine decay with RMSProp optimizer
+
+        Returns: 
+            an cosine_decay_RMSProp optimizer
+        """
+
+        learning_rate = fluid.layers.cosine_decay(
+            learning_rate=self.lr,
+            step_each_epoch=self.step,
+            epochs=self.num_epochs)
+        optimizer = fluid.optimizer.RMSProp(
+            learning_rate=learning_rate,
+            momentum=self.momentum_rate,
+            regularization=fluid.regularizer.L2Decay(self.l2_decay),
+            # Apply epsilon=1 on ImageNet dataset.
+            epsilon=1,
+            parameter_list=self.parameter_list)
+        return optimizer
+
+    def default_decay(self):
+        """default decay
+
+        Returns:
+            default decay optimizer
+        """
+
+        optimizer = fluid.optimizer.Momentum(
+            learning_rate=self.lr,
+            momentum=self.momentum_rate,
+            regularization=fluid.regularizer.L2Decay(self.l2_decay),
+            parameter_list=self.parameter_list)
+        return optimizer
+
+
+def create_optimizer(args, parameter_list):
+    Opt = Optimizer(args, parameter_list)
+    optimizer = getattr(Opt, args.lr_strategy)()
+
+    return optimizer
diff --git a/dygraph/mobilenet/utils/utility.py b/dygraph/mobilenet/utils/utility.py
new file mode 100644
index 0000000000000000000000000000000000000000..53678ebb72010c44e70a43b0b084ad54a72e6ca1
--- /dev/null
+++ b/dygraph/mobilenet/utils/utility.py
@@ -0,0 +1,576 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import distutils.util
+import numpy as np
+import six
+import argparse
+import functools
+import logging
+import sys
+import os
+import warnings
+import signal
+
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.wrapped_decorator import signature_safe_contextmanager
+from paddle.fluid.framework import Program, program_guard, name_scope, default_main_program
+from paddle.fluid import unique_name, layers
+from utils import dist_utils
+
+
+def print_arguments(args):
+    """Print argparse's arguments.
+
+    Usage:
+
+    .. code-block:: python
+
+        parser = argparse.ArgumentParser()
+        parser.add_argument("name", default="Jonh", type=str, help="User name.")
+        args = parser.parse_args()
+        print_arguments(args)
+
+    :param args: Input argparse.Namespace for printing.
+    :type args: argparse.Namespace
+    """
+    print("-------------  Configuration Arguments -------------")
+    for arg, value in sorted(six.iteritems(vars(args))):
+        print("%25s : %s" % (arg, value))
+    print("----------------------------------------------------")
+
+
+def add_arguments(argname, type, default, help, argparser, **kwargs):
+    """Add argparse's argument. 
+
+    Usage:
+
+    .. code-block:: python
+
+        parser = argparse.ArgumentParser()
+        add_argument("name", str, "Jonh", "User name.", parser)
+        args = parser.parse_args()
+    """
+    type = distutils.util.strtobool if type == bool else type
+    argparser.add_argument(
+        "--" + argname,
+        default=default,
+        type=type,
+        help=help + ' Default: %(default)s.',
+        **kwargs)
+
+
+def parse_args():
+    """Add arguments
+
+    Returns: 
+        all training args
+    """
+    parser = argparse.ArgumentParser(description=__doc__)
+    add_arg = functools.partial(add_arguments, argparser=parser)
+    # yapf: disable
+
+    add_arg('use_data_parallel',                  bool,   False,                   "The flag indicating whether to use data parallel mode to train the model.")
+    add_arg('ce',                  bool,   False,                   "run ce.")
+
+    # ENV
+    add_arg('use_gpu',                  bool,   True,                   "Whether to use GPU.")
+    add_arg('model_save_dir',           str,    "./output",        "The directory path to save model.")
+    add_arg('data_dir',                 str,    "../../PaddleCV/image_classification/data/ILSVRC2012/",   "The ImageNet dataset root directory.")
+    #add_arg('data_dir',                 str,    "../../PaddleCV/image_classification/data/",   "The ImageNet dataset root directory.")
+    add_arg('pretrained_model',         str,    None,                   "Whether to load pretrained model.")
+    add_arg('checkpoint',               str,    None,                   "Whether to resume checkpoint.")
+    add_arg('print_step',               int,    10,                     "The steps interval to print logs")
+    add_arg('save_step',                int,    1,                      "The steps interval to save checkpoints")
+
+    # SOLVER AND HYPERPARAMETERS
+    add_arg('model',                    str,    "ResNet50",   "The name of network.")
+    add_arg('total_images',             int,    1281167,                "The number of total training images.")
+    add_arg('num_epochs',               int,    120,                    "The number of total epochs.")
+    add_arg('class_dim',                int,    1000,                   "The number of total classes.")
+    add_arg('image_shape',              str,    "3,224,224",            "The size of Input image, order: [channels, height, weidth] ")
+    add_arg('batch_size',               int,    8,                      "Minibatch size on a device.")
+    add_arg('test_batch_size',          int,    16,                     "Test batch size on a deveice.")
+    add_arg('lr',                       float,  0.1,                    "The learning rate.")
+    add_arg('lr_strategy',              str,    "piecewise_decay",      "The learning rate decay strategy.")
+    add_arg('l2_decay',                 float,  1e-4,                   "The l2_decay parameter.")
+    add_arg('momentum_rate',            float,  0.9,                    "The value of momentum_rate.")
+    add_arg('warm_up_epochs',           float,  5.0,                    "The value of warm up epochs")
+    add_arg('decay_epochs',             float,  2.4,                    "Decay epochs of exponential decay learning rate scheduler")
+    add_arg('decay_rate',               float,  0.97,                   "Decay rate of exponential decay learning rate scheduler")
+    add_arg('drop_connect_rate',        float,  0.2,                    "The value of drop connect rate")
+    parser.add_argument('--step_epochs', nargs='+', type=int, default=[30, 60, 90], help="piecewise decay step")
+
+    # READER AND PREPROCESS
+    add_arg('lower_scale',              float,  0.08,                   "The value of lower_scale in ramdom_crop")
+    add_arg('lower_ratio',              float,  3./4.,                  "The value of lower_ratio in ramdom_crop")
+    add_arg('upper_ratio',              float,  4./3.,                  "The value of upper_ratio in ramdom_crop")
+    add_arg('resize_short_size',        int,    256,                    "The value of resize_short_size")
+    add_arg('crop_size',                int,    224,                    "The value of crop size")
+    add_arg('use_mixup',                bool,   False,                  "Whether to use mixup")
+    add_arg('mixup_alpha',              float,  0.2,                    "The value of mixup_alpha")
+    add_arg('reader_thread',            int,    8,                      "The number of multi thread reader")
+    add_arg('reader_buf_size',          int,    16,                   "The buf size of multi thread reader")
+    add_arg('interpolation',            int,    None,                   "The interpolation mode")
+    add_arg('use_aa',                   bool,   False,                  "Whether to use auto augment")
+    parser.add_argument('--image_mean', nargs='+', type=float, default=[0.485, 0.456, 0.406], help="The mean of input image data")
+    parser.add_argument('--image_std', nargs='+', type=float, default=[0.229, 0.224, 0.225], help="The std of input image data")
+
+    # SWITCH
+    #NOTE: (2019/08/08) FP16 is moving to PaddlePaddle/Fleet now
+    #add_arg('use_fp16',                 bool,   False,                  "Whether to enable half precision training with fp16." )
+    #add_arg('scale_loss',               float,  1.0,                    "The value of scale_loss for fp16." )
+    add_arg('use_label_smoothing',      bool,   False,                  "Whether to use label_smoothing")
+    add_arg('label_smoothing_epsilon',  float,  0.1,                    "The value of label_smoothing_epsilon parameter")
+    #NOTE: (2019/08/08) temporary disable use_distill
+    #add_arg('use_distill',              bool,   False,                  "Whether to use distill")
+    add_arg('random_seed',              int,    None,                   "random seed")
+    add_arg('use_ema',                  bool,   False,                  "Whether to use ExponentialMovingAverage.")
+    add_arg('ema_decay',                float,  0.9999,                 "The value of ema decay rate")
+    add_arg('padding_type',             str,    "SAME",                 "Padding type of convolution")
+    add_arg('use_se',                   bool,   True,                   "Whether to use Squeeze-and-Excitation module for EfficientNet.")
+    # yapf: enable
+
+    args = parser.parse_args()
+
+    return args
+
+
+def check_gpu():
+    """   
+    Log error and exit when set use_gpu=true in paddlepaddle
+    cpu ver sion.
+    """
+    logger = logging.getLogger(__name__)
+    err = "Config use_gpu cannot be set as true while you are " \
+                "using paddlepaddle cpu version ! \nPlease try: \n" \
+                "\t1. Install paddlepaddle-gpu to run model on GPU \n" \
+                "\t2. Set use_gpu as false in config file to run " \
+                "model on CPU"
+
+    try:
+        if args.use_gpu and not fluid.is_compiled_with_cuda():
+            print(err)
+            sys.exit(1)
+    except Exception as e:
+        pass
+
+
+def check_version():
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        print(err)
+        sys.exit(1)
+
+
+def check_args(args):
+    """check arguments before running
+
+    Args:
+        all arguments
+    """
+
+    # check models name
+    sys.path.append("..")
+    import models
+    model_list = [m for m in dir(models) if "__" not in m]
+    assert args.model in model_list, "{} is not in lists: {}, please check the model name".format(
+        args.model, model_list)
+
+    # check learning rate strategy
+    lr_strategy_list = [
+        "piecewise_decay", "cosine_decay", "linear_decay",
+        "cosine_decay_warmup", "exponential_decay_warmup"
+    ]
+    if args.lr_strategy not in lr_strategy_list:
+        warnings.warn(
+            "\n{} is not in lists: {}, \nUse default learning strategy now.".
+            format(args.lr_strategy, lr_strategy_list))
+        args.lr_strategy = "default_decay"
+    # check confict of GoogLeNet and mixup
+    if args.model == "GoogLeNet":
+        assert args.use_mixup == False, "Cannot use mixup processing in GoogLeNet, please set use_mixup = False."
+
+    if args.interpolation:
+        assert args.interpolation in [
+            0, 1, 2, 3, 4
+        ], "Wrong interpolation, please set:\n0: cv2.INTER_NEAREST\n1: cv2.INTER_LINEAR\n2: cv2.INTER_CUBIC\n3: cv2.INTER_AREA\n4: cv2.INTER_LANCZOS4"
+
+    if args.padding_type:
+        assert args.padding_type in [
+            "SAME", "VALID", "DYNAMIC"
+        ], "Wrong padding_type, please set:\nSAME\nVALID\nDYNAMIC"
+
+    assert args.checkpoint is None or args.pretrained_model is None, "Do not init model by checkpoint and pretrained_model both."
+
+    # check pretrained_model path for loading
+    if args.pretrained_model is not None:
+        assert isinstance(args.pretrained_model, str)
+        assert os.path.isdir(
+            args.
+            pretrained_model), "please support available pretrained_model path."
+
+    #FIXME: check checkpoint path for saving
+    if args.checkpoint is not None:
+        assert isinstance(args.checkpoint, str)
+        assert os.path.isdir(
+            args.checkpoint
+        ), "please support available checkpoint path for initing model."
+
+    # check params for loading
+    """
+    if args.save_params:
+        assert isinstance(args.save_params, str)
+        assert os.path.isdir(
+            args.save_params), "please support available save_params path."
+    """
+
+    # check gpu: when using gpu, the number of visible cards should divide batch size
+    if args.use_gpu:
+        assert args.batch_size % fluid.core.get_cuda_device_count(
+        ) == 0, "please support correct batch_size({}), which can be divided by available cards({}), you can change the number of cards by indicating: export CUDA_VISIBLE_DEVICES= ".format(
+            args.batch_size, fluid.core.get_cuda_device_count())
+
+    # check data directory
+    assert os.path.isdir(
+        args.data_dir
+    ), "Data doesn't exist in {}, please load right path".format(args.data_dir)
+
+    #check gpu
+
+    check_gpu()
+    check_version()
+
+
+def init_model(exe, args, program):
+    if args.checkpoint:
+        fluid.io.load_persistables(exe, args.checkpoint, main_program=program)
+        print("Finish initing model from %s" % (args.checkpoint))
+
+    if args.pretrained_model:
+
+        def if_exist(var):
+            return os.path.exists(os.path.join(args.pretrained_model, var.name))
+
+        fluid.io.load_vars(
+            exe,
+            args.pretrained_model,
+            main_program=program,
+            predicate=if_exist)
+
+
+def save_model(args, exe, train_prog, info):
+    model_path = os.path.join(args.model_save_dir, args.model, str(info))
+    if not os.path.isdir(model_path):
+        os.makedirs(model_path)
+    fluid.io.save_persistables(exe, model_path, main_program=train_prog)
+    print("Already save model in %s" % (model_path))
+
+
+def create_data_loader(is_train, args):
+    """create data_loader
+
+    Usage:
+        Using mixup process in training, it will return 5 results, include data_loader, image, y_a(label), y_b(label) and lamda, or it will return 3 results, include data_loader, image, and label.
+
+    Args: 
+        is_train: mode
+        args: arguments
+
+    Returns:
+        data_loader and the input data of net, 
+    """
+    image_shape = [int(m) for m in args.image_shape.split(",")]
+
+    feed_image = fluid.data(
+        name="feed_image",
+        shape=[None] + image_shape,
+        dtype="float32",
+        lod_level=0)
+
+    feed_label = fluid.data(
+        name="feed_label", shape=[None, 1], dtype="int64", lod_level=0)
+    feed_y_a = fluid.data(
+        name="feed_y_a", shape=[None, 1], dtype="int64", lod_level=0)
+
+    if is_train and args.use_mixup:
+        feed_y_b = fluid.data(
+            name="feed_y_b", shape=[None, 1], dtype="int64", lod_level=0)
+        feed_lam = fluid.data(
+            name="feed_lam", shape=[None, 1], dtype="float32", lod_level=0)
+
+        data_loader = fluid.io.DataLoader.from_generator(
+            capacity=64,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True)
+
+        return data_loader, [feed_image, feed_y_a, feed_y_b, feed_lam]
+    else:
+        data_loader = fluid.io.DataLoader.from_generator(
+            capacity=64,
+            use_double_buffer=True,
+            iterable=True,
+            return_list=True)
+
+        return data_loader, [feed_image, feed_label]
+
+
+def print_info(pass_id, batch_id, print_step, metrics, time_info, info_mode):
+    """print function
+
+    Args:
+        pass_id: epoch index
+        batch_id: batch index
+        print_step: the print_step arguments
+        metrics: message to print
+        time_info: time infomation
+        info_mode: mode
+    """
+    if info_mode == "batch":
+        if batch_id % print_step == 0:
+            #if isinstance(metrics,np.ndarray):
+            # train and mixup output
+            if len(metrics) == 2:
+                loss, lr = metrics
+                print(
+                    "[Pass {0}, train batch {1}] \tloss {2}, lr {3}, elapse {4}".
+                    format(pass_id, batch_id, "%.5f" % loss, "%.5f" % lr,
+                           "%2.4f sec" % time_info))
+            # train and no mixup output
+            elif len(metrics) == 4:
+                loss, acc1, acc5, lr = metrics
+                print(
+                    "[Pass {0}, train batch {1}] \tloss {2}, acc1 {3}, acc5 {4}, lr {5}, elapse {6}".
+                    format(pass_id, batch_id, "%.5f" % loss, "%.5f" % acc1,
+                           "%.5f" % acc5, "%.5f" % lr, "%2.4f sec" % time_info))
+            # test output
+            elif len(metrics) == 3:
+                loss, acc1, acc5 = metrics
+                print(
+                    "[Pass {0}, test  batch {1}] \tloss {2}, acc1 {3}, acc5 {4}, elapse {5}".
+                    format(pass_id, batch_id, "%.5f" % loss, "%.5f" % acc1,
+                           "%.5f" % acc5, "%2.4f sec" % time_info))
+            else:
+                raise Exception(
+                    "length of metrics {} is not implemented, It maybe caused by wrong format of build_program_output".
+                    format(len(metrics)))
+            sys.stdout.flush()
+
+    elif info_mode == "epoch":
+        ## TODO add time elapse
+        #if isinstance(metrics,np.ndarray):
+        if len(metrics) == 5:
+            train_loss, _, test_loss, test_acc1, test_acc5 = metrics
+            print(
+                "[End pass {0}]\ttrain_loss {1}, test_loss {2}, test_acc1 {3}, test_acc5 {4}".
+                format(pass_id, "%.5f" % train_loss, "%.5f" % test_loss, "%.5f"
+                       % test_acc1, "%.5f" % test_acc5))
+        elif len(metrics) == 7:
+            train_loss, train_acc1, train_acc5, _, test_loss, test_acc1, test_acc5 = metrics
+            print(
+                "[End pass {0}]\ttrain_loss {1}, train_acc1 {2}, train_acc5 {3},test_loss {4}, test_acc1 {5}, test_acc5 {6}".
+                format(pass_id, "%.5f" % train_loss, "%.5f" % train_acc1, "%.5f"
+                       % train_acc5, "%.5f" % test_loss, "%.5f" % test_acc1,
+                       "%.5f" % test_acc5))
+        sys.stdout.flush()
+    elif info_mode == "ce":
+        raise Warning("CE code is not ready")
+    else:
+        raise Exception("Illegal info_mode")
+
+
+def best_strategy_compiled(args, program, loss, exe):
+    """make a program which wrapped by a compiled program
+    """
+
+    if os.getenv('FLAGS_use_ngraph'):
+        return program
+    else:
+        build_strategy = fluid.compiler.BuildStrategy()
+        #Feature will be supported in Fluid v1.6
+        #build_strategy.enable_inplace = True
+
+        exec_strategy = fluid.ExecutionStrategy()
+        exec_strategy.num_threads = fluid.core.get_cuda_device_count()
+        exec_strategy.num_iteration_per_drop_scope = 10
+
+        num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+        if num_trainers > 1 and args.use_gpu:
+            dist_utils.prepare_for_multi_process(exe, build_strategy, program)
+            # NOTE: the process is fast when num_threads is 1
+            # for multi-process training.
+            exec_strategy.num_threads = 1
+
+        compiled_program = fluid.CompiledProgram(program).with_data_parallel(
+            loss_name=loss.name,
+            build_strategy=build_strategy,
+            exec_strategy=exec_strategy)
+
+        return compiled_program
+
+
+class ExponentialMovingAverage(object):
+    def __init__(self,
+                 decay=0.999,
+                 thres_steps=None,
+                 zero_debias=False,
+                 name=None):
+        self._decay = decay
+        self._thres_steps = thres_steps
+        self._name = name if name is not None else ''
+        self._decay_var = self._get_ema_decay()
+
+        self._params_tmps = []
+        for param in default_main_program().global_block().all_parameters():
+            if param.do_model_average != False:
+                tmp = param.block.create_var(
+                    name=unique_name.generate(".".join(
+                        [self._name + param.name, 'ema_tmp'])),
+                    dtype=param.dtype,
+                    persistable=False,
+                    stop_gradient=True)
+                self._params_tmps.append((param, tmp))
+
+        self._ema_vars = {}
+        for param, tmp in self._params_tmps:
+            with param.block.program._optimized_guard(
+                [param, tmp]), name_scope('moving_average'):
+                self._ema_vars[param.name] = self._create_ema_vars(param)
+
+        self.apply_program = Program()
+        block = self.apply_program.global_block()
+        with program_guard(main_program=self.apply_program):
+            decay_pow = self._get_decay_pow(block)
+            for param, tmp in self._params_tmps:
+                param = block._clone_variable(param)
+                tmp = block._clone_variable(tmp)
+                ema = block._clone_variable(self._ema_vars[param.name])
+                layers.assign(input=param, output=tmp)
+                # bias correction
+                if zero_debias:
+                    ema = ema / (1.0 - decay_pow)
+                layers.assign(input=ema, output=param)
+
+        self.restore_program = Program()
+        block = self.restore_program.global_block()
+        with program_guard(main_program=self.restore_program):
+            for param, tmp in self._params_tmps:
+                tmp = block._clone_variable(tmp)
+                param = block._clone_variable(param)
+                layers.assign(input=tmp, output=param)
+
+    def _get_ema_decay(self):
+        with default_main_program()._lr_schedule_guard():
+            decay_var = layers.tensor.create_global_var(
+                shape=[1],
+                value=self._decay,
+                dtype='float32',
+                persistable=True,
+                name="scheduled_ema_decay_rate")
+
+            if self._thres_steps is not None:
+                decay_t = (self._thres_steps + 1.0) / (self._thres_steps + 10.0)
+                with layers.control_flow.Switch() as switch:
+                    with switch.case(decay_t < self._decay):
+                        layers.tensor.assign(decay_t, decay_var)
+                    with switch.default():
+                        layers.tensor.assign(
+                            np.array(
+                                [self._decay], dtype=np.float32),
+                            decay_var)
+        return decay_var
+
+    def _get_decay_pow(self, block):
+        global_steps = layers.learning_rate_scheduler._decay_step_counter()
+        decay_var = block._clone_variable(self._decay_var)
+        decay_pow_acc = layers.elementwise_pow(decay_var, global_steps + 1)
+        return decay_pow_acc
+
+    def _create_ema_vars(self, param):
+        param_ema = layers.create_global_var(
+            name=unique_name.generate(self._name + param.name + '_ema'),
+            shape=param.shape,
+            value=0.0,
+            dtype=param.dtype,
+            persistable=True)
+
+        return param_ema
+
+    def update(self):
+        """
+        Update Exponential Moving Average. Should only call this method in
+        train program.
+        """
+        param_master_emas = []
+        for param, tmp in self._params_tmps:
+            with param.block.program._optimized_guard(
+                [param, tmp]), name_scope('moving_average'):
+                param_ema = self._ema_vars[param.name]
+                if param.name + '.master' in self._ema_vars:
+                    master_ema = self._ema_vars[param.name + '.master']
+                    param_master_emas.append([param_ema, master_ema])
+                else:
+                    ema_t = param_ema * self._decay_var + param * (
+                        1 - self._decay_var)
+                    layers.assign(input=ema_t, output=param_ema)
+
+        # for fp16 params
+        for param_ema, master_ema in param_master_emas:
+            default_main_program().global_block().append_op(
+                type="cast",
+                inputs={"X": master_ema},
+                outputs={"Out": param_ema},
+                attrs={
+                    "in_dtype": master_ema.dtype,
+                    "out_dtype": param_ema.dtype
+                })
+
+    @signature_safe_contextmanager
+    def apply(self, executor, need_restore=True):
+        """
+        Apply moving average to parameters for evaluation.
+
+        Args:
+            executor (Executor): The Executor to execute applying.
+            need_restore (bool): Whether to restore parameters after applying.
+        """
+        executor.run(self.apply_program)
+        try:
+            yield
+        finally:
+            if need_restore:
+                self.restore(executor)
+
+    def restore(self, executor):
+        """Restore parameters.
+
+        Args:
+            executor (Executor): The Executor to execute restoring.
+        """
+        executor.run(self.restore_program)
diff --git a/dygraph/ocr_recognition/train.py b/dygraph/ocr_recognition/train.py
index 455a25482cb82bcc55b85beee9c1d7cc9611b0d1..821c5b9222a88187cae1c40ec37ad6d6d4e092b7 100644
--- a/dygraph/ocr_recognition/train.py
+++ b/dygraph/ocr_recognition/train.py
@@ -20,7 +20,7 @@ import paddle.fluid.profiler as profiler
 import paddle.fluid as fluid
 import paddle.fluid.layers as layers
 import data_reader
-from paddle.fluid.dygraph.nn import Conv2D, Pool2D, FC, BatchNorm, Embedding, GRUUnit
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, Linear, BatchNorm, Embedding, GRUUnit
 from paddle.fluid.dygraph.base import to_variable
 import argparse
 import functools
@@ -57,6 +57,8 @@ class Config(object):
     '''
     config for training
     '''
+    # encoder rnn hidden_size
+    encoder_size = 200
     # decoder size for decoder stage
     decoder_size = 128
     # size for word embedding
@@ -84,7 +86,6 @@ class Config(object):
 
 class ConvBNPool(fluid.dygraph.Layer):
     def __init__(self,
-                 name_scope,
                  group,
                  out_ch,
                  channels,
@@ -92,7 +93,7 @@ class ConvBNPool(fluid.dygraph.Layer):
                  is_test=False,
                  pool=True,
                  use_cudnn=True):
-        super(ConvBNPool, self).__init__(name_scope)
+        super(ConvBNPool, self).__init__()
         self.group = group
         self.pool = pool
 
@@ -106,7 +107,7 @@ class ConvBNPool(fluid.dygraph.Layer):
             initializer=fluid.initializer.Normal(0.0, conv_std_1))
 
         self.conv_0_layer = Conv2D(
-            self.full_name(),
+            channels[0],
             out_ch[0],
             3,
             padding=1,
@@ -115,9 +116,9 @@ class ConvBNPool(fluid.dygraph.Layer):
             act=None,
             use_cudnn=use_cudnn)
         self.bn_0_layer = BatchNorm(
-            self.full_name(), out_ch[0], act=act, is_test=is_test)
+            out_ch[0], act=act, is_test=is_test)
         self.conv_1_layer = Conv2D(
-            self.full_name(),
+            out_ch[0],
             num_filters=out_ch[1],
             filter_size=3,
             padding=1,
@@ -126,12 +127,10 @@ class ConvBNPool(fluid.dygraph.Layer):
             act=None,
             use_cudnn=use_cudnn)
         self.bn_1_layer = BatchNorm(
-            self.full_name(), out_ch[1], act=act, is_test=is_test)
+            out_ch[1], act=act, is_test=is_test)
 
-        print( "pool", self.pool)
         if self.pool:
             self.pool_layer = Pool2D(
-                self.full_name(),
                 pool_size=2,
                 pool_type='max',
                 pool_stride=2,
@@ -151,25 +150,21 @@ class ConvBNPool(fluid.dygraph.Layer):
 
 
 class OCRConv(fluid.dygraph.Layer):
-    def __init__(self, name_scope, is_test=False, use_cudnn=True):
-        super(OCRConv, self).__init__(name_scope)
+    def __init__(self,  is_test=False, use_cudnn=True):
+        super(OCRConv, self).__init__()
         self.conv_bn_pool_1 = ConvBNPool(
-            self.full_name(),
             2, [16, 16], [1, 16],
             is_test=is_test,
             use_cudnn=use_cudnn)
         self.conv_bn_pool_2 = ConvBNPool(
-            self.full_name(),
             2, [32, 32], [16, 32],
             is_test=is_test,
             use_cudnn=use_cudnn)
         self.conv_bn_pool_3 = ConvBNPool(
-            self.full_name(),
             2, [64, 64], [32, 64],
             is_test=is_test,
             use_cudnn=use_cudnn)
         self.conv_bn_pool_4 = ConvBNPool(
-            self.full_name(),
             2, [128, 128], [64, 128],
             is_test=is_test,
             pool=False,
@@ -181,13 +176,11 @@ class OCRConv(fluid.dygraph.Layer):
         inputs_3 = self.conv_bn_pool_3(inputs_2)
         inputs_4 = self.conv_bn_pool_4(inputs_3)
 
-        #print( inputs_4.numpy() )
         return inputs_4
 
 
 class DynamicGRU(fluid.dygraph.Layer):
     def __init__(self,
-                 scope_name,
                  size,
                  param_attr=None,
                  bias_attr=None,
@@ -197,10 +190,9 @@ class DynamicGRU(fluid.dygraph.Layer):
                  h_0=None,
                  origin_mode=False,
                  init_size = None):
-        super(DynamicGRU, self).__init__(scope_name)
+        super(DynamicGRU, self).__init__()
 
         self.gru_unit = GRUUnit(
-            self.full_name(),
             size * 3,
             param_attr=param_attr,
             bias_attr=bias_attr,
@@ -239,11 +231,10 @@ class DynamicGRU(fluid.dygraph.Layer):
 
 class EncoderNet(fluid.dygraph.Layer):
     def __init__(self,
-                 scope_name,
-                 rnn_hidden_size=200,
+                 rnn_hidden_size=Config.encoder_size,
                  is_test=False,
                  use_cudnn=True):
-        super(EncoderNet, self).__init__(scope_name)
+        super(EncoderNet, self).__init__()
         self.rnn_hidden_size = rnn_hidden_size
         para_attr = fluid.ParamAttr(initializer=fluid.initializer.Normal(0.0,
                                                                          0.02))
@@ -259,27 +250,24 @@ class EncoderNet(fluid.dygraph.Layer):
                 dtype='float32',
                 value=0)
         self.ocr_convs = OCRConv(
-            self.full_name(), is_test=is_test, use_cudnn=use_cudnn)
+            is_test=is_test, use_cudnn=use_cudnn)
 
-        self.fc_1_layer = FC(self.full_name(),
+        self.fc_1_layer = Linear( 768,
                              rnn_hidden_size * 3,
                              param_attr=para_attr,
-                             bias_attr=False,
-                             num_flatten_dims=2)
-        self.fc_2_layer = FC(self.full_name(),
+                             bias_attr=False )
+        print( "weight", self.fc_1_layer.weight.shape )
+        self.fc_2_layer = Linear( 768,
                              rnn_hidden_size * 3,
                              param_attr=para_attr,
-                             bias_attr=False,
-                             num_flatten_dims=2)
+                             bias_attr=False )
         self.gru_forward_layer = DynamicGRU(
-            self.full_name(),
             size=rnn_hidden_size,
             h_0=h_0,
             param_attr=para_attr,
             bias_attr=bias_attr,
             candidate_activation='relu')
         self.gru_backward_layer = DynamicGRU(
-            self.full_name(),
             size=rnn_hidden_size,
             h_0=h_0,
             param_attr=para_attr,
@@ -287,10 +275,9 @@ class EncoderNet(fluid.dygraph.Layer):
             candidate_activation='relu',
             is_reverse=True)
 
-        self.encoded_proj_fc = FC(self.full_name(),
+        self.encoded_proj_fc = Linear( rnn_hidden_size * 2,
                                   Config.decoder_size,
-                                  bias_attr=False,
-                                  num_flatten_dims=2)
+                                  bias_attr=False )
 
     def forward(self, inputs):
         conv_features = self.ocr_convs(inputs)
@@ -316,16 +303,15 @@ class EncoderNet(fluid.dygraph.Layer):
 
 
 class SimpleAttention(fluid.dygraph.Layer):
-    def __init__(self, scope_name, decoder_size):
-        super(SimpleAttention, self).__init__(scope_name)
+    def __init__(self, decoder_size):
+        super(SimpleAttention, self).__init__()
 
-        self.fc_1 = FC(self.full_name(),
+        self.fc_1 = Linear( decoder_size,
                        decoder_size,
                        act=None,
                        bias_attr=False)
-        self.fc_2 = FC(self.full_name(),
+        self.fc_2 = Linear( decoder_size,
                        1,
-                       num_flatten_dims = 2,
                        act=None,
                        bias_attr=False)
 
@@ -354,23 +340,22 @@ class SimpleAttention(fluid.dygraph.Layer):
 
 
 class GRUDecoderWithAttention(fluid.dygraph.Layer):
-    def __init__(self, scope_name, decoder_size, num_classes):
-        super(GRUDecoderWithAttention, self).__init__(scope_name)
-        self.simple_attention = SimpleAttention(self.full_name(), decoder_size)
+    def __init__(self,  decoder_size, num_classes):
+        super(GRUDecoderWithAttention, self).__init__()
+        self.simple_attention = SimpleAttention(decoder_size)
 
-        self.fc_1_layer = FC(self.full_name(),
-                             size=decoder_size * 3,
+        self.fc_1_layer = Linear( input_dim = Config.encoder_size * 2,
+                             output_dim=decoder_size * 3,
                              bias_attr=False)
-        self.fc_2_layer = FC(self.full_name(),
-                             size=decoder_size * 3,
+        self.fc_2_layer = Linear( input_dim = decoder_size,
+                             output_dim=decoder_size * 3,
                              bias_attr=False)
         self.gru_unit = GRUUnit(
-            self.full_name(),
             size=decoder_size * 3,
             param_attr=None,
             bias_attr=None)
-        self.out_layer = FC(self.full_name(),
-                            size=num_classes + 2,
+        self.out_layer = Linear( input_dim = decoder_size,
+                            output_dim =num_classes + 2,
                             bias_attr=None,
                             act='softmax')
 
@@ -410,18 +395,18 @@ class GRUDecoderWithAttention(fluid.dygraph.Layer):
 
 
 class OCRAttention(fluid.dygraph.Layer):
-    def __init__(self, scope_name):
-        super(OCRAttention, self).__init__(scope_name)
-        self.encoder_net = EncoderNet(self.full_name())
-        self.fc = FC(self.full_name(),
-                     size=Config.decoder_size,
+    def __init__(self):
+        super(OCRAttention, self).__init__()
+        self.encoder_net = EncoderNet()
+        self.fc = Linear( input_dim = Config.encoder_size,
+                     output_dim =Config.decoder_size,
                      bias_attr=False,
                      act='relu')
         self.embedding = Embedding(
-            self.full_name(), [Config.num_classes + 2, Config.word_vector_dim],
+            [Config.num_classes + 2, Config.word_vector_dim],
             dtype='float32')
         self.gru_decoder_with_attention = GRUDecoderWithAttention(
-            self.full_name(), Config.decoder_size, Config.num_classes)
+            Config.decoder_size, Config.num_classes)
 
 
     def forward(self, inputs, label_in):
@@ -433,7 +418,7 @@ class OCRAttention(fluid.dygraph.Layer):
 
         decoder_boot = self.fc(backward_first)
 
-        label_in = fluid.layers.reshape(label_in, [-1, 1], inplace=False)
+        label_in = fluid.layers.reshape(label_in, [-1], inplace=False)
         trg_embedding = self.embedding(label_in)
 
         trg_embedding = fluid.layers.reshape(
@@ -451,14 +436,14 @@ def train(args):
     with fluid.dygraph.guard():
         backward_strategy = fluid.dygraph.BackwardStrategy()
         backward_strategy.sort_sum_gradient = True
-        ocr_attention = OCRAttention("ocr_attention")
+        ocr_attention = OCRAttention()
 
         if Config.learning_rate_decay == "piecewise_decay":
             learning_rate = fluid.layers.piecewise_decay(
                 [50000], [Config.LR, Config.LR * 0.01])
         else:
             learning_rate = Config.LR
-        optimizer = fluid.optimizer.Adam(learning_rate=0.001)
+        optimizer = fluid.optimizer.Adam(learning_rate=0.001, parameter_list=ocr_attention.parameters())
         dy_param_init_value = {}
 
         grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(5.0 )
@@ -486,8 +471,7 @@ def train(args):
                 label_in = to_variable(data_dict["label_in"])
                 label_out = to_variable(data_dict["label_out"])
 
-                label_out._stop_gradient = True
-                label_out.trainable = False
+                label_out.stop_gradient = True
 
                 img = to_variable(data_dict["pixel"])
 
@@ -528,8 +512,7 @@ def train(args):
                 label_in = to_variable(data_dict["label_in"])
                 label_out = to_variable(data_dict["label_out"])
 
-                label_out._stop_gradient = True
-                label_out.trainable = False
+                label_out.stop_gradient = True
 
                 img = to_variable(data_dict["pixel"])
 
@@ -549,8 +532,6 @@ def train(args):
                 optimizer.minimize(avg_loss, grad_clip=grad_clip)
                 ocr_attention.clear_gradients()
 
-                framework._dygraph_tracer()._clear_ops()
-
                 if batch_id > 0 and batch_id % 1000 == 0:
                     print("epoch: {}, batch_id: {}, loss {}".format(epoch, batch_id, total_loss / args.batch_size / 1000))
 
diff --git a/dygraph/ptb_lm/README.md b/dygraph/ptb_lm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..cd3f06d2711409378e130466afbdd30a2e1b2d79
--- /dev/null
+++ b/dygraph/ptb_lm/README.md
@@ -0,0 +1,132 @@
+# 语言模型
+
+# 简介
+
+## 1. 任务说明
+本文主要介绍基于lstm的语言的模型的实现，给定一个输入词序列（中文分词、英文tokenize），计算其ppl（语言模型困惑度，用户表示句子的流利程度），基于循环神经网络语言模型的介绍可以[参阅论文](https://arxiv.org/abs/1409.2329)。相对于传统的方法，基于循环神经网络的方法能够更好的解决稀疏词的问题。
+
+**目前语言模型要求使用PaddlePaddle 1.7及以上版本或适当的develop版本。**
+
+同时推荐用户参考[IPython Notebook demo](https://aistudio.baidu.com/aistudio/projectDetail/122290)
+
+## 2. 效果说明
+在small meidum large三个不同配置情况的ppl对比：
+
+单卡V100，CUDA10 cudnn7，Python 3.7，CentOS release 6.3
+
+|  small config  |    train    |   valid    |    test      |    训练速度    |
+| :------------- | :---------: | :--------: | :----------: | :------------: | 
+|  paddle静态图  |    40.962   |  118.111   |   112.617    |    41s/epoch   |
+|  paddle动态图  |    40.566   |  119.541   |   115.300    |    93s/epoch   |
+
+|  medium config |    train    |   valid    |    test      |    训练速度    |
+| :------------- | :---------: | :--------: | :----------: | :------------: |
+|  paddle静态图  |    45.620   |  87.398    |    83.682    |    53s/epoch   |
+|  paddle动态图  |    45.738   |  87.428    |    83.810    |   104s/epoch   |
+
+|  large config  |    train    |   valid    |    test      |    训练速度    |
+| :------------- | :---------: | :--------: | :----------: | :------------: |
+|  paddle静态图  |    37.221   |  82.358    |    78.137    |    77s/epoch   |
+|  paddle动态图  |    37.468   |  82.273    |    78.912    |   145s/epoch   |
+
+## 3. 数据集
+
+此任务的数据集合是采用ptb dataset，下载地址为: http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
+
+
+# 快速开始
+
+## 1. 安装说明
+
+### Paddle安装
+本项目依赖于 Paddle Fluid, 关于PaddlePaddle框架的安装教程，详见[PaddlePaddle官方网站](http://paddlepaddle.org/documentation/docs/zh/1.6/beginners_guide/install/index_cn.html)。
+### 安装代码
+### 环境依赖
+
+## 2. 开始第一次模型调用
+
+### 数据准备
+为了方便开发者进行测试，我们提供了数据下载脚本。用户也可以自行下载数据，并解压。
+
+```
+cd data; sh download_data.sh
+```
+
+### 训练或fine-tune
+任务训练启动命令如下：
+```
+sh debug.sh
+```
+可以通过`-d`指定数据的目录，`-t`指定模型的大小(默认为small，用户可以选择medium， 或者large)。
+
+例如：
+```
+sh debug.sh -t medium
+```
+
+# 进阶使用
+## 1. 任务定义与建模
+此任务目的是给定一个输入的词序列，预测下一个词出现的概率。
+
+## 2. 模型原理介绍
+此任务采用了序列任务常用的rnn网络，实现了一个两层的lstm网络，然后lstm的结果去预测下一个词出现的概率。
+
+由于数据的特殊性，每一个batch的last hidden和last cell会被作为下一个batch 的init hidden 和 init cell，数据的特殊性下节会介绍。
+
+
+## 3. 数据格式说明
+此任务的数据格式比较简单，每一行为一个已经分好词（英文的tokenize）的词序列。
+
+目前的句子示例如下图所示:
+```
+aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter
+pierre <unk> N years old will join the board as a nonexecutive director nov. N
+mr. <unk> is chairman of <unk> n.v. the dutch publishing group
+```
+
+特殊说明：ptb的数据比较特殊，ptb的数据来源于一些文章，相邻的句子可能来源于一个段落或者相邻的段落，ptb 数据不能做shuffle
+
+
+
+## 4. 目录结构
+
+```text
+.
+├── README.md            # 文档
+├── debug.sh             # 启动脚本
+├── ptb_dy.py            # 训练代码
+├── reader.py            # 数据读取
+├── args.py              # 参数读取
+└── data                 # 数据下载
+```
+
+## 5. 如何组建自己的模型
++ **自定义数据：** 关于数据，如果可以把自己的数据先进行分词（或者tokenize），然后放入到data目录下，并修改reader.py中文件的名称，如果句子之间没有关联，用户可以将`ptb_dy.py`中更新的代码注释掉。
+    ```
+    init_hidden = to_variable(init_hidden_data)
+    init_cell = to_variable(init_cell_data)
+    ```
+
++ **网络结构更改：** 网络只实现了基于lstm的语言模型，用户可以自己的需求更换为gru或者self等网络结构，这些实现都是在ptb_dy.py 中定义
+
+
+# 其他
+
+## Copyright and License
+Copyright 2019 Baidu.com, Inc. All Rights Reserved
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+## 如何贡献代码
+
+如果你可以修复某个issue或者增加一个新功能，欢迎给我们提交PR。如果对应的PR被接受了，我们将根据贡献的质量和难度进行打分（0-5分，越高越好）。如果你累计获得了10分，可以联系我们获得面试机会或者为你写推荐信。
diff --git a/dygraph/ptb_lm/args.py b/dygraph/ptb_lm/args.py
index ad33ea1a27155c81678f72ee46e6448e60a6ee45..6449b274542185dbb070fdfce0d14bd8138eeea9 100644
--- a/dygraph/ptb_lm/args.py
+++ b/dygraph/ptb_lm/args.py
@@ -19,6 +19,13 @@ from __future__ import print_function
 import argparse
 import distutils.util
 
+def str2bool(v):
+    if v.lower() in ('yes', 'true', 't', 'y', '1'):
+        return True
+    elif v.lower() in ('no', 'false', 'f', 'n', '0'):
+        return False
+    else:
+        raise argparse.ArgumentTypeError('Unsupported value encountered.')
 
 def parse_args():
     parser = argparse.ArgumentParser(description=__doc__)
@@ -36,7 +43,7 @@ def parse_args():
         "--data_path", type=str, help="all the data for train,valid,test")
     parser.add_argument('--para_init', action='store_true')
     parser.add_argument(
-        '--use_gpu', type=bool, default=False, help='whether using gpu')
+        '--use_gpu', type=str2bool, default=True, help='whether using gpu')
     parser.add_argument(
         '--log_path',
         help='path of the log file. If not set, logs are printed to console')
diff --git a/dygraph/ptb_lm/debug.sh b/dygraph/ptb_lm/debug.sh
index b31cb22cc4aee00b5291705fac7ab0bcee8bc204..23eb2ab887aaee5a3b08bdbd09ee7562c440ded6 100644
--- a/dygraph/ptb_lm/debug.sh
+++ b/dygraph/ptb_lm/debug.sh
@@ -1,7 +1,22 @@
 
 export CUDA_VISIBLE_DEVICES=0
 
-#export FLAGS_fraction_of_gpu_memory_to_use=0.0
-python  ptb_dy.py  --data_path data/simple-examples/data/ \
-           --model_type small 
+DATA_PATH="data/simple-examples/data/"
+MODEL_TYPE="small"
 
+while getopts d:t: opt
+do  
+    case $opt in
+        d)
+            DATA_PATH="$OPTARG"
+            ;;
+        t)
+            MODEL_TYPE="$OPTARG"
+            ;;
+        \?)
+            exit;  
+            ;;
+    esac
+done
+echo "python  ptb_dy.py  --data_path $DATA_PATH  --model_type $MODEL_TYPE"
+python  ptb_dy.py  --data_path $DATA_PATH  --model_type $MODEL_TYPE 
diff --git a/dygraph/ptb_lm/ptb_dy.py b/dygraph/ptb_lm/ptb_dy.py
index 858fb13b7afab8f2cfffb5633bc5816273e95d39..86411a02acf9a8f4481da0f6b2f6ebf74d10b722 100644
--- a/dygraph/ptb_lm/ptb_dy.py
+++ b/dygraph/ptb_lm/ptb_dy.py
@@ -42,18 +42,16 @@ if sys.version[0] == '2':
 
 class SimpleLSTMRNN(fluid.Layer):
     def __init__(self,
-                 name_scope,
                  hidden_size,
                  num_steps,
                  num_layers=2,
                  init_scale=0.1,
                  dropout=None):
-        super(SimpleLSTMRNN, self).__init__(name_scope)
+        super(SimpleLSTMRNN, self).__init__()
         self._hidden_size = hidden_size
         self._num_layers = num_layers
         self._init_scale = init_scale
         self._dropout = dropout
-        self._input = None
         self._num_steps = num_steps
         self.cell_array = []
         self.hidden_array = []
@@ -83,34 +81,23 @@ class SimpleLSTMRNN(fluid.Layer):
             self.bias_arr.append(self.add_parameter('b_%d' % i, bias_1))
 
     def forward(self, input_embedding, init_hidden=None, init_cell=None):
-        self.cell_array = []
-        self.hidden_array = []
+        cell_array = []
+        hidden_array = []
 
         for i in range(self._num_layers):
-            pre_hidden = fluid.layers.slice(
-                init_hidden, axes=[0], starts=[i], ends=[i + 1])
-            pre_cell = fluid.layers.slice(
-                init_cell, axes=[0], starts=[i], ends=[i + 1])
-            pre_hidden = fluid.layers.reshape(
-                pre_hidden, shape=[-1, self._hidden_size])
-            pre_cell = fluid.layers.reshape(
-                pre_cell, shape=[-1, self._hidden_size])
-            self.hidden_array.append(pre_hidden)
-            self.cell_array.append(pre_cell)
+            hidden_array.append(init_hidden[i])
+            cell_array.append(init_cell[i])
 
         res = []
         for index in range(self._num_steps):
-            self._input = fluid.layers.slice(
-                input_embedding, axes=[1], starts=[index], ends=[index + 1])
-            self._input = fluid.layers.reshape(
-                self._input, shape=[-1, self._hidden_size])
+            step_input = input_embedding[:,index,:]
             for k in range(self._num_layers):
-                pre_hidden = self.hidden_array[k]
-                pre_cell = self.cell_array[k]
+                pre_hidden = hidden_array[k]
+                pre_cell = cell_array[k]
                 weight_1 = self.weight_1_arr[k]
                 bias = self.bias_arr[k]
 
-                nn = fluid.layers.concat([self._input, pre_hidden], 1)
+                nn = fluid.layers.concat([step_input, pre_hidden], 1)
                 gate_input = fluid.layers.matmul(x=nn, y=weight_1)
 
                 gate_input = fluid.layers.elementwise_add(gate_input, bias)
@@ -119,25 +106,23 @@ class SimpleLSTMRNN(fluid.Layer):
                 c = pre_cell * fluid.layers.sigmoid(f) + fluid.layers.sigmoid(
                     i) * fluid.layers.tanh(j)
                 m = fluid.layers.tanh(c) * fluid.layers.sigmoid(o)
-                self.hidden_array[k] = m
-                self.cell_array[k] = c
-                self._input = m
+                hidden_array[k] = m
+                cell_array[k] = c
+                step_input = m
 
                 if self._dropout is not None and self._dropout > 0.0:
-                    self._input = fluid.layers.dropout(
-                        self._input,
+                    step_input = fluid.layers.dropout(
+                        step_input,
                         dropout_prob=self._dropout,
                         dropout_implementation='upscale_in_train')
-            res.append(
-                fluid.layers.reshape(
-                    self._input, shape=[1, -1, self._hidden_size]))
-        real_res = fluid.layers.concat(res, 0)
-        real_res = fluid.layers.transpose(x=real_res, perm=[1, 0, 2])
-        last_hidden = fluid.layers.concat(self.hidden_array, 1)
+            res.append(step_input)
+        real_res = fluid.layers.concat(res, 1)
+        real_res = fluid.layers.reshape(real_res, [ -1, self._num_steps, self._hidden_size])
+        last_hidden = fluid.layers.concat(hidden_array, 1)
         last_hidden = fluid.layers.reshape(
             last_hidden, shape=[-1, self._num_layers, self._hidden_size])
         last_hidden = fluid.layers.transpose(x=last_hidden, perm=[1, 0, 2])
-        last_cell = fluid.layers.concat(self.cell_array, 1)
+        last_cell = fluid.layers.concat(cell_array, 1)
         last_cell = fluid.layers.reshape(
             last_cell, shape=[-1, self._num_layers, self._hidden_size])
         last_cell = fluid.layers.transpose(x=last_cell, perm=[1, 0, 2])
@@ -146,14 +131,13 @@ class SimpleLSTMRNN(fluid.Layer):
 
 class PtbModel(fluid.Layer):
     def __init__(self,
-                 name_scope,
                  hidden_size,
                  vocab_size,
                  num_layers=2,
                  num_steps=20,
                  init_scale=0.1,
                  dropout=None):
-        super(PtbModel, self).__init__(name_scope)
+        super(PtbModel, self).__init__()
         self.hidden_size = hidden_size
         self.vocab_size = vocab_size
         self.init_scale = init_scale
@@ -161,14 +145,12 @@ class PtbModel(fluid.Layer):
         self.num_steps = num_steps
         self.dropout = dropout
         self.simple_lstm_rnn = SimpleLSTMRNN(
-            self.full_name(),
             hidden_size,
             num_steps,
             num_layers=num_layers,
             init_scale=init_scale,
             dropout=dropout)
         self.embedding = Embedding(
-            self.full_name(),
             size=[vocab_size, hidden_size],
             dtype='float32',
             is_sparse=False,
@@ -212,18 +194,14 @@ class PtbModel(fluid.Layer):
         rnn_out, last_hidden, last_cell = self.simple_lstm_rnn(x_emb, init_h,
                                                                init_c)
 
-        rnn_out = fluid.layers.reshape(
-            rnn_out, shape=[-1, self.num_steps, self.hidden_size])
         projection = fluid.layers.matmul(rnn_out, self.softmax_weight)
         projection = fluid.layers.elementwise_add(projection, self.softmax_bias)
-        projection = fluid.layers.reshape(
-            projection, shape=[-1, self.vocab_size])
+
         loss = fluid.layers.softmax_with_cross_entropy(
             logits=projection, label=label, soft_label=False)
         loss = fluid.layers.reshape(loss, shape=[-1, self.num_steps])
         loss = fluid.layers.reduce_mean(loss, dim=[0])
         loss = fluid.layers.reduce_sum(loss)
-        loss.permissions = True
 
         return loss, last_hidden, last_cell
 
@@ -237,6 +215,11 @@ def train_ptb_lm():
 
     # check if set use_gpu=True in paddlepaddle cpu version
     model_check.check_cuda(args.use_gpu)
+
+    place = core.CPUPlace()
+    if args.use_gpu == True:
+        place = core.CUDAPlace(0)
+
     # check if paddlepaddle version is satisfied
     model_check.check_version()
 
@@ -295,7 +278,7 @@ def train_ptb_lm():
         print("model type not support")
         return
 
-    with fluid.dygraph.guard(core.CUDAPlace(0)):
+    with fluid.dygraph.guard(place):
         if args.ce:
             print("ce mode")
             seed = 33
@@ -304,7 +287,6 @@ def train_ptb_lm():
             fluid.default_main_program().random_seed = seed
             max_epoch = 1
         ptb_model = PtbModel(
-            "ptb_model",
             hidden_size=hidden_size,
             vocab_size=vocab_size,
             num_layers=num_layers,
@@ -335,7 +317,7 @@ def train_ptb_lm():
 
         batch_len = len(train_data) // batch_size
         total_batch_size = (batch_len - 1) // num_steps
-        log_interval = total_batch_size // 20
+        log_interval = 200
 
         bd = []
         lr_arr = [1.0]
@@ -346,10 +328,10 @@ def train_ptb_lm():
             lr_arr.append(new_lr)
 
         sgd = SGDOptimizer(learning_rate=fluid.layers.piecewise_decay(
-            boundaries=bd, values=lr_arr))
+            boundaries=bd, values=lr_arr), parameter_list=ptb_model.parameters())
 
         def eval(model, data):
-            print("begion to eval")
+            print("begin to eval")
             total_loss = 0.0
             iters = 0.0
             init_hidden_data = np.zeros(
@@ -362,7 +344,7 @@ def train_ptb_lm():
             for batch_id, batch in enumerate(train_data_iter):
                 x_data, y_data = batch
                 x_data = x_data.reshape((-1, num_steps, 1))
-                y_data = y_data.reshape((-1, 1))
+                y_data = y_data.reshape((-1, num_steps, 1))
                 x = to_variable(x_data)
                 y = to_variable(y_data)
                 init_hidden = to_variable(init_hidden_data)
@@ -396,23 +378,24 @@ def train_ptb_lm():
 
             train_data_iter = reader.get_data_iter(train_data, batch_size,
                                                    num_steps)
-
+            init_hidden = to_variable(init_hidden_data)
+            init_cell = to_variable(init_cell_data)
             start_time = time.time()
             for batch_id, batch in enumerate(train_data_iter):
                 x_data, y_data = batch
+
                 x_data = x_data.reshape((-1, num_steps, 1))
-                y_data = y_data.reshape((-1, 1))
+                y_data = y_data.reshape((-1, num_steps, 1))
+
                 x = to_variable(x_data)
                 y = to_variable(y_data)
-                init_hidden = to_variable(init_hidden_data)
-                init_cell = to_variable(init_cell_data)
+
                 dy_loss, last_hidden, last_cell = ptb_model(x, y, init_hidden,
                                                             init_cell)
-
+                init_hidden = last_hidden
+                init_cell = last_cell
                 out_loss = dy_loss.numpy()
 
-                init_hidden_data = last_hidden.numpy()
-                init_cell_data = last_cell.numpy()
                 dy_loss.backward()
                 sgd.minimize(dy_loss, grad_clip=grad_clip)
 
@@ -422,14 +405,22 @@ def train_ptb_lm():
 
                 if batch_id > 0 and batch_id % log_interval == 0:
                     ppl = np.exp(total_loss / iters)
-                    print("-- Epoch:[%d]; Batch:[%d]; ppl: %.5f, lr: %.5f" %
+                    print("-- Epoch:[%d]; Batch:[%d]; ppl: %.5f, lr: %.5f, loss: %.5f" %
                           (epoch_id, batch_id, ppl[0],
-                           sgd._global_learning_rate().numpy()))
+                           sgd._global_learning_rate().numpy(), out_loss))
 
-            print("one ecpoh finished", epoch_id)
+            print("one epoch finished", epoch_id)
             print("time cost ", time.time() - start_time)
             ppl = np.exp(total_loss / iters)
             print("-- Epoch:[%d]; ppl: %.5f" % (epoch_id, ppl[0]))
+
+            if batch_size <= 20 and epoch_id == 0 and ppl[0] > 1000:
+                # for bad init, after first epoch, the loss is over 1000
+                # no more need to continue
+                print("Parameters are randomly initialized and not good this time because the loss is over 1000 after the first epoch.")
+                print("Abort this training process and please start again.")
+                return 
+
             if args.ce:
                 print("kpis\ttrain_ppl\t%0.3f" % ppl[0])
             save_model_dir = os.path.join(args.save_model_dir,
@@ -437,7 +428,8 @@ def train_ptb_lm():
             fluid.save_dygraph(ptb_model.state_dict(), save_model_dir)
             print("Saved model to: %s.\n" % save_model_dir)
 
-        eval(ptb_model, test_data)
+            eval(ptb_model, valid_data)
 
+        eval(ptb_model, test_data)
 
 train_ptb_lm()
diff --git a/dygraph/reinforcement_learning/actor_critic.py b/dygraph/reinforcement_learning/actor_critic.py
index f68a53f85dbc2e6b9e0743413c359eee76ee9e5f..26ff614a588bc5f8206377c1862993312d335120 100644
--- a/dygraph/reinforcement_learning/actor_critic.py
+++ b/dygraph/reinforcement_learning/actor_critic.py
@@ -9,7 +9,7 @@ import paddle.fluid as fluid
 import paddle.fluid.dygraph.nn as nn
 import paddle.fluid.framework as framework
 
-parser = argparse.ArgumentParser(description='PyTorch REINFORCE example')
+parser = argparse.ArgumentParser()
 parser.add_argument(
     '--gamma',
     type=float,
@@ -40,12 +40,12 @@ SavedAction = namedtuple('SavedAction', ['log_prob', 'value'])
 
 
 class Policy(fluid.dygraph.Layer):
-    def __init__(self, name_scope):
-        super(Policy, self).__init__(name_scope)
+    def __init__(self):
+        super(Policy, self).__init__()
 
-        self.affine1 = nn.FC(self.full_name(), size=128)
-        self.action_head = nn.FC(self.full_name(), size=2)
-        self.value_head = nn.FC(self.full_name(), size=1)
+        self.affine1 = nn.Linear(4, 128)
+        self.action_head = nn.Linear(128, 2)
+        self.value_head = nn.Linear(128, 1)
 
         self.saved_actions = []
         self.rewards = []
@@ -65,10 +65,10 @@ with fluid.dygraph.guard():
     fluid.default_startup_program().random_seed = args.seed
     fluid.default_main_program().random_seed = args.seed
     np.random.seed(args.seed)
-    policy = Policy("PolicyModel")
+    policy = Policy()
 
     eps = np.finfo(np.float32).eps.item()
-    optimizer = fluid.optimizer.AdamOptimizer(learning_rate=3e-2)
+    optimizer = fluid.optimizer.AdamOptimizer(learning_rate=3e-2, parameter_list=policy.parameters())
 
     def get_mean_and_std(values=[]):
         n = 0.
diff --git a/dygraph/reinforcement_learning/reinforce.py b/dygraph/reinforcement_learning/reinforce.py
index 2a23b3450b65f8f213dc90f6c8ea22049a1c32f4..e7f4d7e56a6608a8d5027459746806639b06eac7 100644
--- a/dygraph/reinforcement_learning/reinforce.py
+++ b/dygraph/reinforcement_learning/reinforce.py
@@ -8,7 +8,7 @@ import paddle.fluid as fluid
 import paddle.fluid.dygraph.nn as nn
 import paddle.fluid.framework as framework
 
-parser = argparse.ArgumentParser(description='PyTorch REINFORCE example')
+parser = argparse.ArgumentParser()
 parser.add_argument(
     '--gamma',
     type=float,
@@ -37,11 +37,11 @@ env.seed(args.seed)
 
 
 class Policy(fluid.dygraph.Layer):
-    def __init__(self, name_scope):
-        super(Policy, self).__init__(name_scope)
+    def __init__(self):
+        super(Policy, self).__init__()
 
-        self.affine1 = nn.FC(self.full_name(), size=128)
-        self.affine2 = nn.FC(self.full_name(), size=2)
+        self.affine1 = nn.Linear(4, 128)
+        self.affine2 = nn.Linear(128, 2)
         self.dropout_ratio = 0.6
 
         self.saved_log_probs = []
@@ -64,10 +64,10 @@ with fluid.dygraph.guard():
     fluid.default_main_program().random_seed = args.seed
     np.random.seed(args.seed)
 
-    policy = Policy("PolicyModel")
+    policy = Policy()
 
     eps = np.finfo(np.float32).eps.item()
-    optimizer = fluid.optimizer.AdamOptimizer(learning_rate=1e-2)
+    optimizer = fluid.optimizer.AdamOptimizer(learning_rate=1e-2, parameter_list=policy.parameters())
 
     def get_mean_and_std(values=[]):
         n = 0.
diff --git a/dygraph/reinforcement_learning/test_actor_critic_load.py b/dygraph/reinforcement_learning/test_actor_critic_load.py
index 2ddbfd8ce017f02aa3f839eaca5ac805c27c24b0..21699bef7669c33389d52c71545836d7e0e82dee 100644
--- a/dygraph/reinforcement_learning/test_actor_critic_load.py
+++ b/dygraph/reinforcement_learning/test_actor_critic_load.py
@@ -9,7 +9,7 @@ import paddle.fluid as fluid
 import paddle.fluid.dygraph.nn as nn
 import paddle.fluid.framework as framework
 
-parser = argparse.ArgumentParser(description='PyTorch REINFORCE example')
+parser = argparse.ArgumentParser()
 parser.add_argument(
     '--gamma',
     type=float,
@@ -40,12 +40,12 @@ SavedAction = namedtuple('SavedAction', ['log_prob', 'value'])
 
 
 class Policy(fluid.dygraph.Layer):
-    def __init__(self, name_scope):
-        super(Policy, self).__init__(name_scope)
+    def __init__(self):
+        super(Policy, self).__init__()
 
-        self.affine1 = nn.FC(self.full_name(), size=128)
-        self.action_head = nn.FC(self.full_name(), size=2)
-        self.value_head = nn.FC(self.full_name(), size=1)
+        self.affine1 = nn.Linear(4, 128)
+        self.action_head = nn.Linear(128, 2)
+        self.value_head = nn.Linear(128, 1)
 
         self.saved_actions = []
         self.rewards = []
@@ -65,10 +65,10 @@ with fluid.dygraph.guard():
     fluid.default_startup_program().random_seed = args.seed
     fluid.default_main_program().random_seed = args.seed
     np.random.seed(args.seed)
-    policy = Policy("PolicyModel")
+    policy = Policy()
 
     eps = np.finfo(np.float32).eps.item()
-    optimizer = fluid.optimizer.AdamOptimizer(learning_rate=3e-2)
+    optimizer = fluid.optimizer.AdamOptimizer(learning_rate=3e-2, parameter_list=policy.parameters())
 
     def get_mean_and_std(values=[]):
         n = 0.
diff --git a/dygraph/reinforcement_learning/test_reinforce_load.py b/dygraph/reinforcement_learning/test_reinforce_load.py
index db7245d1ee44b49950ae46660a67937e333318b4..31edd66b0f8e4dea8907af71c117fb62dfb6ae42 100644
--- a/dygraph/reinforcement_learning/test_reinforce_load.py
+++ b/dygraph/reinforcement_learning/test_reinforce_load.py
@@ -8,7 +8,7 @@ import paddle.fluid as fluid
 import paddle.fluid.dygraph.nn as nn
 import paddle.fluid.framework as framework
 
-parser = argparse.ArgumentParser(description='PyTorch REINFORCE example')
+parser = argparse.ArgumentParser()
 parser.add_argument(
     '--gamma',
     type=float,
@@ -37,11 +37,11 @@ env.seed(args.seed)
 
 
 class Policy(fluid.dygraph.Layer):
-    def __init__(self, name_scope):
-        super(Policy, self).__init__(name_scope)
+    def __init__(self):
+        super(Policy, self).__init__()
 
-        self.affine1 = nn.FC(self.full_name(), size=128)
-        self.affine2 = nn.FC(self.full_name(), size=2)
+        self.affine1 = nn.Linear(4, 128)
+        self.affine2 = nn.Linear(128, 2)
         self.dropout_ratio = 0.6
 
         self.saved_log_probs = []
@@ -64,10 +64,10 @@ with fluid.dygraph.guard():
     fluid.default_main_program().random_seed = args.seed
     np.random.seed(args.seed)
 
-    policy = Policy("PolicyModel")
+    policy = Policy()
 
     eps = np.finfo(np.float32).eps.item()
-    optimizer = fluid.optimizer.AdamOptimizer(learning_rate=1e-2)
+    optimizer = fluid.optimizer.AdamOptimizer(learning_rate=1e-2, parameter_list=policy.parameters())
 
     def get_mean_and_std(values=[]):
         n = 0.
diff --git a/dygraph/resnet/train.py b/dygraph/resnet/train.py
index 5ce1d2466fac6c829001aee884c6b14c5a846ac2..d21f650b710c2cbe31415de7b434ccce80f9baf4 100644
--- a/dygraph/resnet/train.py
+++ b/dygraph/resnet/train.py
@@ -18,7 +18,7 @@ import ast
 import paddle
 import paddle.fluid as fluid
 from paddle.fluid.layer_helper import LayerHelper
-from paddle.fluid.dygraph.nn import Conv2D, Pool2D, BatchNorm, FC
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, BatchNorm, Linear
 from paddle.fluid.dygraph.base import to_variable
 
 from paddle.fluid import framework
@@ -53,7 +53,7 @@ args = parse_args()
 batch_size = args.batch_size
 
 
-def optimizer_setting():
+def optimizer_setting(parameter_list=None):
 
     total_images = IMAGENET1000
 
@@ -64,28 +64,36 @@ def optimizer_setting():
 
     lr = []
     lr = [base_lr * (0.1**i) for i in range(len(bd) + 1)]
-    optimizer = fluid.optimizer.Momentum(
-        learning_rate=fluid.layers.piecewise_decay(
-            boundaries=bd, values=lr),
-        momentum=momentum_rate,
-        regularization=fluid.regularizer.L2Decay(l2_decay))
+    if fluid.in_dygraph_mode():
+        optimizer = fluid.optimizer.Momentum(
+            learning_rate=fluid.layers.piecewise_decay(
+                boundaries=bd, values=lr),
+            momentum=momentum_rate,
+            regularization=fluid.regularizer.L2Decay(l2_decay),
+            parameter_list=parameter_list)
+    else:
+        optimizer = fluid.optimizer.Momentum(
+            learning_rate=fluid.layers.piecewise_decay(
+                boundaries=bd, values=lr),
+            momentum=momentum_rate,
+            regularization=fluid.regularizer.L2Decay(l2_decay))
+        
 
     return optimizer
 
 
 class ConvBNLayer(fluid.dygraph.Layer):
     def __init__(self,
-                 name_scope,
                  num_channels,
                  num_filters,
                  filter_size,
                  stride=1,
                  groups=1,
                  act=None):
-        super(ConvBNLayer, self).__init__(name_scope)
+        super(ConvBNLayer, self).__init__()
 
         self._conv = Conv2D(
-            self.full_name(),
+            num_channels=num_channels,
             num_filters=num_filters,
             filter_size=filter_size,
             stride=stride,
@@ -94,7 +102,7 @@ class ConvBNLayer(fluid.dygraph.Layer):
             act=None,
             bias_attr=False)
 
-        self._batch_norm = BatchNorm(self.full_name(), num_filters, act=act)
+        self._batch_norm = BatchNorm(num_filters, act=act)
 
     def forward(self, inputs):
         y = self._conv(inputs)
@@ -105,28 +113,24 @@ class ConvBNLayer(fluid.dygraph.Layer):
 
 class BottleneckBlock(fluid.dygraph.Layer):
     def __init__(self,
-                 name_scope,
                  num_channels,
                  num_filters,
                  stride,
                  shortcut=True):
-        super(BottleneckBlock, self).__init__(name_scope)
+        super(BottleneckBlock, self).__init__()
 
         self.conv0 = ConvBNLayer(
-            self.full_name(),
             num_channels=num_channels,
             num_filters=num_filters,
             filter_size=1,
             act='relu')
         self.conv1 = ConvBNLayer(
-            self.full_name(),
             num_channels=num_filters,
             num_filters=num_filters,
             filter_size=3,
             stride=stride,
             act='relu')
         self.conv2 = ConvBNLayer(
-            self.full_name(),
             num_channels=num_filters,
             num_filters=num_filters * 4,
             filter_size=1,
@@ -134,7 +138,6 @@ class BottleneckBlock(fluid.dygraph.Layer):
 
         if not shortcut:
             self.short = ConvBNLayer(
-                self.full_name(),
                 num_channels=num_channels,
                 num_filters=num_filters * 4,
                 filter_size=1,
@@ -161,8 +164,8 @@ class BottleneckBlock(fluid.dygraph.Layer):
 
 
 class ResNet(fluid.dygraph.Layer):
-    def __init__(self, name_scope, layers=50, class_dim=102):
-        super(ResNet, self).__init__(name_scope)
+    def __init__(self, layers=50, class_dim=102):
+        super(ResNet, self).__init__()
 
         self.layers = layers
         supported_layers = [50, 101, 152]
@@ -175,47 +178,46 @@ class ResNet(fluid.dygraph.Layer):
             depth = [3, 4, 23, 3]
         elif layers == 152:
             depth = [3, 8, 36, 3]
+        num_channels = [64, 256, 512, 1024]
         num_filters = [64, 128, 256, 512]
 
         self.conv = ConvBNLayer(
-            self.full_name(),
             num_channels=3,
             num_filters=64,
             filter_size=7,
             stride=2,
             act='relu')
         self.pool2d_max = Pool2D(
-            self.full_name(),
             pool_size=3,
             pool_stride=2,
             pool_padding=1,
             pool_type='max')
 
         self.bottleneck_block_list = []
-        num_channels = 64
         for block in range(len(depth)):
             shortcut = False
             for i in range(depth[block]):
                 bottleneck_block = self.add_sublayer(
                     'bb_%d_%d' % (block, i),
                     BottleneckBlock(
-                        self.full_name(),
-                        num_channels=num_channels,
+                        num_channels=num_channels[block]
+                        if i == 0 else num_filters[block] * 4,
                         num_filters=num_filters[block],
                         stride=2 if i == 0 and block != 0 else 1,
                         shortcut=shortcut))
-                num_channels = bottleneck_block._num_channels_out
                 self.bottleneck_block_list.append(bottleneck_block)
                 shortcut = True
 
         self.pool2d_avg = Pool2D(
-            self.full_name(), pool_size=7, pool_type='avg', global_pooling=True)
+            pool_size=7, pool_type='avg', global_pooling=True)
+
+        self.pool2d_avg_output = num_filters[len(num_filters) - 1] * 4 * 1 * 1
 
         import math
         stdv = 1.0 / math.sqrt(2048 * 1.0)
 
-        self.out = FC(self.full_name(),
-                      size=class_dim,
+        self.out = Linear(self.pool2d_avg_output,
+                      class_dim,
                       act='softmax',
                       param_attr=fluid.param_attr.ParamAttr(
                           initializer=fluid.initializer.Uniform(-stdv, stdv)))
@@ -226,6 +228,7 @@ class ResNet(fluid.dygraph.Layer):
         for bottleneck_block in self.bottleneck_block_list:
             y = bottleneck_block(y)
         y = self.pool2d_avg(y)
+        y = fluid.layers.reshape(y, shape=[-1, self.pool2d_avg_output])
         y = self.out(y)
         return y
 
@@ -247,7 +250,7 @@ def eval(model, data):
 
         img = to_variable(dy_x_data)
         label = to_variable(y_data)
-        label._stop_gradient = True
+        label.stop_gradient = True
 
         out = model(img)
         #loss = fluid.layers.cross_entropy(input=out, label=label)
@@ -265,16 +268,13 @@ def eval(model, data):
 
         # print("epoch id: %d, batch step: %d, loss: %f" % (eop, batch_id, dy_out))
         if batch_id % 10 == 0:
-            print("test | batch step %d, loss %0.3f acc1 %0.3f acc5 %0.3f" % \
-                  ( batch_id, total_loss / total_sample, \
-                   total_acc1 / total_sample, total_acc5 / total_sample))
+            print("test | batch step %d, acc1 %0.3f acc5 %0.3f" % \
+                  ( batch_id, total_acc1 / total_sample, total_acc5 / total_sample))
     if args.ce:
         print("kpis\ttest_acc1\t%0.3f" % (total_acc1 / total_sample))
         print("kpis\ttest_acc5\t%0.3f" % (total_acc5 / total_sample))
-        print("kpis\ttest_loss\t%0.3f" % (total_loss / total_sample))
-    print("final eval loss %0.3f acc1 %0.3f acc5 %0.3f" % \
-          (total_loss / total_sample, \
-           total_acc1 / total_sample, total_acc5 / total_sample))
+    print("final eval acc1 %0.3f acc5 %0.3f" % \
+          (total_acc1 / total_sample, total_acc5 / total_sample))
 
 
 def train_resnet():
@@ -292,8 +292,8 @@ def train_resnet():
         if args.use_data_parallel:
             strategy = fluid.dygraph.parallel.prepare_context()
 
-        resnet = ResNet("resnet")
-        optimizer = optimizer_setting()
+        resnet = ResNet()
+        optimizer = optimizer_setting(parameter_list=resnet.parameters())
 
         if args.use_data_parallel:
             resnet = fluid.dygraph.parallel.DataParallel(resnet, strategy)
@@ -335,7 +335,7 @@ def train_resnet():
 
                 img = to_variable(dy_x_data)
                 label = to_variable(y_data)
-                label._stop_gradient = True
+                label.stop_gradient = True
 
                 out = resnet(img)
                 loss = fluid.layers.cross_entropy(input=out, label=label)
diff --git a/dygraph/se_resnext/.run_ce.sh b/dygraph/se_resnet/.run_ce.sh
similarity index 100%
rename from dygraph/se_resnext/.run_ce.sh
rename to dygraph/se_resnet/.run_ce.sh
diff --git a/dygraph/se_resnext/README.md b/dygraph/se_resnet/README.md
similarity index 100%
rename from dygraph/se_resnext/README.md
rename to dygraph/se_resnet/README.md
diff --git a/dygraph/se_resnext/_ce.py b/dygraph/se_resnet/_ce.py
similarity index 100%
rename from dygraph/se_resnext/_ce.py
rename to dygraph/se_resnet/_ce.py
diff --git a/dygraph/se_resnext/train.py b/dygraph/se_resnet/train.py
similarity index 87%
rename from dygraph/se_resnext/train.py
rename to dygraph/se_resnet/train.py
index 56ff2c00b80f367bc28180885bf56b48101e15dc..67b9dacf2e07e19e07d466683769641830a6fd36 100644
--- a/dygraph/se_resnext/train.py
+++ b/dygraph/se_resnet/train.py
@@ -21,7 +21,7 @@ import paddle
 import paddle.fluid as fluid
 from paddle.fluid import core
 from paddle.fluid.layer_helper import LayerHelper
-from paddle.fluid.dygraph.nn import Conv2D, Pool2D, BatchNorm, FC
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, BatchNorm, Linear
 from paddle.fluid.dygraph.base import to_variable
 import sys
 import math
@@ -59,7 +59,7 @@ momentum_rate = 0.9
 l2_decay = 1.2e-4
 
 
-def optimizer_setting(params):
+def optimizer_setting(params, parameter_list):
     ls = params["learning_strategy"]
     if "total_images" not in params:
         total_images = 6149
@@ -75,33 +75,33 @@ def optimizer_setting(params):
         learning_rate=fluid.layers.cosine_decay(
             learning_rate=lr, step_each_epoch=step, epochs=num_epochs),
         momentum=momentum_rate,
-        regularization=fluid.regularizer.L2Decay(l2_decay))
+        regularization=fluid.regularizer.L2Decay(l2_decay),
+        parameter_list=parameter_list)
 
     return optimizer
 
 
 class ConvBNLayer(fluid.dygraph.Layer):
     def __init__(self,
-                 name_scope,
+                 num_channels,
                  num_filters,
                  filter_size,
                  stride=1,
                  groups=1,
                  act=None):
-        super(ConvBNLayer, self).__init__(name_scope)
+        super(ConvBNLayer, self).__init__()
 
         self._conv = Conv2D(
-            "conv2d",
+            num_channels=num_channels,
             num_filters=num_filters,
             filter_size=filter_size,
             stride=stride,
             padding=(filter_size - 1) // 2,
             groups=groups,
             act=None,
-            bias_attr=False,
-            param_attr=fluid.ParamAttr(name="weights"))
+            bias_attr=False)
 
-        self._batch_norm = BatchNorm(self.full_name(), num_filters, act=act)
+        self._batch_norm = BatchNorm(num_filters, act=act)
 
     def forward(self, inputs):
         y = self._conv(inputs)
@@ -111,29 +111,30 @@ class ConvBNLayer(fluid.dygraph.Layer):
 
 
 class SqueezeExcitation(fluid.dygraph.Layer):
-    def __init__(self, name_scope, num_channels, reduction_ratio):
+    def __init__(self, num_channels, reduction_ratio):
 
-        super(SqueezeExcitation, self).__init__(name_scope)
-        self._pool = Pool2D(
-            self.full_name(), pool_size=0, pool_type='avg', global_pooling=True)
+        super(SqueezeExcitation, self).__init__()
+        self._num_channels = num_channels
+        self._pool = Pool2D(pool_size=0, pool_type='avg', global_pooling=True)
         stdv = 1.0 / math.sqrt(num_channels * 1.0)
-        self._squeeze = FC(
-            self.full_name(),
-            size=num_channels // reduction_ratio,
+        self._fc = Linear(
+            num_channels,
+            num_channels // reduction_ratio,
             param_attr=fluid.ParamAttr(
                 initializer=fluid.initializer.Uniform(-stdv, stdv)),
             act='relu')
         stdv = 1.0 / math.sqrt(num_channels / 16.0 * 1.0)
-        self._excitation = FC(
-            self.full_name(),
-            size=num_channels,
+        self._excitation = Linear(
+            num_channels // reduction_ratio,
+            num_channels,
             param_attr=fluid.ParamAttr(
                 initializer=fluid.initializer.Uniform(-stdv, stdv)),
             act='sigmoid')
 
     def forward(self, input):
         y = self._pool(input)
-        y = self._squeeze(y)
+        y = fluid.layers.reshape(y, shape=[-1, self._num_channels])
+        y = self._fc(y)
         y = self._excitation(y)
         y = fluid.layers.elementwise_mul(x=input, y=y, axis=0)
         return y
@@ -141,41 +142,39 @@ class SqueezeExcitation(fluid.dygraph.Layer):
 
 class BottleneckBlock(fluid.dygraph.Layer):
     def __init__(self,
-                 name_scope,
                  num_channels,
                  num_filters,
                  stride,
                  cardinality,
                  reduction_ratio,
                  shortcut=True):
-        super(BottleneckBlock, self).__init__(name_scope)
+        super(BottleneckBlock, self).__init__()
 
         self.conv0 = ConvBNLayer(
-            self.full_name(),
+            num_channels=num_channels,
             num_filters=num_filters,
             filter_size=1,
             act="relu")
         self.conv1 = ConvBNLayer(
-            self.full_name(),
+            num_channels=num_filters,
             num_filters=num_filters,
             filter_size=3,
             stride=stride,
             groups=cardinality,
             act="relu")
         self.conv2 = ConvBNLayer(
-            self.full_name(),
+            num_channels=num_filters,
             num_filters=num_filters * 2,
             filter_size=1,
             act=None)
 
         self.scale = SqueezeExcitation(
-            self.full_name(),
             num_channels=num_filters * 2,
             reduction_ratio=reduction_ratio)
 
         if not shortcut:
             self.short = ConvBNLayer(
-                self.full_name(),
+                num_channels=num_channels,
                 num_filters=num_filters * 2,
                 filter_size=1,
                 stride=stride)
@@ -200,8 +199,8 @@ class BottleneckBlock(fluid.dygraph.Layer):
 
 
 class SeResNeXt(fluid.dygraph.Layer):
-    def __init__(self, name_scope, layers=50, class_dim=102):
-        super(SeResNeXt, self).__init__(name_scope)
+    def __init__(self, layers=50, class_dim=102):
+        super(SeResNeXt, self).__init__()
 
         self.layers = layers
         supported_layers = [50, 101, 152]
@@ -214,13 +213,12 @@ class SeResNeXt(fluid.dygraph.Layer):
             depth = [3, 4, 6, 3]
             num_filters = [128, 256, 512, 1024]
             self.conv0 = ConvBNLayer(
-                self.full_name(),
+                num_channels=3,
                 num_filters=64,
                 filter_size=7,
                 stride=2,
                 act='relu')
             self.pool = Pool2D(
-                self.full_name(),
                 pool_size=3,
                 pool_stride=2,
                 pool_padding=1,
@@ -231,13 +229,12 @@ class SeResNeXt(fluid.dygraph.Layer):
             depth = [3, 4, 23, 3]
             num_filters = [128, 256, 512, 1024]
             self.conv0 = ConvBNLayer(
-                self.full_name(),
+                num_channels=3,
                 num_filters=64,
                 filter_size=7,
                 stride=2,
                 act='relu')
             self.pool = Pool2D(
-                self.full_name(),
                 pool_size=3,
                 pool_stride=2,
                 pool_padding=1,
@@ -248,25 +245,24 @@ class SeResNeXt(fluid.dygraph.Layer):
             depth = [3, 8, 36, 3]
             num_filters = [128, 256, 512, 1024]
             self.conv0 = ConvBNLayer(
-                self.full_name(),
+                num_channels=3,
                 num_filters=64,
                 filter_size=3,
                 stride=2,
                 act='relu')
             self.conv1 = ConvBNLayer(
-                self.full_name(),
+                num_channels=64,
                 num_filters=64,
                 filter_size=3,
                 stride=1,
                 act='relu')
             self.conv2 = ConvBNLayer(
-                self.full_name(),
+                num_channels=64,
                 num_filters=128,
                 filter_size=3,
                 stride=1,
                 act='relu')
             self.pool = Pool2D(
-                self.full_name(),
                 pool_size=3,
                 pool_stride=2,
                 pool_padding=1,
@@ -274,13 +270,14 @@ class SeResNeXt(fluid.dygraph.Layer):
 
         self.bottleneck_block_list = []
         num_channels = 64
+        if layers == 152:
+            num_channels = 128
         for block in range(len(depth)):
             shortcut = False
             for i in range(depth[block]):
                 bottleneck_block = self.add_sublayer(
                     'bb_%d_%d' % (block, i),
                     BottleneckBlock(
-                        self.full_name(),
                         num_channels=num_channels,
                         num_filters=num_filters[block],
                         stride=2 if i == 0 and block != 0 else 1,
@@ -292,11 +289,13 @@ class SeResNeXt(fluid.dygraph.Layer):
                 shortcut = True
 
         self.pool2d_avg = Pool2D(
-            self.full_name(), pool_size=7, pool_type='avg', global_pooling=True)
+            pool_size=7, pool_type='avg', global_pooling=True)
         stdv = 1.0 / math.sqrt(2048 * 1.0)
 
-        self.out = FC(self.full_name(),
-                      size=class_dim,
+        self.pool2d_avg_output = num_filters[len(num_filters) - 1] * 2 * 1 * 1
+
+        self.out = Linear(self.pool2d_avg_output,
+                      class_dim,
                       param_attr=fluid.param_attr.ParamAttr(
                           initializer=fluid.initializer.Uniform(-stdv, stdv)))
 
@@ -306,14 +305,15 @@ class SeResNeXt(fluid.dygraph.Layer):
             y = self.pool(y)
         elif self.layers == 152:
             y = self.conv0(inputs)
-            y = self.conv1(inputs)
-            y = self.conv2(inputs)
+            y = self.conv1(y)
+            y = self.conv2(y)
             y = self.pool(y)
 
         for bottleneck_block in self.bottleneck_block_list:
             y = bottleneck_block(y)
         y = self.pool2d_avg(y)
         y = fluid.layers.dropout(y, dropout_prob=0.5, seed=100)
+        y = fluid.layers.reshape(y, shape=[-1, self.pool2d_avg_output])
         y = self.out(y)
         return y
 
@@ -336,7 +336,7 @@ def eval(model, data):
 
         img = to_variable(dy_x_data)
         label = to_variable(y_data)
-        label._stop_gradient = True
+        label.stop_gradient = True
         out = model(img)
 
         softmax_out = fluid.layers.softmax(out, use_cudnn=False)
@@ -383,8 +383,8 @@ def train():
             fluid.default_main_program().random_seed = seed
         if args.use_data_parallel:
             strategy = fluid.dygraph.parallel.prepare_context()
-        se_resnext = SeResNeXt("se_resnext")
-        optimizer = optimizer_setting(train_parameters)
+        se_resnext = SeResNeXt()
+        optimizer = optimizer_setting(train_parameters, se_resnext.parameters())
         if args.use_data_parallel:
             se_resnext = fluid.dygraph.parallel.DataParallel(se_resnext,
                                                              strategy)
diff --git a/dygraph/sentiment/README.md b/dygraph/sentiment/README.md
index 99d9b1dbf8203f01120bcb37ea8c312027bcd373..ef6dd466bcbfbc00453976fe5ecc8a74bee50504 100644
--- a/dygraph/sentiment/README.md
+++ b/dygraph/sentiment/README.md
@@ -3,22 +3,27 @@
 
 情感是人类的一种高级智能行为，为了识别文本的情感倾向，需要深入的语义建模。另外，不同领域（如餐饮、体育）在情感的表达各不相同，因而需要有大规模覆盖各个领域的数据进行模型训练。为此，我们通过基于深度学习的语义模型和大规模数据挖掘解决上述两个问题。效果上，我们基于开源情感倾向分类数据集ChnSentiCorp进行评测。具体数据如下所示：
 
-| 模型 | dev |
-| :------| :------ |
-| CNN | 90.6% |
+| 模型 | dev | test |
+| :------| :------ | :------ |
+| CNN | 90.6% | 89.7% |
+| BOW | 90.1% | 90.3% |
+| GRU | 90.0% | 91.1% |
+| BIGRU | 89.7% |  89.6% |
 
 动态图文档请见[Dygraph](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/user_guides/howto/dygraph/DyGraph.html)
 
 
 ## 快速开始
 
-本项目依赖于 Paddlepaddle 1.5.0 及以上版本，请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装
+本项目依赖于 Paddlepaddle 1.7.0 及以上版本，请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装。
+
+python版本依赖python 2.7或python 3.5及以上版本。
 
-python版本依赖python 2.7或python 3.5及以上版本
 
 #### 安装代码
 
-克隆数据集代码库到本地
+克隆数据集代码库到本地。
+
 ```shell
 git clone https://github.com/PaddlePaddle/models.git
 cd models/dygraph/sentiment
@@ -27,6 +32,7 @@ cd models/dygraph/sentiment
 #### 数据准备
 
 下载经过预处理的数据，文件解压之后，senta_data目录下会存在训练数据（train.tsv）、开发集数据（dev.tsv）、测试集数据（test.tsv）以及对应的词典（word_dict.txt）
+
 ```shell
 wget https://baidu-nlp.bj.bcebos.com/sentiment_classification-dataset-1.0.0.tar.gz
 tar -zxvf sentiment_classification-dataset-1.0.0.tar.gz
@@ -34,28 +40,43 @@ tar -zxvf sentiment_classification-dataset-1.0.0.tar.gz
 
 #### 模型训练
 
-基于示例的数据集，可以运行下面的命令，在训练集（train.tsv）上进行模型训练，并在开发集（dev.tsv）验证
+基于示例的数据集，可以运行下面的命令，在训练集（train.tsv）上进行模型训练，并在开发集（dev.tsv）验证。训练阶段需手动创建模型需要保存的文件夹，并且通过checkpoints设置保存文件路径。
+model_type从bow_net，cnn_net，gru_net，bigru_net中选择。
+
 ```shell
-python main.py
+python main.py --model_type=bow_net --do_train=True --do_infer=True --epoch=50 --batch_size=256
 ```
 
 #### 模型预测
 
-利用已有模型，可以运行下面命令，对未知label的数据（test.tsv）进行预测
+利用已有模型，可以运行下面命令，对未知label的数据（test.tsv）进行预测。
+
 ```shell
-python main.py --do_train false --do_infer true --checkpoints ./path_to_save_models
+python main.py --model_type=bow_net --do_train=False --do_infer=True --epoch=1 --checkpoints=./path_to_save_models
 ```
 
+#### 模型参数
+
+1. batch_size, 根据模型情况和GPU占用率选择batch_size, 建议cnn/bow选择batch_size=256, gru/bigru选择batch_size=16。
+2. padding_size默认为150。
+3. epoch, training时默认设置为50，infer默认为1。
+4. learning_rate默认为0.002。
+
+
 ## 进阶使用
 
 #### 任务定义
 
 传统的情感分类主要基于词典或者特征工程的方式进行分类，这种方法需要繁琐的人工特征设计和先验知识，理解停留于浅层并且扩展泛化能力差。为了避免传统方法的局限，我们采用近年来飞速发展的深度学习技术。基于深度学习的情感分类不依赖于人工特征，它能够端到端的对输入文本进行语义理解，并基于语义表示进行情感倾向的判断。
+
 #### 模型原理介绍
 
 本项目针对情感倾向性分类问题，：
 
 + CNN（Convolutional Neural Networks），是一个基础的序列模型，能处理变长序列输入，提取局部区域之内的特征；
++ BOW（Bag Of Words）模型，是一个非序列模型，使用基本的全连接结构；
++ GRU（Gated Recurrent Unit），序列模型，能够较好地解决序列文本中长距离依赖的问题；
++ BI-GRU（Bidirectional Gated Recurrent Unit），序列模型，采用双向双层GRU结构，更好地捕获句子中的语义特征；
 
 #### 数据格式说明
 
diff --git a/dygraph/sentiment/main.py b/dygraph/sentiment/main.py
old mode 100644
new mode 100755
index b22f7ee7e71922db978332eedaba48145a79f28d..1e752b9ecc654c67cff3318215f7284fc6f4abab
--- a/dygraph/sentiment/main.py
+++ b/dygraph/sentiment/main.py
@@ -28,8 +28,8 @@ model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
 model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints")
 
 train_g = ArgumentGroup(parser, "training", "training options.")
-train_g.add_arg("epoch", int, 10, "Number of epoches for training.")
-train_g.add_arg("save_steps", int, 1000,
+train_g.add_arg("epoch", int, 50, "Number of epoches for training.")
+train_g.add_arg("save_steps", int, 200,
                 "The steps interval to save checkpoints.")
 train_g.add_arg("validation_steps", int, 200,
                 "The steps interval to evaluate model performance.")
@@ -47,7 +47,7 @@ data_g.add_arg("data_dir", str, "./senta_data/", "Path to training data.")
 data_g.add_arg("vocab_path", str, "./senta_data/word_dict.txt",
                "Vocabulary path.")
 data_g.add_arg("vocab_size", int, 33256, "Vocabulary path.")
-data_g.add_arg("batch_size", int, 16,
+data_g.add_arg("batch_size", int, 256,
                "Total examples' number in batch for training.")
 data_g.add_arg("random_seed", int, 0, "Random seed.")
 
@@ -56,8 +56,9 @@ run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
 run_type_g.add_arg("do_train", bool, True, "Whether to perform training.")
 run_type_g.add_arg("do_val", bool, True, "Whether to perform evaluation.")
 run_type_g.add_arg("do_infer", bool, False, "Whether to perform inference.")
-run_type_g.add_arg("profile_steps", int, 15000,
+run_type_g.add_arg("profile_steps", int, 60000,
                    "The steps interval to record the performance.")
+train_g.add_arg("model_type", str, "bow_net", "Model type of training.")
 parser.add_argument("--ce", action="store_true", help="run ce")
 
 args = parser.parse_args()
@@ -81,13 +82,13 @@ def profile_context(profile=True):
     else:
         yield
 
-
 if args.ce:
     print("ce mode")
     seed = 90
     np.random.seed(seed)
     fluid.default_startup_program().random_seed = seed
-    fluid.default_main_program().random_seed = seed 
+    fluid.default_main_program().random_seed = seed
+
 
 def train():
     with fluid.dygraph.guard(place):
@@ -96,7 +97,7 @@ def train():
             seed = 90
             np.random.seed(seed)
             fluid.default_startup_program().random_seed = seed
-            fluid.default_main_program().random_seed = seed 
+            fluid.default_main_program().random_seed = seed
         processor = reader.SentaProcessor(
             data_dir=args.data_dir,
             vocab_path=args.vocab_path,
@@ -106,7 +107,7 @@ def train():
         num_train_examples = processor.get_num_examples(phase="train")
 
         max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
-        
+
         if not args.ce:
             train_data_generator = processor.data_generator(
                 batch_size=args.batch_size,
@@ -131,20 +132,27 @@ def train():
                 phase='dev',
                 epoch=args.epoch,
                 shuffle=False)
-        cnn_net = nets.CNN("cnn_net", args.vocab_size, args.batch_size,
-                           args.padding_size)
-
-        sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=args.lr)
+        if args.model_type == 'cnn_net':
+            model = nets.CNN( args.vocab_size, args.batch_size,
+                             args.padding_size)
+        elif args.model_type == 'bow_net':
+            model = nets.BOW( args.vocab_size, args.batch_size,
+                             args.padding_size)
+        elif args.model_type == 'gru_net':
+            model = nets.GRU( args.vocab_size, args.batch_size,
+                             args.padding_size)
+        elif args.model_type == 'bigru_net':
+            model = nets.BiGRU( args.vocab_size, args.batch_size,
+                             args.padding_size)
+        sgd_optimizer = fluid.optimizer.Adagrad(learning_rate=args.lr,parameter_list=model.parameters())
         steps = 0
         total_cost, total_acc, total_num_seqs = [], [], []
-
+        gru_hidden_data = np.zeros((args.batch_size, 128), dtype='float32')
         for eop in range(args.epoch):
             time_begin = time.time()
             for batch_id, data in enumerate(train_data_generator()):
                 enable_profile = steps > args.profile_steps
-
                 with profile_context(enable_profile):
-
                     steps += 1
                     doc = to_variable(
                         np.array([
@@ -154,23 +162,21 @@ def train():
                                    'constant',
                                    constant_values=(args.vocab_size))
                             for x in data
-                        ]).astype('int64').reshape(-1, 1))
-
+                        ]).astype('int64').reshape(-1))
                     label = to_variable(
                         np.array([x[1] for x in data]).astype('int64').reshape(
                             args.batch_size, 1))
-
-                    cnn_net.train()
-                    avg_cost, prediction, acc = cnn_net(doc, label)
+                    model.train()
+                    avg_cost, prediction, acc = model(doc, label)
                     avg_cost.backward()
                     np_mask = (doc.numpy() != args.vocab_size).astype('int32')
                     word_num = np.sum(np_mask)
                     sgd_optimizer.minimize(avg_cost)
-                    cnn_net.clear_gradients()
+                    model.clear_gradients()
                     total_cost.append(avg_cost.numpy() * word_num)
                     total_acc.append(acc.numpy() * word_num)
                     total_num_seqs.append(word_num)
-
+ 
                     if steps % args.skip_steps == 0:
                         time_end = time.time()
                         used_time = time_end - time_begin
@@ -185,8 +191,9 @@ def train():
 
                     if steps % args.validation_steps == 0:
                         total_eval_cost, total_eval_acc, total_eval_num_seqs = [], [], []
-                        cnn_net.eval()
+                        model.eval()
                         eval_steps = 0
+                        gru_hidden_data = np.zeros((args.batch_size, 128), dtype='float32')
                         for eval_batch_id, eval_data in enumerate(
                                 eval_data_generator()):
                             eval_np_doc = np.array([
@@ -196,14 +203,13 @@ def train():
                                        'constant',
                                        constant_values=(args.vocab_size))
                                 for x in eval_data
-                            ]).astype('int64').reshape(1, -1)
+                            ]).astype('int64').reshape(-1)
                             eval_label = to_variable(
                                 np.array([x[1] for x in eval_data]).astype(
                                     'int64').reshape(args.batch_size, 1))
-                            eval_doc = to_variable(eval_np_doc.reshape(-1, 1))
-                            eval_avg_cost, eval_prediction, eval_acc = cnn_net(
+                            eval_doc = to_variable(eval_np_doc)
+                            eval_avg_cost, eval_prediction, eval_acc = model(
                                 eval_doc, eval_label)
-
                             eval_np_mask = (
                                 eval_np_doc != args.vocab_size).astype('int32')
                             eval_word_num = np.sum(eval_np_mask)
@@ -226,17 +232,21 @@ def train():
                              eval_steps / used_time))
                         time_begin = time.time()
                         if args.ce:
-                            print("kpis\ttrain_loss\t%0.3f" % (np.sum(total_eval_cost) / np.sum(total_eval_num_seqs)))
-                            print("kpis\ttrain_acc\t%0.3f" % (np.sum(total_eval_acc) / np.sum(total_eval_num_seqs)))
+                            print("kpis\ttrain_loss\t%0.3f" %
+                                  (np.sum(total_eval_cost) /
+                                   np.sum(total_eval_num_seqs)))
+                            print("kpis\ttrain_acc\t%0.3f" %
+                                  (np.sum(total_eval_acc) /
+                                   np.sum(total_eval_num_seqs)))
 
                     if steps % args.save_steps == 0:
-                        save_path = "save_dir_" + str(steps)
+                        save_path = args.checkpoints+"/"+"save_dir_" + str(steps)
                         print('save model to: ' + save_path)
-                        fluid.save_dygraph(cnn_net.state_dict(),
-                                                        save_path)
+                        fluid.dygraph.save_dygraph(model.state_dict(),
+                                                   save_path)
                 if enable_profile:
-                        print('save profile result into /tmp/profile_file')
-                        return
+                    print('save profile result into /tmp/profile_file')
+                    return
 
 
 def infer():
@@ -251,42 +261,45 @@ def infer():
             phase='infer',
             epoch=args.epoch,
             shuffle=False)
-
-        cnn_net_infer = nets.CNN("cnn_net", args.vocab_size, args.batch_size,
-                                 args.padding_size)
-
+        if args.model_type == 'cnn_net':
+            model_infer = nets.CNN( args.vocab_size, args.batch_size,
+                                   args.padding_size)
+        elif args.model_type == 'bow_net':
+            model_infer = nets.BOW( args.vocab_size, args.batch_size,
+                                   args.padding_size)
+        elif args.model_type == 'gru_net':
+            model_infer = nets.GRU( args.vocab_size, args.batch_size,
+                                   args.padding_size)
+        elif args.model_type == 'bigru_net':
+            model_infer = nets.BiGRU( args.vocab_size, args.batch_size,
+                                   args.padding_size)
         print('Do inferring ...... ')
-        total_acc, total_num_seqs = [], []
-
         restore, _ = fluid.load_dygraph(args.checkpoints)
-        cnn_net_infer.set_dict(restore)
-        cnn_net_infer.eval()
-
+        model_infer.set_dict(restore)
+        model_infer.eval()
+        total_acc, total_num_seqs = [], []
         steps = 0
         time_begin = time.time()
         for batch_id, data in enumerate(infer_data_generator()):
             steps += 1
-            np_doc = np.array([
-                np.pad(x[0][0:args.padding_size],
-                       (0, args.padding_size - len(x[0][0:args.padding_size])),
-                       'constant',
-                       constant_values=(args.vocab_size)) for x in data
-            ]).astype('int64').reshape(-1, 1)
+            np_doc = np.array([np.pad(x[0][0:args.padding_size],
+                                       (0, args.padding_size -
+                                        len(x[0][0:args.padding_size])),
+                                       'constant',
+                                       constant_values=(args.vocab_size))
+                                for x in data
+            ]).astype('int64').reshape(-1)
             doc = to_variable(np_doc)
             label = to_variable(
                 np.array([x[1] for x in data]).astype('int64').reshape(
                     args.batch_size, 1))
-
-            _, _, acc = cnn_net_infer(doc, label)
-
+            _, _, acc = model_infer(doc, label)
             mask = (np_doc != args.vocab_size).astype('int32')
             word_num = np.sum(mask)
             total_acc.append(acc.numpy() * word_num)
             total_num_seqs.append(word_num)
-
         time_end = time.time()
-        used_time = time_end - time_begin
-
+        used_time = time_end - time_begin       
         print("Final infer result: ave acc: %f, speed: %f steps/s" %
               (np.sum(total_acc) / np.sum(total_num_seqs), steps / used_time))
 
diff --git a/dygraph/sentiment/nets.py b/dygraph/sentiment/nets.py
old mode 100644
new mode 100755
index 4c64e3545ea6906d256560bd214dc70dbf5b7dbb..4f6e94fc494a26e8b95554dcd8336371ada6bd0e
--- a/dygraph/sentiment/nets.py
+++ b/dygraph/sentiment/nets.py
@@ -12,24 +12,63 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import paddle.fluid as fluid
-from paddle.fluid.dygraph.nn import Conv2D, Pool2D, FC, Embedding
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, Linear, Embedding
+from paddle.fluid.dygraph import GRUUnit
 from paddle.fluid.dygraph.base import to_variable
+import numpy as np
 
+    
+class DynamicGRU(fluid.dygraph.Layer):
+    def __init__(self,
+                 size,
+                 param_attr=None,
+                 bias_attr=None,
+                 is_reverse=False,
+                 gate_activation='sigmoid',
+                 candidate_activation='tanh',
+                 h_0=None,
+                 origin_mode=False,
+                 init_size = None):
+        super(DynamicGRU, self).__init__()
+        self.gru_unit = GRUUnit(
+            size * 3,
+            param_attr=param_attr,
+            bias_attr=bias_attr,
+            activation=candidate_activation,
+            gate_activation=gate_activation,
+            origin_mode=origin_mode)
+        self.size = size
+        self.h_0 = h_0
+        self.is_reverse = is_reverse
+    def forward(self, inputs):
+        hidden = self.h_0
+        res = []
+        for i in range(inputs.shape[1]):
+            if self.is_reverse:
+                i = inputs.shape[1] - 1 - i
+            input_ = inputs[ :, i:i+1, :]
+            input_ = fluid.layers.reshape(input_, [-1, input_.shape[2]], inplace=False)
+            hidden, reset, gate = self.gru_unit(input_, hidden)
+            hidden_ = fluid.layers.reshape(hidden, [-1, 1, hidden.shape[1]], inplace=False)
+            res.append(hidden_)
+        if self.is_reverse:
+            res = res[::-1]
+        res = fluid.layers.concat(res, axis=1)
+        return res
 
 class SimpleConvPool(fluid.dygraph.Layer):
     def __init__(self,
-                 name_scope,
+                 num_channels,
                  num_filters,
                  filter_size,
                  use_cudnn=False,
                  batch_size=None):
-        super(SimpleConvPool, self).__init__(name_scope)
+        super(SimpleConvPool, self).__init__()
         self.batch_size = batch_size
-        self._conv2d = Conv2D(
-            self.full_name(),
+        self._conv2d = Conv2D(num_channels = num_channels,
             num_filters=num_filters,
             filter_size=filter_size,
-            padding=[1, 1],
+            padding=[1, 1],                 
             use_cudnn=use_cudnn,
             act='tanh')
 
@@ -41,49 +80,182 @@ class SimpleConvPool(fluid.dygraph.Layer):
 
 
 class CNN(fluid.dygraph.Layer):
-    def __init__(self, name_scope, dict_dim, batch_size, seq_len):
-        super(CNN, self).__init__(name_scope)
+    def __init__(self,  dict_dim, batch_size, seq_len):
+        super(CNN, self).__init__()
         self.dict_dim = dict_dim
         self.emb_dim = 128
         self.hid_dim = 128
         self.fc_hid_dim = 96
         self.class_dim = 2
+        self.channels = 1
         self.win_size = [3, self.hid_dim]
         self.batch_size = batch_size
         self.seq_len = seq_len
         self.embedding = Embedding(
-            self.full_name(),
             size=[self.dict_dim + 1, self.emb_dim],
             dtype='float32',
             is_sparse=False)
-
         self._simple_conv_pool_1 = SimpleConvPool(
-            self.full_name(),
+            self.channels,
             self.hid_dim,
             self.win_size,
             batch_size=self.batch_size)
-        self._fc1 = FC(self.full_name(), size=self.fc_hid_dim, act="softmax")
-        self._fc_prediction = FC(self.full_name(),
-                                 size=self.class_dim,
+        self._fc1 = Linear(input_dim = self.hid_dim*self.seq_len, output_dim=self.fc_hid_dim, act="softmax")
+        self._fc_prediction = Linear(input_dim = self.fc_hid_dim,
+                                 output_dim = self.class_dim,
                                  act="softmax")
-
     def forward(self, inputs, label=None):
         emb = self.embedding(inputs)
-        o_np_mask = (inputs.numpy() != self.dict_dim).astype('float32')
+        o_np_mask = (inputs.numpy().reshape(-1,1) != self.dict_dim).astype('float32')
         mask_emb = fluid.layers.expand(
             to_variable(o_np_mask), [1, self.hid_dim])
         emb = emb * mask_emb
         emb = fluid.layers.reshape(
-            emb, shape=[-1, 1, self.seq_len, self.hid_dim])
+            emb, shape=[-1, self.channels , self.seq_len, self.hid_dim])
         conv_3 = self._simple_conv_pool_1(emb)
         fc_1 = self._fc1(conv_3)
         prediction = self._fc_prediction(fc_1)
+        if label:
+            cost = fluid.layers.cross_entropy(input=prediction, label=label)
+            avg_cost = fluid.layers.mean(x=cost)
+            acc = fluid.layers.accuracy(input=prediction, label=label)
+            return avg_cost, prediction, acc
+        else:
+            return prediction
+
 
+class BOW(fluid.dygraph.Layer):
+    def __init__(self, dict_dim, batch_size, seq_len):
+        super(BOW, self).__init__()
+        self.dict_dim = dict_dim
+        self.emb_dim = 128
+        self.hid_dim = 128
+        self.fc_hid_dim = 96
+        self.class_dim = 2
+        self.batch_size = batch_size
+        self.seq_len = seq_len
+        self.embedding = Embedding(
+            size=[self.dict_dim + 1, self.emb_dim],
+            dtype='float32',
+            is_sparse=False)
+        self._fc1 = Linear(input_dim = self.hid_dim, output_dim=self.hid_dim, act="tanh")
+        self._fc2 = Linear(input_dim = self.hid_dim, output_dim=self.fc_hid_dim, act="tanh")
+        self._fc_prediction = Linear(input_dim = self.fc_hid_dim,
+                                 output_dim = self.class_dim,
+                                 act="softmax")
+    def forward(self, inputs, label=None):
+        emb = self.embedding(inputs)
+        o_np_mask = (inputs.numpy().reshape(-1,1) != self.dict_dim).astype('float32')
+        mask_emb = fluid.layers.expand(
+            to_variable(o_np_mask), [1, self.hid_dim])
+        emb = emb * mask_emb
+        emb = fluid.layers.reshape(
+            emb, shape=[-1, self.seq_len, self.hid_dim])
+        bow_1 = fluid.layers.reduce_sum(emb, dim=1)
+        bow_1 = fluid.layers.tanh(bow_1)
+        fc_1 = self._fc1(bow_1)
+        fc_2 = self._fc2(fc_1)
+        prediction = self._fc_prediction(fc_2)
         if label:
             cost = fluid.layers.cross_entropy(input=prediction, label=label)
             avg_cost = fluid.layers.mean(x=cost)
             acc = fluid.layers.accuracy(input=prediction, label=label)
+            return avg_cost, prediction, acc
+        else:
+            return prediction
 
+
+class GRU(fluid.dygraph.Layer):
+    def __init__(self, dict_dim, batch_size, seq_len):
+        super(GRU, self).__init__()
+        self.dict_dim = dict_dim
+        self.emb_dim = 128
+        self.hid_dim = 128
+        self.fc_hid_dim = 96
+        self.class_dim = 2
+        self.batch_size = batch_size
+        self.seq_len = seq_len
+        self.embedding = Embedding(
+            size=[self.dict_dim + 1, self.emb_dim],
+            dtype='float32',
+            param_attr=fluid.ParamAttr(learning_rate=30),
+            is_sparse=False)
+        h_0 = np.zeros((self.batch_size, self.hid_dim), dtype="float32")
+        h_0 = to_variable(h_0)
+        self._fc1 = Linear(input_dim = self.hid_dim, output_dim=self.hid_dim*3)
+        self._fc2 = Linear(input_dim=self.hid_dim, output_dim=self.fc_hid_dim, act="tanh")
+        self._fc_prediction = Linear(input_dim=self.fc_hid_dim,
+                                 output_dim=self.class_dim,
+                                 act="softmax")
+        self._gru = DynamicGRU( size= self.hid_dim, h_0=h_0)
+    def forward(self, inputs, label=None):
+        emb = self.embedding(inputs)
+        o_np_mask =to_variable(inputs.numpy().reshape(-1,1) != self.dict_dim).astype('float32')
+        mask_emb = fluid.layers.expand(
+            to_variable(o_np_mask), [1, self.hid_dim])
+        emb = emb * mask_emb
+        emb = fluid.layers.reshape(emb, shape=[self.batch_size, -1, self.hid_dim])
+        fc_1 = self._fc1(emb)
+        gru_hidden = self._gru(fc_1)
+        gru_hidden = fluid.layers.reduce_max(gru_hidden, dim=1)
+        tanh_1 = fluid.layers.tanh(gru_hidden)
+        fc_2 = self._fc2(tanh_1)
+        prediction = self._fc_prediction(fc_2)
+        if label:
+            cost = fluid.layers.cross_entropy(input=prediction, label=label)
+            avg_cost = fluid.layers.mean(x=cost)
+            acc = fluid.layers.accuracy(input=prediction, label=label)
             return avg_cost, prediction, acc
         else:
             return prediction
+
+        
+class BiGRU(fluid.dygraph.Layer):
+    def __init__(self, dict_dim, batch_size, seq_len):
+        super(BiGRU, self).__init__()
+        self.dict_dim = dict_dim
+        self.emb_dim = 128
+        self.hid_dim = 128
+        self.fc_hid_dim = 96
+        self.class_dim = 2
+        self.batch_size = batch_size
+        self.seq_len = seq_len
+        self.embedding = Embedding(
+            size=[self.dict_dim + 1, self.emb_dim],
+            dtype='float32',
+            param_attr=fluid.ParamAttr(learning_rate=30),
+            is_sparse=False)
+        h_0 = np.zeros((self.batch_size, self.hid_dim), dtype="float32")
+        h_0 = to_variable(h_0)
+        self._fc1 = Linear(input_dim = self.hid_dim, output_dim=self.hid_dim*3)
+        self._fc2 = Linear(input_dim = self.hid_dim*2, output_dim=self.fc_hid_dim, act="tanh")
+        self._fc_prediction = Linear(input_dim=self.fc_hid_dim,
+                                 output_dim=self.class_dim,
+                                 act="softmax")
+        self._gru_forward = DynamicGRU( size= self.hid_dim, h_0=h_0,is_reverse=False)
+        self._gru_backward = DynamicGRU( size= self.hid_dim, h_0=h_0,is_reverse=True)
+
+    def forward(self, inputs, label=None):
+        emb = self.embedding(inputs)
+        o_np_mask =to_variable(inputs.numpy() .reshape(-1,1)!= self.dict_dim).astype('float32')
+        mask_emb = fluid.layers.expand(
+            to_variable(o_np_mask), [1, self.hid_dim])
+        emb = emb * mask_emb
+        emb = fluid.layers.reshape(emb, shape=[self.batch_size, -1, self.hid_dim])
+        fc_1 = self._fc1(emb)
+        gru_forward = self._gru_forward(fc_1)
+        gru_backward = self._gru_backward(fc_1)
+        gru_forward_tanh = fluid.layers.tanh(gru_forward)
+        gru_backward_tanh = fluid.layers.tanh(gru_backward)
+        encoded_vector = fluid.layers.concat(
+            input=[gru_forward_tanh, gru_backward_tanh], axis=2)
+        encoded_vector = fluid.layers.reduce_max(encoded_vector, dim=1)
+        fc_2 = self._fc2(encoded_vector)
+        prediction = self._fc_prediction(fc_2)
+        if label:
+            cost = fluid.layers.cross_entropy(input=prediction, label=label)
+            avg_cost = fluid.layers.mean(x=cost)
+            acc = fluid.layers.accuracy(input=prediction, label=label)
+            return avg_cost, prediction, acc
+        else:
+            return prediction
\ No newline at end of file
diff --git a/dygraph/seq2seq/README.md b/dygraph/seq2seq/README.md
new file mode 100755
index 0000000000000000000000000000000000000000..94072c7fe5142038b348f26ba9564e6d5c8d445c
--- /dev/null
+++ b/dygraph/seq2seq/README.md
@@ -0,0 +1,127 @@
+运行本目录下的范例模型需要安装PaddlePaddle Fluid 1.7版。如果您的 PaddlePaddle 安装版本低于此要求，请按照[安装文档](https://www.paddlepaddle.org.cn/#quick-start)中的说明更新 PaddlePaddle 安装版本。
+
+# Sequence to Sequence (Seq2Seq)
+
+以下是本范例模型的简要目录结构及说明：
+
+```
+.
+├── README.md              # 文档，本文件
+├── args.py                # 训练、预测以及模型参数配置程序
+├── reader.py              # 数据读入程序
+├── download.py            # 数据下载程序
+├── train.py               # 训练主程序
+├── infer.py               # 预测主程序
+├── run.sh                 # 默认配置的启动脚本
+├── infer.sh               # 默认配置的解码脚本
+├── attention_model.py     # 带注意力机制的翻译模型程序
+└── base_model.py          # 无注意力机制的翻译模型程序
+```
+
+## 简介
+
+Sequence to Sequence (Seq2Seq)，使用编码器-解码器（Encoder-Decoder）结构，用编码器将源序列编码成vector，再用解码器将该vector解码为目标序列。Seq2Seq 广泛应用于机器翻译，自动对话机器人，文档摘要自动生成，图片描述自动生成等任务中。
+
+本目录包含Seq2Seq的一个经典样例：机器翻译，实现了一个base model（不带attention机制），一个带attention机制的翻译模型。Seq2Seq翻译模型，模拟了人类在进行翻译类任务时的行为：先解析源语言，理解其含义，再根据该含义来写出目标语言的语句。更多关于机器翻译的具体原理和数学表达式，我们推荐参考[深度学习101](http://paddlepaddle.org/documentation/docs/zh/1.2/beginners_guide/basics/machine_translation/index.html)。
+
+**本目录旨在展示如何用Paddle Fluid 1.7的动态图接口实现一个标准的Seq2Seq模型** ，相同网络结构的静态图实现可以参照 [Seq2Seq](https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/PaddleTextGEN/seq2seq)。
+
+## 模型概览
+
+本模型中，在编码器方面，我们采用了基于LSTM的多层的RNN encoder；在解码器方面，我们使用了带注意力（Attention）机制的RNN decoder，并同时提供了一个不带注意力机制的解码器实现作为对比。在预测时我们使用柱搜索（beam search）算法来生成翻译的目标语句。以下将分别介绍用到的这些方法。
+
+## 数据介绍
+
+本教程使用[IWSLT'15 English-Vietnamese data ](https://nlp.stanford.edu/projects/nmt/)数据集中的英语到越南语的数据作为训练语料，tst2012的数据作为开发集，tst2013的数据作为测试集
+
+### 数据获取
+
+```
+python download.py
+```
+
+## 模型训练
+
+`run.sh`包含训练程序的主函数，要使用默认参数开始训练，只需要简单地执行：
+
+```
+sh run.sh
+```
+
+默认使用带有注意力机制的RNN模型，可以通过修改 --attention 为False来训练不带注意力机制的RNN模型。
+
+```
+python train.py \
+    --src_lang en --tar_lang vi \
+    --attention True \
+    --num_layers 2 \
+    --hidden_size 512 \
+    --src_vocab_size 17191 \
+    --tar_vocab_size 7709 \
+    --batch_size 128 \
+    --dropout 0.2 \
+    --init_scale  0.1 \
+    --max_grad_norm 5.0 \
+    --train_data_prefix data/en-vi/train \
+    --eval_data_prefix data/en-vi/tst2012 \
+    --test_data_prefix data/en-vi/tst2013 \
+    --vocab_prefix data/en-vi/vocab \
+    --use_gpu True \
+    --model_path ./attention_models
+```
+
+训练程序会在每个epoch训练结束之后，save一次模型。
+
+## 模型预测
+
+当模型训练完成之后， 可以利用infer.sh的脚本进行预测，默认使用beam search的方法进行预测，加载第10个epoch的模型进行预测，对test的数据集进行解码
+
+```
+sh infer.sh
+```
+
+如果想预测别的数据文件，只需要将 --infer_file参数进行修改。
+
+```
+python infer.py \
+    --attention True \
+    --src_lang en --tar_lang vi \
+    --num_layers 2 \
+    --hidden_size 512 \
+    --src_vocab_size 17191 \
+    --tar_vocab_size 7709 \
+    --batch_size 128 \
+    --dropout 0.2 \
+    --init_scale  0.1 \
+    --max_grad_norm 5.0 \
+    --vocab_prefix data/en-vi/vocab \
+    --infer_file data/en-vi/tst2013.en \
+    --reload_model attention_models/epoch_10 \
+    --infer_output_file attention_infer_output/infer_output.txt \
+    --beam_size 10 \
+    --use_gpu True
+```
+
+## 效果评价
+
+使用 [*multi-bleu.perl*](https://github.com/moses-smt/mosesdecoder.git) 工具来评价模型预测的翻译质量，使用方法如下：
+
+```sh
+mosesdecoder/scripts/generic/multi-bleu.perl tst2013.vi < infer_output.txt
+```
+
+每个模型分别训练了10次，单次取第10个epoch保存的模型进行预测，取beam_size=10。效果如下（为了便于观察，对10次结果按照升序进行了排序）：
+
+```
+> no attention
+tst2012 BLEU:
+[10.75 10.85 10.9  10.94 10.97 11.01 11.01 11.04 11.13 11.4]
+tst2013 BLEU:
+[10.71 10.71 10.74 10.76 10.91 10.94 11.02 11.16 11.21 11.44]
+
+> with attention
+tst2012 BLEU:
+[21.14 22.34 22.54 22.65 22.71 22.71 23.08 23.15 23.3  23.4]
+tst2013 BLEU:
+[23.41 24.79 25.11 25.12 25.19 25.24 25.39 25.61 25.61 25.63]
+```
diff --git a/dygraph/seq2seq/__init__.py b/dygraph/seq2seq/__init__.py
new file mode 100755
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/dygraph/seq2seq/args.py b/dygraph/seq2seq/args.py
new file mode 100755
index 0000000000000000000000000000000000000000..99f21b0800d9a2696e245fc807b393308a98e09a
--- /dev/null
+++ b/dygraph/seq2seq/args.py
@@ -0,0 +1,132 @@
+# -*- coding: utf-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import distutils.util
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--train_data_prefix", type=str, help="file prefix for train data")
+    parser.add_argument(
+        "--eval_data_prefix", type=str, help="file prefix for eval data")
+    parser.add_argument(
+        "--test_data_prefix", type=str, help="file prefix for test data")
+    parser.add_argument(
+        "--vocab_prefix", type=str, help="file prefix for vocab")
+    parser.add_argument("--src_lang", type=str, help="source language suffix")
+    parser.add_argument("--tar_lang", type=str, help="target language suffix")
+
+    parser.add_argument(
+        "--attention",
+        type=eval,
+        default=False,
+        help="Whether use attention model")
+
+    parser.add_argument(
+        "--optimizer",
+        type=str,
+        default='adam',
+        help="optimizer to use, only supprt[sgd|adam]")
+
+    parser.add_argument(
+        "--learning_rate",
+        type=float,
+        default=0.001,
+        help="learning rate for optimizer")
+
+    parser.add_argument(
+        "--num_layers",
+        type=int,
+        default=1,
+        help="layers number of encoder and decoder")
+    parser.add_argument(
+        "--hidden_size",
+        type=int,
+        default=100,
+        help="hidden size of encoder and decoder")
+    parser.add_argument("--src_vocab_size", type=int, help="source vocab size")
+    parser.add_argument("--tar_vocab_size", type=int, help="target vocab size")
+
+    parser.add_argument(
+        "--batch_size", type=int, help="batch size of each step")
+
+    parser.add_argument(
+        "--max_epoch", type=int, default=12, help="max epoch for the training")
+
+    parser.add_argument(
+        "--max_len",
+        type=int,
+        default=50,
+        help="max length for source and target sentence")
+    parser.add_argument(
+        "--dropout", type=float, default=0.0, help="drop probability")
+    parser.add_argument(
+        "--init_scale",
+        type=float,
+        default=0.0,
+        help="init scale for parameter")
+    parser.add_argument(
+        "--max_grad_norm",
+        type=float,
+        default=5.0,
+        help="max grad norm for global norm clip")
+
+    parser.add_argument(
+        "--model_path",
+        type=str,
+        default='model',
+        help="model path for model to save")
+
+    parser.add_argument(
+        "--reload_model", type=str, help="reload model to inference")
+
+    parser.add_argument(
+        "--infer_file", type=str, help="file name for inference")
+    parser.add_argument(
+        "--infer_output_file",
+        type=str,
+        default='infer_output',
+        help="file name for inference output")
+    parser.add_argument(
+        "--beam_size", type=int, default=10, help="file name for inference")
+
+    parser.add_argument(
+        '--use_gpu',
+        type=eval,
+        default=False,
+        help='Whether using gpu [True|False]')
+
+    parser.add_argument(
+        "--enable_ce",
+        action='store_true',
+        help="The flag indicating whether to run the task "
+        "for continuous evaluation.")
+
+    parser.add_argument(
+        "--profile", action='store_true', help="Whether enable the profile.")
+    # NOTE: profiler args, used for benchmark
+    parser.add_argument(
+        "--profiler_path",
+        type=str,
+        default='./seq2seq.profile',
+        help="the profiler output file path. (used for benchmark)")
+    args = parser.parse_args()
+    return args
diff --git a/dygraph/seq2seq/attention_model.py b/dygraph/seq2seq/attention_model.py
new file mode 100755
index 0000000000000000000000000000000000000000..4ab8d6e8cbf83740c10c0d9449ea4f5e201d8a06
--- /dev/null
+++ b/dygraph/seq2seq/attention_model.py
@@ -0,0 +1,361 @@
+# -*- coding: utf-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import paddle.fluid as fluid
+import numpy as np
+from paddle.fluid import ParamAttr
+from paddle.fluid.dygraph.base import to_variable
+from paddle.fluid.dygraph.nn import Embedding
+from rnn import BasicLSTMUnit
+import numpy as np
+
+INF = 1. * 1e5
+alpha = 0.6
+uniform_initializer = lambda x: fluid.initializer.UniformInitializer(low=-x, high=x)
+zero_constant = fluid.initializer.Constant(0.0)
+
+class AttentionModel(fluid.dygraph.Layer):
+    def __init__(self,
+                 hidden_size,
+                 src_vocab_size,
+                 tar_vocab_size,
+                 batch_size,
+                 num_layers=1,
+                 init_scale=0.1,
+                 dropout=None,
+                 beam_size=1,
+                 beam_start_token=1,
+                 beam_end_token=2,
+                 beam_max_step_num=100,
+                 mode='train'):
+        super(AttentionModel, self).__init__()
+        self.hidden_size = hidden_size
+        self.src_vocab_size = src_vocab_size
+        self.tar_vocab_size = tar_vocab_size
+        self.batch_size = batch_size
+        self.num_layers = num_layers
+        self.init_scale = init_scale
+        self.dropout = dropout
+        self.beam_size = beam_size
+        self.beam_start_token = beam_start_token
+        self.beam_end_token = beam_end_token
+        self.beam_max_step_num = beam_max_step_num
+        self.mode = mode
+        self.kinf = 1e9
+
+        param_attr = ParamAttr(initializer=uniform_initializer(self.init_scale))
+        bias_attr = ParamAttr(initializer=zero_constant)
+        forget_bias = 1.0
+
+        self.src_embeder = Embedding(
+            size=[self.src_vocab_size, self.hidden_size],
+            param_attr=fluid.ParamAttr(
+                name='source_embedding',
+                initializer=uniform_initializer(init_scale)))
+
+        self.tar_embeder = Embedding(
+            size=[self.tar_vocab_size, self.hidden_size],
+            is_sparse=False,
+            param_attr=fluid.ParamAttr(
+                name='target_embedding',
+                initializer=uniform_initializer(init_scale)))
+
+        self.enc_units = []
+        for i in range(num_layers):
+            self.enc_units.append(
+                self.add_sublayer("enc_units_%d" % i,
+                    BasicLSTMUnit(
+                    hidden_size=self.hidden_size, 
+                    input_size=self.hidden_size,
+                    param_attr=param_attr, 
+                    bias_attr=bias_attr, 
+                    forget_bias=forget_bias)))
+
+        self.dec_units = []
+        for i in range(num_layers):
+            if i == 0:
+                self.dec_units.append(
+                    self.add_sublayer("dec_units_%d" % i,
+                        BasicLSTMUnit(
+                        hidden_size=self.hidden_size, 
+                        input_size=self.hidden_size * 2,
+                        param_attr=param_attr, 
+                        bias_attr=bias_attr, 
+                        forget_bias=forget_bias)))
+            else:
+                self.dec_units.append(
+                    self.add_sublayer("dec_units_%d" % i,
+                    BasicLSTMUnit(
+                    hidden_size=self.hidden_size, 
+                    input_size=self.hidden_size,
+                    param_attr=param_attr, 
+                    bias_attr=bias_attr, 
+                    forget_bias=forget_bias)))
+        
+        self.fc = fluid.dygraph.nn.Linear(self.hidden_size,
+                self.tar_vocab_size, 
+                param_attr=param_attr,
+                bias_attr=False)
+        
+        self.attn_fc = fluid.dygraph.nn.Linear(self.hidden_size, 
+                self.hidden_size, 
+                param_attr=param_attr,
+                bias_attr=False)
+        
+        self.concat_fc = fluid.dygraph.nn.Linear(2 * self.hidden_size, 
+                self.hidden_size, 
+                param_attr=param_attr,
+                bias_attr=False)
+
+    def _transpose_batch_time(self, x):
+        return fluid.layers.transpose(x, [1, 0] + list(range(2, len(x.shape))))
+
+    def _merge_batch_beams(self, x):
+        return fluid.layers.reshape(x, shape=(-1,x.shape[2]))
+
+    def tile_beam_merge_with_batch(self, x):
+        x = fluid.layers.unsqueeze(x, [1])  # [batch_size, 1, ...]
+        expand_times = [1] * len(x.shape)
+        expand_times[1] = self.beam_size
+        x = fluid.layers.expand(x, expand_times)  # [batch_size, beam_size, ...]
+        x = fluid.layers.transpose(x, list(range(2, len(x.shape))) +
+                         [0, 1])  # [..., batch_size, beam_size]
+        # use 0 to copy to avoid wrong shape
+        x = fluid.layers.reshape(
+            x, shape=[0] *
+            (len(x.shape) - 2) + [-1])  # [..., batch_size * beam_size]
+        x = fluid.layers.transpose(
+            x, [len(x.shape) - 1] +
+            list(range(0, len(x.shape) - 1)))  # [batch_size * beam_size, ...]
+        return x
+
+    def _split_batch_beams(self, x):
+        return fluid.layers.reshape(x, shape=(-1, self.beam_size, x.shape[1]))
+
+    def _expand_to_beam_size(self, x):
+        x = fluid.layers.unsqueeze(x, [1])
+        expand_times = [1] * len(x.shape)
+        expand_times[1] = self.beam_size
+        x = fluid.layers.expand(x, expand_times)
+        return x
+
+    def _real_state(self, state, new_state, step_mask):
+        new_state = fluid.layers.elementwise_mul(new_state, step_mask, axis=0) - \
+                    fluid.layers.elementwise_mul(state, (step_mask - 1), axis=0)
+        return new_state
+        
+    def _gather(self, x, indices, batch_pos):
+        topk_coordinates = fluid.layers.stack([batch_pos, indices], axis=2)
+        return fluid.layers.gather_nd(x, topk_coordinates)
+
+    def attention(self, query, enc_output, mask=None):
+        query = fluid.layers.unsqueeze(query, [1])
+        memory = self.attn_fc(enc_output)
+        attn = fluid.layers.matmul(query, memory, transpose_y=True)
+
+        if mask:
+            attn = fluid.layers.transpose(attn, [1, 0, 2])
+            attn = fluid.layers.elementwise_add(attn, mask * 1000000000, -1)
+            attn = fluid.layers.transpose(attn, [1, 0, 2])
+        weight = fluid.layers.softmax(attn)
+        weight_memory = fluid.layers.matmul(weight, memory)
+
+        return weight_memory
+
+    def forward(self, inputs):
+        inputs = [fluid.dygraph.to_variable(np_inp) for np_inp in inputs]
+        src, tar, label, src_sequence_length, tar_sequence_length = inputs
+        if src.shape[0] < self.batch_size:
+            self.batch_size = src.shape[0]
+        src_emb = self.src_embeder(self._transpose_batch_time(src))
+        
+        enc_hidden = to_variable(np.zeros((self.num_layers, self.batch_size, self.hidden_size), dtype='float32'))
+        enc_cell = to_variable(np.zeros((self.num_layers, self.batch_size, self.hidden_size), dtype='float32'))
+
+        max_seq_len = src_emb.shape[0]
+        enc_len_mask = fluid.layers.sequence_mask(src_sequence_length, maxlen=max_seq_len, dtype="float32")
+        enc_padding_mask = (enc_len_mask - 1.0)
+        enc_len_mask = fluid.layers.transpose(enc_len_mask, [1, 0])
+        enc_states = [[enc_hidden, enc_cell]]
+        enc_outputs = []
+        for l in range(max_seq_len):
+            step_input = src_emb[l]
+            step_mask = enc_len_mask[l]
+            enc_hidden, enc_cell = enc_states[l]
+            new_enc_hidden, new_enc_cell = [], []
+            for i in range(self.num_layers):
+                new_hidden, new_cell = self.enc_units[i](step_input, enc_hidden[i], enc_cell[i])
+                new_enc_hidden.append(new_hidden)
+                new_enc_cell.append(new_cell)
+                if self.dropout != None and self.dropout > 0.0:
+                    step_input = fluid.layers.dropout(
+                        new_hidden,
+                        dropout_prob=self.dropout,
+                        dropout_implementation='upscale_in_train')
+                else:
+                    step_input = new_hidden
+            new_enc_hidden = [self._real_state(enc_hidden[i], new_enc_hidden[i], step_mask) for i in range(self.num_layers)]
+            new_enc_cell = [self._real_state(enc_cell[i], new_enc_cell[i], step_mask) for i in range(self.num_layers)]
+            enc_states.append([new_enc_hidden, new_enc_cell])
+            enc_outputs.append(step_input)
+        enc_outputs = fluid.layers.stack(enc_outputs)
+        enc_outputs = self._transpose_batch_time(enc_outputs)
+
+        if self.mode in ['train', 'eval']: 
+            # calculation with input_feed derives from paper: https://arxiv.org/pdf/1508.04025.pdf
+            input_feed = to_variable(np.zeros((self.batch_size, self.hidden_size), dtype='float32'))
+            
+            dec_hidden, dec_cell = enc_states[-1]
+            tar_emb = self.tar_embeder(self._transpose_batch_time(tar))
+            max_seq_len = tar_emb.shape[0]
+            dec_output = []
+
+            for step_idx in range(max_seq_len):
+                step_input = tar_emb[step_idx]
+                step_input = fluid.layers.concat([step_input, input_feed], 1)
+                new_dec_hidden, new_dec_cell = [], []
+                for i in range(self.num_layers):
+                    new_hidden, new_cell = self.dec_units[i](step_input, dec_hidden[i], dec_cell[i])
+
+                    new_dec_hidden.append(new_hidden)
+                    new_dec_cell.append(new_cell)
+                    if self.dropout != None and self.dropout > 0.0:
+                        step_input = fluid.layers.dropout(
+                            new_hidden,
+                            dropout_prob=self.dropout,
+                            dropout_implementation='upscale_in_train')
+                    else:
+                        step_input = new_hidden
+
+                dec_att = self.attention(step_input, enc_outputs, enc_padding_mask)
+                dec_att = fluid.layers.squeeze(dec_att, [1])
+                concat_att_out = fluid.layers.concat([dec_att, step_input], 1)
+                out = self.concat_fc(concat_att_out)
+                input_feed = out
+                dec_output.append(out)
+                dec_hidden, dec_cell = new_dec_hidden, new_dec_cell
+
+            dec_output = fluid.layers.stack(dec_output)
+            dec_output = self.fc(self._transpose_batch_time(dec_output))
+        
+            loss = fluid.layers.softmax_with_cross_entropy(
+            logits=dec_output, label=label, soft_label=False)
+            loss = fluid.layers.squeeze(loss, axes=[2])
+            max_tar_seq_len = fluid.layers.shape(tar)[1]
+            tar_mask = fluid.layers.sequence_mask(
+                tar_sequence_length, maxlen=max_tar_seq_len, dtype='float32')
+            loss = loss * tar_mask
+            loss = fluid.layers.reduce_mean(loss, dim=[0])
+            loss = fluid.layers.reduce_sum(loss)
+            return loss
+            
+        elif self.mode in ['beam_search']:
+            enc_outputs = self.tile_beam_merge_with_batch(enc_outputs)
+            enc_padding_mask = self.tile_beam_merge_with_batch(enc_padding_mask)
+            batch_beam_shape = (self.batch_size, self.beam_size)
+            vocab_size_tensor = to_variable(np.full((1), self.tar_vocab_size))
+            start_token_tensor = to_variable(np.full(batch_beam_shape, self.beam_start_token, dtype='int64'))
+            end_token_tensor = to_variable(np.full(batch_beam_shape, self.beam_end_token, dtype='int64'))
+            step_input = self.tar_embeder(start_token_tensor)
+            input_feed = to_variable(np.zeros((self.batch_size, self.hidden_size), dtype='float32'))
+            input_feed = self._expand_to_beam_size(input_feed)
+            input_feed = self._merge_batch_beams(input_feed)
+            beam_finished = to_variable(np.full(batch_beam_shape, 0, dtype='float32'))
+            beam_state_log_probs = to_variable(np.array([[0.] + [-self.kinf] * (self.beam_size - 1)], dtype="float32"))
+            beam_state_log_probs = fluid.layers.expand(beam_state_log_probs, [self.batch_size, 1])
+            
+            dec_hidden, dec_cell = enc_states[-1]
+            dec_hidden = [self._expand_to_beam_size(state) for state in dec_hidden]
+            dec_cell = [self._expand_to_beam_size(state) for state in dec_cell]
+            
+            batch_pos = fluid.layers.expand(
+                fluid.layers.unsqueeze(to_variable(np.arange(0, self.batch_size, 1, dtype="int64")), [1]),
+                [1, self.beam_size])
+            predicted_ids = []
+            parent_ids = []
+
+            for step_idx in range(self.beam_max_step_num):
+                if fluid.layers.reduce_sum(1 - beam_finished).numpy()[0] == 0:
+                    break
+                step_input = self._merge_batch_beams(step_input)
+                step_input = fluid.layers.concat([step_input, input_feed], 1)
+                new_dec_hidden, new_dec_cell = [], []
+                dec_hidden = [self._merge_batch_beams(state) for state in dec_hidden]
+                dec_cell = [self._merge_batch_beams(state) for state in dec_cell]
+
+                for i in range(self.num_layers):
+                    new_hidden, new_cell = self.dec_units[i](step_input, dec_hidden[i], dec_cell[i])
+                    new_dec_hidden.append(new_hidden)
+                    new_dec_cell.append(new_cell)
+                    if self.dropout != None and self.dropout > 0.0:
+                        step_input = fluid.layers.dropout(
+                            new_hidden,
+                            dropout_prob=self.dropout,
+                            dropout_implementation='upscale_in_train')
+                    else:
+                        step_input = new_hidden
+                dec_att = self.attention(step_input, enc_outputs, enc_padding_mask)
+                dec_att = fluid.layers.squeeze(dec_att, [1])
+                concat_att_out = fluid.layers.concat([dec_att, step_input], 1)
+                out = self.concat_fc(concat_att_out)
+                input_feed = out
+                cell_outputs = self._split_batch_beams(out)
+                cell_outputs = self.fc(cell_outputs)
+                step_log_probs = fluid.layers.log(fluid.layers.softmax(cell_outputs))
+                noend_array = [-self.kinf] * self.tar_vocab_size
+                noend_array[self.beam_end_token] = 0 # [-kinf, -kinf, ..., 0, -kinf, ...]
+                noend_mask_tensor = to_variable(np.array(noend_array,dtype='float32'))
+                # set finished position to one-hot probability of <eos>
+                step_log_probs = fluid.layers.elementwise_mul(
+                        fluid.layers.expand(fluid.layers.unsqueeze(beam_finished, [2]), [1, 1, self.tar_vocab_size]),
+                    noend_mask_tensor, axis=-1) - \
+                    fluid.layers.elementwise_mul(step_log_probs, (beam_finished - 1), axis=0)
+                log_probs = fluid.layers.elementwise_add(
+                    x=step_log_probs, y=beam_state_log_probs, axis=0)
+                scores = fluid.layers.reshape(log_probs, [-1, self.beam_size * self.tar_vocab_size])
+                topk_scores, topk_indices = fluid.layers.topk(input=scores, k=self.beam_size)
+                beam_indices = fluid.layers.elementwise_floordiv(topk_indices, vocab_size_tensor) # in which beam
+                token_indices = fluid.layers.elementwise_mod(topk_indices, vocab_size_tensor) # position in beam
+                next_log_probs = self._gather(scores, topk_indices, batch_pos) # 
+
+                new_dec_hidden = [self._split_batch_beams(state) for state in new_dec_hidden]
+                new_dec_cell = [self._split_batch_beams(state) for state in new_dec_cell]
+                new_dec_hidden = [self._gather(x, beam_indices, batch_pos) for x in new_dec_hidden]
+                new_dec_cell = [self._gather(x, beam_indices, batch_pos) for x in new_dec_cell]
+                
+                next_finished = self._gather(beam_finished, beam_indices, batch_pos)              
+                next_finished = fluid.layers.cast(next_finished, "bool")
+                next_finished = fluid.layers.logical_or(next_finished, fluid.layers.equal(token_indices, end_token_tensor))
+                next_finished = fluid.layers.cast(next_finished, "float32")
+                # prepare for next step
+                dec_hidden, dec_cell = new_dec_hidden, new_dec_cell
+                beam_finished = next_finished
+                beam_state_log_probs = next_log_probs
+                step_input = self.tar_embeder(token_indices)
+                predicted_ids.append(token_indices)
+                parent_ids.append(beam_indices)
+
+            predicted_ids = fluid.layers.stack(predicted_ids)
+            parent_ids = fluid.layers.stack(parent_ids)
+            predicted_ids = fluid.layers.gather_tree(predicted_ids, parent_ids)
+            predicted_ids = self._transpose_batch_time(predicted_ids)
+            return predicted_ids
+        else:
+            print("not support mode ", self.mode)
+            raise Exception("not support mode: " + self.mode)
diff --git a/dygraph/seq2seq/base_model.py b/dygraph/seq2seq/base_model.py
new file mode 100755
index 0000000000000000000000000000000000000000..6319223502cebd32369b65b8ae60d6569b9ad48a
--- /dev/null
+++ b/dygraph/seq2seq/base_model.py
@@ -0,0 +1,285 @@
+# -*- coding: utf-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import paddle.fluid as fluid
+import numpy as np
+from paddle.fluid import ParamAttr
+from paddle.fluid.dygraph import to_variable
+from paddle.fluid.dygraph.nn import Embedding, Linear
+from rnn import BasicLSTMUnit
+import numpy as np
+
+INF = 1. * 1e5
+alpha = 0.6
+uniform_initializer = lambda x: fluid.initializer.UniformInitializer(low=-x, high=x)
+zero_constant = fluid.initializer.Constant(0.0)
+
+class BaseModel(fluid.dygraph.Layer):
+    def __init__(self,
+                 hidden_size,
+                 src_vocab_size,
+                 tar_vocab_size,
+                 batch_size,
+                 num_layers=1,
+                 init_scale=0.1,
+                 dropout=None,
+                 beam_size=1,
+                 beam_start_token=1,
+                 beam_end_token=2,
+                 beam_max_step_num=100,
+                 mode='train'):
+        super(BaseModel, self).__init__()
+        self.hidden_size = hidden_size
+        self.src_vocab_size = src_vocab_size
+        self.tar_vocab_size = tar_vocab_size
+        self.batch_size = batch_size
+        self.num_layers = num_layers
+        self.init_scale = init_scale
+        self.dropout = dropout
+        self.beam_size = beam_size
+        self.beam_start_token = beam_start_token
+        self.beam_end_token = beam_end_token
+        self.beam_max_step_num = beam_max_step_num
+        self.mode = mode
+        self.kinf = 1e9
+
+        param_attr = ParamAttr(initializer=uniform_initializer(self.init_scale))
+        bias_attr = ParamAttr(initializer=zero_constant)
+        forget_bias = 1.0
+
+        self.src_embeder = Embedding(
+            size=[self.src_vocab_size, self.hidden_size],
+            param_attr=fluid.ParamAttr(
+                initializer=uniform_initializer(init_scale)))
+
+        self.tar_embeder = Embedding(
+            size=[self.tar_vocab_size, self.hidden_size],
+            is_sparse=False,
+            param_attr=fluid.ParamAttr(
+                initializer=uniform_initializer(init_scale)))
+
+        self.enc_units = []
+        for i in range(num_layers):
+            self.enc_units.append(
+                self.add_sublayer("enc_units_%d" % i,
+                    BasicLSTMUnit(
+                    hidden_size=self.hidden_size, 
+                    input_size=self.hidden_size,
+                    param_attr=param_attr, 
+                    bias_attr=bias_attr, 
+                    forget_bias=forget_bias)))
+
+        self.dec_units = []
+        for i in range(num_layers):
+            self.dec_units.append(
+                self.add_sublayer("dec_units_%d" % i,
+                    BasicLSTMUnit(
+                    hidden_size=self.hidden_size, 
+                    input_size=self.hidden_size,
+                    param_attr=param_attr, 
+                    bias_attr=bias_attr, 
+                    forget_bias=forget_bias)))
+        
+        self.fc = fluid.dygraph.nn.Linear(self.hidden_size,
+                self.tar_vocab_size,
+                param_attr=param_attr,
+                bias_attr=False)
+
+    def _transpose_batch_time(self, x):
+        return fluid.layers.transpose(x, [1, 0] + list(range(2, len(x.shape))))
+
+    def _merge_batch_beams(self, x):
+        return fluid.layers.reshape(x, shape=(-1,x.shape[2]))
+
+    def _split_batch_beams(self, x):
+        return fluid.layers.reshape(x, shape=(-1, self.beam_size, x.shape[1]))
+
+    def _expand_to_beam_size(self, x):
+        x = fluid.layers.unsqueeze(x, [1])
+        expand_times = [1] * len(x.shape)
+        expand_times[1] = self.beam_size
+        x = fluid.layers.expand(x, expand_times)
+        return x
+
+    def _real_state(self, state, new_state, step_mask):
+        new_state = fluid.layers.elementwise_mul(new_state, step_mask, axis=0) - \
+                    fluid.layers.elementwise_mul(state, (step_mask - 1), axis=0)
+        return new_state
+
+    def _gather(self, x, indices, batch_pos):
+        topk_coordinates = fluid.layers.stack([batch_pos, indices], axis=2)
+        return fluid.layers.gather_nd(x, topk_coordinates)
+
+    def forward(self, inputs):
+        #inputs[0] = np.expand_dims(inputs[0], axis=-1)
+        #inputs[1] = np.expand_dims(inputs[1], axis=-1)
+        inputs = [fluid.dygraph.to_variable(np_inp) for np_inp in inputs]
+        src, tar, label, src_sequence_length, tar_sequence_length = inputs
+        if src.shape[0] < self.batch_size:
+            self.batch_size = src.shape[0]
+        src_emb = self.src_embeder(self._transpose_batch_time(src))
+
+        enc_hidden = to_variable(np.zeros((self.num_layers, self.batch_size, self.hidden_size), dtype='float32'))
+        enc_cell = to_variable(np.zeros((self.num_layers, self.batch_size, self.hidden_size), dtype='float32'))
+
+        max_seq_len = src_emb.shape[0]
+        enc_len_mask = fluid.layers.sequence_mask(src_sequence_length, maxlen=max_seq_len, dtype="float32")
+        enc_len_mask = fluid.layers.transpose(enc_len_mask, [1, 0])
+        enc_states = [[enc_hidden, enc_cell]]
+        for l in range(max_seq_len):
+            step_input = src_emb[l]
+            step_mask = enc_len_mask[l]
+            enc_hidden, enc_cell = enc_states[l]
+            new_enc_hidden, new_enc_cell = [], []
+            for i in range(self.num_layers):
+                new_hidden, new_cell = self.enc_units[i](step_input, enc_hidden[i], enc_cell[i])
+                new_enc_hidden.append(new_hidden)
+                new_enc_cell.append(new_cell)
+                if self.dropout != None and self.dropout > 0.0:
+                    step_input = fluid.layers.dropout(
+                        new_hidden,
+                        dropout_prob=self.dropout,
+                        dropout_implementation='upscale_in_train')
+                else:
+                    step_input = new_hidden
+            new_enc_hidden = [self._real_state(enc_hidden[i], new_enc_hidden[i], step_mask) for i in range(self.num_layers)]
+            new_enc_cell = [self._real_state(enc_cell[i], new_enc_cell[i], step_mask) for i in range(self.num_layers)]
+            enc_states.append([new_enc_hidden, new_enc_cell])
+        
+        if self.mode in ['train', 'eval']:
+            dec_hidden, dec_cell = enc_states[-1]
+            tar_emb = self.tar_embeder(self._transpose_batch_time(tar))
+            max_seq_len = tar_emb.shape[0]
+            dec_output = []
+
+            for step_idx in range(max_seq_len):
+                step_input = tar_emb[step_idx]
+                new_dec_hidden, new_dec_cell = [], []
+                for i in range(self.num_layers):
+                    new_hidden, new_cell = self.dec_units[i](step_input, dec_hidden[i], dec_cell[i])
+                    new_dec_hidden.append(new_hidden)
+                    new_dec_cell.append(new_cell)
+                    if self.dropout != None and self.dropout > 0.0:
+                        step_input = fluid.layers.dropout(
+                            new_hidden,
+                            dropout_prob=self.dropout,
+                            dropout_implementation='upscale_in_train')
+                    else:
+                        step_input = new_hidden
+                dec_output.append(step_input)
+                dec_hidden, dec_cell = new_dec_hidden, new_dec_cell
+
+            dec_output = fluid.layers.stack(dec_output)
+            dec_output = self.fc(self._transpose_batch_time(dec_output))
+        
+            loss = fluid.layers.softmax_with_cross_entropy(
+            logits=dec_output, label=label, soft_label=False)
+            loss = fluid.layers.squeeze(loss, axes=[2])
+            max_tar_seq_len = fluid.layers.shape(tar)[1]
+            tar_mask = fluid.layers.sequence_mask(
+                tar_sequence_length, maxlen=max_tar_seq_len, dtype='float32')
+            loss = loss * tar_mask
+            loss = fluid.layers.reduce_mean(loss, dim=[0])
+            loss = fluid.layers.reduce_sum(loss)
+            return loss
+        elif self.mode in ['beam_search']:
+            batch_beam_shape = (self.batch_size, self.beam_size)
+            #batch_beam_shape_1 = (self.batch_size, self.beam_size, 1)
+            vocab_size_tensor = to_variable(np.full((1), self.tar_vocab_size))
+            start_token_tensor = to_variable(np.full(batch_beam_shape, self.beam_start_token, dtype='int64')) 
+            end_token_tensor = to_variable(np.full(batch_beam_shape, self.beam_end_token, dtype='int64'))
+            step_input = self.tar_embeder(start_token_tensor)
+            beam_finished = to_variable(np.full(batch_beam_shape, 0, dtype='float32'))
+            beam_state_log_probs = to_variable(np.array([[0.] + [-self.kinf] * (self.beam_size - 1)], dtype="float32"))
+            beam_state_log_probs = fluid.layers.expand(beam_state_log_probs, [self.batch_size, 1])
+            
+            dec_hidden, dec_cell = enc_states[-1]
+            dec_hidden = [self._expand_to_beam_size(state) for state in dec_hidden]
+            dec_cell = [self._expand_to_beam_size(state) for state in dec_cell]
+            
+            batch_pos = fluid.layers.expand(
+                fluid.layers.unsqueeze(to_variable(np.arange(0, self.batch_size, 1, dtype="int64")), [1]),
+                [1, self.beam_size])
+            predicted_ids = []
+            parent_ids = []
+
+            for step_idx in range(self.beam_max_step_num):
+                if fluid.layers.reduce_sum(1 - beam_finished).numpy()[0] == 0:
+                    break
+                step_input = self._merge_batch_beams(step_input)
+                new_dec_hidden, new_dec_cell = [], []
+                dec_hidden = [self._merge_batch_beams(state) for state in dec_hidden]
+                dec_cell = [self._merge_batch_beams(state) for state in dec_cell]
+
+                for i in range(self.num_layers):
+                    new_hidden, new_cell = self.dec_units[i](step_input, dec_hidden[i], dec_cell[i])
+                    new_dec_hidden.append(new_hidden)
+                    new_dec_cell.append(new_cell)
+                    if self.dropout != None and self.dropout > 0.0:
+                        step_input = fluid.layers.dropout(
+                            new_hidden,
+                            dropout_prob=self.dropout,
+                            dropout_implementation='upscale_in_train')
+                    else:
+                        step_input = new_hidden
+                cell_outputs = self._split_batch_beams(step_input)
+                cell_outputs = self.fc(cell_outputs) 
+                # Beam_search_step:
+                step_log_probs = fluid.layers.log(fluid.layers.softmax(cell_outputs))
+                noend_array = [-self.kinf] * self.tar_vocab_size
+                noend_array[self.beam_end_token] = 0 # [-kinf, -kinf, ..., 0, -kinf, ...]
+                noend_mask_tensor = to_variable(np.array(noend_array,dtype='float32'))
+                # set finished position to one-hot probability of <eos>
+                step_log_probs = fluid.layers.elementwise_mul(
+                        fluid.layers.expand(fluid.layers.unsqueeze(beam_finished, [2]), [1, 1, self.tar_vocab_size]),
+                    noend_mask_tensor, axis=-1) - \
+                    fluid.layers.elementwise_mul(step_log_probs, (beam_finished - 1), axis=0)
+                log_probs = fluid.layers.elementwise_add(
+                    x=step_log_probs, y=beam_state_log_probs, axis=0)
+                scores = fluid.layers.reshape(log_probs, [-1, self.beam_size * self.tar_vocab_size])
+                topk_scores, topk_indices = fluid.layers.topk(input=scores, k=self.beam_size)
+                beam_indices = fluid.layers.elementwise_floordiv(topk_indices, vocab_size_tensor) # in which beam
+                token_indices = fluid.layers.elementwise_mod(topk_indices, vocab_size_tensor) # position in beam
+                next_log_probs = self._gather(scores, topk_indices, batch_pos) # 
+
+                new_dec_hidden = [self._split_batch_beams(state) for state in new_dec_hidden]
+                new_dec_cell = [self._split_batch_beams(state) for state in new_dec_cell]
+                new_dec_hidden = [self._gather(x, beam_indices, batch_pos) for x in new_dec_hidden]
+                new_dec_cell = [self._gather(x, beam_indices, batch_pos) for x in new_dec_cell]
+                
+                next_finished = self._gather(beam_finished, beam_indices, batch_pos)              
+                next_finished = fluid.layers.cast(next_finished, "bool")
+                next_finished = fluid.layers.logical_or(next_finished, fluid.layers.equal(token_indices, end_token_tensor))
+                next_finished = fluid.layers.cast(next_finished, "float32")
+                # prepare for next step
+                dec_hidden, dec_cell = new_dec_hidden, new_dec_cell
+                beam_finished = next_finished
+                beam_state_log_probs = next_log_probs
+                step_input = self.tar_embeder(token_indices) # remove unsqueeze in v1.7
+                predicted_ids.append(token_indices)
+                parent_ids.append(beam_indices)
+
+            predicted_ids = fluid.layers.stack(predicted_ids)
+            parent_ids = fluid.layers.stack(parent_ids)
+            predicted_ids = fluid.layers.gather_tree(predicted_ids, parent_ids)
+            predicted_ids = self._transpose_batch_time(predicted_ids)
+            return predicted_ids
+        else:
+            print("not support mode ", self.mode)
+            raise Exception("not support mode: " + self.mode)
diff --git a/dygraph/seq2seq/download.py b/dygraph/seq2seq/download.py
new file mode 100755
index 0000000000000000000000000000000000000000..4dd1466d25bf16d7b4fc9bc3819fff8fe12f7adf
--- /dev/null
+++ b/dygraph/seq2seq/download.py
@@ -0,0 +1,55 @@
+# -*- coding: utf-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the 'License');
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an 'AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''
+Script for downloading training data.
+'''
+import os
+import urllib
+import sys
+
+if sys.version_info >= (3, 0):
+    import urllib.request
+import zipfile
+
+URLLIB = urllib
+if sys.version_info >= (3, 0):
+    URLLIB = urllib.request
+
+remote_path = 'https://nlp.stanford.edu/projects/nmt/data/iwslt15.en-vi'
+base_path = 'data'
+tar_path = os.path.join(base_path, 'en-vi')
+filenames = [
+    'train.en', 'train.vi', 'tst2012.en', 'tst2012.vi', 'tst2013.en',
+    'tst2013.vi', 'vocab.en', 'vocab.vi'
+]
+
+
+def main(arguments):
+    print("Downloading data......")
+
+    if not os.path.exists(tar_path):
+        if not os.path.exists(base_path):
+            os.mkdir(base_path)
+        os.mkdir(tar_path)
+
+    for filename in filenames:
+        url = remote_path + '/' + filename
+        tar_file = os.path.join(tar_path, filename)
+        URLLIB.urlretrieve(url, tar_file)
+    print("Downloaded sucess......")
+
+
+if __name__ == '__main__':
+    sys.exit(main(sys.argv[1:]))
diff --git a/dygraph/seq2seq/infer.py b/dygraph/seq2seq/infer.py
new file mode 100755
index 0000000000000000000000000000000000000000..43fbf965ce949e54fa0f0c3e69470533b3a0ab83
--- /dev/null
+++ b/dygraph/seq2seq/infer.py
@@ -0,0 +1,168 @@
+# -*- coding: utf-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import time
+import os
+import random
+import logging
+import math
+import io
+import paddle
+import paddle.fluid as fluid
+
+import reader
+
+import sys
+line_tok = '\n'
+space_tok = ' '
+if sys.version[0] == '2':
+    reload(sys)
+    sys.setdefaultencoding("utf-8")
+    line_tok = u'\n'
+    space_tok = u' '
+
+logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger("fluid")
+logger.setLevel(logging.INFO)
+
+from args import *
+import logging
+import pickle
+
+from attention_model import AttentionModel
+from base_model import BaseModel
+
+
+def infer():
+    args = parse_args()
+
+    num_layers = args.num_layers
+    src_vocab_size = args.src_vocab_size
+    tar_vocab_size = args.tar_vocab_size
+    batch_size = args.batch_size
+    dropout = args.dropout
+    init_scale = args.init_scale
+    max_grad_norm = args.max_grad_norm
+    hidden_size = args.hidden_size
+    # inference process
+
+    print("src", src_vocab_size)
+    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    with fluid.dygraph.guard(place):
+        # dropout type using upscale_in_train, dropout can be remove in inferecen
+        # So we can set dropout to 0
+        if args.attention:
+            model = AttentionModel(
+                hidden_size,
+                src_vocab_size,
+                tar_vocab_size,
+                batch_size,
+                beam_size = args.beam_size,
+                num_layers=num_layers,
+                init_scale=init_scale,
+                dropout=0.0,
+                mode='beam_search')
+        else:
+            model = BaseModel(
+                hidden_size,
+                src_vocab_size,
+                tar_vocab_size,
+                batch_size,
+                beam_size = args.beam_size,
+                num_layers=num_layers,
+                init_scale=init_scale,
+                dropout=0.0,
+                mode='beam_search')
+
+        source_vocab_file = args.vocab_prefix + "." + args.src_lang
+        infer_file = args.infer_file
+
+        infer_data = reader.raw_mono_data(source_vocab_file, infer_file)
+
+        def prepare_input(batch, epoch_id=0):
+                src_ids, src_mask, tar_ids, tar_mask = batch
+                res = {}
+                src_ids = src_ids.reshape((src_ids.shape[0], src_ids.shape[1]))
+                in_tar = tar_ids[:, :-1]
+                label_tar = tar_ids[:, 1:]
+
+                in_tar = in_tar.reshape((in_tar.shape[0], in_tar.shape[1]))
+                label_tar = label_tar.reshape(
+                    (label_tar.shape[0], label_tar.shape[1], 1))
+                inputs = [src_ids, in_tar, label_tar, src_mask, tar_mask]
+                return inputs, np.sum(tar_mask)
+
+        dir_name = args.reload_model
+        print("dir name", dir_name)
+        state_dict, _ = fluid.dygraph.load_dygraph(dir_name)
+        model.set_dict(state_dict)
+        model.eval()
+
+        train_data_iter = reader.get_data_iter(infer_data, batch_size, mode='infer')
+
+        tar_id2vocab = []
+        tar_vocab_file = args.vocab_prefix + "." + args.tar_lang
+        with io.open(tar_vocab_file, "r", encoding='utf-8') as f:
+            for line in f.readlines():
+                tar_id2vocab.append(line.strip())
+
+        infer_output_file = args.infer_output_file
+        infer_output_dir = infer_output_file.split('/')[0]
+        if not os.path.exists(infer_output_dir):
+            os.mkdir(infer_output_dir)
+
+        with io.open(infer_output_file, 'w', encoding='utf-8') as out_file:
+
+            for batch_id, batch in enumerate(train_data_iter):
+                input_data_feed, word_num = prepare_input(batch, epoch_id=0)
+                # import ipdb; ipdb.set_trace()
+                outputs = model(input_data_feed)
+                for i in range(outputs.shape[0]):
+                    ins = outputs[i].numpy()
+                    res = [tar_id2vocab[int(e)] for e in ins[:, 0].reshape(-1)]
+                    new_res = []
+                    for ele in res:
+                        if ele == "</s>":
+                            break
+                        new_res.append(ele)
+
+                    out_file.write(space_tok.join(new_res))
+                    out_file.write(line_tok)
+
+
+def check_version():
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
+
+
+if __name__ == '__main__':
+    check_version()
+    infer()
diff --git a/dygraph/seq2seq/infer.sh b/dygraph/seq2seq/infer.sh
new file mode 100755
index 0000000000000000000000000000000000000000..5c79b9ee838dc28bd1eec261eb5301ebadc208d0
--- /dev/null
+++ b/dygraph/seq2seq/infer.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+export CUDA_VISIBLE_DEVICES=7
+
+python infer.py \
+    --attention True \
+    --src_lang en --tar_lang vi \
+    --num_layers 2 \
+    --hidden_size 512 \
+    --src_vocab_size 17191 \
+    --tar_vocab_size 7709 \
+    --batch_size 1 \
+    --dropout 0.2 \
+    --init_scale  0.1 \
+    --max_grad_norm 5.0 \
+    --vocab_prefix data/en-vi/vocab \
+    --infer_file data/en-vi/tst2013.en \
+    --reload_model attention_models/epoch_10 \
+    --infer_output_file attention_infer_output/infer_output.txt \
+    --beam_size 10 \
+    --use_gpu True
+
+
diff --git a/dygraph/seq2seq/reader.py b/dygraph/seq2seq/reader.py
new file mode 100755
index 0000000000000000000000000000000000000000..4f27560722a839c6bc51b3302fe1cc6b36064b65
--- /dev/null
+++ b/dygraph/seq2seq/reader.py
@@ -0,0 +1,220 @@
+# -*- coding: utf-8 -*-
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Utilities for parsing PTB text files."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import os
+import io
+import sys
+import numpy as np
+
+Py3 = sys.version_info[0] == 3
+
+UNK_ID = 0
+
+
+def _read_words(filename):
+    data = []
+    with io.open(filename, "r", encoding='utf-8') as f:
+        if Py3:
+            return f.read().replace("\n", "<eos>").split()
+        else:
+            return f.read().decode("utf-8").replace(u"\n", u"<eos>").split()
+
+
+def read_all_line(filenam):
+    data = []
+    with io.open(filename, "r", encoding='utf-8') as f:
+        for line in f.readlines():
+            data.append(line.strip())
+
+
+def _build_vocab(filename):
+
+    vocab_dict = {}
+    ids = 0
+    with io.open(filename, "r", encoding='utf-8') as f:
+        for line in f.readlines():
+            vocab_dict[line.strip()] = ids
+            ids += 1
+
+    print("vocab word num", ids)
+
+    return vocab_dict
+
+
+def _para_file_to_ids(src_file, tar_file, src_vocab, tar_vocab):
+
+    src_data = []
+    with io.open(src_file, "r", encoding='utf-8') as f_src:
+        for line in f_src.readlines():
+            arra = line.strip().split()
+            ids = [src_vocab[w] if w in src_vocab else UNK_ID for w in arra]
+            ids = ids
+
+            src_data.append(ids)
+
+    tar_data = []
+    with io.open(tar_file, "r", encoding='utf-8') as f_tar:
+        for line in f_tar.readlines():
+            arra = line.strip().split()
+            ids = [tar_vocab[w] if w in tar_vocab else UNK_ID for w in arra]
+
+            ids = [1] + ids + [2]
+
+            tar_data.append(ids)
+
+    return src_data, tar_data
+
+
+def filter_len(src, tar, max_sequence_len=50):
+    new_src = []
+    new_tar = []
+
+    for id1, id2 in zip(src, tar):
+        if len(id1) > max_sequence_len:
+            id1 = id1[:max_sequence_len]
+        if len(id2) > max_sequence_len + 2:
+            id2 = id2[:max_sequence_len + 2]
+
+        new_src.append(id1)
+        new_tar.append(id2)
+
+    return new_src, new_tar
+
+
+def raw_data(src_lang,
+             tar_lang,
+             vocab_prefix,
+             train_prefix,
+             eval_prefix,
+             test_prefix,
+             max_sequence_len=50):
+
+    src_vocab_file = vocab_prefix + "." + src_lang
+    tar_vocab_file = vocab_prefix + "." + tar_lang
+
+    src_train_file = train_prefix + "." + src_lang
+    tar_train_file = train_prefix + "." + tar_lang
+
+    src_eval_file = eval_prefix + "." + src_lang
+    tar_eval_file = eval_prefix + "." + tar_lang
+
+    src_test_file = test_prefix + "." + src_lang
+    tar_test_file = test_prefix + "." + tar_lang
+
+    src_vocab = _build_vocab(src_vocab_file)
+    tar_vocab = _build_vocab(tar_vocab_file)
+
+    train_src, train_tar = _para_file_to_ids( src_train_file, tar_train_file, \
+                                              src_vocab, tar_vocab )
+    train_src, train_tar = filter_len(
+        train_src, train_tar, max_sequence_len=max_sequence_len)
+    eval_src, eval_tar = _para_file_to_ids( src_eval_file, tar_eval_file, \
+                                              src_vocab, tar_vocab )
+
+    test_src, test_tar = _para_file_to_ids( src_test_file, tar_test_file, \
+                                              src_vocab, tar_vocab )
+
+    return ( train_src, train_tar), (eval_src, eval_tar), (test_src, test_tar),\
+            (src_vocab, tar_vocab)
+
+
+def raw_mono_data(vocab_file, file_path):
+
+    src_vocab = _build_vocab(vocab_file)
+
+    test_src, test_tar = _para_file_to_ids( file_path, file_path, \
+                                              src_vocab, src_vocab )
+
+    return (test_src, test_tar)
+
+
+def get_data_iter(raw_data,
+                  batch_size,
+                  mode='train',
+                  enable_ce=False,
+                  cache_num=20):
+
+    src_data, tar_data = raw_data
+
+    data_len = len(src_data)
+
+    index = np.arange(data_len)
+    if mode == "train" and not enable_ce:
+        np.random.shuffle(index)
+
+    def to_pad_np(data, source=False):
+        max_len = 0
+        bs = min(batch_size, len(data))
+        for ele in data:
+            if len(ele) > max_len:
+                max_len = len(ele)
+
+        ids = np.ones((bs, max_len), dtype='int64') * 2
+        mask = np.zeros((bs), dtype='int32')
+
+        for i, ele in enumerate(data):
+            ids[i, :len(ele)] = ele
+            if not source:
+                mask[i] = len(ele) - 1
+            else:
+                mask[i] = len(ele)
+
+        return ids, mask
+
+    b_src = []
+
+    if mode != "train":
+        cache_num = 1
+    for j in range(data_len):
+        if len(b_src) == batch_size * cache_num:
+            # build batch size
+
+            # sort
+            if mode == 'infer':
+                new_cache = b_src
+            else:
+                new_cache = sorted(b_src, key=lambda k: len(k[0]))
+            
+
+            for i in range(cache_num):
+                batch_data = new_cache[i * batch_size:(i + 1) * batch_size]
+                src_cache = [w[0] for w in batch_data]
+                tar_cache = [w[1] for w in batch_data]
+                src_ids, src_mask = to_pad_np(src_cache, source=True)
+                tar_ids, tar_mask = to_pad_np(tar_cache)
+                yield (src_ids, src_mask, tar_ids, tar_mask)
+
+            b_src = []
+
+        b_src.append((src_data[index[j]], tar_data[index[j]]))
+    if len(b_src) == batch_size * cache_num or mode == 'infer':
+        if mode == 'infer':
+            new_cache = b_src
+        else:
+            new_cache = sorted(b_src, key=lambda k: len(k[0]))
+
+        for i in range(cache_num):
+            batch_end = min(len(new_cache), (i + 1) * batch_size)
+            batch_data = new_cache[i * batch_size: batch_end]
+            src_cache = [w[0] for w in batch_data]
+            tar_cache = [w[1] for w in batch_data]
+            src_ids, src_mask = to_pad_np(src_cache, source=True)
+            tar_ids, tar_mask = to_pad_np(tar_cache)
+            yield (src_ids, src_mask, tar_ids, tar_mask)
diff --git a/dygraph/seq2seq/rnn.py b/dygraph/seq2seq/rnn.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d841530ebd865e3c6e21911ef8b04da498ded15
--- /dev/null
+++ b/dygraph/seq2seq/rnn.py
@@ -0,0 +1,94 @@
+from paddle.fluid import layers
+from paddle.fluid.dygraph import Layer
+
+class BasicLSTMUnit(Layer):
+    """
+    ****
+    BasicLSTMUnit class, Using basic operator to build LSTM
+    The algorithm can be described as the code below.
+        .. math::
+           i_t &= \sigma(W_{ix}x_{t} + W_{ih}h_{t-1} + b_i)
+           f_t &= \sigma(W_{fx}x_{t} + W_{fh}h_{t-1} + b_f + forget_bias )
+           o_t &= \sigma(W_{ox}x_{t} + W_{oh}h_{t-1} + b_o)
+           \\tilde{c_t} &= tanh(W_{cx}x_t + W_{ch}h_{t-1} + b_c)
+           c_t &= f_t \odot c_{t-1} + i_t \odot \\tilde{c_t}
+           h_t &= o_t \odot tanh(c_t)
+        - $W$ terms denote weight matrices (e.g. $W_{ix}$ is the matrix
+          of weights from the input gate to the input)
+        - The b terms denote bias vectors ($bx_i$ and $bh_i$ are the input gate bias vector).
+        - sigmoid is the logistic sigmoid function.
+        - $i, f, o$ and $c$ are the input gate, forget gate, output gate,
+          and cell activation vectors, respectively, all of which have the same size as
+          the cell output activation vector $h$.
+        - The :math:`\odot` is the element-wise product of the vectors.
+        - :math:`tanh` is the activation functions.
+        - :math:`\\tilde{c_t}` is also called candidate hidden state,
+          which is computed based on the current input and the previous hidden state.
+    Args:
+        name_scope(string) : The name scope used to identify parameter and bias name
+        hidden_size (integer): The hidden size used in the Unit.
+        param_attr(ParamAttr|None): The parameter attribute for the learnable
+            weight matrix. Note:
+            If it is set to None or one attribute of ParamAttr, lstm_unit will
+            create ParamAttr as param_attr. If the Initializer of the param_attr
+            is not set, the parameter is initialized with Xavier. Default: None.
+        bias_attr (ParamAttr|None): The parameter attribute for the bias
+            of LSTM unit.
+            If it is set to None or one attribute of ParamAttr, lstm_unit will 
+            create ParamAttr as bias_attr. If the Initializer of the bias_attr
+            is not set, the bias is initialized as zero. Default: None.
+        gate_activation (function|None): The activation function for gates (actGate).
+                                  Default: 'fluid.layers.sigmoid'
+        activation (function|None): The activation function for cells (actNode).
+                             Default: 'fluid.layers.tanh'
+        forget_bias(float|1.0): forget bias used when computing forget gate
+        dtype(string): data type used in this unit
+    """
+
+    def __init__(self,
+                 hidden_size,
+                 input_size,
+                 param_attr=None,
+                 bias_attr=None,
+                 gate_activation=None,
+                 activation=None,
+                 forget_bias=1.0,
+                 dtype='float32'):
+        super(BasicLSTMUnit, self).__init__(dtype)
+
+        self._hiden_size = hidden_size
+        self._param_attr = param_attr
+        self._bias_attr = bias_attr
+        self._gate_activation = gate_activation or layers.sigmoid
+        self._activation = activation or layers.tanh
+        self._forget_bias = layers.fill_constant(
+            [1], dtype=dtype, value=forget_bias)
+        self._forget_bias.stop_gradient = False
+        self._dtype = dtype
+        self._input_size = input_size
+
+        self._weight = self.create_parameter(
+            attr=self._param_attr,
+            shape=[self._input_size + self._hiden_size, 4 * self._hiden_size],
+            dtype=self._dtype)
+
+        self._bias = self.create_parameter(
+            attr=self._bias_attr,
+            shape=[4 * self._hiden_size],
+            dtype=self._dtype,
+            is_bias=True)
+
+    def forward(self, input, pre_hidden, pre_cell):
+        concat_input_hidden = layers.concat([input, pre_hidden], 1)
+        gate_input = layers.matmul(x=concat_input_hidden, y=self._weight)
+
+        gate_input = layers.elementwise_add(gate_input, self._bias)
+        i, j, f, o = layers.split(gate_input, num_or_sections=4, dim=-1)
+        new_cell = layers.elementwise_add(
+            layers.elementwise_mul(
+                pre_cell,
+                layers.sigmoid(layers.elementwise_add(f, self._forget_bias))),
+            layers.elementwise_mul(layers.sigmoid(i), layers.tanh(j)))
+        new_hidden = layers.tanh(new_cell) * layers.sigmoid(o)
+
+        return new_hidden, new_cell
\ No newline at end of file
diff --git a/dygraph/seq2seq/run.sh b/dygraph/seq2seq/run.sh
new file mode 100755
index 0000000000000000000000000000000000000000..25bc78a3c7f3303324c0f5fb9aac14b467918a16
--- /dev/null
+++ b/dygraph/seq2seq/run.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+export CUDA_VISIBLE_DEVICES=0
+
+python train.py \
+        --src_lang en --tar_lang vi \
+        --attention True \
+        --num_layers 2 \
+        --hidden_size 512 \
+        --src_vocab_size 17191 \
+        --tar_vocab_size 7709 \
+        --batch_size 128 \
+        --dropout 0.2 \
+        --init_scale  0.1 \
+        --max_grad_norm 5.0 \
+        --train_data_prefix data/en-vi/train \
+        --eval_data_prefix data/en-vi/tst2012 \
+        --test_data_prefix data/en-vi/tst2013 \
+        --vocab_prefix data/en-vi/vocab \
+        --use_gpu True \
+        --model_path attention_models
diff --git a/dygraph/seq2seq/train.py b/dygraph/seq2seq/train.py
new file mode 100755
index 0000000000000000000000000000000000000000..a25dccc8b316c1cd61672bd47730379f49802ea2
--- /dev/null
+++ b/dygraph/seq2seq/train.py
@@ -0,0 +1,220 @@
+# -*- coding: utf-8 -*-
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+import time
+import os
+import logging
+import random
+import math
+import contextlib
+
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.dygraph_grad_clip import GradClipByGlobalNorm
+
+import reader
+
+import sys
+if sys.version[0] == '2':
+    reload(sys)
+    sys.setdefaultencoding("utf-8")
+
+from args import *
+from base_model import BaseModel
+from attention_model import AttentionModel
+import logging
+import pickle
+
+
+def main():
+    args = parse_args()
+    print(args)
+    num_layers = args.num_layers
+    src_vocab_size = args.src_vocab_size
+    tar_vocab_size = args.tar_vocab_size
+    batch_size = args.batch_size
+    dropout = args.dropout
+    init_scale = args.init_scale
+    max_grad_norm = args.max_grad_norm
+    hidden_size = args.hidden_size
+
+    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    with fluid.dygraph.guard(place):
+        args.enable_ce = True
+        if args.enable_ce:
+            fluid.default_startup_program().random_seed = 102
+            fluid.default_main_program().random_seed = 102
+            np.random.seed(102)
+            random.seed(102)
+
+        # Training process
+
+        if args.attention:
+            model = AttentionModel(
+                hidden_size,
+                src_vocab_size,
+                tar_vocab_size,
+                batch_size,
+                num_layers=num_layers,
+                init_scale=init_scale,
+                dropout=dropout)
+        else:
+            model = BaseModel(
+                hidden_size,
+                src_vocab_size,
+                tar_vocab_size,
+                batch_size,
+                num_layers=num_layers,
+                init_scale=init_scale,
+                dropout=dropout)
+        gloabl_norm_clip = GradClipByGlobalNorm(max_grad_norm)
+        lr = args.learning_rate
+        opt_type = args.optimizer
+        if opt_type == "sgd":
+            optimizer = fluid.optimizer.SGD(lr, parameter_list=model.parameters())
+        elif opt_type == "adam":
+            optimizer = fluid.optimizer.Adam(lr, parameter_list=model.parameters())
+        else:
+            print("only support [sgd|adam]")
+            raise Exception("opt type not support")
+
+        train_data_prefix = args.train_data_prefix
+        eval_data_prefix = args.eval_data_prefix
+        test_data_prefix = args.test_data_prefix
+        vocab_prefix = args.vocab_prefix
+        src_lang = args.src_lang
+        tar_lang = args.tar_lang
+        print("begin to load data")
+        raw_data = reader.raw_data(src_lang, tar_lang, vocab_prefix,
+                                train_data_prefix, eval_data_prefix,
+                                test_data_prefix, args.max_len)
+        print("finished load data")
+        train_data, valid_data, test_data, _ = raw_data
+
+        def prepare_input(batch, epoch_id=0):
+            src_ids, src_mask, tar_ids, tar_mask = batch
+            res = {}
+            src_ids = src_ids.reshape((src_ids.shape[0], src_ids.shape[1]))
+            in_tar = tar_ids[:, :-1]
+            label_tar = tar_ids[:, 1:]
+
+            in_tar = in_tar.reshape((in_tar.shape[0], in_tar.shape[1]))
+            label_tar = label_tar.reshape(
+                (label_tar.shape[0], label_tar.shape[1], 1))
+            inputs = [src_ids, in_tar, label_tar, src_mask, tar_mask]
+            return inputs, np.sum(tar_mask)
+
+        # get train epoch size
+        def eval(data, epoch_id=0):
+            model.eval()
+            eval_data_iter = reader.get_data_iter(data, batch_size, mode='eval')
+            total_loss = 0.0
+            word_count = 0.0
+            for batch_id, batch in enumerate(eval_data_iter):
+                input_data_feed, word_num = prepare_input(
+                    batch, epoch_id)
+                loss = model(input_data_feed)
+
+                total_loss += loss * batch_size
+                word_count += word_num
+            ppl = np.exp(total_loss.numpy() / word_count)
+            model.train()
+            return ppl
+
+        max_epoch = args.max_epoch
+        for epoch_id in range(max_epoch):
+            model.train()
+            start_time = time.time()
+            if args.enable_ce:
+                train_data_iter = reader.get_data_iter(
+                    train_data, batch_size, enable_ce=True)
+            else:
+                train_data_iter = reader.get_data_iter(train_data, batch_size)
+
+            total_loss = 0
+            word_count = 0.0
+            batch_times = []
+            for batch_id, batch in enumerate(train_data_iter):
+                batch_start_time = time.time()
+                input_data_feed, word_num = prepare_input(
+                    batch, epoch_id=epoch_id)
+                word_count += word_num
+                loss = model(input_data_feed)
+                # print(loss.numpy()[0])
+                loss.backward()
+                optimizer.minimize(loss, grad_clip = gloabl_norm_clip)
+                model.clear_gradients()
+                total_loss += loss * batch_size
+                batch_end_time = time.time()
+                batch_time = batch_end_time - batch_start_time
+                batch_times.append(batch_time)
+
+                if batch_id > 0 and batch_id % 100 == 0:
+                    print("-- Epoch:[%d]; Batch:[%d]; Time: %.5f s; ppl: %.5f" %
+                        (epoch_id, batch_id, batch_time,
+                        np.exp(total_loss.numpy() / word_count)))
+                    total_loss = 0.0
+                    word_count = 0.0
+
+            end_time = time.time()
+            epoch_time = end_time - start_time
+            print(
+                "\nTrain epoch:[%d]; Epoch Time: %.5f; avg_time: %.5f s/step\n"
+                % (epoch_id, epoch_time, sum(batch_times) / len(batch_times)))
+
+            
+            dir_name = os.path.join(args.model_path,
+                                    "epoch_" + str(epoch_id))
+            print("begin to save", dir_name)
+            paddle.fluid.save_dygraph(model.state_dict(), dir_name)
+            print("save finished")
+            dev_ppl = eval(valid_data)
+            print("dev ppl", dev_ppl)
+            test_ppl = eval(test_data)
+            print("test ppl", test_ppl)
+
+
+def get_cards():
+    num = 0
+    cards = os.environ.get('CUDA_VISIBLE_DEVICES', '')
+    if cards != '':
+        num = len(cards.split(","))
+    return num
+
+
+def check_version():
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
+
+
+if __name__ == '__main__':
+    check_version()
+    main()
diff --git a/dygraph/transformer/README.md b/dygraph/transformer/README.md
index 2faef885db47b5c3a9776721b1901f2b93e169db..6776e618a69fe35ed46552faf9512f58a07e7685 100644
--- a/dygraph/transformer/README.md
+++ b/dygraph/transformer/README.md
@@ -1,194 +1,286 @@
-## 简介
+## Transformer
 
-### 任务说明
+以下是本例的简要目录结构及说明：
 
-机器翻译（machine translation, MT）是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程，输入为源语言句子，输出为相应的目标语言的句子。本示例是机器翻译主流模型 Transformer 的实现和相关介绍。
+```text
+.
+├── images               # README 文档中的图片
+├── utils                # 工具包
+├── gen_data.sh          # 数据生成脚本
+├── predict.py           # 预测脚本
+├── reader.py            # 数据读取接口
+├── README.md            # 文档
+├── train.py             # 训练脚本
+├── model.py             # 模型定义文件
+└── transformer.yaml     # 配置文件
+```
 
-动态图文档请见[Dygraph](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/user_guides/howto/dygraph/DyGraph.html)
+## 模型简介
 
-### 数据集说明
+机器翻译（machine translation, MT）是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程，输入为源语言句子，输出为相应的目标语言的句子。
 
-我们使用[WMT-16](http://www.statmt.org/wmt16/)新增的[multimodal task](http://www.statmt.org/wmt16/multimodal-task.html)中的[translation task](http://www.statmt.org/wmt16/multimodal-task.html#task1)的数据集作为示例。该数据集为英德翻译数据，包含29001条训练数据，1000条测试数据。
+本项目是机器翻译领域主流模型 Transformer 的 PaddlePaddle 实现， 包含模型训练，预测以及使用自定义数据等内容。用户可以基于发布的内容搭建自己的翻译模型。
 
-该数据集内置在了Paddle中，可以通过 `paddle.dataset.wmt16` 使用，执行本项目中的训练代码数据集将自动下载到 `~/.cache/paddle/dataset/wmt16/` 目录下。
+
+## 快速开始
 
 ### 安装说明
 
 1. paddle安装
 
-   本项目依赖于 PaddlePaddle Fluid 1.6.0 及以上版本（1.6.0 待近期正式发版，可先使用 develop），请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装
+   本项目依赖于 PaddlePaddle 1.7及以上版本或适当的develop版本，请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装
+
+2. 下载代码
+
+    克隆代码库到本地
+    ```shell
+    git clone https://github.com/PaddlePaddle/models.git
+    cd models/dygraph/transformer
+    ```
+
+3. 环境依赖
+
+   请参考PaddlePaddle[安装说明](https://www.paddlepaddle.org.cn/documentation/docs/zh/1.6/beginners_guide/install/index_cn.html)部分的内容
 
-2. 环境依赖
 
-   多卡运行需要 NCCL 2.4.7 版本。
+### 数据准备
 
-### 执行训练：
-如果是使用GPU单卡训练，启动训练的方式:
+公开数据集：WMT 翻译大赛是机器翻译领域最具权威的国际评测大赛，其中英德翻译任务提供了一个中等规模的数据集，这个数据集是较多论文中使用的数据集，也是 Transformer 论文中用到的一个数据集。我们也将[WMT'16 EN-DE 数据集](http://www.statmt.org/wmt16/translation-task.html)作为示例提供。运行 `gen_data.sh` 脚本进行 WMT'16 EN-DE 数据集的下载和预处理（时间较长，建议后台运行）。数据处理过程主要包括 Tokenize 和 [BPE 编码（byte-pair encoding）](https://arxiv.org/pdf/1508.07909)。运行成功后，将会生成文件夹 `gen_data`，其目录结构如下：
+
+```text
+.
+├── wmt16_ende_data              # WMT16 英德翻译数据
+├── wmt16_ende_data_bpe          # BPE 编码的 WMT16 英德翻译数据
+├── mosesdecoder                 # Moses 机器翻译工具集，包含了 Tokenize、BLEU 评估等脚本
+└── subword-nmt                  # BPE 编码的代码
 ```
-env CUDA_VISIBLE_DEVICES=0 python train.py
+
+另外我们也整理提供了一份处理好的 WMT'16 EN-DE 数据以供[下载](https://transformer-res.bj.bcebos.com/wmt16_ende_data_bpe_clean.tar.gz)使用，其中包含词典（`vocab_all.bpe.32000`文件）、训练所需的 BPE 数据（`train.tok.clean.bpe.32000.en-de`文件）、预测所需的 BPE 数据（`newstest2016.tok.bpe.32000.en-de`等文件）和相应的评估预测结果所需的 tokenize 数据（`newstest2016.tok.de`等文件）。
+
+
+自定义数据：如果需要使用自定义数据，本项目程序中可直接支持的数据格式为制表符 \t 分隔的源语言和目标语言句子对，句子中的 token 之间使用空格分隔。提供以上格式的数据文件（可以分多个part，数据读取支持文件通配符）和相应的词典文件即可直接运行。
+
+### 单机训练
+
+### 单机单卡
+
+以提供的英德翻译数据为例，可以执行以下命令进行模型训练：
+
+```sh
+# setting visible devices for training
+export CUDA_VISIBLE_DEVICES=0
+
+python -u train.py \
+  --epoch 30 \
+  --src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
+  --trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
+  --special_token '<s>' '<e>' '<unk>' \
+  --training_file gen_data/wmt16_ende_data_bpe/train.tok.clean.bpe.32000.en-de \
+  --batch_size 4096
 ```
 
-这里`CUDA_VISIBLE_DEVICES=0`表示是执行在0号设备卡上，请根据自身情况修改这个参数。如需调整其他模型及训练参数，可在 `config.py` 中修改或使用如下方式传入：
+以上命令中传入了训练轮数（`epoch`）和训练数据文件路径（注意请正确设置，支持通配符）等参数，更多参数的使用以及支持的模型超参数可以参见 `transformer.yaml` 配置文件，其中默认提供了 Transformer base model 的配置，如需调整可以在配置文件中更改或通过命令行传入（命令行传入内容将覆盖配置文件中的设置）。可以通过以下命令来训练 Transformer 论文中的 big model：
 
 ```sh
-python train.py \
-  n_head 16 \
-  d_model 1024 \
-  d_inner_hid 4096 \
-  prepostprocess_dropout 0.3
+# setting visible devices for training
+export CUDA_VISIBLE_DEVICES=0
+
+python -u train.py \
+  --epoch 30 \
+  --src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
+  --trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
+  --special_token '<s>' '<e>' '<unk>' \
+  --training_file gen_data/wmt16_ende_data_bpe/train.tok.clean.bpe.32000.en-de \
+  --batch_size 4096 \
+  --n_head 16 \
+  --d_model 1024 \
+  --d_inner_hid 4096 \
+  --prepostprocess_dropout 0.3
 ```
 
-Paddle动态图支持多进程多卡进行模型训练，启动训练的方式：
+另外，如果在执行训练时若提供了 `save_model`（默认为 trained_models），则每隔一定 iteration 后（通过参数 `save_step` 设置，默认为10000）将保存当前训练的到相应目录（会保存分别记录了模型参数和优化器状态的 `transformer.pdparams` 和 `transformer.pdopt` 两个文件），每隔一定数目的 iteration (通过参数 `print_step` 设置，默认为100)将打印如下的日志到标准输出：
+
+```txt
+[2019-08-02 15:30:51,656 INFO train.py:262] step_idx: 150100, epoch: 32, batch: 1364, avg loss: 2.880427, normalized loss: 1.504687, ppl: 17.821888, speed: 3.34 step/s
+[2019-08-02 15:31:19,824 INFO train.py:262] step_idx: 150200, epoch: 32, batch: 1464, avg loss: 2.955965, normalized loss: 1.580225, ppl: 19.220257, speed: 3.55 step/s
+[2019-08-02 15:31:48,151 INFO train.py:262] step_idx: 150300, epoch: 32, batch: 1564, avg loss: 2.951180, normalized loss: 1.575439, ppl: 19.128502, speed: 3.53 step/s
+[2019-08-02 15:32:16,401 INFO train.py:262] step_idx: 150400, epoch: 32, batch: 1664, avg loss: 3.027281, normalized loss: 1.651540, ppl: 20.641024, speed: 3.54 step/s
+[2019-08-02 15:32:44,764 INFO train.py:262] step_idx: 150500, epoch: 32, batch: 1764, avg loss: 3.069125, normalized loss: 1.693385, ppl: 21.523066, speed: 3.53 step/s
+[2019-08-02 15:33:13,199 INFO train.py:262] step_idx: 150600, epoch: 32, batch: 1864, avg loss: 2.869379, normalized loss: 1.493639, ppl: 17.626074, speed: 3.52 step/s
+[2019-08-02 15:33:41,601 INFO train.py:262] step_idx: 150700, epoch: 32, batch: 1964, avg loss: 2.980905, normalized loss: 1.605164, ppl: 19.705633, speed: 3.52 step/s
+[2019-08-02 15:34:10,079 INFO train.py:262] step_idx: 150800, epoch: 32, batch: 2064, avg loss: 3.047716, normalized loss: 1.671976, ppl: 21.067181, speed: 3.51 step/s
+[2019-08-02 15:34:38,598 INFO train.py:262] step_idx: 150900, epoch: 32, batch: 2164, avg loss: 2.956475, normalized loss: 1.580735, ppl: 19.230072, speed: 3.51 step/s
 ```
-python -m paddle.distributed.launch --started_port 9999 --selected_gpus=0,1,2,3 --log_dir ./mylog train.py --use_data_parallel 1
+
+也可以使用 CPU 训练(通过参数 `--use_cuda False` 设置)，训练速度较慢。
+
+#### 单机多卡
+
+Paddle动态图支持多进程多卡进行模型训练，启动训练的方式如下：
+
+```sh
+python -m paddle.distributed.launch --started_port 8999 --selected_gpus=0,1,2,3,4,5,6,7 --log_dir ./mylog train.py \
+  --epoch 30 \
+  --src_vocab_fpath wmt16_ende_data_bpe/vocab_all.bpe.32000 \
+  --trg_vocab_fpath wmt16_ende_data_bpe/vocab_all.bpe.32000 \
+  --special_token '<s>' '<e>' '<unk>' \
+  --training_file wmt16_ende_data_bpe/train.tok.clean.bpe.32000.en-de \
+  --batch_size 4096 \
+  --print_step 100 \
+  --use_cuda True \
+  --save_step 10000
 ```
-此时，程序会将每个进程的输出log导入到`./mylog`路径下：
+
+此时，程序会将每个进程的输出log导入到`./mylog`路径下，只有第一个工作进程会保存模型。
+
 ```
 .
 ├── mylog
 │   ├── workerlog.0
 │   ├── workerlog.1
 │   ├── workerlog.2
-│   └── workerlog.3
-├── README.md
-└── train.py
+│   ├── workerlog.3
+│   ├── workerlog.4
+│   ├── workerlog.5
+│   ├── workerlog.6
+│   └── workerlog.7
 ```
 
-### 执行效果
-
-    W0422 13:25:53.853921 116144 device_context.cc:261] Please NOTE: device: 0, CUDA Capability: 35, Driver API Version: 9.0, Runtime API Version: 8.0
-    W0422 13:25:53.861614 116144 device_context.cc:269] device: 0, cuDNN Version: 7.0.
+### 模型推断
 
-    pass num : 0, batch_id: 10, dy_graph avg loss: [9.033163]
-    pass num : 0, batch_id: 20, dy_graph avg loss: [8.869838]
-    pass num : 0, batch_id: 30, dy_graph avg loss: [8.635877]
-    pass num : 0, batch_id: 40, dy_graph avg loss: [8.460026]
-    pass num : 0, batch_id: 50, dy_graph avg loss: [8.293438]
-    pass num : 0, batch_id: 60, dy_graph avg loss: [8.138791]
-    pass num : 0, batch_id: 70, dy_graph avg loss: [7.9594088]
-    pass num : 0, batch_id: 80, dy_graph avg loss: [7.7303553]
-    pass num : 0, batch_id: 90, dy_graph avg loss: [7.6716228]
-    pass num : 0, batch_id: 100, dy_graph avg loss: [7.611051]
-    pass num : 0, batch_id: 110, dy_graph avg loss: [7.4179897]
-    pass num : 0, batch_id: 120, dy_graph avg loss: [7.318419]
+以英德翻译数据为例，模型训练完成后可以执行以下命令对指定文件中的文本进行翻译：
 
-
-### 执行预测
-
-训练完成后，使用如下命令进行预测：
-
-```
-env CUDA_VISIBLE_DEVICES=0 python predict.py
+```sh
+# setting visible devices for prediction
+export CUDA_VISIBLE_DEVICES=0
+
+python -u predict.py \
+  --src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
+  --trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
+  --special_token '<s>' '<e>' '<unk>' \
+  --predict_file gen_data/wmt16_ende_data_bpe/newstest2014.tok.bpe.32000.en-de \
+  --batch_size 32 \
+  --init_from_params trained_params/step_100000 \
+  --beam_size 5 \
+  --max_out_len 255 \
+  --output_file predict.txt
 ```
 
-预测结果将输出到 `predict.txt` 文件中（可在运行时通过 `--output_file` 更改），其他模型与预测参数也可在 `config.py` 中修改或使用如下方式传入：
+ 由 `predict_file` 指定的文件中文本的翻译结果会输出到 `output_file` 指定的文件。执行预测时需要设置 `init_from_params` 来给出模型所在目录，更多参数的使用可以在 `transformer.yaml` 文件中查阅注释说明并进行更改设置。注意若在执行预测时设置了模型超参数，应与模型训练时的设置一致，如若训练时使用 big model 的参数设置，则预测时对应类似如下命令：
 
 ```sh
-python predict.py \
-  n_head 16 \
-  d_model 1024 \
-  d_inner_hid 4096 \
-  prepostprocess_dropout 0.3
+# setting visible devices for prediction
+export CUDA_VISIBLE_DEVICES=0
+
+python -u predict.py \
+  --src_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
+  --trg_vocab_fpath gen_data/wmt16_ende_data_bpe/vocab_all.bpe.32000 \
+  --special_token '<s>' '<e>' '<unk>' \
+  --predict_file gen_data/wmt16_ende_data_bpe/newstest2014.tok.bpe.32000.en-de \
+  --batch_size 32 \
+  --init_from_params trained_params/step_100000 \
+  --beam_size 5 \
+  --max_out_len 255 \
+  --output_file predict.txt \
+  --n_head 16 \
+  --d_model 1024 \
+  --d_inner_hid 4096 \
+  --prepostprocess_dropout 0.3
 ```
 
-完成预测后，可以借助第三方工具进行 BLEU 指标的评估，可按照如下方式进行：
 
-```sh
-# 提取 reference 数据
-tar -zxvf ~/.cache/paddle/dataset/wmt16/wmt16.tar.gz
-awk 'BEGIN {FS="\t"}; {print $2}' wmt16/test > ref.de
+### 模型评估
 
-# clone mosesdecoder代码
-git clone https://github.com/moses-smt/mosesdecoder.git
+预测结果中每行输出是对应行输入的得分最高的翻译，对于使用 BPE 的数据，预测出的翻译结果也将是 BPE 表示的数据，要还原成原始的数据（这里指 tokenize 后的数据）才能进行正确的评估。评估过程具体如下（BLEU 是翻译任务常用的自动评估方法指标）：
 
-# 进行评估
-perl mosesdecoder/scripts/generic/multi-bleu.perl ref.de < predict.txt
+```sh
+# 还原 predict.txt 中的预测结果为 tokenize 后的数据
+sed -r 's/(@@ )|(@@ ?$)//g' predict.txt > predict.tok.txt
+# 若无 BLEU 评估工具，需先进行下载
+# git clone https://github.com/moses-smt/mosesdecoder.git
+# 以英德翻译 newstest2014 测试数据为例
+perl gen_data/mosesdecoder/scripts/generic/multi-bleu.perl gen_data/wmt16_ende_data/newstest2014.tok.de < predict.tok.txt
 ```
-
-使用默认配置单卡训练20个 epoch 训练的模型约有如下评估结果：
+可以看到类似如下的结果：
 ```
-BLEU = 32.38, 64.3/39.1/25.9/16.9 (BP=1.000, ratio=1.001, hyp_len=12122, ref_len=12104)
+BLEU = 26.35, 57.7/32.1/20.0/13.0 (BP=1.000, ratio=1.013, hyp_len=63903, ref_len=63078)
 ```
 
+使用本项目中提供的内容，英德翻译 base model 和 big model 八卡训练 100K 个 iteration 后测试有大约如下的 BLEU 值：
 
-## 进阶使用
-
-### 自定义数据
-
-- 训练：
-  
-  修改 `train.py` 中的如下代码段
-
-  ```python
-        reader = paddle.batch(wmt16.train(ModelHyperParams.src_vocab_size,
-                                          ModelHyperParams.trg_vocab_size),
-                              batch_size=TrainTaskConfig.batch_size)
-  ```
-  
-  
-  将其中的 `wmt16.train` 替换为类似如下的 python generator ：
-
-  ```python
-  def reader(file_name, src_dict, trg_dict):
-    start_id = src_dict[START_MARK]  # BOS
-    end_id = src_dict[END_MARK]  # EOS
-    unk_id = src_dict[UNK_MARK]  # UNK
+| 测试集 | newstest2014 | newstest2015 | newstest2016 |
+|-|-|-|-|
+| Base | 26.35 | 29.07 | 33.30 |
+| Big | 27.07 | 30.09 | 34.38 |
 
-    src_col, trg_col = 0, 1
+### 预训练模型
 
-    for line in open(file_name, "r"):
-        line = line.strip()
-        line_split = line.strip().split("\t")
-        if len(line_split) != 2:
-            continue
-        src_words = line_split[src_col].split()
-        src_ids = [start_id] + [
-            src_dict.get(w, unk_id) for w in src_words
-        ] + [end_id]
+我们这里提供了对应有以上 BLEU 值的 [base model](https://transformer-res.bj.bcebos.com/base_model_dygraph.tar.gz) 和 [big model](https://transformer-res.bj.bcebos.com/big_model_dygraph.tar.gz) 的模型参数提供下载使用（注意，模型使用了提供下载的数据进行训练和测试）。
 
-        trg_words = line_split[trg_col].split()
-        trg_ids = [trg_dict.get(w, unk_id) for w in trg_words]
+## 进阶使用
 
-        trg_ids_next = trg_ids + [end_id]
-        trg_ids = [start_id] + trg_ids
+### 背景介绍
 
-        yield src_ids, trg_ids, trg_ids_next
-  ```
+Transformer 是论文 [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 中提出的用以完成机器翻译（machine translation, MT）等序列到序列（sequence to sequence, Seq2Seq）学习任务的一种全新网络结构，其完全使用注意力（Attention）机制来实现序列到序列的建模[1]。
 
-该 generator 产生的数据为单个样本，是包含源句（src_ids），目标句（trg_ids）和标签（trg_ids_next）三个 integer list 的 tuple；其中 src_ids 包含 BOS 和 EOS 的 id，trg_ids 包含 BOS 的 id，trg_ids_next 包含 EOS 的 id。
+相较于此前 Seq2Seq 模型中广泛使用的循环神经网络（Recurrent Neural Network, RNN），使用（Self）Attention 进行输入序列到输出序列的变换主要具有以下优势：
 
-- 预测：
-  修改 `predict.py` 中的如下代码段
+- 计算复杂度小
+  - 特征维度为 d 、长度为 n 的序列，在 RNN 中计算复杂度为 `O(n * d * d)` （n 个时间步，每个时间步计算 d 维的矩阵向量乘法），在 Self-Attention 中计算复杂度为 `O(n * n * d)` （n 个时间步两两计算 d 维的向量点积或其他相关度函数），n 通常要小于 d 。
+- 计算并行度高
+  - RNN 中当前时间步的计算要依赖前一个时间步的计算结果；Self-Attention 中各时间步的计算只依赖输入不依赖之前时间步输出，各时间步可以完全并行。
+- 容易学习长程依赖（long-range dependencies）
+  - RNN 中相距为 n 的两个位置间的关联需要 n 步才能建立；Self-Attention 中任何两个位置都直接相连；路径越短信号传播越容易。
 
-  ```python
-        reader = paddle.batch(wmt16.test(ModelHyperParams.src_vocab_size,
-                                         ModelHyperParams.trg_vocab_size),
-                              batch_size=InferTaskConfig.batch_size)
-        id2word = wmt16.get_dict("de",
-                                 ModelHyperParams.trg_vocab_size,
-                                 reverse=True)
-  ```
+Transformer 中引入使用的基于 Self-Attention 的序列建模模块结构，已被广泛应用在 Bert [2]等语义表示模型中，取得了显著效果。
 
-  将其中的 `wmt16.test` 替换为和训练部分类似的 python generator ；另外还需要提供将 id 映射到 word 的 python dict 作为 `id2word` .
 
-### 模型原理介绍
+### 模型概览
 
-Transformer 是论文 [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 中提出的用以完成机器翻译（machine translation, MT）等序列到序列（sequence to sequence, Seq2Seq）学习任务的一种全新网络结构。其同样使用了 Seq2Seq 任务中典型的编码器-解码器（Encoder-Decoder）的框架结构，但相较于此前广泛使用的循环神经网络（Recurrent Neural Network, RNN），其完全使用注意力（Attention）机制来实现序列到序列的建模，整体网络结构如图1所示。
+Transformer 同样使用了 Seq2Seq 模型中典型的编码器-解码器（Encoder-Decoder）的框架结构，整体网络结构如图1所示。
 
 <p align="center">
-<img src="../../PaddleNLP/neural_machine_translation/transformer/images/transformer_network.png" height=400 hspace='10'/> <br />
+<img src="images/transformer_network.png" height=400 hspace='10'/> <br />
 图 1. Transformer 网络结构图
 </p>
 
-Encoder 由若干相同的 layer 堆叠组成，每个 layer 主要由多头注意力（Multi-Head Attention）和全连接的前馈（Feed-Forward）网络这两个 sub-layer 构成。
+可以看到，和以往 Seq2Seq 模型不同，Transformer 的 Encoder 和 Decoder 中不再使用 RNN 的结构。
+
+### 模型特点
+
+Transformer 中的 Encoder 由若干相同的 layer 堆叠组成，每个 layer 主要由多头注意力（Multi-Head Attention）和全连接的前馈（Feed-Forward）网络这两个 sub-layer 构成。
 - Multi-Head Attention 在这里用于实现 Self-Attention，相比于简单的 Attention 机制，其将输入进行多路线性变换后分别计算 Attention 的结果，并将所有结果拼接后再次进行线性变换作为输出。参见图2，其中 Attention 使用的是点积（Dot-Product），并在点积后进行了 scale 的处理以避免因点积结果过大进入 softmax 的饱和区域。
 - Feed-Forward 网络会对序列中的每个位置进行相同的计算（Position-wise），其采用的是两次线性变换中间加以 ReLU 激活的结构。
 
-此外，每个 sub-layer 后还施以 [Residual Connection](http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf) 和 [Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf) 来促进梯度传播和模型收敛。
+此外，每个 sub-layer 后还施以 Residual Connection [3]和 Layer Normalization [4]来促进梯度传播和模型收敛。
 
 <p align="center">
-<img src="../../PaddleNLP/neural_machine_translation/transformer/images/multi_head_attention.png" height=300 hspace='10'/> <br />
+<img src="images/multi_head_attention.png" height=300 hspace='10'/> <br />
 图 2. Multi-Head Attention
 </p>
 
 Decoder 具有和 Encoder 类似的结构，只是相比于组成 Encoder 的 layer ，在组成 Decoder 的 layer 中还多了一个 Multi-Head Attention 的 sub-layer 来实现对 Encoder 输出的 Attention，这个 Encoder-Decoder Attention 在其他 Seq2Seq 模型中也是存在的。
+
+## FAQ
+
+**Q:** 预测结果中样本数少于输入的样本数是什么原因  
+**A:** 若样本中最大长度超过 `transformer.yaml` 中 `max_length` 的默认设置，请注意运行时增大 `--max_length` 的设置，否则超长样本将被过滤。
+
+**Q:** 预测时最大长度超过了训练时的最大长度怎么办  
+**A:** 由于训练时 `max_length` 的设置决定了保存模型 position encoding 的大小，若预测时长度超过 `max_length`，请调大该值，会重新生成更大的 position encoding 表。
+
+
+## 参考文献
+1. Vaswani A, Shazeer N, Parmar N, et al. [Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010.
+2. Devlin J, Chang M W, Lee K, et al. [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805)[J]. arXiv preprint arXiv:1810.04805, 2018.
+3. He K, Zhang X, Ren S, et al. [Deep residual learning for image recognition](http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
+4. Ba J L, Kiros J R, Hinton G E. [Layer normalization](https://arxiv.org/pdf/1607.06450.pdf)[J]. arXiv preprint arXiv:1607.06450, 2016.
+5. Sennrich R, Haddow B, Birch A. [Neural machine translation of rare words with subword units](https://arxiv.org/pdf/1508.07909)[J]. arXiv preprint arXiv:1508.07909, 2015.
+
+
+## 作者
+- [guochengCS](https://github.com/guoshengCS)
+
+## 如何贡献代码
+
+如果你可以修复某个issue或者增加一个新功能，欢迎给我们提交PR。如果对应的PR被接受了，我们将根据贡献的质量和难度进行打分（0-5分，越高越好）。如果你累计获得了10分，可以联系我们获得面试机会或者为你写推荐信。
diff --git a/dygraph/transformer/data_util.py b/dygraph/transformer/data_util.py
deleted file mode 100644
index 4e20c8b1cee723a3dd4ef5d037bec4b8c30da30c..0000000000000000000000000000000000000000
--- a/dygraph/transformer/data_util.py
+++ /dev/null
@@ -1,75 +0,0 @@
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import print_function
-
-import numpy as np
-from paddle.fluid.dygraph import to_variable
-
-
-def pad_batch_data(insts,
-                   pad_idx,
-                   n_head,
-                   is_target=False,
-                   is_label=False,
-                   return_attn_bias=True,
-                   return_max_len=True,
-                   return_num_token=False):
-    """
-    Pad the instances to the max sequence length in batch, and generate the
-    corresponding position data and attention bias.
-    """
-    return_list = []
-    max_len = max(len(inst) for inst in insts)
-    # Any token included in dict can be used to pad, since the paddings' loss
-    # will be masked out by weights and make no effect on parameter gradients.
-    inst_data = np.array(
-        [inst + [pad_idx] * (max_len - len(inst)) for inst in insts])
-    return_list += [inst_data.astype("int64").reshape([-1, 1])]
-    if is_label:  # label weight
-        inst_weight = np.array([[1.] * len(inst) + [0.] * (max_len - len(inst))
-                                for inst in insts])
-        return_list += [inst_weight.astype("float32").reshape([-1, 1])]
-    else:  # position data
-        inst_pos = np.array([
-            list(range(0, len(inst))) + [0] * (max_len - len(inst))
-            for inst in insts
-        ])
-        return_list += [inst_pos.astype("int64").reshape([-1, 1])]
-    if return_attn_bias:
-        if is_target:
-            # This is used to avoid attention on paddings and subsequent
-            # words.
-            slf_attn_bias_data = np.ones((inst_data.shape[0], max_len, max_len))
-            slf_attn_bias_data = np.triu(slf_attn_bias_data,
-                                         1).reshape([-1, 1, max_len, max_len])
-            slf_attn_bias_data = np.tile(slf_attn_bias_data,
-                                         [1, n_head, 1, 1]) * [-1e9]
-        else:
-            # This is used to avoid attention on paddings.
-            slf_attn_bias_data = np.array([[0] * len(inst) + [-1e9] *
-                                           (max_len - len(inst))
-                                           for inst in insts])
-            slf_attn_bias_data = np.tile(
-                slf_attn_bias_data.reshape([-1, 1, 1, max_len]),
-                [1, n_head, max_len, 1])
-        return_list += [slf_attn_bias_data.astype("float32")]
-    if return_max_len:
-        return_list += [max_len]
-    if return_num_token:
-        num_token = 0
-        for inst in insts:
-            num_token += len(inst)
-        return_list += [num_token]
-    return return_list if len(return_list) > 1 else return_list[0]
diff --git a/dygraph/transformer/images/multi_head_attention.png b/dygraph/transformer/images/multi_head_attention.png
new file mode 100644
index 0000000000000000000000000000000000000000..427fb6b32aaeb7013066a167aab4fb97c024c2d6
Binary files /dev/null and b/dygraph/transformer/images/multi_head_attention.png differ
diff --git a/dygraph/transformer/images/transformer_network.png b/dygraph/transformer/images/transformer_network.png
new file mode 100644
index 0000000000000000000000000000000000000000..34be0e5c7e2b08f858683d86353db5e81049c7ca
Binary files /dev/null and b/dygraph/transformer/images/transformer_network.png differ
diff --git a/dygraph/transformer/model.py b/dygraph/transformer/model.py
index 6cedad5a21c03fa4680f77226e1e5fde61e38870..326f96cc747d2c8074e778808a1f1de1ea81976c 100644
--- a/dygraph/transformer/model.py
+++ b/dygraph/transformer/model.py
@@ -18,7 +18,8 @@ import numpy as np
 
 import paddle.fluid as fluid
 import paddle.fluid.layers as layers
-from paddle.fluid.dygraph import Embedding, LayerNorm, FC, to_variable, Layer, guard
+from paddle.fluid.layers.utils import map_structure
+from paddle.fluid.dygraph import Embedding, LayerNorm, Linear, Layer, to_variable
 from paddle.fluid.dygraph.learning_rate_scheduler import LearningRateDecay
 
 from config import word_emb_param_names, pos_enc_param_names
@@ -71,186 +72,134 @@ class PrePostProcessLayer(Layer):
     """
     PrePostProcessLayer
     """
-    def __init__(self, name_scope, process_cmd, shape_len=None):
-        super(PrePostProcessLayer, self).__init__(name_scope)
-        for cmd in process_cmd:
-            if cmd == "n":
-                self._layer_norm = LayerNorm(
-                    name_scope=self.full_name(),
-                    begin_norm_axis=shape_len - 1,
-                    param_attr=fluid.ParamAttr(
-                        initializer=fluid.initializer.Constant(1.)),
-                    bias_attr=fluid.ParamAttr(
-                        initializer=fluid.initializer.Constant(0.)))
-
-    def forward(self, prev_out, out, process_cmd, dropout_rate=0.):
-        """
-        forward
-        :param prev_out:
-        :param out:
-        :param process_cmd:
-        :param dropout_rate:
-        :return:
-        """
-        for cmd in process_cmd:
+    def __init__(self, process_cmd, d_model, dropout_rate):
+        super(PrePostProcessLayer, self).__init__()
+        self.process_cmd = process_cmd
+        self.functors = []
+        for cmd in self.process_cmd:
             if cmd == "a":  # add residual connection
-                out = out + prev_out if prev_out else out
+                self.functors.append(lambda x, y: x + y if y else x)
             elif cmd == "n":  # add layer normalization
-                out = self._layer_norm(out)
+                self.functors.append(
+                    self.add_sublayer(
+                        "layer_norm_%d" %
+                        len(self.sublayers(include_sublayers=False)),
+                        LayerNorm(
+                            normalized_shape=d_model,
+                            param_attr=fluid.ParamAttr(
+                                initializer=fluid.initializer.Constant(1.)),
+                            bias_attr=fluid.ParamAttr(
+                                initializer=fluid.initializer.Constant(0.)))))
             elif cmd == "d":  # add dropout
                 if dropout_rate:
-                    out = layers.dropout(out,
-                                         dropout_prob=dropout_rate,
-                                         is_test=False)
-        return out
-
-
-class PositionwiseFeedForwardLayer(Layer):
-    """
-    PositionwiseFeedForwardLayer
-    """
-    def __init__(self, name_scope, d_inner_hid, d_hid, dropout_rate):
-        super(PositionwiseFeedForwardLayer, self).__init__(name_scope)
-        self._i2h = FC(name_scope=self.full_name(),
-                       size=d_inner_hid,
-                       num_flatten_dims=2,
-                       act="relu")
-        self._h2o = FC(name_scope=self.full_name(),
-                       size=d_hid,
-                       num_flatten_dims=2)
-        self._dropout_rate = dropout_rate
+                    self.functors.append(lambda x: layers.dropout(
+                        x, dropout_prob=dropout_rate, is_test=False))
 
-    def forward(self, x):
-        """
-        forward
-        :param x:
-        :return:
-        """
-        hidden = self._i2h(x)
-        if self._dropout_rate:
-            hidden = layers.dropout(hidden,
-                                    dropout_prob=self._dropout_rate,
-                                    is_test=False)
-        out = self._h2o(hidden)
-        return out
+    def forward(self, x, residual=None):
+        for i, cmd in enumerate(self.process_cmd):
+            if cmd == "a":
+                x = self.functors[i](x, residual)
+            else:
+                x = self.functors[i](x)
+        return x
 
 
-class MultiHeadAttentionLayer(Layer):
+class MultiHeadAttention(Layer):
     """
-    MultiHeadAttentionLayer
+    Multi-Head Attention
     """
-    def __init__(self,
-                 name_scope,
-                 d_key,
-                 d_value,
-                 d_model,
-                 n_head=1,
-                 dropout_rate=0.,
-                 cache=None,
-                 gather_idx=None,
-                 static_kv=False):
-        super(MultiHeadAttentionLayer, self).__init__(name_scope)
-        self._n_head = n_head
-        self._d_key = d_key
-        self._d_value = d_value
-        self._d_model = d_model
-        self._dropout_rate = dropout_rate
-        self._q_fc = FC(name_scope=self.full_name(),
-                        size=d_key * n_head,
-                        bias_attr=False,
-                        num_flatten_dims=2)
-        self._k_fc = FC(name_scope=self.full_name(),
-                        size=d_key * n_head,
-                        bias_attr=False,
-                        num_flatten_dims=2)
-        self._v_fc = FC(name_scope=self.full_name(),
-                        size=d_value * n_head,
-                        bias_attr=False,
-                        num_flatten_dims=2)
-        self._proj_fc = FC(name_scope=self.full_name(),
-                           size=self._d_model,
-                           bias_attr=False,
-                           num_flatten_dims=2)
-
-    def forward(self,
-                queries,
-                keys,
-                values,
-                attn_bias,
-                cache=None,
-                gather_idx=None):
-        """
-        forward
-        :param queries:
-        :param keys:
-        :param values:
-        :param attn_bias:
-        :return:
-        """
+    def __init__(self, d_key, d_value, d_model, n_head=1, dropout_rate=0.):
+        super(MultiHeadAttention, self).__init__()
+        self.n_head = n_head
+        self.d_key = d_key
+        self.d_value = d_value
+        self.d_model = d_model
+        self.dropout_rate = dropout_rate
+        self.q_fc = Linear(input_dim=d_model,
+                           output_dim=d_key * n_head,
+                           bias_attr=False)
+        self.k_fc = Linear(input_dim=d_model,
+                           output_dim=d_key * n_head,
+                           bias_attr=False)
+        self.v_fc = Linear(input_dim=d_model,
+                           output_dim=d_value * n_head,
+                           bias_attr=False)
+        self.proj_fc = Linear(input_dim=d_value * n_head,
+                              output_dim=d_model,
+                              bias_attr=False)
+
+    def forward(self, queries, keys, values, attn_bias, cache=None):
         # compute q ,k ,v
         keys = queries if keys is None else keys
         values = keys if values is None else values
 
-        q = self._q_fc(queries)
-        k = self._k_fc(keys)
-        v = self._v_fc(values)
+        q = self.q_fc(queries)
+        k = self.k_fc(keys)
+        v = self.v_fc(values)
 
         # split head
-        reshaped_q = layers.reshape(x=q,
-                                    shape=[0, 0, self._n_head, self._d_key],
-                                    inplace=False)
-        transpose_q = layers.transpose(x=reshaped_q, perm=[0, 2, 1, 3])
-        reshaped_k = layers.reshape(x=k,
-                                    shape=[0, 0, self._n_head, self._d_key],
-                                    inplace=False)
-        transpose_k = layers.transpose(x=reshaped_k, perm=[0, 2, 1, 3])
-        reshaped_v = layers.reshape(x=v,
-                                    shape=[0, 0, self._n_head, self._d_value],
-                                    inplace=False)
-        transpose_v = layers.transpose(x=reshaped_v, perm=[0, 2, 1, 3])
+        q = layers.reshape(x=q, shape=[0, 0, self.n_head, self.d_key])
+        q = layers.transpose(x=q, perm=[0, 2, 1, 3])
+        k = layers.reshape(x=k, shape=[0, 0, self.n_head, self.d_key])
+        k = layers.transpose(x=k, perm=[0, 2, 1, 3])
+        v = layers.reshape(x=v, shape=[0, 0, self.n_head, self.d_value])
+        v = layers.transpose(x=v, perm=[0, 2, 1, 3])
 
         if cache is not None:
             cache_k, cache_v = cache["k"], cache["v"]
-            transpose_k = layers.concat([cache_k, transpose_k], axis=2)
-            transpose_v = layers.concat([cache_v, transpose_v], axis=2)
-            cache["k"], cache["v"] = transpose_k, transpose_v
+            k = layers.concat([cache_k, k], axis=2)
+            v = layers.concat([cache_v, v], axis=2)
+            cache["k"], cache["v"] = k, v
 
         # scale dot product attention
-        product = layers.matmul(x=transpose_q,
-                                y=transpose_k,
+        product = layers.matmul(x=q,
+                                y=k,
                                 transpose_y=True,
-                                alpha=self._d_model**-0.5)
+                                alpha=self.d_model**-0.5)
         if attn_bias:
             product += attn_bias
         weights = layers.softmax(product)
-        if self._dropout_rate:
-            weights_droped = layers.dropout(weights,
-                                            dropout_prob=self._dropout_rate,
-                                            is_test=False)
-            out = layers.matmul(weights_droped, transpose_v)
-        else:
-            out = layers.matmul(weights, transpose_v)
+        if self.dropout_rate:
+            weights = layers.dropout(weights,
+                                     dropout_prob=self.dropout_rate,
+                                     is_test=False)
+
+            out = layers.matmul(weights, v)
 
         # combine heads
-        if len(out.shape) != 4:
-            raise ValueError("Input(x) should be a 4-D Tensor.")
-        trans_x = layers.transpose(out, perm=[0, 2, 1, 3])
-        final_out = layers.reshape(
-            x=trans_x,
-            shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]],
-            inplace=False)
+        out = layers.transpose(out, perm=[0, 2, 1, 3])
+        out = layers.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
 
-        # fc to output
-        proj_out = self._proj_fc(final_out)
-        return proj_out
+        # project to output
+        out = self.proj_fc(out)
+        return out
 
 
-class EncoderSubLayer(Layer):
+class FFN(Layer):
+    """
+    Feed-Forward Network
+    """
+    def __init__(self, d_inner_hid, d_model, dropout_rate):
+        super(FFN, self).__init__()
+        self.dropout_rate = dropout_rate
+        self.fc1 = Linear(input_dim=d_model, output_dim=d_inner_hid, act="relu")
+        self.fc2 = Linear(input_dim=d_inner_hid, output_dim=d_model)
+
+    def forward(self, x):
+        hidden = self.fc1(x)
+        if self.dropout_rate:
+            hidden = layers.dropout(hidden,
+                                    dropout_prob=self.dropout_rate,
+                                    is_test=False)
+        out = self.fc2(hidden)
+        return out
+
+
+class EncoderLayer(Layer):
     """
-    EncoderSubLayer
+    EncoderLayer
     """
     def __init__(self,
-                 name_scope,
                  n_head,
                  d_key,
                  d_value,
@@ -262,56 +211,36 @@ class EncoderSubLayer(Layer):
                  preprocess_cmd="n",
                  postprocess_cmd="da"):
 
-        super(EncoderSubLayer, self).__init__(name_scope)
-        self._preprocess_cmd = preprocess_cmd
-        self._postprocess_cmd = postprocess_cmd
-        self._prepostprocess_dropout = prepostprocess_dropout
-
-        self._preprocess_layer = PrePostProcessLayer(self.full_name(),
-                                                     self._preprocess_cmd, 3)
-        self._multihead_attention_layer = MultiHeadAttentionLayer(
-            self.full_name(), d_key, d_value, d_model, n_head,
-            attention_dropout)
-        self._postprocess_layer = PrePostProcessLayer(self.full_name(),
-                                                      self._postprocess_cmd,
-                                                      None)
-        self._preprocess_layer2 = PrePostProcessLayer(self.full_name(),
-                                                      self._preprocess_cmd, 3)
-        self._positionwise_feed_forward = PositionwiseFeedForwardLayer(
-            self.full_name(), d_inner_hid, d_model, relu_dropout)
-        self._postprocess_layer2 = PrePostProcessLayer(self.full_name(),
-                                                       self._postprocess_cmd,
-                                                       None)
+        super(EncoderLayer, self).__init__()
+
+        self.preprocesser1 = PrePostProcessLayer(preprocess_cmd, d_model,
+                                                 prepostprocess_dropout)
+        self.self_attn = MultiHeadAttention(d_key, d_value, d_model, n_head,
+                                            attention_dropout)
+        self.postprocesser1 = PrePostProcessLayer(postprocess_cmd, d_model,
+                                                  prepostprocess_dropout)
+
+        self.preprocesser2 = PrePostProcessLayer(preprocess_cmd, d_model,
+                                                 prepostprocess_dropout)
+        self.ffn = FFN(d_inner_hid, d_model, relu_dropout)
+        self.postprocesser2 = PrePostProcessLayer(postprocess_cmd, d_model,
+                                                  prepostprocess_dropout)
 
     def forward(self, enc_input, attn_bias):
-        """
-        forward
-        :param enc_input:
-        :param attn_bias:
-        :return:
-        """
-        pre_process_multihead = self._preprocess_layer(
-            None, enc_input, self._preprocess_cmd, self._prepostprocess_dropout)
-        attn_output = self._multihead_attention_layer(pre_process_multihead,
-                                                      None, None, attn_bias)
-        attn_output = self._postprocess_layer(enc_input, attn_output,
-                                              self._postprocess_cmd,
-                                              self._prepostprocess_dropout)
-        pre_process2_output = self._preprocess_layer2(
-            None, attn_output, self._preprocess_cmd,
-            self._prepostprocess_dropout)
-        ffd_output = self._positionwise_feed_forward(pre_process2_output)
-        return self._postprocess_layer2(attn_output, ffd_output,
-                                        self._postprocess_cmd,
-                                        self._prepostprocess_dropout)
+        attn_output = self.self_attn(self.preprocesser1(enc_input), None, None,
+                                     attn_bias)
+        attn_output = self.postprocesser1(attn_output, enc_input)
 
+        ffn_output = self.ffn(self.preprocesser2(attn_output))
+        ffn_output = self.postprocesser2(ffn_output, attn_output)
+        return ffn_output
 
-class EncoderLayer(Layer):
+
+class Encoder(Layer):
     """
     encoder
     """
     def __init__(self,
-                 name_scope,
                  n_layer,
                  n_head,
                  d_key,
@@ -324,340 +253,267 @@ class EncoderLayer(Layer):
                  preprocess_cmd="n",
                  postprocess_cmd="da"):
 
-        super(EncoderLayer, self).__init__(name_scope)
-        self._preprocess_cmd = preprocess_cmd
-        self._encoder_sublayers = list()
-        self._prepostprocess_dropout = prepostprocess_dropout
-        self._n_layer = n_layer
-        self._preprocess_layer = PrePostProcessLayer(self.full_name(),
-                                                     self._preprocess_cmd, 3)
+        super(Encoder, self).__init__()
+
+        self.encoder_layers = list()
         for i in range(n_layer):
-            self._encoder_sublayers.append(
+            self.encoder_layers.append(
                 self.add_sublayer(
-                    'esl_%d' % i,
-                    EncoderSubLayer(self.full_name(), n_head, d_key, d_value,
-                                    d_model, d_inner_hid,
-                                    prepostprocess_dropout, attention_dropout,
-                                    relu_dropout, preprocess_cmd,
-                                    postprocess_cmd)))
+                    "layer_%d" % i,
+                    EncoderLayer(n_head, d_key, d_value, d_model, d_inner_hid,
+                                 prepostprocess_dropout, attention_dropout,
+                                 relu_dropout, preprocess_cmd,
+                                 postprocess_cmd)))
+        self.processer = PrePostProcessLayer(preprocess_cmd, d_model,
+                                             prepostprocess_dropout)
 
     def forward(self, enc_input, attn_bias):
-        """
-        forward
-        :param enc_input:
-        :param attn_bias:
-        :return:
-        """
-        for i in range(self._n_layer):
-            enc_output = self._encoder_sublayers[i](enc_input, attn_bias)
+        for encoder_layer in self.encoder_layers:
+            enc_output = encoder_layer(enc_input, attn_bias)
             enc_input = enc_output
 
-        return self._preprocess_layer(None, enc_output, self._preprocess_cmd,
-                                      self._prepostprocess_dropout)
+        return self.processer(enc_output)
 
 
-class PrepareEncoderDecoderLayer(Layer):
+class Embedder(Layer):
     """
-    PrepareEncoderDecoderLayer
+    Word Embedding + Position Encoding
     """
-    def __init__(self,
-                 name_scope,
-                 src_vocab_size,
-                 src_emb_dim,
-                 src_max_len,
-                 dropout_rate,
-                 word_emb_param_name=None,
-                 pos_enc_param_name=None):
-        super(PrepareEncoderDecoderLayer, self).__init__(name_scope)
-        self._src_max_len = src_max_len
-        self._src_emb_dim = src_emb_dim
-        self._src_vocab_size = src_vocab_size
-        self._dropout_rate = dropout_rate
-        self._input_emb = Embedding(name_scope=self.full_name(),
-                                    size=[src_vocab_size, src_emb_dim],
-                                    padding_idx=0,
-                                    param_attr=fluid.ParamAttr(
-                                        name=word_emb_param_name,
-                                        initializer=fluid.initializer.Normal(
-                                            0., src_emb_dim**-0.5)))
-
-        pos_inp = position_encoding_init(src_max_len, src_emb_dim)
-        self._pos_emb = Embedding(
-            name_scope=self.full_name(),
-            size=[self._src_max_len, src_emb_dim],
+    def __init__(self, vocab_size, emb_dim, bos_idx=0):
+        super(Embedder, self).__init__()
+
+        self.word_embedder = Embedding(
+            size=[vocab_size, emb_dim],
+            padding_idx=bos_idx,
             param_attr=fluid.ParamAttr(
-                name=pos_enc_param_name,
-                initializer=fluid.initializer.NumpyArrayInitializer(pos_inp),
-                trainable=False))
+                initializer=fluid.initializer.Normal(0., emb_dim**-0.5)))
 
-        # use in dygraph_mode to fit different length batch
-        # self._pos_emb._w = to_variable(
-        #     position_encoding_init(self._src_max_len, self._src_emb_dim))
+    def forward(self, word):
+        word_emb = self.word_embedder(word)
+        return word_emb
 
-    def forward(self, src_word, src_pos):
-        """
-        forward
-        :param src_word:
-        :param src_pos:
-        :return:
-        """
-        # print("here")
-        # print(self._input_emb._w._numpy().shape)
-        src_word_emb = self._input_emb(src_word)
-
-        src_word_emb = layers.scale(x=src_word_emb,
-                                    scale=self._src_emb_dim**0.5)
-        # # TODO change this to fit dynamic length input
-        src_pos_emb = self._pos_emb(src_pos)
-        src_pos_emb.stop_gradient = True
-        enc_input = src_word_emb + src_pos_emb
-        return layers.dropout(
-            enc_input, dropout_prob=self._dropout_rate,
-            is_test=False) if self._dropout_rate else enc_input
-
-
-class WrapEncoderLayer(Layer):
+
+class WrapEncoder(Layer):
     """
-    encoderlayer
+    embedder + encoder
     """
-    def __init__(self, name_cope, src_vocab_size, max_length, n_layer, n_head,
-                 d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
+    def __init__(self, src_vocab_size, max_length, n_layer, n_head, d_key,
+                 d_value, d_model, d_inner_hid, prepostprocess_dropout,
                  attention_dropout, relu_dropout, preprocess_cmd,
-                 postprocess_cmd, weight_sharing):
-        """
-        The wrapper assembles together all needed layers for the encoder.
-        """
-        super(WrapEncoderLayer, self).__init__(name_cope)
-
-        self._prepare_encoder_layer = PrepareEncoderDecoderLayer(
-            self.full_name(),
-            src_vocab_size,
-            d_model,
-            max_length,
-            prepostprocess_dropout,
-            word_emb_param_name=word_emb_param_names[0],
-            pos_enc_param_name=pos_enc_param_names[0])
-        self._encoder = EncoderLayer(self.full_name(), n_layer, n_head, d_key,
-                                     d_value, d_model, d_inner_hid,
-                                     prepostprocess_dropout, attention_dropout,
-                                     relu_dropout, preprocess_cmd,
-                                     postprocess_cmd)
-
-    def forward(self, enc_inputs):
-        """forward"""
-        src_word, src_pos, src_slf_attn_bias = enc_inputs
-        enc_input = self._prepare_encoder_layer(src_word, src_pos)
-        enc_output = self._encoder(enc_input, src_slf_attn_bias)
+                 postprocess_cmd, word_embedder):
+        super(WrapEncoder, self).__init__()
+
+        self.emb_dropout = prepostprocess_dropout
+        self.emb_dim = d_model
+        self.word_embedder = word_embedder
+        self.pos_encoder = Embedding(
+            size=[max_length, self.emb_dim],
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.NumpyArrayInitializer(
+                    position_encoding_init(max_length, self.emb_dim)),
+                trainable=False))
+
+        self.encoder = Encoder(n_layer, n_head, d_key, d_value, d_model,
+                               d_inner_hid, prepostprocess_dropout,
+                               attention_dropout, relu_dropout, preprocess_cmd,
+                               postprocess_cmd)
+
+    def forward(self, src_word, src_pos, src_slf_attn_bias):
+        word_emb = self.word_embedder(src_word)
+        word_emb = layers.scale(x=word_emb, scale=self.emb_dim**0.5)
+        pos_enc = self.pos_encoder(src_pos)
+        pos_enc.stop_gradient = True
+        emb = word_emb + pos_enc
+        enc_input = layers.dropout(emb,
+                                   dropout_prob=self.emb_dropout,
+                                   is_test=False) if self.emb_dropout else emb
+
+        enc_output = self.encoder(enc_input, src_slf_attn_bias)
         return enc_output
 
 
-class DecoderSubLayer(Layer):
+class DecoderLayer(Layer):
     """
     decoder
     """
-    def __init__(self, name_scope, n_head, d_key, d_value, d_model, d_inner_hid,
-                 prepostprocess_dropout, attention_dropout, relu_dropout,
-                 preprocess_cmd, postprocess_cmd):
-        super(DecoderSubLayer, self).__init__(name_scope)
-        self._postprocess_cmd = postprocess_cmd
-        self._preprocess_cmd = preprocess_cmd
-        self._prepostprcess_dropout = prepostprocess_dropout
-        self._pre_process_layer = PrePostProcessLayer(self.full_name(),
-                                                      preprocess_cmd, 3)
-        self._multihead_attention_layer = MultiHeadAttentionLayer(
-            self.full_name(), d_key, d_value, d_model, n_head,
-            attention_dropout)
-        self._post_process_layer = PrePostProcessLayer(self.full_name(),
-                                                       postprocess_cmd, None)
-        self._pre_process_layer2 = PrePostProcessLayer(self.full_name(),
-                                                       preprocess_cmd, 3)
-        self._multihead_attention_layer2 = MultiHeadAttentionLayer(
-            self.full_name(), d_key, d_value, d_model, n_head,
-            attention_dropout)
-        self._post_process_layer2 = PrePostProcessLayer(self.full_name(),
-                                                        postprocess_cmd, None)
-        self._pre_process_layer3 = PrePostProcessLayer(self.full_name(),
-                                                       preprocess_cmd, 3)
-        self._positionwise_feed_forward_layer = PositionwiseFeedForwardLayer(
-            self.full_name(), d_inner_hid, d_model, relu_dropout)
-        self._post_process_layer3 = PrePostProcessLayer(self.full_name(),
-                                                        postprocess_cmd, None)
+    def __init__(self,
+                 n_head,
+                 d_key,
+                 d_value,
+                 d_model,
+                 d_inner_hid,
+                 prepostprocess_dropout,
+                 attention_dropout,
+                 relu_dropout,
+                 preprocess_cmd="n",
+                 postprocess_cmd="da"):
+        super(DecoderLayer, self).__init__()
+
+        self.preprocesser1 = PrePostProcessLayer(preprocess_cmd, d_model,
+                                                 prepostprocess_dropout)
+        self.self_attn = MultiHeadAttention(d_key, d_value, d_model, n_head,
+                                            attention_dropout)
+        self.postprocesser1 = PrePostProcessLayer(postprocess_cmd, d_model,
+                                                  prepostprocess_dropout)
+
+        self.preprocesser2 = PrePostProcessLayer(preprocess_cmd, d_model,
+                                                 prepostprocess_dropout)
+        self.cross_attn = MultiHeadAttention(d_key, d_value, d_model, n_head,
+                                             attention_dropout)
+        self.postprocesser2 = PrePostProcessLayer(postprocess_cmd, d_model,
+                                                  prepostprocess_dropout)
+
+        self.preprocesser3 = PrePostProcessLayer(preprocess_cmd, d_model,
+                                                 prepostprocess_dropout)
+        self.ffn = FFN(d_inner_hid, d_model, relu_dropout)
+        self.postprocesser3 = PrePostProcessLayer(postprocess_cmd, d_model,
+                                                  prepostprocess_dropout)
 
     def forward(self,
                 dec_input,
                 enc_output,
-                slf_attn_bias,
-                dec_enc_attn_bias,
-                cache=None,
-                gather_idx=None):
-        """
-        forward
-        :param dec_input:
-        :param enc_output:
-        :param slf_attn_bias:
-        :param dec_enc_attn_bias:
-        :return:
-        """
-        pre_process_rlt = self._pre_process_layer(None, dec_input,
-                                                  self._preprocess_cmd,
-                                                  self._prepostprcess_dropout)
-        slf_attn_output = self._multihead_attention_layer(
-            pre_process_rlt, None, None, slf_attn_bias, cache, gather_idx)
-        slf_attn_output_pp = self._post_process_layer(
-            dec_input, slf_attn_output, self._postprocess_cmd,
-            self._prepostprcess_dropout)
-        pre_process_rlt2 = self._pre_process_layer2(None, slf_attn_output_pp,
-                                                    self._preprocess_cmd,
-                                                    self._prepostprcess_dropout)
-        enc_attn_output_pp = self._multihead_attention_layer2(
-            pre_process_rlt2, enc_output, enc_output, dec_enc_attn_bias)
-        enc_attn_output = self._post_process_layer2(slf_attn_output_pp,
-                                                    enc_attn_output_pp,
-                                                    self._postprocess_cmd,
-                                                    self._prepostprcess_dropout)
-        pre_process_rlt3 = self._pre_process_layer3(None, enc_attn_output,
-                                                    self._preprocess_cmd,
-                                                    self._prepostprcess_dropout)
-        ffd_output = self._positionwise_feed_forward_layer(pre_process_rlt3)
-        dec_output = self._post_process_layer3(enc_attn_output, ffd_output,
-                                               self._postprocess_cmd,
-                                               self._prepostprcess_dropout)
-        return dec_output
+                self_attn_bias,
+                cross_attn_bias,
+                cache=None):
+        self_attn_output = self.self_attn(self.preprocesser1(dec_input), None,
+                                          None, self_attn_bias, cache)
+        self_attn_output = self.postprocesser1(self_attn_output, dec_input)
 
+        cross_attn_output = self.cross_attn(
+            self.preprocesser2(self_attn_output), enc_output, enc_output,
+            cross_attn_bias)
+        cross_attn_output = self.postprocesser2(cross_attn_output,
+                                                self_attn_output)
 
-class DecoderLayer(Layer):
+        ffn_output = self.ffn(self.preprocesser3(cross_attn_output))
+        ffn_output = self.postprocesser3(ffn_output, cross_attn_output)
+
+        return ffn_output
+
+
+class Decoder(Layer):
     """
     decoder
     """
-    def __init__(self, name_scope, n_layer, n_head, d_key, d_value, d_model,
-                 d_inner_hid, prepostprocess_dropout, attention_dropout,
-                 relu_dropout, preprocess_cmd, postprocess_cmd):
-        super(DecoderLayer, self).__init__(name_scope)
-        self._pre_process_layer = PrePostProcessLayer(self.full_name(),
-                                                      preprocess_cmd, 3)
-        self._decoder_sub_layers = list()
-        self._n_layer = n_layer
-        self._preprocess_cmd = preprocess_cmd
-        self._prepostprocess_dropout = prepostprocess_dropout
+    def __init__(self, n_layer, n_head, d_key, d_value, d_model, d_inner_hid,
+                 prepostprocess_dropout, attention_dropout, relu_dropout,
+                 preprocess_cmd, postprocess_cmd):
+        super(Decoder, self).__init__()
+
+        self.decoder_layers = list()
         for i in range(n_layer):
-            self._decoder_sub_layers.append(
+            self.decoder_layers.append(
                 self.add_sublayer(
-                    'dsl_%d' % i,
-                    DecoderSubLayer(self.full_name(), n_head, d_key, d_value,
-                                    d_model, d_inner_hid,
-                                    prepostprocess_dropout, attention_dropout,
-                                    relu_dropout, preprocess_cmd,
-                                    postprocess_cmd)))
+                    "layer_%d" % i,
+                    DecoderLayer(n_head, d_key, d_value, d_model, d_inner_hid,
+                                 prepostprocess_dropout, attention_dropout,
+                                 relu_dropout, preprocess_cmd,
+                                 postprocess_cmd)))
+        self.processer = PrePostProcessLayer(preprocess_cmd, d_model,
+                                             prepostprocess_dropout)
 
     def forward(self,
                 dec_input,
                 enc_output,
-                dec_slf_attn_bias,
-                dec_enc_attn_bias,
-                caches=None,
-                gather_idx=None):
-        """
-        forward
-        :param dec_input:
-        :param enc_output:
-        :param dec_slf_attn_bias:
-        :param dec_enc_attn_bias:
-        :return:
-        """
-        for i in range(self._n_layer):
-            tmp_dec_output = self._decoder_sub_layers[i](
-                dec_input, enc_output, dec_slf_attn_bias, dec_enc_attn_bias,
-                None if caches is None else caches[i], gather_idx)
-            dec_input = tmp_dec_output
+                self_attn_bias,
+                cross_attn_bias,
+                caches=None):
+        for i, decoder_layer in enumerate(self.decoder_layers):
+            dec_output = decoder_layer(dec_input, enc_output, self_attn_bias,
+                                       cross_attn_bias,
+                                       None if caches is None else caches[i])
+            dec_input = dec_output
 
-        dec_output = self._pre_process_layer(None, tmp_dec_output,
-                                             self._preprocess_cmd,
-                                             self._prepostprocess_dropout)
-        return dec_output
+        return self.processer(dec_output)
 
 
-class WrapDecoderLayer(Layer):
+class WrapDecoder(Layer):
     """
-    decoder
+    embedder + decoder
     """
-    def __init__(self,
-                 name_scope,
-                 trg_vocab_size,
-                 max_length,
-                 n_layer,
-                 n_head,
-                 d_key,
-                 d_value,
-                 d_model,
-                 d_inner_hid,
-                 prepostprocess_dropout,
-                 attention_dropout,
-                 relu_dropout,
-                 preprocess_cmd,
-                 postprocess_cmd,
-                 weight_sharing,
-                 gather_idx=None):
-        """
-        The wrapper assembles together all needed layers for the encoder.
-        """
-        super(WrapDecoderLayer, self).__init__(name_scope)
-
-        self._prepare_decoder_layer = PrepareEncoderDecoderLayer(
-            self.full_name(),
-            trg_vocab_size,
-            d_model,
-            max_length,
-            prepostprocess_dropout,
-            word_emb_param_name=word_emb_param_names[1],
-            pos_enc_param_name=pos_enc_param_names[1])
-        self._decoder_layer = DecoderLayer(self.full_name(), n_layer, n_head,
-                                           d_key, d_value, d_model, d_inner_hid,
-                                           prepostprocess_dropout,
-                                           attention_dropout, relu_dropout,
-                                           preprocess_cmd, postprocess_cmd)
-        self._weight_sharing = weight_sharing
-        if not weight_sharing:
-            self._fc = FC(self.full_name(),
-                          size=trg_vocab_size,
-                          bias_attr=False)
-
-    def forward(self, dec_inputs, enc_output, caches=None, gather_idx=None):
-        """
-        forward
-        :param dec_inputs:
-        :param enc_output:
-        :return:
-        """
-        trg_word, trg_pos, trg_slf_attn_bias, trg_src_attn_bias = dec_inputs
-        dec_input = self._prepare_decoder_layer(trg_word, trg_pos)
-        dec_output = self._decoder_layer(dec_input, enc_output,
-                                         trg_slf_attn_bias, trg_src_attn_bias,
-                                         caches, gather_idx)
-
-        dec_output_reshape = layers.reshape(dec_output,
-                                            shape=[-1, dec_output.shape[-1]],
-                                            inplace=False)
-
-        if self._weight_sharing:
-            predict = layers.matmul(x=dec_output_reshape,
-                                    y=self._prepare_decoder_layer._input_emb._w,
-                                    transpose_y=True)
+    def __init__(self, trg_vocab_size, max_length, n_layer, n_head, d_key,
+                 d_value, d_model, d_inner_hid, prepostprocess_dropout,
+                 attention_dropout, relu_dropout, preprocess_cmd,
+                 postprocess_cmd, share_input_output_embed, word_embedder):
+        super(WrapDecoder, self).__init__()
+
+        self.emb_dropout = prepostprocess_dropout
+        self.emb_dim = d_model
+        self.word_embedder = word_embedder
+        self.pos_encoder = Embedding(
+            size=[max_length, self.emb_dim],
+            param_attr=fluid.ParamAttr(
+                initializer=fluid.initializer.NumpyArrayInitializer(
+                    position_encoding_init(max_length, self.emb_dim)),
+                trainable=False))
+
+        self.decoder = Decoder(n_layer, n_head, d_key, d_value, d_model,
+                               d_inner_hid, prepostprocess_dropout,
+                               attention_dropout, relu_dropout, preprocess_cmd,
+                               postprocess_cmd)
+
+        if share_input_output_embed:
+            self.linear = lambda x: layers.matmul(x=x,
+                                                  y=self.word_embedder.
+                                                  word_embedder.weight,
+                                                  transpose_y=True)
         else:
-            predict = self._fc(dec_output_reshape)
+            self.linear = Linear(input_dim=d_model,
+                                 output_dim=trg_vocab_size,
+                                 bias_attr=False)
 
-        if dec_inputs is None:
-            # Return probs for independent decoder program.
-            predict_out = layers.softmax(predict)
-            return predict_out
-        return predict
+    def forward(self,
+                trg_word,
+                trg_pos,
+                trg_slf_attn_bias,
+                trg_src_attn_bias,
+                enc_output,
+                caches=None):
+        word_emb = self.word_embedder(trg_word)
+        word_emb = layers.scale(x=word_emb, scale=self.emb_dim**0.5)
+        pos_enc = self.pos_encoder(trg_pos)
+        pos_enc.stop_gradient = True
+        emb = word_emb + pos_enc
+        dec_input = layers.dropout(emb,
+                                   dropout_prob=self.emb_dropout,
+                                   is_test=False) if self.emb_dropout else emb
+        dec_output = self.decoder(dec_input, enc_output, trg_slf_attn_bias,
+                                  trg_src_attn_bias, caches)
+        dec_output = layers.reshape(
+            dec_output,
+            shape=[-1, dec_output.shape[-1]],
+        )
+        logits = self.linear(dec_output)
+        return logits
+
+
+class CrossEntropyCriterion(object):
+    def __init__(self, label_smooth_eps):
+        self.label_smooth_eps = label_smooth_eps
+
+    def __call__(self, predict, label, weights):
+        if self.label_smooth_eps:
+            label_out = layers.label_smooth(label=layers.one_hot(
+                input=label, depth=predict.shape[-1]),
+                                            epsilon=self.label_smooth_eps)
+
+        cost = layers.softmax_with_cross_entropy(
+            logits=predict,
+            label=label_out,
+            soft_label=True if self.label_smooth_eps else False)
+        weighted_cost = cost * weights
+        sum_cost = layers.reduce_sum(weighted_cost)
+        token_num = layers.reduce_sum(weights)
+        token_num.stop_gradient = True
+        avg_cost = sum_cost / token_num
+        return sum_cost, avg_cost, token_num
 
 
-class TransFormer(Layer):
+class Transformer(Layer):
     """
     model
     """
     def __init__(self,
-                 name_scope,
                  src_vocab_size,
                  trg_vocab_size,
                  max_length,
@@ -673,73 +529,62 @@ class TransFormer(Layer):
                  preprocess_cmd,
                  postprocess_cmd,
                  weight_sharing,
-                 label_smooth_eps=0.0):
-        super(TransFormer, self).__init__(name_scope)
-        self._label_smooth_eps = label_smooth_eps
-        self._trg_vocab_size = trg_vocab_size
+                 bos_id=0,
+                 eos_id=1):
+        super(Transformer, self).__init__()
+        src_word_embedder = Embedder(vocab_size=src_vocab_size,
+                                     emb_dim=d_model,
+                                     bos_idx=bos_id)
+        self.encoder = WrapEncoder(src_vocab_size, max_length, n_layer, n_head,
+                                   d_key, d_value, d_model, d_inner_hid,
+                                   prepostprocess_dropout, attention_dropout,
+                                   relu_dropout, preprocess_cmd,
+                                   postprocess_cmd, src_word_embedder)
         if weight_sharing:
             assert src_vocab_size == trg_vocab_size, (
                 "Vocabularies in source and target should be same for weight sharing."
             )
-        self._wrap_encoder_layer = WrapEncoderLayer(
-            self.full_name(), src_vocab_size, max_length, n_layer, n_head,
-            d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
-            attention_dropout, relu_dropout, preprocess_cmd, postprocess_cmd,
-            weight_sharing)
-        self._wrap_decoder_layer = WrapDecoderLayer(
-            self.full_name(), trg_vocab_size, max_length, n_layer, n_head,
-            d_key, d_value, d_model, d_inner_hid, prepostprocess_dropout,
-            attention_dropout, relu_dropout, preprocess_cmd, postprocess_cmd,
-            weight_sharing)
-
-        if weight_sharing:
-            self._wrap_decoder_layer._prepare_decoder_layer._input_emb._w = self._wrap_encoder_layer._prepare_encoder_layer._input_emb._w
-
+            trg_word_embedder = src_word_embedder
+        else:
+            trg_word_embedder = Embedder(vocab_size=trg_vocab_size,
+                                         emb_dim=d_model,
+                                         bos_idx=bos_id)
+        self.decoder = WrapDecoder(trg_vocab_size, max_length, n_layer, n_head,
+                                   d_key, d_value, d_model, d_inner_hid,
+                                   prepostprocess_dropout, attention_dropout,
+                                   relu_dropout, preprocess_cmd,
+                                   postprocess_cmd, weight_sharing,
+                                   trg_word_embedder)
+
+        self.trg_vocab_size = trg_vocab_size
         self.n_layer = n_layer
         self.n_head = n_head
         self.d_key = d_key
         self.d_value = d_value
 
-    def forward(self, enc_inputs, dec_inputs, label, weights):
-        """
-        forward
-        :param enc_inputs:
-        :param dec_inputs:
-        :param label:
-        :param weights:
-        :return:
-        """
-        enc_output = self._wrap_encoder_layer(enc_inputs)
-        predict = self._wrap_decoder_layer(dec_inputs, enc_output)
-        if self._label_smooth_eps:
-            label_out = layers.label_smooth(label=layers.one_hot(
-                input=label, depth=self._trg_vocab_size),
-                                            epsilon=self._label_smooth_eps)
-
-        cost = layers.softmax_with_cross_entropy(
-            logits=predict,
-            label=label_out,
-            soft_label=True if self._label_smooth_eps else False)
-        weighted_cost = cost * weights
-        sum_cost = layers.reduce_sum(weighted_cost)
-        token_num = layers.reduce_sum(weights)
-        token_num.stop_gradient = True
-        avg_cost = sum_cost / token_num
-        return sum_cost, avg_cost, predict, token_num
+    def forward(self, src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos,
+                trg_slf_attn_bias, trg_src_attn_bias):
+        enc_output = self.encoder(src_word, src_pos, src_slf_attn_bias)
+        predict = self.decoder(trg_word, trg_pos, trg_slf_attn_bias,
+                               trg_src_attn_bias, enc_output)
+        return predict
 
-    def beam_search(self,
-                    enc_inputs,
-                    dec_inputs,
-                    bos_id=0,
-                    eos_id=1,
-                    beam_size=4,
-                    max_len=30,
-                    alpha=0.6):
+    def beam_search_v2(self,
+                       src_word,
+                       src_pos,
+                       src_slf_attn_bias,
+                       trg_word,
+                       trg_src_attn_bias,
+                       bos_id=0,
+                       eos_id=1,
+                       beam_size=4,
+                       max_len=None,
+                       alpha=0.6):
         """
         Beam search with the alive and finished two queues, both have a beam size
         capicity separately. It includes `grow_topk` `grow_alive` `grow_finish` as
-        steps. 
-        
+        steps.
+
         1. `grow_topk` selects the top `2*beam_size` candidates to avoid all getting
         EOS.
 
@@ -761,15 +606,15 @@ class TransFormer(Layer):
             return layers.reshape(tensor, [-1] + tensor.shape[2:])
 
         # run encoder
-        enc_output = self._wrap_encoder_layer(enc_inputs)
+        enc_output = self.encoder(src_word, src_pos, src_slf_attn_bias)
 
         # constant number
         inf = float(1. * 1e7)
         batch_size = enc_output.shape[0]
+        max_len = (enc_output.shape[1] + 20) if max_len is None else max_len
 
         ### initialize states of beam search ###
         ## init for the alive ##
-        initial_ids, trg_src_attn_bias = dec_inputs  # (batch_size, 1)
         initial_log_probs = to_variable(
             np.array([[0.] + [-inf] * (beam_size - 1)], dtype="float32"))
         alive_log_probs = layers.expand(initial_log_probs, [batch_size, 1])
@@ -789,8 +634,7 @@ class TransFormer(Layer):
         ### initialize inputs and states of transformer decoder ###
         ## init inputs for decoder, shaped `[batch_size*beam_size, ...]`
         trg_word = layers.reshape(alive_seq[:, :, -1],
-                                  [batch_size * beam_size, 1, 1])
-        trg_pos = layers.zeros_like(trg_word)
+                                  [batch_size * beam_size, 1])
         trg_src_attn_bias = merge_beam_dim(
             expand_to_beam_size(trg_src_attn_bias, beam_size))
         enc_output = merge_beam_dim(expand_to_beam_size(enc_output, beam_size))
@@ -872,8 +716,8 @@ class TransFormer(Layer):
 
             topk_log_probs = topk_scores * length_penalty
 
-            topk_beam_index = topk_ids // self._trg_vocab_size
-            topk_ids = topk_ids % self._trg_vocab_size
+            topk_beam_index = topk_ids // self.trg_vocab_size
+            topk_ids = topk_ids % self.trg_vocab_size
 
             # use gather as gather_nd, TODO: use gather_nd
             topk_seq = gather_2d_by_gather(alive_seq, topk_beam_index,
@@ -935,9 +779,11 @@ class TransFormer(Layer):
             return finished_seq, finished_scores, finished_flags
 
         for i in range(max_len):
-            logits = self._wrap_decoder_layer(
-                (trg_word, trg_pos, None, trg_src_attn_bias), enc_output,
-                caches)
+            trg_pos = layers.fill_constant(shape=trg_word.shape,
+                                           dtype="int64",
+                                           value=i)
+            logits = self.decoder(trg_word, trg_pos, None, trg_src_attn_bias,
+                                  enc_output, caches)
             topk_seq, topk_log_probs, topk_scores, topk_finished, states = grow_topk(
                 i, logits, alive_seq, alive_log_probs, caches)
             alive_seq, alive_log_probs, states = grow_alive(
@@ -946,12 +792,148 @@ class TransFormer(Layer):
                 finished_seq, finished_scores, finished_flags, topk_seq,
                 topk_scores, topk_finished)
             trg_word = layers.reshape(alive_seq[:, :, -1],
-                                      [batch_size * beam_size, 1, 1])
-            trg_pos = layers.fill_constant(shape=trg_word.shape,
-                                           dtype="int64",
-                                           value=i)
+                                      [batch_size * beam_size, 1])
+
             if early_finish(alive_log_probs, finished_scores,
                             finished_flags).numpy():
                 break
 
         return finished_seq, finished_scores
+
+    def beam_search(self,
+                    src_word,
+                    src_pos,
+                    src_slf_attn_bias,
+                    trg_word,
+                    trg_src_attn_bias,
+                    bos_id=0,
+                    eos_id=1,
+                    beam_size=4,
+                    max_len=256):
+        def expand_to_beam_size(tensor, beam_size):
+            tensor = layers.reshape(tensor,
+                                    [tensor.shape[0], 1] + tensor.shape[1:])
+            tile_dims = [1] * len(tensor.shape)
+            tile_dims[1] = beam_size
+            return layers.expand(tensor, tile_dims)
+
+        def merge_batch_beams(tensor):
+            return layers.reshape(tensor, [tensor.shape[0] * tensor.shape[1]] +
+                                  tensor.shape[2:])
+
+        def split_batch_beams(tensor):
+            return fluid.layers.reshape(tensor,
+                                        shape=[-1, beam_size] +
+                                        list(tensor.shape[1:]))
+
+        def mask_probs(probs, finished, noend_mask_tensor):
+            # TODO: use where_op
+            finished = layers.cast(finished, dtype=probs.dtype)
+            probs = layers.elementwise_mul(
+                layers.expand(layers.unsqueeze(finished, [2]), [1, 1, self.trg_vocab_size]),
+                noend_mask_tensor, axis=-1) - layers.elementwise_mul(probs, (finished - 1), axis=0)
+            return probs
+
+        def gather(x, indices, batch_pos):
+            topk_coordinates = fluid.layers.stack([batch_pos, indices], axis=2)
+            return layers.gather_nd(x, topk_coordinates)
+
+        # run encoder
+        enc_output = self.encoder(src_word, src_pos, src_slf_attn_bias)
+
+        # constant number
+        inf = float(1. * 1e7)
+        batch_size = enc_output.shape[0]
+        max_len = (enc_output.shape[1] + 20) if max_len is None else max_len
+        vocab_size_tensor = layers.fill_constant(shape=[1],
+                                                 dtype="int64",
+                                                 value=self.trg_vocab_size)
+        end_token_tensor = to_variable(
+            np.full([batch_size, beam_size], eos_id, dtype="int64"))
+        noend_array = [-inf] * self.trg_vocab_size
+        noend_array[eos_id] = 0
+        noend_mask_tensor = to_variable(np.array(noend_array,dtype="float32"))
+        batch_pos = layers.expand(
+            layers.unsqueeze(
+                to_variable(np.arange(0, batch_size, 1, dtype="int64")), [1]),
+            [1, beam_size])
+
+        predict_ids = []
+        parent_ids = []
+        ### initialize states of beam search ###
+        log_probs = to_variable(
+            np.array([[0.] + [-inf] * (beam_size - 1)] * batch_size,
+                     dtype="float32"))
+        finished = to_variable(np.full([batch_size, beam_size], 0,
+                                       dtype="bool"))
+        ### initialize inputs and states of transformer decoder ###
+        ## init inputs for decoder, shaped `[batch_size*beam_size, ...]`
+        trg_word = layers.fill_constant(shape=[batch_size * beam_size, 1],
+                                        dtype="int64",
+                                        value=bos_id)
+        trg_pos = layers.zeros_like(trg_word)
+        trg_src_attn_bias = merge_batch_beams(
+            expand_to_beam_size(trg_src_attn_bias, beam_size))
+        enc_output = merge_batch_beams(expand_to_beam_size(enc_output, beam_size))
+        ## init states (caches) for transformer, need to be updated according to selected beam
+        caches = [{
+            "k":
+            layers.fill_constant(
+                shape=[batch_size * beam_size, self.n_head, 0, self.d_key],
+                dtype=enc_output.dtype,
+                value=0),
+            "v":
+            layers.fill_constant(
+                shape=[batch_size * beam_size, self.n_head, 0, self.d_value],
+                dtype=enc_output.dtype,
+                value=0),
+        } for i in range(self.n_layer)]
+
+        for i in range(max_len):
+            trg_pos = layers.fill_constant(shape=trg_word.shape,
+                                           dtype="int64",
+                                           value=i)
+            caches = map_structure(  # can not be reshaped since the 0 size
+                lambda x: x if i == 0 else merge_batch_beams(x), caches)
+            logits = self.decoder(trg_word, trg_pos, None, trg_src_attn_bias,
+                                  enc_output, caches)
+            caches = map_structure(split_batch_beams, caches)
+            step_log_probs = split_batch_beams(
+                fluid.layers.log(fluid.layers.softmax(logits)))
+            step_log_probs = mask_probs(step_log_probs, finished,
+                                        noend_mask_tensor)
+            log_probs = layers.elementwise_add(x=step_log_probs,
+                                                    y=log_probs,
+                                                    axis=0)
+            log_probs = layers.reshape(log_probs,
+                                       [-1, beam_size * self.trg_vocab_size])
+            scores = log_probs
+            topk_scores, topk_indices = fluid.layers.topk(input=scores,
+                                                          k=beam_size)
+            beam_indices = fluid.layers.elementwise_floordiv(
+                topk_indices, vocab_size_tensor)
+            token_indices = fluid.layers.elementwise_mod(
+                topk_indices, vocab_size_tensor)
+
+            # update states
+            caches = map_structure(lambda x: gather(x, beam_indices, batch_pos),
+                                   caches)
+            log_probs = gather(log_probs, topk_indices, batch_pos)
+            finished = gather(finished, beam_indices, batch_pos)
+            finished = layers.logical_or(
+                finished, layers.equal(token_indices, end_token_tensor))
+            trg_word = layers.reshape(token_indices, [-1, 1])
+
+            predict_ids.append(token_indices)
+            parent_ids.append(beam_indices)
+
+            if layers.reduce_all(finished).numpy():
+                break
+
+        predict_ids = layers.stack(predict_ids, axis=0)
+        parent_ids = layers.stack(parent_ids, axis=0)
+        finished_seq = layers.transpose(
+            layers.gather_tree(predict_ids, parent_ids), [1, 2, 0])
+        finished_scores = topk_scores
+
+        return finished_seq, finished_scores
\ No newline at end of file
diff --git a/dygraph/transformer/predict.py b/dygraph/transformer/predict.py
index c4da56ee194b6ee9f900a357bee7fc4457a0462d..d33d4e5c909d9565dccbddaf3181e5a0b56c0d88 100644
--- a/dygraph/transformer/predict.py
+++ b/dygraph/transformer/predict.py
@@ -12,72 +12,22 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from __future__ import print_function
-import argparse
-import ast
+import logging
+import os
+import six
+import sys
+import time
 
 import numpy as np
 import paddle
 import paddle.fluid as fluid
-import paddle.dataset.wmt16 as wmt16
-
-from model import TransFormer
-from config import *
-from data_util import *
-
-
-def parse_args():
-    parser = argparse.ArgumentParser("Arguments for Inference")
-    parser.add_argument(
-        "--use_data_parallel",
-        type=ast.literal_eval,
-        default=False,
-        help="The flag indicating whether to shuffle instances in each pass.")
-    parser.add_argument(
-        "--model_file",
-        type=str,
-        default="transformer_params",
-        help="Load model from the file named `model_file.pdparams`.")
-    parser.add_argument(
-        "--output_file",
-        type=str,
-        default="predict.txt",
-        help="The file to output the translation results of predict_file to.")
-    parser.add_argument('opts',
-                        help='See config.py for all options',
-                        default=None,
-                        nargs=argparse.REMAINDER)
-    args = parser.parse_args()
-    merge_cfg_from_list(args.opts, [InferTaskConfig, ModelHyperParams])
-    return args
-
-
-def prepare_infer_input(insts, src_pad_idx, bos_idx, n_head):
-    """
-    inputs for inferencs
-    """
-    src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
-        [inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
-    # start tokens
-    trg_word = np.asarray([[bos_idx]] * len(insts), dtype="int64")
-    trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
-                                [1, 1, 1, 1]).astype("float32")
-    trg_word = trg_word.reshape(-1, 1, 1)
-    src_word = src_word.reshape(-1, src_max_len, 1)
-    src_pos = src_pos.reshape(-1, src_max_len, 1)
-
-    data_inputs = [
-        src_word, src_pos, src_slf_attn_bias, trg_word, trg_src_attn_bias
-    ]
 
-    var_inputs = []
-    for i, field in enumerate(encoder_data_input_fields +
-                              fast_decoder_data_input_fields):
-        var_inputs.append(to_variable(data_inputs[i], name=field))
+from utils.configure import PDConfig
+from utils.check import check_gpu, check_version
 
-    enc_inputs = var_inputs[0:len(encoder_data_input_fields)]
-    dec_inputs = var_inputs[len(encoder_data_input_fields):]
-    return enc_inputs, dec_inputs
+# include task-specific libs
+import reader
+from model import Transformer, position_encoding_init
 
 
 def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
@@ -96,59 +46,98 @@ def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False):
     return seq
 
 
-def infer(args):
-    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) \
-        if args.use_data_parallel else fluid.CUDAPlace(0)
+def do_predict(args):
+    if args.use_cuda:
+        place = fluid.CUDAPlace(0)
+    else:
+        place = fluid.CPUPlace()
+
+    # define the data generator
+    processor = reader.DataProcessor(fpattern=args.predict_file,
+                                     src_vocab_fpath=args.src_vocab_fpath,
+                                     trg_vocab_fpath=args.trg_vocab_fpath,
+                                     token_delimiter=args.token_delimiter,
+                                     use_token_batch=False,
+                                     batch_size=args.batch_size,
+                                     device_count=1,
+                                     pool_size=args.pool_size,
+                                     sort_type=reader.SortType.NONE,
+                                     shuffle=False,
+                                     shuffle_batch=False,
+                                     start_mark=args.special_token[0],
+                                     end_mark=args.special_token[1],
+                                     unk_mark=args.special_token[2],
+                                     max_length=args.max_length,
+                                     n_head=args.n_head)
+    batch_generator = processor.data_generator(phase="predict", place=place)
+    args.src_vocab_size, args.trg_vocab_size, args.bos_idx, args.eos_idx, \
+        args.unk_idx = processor.get_vocab_summary()
+    trg_idx2word = reader.DataProcessor.load_dict(
+        dict_path=args.trg_vocab_fpath, reverse=True)
+
+    args.src_vocab_size, args.trg_vocab_size, args.bos_idx, args.eos_idx, \
+        args.unk_idx = processor.get_vocab_summary()
+
     with fluid.dygraph.guard(place):
-        transformer = TransFormer(
-            'transformer', ModelHyperParams.src_vocab_size,
-            ModelHyperParams.trg_vocab_size, ModelHyperParams.max_length + 1,
-            ModelHyperParams.n_layer, ModelHyperParams.n_head,
-            ModelHyperParams.d_key, ModelHyperParams.d_value,
-            ModelHyperParams.d_model, ModelHyperParams.d_inner_hid,
-            ModelHyperParams.prepostprocess_dropout,
-            ModelHyperParams.attention_dropout, ModelHyperParams.relu_dropout,
-            ModelHyperParams.preprocess_cmd, ModelHyperParams.postprocess_cmd,
-            ModelHyperParams.weight_sharing)
-        # load checkpoint
-        model_dict, _ = fluid.load_dygraph(args.model_file)
+        # define data loader
+        test_loader = fluid.io.DataLoader.from_generator(capacity=10)
+        test_loader.set_batch_generator(batch_generator, places=place)
+
+        # define model
+        transformer = Transformer(
+            args.src_vocab_size, args.trg_vocab_size, args.max_length + 1,
+            args.n_layer, args.n_head, args.d_key, args.d_value, args.d_model,
+            args.d_inner_hid, args.prepostprocess_dropout,
+            args.attention_dropout, args.relu_dropout, args.preprocess_cmd,
+            args.postprocess_cmd, args.weight_sharing, args.bos_idx,
+            args.eos_idx)
+
+        # load the trained model
+        assert args.init_from_params, (
+            "Please set init_from_params to load the infer model.")
+        model_dict, _ = fluid.load_dygraph(
+            os.path.join(args.init_from_params, "transformer"))
+        # to avoid a longer length than training, reset the size of position
+        # encoding to max_length
+        model_dict["encoder.pos_encoder.weight"] = position_encoding_init(
+            args.max_length + 1, args.d_model)
+        model_dict["decoder.pos_encoder.weight"] = position_encoding_init(
+            args.max_length + 1, args.d_model)
         transformer.load_dict(model_dict)
-        print("checkpoint loaded")
-        # start evaluate mode
-        transformer.eval()
 
-        reader = paddle.batch(wmt16.test(ModelHyperParams.src_vocab_size,
-                                         ModelHyperParams.trg_vocab_size),
-                              batch_size=InferTaskConfig.batch_size)
-        id2word = wmt16.get_dict("de",
-                                 ModelHyperParams.trg_vocab_size,
-                                 reverse=True)
+        # set evaluate mode
+        transformer.eval()
 
         f = open(args.output_file, "wb")
-        for batch in reader():
-            enc_inputs, dec_inputs = prepare_infer_input(
-                batch, ModelHyperParams.eos_idx, ModelHyperParams.bos_idx,
-                ModelHyperParams.n_head)
-
+        for input_data in test_loader():
+            (src_word, src_pos, src_slf_attn_bias, trg_word,
+             trg_src_attn_bias) = input_data
             finished_seq, finished_scores = transformer.beam_search(
-                enc_inputs,
-                dec_inputs,
-                bos_id=ModelHyperParams.bos_idx,
-                eos_id=ModelHyperParams.eos_idx,
-                max_len=InferTaskConfig.max_out_len,
-                alpha=InferTaskConfig.alpha)
+                src_word,
+                src_pos,
+                src_slf_attn_bias,
+                trg_word,
+                trg_src_attn_bias,
+                bos_id=args.bos_idx,
+                eos_id=args.eos_idx,
+                beam_size=args.beam_size,
+                max_len=args.max_out_len)
             finished_seq = finished_seq.numpy()
             finished_scores = finished_scores.numpy()
             for ins in finished_seq:
-                for beam in ins:
-                    id_list = post_process_seq(beam, ModelHyperParams.bos_idx,
-                                                ModelHyperParams.eos_idx)
-                    word_list = [id2word[id] for id in id_list]
-                    sequence = " ".join(word_list) + "\n"
-                    f.write(sequence.encode("utf8"))
-                    break  # only print the best
-
-
-if __name__ == '__main__':
-    args = parse_args()
-    infer(args)
+                for beam_idx, beam in enumerate(ins):
+                    if beam_idx >= args.n_best: break
+                    id_list = post_process_seq(beam, args.bos_idx, args.eos_idx)
+                    word_list = [trg_idx2word[id] for id in id_list]
+                    sequence = b" ".join(word_list) + b"\n"
+                    f.write(sequence)
+
+
+if __name__ == "__main__":
+    args = PDConfig(yaml_file="./transformer.yaml")
+    args.build()
+    args.Print()
+    check_gpu(args.use_cuda)
+    check_version()
+
+    do_predict(args)
diff --git a/dygraph/transformer/reader.py b/dygraph/transformer/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..ef23c5e1e32fa4cee1ba5a42bb970a1a135879a0
--- /dev/null
+++ b/dygraph/transformer/reader.py
@@ -0,0 +1,550 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import glob
+import six
+import os
+import tarfile
+
+import numpy as np
+import paddle.fluid as fluid
+
+
+def pad_batch_data(insts,
+                   pad_idx,
+                   n_head,
+                   is_target=False,
+                   is_label=False,
+                   return_attn_bias=True,
+                   return_max_len=True,
+                   return_num_token=False):
+    """
+    Pad the instances to the max sequence length in batch, and generate the
+    corresponding position data and attention bias.
+    """
+    return_list = []
+    max_len = max(len(inst) for inst in insts)
+    # Any token included in dict can be used to pad, since the paddings' loss
+    # will be masked out by weights and make no effect on parameter gradients.
+    inst_data = np.array(
+        [inst + [pad_idx] * (max_len - len(inst)) for inst in insts])
+    return_list += [inst_data.astype("int64").reshape([-1, 1])]
+    if is_label:  # label weight
+        inst_weight = np.array([[1.] * len(inst) + [0.] * (max_len - len(inst))
+                                for inst in insts])
+        return_list += [inst_weight.astype("float32").reshape([-1, 1])]
+    else:  # position data
+        inst_pos = np.array([
+            list(range(0, len(inst))) + [0] * (max_len - len(inst))
+            for inst in insts
+        ])
+        return_list += [inst_pos.astype("int64").reshape([-1, 1])]
+    if return_attn_bias:
+        if is_target:
+            # This is used to avoid attention on paddings and subsequent
+            # words.
+            slf_attn_bias_data = np.ones((inst_data.shape[0], max_len, max_len))
+            slf_attn_bias_data = np.triu(slf_attn_bias_data,
+                                         1).reshape([-1, 1, max_len, max_len])
+            slf_attn_bias_data = np.tile(slf_attn_bias_data,
+                                         [1, n_head, 1, 1]) * [-1e9]
+        else:
+            # This is used to avoid attention on paddings.
+            slf_attn_bias_data = np.array([[0] * len(inst) + [-1e9] *
+                                           (max_len - len(inst))
+                                           for inst in insts])
+            slf_attn_bias_data = np.tile(
+                slf_attn_bias_data.reshape([-1, 1, 1, max_len]),
+                [1, n_head, max_len, 1])
+        return_list += [slf_attn_bias_data.astype("float32")]
+    if return_max_len:
+        return_list += [max_len]
+    if return_num_token:
+        num_token = 0
+        for inst in insts:
+            num_token += len(inst)
+        return_list += [num_token]
+    return return_list if len(return_list) > 1 else return_list[0]
+
+
+def prepare_train_input(insts, src_pad_idx, trg_pad_idx, n_head):
+    """
+    Put all padded data needed by training into a list.
+    """
+    src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
+        [inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
+    src_word = src_word.reshape(-1, src_max_len)
+    src_pos = src_pos.reshape(-1, src_max_len)
+    trg_word, trg_pos, trg_slf_attn_bias, trg_max_len = pad_batch_data(
+        [inst[1] for inst in insts], trg_pad_idx, n_head, is_target=True)
+    trg_word = trg_word.reshape(-1, trg_max_len)
+    trg_pos = trg_pos.reshape(-1, trg_max_len)
+
+    trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
+                                [1, 1, trg_max_len, 1]).astype("float32")
+
+    lbl_word, lbl_weight, num_token = pad_batch_data(
+        [inst[2] for inst in insts],
+        trg_pad_idx,
+        n_head,
+        is_target=False,
+        is_label=True,
+        return_attn_bias=False,
+        return_max_len=False,
+        return_num_token=True)
+    lbl_word = lbl_word.reshape(-1, 1)
+    lbl_weight = lbl_weight.reshape(-1, 1)
+
+    data_inputs = [
+        src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos,
+        trg_slf_attn_bias, trg_src_attn_bias, lbl_word, lbl_weight
+    ]
+
+    return data_inputs
+
+
+def prepare_infer_input(insts, src_pad_idx, bos_idx, n_head, place):
+    """
+    Put all padded data needed by beam search decoder into a list.
+    """
+    src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
+        [inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
+    # start tokens
+    trg_word = np.asarray([[bos_idx]] * len(insts), dtype="int64")
+    trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
+                                [1, 1, 1, 1]).astype("float32")
+    trg_word = trg_word.reshape(-1, 1)
+    src_word = src_word.reshape(-1, src_max_len)
+    src_pos = src_pos.reshape(-1, src_max_len)
+
+    data_inputs = [
+        src_word, src_pos, src_slf_attn_bias, trg_word, trg_src_attn_bias
+    ]
+    return data_inputs
+
+
+class SortType(object):
+    GLOBAL = 'global'
+    POOL = 'pool'
+    NONE = "none"
+
+
+class Converter(object):
+    def __init__(self, vocab, beg, end, unk, delimiter, add_beg):
+        self._vocab = vocab
+        self._beg = beg
+        self._end = end
+        self._unk = unk
+        self._delimiter = delimiter
+        self._add_beg = add_beg
+
+    def __call__(self, sentence):
+        return ([self._beg] if self._add_beg else []) + [
+            self._vocab.get(w, self._unk)
+            for w in sentence.split(self._delimiter)
+        ] + [self._end]
+
+
+class ComposedConverter(object):
+    def __init__(self, converters):
+        self._converters = converters
+
+    def __call__(self, parallel_sentence):
+        return [
+            self._converters[i](parallel_sentence[i])
+            for i in range(len(self._converters))
+        ]
+
+
+class SentenceBatchCreator(object):
+    def __init__(self, batch_size):
+        self.batch = []
+        self._batch_size = batch_size
+
+    def append(self, info):
+        self.batch.append(info)
+        if len(self.batch) == self._batch_size:
+            tmp = self.batch
+            self.batch = []
+            return tmp
+
+
+class TokenBatchCreator(object):
+    def __init__(self, batch_size):
+        self.batch = []
+        self.max_len = -1
+        self._batch_size = batch_size
+
+    def append(self, info):
+        cur_len = info.max_len
+        max_len = max(self.max_len, cur_len)
+        if max_len * (len(self.batch) + 1) > self._batch_size:
+            result = self.batch
+            self.batch = [info]
+            self.max_len = cur_len
+            return result
+        else:
+            self.max_len = max_len
+            self.batch.append(info)
+
+
+class SampleInfo(object):
+    def __init__(self, i, max_len, min_len):
+        self.i = i
+        self.min_len = min_len
+        self.max_len = max_len
+
+
+class MinMaxFilter(object):
+    def __init__(self, max_len, min_len, underlying_creator):
+        self._min_len = min_len
+        self._max_len = max_len
+        self._creator = underlying_creator
+
+    def append(self, info):
+        if info.max_len > self._max_len or info.min_len < self._min_len:
+            return
+        else:
+            return self._creator.append(info)
+
+    @property
+    def batch(self):
+        return self._creator.batch
+
+
+class DataProcessor(object):
+    """
+    The data reader loads all data from files and produces batches of data
+    in the way corresponding to settings.
+
+    An example of returning a generator producing data batches whose data
+    is shuffled in each pass and sorted in each pool:
+
+    ```
+    train_data = DataProcessor(
+        src_vocab_fpath='data/src_vocab_file',
+        trg_vocab_fpath='data/trg_vocab_file',
+        fpattern='data/part-*',
+        use_token_batch=True,
+        batch_size=2000,
+        device_count=8,
+        n_head=8,
+        pool_size=10000,
+        sort_type=SortType.POOL,
+        shuffle=True,
+        shuffle_batch=True,
+        start_mark='<s>',
+        end_mark='<e>',
+        unk_mark='<unk>',
+        clip_last_batch=False).data_generator(phase='train')
+    ```
+
+    :param src_vocab_fpath: The path of vocabulary file of source language.
+    :type src_vocab_fpath: basestring
+    :param trg_vocab_fpath: The path of vocabulary file of target language.
+    :type trg_vocab_fpath: basestring
+    :param fpattern: The pattern to match data files.
+    :type fpattern: basestring
+    :param batch_size: The number of sequences contained in a mini-batch.
+        or the maximum number of tokens (include paddings) contained in a
+        mini-batch.
+    :type batch_size: int
+    :param pool_size: The size of pool buffer.
+    :type device_count: int
+    :param device_count: The number of devices. The actual batch size is
+        determined by both batch_size and device_count.
+    :type n_head: int
+    :param n_head: The number of head used in multi-head attention. Actually,
+        this is not a reader related argument, but is used for input data.
+    :type pool_size: int
+    :param sort_type: The grain to sort by length: 'global' for all
+        instances; 'pool' for instances in pool; 'none' for no sort.
+    :type sort_type: basestring
+    :param clip_last_batch: Whether to clip the last uncompleted batch.
+    :type clip_last_batch: bool
+    :param tar_fname: The data file in tar if fpattern matches a tar file.
+    :type tar_fname: basestring
+    :param min_length: The minimum length used to filt sequences.
+    :type min_length: int
+    :param max_length: The maximum length used to filt sequences.
+    :type max_length: int
+    :param shuffle: Whether to shuffle all instances.
+    :type shuffle: bool
+    :param shuffle_batch: Whether to shuffle the generated batches.
+    :type shuffle_batch: bool
+    :param use_token_batch: Whether to produce batch data according to
+        token number.
+    :type use_token_batch: bool
+    :param field_delimiter: The delimiter used to split source and target in
+        each line of data file.
+    :type field_delimiter: basestring
+    :param token_delimiter: The delimiter used to split tokens in source or
+        target sentences.
+    :type token_delimiter: basestring
+    :param start_mark: The token representing for the beginning of
+        sentences in dictionary.
+    :type start_mark: basestring
+    :param end_mark: The token representing for the end of sentences
+        in dictionary.
+    :type end_mark: basestring
+    :param unk_mark: The token representing for unknown word in dictionary.
+    :type unk_mark: basestring
+    :param only_src: Whether each line is a source and target sentence
+        pair or only has the source sentence.
+    :type only_src: bool
+    :param seed: The seed for random.
+    :type seed: int
+    """
+    def __init__(self,
+                 src_vocab_fpath,
+                 trg_vocab_fpath,
+                 fpattern,
+                 batch_size,
+                 device_count,
+                 n_head,
+                 pool_size,
+                 sort_type=SortType.GLOBAL,
+                 clip_last_batch=False,
+                 tar_fname=None,
+                 min_length=0,
+                 max_length=100,
+                 shuffle=True,
+                 shuffle_batch=False,
+                 use_token_batch=False,
+                 field_delimiter="\t",
+                 token_delimiter=" ",
+                 start_mark="<s>",
+                 end_mark="<e>",
+                 unk_mark="<unk>",
+                 only_src=False,
+                 seed=0):
+        # convert str to bytes, and use byte data
+        field_delimiter = field_delimiter.encode("utf8")
+        token_delimiter = token_delimiter.encode("utf8")
+        start_mark = start_mark.encode("utf8")
+        end_mark = end_mark.encode("utf8")
+        unk_mark = unk_mark.encode("utf8")
+        self._src_vocab = self.load_dict(src_vocab_fpath)
+        self._trg_vocab = self.load_dict(trg_vocab_fpath)
+        self._bos_idx = self._src_vocab[start_mark]
+        self._eos_idx = self._src_vocab[end_mark]
+        self._unk_idx = self._src_vocab[unk_mark]
+        self._only_src = only_src
+        self._pool_size = pool_size
+        self._batch_size = batch_size
+        self._device_count = device_count
+        self._n_head = n_head
+        self._use_token_batch = use_token_batch
+        self._sort_type = sort_type
+        self._clip_last_batch = clip_last_batch
+        self._shuffle = shuffle
+        self._shuffle_batch = shuffle_batch
+        self._min_length = min_length
+        self._max_length = max_length
+        self._field_delimiter = field_delimiter
+        self._token_delimiter = token_delimiter
+        self.load_src_trg_ids(fpattern, tar_fname)
+        self._random = np.random
+        self._random.seed(seed)
+
+    def load_src_trg_ids(self, fpattern, tar_fname):
+        converters = [
+            Converter(vocab=self._src_vocab,
+                      beg=self._bos_idx,
+                      end=self._eos_idx,
+                      unk=self._unk_idx,
+                      delimiter=self._token_delimiter,
+                      add_beg=False)
+        ]
+        if not self._only_src:
+            converters.append(
+                Converter(vocab=self._trg_vocab,
+                          beg=self._bos_idx,
+                          end=self._eos_idx,
+                          unk=self._unk_idx,
+                          delimiter=self._token_delimiter,
+                          add_beg=True))
+
+        converters = ComposedConverter(converters)
+
+        self._src_seq_ids = []
+        self._trg_seq_ids = None if self._only_src else []
+        self._sample_infos = []
+
+        for i, line in enumerate(self._load_lines(fpattern, tar_fname)):
+            src_trg_ids = converters(line)
+            self._src_seq_ids.append(src_trg_ids[0])
+            lens = [len(src_trg_ids[0])]
+            if not self._only_src:
+                self._trg_seq_ids.append(src_trg_ids[1])
+                lens.append(len(src_trg_ids[1]))
+            self._sample_infos.append(SampleInfo(i, max(lens), min(lens)))
+
+    def _load_lines(self, fpattern, tar_fname):
+        fpaths = glob.glob(fpattern)
+        assert len(fpaths) > 0, "no matching file to the provided data path"
+
+        if len(fpaths) == 1 and tarfile.is_tarfile(fpaths[0]):
+            if tar_fname is None:
+                raise Exception("If tar file provided, please set tar_fname.")
+
+            f = tarfile.open(fpaths[0], "rb")
+            for line in f.extractfile(tar_fname):
+                fields = line.strip(b"\n").split(self._field_delimiter)
+                if (not self._only_src
+                        and len(fields) == 2) or (self._only_src
+                                                  and len(fields) == 1):
+                    yield fields
+        else:
+            for fpath in fpaths:
+                if not os.path.isfile(fpath):
+                    raise IOError("Invalid file: %s" % fpath)
+
+                with open(fpath, "rb") as f:
+                    for line in f:
+                        fields = line.strip(b"\n").split(self._field_delimiter)
+                        if (not self._only_src
+                                and len(fields) == 2) or (self._only_src
+                                                          and len(fields) == 1):
+                            yield fields
+
+    @staticmethod
+    def load_dict(dict_path, reverse=False):
+        word_dict = {}
+        with open(dict_path, "rb") as fdict:
+            for idx, line in enumerate(fdict):
+                if reverse:
+                    word_dict[idx] = line.strip(b"\n")
+                else:
+                    word_dict[line.strip(b"\n")] = idx
+        return word_dict
+
+    def batch_generator(self, batch_size, use_token_batch):
+        def __impl__():
+            # global sort or global shuffle
+            if self._sort_type == SortType.GLOBAL:
+                infos = sorted(self._sample_infos, key=lambda x: x.max_len)
+            else:
+                if self._shuffle:
+                    infos = self._sample_infos
+                    self._random.shuffle(infos)
+                else:
+                    infos = self._sample_infos
+
+                if self._sort_type == SortType.POOL:
+                    reverse = True
+                    for i in range(0, len(infos), self._pool_size):
+                        # to avoid placing short next to long sentences
+                        reverse = not reverse
+                        infos[i:i + self._pool_size] = sorted(
+                            infos[i:i + self._pool_size],
+                            key=lambda x: x.max_len,
+                            reverse=reverse)
+
+            # concat batch
+            batches = []
+            batch_creator = TokenBatchCreator(
+                batch_size) if use_token_batch else SentenceBatchCreator(
+                    batch_size)
+            batch_creator = MinMaxFilter(self._max_length, self._min_length,
+                                         batch_creator)
+
+            for info in infos:
+                batch = batch_creator.append(info)
+                if batch is not None:
+                    batches.append(batch)
+
+            if not self._clip_last_batch and len(batch_creator.batch) != 0:
+                batches.append(batch_creator.batch)
+
+            if self._shuffle_batch:
+                self._random.shuffle(batches)
+
+            for batch in batches:
+                batch_ids = [info.i for info in batch]
+
+                if self._only_src:
+                    yield [[self._src_seq_ids[idx]] for idx in batch_ids]
+                else:
+                    yield [(self._src_seq_ids[idx], self._trg_seq_ids[idx][:-1],
+                            self._trg_seq_ids[idx][1:]) for idx in batch_ids]
+
+        return __impl__
+
+    @staticmethod
+    def stack(data_reader, count, clip_last=True):
+        def __impl__():
+            res = []
+            for item in data_reader():
+                res.append(item)
+                if len(res) == count:
+                    yield res
+                    res = []
+            if len(res) == count:
+                yield res
+            elif not clip_last:
+                data = []
+                for item in res:
+                    data += item
+                if len(data) > count:
+                    inst_num_per_part = len(data) // count
+                    yield [
+                        data[inst_num_per_part * i:inst_num_per_part * (i + 1)]
+                        for i in range(count)
+                    ]
+
+        return __impl__
+
+    @staticmethod
+    def split(data_reader, count):
+        def __impl__():
+            for item in data_reader():
+                inst_num_per_part = len(item) // count
+                for i in range(count):
+                    yield item[inst_num_per_part * i:inst_num_per_part *
+                               (i + 1)]
+
+        return __impl__
+
+    def data_generator(self, phase, place=None):
+        # Any token included in dict can be used to pad, since the paddings' loss
+        # will be masked out by weights and make no effect on parameter gradients.
+        src_pad_idx = trg_pad_idx = self._eos_idx
+        bos_idx = self._bos_idx
+        n_head = self._n_head
+        data_reader = self.batch_generator(
+            self._batch_size *
+            (1 if self._use_token_batch else self._device_count),
+            self._use_token_batch)
+        if not self._use_token_batch:
+            # to make data on each device have similar token number
+            data_reader = self.split(data_reader, self._device_count)
+
+        def __for_train__():
+            for data in data_reader():
+                data_inputs = prepare_train_input(data, src_pad_idx,
+                                                  trg_pad_idx, n_head)
+                yield data_inputs
+
+        def __for_predict__():
+            for data in data_reader():
+                data_inputs = prepare_infer_input(data, src_pad_idx, bos_idx,
+                                                  n_head, place)
+                yield data_inputs
+
+        return __for_train__ if phase == "train" else __for_predict__
+
+    def get_vocab_summary(self):
+        return len(self._src_vocab), len(
+            self._trg_vocab), self._bos_idx, self._eos_idx, self._unk_idx
diff --git a/dygraph/transformer/train.py b/dygraph/transformer/train.py
index f97196d68848071b0338807778cec0a60cfb3195..bbfb2c12f58a4a4b4a2ac39baeabaa08f670e242 100644
--- a/dygraph/transformer/train.py
+++ b/dygraph/transformer/train.py
@@ -12,184 +12,195 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from __future__ import print_function
-import argparse
-import ast
+import logging
+import os
+import six
+import sys
+import time
 
 import numpy as np
 import paddle
 import paddle.fluid as fluid
-import paddle.dataset.wmt16 as wmt16
-
-from model import TransFormer, NoamDecay
-from config import *
-from data_util import *
-
-
-def parse_args():
-    parser = argparse.ArgumentParser("Arguments for Training")
-    parser.add_argument(
-        "--use_data_parallel",
-        type=ast.literal_eval,
-        default=False,
-        help="The flag indicating whether to use multi-GPU.")
-    parser.add_argument(
-        "--model_file",
-        type=str,
-        default="transformer_params",
-        help="Save the model as a file named `model_file.pdparams`.")
-    parser.add_argument(
-        'opts',
-        help='See config.py for all options',
-        default=None,
-        nargs=argparse.REMAINDER)
-    args = parser.parse_args()
-    merge_cfg_from_list(args.opts, [TrainTaskConfig, ModelHyperParams])
-    return args
-
-
-def prepare_train_input(insts, src_pad_idx, trg_pad_idx, n_head):
-    """
-    inputs for training
-    """
-    src_word, src_pos, src_slf_attn_bias, src_max_len = pad_batch_data(
-        [inst[0] for inst in insts], src_pad_idx, n_head, is_target=False)
-    src_word = src_word.reshape(-1, src_max_len, 1)
-    src_pos = src_pos.reshape(-1, src_max_len, 1)
-    trg_word, trg_pos, trg_slf_attn_bias, trg_max_len = pad_batch_data(
-        [inst[1] for inst in insts], trg_pad_idx, n_head, is_target=True)
-    trg_word = trg_word.reshape(-1, trg_max_len, 1)
-    trg_pos = trg_pos.reshape(-1, trg_max_len, 1)
-
-    trg_src_attn_bias = np.tile(src_slf_attn_bias[:, :, ::src_max_len, :],
-                                [1, 1, trg_max_len, 1]).astype("float32")
-
-    lbl_word, lbl_weight, num_token = pad_batch_data(
-        [inst[2] for inst in insts],
-        trg_pad_idx,
-        n_head,
-        is_target=False,
-        is_label=True,
-        return_attn_bias=False,
-        return_max_len=False,
-        return_num_token=True)
-
-    data_inputs = [
-        src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos,
-        trg_slf_attn_bias, trg_src_attn_bias, lbl_word, lbl_weight
-    ]
-
-    var_inputs = []
-    for i, field in enumerate(encoder_data_input_fields +
-                              decoder_data_input_fields[:-1] +
-                              label_data_input_fields):
-        var_inputs.append(to_variable(data_inputs[i], name=field))
-
-    enc_inputs = var_inputs[0:len(encoder_data_input_fields)]
-    dec_inputs = var_inputs[len(encoder_data_input_fields
-                                ):len(encoder_data_input_fields) +
-                            len(decoder_data_input_fields[:-1])]
-    label = var_inputs[-2]
-    weights = var_inputs[-1]
-
-    return enc_inputs, dec_inputs, label, weights
-
-
-def train(args):
-    """
-    train models
-    :return:
-    """
-
-    trainer_count = fluid.dygraph.parallel.Env().nranks
-    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) \
-        if args.use_data_parallel else fluid.CUDAPlace(0)
+
+from utils.configure import PDConfig
+from utils.check import check_gpu, check_version
+
+# include task-specific libs
+import reader
+from model import Transformer, CrossEntropyCriterion, NoamDecay
+
+
+def do_train(args):
+    if args.use_cuda:
+        trainer_count = fluid.dygraph.parallel.Env().nranks
+        place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id
+                                ) if trainer_count > 1 else fluid.CUDAPlace(0)
+    else:
+        trainer_count = 1
+        place = fluid.CPUPlace()
+
+    # define the data generator
+    processor = reader.DataProcessor(fpattern=args.training_file,
+                                     src_vocab_fpath=args.src_vocab_fpath,
+                                     trg_vocab_fpath=args.trg_vocab_fpath,
+                                     token_delimiter=args.token_delimiter,
+                                     use_token_batch=args.use_token_batch,
+                                     batch_size=args.batch_size,
+                                     device_count=trainer_count,
+                                     pool_size=args.pool_size,
+                                     sort_type=args.sort_type,
+                                     shuffle=args.shuffle,
+                                     shuffle_batch=args.shuffle_batch,
+                                     start_mark=args.special_token[0],
+                                     end_mark=args.special_token[1],
+                                     unk_mark=args.special_token[2],
+                                     max_length=args.max_length,
+                                     n_head=args.n_head)
+    batch_generator = processor.data_generator(phase="train")
+    if trainer_count > 1:  # for multi-process gpu training
+        batch_generator = fluid.contrib.reader.distributed_batch_reader(
+            batch_generator)
+    args.src_vocab_size, args.trg_vocab_size, args.bos_idx, args.eos_idx, \
+        args.unk_idx = processor.get_vocab_summary()
+
     with fluid.dygraph.guard(place):
-        if args.use_data_parallel:
-            strategy = fluid.dygraph.parallel.prepare_context()
+        # set seed for CE
+        random_seed = eval(str(args.random_seed))
+        if random_seed is not None:
+            fluid.default_main_program().random_seed = random_seed
+            fluid.default_startup_program().random_seed = random_seed
+
+        # define data loader
+        train_loader = fluid.io.DataLoader.from_generator(capacity=10)
+        train_loader.set_batch_generator(batch_generator, places=place)
 
         # define model
-        transformer = TransFormer(
-            'transformer', ModelHyperParams.src_vocab_size,
-            ModelHyperParams.trg_vocab_size, ModelHyperParams.max_length + 1,
-            ModelHyperParams.n_layer, ModelHyperParams.n_head,
-            ModelHyperParams.d_key, ModelHyperParams.d_value,
-            ModelHyperParams.d_model, ModelHyperParams.d_inner_hid,
-            ModelHyperParams.prepostprocess_dropout,
-            ModelHyperParams.attention_dropout, ModelHyperParams.relu_dropout,
-            ModelHyperParams.preprocess_cmd, ModelHyperParams.postprocess_cmd,
-            ModelHyperParams.weight_sharing, TrainTaskConfig.label_smooth_eps)
+        transformer = Transformer(
+            args.src_vocab_size, args.trg_vocab_size, args.max_length + 1,
+            args.n_layer, args.n_head, args.d_key, args.d_value, args.d_model,
+            args.d_inner_hid, args.prepostprocess_dropout,
+            args.attention_dropout, args.relu_dropout, args.preprocess_cmd,
+            args.postprocess_cmd, args.weight_sharing, args.bos_idx,
+            args.eos_idx)
+
+        # define loss
+        criterion = CrossEntropyCriterion(args.label_smooth_eps)
+
         # define optimizer
-        optimizer = fluid.optimizer.Adam(learning_rate=NoamDecay(
-            ModelHyperParams.d_model, TrainTaskConfig.warmup_steps,
-            TrainTaskConfig.learning_rate),
-                                         beta1=TrainTaskConfig.beta1,
-                                         beta2=TrainTaskConfig.beta2,
-                                         epsilon=TrainTaskConfig.eps)
-        #
-        if args.use_data_parallel:
+        optimizer = fluid.optimizer.Adam(
+            learning_rate=NoamDecay(args.d_model, args.warmup_steps,
+                                    args.learning_rate),
+            beta1=args.beta1,
+            beta2=args.beta2,
+            epsilon=float(args.eps),
+            parameter_list=transformer.parameters())
+
+        ## init from some checkpoint, to resume the previous training
+        if args.init_from_checkpoint:
+            model_dict, opt_dict = fluid.load_dygraph(
+                os.path.join(args.init_from_checkpoint, "transformer"))
+            transformer.load_dict(model_dict)
+            optimizer.set_dict(opt_dict)
+        ## init from some pretrain models, to better solve the current task
+        if args.init_from_pretrain_model:
+            model_dict, _ = fluid.load_dygraph(
+                os.path.join(args.init_from_pretrain_model, "transformer"))
+            transformer.load_dict(model_dict)
+
+        if trainer_count > 1:
+            strategy = fluid.dygraph.parallel.prepare_context()
             transformer = fluid.dygraph.parallel.DataParallel(
                 transformer, strategy)
 
-        # define data generator for training and validation
-        train_reader = paddle.batch(wmt16.train(
-            ModelHyperParams.src_vocab_size, ModelHyperParams.trg_vocab_size),
-                                    batch_size=TrainTaskConfig.batch_size)
-        if args.use_data_parallel:
-            train_reader = fluid.contrib.reader.distributed_batch_reader(
-                train_reader)
-        val_reader = paddle.batch(wmt16.test(ModelHyperParams.src_vocab_size,
-                                             ModelHyperParams.trg_vocab_size),
-                                  batch_size=TrainTaskConfig.batch_size)
-
-        # loop for training iterations
-        for i in range(TrainTaskConfig.pass_num):
-            dy_step = 0
-            sum_cost = 0
-            transformer.train()
-            for batch in train_reader():
-                enc_inputs, dec_inputs, label, weights = prepare_train_input(
-                    batch, ModelHyperParams.eos_idx, ModelHyperParams.eos_idx,
-                    ModelHyperParams.n_head)
-
-                dy_sum_cost, dy_avg_cost, dy_predict, dy_token_num = transformer(
-                    enc_inputs, dec_inputs, label, weights)
-
-                if args.use_data_parallel:
-                    dy_avg_cost = transformer.scale_loss(dy_avg_cost)
-                    dy_avg_cost.backward()
+        # the best cross-entropy value with label smoothing
+        loss_normalizer = -(
+            (1. - args.label_smooth_eps) * np.log(
+                (1. - args.label_smooth_eps)) +
+            args.label_smooth_eps * np.log(args.label_smooth_eps /
+                                           (args.trg_vocab_size - 1) + 1e-20))
+
+        step_idx = 0
+        # train loop
+        for pass_id in range(args.epoch):
+            pass_start_time = time.time()
+            batch_id = 0
+            for input_data in train_loader():
+                (src_word, src_pos, src_slf_attn_bias, trg_word, trg_pos,
+                 trg_slf_attn_bias, trg_src_attn_bias, lbl_word,
+                 lbl_weight) = input_data
+                logits = transformer(src_word, src_pos, src_slf_attn_bias,
+                                     trg_word, trg_pos, trg_slf_attn_bias,
+                                     trg_src_attn_bias)
+
+                sum_cost, avg_cost, token_num = criterion(
+                    logits, lbl_word, lbl_weight)
+
+                if trainer_count > 1:
+                    avg_cost = transformer.scale_loss(avg_cost)
+                    avg_cost.backward()
                     transformer.apply_collective_grads()
                 else:
-                    dy_avg_cost.backward()
-                optimizer.minimize(dy_avg_cost)
+                    avg_cost.backward()
+
+                optimizer.minimize(avg_cost)
                 transformer.clear_gradients()
 
-                dy_step = dy_step + 1
-                if dy_step % 10 == 0:
-                    print("pass num : {}, batch_id: {}, dy_graph avg loss: {}".
-                          format(i, dy_step,
-                                 dy_avg_cost.numpy() * trainer_count))
-
-            # switch to evaluation mode
-            transformer.eval()
-            sum_cost = 0
-            token_num = 0
-            for batch in val_reader():
-                enc_inputs, dec_inputs, label, weights = prepare_train_input(
-                    batch, ModelHyperParams.eos_idx, ModelHyperParams.eos_idx,
-                    ModelHyperParams.n_head)
-
-                dy_sum_cost, dy_avg_cost, dy_predict, dy_token_num = transformer(
-                    enc_inputs, dec_inputs, label, weights)
-                sum_cost += dy_sum_cost.numpy()
-                token_num += dy_token_num.numpy()
-            print("pass : {} finished, validation avg loss: {}".format(
-                i, sum_cost / token_num))
-
-        if fluid.dygraph.parallel.Env().dev_id == 0:
-            fluid.save_dygraph(transformer.state_dict(), args.model_file)
-
-
-if __name__ == '__main__':
-    args = parse_args()
-    train(args)
+                if step_idx % args.print_step == 0:
+                    total_avg_cost = avg_cost.numpy() * trainer_count
+
+                    if step_idx == 0:
+                        logging.info(
+                            "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                            "normalized loss: %f, ppl: %f" %
+                            (step_idx, pass_id, batch_id, total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)])))
+                        avg_batch_time = time.time()
+                    else:
+                        logging.info(
+                            "step_idx: %d, epoch: %d, batch: %d, avg loss: %f, "
+                            "normalized loss: %f, ppl: %f, speed: %.2f step/s" %
+                            (step_idx, pass_id, batch_id, total_avg_cost,
+                            total_avg_cost - loss_normalizer,
+                            np.exp([min(total_avg_cost, 100)]),
+                            args.print_step / (time.time() - avg_batch_time)))
+                        avg_batch_time = time.time()
+
+                if step_idx % args.save_step == 0 and step_idx != 0 and (
+                        trainer_count == 1
+                        or fluid.dygraph.parallel.Env().dev_id == 0):
+                    if args.save_model:
+                        model_dir = os.path.join(args.save_model,
+                                                 "step_" + str(step_idx))
+                        if not os.path.exists(model_dir):
+                            os.makedirs(model_dir)
+                        fluid.save_dygraph(
+                            transformer.state_dict(),
+                            os.path.join(model_dir, "transformer"))
+                        fluid.save_dygraph(
+                            optimizer.state_dict(),
+                            os.path.join(model_dir, "transformer"))
+
+                batch_id += 1
+                step_idx += 1
+
+        time_consumed = time.time() - pass_start_time
+
+        if args.save_model:
+            model_dir = os.path.join(args.save_model, "step_final")
+            if not os.path.exists(model_dir):
+                os.makedirs(model_dir)
+            fluid.save_dygraph(transformer.state_dict(),
+                               os.path.join(model_dir, "transformer"))
+            fluid.save_dygraph(optimizer.state_dict(),
+                               os.path.join(model_dir, "transformer"))
+
+
+if __name__ == "__main__":
+    args = PDConfig(yaml_file="./transformer.yaml")
+    args.build()
+    args.Print()
+    check_gpu(args.use_cuda)
+    check_version()
+
+    do_train(args)
diff --git a/dygraph/transformer/transformer.yaml b/dygraph/transformer/transformer.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..76151f207ca4b6b9a907779642f9f0f48cddc78b
--- /dev/null
+++ b/dygraph/transformer/transformer.yaml
@@ -0,0 +1,108 @@
+# used for continuous evaluation
+enable_ce: False
+
+# The frequency to save trained models when training.
+save_step: 10000
+# The frequency to fetch and print output when training.
+print_step: 100
+# path of the checkpoint, to resume the previous training
+init_from_checkpoint: ""
+# path of the pretrain model, to better solve the current task
+init_from_pretrain_model: ""
+# path of trained parameter, to make prediction
+init_from_params: "trained_params/step_100000/"
+# the directory for saving model
+save_model: "trained_models"
+# the directory for saving inference model.
+inference_model_dir: "infer_model"
+# Set seed for CE or debug
+random_seed: None
+# The pattern to match training data files.
+training_file: "wmt16_ende_data_bpe/train.tok.clean.bpe.32000.en-de"
+# The pattern to match test data files.
+predict_file: "wmt16_ende_data_bpe/newstest2016.tok.bpe.32000.en-de"
+# The file to output the translation results of predict_file to.
+output_file: "predict.txt"
+# The path of vocabulary file of source language.
+src_vocab_fpath: "wmt16_ende_data_bpe/vocab_all.bpe.32000"
+# The path of vocabulary file of target language.
+trg_vocab_fpath: "wmt16_ende_data_bpe/vocab_all.bpe.32000"
+# The <bos>, <eos> and <unk> tokens in the dictionary.
+special_token: ["<s>", "<e>", "<unk>"]
+
+# whether to use cuda
+use_cuda: True
+
+# args for reader, see reader.py for details
+token_delimiter: " "
+use_token_batch: True
+pool_size: 200000
+sort_type: "pool"
+shuffle: True
+shuffle_batch: True
+batch_size: 4096
+
+# Hyparams for training:
+# the number of epoches for training
+epoch: 30
+# the hyper parameters for Adam optimizer.
+# This static learning_rate will be multiplied to the LearningRateScheduler
+# derived learning rate the to get the final learning rate.
+learning_rate: 2.0
+beta1: 0.9
+beta2: 0.997
+eps: 1e-9
+# the parameters for learning rate scheduling.
+warmup_steps: 8000
+# the weight used to mix up the ground-truth distribution and the fixed
+# uniform distribution in label smoothing when training.
+# Set this as zero if label smoothing is not wanted.
+label_smooth_eps: 0.1
+
+# Hyparams for generation:
+# the parameters for beam search.
+beam_size: 5
+max_out_len: 256
+# the number of decoded sentences to output.
+n_best: 1
+
+# Hyparams for model:
+# These following five vocabularies related configurations will be set
+# automatically according to the passed vocabulary path and special tokens.
+# size of source word dictionary.
+src_vocab_size: 10000
+# size of target word dictionay
+trg_vocab_size: 10000
+# index for <bos> token
+bos_idx: 0
+# index for <eos> token
+eos_idx: 1
+# index for <unk> token
+unk_idx: 2
+# max length of sequences deciding the size of position encoding table.
+max_length: 256
+# the dimension for word embeddings, which is also the last dimension of
+# the input and output of multi-head attention, position-wise feed-forward
+# networks, encoder and decoder.
+d_model: 512
+# size of the hidden layer in position-wise feed-forward networks.
+d_inner_hid: 2048
+# the dimension that keys are projected to for dot-product attention.
+d_key: 64
+# the dimension that values are projected to for dot-product attention.
+d_value: 64
+# number of head used in multi-head attention.
+n_head: 8
+# number of sub-layers to be stacked in the encoder and decoder.
+n_layer: 6
+# dropout rates of different modules.
+prepostprocess_dropout: 0.1
+attention_dropout: 0.1
+relu_dropout: 0.1
+# to process before each sub-layer
+preprocess_cmd: "n"  # layer normalization
+# to process after each sub-layer
+postprocess_cmd: "da"  # dropout + residual connection
+# the flag indicating whether to share embedding and softmax weights.
+# vocabularies in source and target should be same for weight sharing.
+weight_sharing: True
diff --git a/dygraph/transformer/utils/__init__.py b/dygraph/transformer/utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/dygraph/transformer/utils/check.py b/dygraph/transformer/utils/check.py
new file mode 100644
index 0000000000000000000000000000000000000000..305fa3705f5c313569986cbdb15c8afeda5a79c1
--- /dev/null
+++ b/dygraph/transformer/utils/check.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import sys
+
+import paddle.fluid as fluid
+
+import logging
+logger = logging.getLogger(__name__)
+
+__all__ = ['check_gpu', 'check_version']
+
+
+def check_gpu(use_gpu):
+    """
+    Log error and exit when set use_gpu=true in paddlepaddle
+    cpu version.
+    """
+    err = "Config use_gpu cannot be set as true while you are " \
+          "using paddlepaddle cpu version ! \nPlease try: \n" \
+          "\t1. Install paddlepaddle-gpu to run model on GPU \n" \
+          "\t2. Set use_gpu as false in config file to run " \
+          "model on CPU"
+
+    try:
+        if use_gpu and not fluid.is_compiled_with_cuda():
+            logger.error(err)
+            sys.exit(1)
+    except Exception as e:
+        pass
+
+
+def check_version():
+    """
+    Log error and exit when the installed version of paddlepaddle is
+    not satisfied.
+    """
+    err = "PaddlePaddle version 1.6 or higher is required, " \
+          "or a suitable develop version is satisfied as well. \n" \
+          "Please make sure the version is good with your code." \
+
+    try:
+        fluid.require_version('1.6.0')
+    except Exception as e:
+        logger.error(err)
+        sys.exit(1)
diff --git a/dygraph/transformer/utils/configure.py b/dygraph/transformer/utils/configure.py
new file mode 100644
index 0000000000000000000000000000000000000000..67e601282fee572518435eaed38a4ed8e26fc5f9
--- /dev/null
+++ b/dygraph/transformer/utils/configure.py
@@ -0,0 +1,350 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import sys
+import argparse
+import json
+import yaml
+import six
+import logging
+
+logging_only_message = "%(message)s"
+logging_details = "%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s"
+
+
+class JsonConfig(object):
+    """
+    A high-level api for handling json configure file.
+    """
+
+    def __init__(self, config_path):
+        self._config_dict = self._parse(config_path)
+
+    def _parse(self, config_path):
+        try:
+            with open(config_path) as json_file:
+                config_dict = json.load(json_file)
+        except:
+            raise IOError("Error in parsing bert model config file '%s'" %
+                          config_path)
+        else:
+            return config_dict
+
+    def __getitem__(self, key):
+        return self._config_dict[key]
+
+    def print_config(self):
+        for arg, value in sorted(six.iteritems(self._config_dict)):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+
+
+class ArgumentGroup(object):
+    def __init__(self, parser, title, des):
+        self._group = parser.add_argument_group(title=title, description=des)
+
+    def add_arg(self, name, type, default, help, **kwargs):
+        type = str2bool if type == bool else type
+        self._group.add_argument(
+            "--" + name,
+            default=default,
+            type=type,
+            help=help + ' Default: %(default)s.',
+            **kwargs)
+
+
+class ArgConfig(object):
+    """
+    A high-level api for handling argument configs.
+    """
+
+    def __init__(self):
+        parser = argparse.ArgumentParser()
+
+        train_g = ArgumentGroup(parser, "training", "training options.")
+        train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
+        train_g.add_arg("learning_rate", float, 5e-5,
+                        "Learning rate used to train with warmup.")
+        train_g.add_arg(
+            "lr_scheduler",
+            str,
+            "linear_warmup_decay",
+            "scheduler of learning rate.",
+            choices=['linear_warmup_decay', 'noam_decay'])
+        train_g.add_arg("weight_decay", float, 0.01,
+                        "Weight decay rate for L2 regularizer.")
+        train_g.add_arg(
+            "warmup_proportion", float, 0.1,
+            "Proportion of training steps to perform linear learning rate warmup for."
+        )
+        train_g.add_arg("save_steps", int, 1000,
+                        "The steps interval to save checkpoints.")
+        train_g.add_arg("use_fp16", bool, False,
+                        "Whether to use fp16 mixed precision training.")
+        train_g.add_arg(
+            "loss_scaling", float, 1.0,
+            "Loss scaling factor for mixed precision training, only valid when use_fp16 is enabled."
+        )
+        train_g.add_arg("pred_dir", str, None,
+                        "Path to save the prediction results")
+
+        log_g = ArgumentGroup(parser, "logging", "logging related.")
+        log_g.add_arg("skip_steps", int, 10,
+                      "The steps interval to print loss.")
+        log_g.add_arg("verbose", bool, False, "Whether to output verbose log.")
+
+        run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
+        run_type_g.add_arg("use_cuda", bool, True,
+                           "If set, use GPU for training.")
+        run_type_g.add_arg(
+            "use_fast_executor", bool, False,
+            "If set, use fast parallel executor (in experiment).")
+        run_type_g.add_arg(
+            "num_iteration_per_drop_scope", int, 1,
+            "Ihe iteration intervals to clean up temporary variables.")
+        run_type_g.add_arg("do_train", bool, True,
+                           "Whether to perform training.")
+        run_type_g.add_arg("do_predict", bool, True,
+                           "Whether to perform prediction.")
+
+        custom_g = ArgumentGroup(parser, "customize", "customized options.")
+
+        self.custom_g = custom_g
+
+        self.parser = parser
+
+    def add_arg(self, name, dtype, default, descrip):
+        self.custom_g.add_arg(name, dtype, default, descrip)
+
+    def build_conf(self):
+        return self.parser.parse_args()
+
+
+def str2bool(v):
+    # because argparse does not support to parse "true, False" as python
+    # boolean directly
+    return v.lower() in ("true", "t", "1")
+
+
+def print_arguments(args, log=None):
+    if not log:
+        print('-----------  Configuration Arguments -----------')
+        for arg, value in sorted(six.iteritems(vars(args))):
+            print('%s: %s' % (arg, value))
+        print('------------------------------------------------')
+    else:
+        log.info('-----------  Configuration Arguments -----------')
+        for arg, value in sorted(six.iteritems(vars(args))):
+            log.info('%s: %s' % (arg, value))
+        log.info('------------------------------------------------')
+
+
+class PDConfig(object):
+    """
+    A high-level API for managing configuration files in PaddlePaddle.
+    Can jointly work with command-line-arugment, json files and yaml files.
+    """
+
+    def __init__(self, json_file="", yaml_file="", fuse_args=True):
+        """
+            Init funciton for PDConfig.
+            json_file: the path to the json configure file.
+            yaml_file: the path to the yaml configure file.
+            fuse_args: if fuse the json/yaml configs with argparse.
+        """
+        assert isinstance(json_file, str)
+        assert isinstance(yaml_file, str)
+
+        if json_file != "" and yaml_file != "":
+            raise Warning(
+                "json_file and yaml_file can not co-exist for now. please only use one configure file type."
+            )
+            return
+
+        self.args = None
+        self.arg_config = {}
+        self.json_config = {}
+        self.yaml_config = {}
+
+        parser = argparse.ArgumentParser()
+
+        self.default_g = ArgumentGroup(parser, "default", "default options.")
+        self.yaml_g = ArgumentGroup(parser, "yaml", "options from yaml.")
+        self.json_g = ArgumentGroup(parser, "json", "options from json.")
+        self.com_g = ArgumentGroup(parser, "custom", "customized options.")
+
+        self.default_g.add_arg("do_train", bool, False,
+                               "Whether to perform training.")
+        self.default_g.add_arg("do_predict", bool, False,
+                               "Whether to perform predicting.")
+        self.default_g.add_arg("do_eval", bool, False,
+                               "Whether to perform evaluating.")
+        self.default_g.add_arg("do_save_inference_model", bool, False,
+                               "Whether to perform model saving for inference.")
+
+        # NOTE: args for profiler
+        self.default_g.add_arg("is_profiler", int, 0, "the switch of profiler tools. (used for benchmark)")
+        self.default_g.add_arg("profiler_path", str, './', "the profiler output file path. (used for benchmark)")
+        self.default_g.add_arg("max_iter", int, 0, "the max train batch num.(used for benchmark)")
+
+        self.parser = parser
+
+        if json_file != "":
+            self.load_json(json_file, fuse_args=fuse_args)
+
+        if yaml_file:
+            self.load_yaml(yaml_file, fuse_args=fuse_args)
+
+    def load_json(self, file_path, fuse_args=True):
+
+        if not os.path.exists(file_path):
+            raise Warning("the json file %s does not exist." % file_path)
+            return
+
+        with open(file_path, "r") as fin:
+            self.json_config = json.loads(fin.read())
+            fin.close()
+
+        if fuse_args:
+            for name in self.json_config:
+                if isinstance(self.json_config[name], list):
+                    self.json_g.add_arg(
+                        name,
+                        type(self.json_config[name][0]),
+                        self.json_config[name],
+                        "This is from %s" % file_path,
+                        nargs=len(self.json_config[name]))
+                    continue
+                if not isinstance(self.json_config[name], int) \
+                    and not isinstance(self.json_config[name], float) \
+                    and not isinstance(self.json_config[name], str) \
+                    and not isinstance(self.json_config[name], bool):
+
+                    continue
+
+                self.json_g.add_arg(name,
+                                    type(self.json_config[name]),
+                                    self.json_config[name],
+                                    "This is from %s" % file_path)
+
+    def load_yaml(self, file_path, fuse_args=True):
+
+        if not os.path.exists(file_path):
+            raise Warning("the yaml file %s does not exist." % file_path)
+            return
+
+        with open(file_path, "r") as fin:
+            self.yaml_config = yaml.load(fin, Loader=yaml.SafeLoader)
+            fin.close()
+
+        if fuse_args:
+            for name in self.yaml_config:
+                if isinstance(self.yaml_config[name], list):
+                    self.yaml_g.add_arg(
+                        name,
+                        type(self.yaml_config[name][0]),
+                        self.yaml_config[name],
+                        "This is from %s" % file_path,
+                        nargs=len(self.yaml_config[name]))
+                    continue
+
+                if not isinstance(self.yaml_config[name], int) \
+                    and not isinstance(self.yaml_config[name], float) \
+                    and not isinstance(self.yaml_config[name], str) \
+                    and not isinstance(self.yaml_config[name], bool):
+
+                    continue
+
+                self.yaml_g.add_arg(name,
+                                    type(self.yaml_config[name]),
+                                    self.yaml_config[name],
+                                    "This is from %s" % file_path)
+
+    def build(self):
+        self.args = self.parser.parse_args()
+        self.arg_config = vars(self.args)
+
+    def __add__(self, new_arg):
+        assert isinstance(new_arg, list) or isinstance(new_arg, tuple)
+        assert len(new_arg) >= 3
+        assert self.args is None
+
+        name = new_arg[0]
+        dtype = new_arg[1]
+        dvalue = new_arg[2]
+        desc = new_arg[3] if len(
+            new_arg) == 4 else "Description is not provided."
+
+        self.com_g.add_arg(name, dtype, dvalue, desc)
+
+        return self
+
+    def __getattr__(self, name):
+        if name in self.arg_config:
+            return self.arg_config[name]
+
+        if name in self.json_config:
+            return self.json_config[name]
+
+        if name in self.yaml_config:
+            return self.yaml_config[name]
+
+        raise Warning("The argument %s is not defined." % name)
+
+    def Print(self):
+
+        print("-" * 70)
+        for name in self.arg_config:
+            print("%s:\t\t\t\t%s" % (str(name), str(self.arg_config[name])))
+
+        for name in self.json_config:
+            if name not in self.arg_config:
+                print("%s:\t\t\t\t%s" %
+                      (str(name), str(self.json_config[name])))
+
+        for name in self.yaml_config:
+            if name not in self.arg_config:
+                print("%s:\t\t\t\t%s" %
+                      (str(name), str(self.yaml_config[name])))
+
+        print("-" * 70)
+
+
+if __name__ == "__main__":
+    """
+    pd_config = PDConfig(json_file = "./test/bert_config.json")
+    pd_config.build()
+
+    print(pd_config.do_train)
+    print(pd_config.hidden_size)
+
+    pd_config = PDConfig(yaml_file = "./test/bert_config.yaml")
+    pd_config.build()
+
+    print(pd_config.do_train)
+    print(pd_config.hidden_size)
+    """
+
+    pd_config = PDConfig(yaml_file="./test/bert_config.yaml")
+    pd_config += ("my_age", int, 18, "I am forever 18.")
+    pd_config.build()
+
+    print(pd_config.do_train)
+    print(pd_config.hidden_size)
+    print(pd_config.my_age)
diff --git a/dygraph/tsm/README.md b/dygraph/tsm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..d7cfa859b7d36cf2c4ab5dbe3bbaa826f65e06a1
--- /dev/null
+++ b/dygraph/tsm/README.md
@@ -0,0 +1,49 @@
+# TSM 视频分类模型
+
+本目录下为基于PaddlePaddle 动态图实现的 TSM视频分类模型，静态图实现请参考[TSM 视频分类模型](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/PaddleVideo/models/tsm)
+
+---
+## 内容
+
+- [模型简介](#模型简介)
+- [数据准备](#数据准备)
+- [模型训练](#模型训练)
+- [模型评估](#模型评估)
+
+
+## 模型简介
+
+Temporal Shift Module是由MIT和IBM Watson AI Lab的Ji Lin，Chuang Gan和Song Han等人提出的通过时间位移来提高网络视频理解能力的模块, 详细内容请参考论文[Temporal Shift Module for Efficient Video Understanding](https://arxiv.org/abs/1811.08383v1)
+
+## 数据准备
+
+TSM的训练数据采用由DeepMind公布的Kinetics-400动作识别数据集。数据下载及准备请参考[数据说明](data/dataset/README.md)
+
+### 小数据集验证
+
+为了便于快速迭代，我们采用了较小的数据集进行动态图训练验证，分别进行了两组实验验证：
+
+1. 其中包括8k大小的训练数据和2k大小的测试数据。
+2. 其中包括了十类大小的训练数据和测试数据。
+
+## 模型训练
+
+数据准备完毕后，可以通过如下方式启动训练：
+
+    bash run.sh train
+
+## 模型评估
+
+数据准备完毕后，可以通过如下方式启动训练：
+
+    bash run.sh eval
+
+在从Kinetics400选取的十类的数据集下：
+
+|Top-1|Top-5|
+|:-:|:-:|
+|76.56%|98.1%|
+
+全量数据集精度
+Top-1 0.70
+请参考：[静态图](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/PaddleVideo)
diff --git a/dygraph/tsm/config_utils.py b/dygraph/tsm/config_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fedd1b246b27f6e3ddfd8d12dfcec51e7737e5b
--- /dev/null
+++ b/dygraph/tsm/config_utils.py
@@ -0,0 +1,85 @@
+#  Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import yaml
+import logging
+logger = logging.getLogger(__name__)
+
+CONFIG_SECS = [
+    'train',
+    'valid',
+    'test',
+    'infer',
+]
+
+
+class AttrDict(dict):
+    def __getattr__(self, key):
+        return self[key]
+
+    def __setattr__(self, key, value):
+        if key in self.__dict__:
+            self.__dict__[key] = value
+        else:
+            self[key] = value
+
+
+def parse_config(cfg_file):
+    """Load a config file into AttrDict"""
+    import yaml
+    with open(cfg_file, 'r') as fopen:
+        yaml_config = AttrDict(yaml.load(fopen, Loader=yaml.Loader))
+    create_attr_dict(yaml_config)
+    return yaml_config
+
+
+def create_attr_dict(yaml_config):
+    from ast import literal_eval
+    for key, value in yaml_config.items():
+        if type(value) is dict:
+            yaml_config[key] = value = AttrDict(value)
+        if isinstance(value, str):
+            try:
+                value = literal_eval(value)
+            except BaseException:
+                pass
+        if isinstance(value, AttrDict):
+            create_attr_dict(yaml_config[key])
+        else:
+            yaml_config[key] = value
+    return
+
+
+def merge_configs(cfg, sec, args_dict):
+    assert sec in CONFIG_SECS, "invalid config section {}".format(sec)
+    sec_dict = getattr(cfg, sec.upper())
+    for k, v in args_dict.items():
+        if v is None:
+            continue
+        try:
+            if hasattr(sec_dict, k):
+                setattr(sec_dict, k, v)
+        except:
+            pass
+    return cfg
+
+
+def print_configs(cfg, mode):
+    logger.info("---------------- {:>5} Arguments ----------------".format(
+        mode))
+    for sec, sec_items in cfg.items():
+        logger.info("{}:".format(sec))
+        for k, v in sec_items.items():
+            logger.info("    {}:{}".format(k, v))
+    logger.info("-------------------------------------------------")
diff --git a/dygraph/tsm/data/dataset/README.md b/dygraph/tsm/data/dataset/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..55613ef3cbaf9715ba89c231b85614a29d57a136
--- /dev/null
+++ b/dygraph/tsm/data/dataset/README.md
@@ -0,0 +1,78 @@
+# 数据使用说明
+
+## Kinetics数据集
+
+Kinetics数据集是DeepMind公开的大规模视频动作识别数据集，有Kinetics400与Kinetics600两个版本。这里使用Kinetics400数据集，具体的数据预处理过程如下。
+
+### mp4视频下载
+在Code\_Root目录下创建文件夹
+
+    cd $Code_Root/data/dataset && mkdir kinetics
+
+    cd kinetics && mkdir data_k400 && cd data_k400
+
+    mkdir train_mp4 && mkdir val_mp4
+
+ActivityNet官方提供了Kinetics的下载工具，具体参考其[官方repo ](https://github.com/activitynet/ActivityNet/tree/master/Crawler/Kinetics)即可下载Kinetics400的mp4视频集合。将kinetics400的训练与验证集合分别下载到data/dataset/kinetics/data\_k400/train\_mp4与data/dataset/kinetics/data\_k400/val\_mp4。
+
+### mp4文件预处理
+
+为提高数据读取速度，提前将mp4文件解帧并打pickle包，dataloader从视频的pkl文件中读取数据（该方法耗费更多存储空间）。pkl文件里打包的内容为(video-id, label, [frame1, frame2,...,frameN])。
+
+在 data/dataset/kinetics/data\_k400目录下创建目录train\_pkl和val\_pkl
+
+    cd $Code_Root/data/dataset/kinetics/data_k400
+
+    mkdir train_pkl && mkdir val_pkl
+
+进入$Code\_Root/data/dataset/kinetics目录，使用video2pkl.py脚本进行数据转化。首先需要下载[train](https://github.com/activitynet/ActivityNet/tree/master/Crawler/Kinetics/data/kinetics-400_train.csv)和[validation](https://github.com/activitynet/ActivityNet/tree/master/Crawler/Kinetics/data/kinetics-400_val.csv)数据集的文件列表。
+
+首先生成预处理需要的数据集标签文件
+
+    python generate_label.py kinetics-400_train.csv kinetics400_label.txt
+
+然后执行如下程序：
+
+    python video2pkl.py kinetics-400_train.csv $Source_dir $Target_dir  8 #以8个进程为例
+
+- 该脚本依赖`ffmpeg`库，请预先安装`ffmpeg`
+
+对于train数据，
+
+    Source_dir = $Code_Root/data/dataset/kinetics/data_k400/train_mp4
+
+    Target_dir = $Code_Root/data/dataset/kinetics/data_k400/train_pkl
+
+对于val数据，
+
+    Source_dir = $Code_Root/data/dataset/kinetics/data_k400/val_mp4
+
+    Target_dir = $Code_Root/data/dataset/kinetics/data_k400/val_pkl
+
+这样即可将mp4文件解码并保存为pkl文件。
+
+### 生成训练和验证集list
+··
+    cd $Code_Root/data/dataset/kinetics
+
+    ls $Code_Root/data/dataset/kinetics/data_k400/train_pkl/* > train.list
+
+    ls $Code_Root/data/dataset/kinetics/data_k400/val_pkl/* > val.list
+
+    ls $Code_Root/data/dataset/kinetics/data_k400/val_pkl/* > test.list
+
+    ls $Code_Root/data/dataset/kinetics/data_k400/val_pkl/* > infer.list
+
+即可生成相应的文件列表，train.list和val.list的每一行表示一个pkl文件的绝对路径，示例如下：
+
+    /ssd1/user/models/PaddleCV/PaddleVideo/data/dataset/kinetics/data_k400/train_pkl/data_batch_100-097
+    /ssd1/user/models/PaddleCV/PaddleVideo/data/dataset/kinetics/data_k400/train_pkl/data_batch_100-114
+    /ssd1/user/models/PaddleCV/PaddleVideo/data/dataset/kinetics/data_k400/train_pkl/data_batch_100-118
+    ...
+
+或者
+
+    /ssd1/user/models/PaddleCV/PaddleVideo/data/dataset/kinetics/data_k400/val_pkl/data_batch_102-085
+    /ssd1/user/models/PaddleCV/PaddleVideo/data/dataset/kinetics/data_k400/val_pkl/data_batch_102-086
+    /ssd1/user/models/PaddleCV/PaddleVideo/data/dataset/kinetics/data_k400/val_pkl/data_batch_102-090
+    ...
diff --git a/dygraph/tsm/data/dataset/kinetics/generate_label.py b/dygraph/tsm/data/dataset/kinetics/generate_label.py
new file mode 100644
index 0000000000000000000000000000000000000000..d7608e86244c305bc31aa341d34320b71034c2e2
--- /dev/null
+++ b/dygraph/tsm/data/dataset/kinetics/generate_label.py
@@ -0,0 +1,44 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import sys
+
+# kinetics-400_train.csv should be down loaded first and set as sys.argv[1]
+# sys.argv[2] can be set as kinetics400_label.txt
+# python generate_label.py kinetics-400_train.csv kinetics400_label.txt
+
+num_classes = 400
+
+fname = sys.argv[1]
+outname = sys.argv[2]
+fl = open(fname).readlines()
+fl = fl[1:]
+outf = open(outname, 'w')
+
+label_list = []
+for line in fl:
+    label = line.strip().split(',')[0].strip('"')
+    if label in label_list:
+        continue
+    else:
+        label_list.append(label)
+
+assert len(label_list
+           ) == num_classes, "there should be {} labels in list, but ".format(
+               num_classes, len(label_list))
+
+label_list.sort()
+for i in range(num_classes):
+    outf.write('{} {}'.format(label_list[i], i) + '\n')
+
+outf.close()
diff --git a/dygraph/tsm/data/dataset/kinetics/video2pkl.py b/dygraph/tsm/data/dataset/kinetics/video2pkl.py
new file mode 100644
index 0000000000000000000000000000000000000000..78d1b09b7bf6efb7f96535fa66bee2762bbccc5d
--- /dev/null
+++ b/dygraph/tsm/data/dataset/kinetics/video2pkl.py
@@ -0,0 +1,87 @@
+#  Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import sys
+import glob
+try:
+    import cPickle as pickle
+except:
+    import pickle
+from multiprocessing import Pool
+
+# example command line: python generate_k400_pkl.py kinetics-400_train.csv 8
+# 
+# kinetics-400_train.csv is the training set file of K400 official release
+# each line contains laebl,youtube_id,time_start,time_end,split,is_cc
+
+assert (len(sys.argv) == 5)
+
+f = open(sys.argv[1])
+source_dir = sys.argv[2]
+target_dir = sys.argv[3]
+num_threads = sys.argv[4]
+all_video_entries = [x.strip().split(',') for x in f.readlines()]
+all_video_entries = all_video_entries[1:]
+f.close()
+
+category_label_map = {}
+f = open('kinetics400_label.txt')
+for line in f:
+    ens = line.strip().split(' ')
+    category = " ".join(ens[0:-1])
+    label = int(ens[-1])
+    category_label_map[category] = label
+f.close()
+
+
+def generate_pkl(entry):
+    mode = entry[4]
+    category = entry[0].strip('"')
+    category_dir = category
+    video_path = os.path.join(
+        './',
+        entry[1] + "_%06d" % int(entry[2]) + "_%06d" % int(entry[3]) + ".mp4")
+    video_path = os.path.join(source_dir, category_dir, video_path)
+    label = category_label_map[category]
+
+    vid = './' + video_path.split('/')[-1].split('.')[0]
+    if os.path.exists(video_path):
+        if not os.path.exists(vid):
+            os.makedirs(vid)
+        os.system('ffmpeg -i ' + video_path + ' -q 0 ' + vid + '/%06d.jpg')
+    else:
+        print("File not exists {}".format(video_path))
+        return
+
+    images = sorted(glob.glob(vid + '/*.jpg'))
+    ims = []
+    for img in images:
+        f = open(img, 'rb')
+        ims.append(f.read())
+        f.close()
+
+    output_pkl = vid + ".pkl"
+    output_pkl = os.path.join(target_dir, output_pkl)
+    f = open(output_pkl, 'wb')
+    pickle.dump((vid, label, ims), f, protocol=2)
+    f.close()
+
+    os.system('rm -rf %s' % vid)
+
+
+pool = Pool(processes=int(sys.argv[4]))
+pool.map(generate_pkl, all_video_entries)
+pool.close()
+pool.join()
diff --git a/dygraph/tsm/eval.py b/dygraph/tsm/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..6328edfb3020349327522a730729ebc734b75afb
--- /dev/null
+++ b/dygraph/tsm/eval.py
@@ -0,0 +1,109 @@
+#  Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import sys
+import time
+import argparse
+import ast
+import logging
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.base import to_variable
+
+from model import TSM_ResNet
+from config_utils import *
+from reader import KineticsReader
+
+logging.root.handlers = []
+FORMAT = '[%(levelname)s: %(filename)s: %(lineno)4d]: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("Paddle Video test script")
+    parser.add_argument(
+        '--config',
+        type=str,
+        default='tsm.yaml',
+        help='path to config file of model')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=None,
+        help='test batch size. None to use config file setting.')
+    parser.add_argument(
+        '--use_gpu',
+        type=ast.literal_eval,
+        default=True,
+        help='default use gpu.')
+    parser.add_argument(
+        '--weights', type=str, default="./final", help="weight path")
+    args = parser.parse_args()
+    return args
+
+
+def test(args):
+    # parse config
+    config = parse_config(args.config)
+    test_config = merge_configs(config, 'test', vars(args))
+    print_configs(test_config, 'Test')
+    place = fluid.CUDAPlace(0)
+
+    with fluid.dygraph.guard(place):
+        video_model = TSM_ResNet("TSM", test_config)
+
+        model_dict, _ = fluid.load_dygraph(args.weights)
+        video_model.set_dict(model_dict)
+
+        test_reader = KineticsReader(mode='test', cfg=test_config)
+        test_reader = test_reader.create_reader()
+
+        video_model.eval()
+        total_loss = 0.0
+        total_acc1 = 0.0
+        total_acc5 = 0.0
+        total_sample = 0
+
+        for batch_id, data in enumerate(test_reader()):
+            x_data = np.array([item[0] for item in data])
+            y_data = np.array([item[1] for item in data]).reshape([-1, 1])
+
+            imgs = to_variable(x_data)
+            labels = to_variable(y_data)
+            labels.stop_gradient = True
+            outputs = video_model(imgs)
+            loss = fluid.layers.cross_entropy(
+                input=outputs, label=labels, ignore_index=-1)
+
+            avg_loss = fluid.layers.mean(loss)
+
+            acc_top1 = fluid.layers.accuracy(input=outputs, label=labels, k=1)
+            acc_top5 = fluid.layers.accuracy(input=outputs, label=labels, k=5)
+            total_loss += avg_loss.numpy()
+            total_acc1 += acc_top1.numpy()
+            total_acc5 += acc_top5.numpy()
+            total_sample += 1
+            print('TEST iter {}, loss = {}, acc1 {}, acc5 {}'.format(
+                batch_id, avg_loss.numpy(), acc_top1.numpy(), acc_top5.numpy()))
+        print('Finish loss {}, acc1 {}, acc5 {}'.format(
+            total_loss / total_sample, total_acc1 / total_sample, total_acc5 /
+            total_sample))
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    logger.info(args)
+    test(args)
diff --git a/dygraph/tsm/model.py b/dygraph/tsm/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..3fb64164523fe2a1ace88bafd28de0ce763c1445
--- /dev/null
+++ b/dygraph/tsm/model.py
@@ -0,0 +1,171 @@
+#  Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import time
+import sys
+import paddle.fluid as fluid
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, BatchNorm, Linear
+import math
+
+
+class ConvBNLayer(fluid.dygraph.Layer):
+    def __init__(self,
+                 num_channels,
+                 num_filters,
+                 filter_size,
+                 stride=1,
+                 groups=1,
+                 act=None):
+        super(ConvBNLayer, self).__init__()
+
+        self._conv = Conv2D(
+            num_channels=num_channels,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=(filter_size - 1) // 2,
+            groups=None,
+            act=None,
+            param_attr=fluid.param_attr.ParamAttr(),
+            bias_attr=False)
+
+        self._batch_norm = BatchNorm(
+            num_filters,
+            act=act,
+            param_attr=fluid.param_attr.ParamAttr(),
+            bias_attr=fluid.param_attr.ParamAttr())
+
+    def forward(self, inputs):
+        y = self._conv(inputs)
+        y = self._batch_norm(y)
+
+        return y
+
+
+class BottleneckBlock(fluid.dygraph.Layer):
+    def __init__(self,
+                 num_channels,
+                 num_filters,
+                 stride,
+                 shortcut=True,
+                 seg_num=8):
+        super(BottleneckBlock, self).__init__()
+
+        self.conv0 = ConvBNLayer(
+            num_channels=num_channels,
+            num_filters=num_filters,
+            filter_size=1,
+            act='relu')
+        self.conv1 = ConvBNLayer(
+            num_channels=num_filters,
+            num_filters=num_filters,
+            filter_size=3,
+            stride=stride,
+            act='relu')
+        self.conv2 = ConvBNLayer(
+            num_channels=num_filters,
+            num_filters=num_filters * 4,
+            filter_size=1,
+            act=None)
+
+        if not shortcut:
+            self.short = ConvBNLayer(
+                num_channels=num_channels,
+                num_filters=num_filters * 4,
+                filter_size=1,
+                stride=stride)
+        self.shortcut = shortcut
+        self.seg_num = seg_num
+        self._num_channels_out = int(num_filters * 4)
+
+    def forward(self, inputs):
+        shifts = fluid.layers.temporal_shift(inputs, self.seg_num, 1.0 / 8)
+        y = self.conv0(shifts)
+        conv1 = self.conv1(y)
+        conv2 = self.conv2(conv1)
+        if self.shortcut:
+            short = inputs
+        else:
+            short = self.short(inputs)
+        y = fluid.layers.elementwise_add(x=short, y=conv2, act="relu")
+        return y
+
+
+class TSM_ResNet(fluid.dygraph.Layer):
+    def __init__(self, name_scope, config):
+        super(TSM_ResNet, self).__init__(name_scope)
+
+        self.layers = config.MODEL.num_layers
+        self.seg_num = config.MODEL.seg_num
+        self.class_dim = config.MODEL.num_classes
+
+        if self.layers == 50:
+            depth = [3, 4, 6, 3]
+        else:
+            raise NotImplementedError
+        num_filters = [64, 128, 256, 512]
+
+        self.conv = ConvBNLayer(
+            num_channels=3, num_filters=64, filter_size=7, stride=2, act='relu')
+        self.pool2d_max = Pool2D(
+            pool_size=3, pool_stride=2, pool_padding=1, pool_type='max')
+
+        self.bottleneck_block_list = []
+        num_channels = 64
+
+        for block in range(len(depth)):
+            shortcut = False
+            for i in range(depth[block]):
+                bottleneck_block = self.add_sublayer(
+                    'bb_%d_%d' % (block, i),
+                    BottleneckBlock(
+                        num_channels=num_channels,
+                        num_filters=num_filters[block],
+                        stride=2 if i == 0 and block != 0 else 1,
+                        shortcut=shortcut,
+                        seg_num=self.seg_num))
+                num_channels = int(bottleneck_block._num_channels_out)
+                self.bottleneck_block_list.append(bottleneck_block)
+                shortcut = True
+        self.pool2d_avg = Pool2D(
+            pool_size=7, pool_type='avg', global_pooling=True)
+
+        import math
+        stdv = 1.0 / math.sqrt(2048 * 1.0)
+
+        self.out = Linear(
+            2048,
+            self.class_dim,
+            act="softmax",
+            param_attr=fluid.param_attr.ParamAttr(
+                initializer=fluid.initializer.Uniform(-stdv, stdv)),
+            bias_attr=fluid.param_attr.ParamAttr(
+                learning_rate=2.0, regularizer=fluid.regularizer.L2Decay(0.)))
+
+    def forward(self, inputs):
+        y = fluid.layers.reshape(
+            inputs, [-1, inputs.shape[2], inputs.shape[3], inputs.shape[4]])
+        y = self.conv(y)
+        y = self.pool2d_max(y)
+        for bottleneck_block in self.bottleneck_block_list:
+            y = bottleneck_block(y)
+        y = self.pool2d_avg(y)
+        y = fluid.layers.dropout(y, dropout_prob=0.5)
+        y = fluid.layers.reshape(y, [-1, self.seg_num, y.shape[1]])
+        y = fluid.layers.reduce_mean(y, dim=1)
+        y = fluid.layers.reshape(y, shape=[-1, 2048])
+        y = self.out(y)
+        return y
diff --git a/dygraph/tsm/reader.py b/dygraph/tsm/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..8634b5fef282fd04681229d026581f3f7677148d
--- /dev/null
+++ b/dygraph/tsm/reader.py
@@ -0,0 +1,465 @@
+#  Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import sys
+import cv2
+import math
+import random
+import functools
+try:
+    import cPickle as pickle
+    from cStringIO import StringIO
+except ImportError:
+    import pickle
+    from io import BytesIO
+import numpy as np
+import paddle
+from PIL import Image, ImageEnhance
+import logging
+
+logger = logging.getLogger(__name__)
+python_ver = sys.version_info
+
+
+class KineticsReader():
+    """
+    Data reader for kinetics dataset of two format mp4 and pkl.
+    1. mp4, the original format of kinetics400
+    2. pkl, the mp4 was decoded previously and stored as pkl
+    In both case, load the data, and then get the frame data in the form of numpy and label as an integer.
+     dataset cfg: format
+                  num_classes
+                  seg_num
+                  short_size
+                  target_size
+                  num_reader_threads
+                  buf_size
+                  image_mean
+                  image_std
+                  batch_size
+                  list
+    """
+
+    def __init__(self, mode, cfg):
+
+        self.mode = mode
+        self.format = cfg.MODEL.format
+        self.num_classes = cfg.MODEL.num_classes
+        self.seg_num = cfg.MODEL.seg_num
+        self.seglen = cfg.MODEL.seglen
+
+        self.short_size = cfg[mode.upper()]['short_size']
+        self.target_size = cfg[mode.upper()]['target_size']
+        self.num_reader_threads = cfg[mode.upper()]['num_reader_threads']
+        self.buf_size = cfg[mode.upper()]['buf_size']
+
+        self.img_mean = np.array(cfg.MODEL.image_mean).reshape(
+            [3, 1, 1]).astype(np.float32)
+        self.img_std = np.array(cfg.MODEL.image_std).reshape(
+            [3, 1, 1]).astype(np.float32)
+        # set batch size and file list
+        self.batch_size = cfg[mode.upper()]['batch_size']
+        self.filelist = cfg[mode.upper()]['filelist']
+        if self.mode == 'infer':
+            self.video_path = cfg[mode.upper()]['video_path']
+        else:
+            self.video_path = ''
+
+    def create_reader(self):
+        # if set video_path for inference mode, just load this single video
+        if (self.mode == 'infer') and (self.video_path != ''):
+            # load video from file stored at video_path
+            _reader = self._inference_reader_creator(
+                self.video_path,
+                self.mode,
+                seg_num=self.seg_num,
+                seglen=self.seglen,
+                short_size=self.short_size,
+                target_size=self.target_size,
+                img_mean=self.img_mean,
+                img_std=self.img_std)
+        else:
+            assert os.path.exists(self.filelist), \
+                        '{} not exist, please check the data list'.format(self.filelist)
+            _reader = self._reader_creator(self.filelist, self.mode, seg_num=self.seg_num, seglen = self.seglen, \
+                             short_size = self.short_size, target_size = self.target_size, \
+                             img_mean = self.img_mean, img_std = self.img_std, \
+                             shuffle = (self.mode == 'train'), \
+                             num_threads = self.num_reader_threads, \
+                             buf_size = self.buf_size, format = self.format)
+
+        def _batch_reader():
+            batch_out = []
+            for imgs, label in _reader():
+                if imgs is None:
+                    continue
+                batch_out.append((imgs, label))
+                if len(batch_out) == self.batch_size:
+                    yield batch_out
+                    batch_out = []
+
+        return _batch_reader
+
+    def _inference_reader_creator(self, video_path, mode, seg_num, seglen,
+                                  short_size, target_size, img_mean, img_std):
+        def reader():
+            try:
+                imgs = mp4_loader(video_path, seg_num, seglen, mode)
+                if len(imgs) < 1:
+                    logger.error('{} frame length {} less than 1.'.format(
+                        video_path, len(imgs)))
+                    yield None, None
+            except:
+                logger.error('Error when loading {}'.format(mp4_path))
+                yield None, None
+
+            imgs_ret = imgs_transform(imgs, mode, seg_num, seglen, short_size,
+                                      target_size, img_mean, img_std)
+            label_ret = video_path
+
+            yield imgs_ret, label_ret
+
+        return reader
+
+    def _reader_creator(self,
+                        pickle_list,
+                        mode,
+                        seg_num,
+                        seglen,
+                        short_size,
+                        target_size,
+                        img_mean,
+                        img_std,
+                        shuffle=False,
+                        num_threads=1,
+                        buf_size=1024,
+                        format='pkl'):
+        def decode_mp4(sample, mode, seg_num, seglen, short_size, target_size,
+                       img_mean, img_std):
+            sample = sample[0].split(' ')
+            mp4_path = sample[0]
+            # when infer, we store vid as label
+            label = int(sample[1])
+            try:
+                imgs = mp4_loader(mp4_path, seg_num, seglen, mode)
+                if len(imgs) < 1:
+                    logger.error('{} frame length {} less than 1.'.format(
+                        mp4_path, len(imgs)))
+                    return None, None
+            except:
+                logger.error('Error when loading {}'.format(mp4_path))
+                return None, None
+
+            return imgs_transform(imgs, mode, seg_num, seglen, \
+                         short_size, target_size, img_mean, img_std ), label
+
+        def decode_pickle(sample, mode, seg_num, seglen, short_size,
+                          target_size, img_mean, img_std):
+            pickle_path = sample[0]
+            try:
+                if python_ver < (3, 0):
+                    data_loaded = pickle.load(open(pickle_path, 'rb'))
+                else:
+                    data_loaded = pickle.load(
+                        open(pickle_path, 'rb'), encoding='bytes')
+
+                vid, label, frames = data_loaded
+                if len(frames) < 1:
+                    logger.error('{} frame length {} less than 1.'.format(
+                        pickle_path, len(frames)))
+                    return None, None
+            except:
+                logger.info('Error when loading {}'.format(pickle_path))
+                return None, None
+
+            if mode == 'train' or mode == 'valid' or mode == 'test':
+                ret_label = label
+            elif mode == 'infer':
+                ret_label = vid
+
+            imgs = video_loader(frames, seg_num, seglen, mode)
+            return imgs_transform(imgs, mode, seg_num, seglen, \
+                         short_size, target_size, img_mean, img_std), ret_label
+
+        def reader():
+            with open(pickle_list) as flist:
+                lines = [line.strip() for line in flist]
+                if shuffle:
+                    random.shuffle(lines)
+                for line in lines:
+                    pickle_path = line.strip()
+                    yield [pickle_path]
+
+        if format == 'pkl':
+            decode_func = decode_pickle
+        elif format == 'mp4':
+            decode_func = decode_mp4
+        else:
+            raise "Not implemented format {}".format(format)
+
+        mapper = functools.partial(
+            decode_func,
+            mode=mode,
+            seg_num=seg_num,
+            seglen=seglen,
+            short_size=short_size,
+            target_size=target_size,
+            img_mean=img_mean,
+            img_std=img_std)
+
+        return paddle.reader.xmap_readers(mapper, reader, num_threads, buf_size)
+
+
+def imgs_transform(imgs, mode, seg_num, seglen, short_size, target_size,
+                   img_mean, img_std):
+    imgs = group_scale(imgs, short_size)
+
+    if mode == 'train':
+        #if name == "TSM":
+        imgs = group_multi_scale_crop(imgs, short_size)
+        imgs = group_random_crop(imgs, target_size)
+        imgs = group_random_flip(imgs)
+    else:
+        imgs = group_center_crop(imgs, target_size)
+
+    np_imgs = (np.array(imgs[0]).astype('float32').transpose(
+        (2, 0, 1))).reshape(1, 3, target_size, target_size) / 255
+    for i in range(len(imgs) - 1):
+        img = (np.array(imgs[i + 1]).astype('float32').transpose(
+            (2, 0, 1))).reshape(1, 3, target_size, target_size) / 255
+        np_imgs = np.concatenate((np_imgs, img))
+    imgs = np_imgs
+    imgs -= img_mean
+    imgs /= img_std
+    imgs = np.reshape(imgs, (seg_num, seglen * 3, target_size, target_size))
+    return imgs
+
+def group_multi_scale_crop(img_group, target_size, scales=None, \
+        max_distort=1, fix_crop=True, more_fix_crop=True):
+    scales = scales if scales is not None else [1, .875, .75, .66]
+    input_size = [target_size, target_size]
+
+    im_size = img_group[0].size
+
+    # get random crop offset
+    def _sample_crop_size(im_size):
+        image_w, image_h = im_size[0], im_size[1]
+
+        base_size = min(image_w, image_h)
+        crop_sizes = [int(base_size * x) for x in scales]
+        crop_h = [
+            input_size[1] if abs(x - input_size[1]) < 3 else x
+            for x in crop_sizes
+        ]
+        crop_w = [
+            input_size[0] if abs(x - input_size[0]) < 3 else x
+            for x in crop_sizes
+        ]
+
+        pairs = []
+        for i, h in enumerate(crop_h):
+            for j, w in enumerate(crop_w):
+                if abs(i - j) <= max_distort:
+                    pairs.append((w, h))
+        crop_pair = random.choice(pairs)
+        if not fix_crop:
+            w_offset = random.randint(0, image_w - crop_pair[0])
+            h_offset = random.randint(0, image_h - crop_pair[1])
+        else:
+            w_step = (image_w - crop_pair[0]) / 4
+            h_step = (image_h - crop_pair[1]) / 4
+
+            ret = list()
+            ret.append((0, 0))  # upper left
+            if w_step != 0:
+                ret.append((4 * w_step, 0))  # upper right
+            if h_step != 0:
+                ret.append((0, 4 * h_step))  # lower left
+            if h_step != 0 and w_step != 0:
+                ret.append((4 * w_step, 4 * h_step))  # lower right
+            if h_step != 0 or w_step != 0:
+                ret.append((2 * w_step, 2 * h_step))  # center
+
+            if more_fix_crop:
+                ret.append((0, 2 * h_step))  # center left
+                ret.append((4 * w_step, 2 * h_step))  # center right
+                ret.append((2 * w_step, 4 * h_step))  # lower center
+                ret.append((2 * w_step, 0 * h_step))  # upper center
+
+                ret.append((1 * w_step, 1 * h_step))  # upper left quarter
+                ret.append((3 * w_step, 1 * h_step))  # upper right quarter
+                ret.append((1 * w_step, 3 * h_step))  # lower left quarter
+                ret.append((3 * w_step, 3 * h_step))  # lower righ quarter
+
+            w_offset, h_offset = random.choice(ret)
+
+        return crop_pair[0], crop_pair[1], w_offset, h_offset
+
+    crop_w, crop_h, offset_w, offset_h = _sample_crop_size(im_size)
+    crop_img_group = [
+        img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h))
+        for img in img_group
+    ]
+    ret_img_group = [
+        img.resize((input_size[0], input_size[1]), Image.BILINEAR)
+        for img in crop_img_group
+    ]
+
+    return ret_img_group
+
+
+def group_random_crop(img_group, target_size):
+    w, h = img_group[0].size
+    th, tw = target_size, target_size
+
+    assert (w >= target_size) and (h >= target_size), \
+          "image width({}) and height({}) should be larger than crop size".format(w, h, target_size)
+
+    out_images = []
+    x1 = random.randint(0, w - tw)
+    y1 = random.randint(0, h - th)
+
+    for img in img_group:
+        if w == tw and h == th:
+            out_images.append(img)
+        else:
+            out_images.append(img.crop((x1, y1, x1 + tw, y1 + th)))
+
+    return out_images
+
+
+def group_random_flip(img_group):
+    v = random.random()
+    if v < 0.5:
+        ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group]
+        return ret
+    else:
+        return img_group
+
+
+def group_center_crop(img_group, target_size):
+    img_crop = []
+    for img in img_group:
+        w, h = img.size
+        th, tw = target_size, target_size
+        assert (w >= target_size) and (h >= target_size), \
+             "image width({}) and height({}) should be larger than crop size".format(w, h, target_size)
+        x1 = int(round((w - tw) / 2.))
+        y1 = int(round((h - th) / 2.))
+        img_crop.append(img.crop((x1, y1, x1 + tw, y1 + th)))
+
+    return img_crop
+
+
+def group_scale(imgs, target_size):
+    resized_imgs = []
+    for i in range(len(imgs)):
+        img = imgs[i]
+        w, h = img.size
+        if (w <= h and w == target_size) or (h <= w and h == target_size):
+            resized_imgs.append(img)
+            continue
+
+        if w < h:
+            ow = target_size
+            oh = int(target_size * 4.0 / 3.0)
+            resized_imgs.append(img.resize((ow, oh), Image.BILINEAR))
+        else:
+            oh = target_size
+            ow = int(target_size * 4.0 / 3.0)
+            resized_imgs.append(img.resize((ow, oh), Image.BILINEAR))
+
+    return resized_imgs
+
+
+def imageloader(buf):
+    if isinstance(buf, str):
+        img = Image.open(StringIO(buf))
+    else:
+        img = Image.open(BytesIO(buf))
+
+    return img.convert('RGB')
+
+
+def video_loader(frames, nsample, seglen, mode):
+    videolen = len(frames)
+    average_dur = int(videolen / nsample)
+
+    imgs = []
+    for i in range(nsample):
+        idx = 0
+        if mode == 'train':
+            if average_dur >= seglen:
+                idx = random.randint(0, average_dur - seglen)
+                idx += i * average_dur
+            elif average_dur >= 1:
+                idx += i * average_dur
+            else:
+                idx = i
+        else:
+            if average_dur >= seglen:
+                idx = (average_dur - seglen) // 2
+                idx += i * average_dur
+            elif average_dur >= 1:
+                idx += i * average_dur
+            else:
+                idx = i
+
+        for jj in range(idx, idx + seglen):
+            imgbuf = frames[int(jj % videolen)]
+            img = imageloader(imgbuf)
+            imgs.append(img)
+
+    return imgs
+
+
+def mp4_loader(filepath, nsample, seglen, mode):
+    cap = cv2.VideoCapture(filepath)
+    videolen = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    sampledFrames = []
+    for i in range(videolen):
+        ret, frame = cap.read()
+        # maybe first frame is empty
+        if ret == False:
+            continue
+        img = frame[:, :, ::-1]
+        sampledFrames.append(img)
+    average_dur = int(len(sampledFrames) / nsample)
+    imgs = []
+    for i in range(nsample):
+        idx = 0
+        if mode == 'train':
+            if average_dur >= seglen:
+                idx = random.randint(0, average_dur - seglen)
+                idx += i * average_dur
+            elif average_dur >= 1:
+                idx += i * average_dur
+            else:
+                idx = i
+        else:
+            if average_dur >= seglen:
+                idx = (average_dur - 1) // 2
+                idx += i * average_dur
+            elif average_dur >= 1:
+                idx += i * average_dur
+            else:
+                idx = i
+
+        for jj in range(idx, idx + seglen):
+            imgbuf = sampledFrames[int(jj % len(sampledFrames))]
+            img = Image.fromarray(imgbuf, mode='RGB')
+            imgs.append(img)
+
+    return imgs
diff --git a/dygraph/tsm/run.sh b/dygraph/tsm/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..933ed69c0066321f5e661ddff3e88eddfa78d825
--- /dev/null
+++ b/dygraph/tsm/run.sh
@@ -0,0 +1,50 @@
+# examples of running programs:
+# bash ./run.sh train CTCN ./configs/ctcn.yaml
+# bash ./run.sh eval NEXTVLAD ./configs/nextvlad.yaml
+# bash ./run.sh predict NONLOCAL ./cofings/nonlocal.yaml
+
+# mode should be one of [train, eval, predict, inference]
+# name should be one of [AttentionCluster, AttentionLSTM, NEXTVLAD, NONLOCAL, TSN, TSM, STNET, CTCN]
+# configs should be ./configs/xxx.yaml
+
+mode=$1
+configs="./tsm.yaml"
+pretrain="" # set pretrain model path if needed
+resume="" # set pretrain model path if needed
+save_dir="./data/checkpoints"
+use_gpu=True
+
+weights="" #set the path of weights to enable eval and predicut, just ignore this when training
+
+export CUDA_VISIBLE_DEVICES=0
+export FLAGS_fast_eager_deletion_mode=1
+export FLAGS_eager_delete_tensor_gb=0.0
+export FLAGS_fraction_of_gpu_memory_to_use=0.98
+
+if [ "$mode"x == "train"x ]; then
+    echo $mode "TSM" $configs  $resume $pretrain
+    if [ "$resume"x != ""x ]; then
+        python train.py --config=$configs \
+                        --resume=$resume \
+                        --use_gpu=$use_gpu 
+    elif [ "$pretrain"x != ""x ]; then
+        python train.py --config=$configs \
+                        --pretrain=$pretrain \
+                        --use_gpu=$use_gpu 
+    else
+        python train.py --config=$configs \
+                        --use_gpu=$use_gpu
+    fi
+elif [ "$mode"x == "eval"x ]; then
+    echo $mode $name $configs $weights
+    if [ "$weights"x != ""x ]; then
+        python eval.py --config=$configs \
+                       --weights=$weights \
+                       --use_gpu=$use_gpu
+    else
+        python eval.py --config=$configs \
+                       --use_gpu=$use_gpu
+    fi
+else
+    echo "Not implemented mode " $mode
+fi
diff --git a/dygraph/tsm/train.py b/dygraph/tsm/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..780706ce2168d09cce9a836f3484e905aa8eb569
--- /dev/null
+++ b/dygraph/tsm/train.py
@@ -0,0 +1,235 @@
+#  Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import sys
+import time
+import argparse
+import ast
+import logging
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.base import to_variable
+from model import TSM_ResNet
+from config_utils import *
+from reader import KineticsReader
+
+logging.root.handlers = []
+FORMAT = '[%(levelname)s: %(filename)s: %(lineno)4d]: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT, stream=sys.stdout)
+logger = logging.getLogger(__name__)
+
+
+def parse_args():
+    parser = argparse.ArgumentParser("Paddle Video train script")
+    parser.add_argument(
+        '--config',
+        type=str,
+        default='tsm.yaml',
+        help='path to config file of model')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=None,
+        help='training batch size. None to use config file setting.')
+    parser.add_argument(
+        '--pretrain',
+        type=str,
+        default=None,
+        help='path to pretrain weights. None to use default weights path in  ~/.paddle/weights.'
+    )
+    parser.add_argument(
+        '--resume',
+        type=str,
+        default=None,
+        help='path to resume training based on previous checkpoints. '
+        'None for not resuming any checkpoints.')
+    parser.add_argument(
+        '--use_gpu',
+        type=ast.literal_eval,
+        default=True,
+        help='default use gpu.')
+    parser.add_argument(
+        '--epoch',
+        type=int,
+        default=None,
+        help='epoch number, 0 for read from config file')
+    args = parser.parse_args()
+    return args
+
+
+def val(epoch, model, cfg, args):
+    reader = KineticsReader(mode="valid", cfg=cfg)
+    reader = reader.create_reader()
+    total_loss = 0.0
+    total_acc1 = 0.0
+    total_acc5 = 0.0
+    total_sample = 0
+
+    for batch_id, data in enumerate(reader()):
+        x_data = np.array([item[0] for item in data])
+        y_data = np.array([item[1] for item in data]).reshape([-1, 1])
+        imgs = to_variable(x_data)
+        labels = to_variable(y_data)
+        labels.stop_gradient = True
+
+        outputs = model(imgs)
+
+        loss = fluid.layers.cross_entropy(
+            input=outputs, label=labels, ignore_index=-1)
+        avg_loss = fluid.layers.mean(loss)
+        acc_top1 = fluid.layers.accuracy(input=outputs, label=labels, k=1)
+        acc_top5 = fluid.layers.accuracy(input=outputs, label=labels, k=5)
+
+        total_loss += avg_loss.numpy()[0]
+        total_acc1 += acc_top1.numpy()[0]
+        total_acc5 += acc_top5.numpy()[0]
+        total_sample += 1
+
+        print('TEST Epoch {}, iter {}, loss = {}, acc1 {}, acc5 {}'.format(
+            epoch, batch_id,
+            avg_loss.numpy()[0], acc_top1.numpy()[0], acc_top5.numpy()[0]))
+
+    print('Finish loss {} , acc1 {} , acc5 {}'.format(
+        total_loss / total_sample, total_acc1 / total_sample, total_acc5 /
+        total_sample))
+
+
+def create_optimizer(cfg, params):
+    total_videos = cfg.total_videos
+    step = int(total_videos / cfg.batch_size + 1)
+    bd = [e * step for e in cfg.decay_epochs]
+    base_lr = cfg.learning_rate
+    lr_decay = cfg.learning_rate_decay
+    lr = [base_lr, base_lr * lr_decay, base_lr * lr_decay * lr_decay]
+    l2_weight_decay = cfg.l2_weight_decay
+    momentum = cfg.momentum
+
+    optimizer = fluid.optimizer.Momentum(
+        learning_rate=fluid.layers.piecewise_decay(
+            boundaries=bd, values=lr),
+        momentum=momentum,
+        regularization=fluid.regularizer.L2Decay(l2_weight_decay),
+        parameter_list=params)
+
+    return optimizer
+
+
+def train(args):
+    config = parse_config(args.config)
+    train_config = merge_configs(config, 'train', vars(args))
+    valid_config = merge_configs(config, 'valid', vars(args))
+    print_configs(train_config, 'Train')
+
+    use_data_parallel = False
+    trainer_count = fluid.dygraph.parallel.Env().nranks
+    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) \
+        if use_data_parallel else fluid.CUDAPlace(0)
+
+    with fluid.dygraph.guard(place):
+        if use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
+
+        video_model = TSM_ResNet("TSM", train_config)
+
+        optimizer = create_optimizer(train_config.TRAIN,
+                                     video_model.parameters())
+        if use_data_parallel:
+            video_model = fluid.dygraph.parallel.DataParallel(video_model,
+                                                              strategy)
+
+        bs_denominator = 1
+        if args.use_gpu:
+            # check number of GPUs
+            gpus = os.getenv("CUDA_VISIBLE_DEVICES", "")
+            if gpus == "":
+                pass
+            else:
+                gpus = gpus.split(",")
+                num_gpus = len(gpus)
+                assert num_gpus == train_config.TRAIN.num_gpus, \
+                       "num_gpus({}) set by CUDA_VISIBLE_DEVICES" \
+                       "shoud be the same as that" \
+                       "set in {}({})".format(
+                       num_gpus, args.config, train_config.TRAIN.num_gpus)
+            bs_denominator = train_config.TRAIN.num_gpus
+
+        train_config.TRAIN.batch_size = int(train_config.TRAIN.batch_size /
+                                            bs_denominator)
+
+        train_reader = KineticsReader(mode="train", cfg=train_config)
+
+        train_reader = train_reader.create_reader()
+        if use_data_parallel:
+            train_reader = fluid.contrib.reader.distributed_batch_reader(
+                train_reader)
+
+        for epoch in range(train_config.TRAIN.epoch):
+            video_model.train()
+            total_loss = 0.0
+            total_acc1 = 0.0
+            total_acc5 = 0.0
+            total_sample = 0
+            for batch_id, data in enumerate(train_reader()):
+                x_data = np.array([item[0] for item in data])
+                y_data = np.array([item[1] for item in data]).reshape([-1, 1])
+
+                imgs = to_variable(x_data)
+                labels = to_variable(y_data)
+                labels.stop_gradient = True
+                outputs = video_model(imgs)
+                loss = fluid.layers.cross_entropy(
+                    input=outputs, label=labels, ignore_index=-1)
+                avg_loss = fluid.layers.mean(loss)
+
+                acc_top1 = fluid.layers.accuracy(
+                    input=outputs, label=labels, k=1)
+                acc_top5 = fluid.layers.accuracy(
+                    input=outputs, label=labels, k=5)
+
+                if use_data_parallel:
+                    avg_loss = video_model.scale_loss(avg_loss)
+                    avg_loss.backward()
+                    video_model.apply_collective_grads()
+                else:
+                    avg_loss.backward()
+                optimizer.minimize(avg_loss)
+                video_model.clear_gradients()
+
+                total_loss += avg_loss.numpy()[0]
+                total_acc1 += acc_top1.numpy()[0]
+                total_acc5 += acc_top5.numpy()[0]
+                total_sample += 1
+
+                print('TRAIN Epoch {}, iter {}, loss = {}, acc1 {}, acc5 {}'.
+                      format(epoch, batch_id,
+                             avg_loss.numpy()[0],
+                             acc_top1.numpy()[0], acc_top5.numpy()[0]))
+
+            print(
+                'TRAIN End, Epoch {}, avg_loss= {}, avg_acc1= {}, avg_acc5= {}'.
+                format(epoch, total_loss / total_sample, total_acc1 /
+                       total_sample, total_acc5 / total_sample))
+            video_model.eval()
+            val(epoch, video_model, valid_config, args)
+
+        if fluid.dygraph.parallel.Env().local_rank == 0:
+            fluid.dygraph.save_dygraph(video_model.state_dict(), "final")
+        logger.info('[TRAIN] training finished')
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    logger.info(args)
+    train(args)
diff --git a/dygraph/tsm/tsm.yaml b/dygraph/tsm/tsm.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..0dbafd542db35307751639f47804ea20b96b065b
--- /dev/null
+++ b/dygraph/tsm/tsm.yaml
@@ -0,0 +1,43 @@
+MODEL:
+    name: "TSM"
+    format: "pkl"
+    num_classes: 400
+    seg_num: 8
+    seglen: 1
+    image_mean: [0.485, 0.456, 0.406]
+    image_std: [0.229, 0.224, 0.225]
+    num_layers: 50
+    topk: 5
+
+TRAIN:
+    epoch: 65
+    short_size: 256
+    target_size: 224
+    num_reader_threads: 12
+    buf_size: 1024
+    batch_size: 16 #128
+    use_gpu: True
+    num_gpus: 1 #8
+    filelist: "./data/dataset/kinetics/train.list"
+    learning_rate: 0.01
+    learning_rate_decay: 0.1
+    decay_epochs: [40, 60]
+    l2_weight_decay: 1e-4
+    momentum: 0.9
+    total_videos: 8000 #239781
+
+VALID:
+    short_size: 256
+    target_size: 224
+    num_reader_threads: 12
+    buf_size: 1024
+    batch_size: 32 #128
+    filelist: "./data/dataset/kinetics/val.list"
+
+TEST:
+    short_size: 256
+    target_size: 224
+    num_reader_threads: 12
+    buf_size: 1024
+    batch_size: 64
+    filelist: "./data/dataset/kinetics/test.list"
diff --git a/dygraph/yolov3/.gitignore b/dygraph/yolov3/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..011d9771cdb71009709e4adedff2472e31f4a1b7
--- /dev/null
+++ b/dygraph/yolov3/.gitignore
@@ -0,0 +1,12 @@
+*.log
+*.json
+*.jpg
+*.png
+output/
+checkpoints/
+weights/
+!weights/*.sh
+dataset/coco/
+!dataset/coco/*.py
+log*
+output*
diff --git a/dygraph/yolov3/README.md b/dygraph/yolov3/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..7bd4e92d45c73d530ecc33832a1b7797c7f8dca1
--- /dev/null
+++ b/dygraph/yolov3/README.md
@@ -0,0 +1,216 @@
+# YOLOv3 目标检测
+---
+
+本模型是[paddle_yolov3](https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/yolov3)的动态图版本
+
+## 内容
+
+- [简介](#简介)
+- [快速开始](#快速开始)
+- [进阶使用](#进阶使用)
+- [FAQ](#faq)
+- [参考文献](#参考文献)
+- [版本更新](#版本更新)
+- [如何贡献代码](#如何贡献代码)
+- [作者](#作者)
+
+## 简介
+
+[YOLOv3](https://arxiv.org/abs/1804.02767) 是由 [Joseph Redmon](https://arxiv.org/search/cs?searchtype=author&query=Redmon%2C+J) 和 [Ali Farhadi](https://arxiv.org/search/cs?searchtype=author&query=Farhadi%2C+A) 提出的单阶段检测器, 该检测器与达到同样精度的传统目标检测方法相比，推断速度能达到接近两倍.
+
+在我们的实现版本中使用了 [Bag of Freebies for Training Object Detection Neural Networks](https://arxiv.org/abs/1902.04103v3) 中提出的图像增强和label smooth等优化方法，精度优于darknet框架的实现版本，在COCO-2017数据集上，达到`mAP(0.50:0.95)= 38.9`的精度，比darknet实现版本的精度(33.0)要高5.9.
+
+同时，在推断速度方面，基于Paddle预测库的加速方法，推断速度比darknet高30%.
+
+## 快速开始
+
+### 安装
+
+**安装[COCO-API](https://github.com/cocodataset/cocoapi)：**
+
+训练前需要首先下载[COCO-API](https://github.com/cocodataset/cocoapi)：
+
+    git clone https://github.com/cocodataset/cocoapi.git
+    cd cocoapi/PythonAPI
+    # if cython is not installed
+    pip install Cython
+    # Install into global site-packages
+    make install
+    # Alternatively, if you do not have permissions or prefer
+    # not to install the COCO API into global site-packages
+    python setup.py install --user
+
+**安装[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)：**
+
+在当前目录下运行样例代码需要PadddlePaddle Fluid的v.1.7或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本，请根据[安装文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/install/index_cn.html)中的说明来更新PaddlePaddle。
+
+### 数据准备
+
+**COCO数据集：**
+
+在[MS-COCO数据集](http://cocodataset.org/#download)上进行训练，通过如下方式下载数据集。
+
+```bash
+python dataset/coco/download.py
+```
+
+数据目录结构如下：
+
+```
+dataset/coco/
+├── annotations
+│   ├── instances_train2014.json
+│   ├── instances_train2017.json
+│   ├── instances_val2014.json
+│   ├── instances_val2017.json
+|   ...
+├── train2017
+│   ├── 000000000009.jpg
+│   ├── 000000580008.jpg
+|   ...
+├── val2017
+│   ├── 000000000139.jpg
+│   ├── 000000000285.jpg
+|   ...
+
+```
+
+**自定义数据集：**
+
+用户可使用自定义的数据集，我们推荐自定义数据集使用COCO数据集格式的标注，并可通过设置`--data_dir`或修改[reader.py](./reader.py#L39)指定数据集路径。使用COCO数据集格式标注时，目录结构可参考上述COCO数据集目录结构。
+
+### 模型训练
+
+**下载预训练模型：** 本示例提供DarkNet-53预训练[模型](https://paddlemodels.bj.bcebos.com/yolo/darknet53.pdparams )，该模型转换自作者提供的预训练权重[pjreddie/darknet](https://pjreddie.com/media/files/darknet53.conv.74)，采用如下命令下载预训练模型：
+
+    sh ./weights/download.sh
+
+**注意：** Windows用户可通过`./weights/download.sh`中的链接直接下载和解压。
+
+通过设置`--pretrain` 加载预训练模型。同时在fine-tune时也采用该设置加载已训练模型。
+请在训练前确认预训练模型下载与加载正确，否则训练过程中损失可能会出现NAN。
+
+**开始训练：** 数据准备完毕后，可以通过如下的方式启动训练：
+
+    python train.py \
+       --model_save_dir=output/ \
+       --pretrain=${path_to_pretrain_model} \
+       --data_dir=${path_to_data} \
+       --class_num=${category_num}
+
+**多卡训练：**
+动态图支持多进程多卡进行模型训练，启动方式：
+
+首先通过设置`export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`指定8卡GPU训练。
+
+`python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --started_port=9999 train.py --batch_size=16 --use_data_parallel=1`
+
+您也可以直接运行快速开始脚本`start_parall.sh`进行训练，默认使用4卡进行训练，每张卡的batch size为16
+ 
+执行训练开始时，会得到类似如下输出，每次迭代打印的log数与指定卡数一致：
+
+```
+Iter 2, loss 9056.620443, time 3.21156
+Iter 3, loss 7720.641968, time 1.63363
+Iter 4, loss 6736.150391, time 2.70573
+
+```
+
+**注意：** YOLOv3模型总batch size为64，这里使用4 GPUs每GPU上batch size为16来训练
+
+**模型设置：**
+
+*  模型使用了基于COCO数据集生成的9个先验框：10x13，16x30，33x23，30x61，62x45，59x119，116x90，156x198，373x326
+*  YOLOv3模型中，若预测框不是该点最佳匹配框但是和任一ground truth框的重叠大于`ignore_thresh=0.7`，则忽略该预测框的目标性损失
+
+**训练策略：**
+
+*  采用momentum优化算法训练YOLOv3，momentum=0.9。
+*  学习率采用warmup算法，前4000个Iter学习率从0.0线性增加至0.001。在400000，450000个Iter时使用0.1,0.01乘子进行学习率衰减，最大训练500000个Iter。
+
+
+下图为模型训练结果：
+<p align="center">
+<img src="image/train_loss.png" height="400" width="550" hspace="10"/><br />
+Train Loss
+</p>
+
+### 模型评估
+
+模型评估是指对训练完毕的模型评估各类性能指标。本示例采用[COCO官方评估](http://cocodataset.org/#detections-eval)
+
+    sh ./weights/download.sh
+
+`eval.py`是评估模块的主要执行程序，调用示例如下：
+
+    python eval.py \
+        --dataset=coco2017 \
+        --weights=${path_to_weights} \
+        --class_num=${category_num}
+
+- 通过设置`export CUDA_VISIBLE_DEVICES=0`指定单卡GPU评估。
+
+
+## 进阶使用
+
+### 背景介绍
+
+传统目标检测方法通过两阶段检测，第一阶段生成预选框，第二阶段对预选框进行分类和位置坐标的调整，而YOLO将目标检测看做是对框位置和类别概率的一个单阶段回归问题，使得YOLO能达到近两倍的检测速度。而YOLOv3在YOLO的基础上引入的多尺度预测，使得YOLOv3网络对于小物体的检测精度大幅提高。
+
+### 模型概览
+
+[YOLOv3](https://arxiv.org/abs/1804.02767) 是一阶段End2End的目标检测器。其目标检测原理如下图所示:
+<p align="center">
+<img src="image/YOLOv3.jpg" height=400 width=600 hspace='10'/> <br />
+YOLOv3检测原理
+</p>
+
+### 模型结构
+
+YOLOv3将输入图像分成S\*S个格子，每个格子预测B个bounding box，每个bounding box预测内容包括: Location(x, y, w, h)、Confidence Score和C个类别的概率，因此YOLOv3输出层的channel数为B\*(5 + C)。YOLOv3的loss函数也有三部分组成：Location误差，Confidence误差和分类误差。
+
+YOLOv3的网络结构如下图所示:
+<p align="center">
+<img src="image/YOLOv3_structure.jpg" height=400 width=400 hspace='10'/> <br />
+YOLOv3网络结构
+</p>
+
+YOLOv3 的网络结构由基础特征提取网络、multi-scale特征融合层和输出层组成。
+
+1. 特征提取网络。YOLOv3使用 [DarkNet53](https://arxiv.org/abs/1612.08242)作为特征提取网络：DarkNet53 基本采用了全卷积网络，用步长为2的卷积操作替代了池化层，同时添加了 Residual 单元，避免在网络层数过深时发生梯度弥散。
+
+2. 特征融合层。为了解决之前YOLO版本对小目标不敏感的问题，YOLOv3采用了3个不同尺度的特征图来进行目标检测，分别为13\*13,26\*26,52\*52,用来检测大、中、小三种目标。特征融合层选取 DarkNet 产出的三种尺度特征图作为输入，借鉴了FPN(feature pyramid networks)的思想，通过一系列的卷积层和上采样对各尺度的特征图进行融合。
+
+3. 输出层。同样使用了全卷积结构，其中最后一个卷积层的卷积核个数是255：3\*(80+4+1)=255，3表示一个grid cell包含3个bounding box，4表示框的4个坐标信息，1表示Confidence Score，80表示COCO数据集中80个类别的概率。
+
+
+## FAQ
+
+**Q:** 我使用单GPU训练，训练过程中`loss=nan`，这是为什么？  
+**A:** YOLOv3中`learning_rate=0.001`的设置是针对总batch size为64的情况，若用户的batch size小于该值，建议调小学习率。
+
+**Q:** 我训练YOLOv3速度比较慢，要怎么提速？  
+**A:** YOLOv3的数据增强比较复杂，速度比较慢，可通过在[reader.py](./reader.py#L284)中增加数据读取的进程数来提速。若用户是进行fine-tune，也可将`--no_mixup_iter`设置大于`--max_iter`的值来禁用mixup提升速度。
+
+**Q:** 我使用YOLOv3训练两个类别的数据集，训练`loss=nan`或推断结果不符合预期，这是为什么？  
+**A:** `--label_smooth`参数会把所有正例的目标值设置为`1-1/class_num`，负例的目标值设为`1/class_num`，当`class_num`较小时，这个操作影响过大，可能会出现`loss=nan`或者训练结果错误，类别数较小时建议设置`--label_smooth=False`。若使用Paddle Fluid v1.5及以上版本，我们在C++代码中对这种情况作了保护，设置`--label_smooth=True`也不会出现这些问题。
+
+## 参考文献
+
+- [You Only Look Once: Unified, Real-Time Object Detection](https://arxiv.org/abs/1506.02640v5), Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi.
+- [YOLOv3: An Incremental Improvement](https://arxiv.org/abs/1804.02767v1), Joseph Redmon, Ali Farhadi.
+- [Bag of Freebies for Training Object Detection Neural Networks](https://arxiv.org/abs/1902.04103v3), Zhi Zhang, Tong He, Hang Zhang, Zhongyue Zhang, Junyuan Xie, Mu Li.
+
+## 版本更新
+
+- 12/2019, 新增YOLOv3动态图模型
+
+
+## 如何贡献代码
+
+如果你可以修复某个issue或者增加一个新功能，欢迎给我们提交PR。如果对应的PR被接受了，我们将根据贡献的质量和难度进行打分（0-5分，越高越好）。如果你累计获得了10分，可以联系我们获得面试机会或者为你写推荐信。
+
+## 作者
+
+- [heavengate](https://github.com/heavengate)
+- [tink2123](https://github.com/tink2123)
diff --git a/dygraph/yolov3/box_utils.py b/dygraph/yolov3/box_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..50e6bcbcc2f50aebe2cef393108c850d29116ecc
--- /dev/null
+++ b/dygraph/yolov3/box_utils.py
@@ -0,0 +1,205 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import numpy as np
+
+import matplotlib
+matplotlib.use('Agg')
+from matplotlib import pyplot as plt
+from PIL import Image
+
+
+def coco_anno_box_to_center_relative(box, img_height, img_width):
+    """
+    Convert COCO annotations box with format [x1, y1, w, h] to 
+    center mode [center_x, center_y, w, h] and divide image width
+    and height to get relative value in range[0, 1]
+    """
+    assert len(box) == 4, "box should be a len(4) list or tuple"
+    x, y, w, h = box
+
+    x1 = max(x, 0)
+    x2 = min(x + w - 1, img_width - 1)
+    y1 = max(y, 0)
+    y2 = min(y + h - 1, img_height - 1)
+
+    x = (x1 + x2) / 2 / img_width
+    y = (y1 + y2) / 2 / img_height
+    w = (x2 - x1) / img_width
+    h = (y2 - y1) / img_height
+
+    return np.array([x, y, w, h])
+
+
+def clip_relative_box_in_image(x, y, w, h):
+    """Clip relative box coordinates x, y, w, h to [0, 1]"""
+    x1 = max(x - w / 2, 0.)
+    x2 = min(x + w / 2, 1.)
+    y1 = min(y - h / 2, 0.)
+    y2 = max(y + h / 2, 1.)
+    x = (x1 + x2) / 2
+    y = (y1 + y2) / 2
+    w = x2 - x1
+    h = y2 - y1
+
+
+def box_xywh_to_xyxy(box):
+    shape = box.shape
+    assert shape[-1] == 4, "Box shape[-1] should be 4."
+
+    box = box.reshape((-1, 4))
+    box[:, 0], box[:, 2] = box[:, 0] - box[:, 2] / 2, box[:, 0] + box[:, 2] / 2
+    box[:, 1], box[:, 3] = box[:, 1] - box[:, 3] / 2, box[:, 1] + box[:, 3] / 2
+    box = box.reshape(shape)
+    return box
+
+
+def box_iou_xywh(box1, box2):
+    assert box1.shape[-1] == 4, "Box1 shape[-1] should be 4."
+    assert box2.shape[-1] == 4, "Box2 shape[-1] should be 4."
+
+    b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
+    b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2
+    b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2
+    b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2
+
+    inter_x1 = np.maximum(b1_x1, b2_x1)
+    inter_x2 = np.minimum(b1_x2, b2_x2)
+    inter_y1 = np.maximum(b1_y1, b2_y1)
+    inter_y2 = np.minimum(b1_y2, b2_y2)
+    inter_w = inter_x2 - inter_x1
+    inter_h = inter_y2 - inter_y1
+    inter_w[inter_w < 0] = 0
+    inter_h[inter_h < 0] = 0
+
+    inter_area = inter_w * inter_h
+    b1_area = (b1_x2 - b1_x1) * (b1_y2 - b1_y1)
+    b2_area = (b2_x2 - b2_x1) * (b2_y2 - b2_y1)
+
+    return inter_area / (b1_area + b2_area - inter_area)
+
+
+def box_iou_xyxy(box1, box2):
+    assert box1.shape[-1] == 4, "Box1 shape[-1] should be 4."
+    assert box2.shape[-1] == 4, "Box2 shape[-1] should be 4."
+
+    b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]
+    b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]
+
+    inter_x1 = np.maximum(b1_x1, b2_x1)
+    inter_x2 = np.minimum(b1_x2, b2_x2)
+    inter_y1 = np.maximum(b1_y1, b2_y1)
+    inter_y2 = np.minimum(b1_y2, b2_y2)
+    inter_w = inter_x2 - inter_x1
+    inter_h = inter_y2 - inter_y1
+    inter_w[inter_w < 0] = 0
+    inter_h[inter_h < 0] = 0
+
+    inter_area = inter_w * inter_h
+    b1_area = (b1_x2 - b1_x1) * (b1_y2 - b1_y1)
+    b2_area = (b2_x2 - b2_x1) * (b2_y2 - b2_y1)
+
+    return inter_area / (b1_area + b2_area - inter_area)
+
+
+def box_crop(boxes, labels, scores, crop, img_shape):
+    x, y, w, h = map(float, crop)
+    im_w, im_h = map(float, img_shape)
+
+    boxes = boxes.copy()
+    boxes[:, 0], boxes[:, 2] = (boxes[:, 0] - boxes[:, 2] / 2) * im_w, (
+        boxes[:, 0] + boxes[:, 2] / 2) * im_w
+    boxes[:, 1], boxes[:, 3] = (boxes[:, 1] - boxes[:, 3] / 2) * im_h, (
+        boxes[:, 1] + boxes[:, 3] / 2) * im_h
+
+    crop_box = np.array([x, y, x + w, y + h])
+    centers = (boxes[:, :2] + boxes[:, 2:]) / 2.0
+    mask = np.logical_and(crop_box[:2] <= centers, centers <= crop_box[2:]).all(
+        axis=1)
+
+    boxes[:, :2] = np.maximum(boxes[:, :2], crop_box[:2])
+    boxes[:, 2:] = np.minimum(boxes[:, 2:], crop_box[2:])
+    boxes[:, :2] -= crop_box[:2]
+    boxes[:, 2:] -= crop_box[:2]
+
+    mask = np.logical_and(mask, (boxes[:, :2] < boxes[:, 2:]).all(axis=1))
+    boxes = boxes * np.expand_dims(mask.astype('float32'), axis=1)
+    labels = labels * mask.astype('float32')
+    scores = scores * mask.astype('float32')
+    boxes[:, 0], boxes[:, 2] = (boxes[:, 0] + boxes[:, 2]) / 2 / w, (
+        boxes[:, 2] - boxes[:, 0]) / w
+    boxes[:, 1], boxes[:, 3] = (boxes[:, 1] + boxes[:, 3]) / 2 / h, (
+        boxes[:, 3] - boxes[:, 1]) / h
+
+    return boxes, labels, scores, mask.sum()
+
+
+def draw_boxes_on_image(image_path,
+                        boxes,
+                        scores,
+                        labels,
+                        label_names,
+                        score_thresh=0.5):
+    image = np.array(Image.open(image_path))
+    plt.figure()
+    _, ax = plt.subplots(1)
+    ax.imshow(image)
+
+    image_name = image_path.split('/')[-1]
+    print("Image {} detect: ".format(image_name))
+    colors = {}
+    for box, score, label in zip(boxes, scores, labels):
+        if score < score_thresh:
+            continue
+        if box[2] <= box[0] or box[3] <= box[1]:
+            continue
+        label = int(label)
+        if label not in colors:
+            colors[label] = plt.get_cmap('hsv')(label / len(label_names))
+        x1, y1, x2, y2 = box[0], box[1], box[2], box[3]
+        rect = plt.Rectangle(
+            (x1, y1),
+            x2 - x1,
+            y2 - y1,
+            fill=False,
+            linewidth=2.0,
+            edgecolor=colors[label])
+        ax.add_patch(rect)
+        ax.text(
+            x1,
+            y1,
+            '{} {:.4f}'.format(label_names[label], score),
+            verticalalignment='bottom',
+            horizontalalignment='left',
+            bbox={'facecolor': colors[label],
+                  'alpha': 0.5,
+                  'pad': 0},
+            fontsize=8,
+            color='white')
+        print("\t {:15s} at {:25} score: {:.5f}".format(label_names[int(
+            label)], str(list(map(int, list(box)))), score))
+    image_name = image_name.replace('jpg', 'png')
+    plt.axis('off')
+    plt.gca().xaxis.set_major_locator(plt.NullLocator())
+    plt.gca().yaxis.set_major_locator(plt.NullLocator())
+    plt.savefig(
+        "./output/{}".format(image_name), bbox_inches='tight', pad_inches=0.0)
+    print("Detect result save at ./output/{}\n".format(image_name))
+    plt.cla()
+    plt.close('all')
diff --git a/dygraph/yolov3/config.py b/dygraph/yolov3/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..784cffed0f50a978881ede200cd11edc51689cce
--- /dev/null
+++ b/dygraph/yolov3/config.py
@@ -0,0 +1,127 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#    http://www.apache.org/licenses/LICENSE-2.0
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License. 
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from edict import AttrDict
+import six
+import numpy as np
+
+_C = AttrDict()
+cfg = _C
+
+#
+# Training options
+#
+
+# Snapshot period
+_C.snapshot_iter = 2000
+
+# min valid area for gt boxes
+_C.gt_min_area = -1
+
+# max target box number in an image
+_C.max_box_num = 50
+
+#
+# Training options
+#
+
+# valid score threshold to include boxes
+_C.valid_thresh = 0.005
+
+# threshold vale for box non-max suppression
+_C.nms_thresh = 0.45
+
+# the number of top k boxes to perform nms
+_C.nms_topk = 400
+
+# the number of output boxes after nms
+_C.nms_posk = 100
+
+# score threshold for draw box in debug mode
+_C.draw_thresh = 0.5
+
+#
+# Model options
+#
+
+# pixel mean values
+_C.pixel_means = [0.485, 0.456, 0.406]
+
+# pixel std values
+_C.pixel_stds = [0.229, 0.224, 0.225]
+
+# anchors box weight and height
+_C.anchors = [
+    10, 13, 16, 30, 33, 23, 30, 61, 62, 45, 59, 119, 116, 90, 156, 198, 373, 326
+]
+
+# anchor mask of each yolo layer
+_C.anchor_masks = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
+
+# IoU threshold to ignore objectness loss of pred box
+_C.ignore_thresh = .7
+
+#
+# SOLVER options
+#
+
+# batch size
+_C.batch_size = 8
+
+# derived learning rate the to get the final learning rate.
+_C.learning_rate = 0.001
+
+# maximum number of iterations
+_C.max_iter = 500200
+
+# warm up to learning rate 
+_C.warm_up_iter = 4000
+_C.warm_up_factor = 0.
+
+# lr steps_with_decay
+_C.lr_steps = [400000, 450000]
+_C.lr_gamma = 0.1
+
+# L2 regularization hyperparameter
+_C.weight_decay = 0.0005
+
+# momentum with SGD
+_C.momentum = 0.9
+
+#
+# ENV options
+#
+
+# support both CPU and GPU
+_C.use_gpu = True
+
+# Class number
+_C.class_num = 80
+
+# dataset path
+_C.train_file_list = 'annotations/instances_train2017.json'
+_C.train_data_dir = 'train2017'
+_C.val_file_list = 'annotations/instances_val2017.json'
+_C.val_data_dir = 'val2017'
+
+
+def merge_cfg_from_args(args):
+    """Merge config keys, values in args into the global config."""
+    for k, v in sorted(six.iteritems(vars(args))):
+        try:
+            value = eval(v)
+        except:
+            value = v
+        _C[k] = value
diff --git a/dygraph/yolov3/data_utils.py b/dygraph/yolov3/data_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..f5c5246346a7ef4b568bbb3f3681793d36c22749
--- /dev/null
+++ b/dygraph/yolov3/data_utils.py
@@ -0,0 +1,168 @@
+"""
+This code is based on https://github.com/fchollet/keras/blob/master/keras/utils/data_utils.py
+"""
+
+import os
+import sys
+import signal
+import time
+import numpy as np
+import threading
+import multiprocessing
+try:
+    import queue
+except ImportError:
+    import Queue as queue
+
+
+# handle terminate reader process, do not print stack frame
+def _reader_quit(signum, frame):
+    print("Reader process exit.")
+    sys.exit()
+
+def _term_group(sig_num, frame):
+    print('pid {} terminated, terminate group '
+          '{}...'.format(os.getpid(), os.getpgrp()))
+    os.killpg(os.getpgid(os.getpid()), signal.SIGKILL)
+
+signal.signal(signal.SIGTERM, _reader_quit)
+signal.signal(signal.SIGINT, _term_group)
+
+
+class GeneratorEnqueuer(object):
+    """
+    Builds a queue out of a data generator.
+
+    Args:
+        generator: a generator function which endlessly yields data
+        use_multiprocessing (bool): use multiprocessing if True,
+            otherwise use threading.
+        wait_time (float): time to sleep in-between calls to `put()`.
+        random_seed (int): Initial seed for workers,
+            will be incremented by one for each workers.
+    """
+
+    def __init__(self,
+                 generator,
+                 use_multiprocessing=False,
+                 wait_time=0.05,
+                 random_seed=None):
+        self.wait_time = wait_time
+        self._generator = generator
+        self._use_multiprocessing = use_multiprocessing
+        self._threads = []
+        self._stop_event = None
+        self.queue = None
+        self._manager = None
+        self.seed = random_seed
+
+    def start(self, workers=1, max_queue_size=10):
+        """
+        Start worker threads which add data from the generator into the queue.
+
+        Args:
+            workers (int): number of worker threads
+            max_queue_size (int): queue size
+                (when full, threads could block on `put()`)
+        """
+
+        def data_generator_task():
+            """
+            Data generator task.
+            """
+
+            def task():
+                if (self.queue is not None and
+                        self.queue.qsize() < max_queue_size):
+                    generator_output = next(self._generator)
+                    self.queue.put((generator_output))
+                else:
+                    time.sleep(self.wait_time)
+
+            if not self._use_multiprocessing:
+                while not self._stop_event.is_set():
+                    with self.genlock:
+                        try:
+                            task()
+                        except Exception:
+                            self._stop_event.set()
+                            break
+            else:
+                while not self._stop_event.is_set():
+                    try:
+                        task()
+                    except Exception:
+                        self._stop_event.set()
+                        break
+
+        try:
+            if self._use_multiprocessing:
+                self._manager = multiprocessing.Manager()
+                self.queue = self._manager.Queue(maxsize=max_queue_size)
+                self._stop_event = multiprocessing.Event()
+            else:
+                self.genlock = threading.Lock()
+                self.queue = queue.Queue()
+                self._stop_event = threading.Event()
+            for _ in range(workers):
+                if self._use_multiprocessing:
+                    # Reset random seed else all children processes
+                    # share the same seed
+                    np.random.seed(self.seed)
+                    thread = multiprocessing.Process(target=data_generator_task)
+                    thread.daemon = True
+                    if self.seed is not None:
+                        self.seed += 1
+                else:
+                    thread = threading.Thread(target=data_generator_task)
+                self._threads.append(thread)
+                thread.start()
+        except:
+            self.stop()
+            raise
+
+    def is_running(self):
+        """
+        Returns:
+            bool: Whether the worker theads are running.
+        """
+        return self._stop_event is not None and not self._stop_event.is_set()
+
+    def stop(self, timeout=None):
+        """
+        Stops running threads and wait for them to exit, if necessary.
+        Should be called by the same thread which called `start()`.
+
+        Args:
+            timeout(int|None): maximum time to wait on `thread.join()`.
+        """
+        if self.is_running():
+            self._stop_event.set()
+        for thread in self._threads:
+            if self._use_multiprocessing:
+                if thread.is_alive():
+                    thread.join(timeout)
+            else:
+                thread.join(timeout)
+        if self._manager:
+            self._manager.shutdown()
+
+        self._threads = []
+        self._stop_event = None
+        self.queue = None
+
+    def get(self):
+        """
+        Creates a generator to extract data from the queue.
+        Skip the data if it is `None`.
+
+        # Yields
+            tuple of data in the queue.
+        """
+        while self.is_running():
+            if not self.queue.empty():
+                inputs = self.queue.get()
+                if inputs is not None:
+                    yield inputs
+            else:
+                time.sleep(self.wait_time)
diff --git a/dygraph/yolov3/dataset/coco/download.py b/dygraph/yolov3/dataset/coco/download.py
new file mode 100644
index 0000000000000000000000000000000000000000..9df49bef6eab9d615e61e3cd429dcfdbeb5708ce
--- /dev/null
+++ b/dygraph/yolov3/dataset/coco/download.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import os.path as osp
+import sys
+import zipfile
+import logging
+
+from paddle.dataset.common import download
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+
+DATASETS = {
+    'coco': [
+        # coco2017
+        ('http://images.cocodataset.org/zips/train2017.zip',
+         'cced6f7f71b7629ddf16f17bbcfab6b2', ),
+        ('http://images.cocodataset.org/zips/val2017.zip',
+         '442b8da7639aecaf257c1dceb8ba8c80', ),
+        ('http://images.cocodataset.org/annotations/annotations_trainval2017.zip',
+         'f4bbac642086de4f52a3fdda2de5fa2c', ),
+        # coco2014
+        ('http://images.cocodataset.org/zips/train2014.zip',
+         '0da8c0bd3d6becc4dcb32757491aca88', ),
+        ('http://images.cocodataset.org/zips/val2014.zip',
+         'a3d79f5ed8d289b7a7554ce06a5782b3', ),
+        ('http://images.cocodataset.org/annotations/annotations_trainval2014.zip',
+         '0a379cfc70b0e71301e0f377548639bd', ),
+    ],
+}
+
+
+def download_decompress_file(data_dir, url, md5):
+    logger.info("Downloading from {}".format(url))
+    zip_file = download(url, data_dir, md5)
+    logger.info("Decompressing {}".format(zip_file))
+    with zipfile.ZipFile(zip_file) as zf:
+        zf.extractall(path=data_dir)
+    os.remove(zip_file)
+
+
+if __name__ == "__main__":
+    data_dir = osp.split(osp.realpath(sys.argv[0]))[0]
+    for name, infos in DATASETS.items():
+        for info in infos:
+            download_decompress_file(data_dir, info[0], info[1])
+        logger.info("Download dataset {} finished.".format(name))
diff --git a/dygraph/yolov3/dist_utils.py b/dygraph/yolov3/dist_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..9fb03f1bd351a87d758eb84133ab25b25530e864
--- /dev/null
+++ b/dygraph/yolov3/dist_utils.py
@@ -0,0 +1,45 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import paddle.fluid as fluid
+
+def nccl2_prepare(trainer_id, startup_prog, main_prog):
+   config = fluid.DistributeTranspilerConfig()
+   config.mode = "nccl2"
+   t = fluid.DistributeTranspiler(config=config)
+   t.transpile(trainer_id,
+      trainers=os.environ.get('PADDLE_TRAINER_ENDPOINTS'),
+      current_endpoint=os.environ.get('PADDLE_CURRENT_ENDPOINT'),
+      startup_program=startup_prog,
+      program=main_prog)
+
+def prepare_for_multi_process(exe, build_strategy, train_prog):
+   # prepare for multi-process
+   trainer_id = int(os.environ.get('PADDLE_TRAINER_ID', 0))
+   num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+   if num_trainers < 2: return
+   print("PADDLE_TRAINERS_NUM", num_trainers)
+   print("PADDLE_TRAINER_ID", trainer_id)
+   build_strategy.num_trainers =  num_trainers
+   build_strategy.trainer_id = trainer_id
+   # NOTE(zcd): use multi processes to train the model,
+   # and each process use one GPU card.
+   startup_prog = fluid.Program()
+   nccl2_prepare(trainer_id, startup_prog, train_prog)
+   # the startup_prog are run two times, but it doesn't matter.
+   exe.run(startup_prog) 
diff --git a/dygraph/yolov3/edict.py b/dygraph/yolov3/edict.py
new file mode 100644
index 0000000000000000000000000000000000000000..552ede8e4006b5d4e90dd85d566749fd624c26d1
--- /dev/null
+++ b/dygraph/yolov3/edict.py
@@ -0,0 +1,37 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+
+class AttrDict(dict):
+    def __init__(self, *args, **kwargs):
+        super(AttrDict, self).__init__(*args, **kwargs)
+
+    def __getattr__(self, name):
+        if name in self.__dict__:
+            return self.__dict__[name]
+        elif name in self:
+            return self[name]
+        else:
+            raise AttributeError(name)
+
+    def __setattr__(self, name, value):
+        if name in self.__dict__:
+            self.__dict__[name] = value
+        else:
+            self[name] = value
diff --git a/dygraph/yolov3/eval.py b/dygraph/yolov3/eval.py
new file mode 100755
index 0000000000000000000000000000000000000000..20c337af7794814c369f4aa12952c283de0a4d48
--- /dev/null
+++ b/dygraph/yolov3/eval.py
@@ -0,0 +1,127 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import time
+import json
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.base import to_variable
+import reader
+from models.yolov3 import YOLOv3
+from utility import print_arguments, parse_args, check_gpu
+from pycocotools.coco import COCO
+from pycocotools.cocoeval import COCOeval, Params
+from config import cfg
+
+
+def eval():
+    # check if set use_gpu=True in paddlepaddle cpu version
+    check_gpu(cfg.use_gpu)
+
+    if '2014' in cfg.dataset:
+        test_list = 'annotations/instances_val2014.json'
+    elif '2017' in cfg.dataset:
+        test_list = 'annotations/instances_val2017.json'
+
+    if cfg.debug:
+        if not os.path.exists('output'):
+            os.mkdir('output')
+
+    place = fluid.CUDAPlace(0) if cfg.use_gpu else fluid.CPUPlace()
+    with fluid.dygraph.guard(place):
+        model = YOLOv3(3,is_train=False)
+        # yapf: disable
+        if cfg.weights:
+            restore, _ = fluid.load_dygraph(cfg.weights)
+            model.set_dict(restore)
+            model.eval()
+
+        input_size = cfg.input_size
+        # batch_size for test must be 1
+        test_reader = reader.test(input_size, 1)
+        label_names, label_ids = reader.get_label_infos()
+        if cfg.debug:
+            print("Load in labels {} with ids {}".format(label_names, label_ids))
+
+        def get_pred_result(boxes, scores, labels, im_id):
+            result = []
+            for box, score, label in zip(boxes, scores, labels):
+                x1, y1, x2, y2 = box
+                w = x2 - x1 + 1
+                h = y2 - y1 + 1
+                bbox = [x1, y1, w, h]
+
+                res = {
+                    'image_id': int(im_id),
+                    'category_id': label_ids[int(label)],
+                    'bbox': list(map(float, bbox)),
+                    'score': float(score)
+                }
+                result.append(res)
+            return result
+
+        dts_res = []
+        total_time = 0
+        for iter_id, data in enumerate(test_reader()):
+            start_time = time.time()
+
+            img_data = np.array([x[0] for x in data]).astype('float32')
+            img = to_variable(img_data)
+
+            im_id_data = np.array([x[1] for x in data]).astype('int32')
+            im_id = to_variable(im_id_data)
+
+            im_shape_data = np.array([x[2] for x in data]).astype('int32')
+            im_shape = to_variable(im_shape_data)
+
+            batch_outputs = model(img, None, None, None, im_id, im_shape)
+            nmsed_boxes = batch_outputs.numpy()
+            if nmsed_boxes.shape[1] != 6:
+                continue
+
+            im_id = data[0][1]
+            nmsed_box=nmsed_boxes
+            labels = nmsed_box[:, 0]
+            scores = nmsed_box[:, 1]
+            boxes = nmsed_box[:, 2:6]
+            dts_res += get_pred_result(boxes, scores, labels, im_id)
+
+            end_time = time.time()
+            print("batch id: {}, time: {}".format(iter_id, end_time - start_time))
+            total_time += end_time - start_time
+
+        with open("yolov3_result.json", 'w') as outfile:
+            json.dump(dts_res, outfile)
+        print("start evaluate detection result with coco api")
+        coco = COCO(os.path.join(cfg.data_dir, test_list))
+        cocoDt = coco.loadRes("yolov3_result.json")
+        cocoEval = COCOeval(coco, cocoDt, 'bbox')
+        cocoEval.evaluate()
+        cocoEval.accumulate()
+        cocoEval.summarize()
+    print("evaluate done.")
+
+    print("Time per batch: {}".format(total_time / iter_id))
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+    eval()
+
diff --git a/dygraph/yolov3/image/000000000139.png b/dygraph/yolov3/image/000000000139.png
new file mode 100644
index 0000000000000000000000000000000000000000..a2e3d5d0cd9f6c05ecef83794486410949b53762
Binary files /dev/null and b/dygraph/yolov3/image/000000000139.png differ
diff --git a/dygraph/yolov3/image/000000127517.png b/dygraph/yolov3/image/000000127517.png
new file mode 100644
index 0000000000000000000000000000000000000000..ef04630142bccf1fe8be78f73c4000c02209f3e4
Binary files /dev/null and b/dygraph/yolov3/image/000000127517.png differ
diff --git a/dygraph/yolov3/image/000000203864.png b/dygraph/yolov3/image/000000203864.png
new file mode 100644
index 0000000000000000000000000000000000000000..8067fd8065c272f86952cd289418b4d3d1d44643
Binary files /dev/null and b/dygraph/yolov3/image/000000203864.png differ
diff --git a/dygraph/yolov3/image/000000515077.png b/dygraph/yolov3/image/000000515077.png
new file mode 100644
index 0000000000000000000000000000000000000000..70bbbe6f640fad5394da02e217f52f6912ee3dd3
Binary files /dev/null and b/dygraph/yolov3/image/000000515077.png differ
diff --git a/dygraph/yolov3/image/YOLOv3.jpg b/dygraph/yolov3/image/YOLOv3.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..06b81f545247c1d542fd661f947eb0cf3edc480e
Binary files /dev/null and b/dygraph/yolov3/image/YOLOv3.jpg differ
diff --git a/dygraph/yolov3/image/YOLOv3_structure.jpg b/dygraph/yolov3/image/YOLOv3_structure.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..51bd2d1733e2f78945d3e871cb5b649aad95d633
Binary files /dev/null and b/dygraph/yolov3/image/YOLOv3_structure.jpg differ
diff --git a/dygraph/yolov3/image/dog.jpg b/dygraph/yolov3/image/dog.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..77b0381222eaed50867643f4166092c781e56d5b
Binary files /dev/null and b/dygraph/yolov3/image/dog.jpg differ
diff --git a/dygraph/yolov3/image/eagle.jpg b/dygraph/yolov3/image/eagle.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..8b7509505b01a766bbf637dcbb1e2c5f24903ac5
Binary files /dev/null and b/dygraph/yolov3/image/eagle.jpg differ
diff --git a/dygraph/yolov3/image/giraffe.jpg b/dygraph/yolov3/image/giraffe.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..a93e8b88398d94a7454f201372317a9414344c7c
Binary files /dev/null and b/dygraph/yolov3/image/giraffe.jpg differ
diff --git a/dygraph/yolov3/image/horses.jpg b/dygraph/yolov3/image/horses.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..3a761f46ba08ed459af026b59f6b91b6fa597dd1
Binary files /dev/null and b/dygraph/yolov3/image/horses.jpg differ
diff --git a/dygraph/yolov3/image/kite.jpg b/dygraph/yolov3/image/kite.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..9eb325ac5fc375cb2513380087dd713be9be19d8
Binary files /dev/null and b/dygraph/yolov3/image/kite.jpg differ
diff --git a/dygraph/yolov3/image/person.jpg b/dygraph/yolov3/image/person.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..61d377fff94d48c365b0cf18edcd4de38b229465
Binary files /dev/null and b/dygraph/yolov3/image/person.jpg differ
diff --git a/dygraph/yolov3/image/scream.jpg b/dygraph/yolov3/image/scream.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..43f2c36a8d4df72c4f8621b377944e05f6c1fa08
Binary files /dev/null and b/dygraph/yolov3/image/scream.jpg differ
diff --git a/dygraph/yolov3/image/train_loss.png b/dygraph/yolov3/image/train_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..f16728e95d781d996639a35b54a944e91af6b640
Binary files /dev/null and b/dygraph/yolov3/image/train_loss.png differ
diff --git a/dygraph/yolov3/image_utils.py b/dygraph/yolov3/image_utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..16edd255c395fa814a7cf7041be0175d1bee8bb2
--- /dev/null
+++ b/dygraph/yolov3/image_utils.py
@@ -0,0 +1,233 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import numpy as np
+import cv2
+from PIL import Image, ImageEnhance
+import random
+
+import box_utils
+
+
+def random_distort(img):
+    def random_brightness(img, lower=0.5, upper=1.5):
+        e = np.random.uniform(lower, upper)
+        return ImageEnhance.Brightness(img).enhance(e)
+
+    def random_contrast(img, lower=0.5, upper=1.5):
+        e = np.random.uniform(lower, upper)
+        return ImageEnhance.Contrast(img).enhance(e)
+
+    def random_color(img, lower=0.5, upper=1.5):
+        e = np.random.uniform(lower, upper)
+        return ImageEnhance.Color(img).enhance(e)
+
+    ops = [random_brightness, random_contrast, random_color]
+    np.random.shuffle(ops)
+
+    img = Image.fromarray(img)
+    img = ops[0](img)
+    img = ops[1](img)
+    img = ops[2](img)
+    img = np.asarray(img)
+
+    return img
+
+
+def random_crop(img,
+                boxes,
+                labels,
+                scores,
+                scales=[0.3, 1.0],
+                max_ratio=2.0,
+                constraints=None,
+                max_trial=50):
+    if len(boxes) == 0:
+        return img, boxes
+
+    if not constraints:
+        constraints = [(0.1, 1.0), (0.3, 1.0), (0.5, 1.0), (0.7, 1.0),
+                       (0.9, 1.0), (0.0, 1.0)]
+
+    img = Image.fromarray(img)
+    w, h = img.size
+    crops = [(0, 0, w, h)]
+    for min_iou, max_iou in constraints:
+        for _ in range(max_trial):
+            scale = random.uniform(scales[0], scales[1])
+            aspect_ratio = random.uniform(max(1 / max_ratio, scale * scale), \
+                                          min(max_ratio, 1 / scale / scale))
+            crop_h = int(h * scale / np.sqrt(aspect_ratio))
+            crop_w = int(w * scale * np.sqrt(aspect_ratio))
+            crop_x = random.randrange(w - crop_w)
+            crop_y = random.randrange(h - crop_h)
+            crop_box = np.array([[(crop_x + crop_w / 2.0) / w,
+                                  (crop_y + crop_h / 2.0) / h,
+                                  crop_w / float(w), crop_h / float(h)]])
+
+            iou = box_utils.box_iou_xywh(crop_box, boxes)
+            if min_iou <= iou.min() and max_iou >= iou.max():
+                crops.append((crop_x, crop_y, crop_w, crop_h))
+                break
+
+    while crops:
+        crop = crops.pop(np.random.randint(0, len(crops)))
+        crop_boxes, crop_labels, crop_scores, box_num = \
+            box_utils.box_crop(boxes, labels, scores, crop, (w, h))
+        if box_num < 1:
+            continue
+        img = img.crop((crop[0], crop[1], crop[0] + crop[2],
+                        crop[1] + crop[3])).resize(img.size, Image.LANCZOS)
+        img = np.asarray(img)
+        return img, crop_boxes, crop_labels, crop_scores
+    img = np.asarray(img)
+    return img, boxes, labels, scores
+
+
+def random_flip(img, gtboxes, thresh=0.5):
+    if random.random() > thresh:
+        img = img[:, ::-1, :]
+        gtboxes[:, 0] = 1.0 - gtboxes[:, 0]
+    return img, gtboxes
+
+
+def random_interp(img, size, interp=None):
+    interp_method = [
+        cv2.INTER_NEAREST,
+        cv2.INTER_LINEAR,
+        cv2.INTER_AREA,
+        cv2.INTER_CUBIC,
+        cv2.INTER_LANCZOS4,
+    ]
+    if not interp or interp not in interp_method:
+        interp = interp_method[random.randint(0, len(interp_method) - 1)]
+    h, w, _ = img.shape
+    im_scale_x = size / float(w)
+    im_scale_y = size / float(h)
+    img = cv2.resize(
+        img, None, None, fx=im_scale_x, fy=im_scale_y, interpolation=interp)
+    return img
+
+
+def random_expand(img,
+                  gtboxes,
+                  max_ratio=4.,
+                  fill=None,
+                  keep_ratio=True,
+                  thresh=0.5):
+    if random.random() > thresh:
+        return img, gtboxes
+
+    if max_ratio < 1.0:
+        return img, gtboxes
+
+    h, w, c = img.shape
+    ratio_x = random.uniform(1, max_ratio)
+    if keep_ratio:
+        ratio_y = ratio_x
+    else:
+        ratio_y = random.uniform(1, max_ratio)
+    oh = int(h * ratio_y)
+    ow = int(w * ratio_x)
+    off_x = random.randint(0, ow - w)
+    off_y = random.randint(0, oh - h)
+
+    out_img = np.zeros((oh, ow, c))
+    if fill and len(fill) == c:
+        for i in range(c):
+            out_img[:, :, i] = fill[i] * 255.0
+
+    out_img[off_y:off_y + h, off_x:off_x + w, :] = img
+    gtboxes[:, 0] = ((gtboxes[:, 0] * w) + off_x) / float(ow)
+    gtboxes[:, 1] = ((gtboxes[:, 1] * h) + off_y) / float(oh)
+    gtboxes[:, 2] = gtboxes[:, 2] / ratio_x
+    gtboxes[:, 3] = gtboxes[:, 3] / ratio_y
+
+    return out_img.astype('uint8'), gtboxes
+
+
+def shuffle_gtbox(gtbox, gtlabel, gtscore):
+    gt = np.concatenate(
+        [gtbox, gtlabel[:, np.newaxis], gtscore[:, np.newaxis]], axis=1)
+    idx = np.arange(gt.shape[0])
+    np.random.shuffle(idx)
+    gt = gt[idx, :]
+    return gt[:, :4], gt[:, 4], gt[:, 5]
+
+
+def image_mixup(img1, gtboxes1, gtlabels1, gtscores1, img2, gtboxes2, gtlabels2,
+                gtscores2):
+    factor = np.random.beta(1.5, 1.5)
+    factor = max(0.0, min(1.0, factor))
+    if factor >= 1.0:
+        return img1, gtboxes1, gtlabels1
+    if factor <= 0.0:
+        return img2, gtboxes2, gtlabels2
+    gtscores1 = gtscores1 * factor
+    gtscores2 = gtscores2 * (1.0 - factor)
+
+    h = max(img1.shape[0], img2.shape[0])
+    w = max(img1.shape[1], img2.shape[1])
+    img = np.zeros((h, w, img1.shape[2]), 'float32')
+    img[:img1.shape[0], :img1.shape[1], :] = img1.astype('float32') * factor
+    img[:img2.shape[0], :img2.shape[1], :] += \
+            img2.astype('float32') * (1.0 - factor)
+    gtboxes = np.zeros_like(gtboxes1)
+    gtlabels = np.zeros_like(gtlabels1)
+    gtscores = np.zeros_like(gtscores1)
+
+    gt_valid_mask1 = np.logical_and(gtboxes1[:, 2] > 0, gtboxes1[:, 3] > 0)
+    gtboxes1 = gtboxes1[gt_valid_mask1]
+    gtlabels1 = gtlabels1[gt_valid_mask1]
+    gtscores1 = gtscores1[gt_valid_mask1]
+    gtboxes1[:, 0] = gtboxes1[:, 0] * img1.shape[1] / w
+    gtboxes1[:, 1] = gtboxes1[:, 1] * img1.shape[0] / h
+    gtboxes1[:, 2] = gtboxes1[:, 2] * img1.shape[1] / w
+    gtboxes1[:, 3] = gtboxes1[:, 3] * img1.shape[0] / h
+
+    gt_valid_mask2 = np.logical_and(gtboxes2[:, 2] > 0, gtboxes2[:, 3] > 0)
+    gtboxes2 = gtboxes2[gt_valid_mask2]
+    gtlabels2 = gtlabels2[gt_valid_mask2]
+    gtscores2 = gtscores2[gt_valid_mask2]
+    gtboxes2[:, 0] = gtboxes2[:, 0] * img2.shape[1] / w
+    gtboxes2[:, 1] = gtboxes2[:, 1] * img2.shape[0] / h
+    gtboxes2[:, 2] = gtboxes2[:, 2] * img2.shape[1] / w
+    gtboxes2[:, 3] = gtboxes2[:, 3] * img2.shape[0] / h
+
+    gtboxes_all = np.concatenate((gtboxes1, gtboxes2), axis=0)
+    gtlabels_all = np.concatenate((gtlabels1, gtlabels2), axis=0)
+    gtscores_all = np.concatenate((gtscores1, gtscores2), axis=0)
+    gt_num = min(len(gtboxes), len(gtboxes_all))
+    gtboxes[:gt_num] = gtboxes_all[:gt_num]
+    gtlabels[:gt_num] = gtlabels_all[:gt_num]
+    gtscores[:gt_num] = gtscores_all[:gt_num]
+    return img.astype('uint8'), gtboxes, gtlabels, gtscores
+
+
+def image_augment(img, gtboxes, gtlabels, gtscores, size, means=None):
+    img = random_distort(img)
+    img, gtboxes = random_expand(img, gtboxes, fill=means)
+    img, gtboxes, gtlabels, gtscores = \
+            random_crop(img, gtboxes, gtlabels, gtscores)
+    img = random_interp(img, size)
+    img, gtboxes = random_flip(img, gtboxes)
+    gtboxes, gtlabels, gtscores = shuffle_gtbox(gtboxes, gtlabels, gtscores)
+
+    return img.astype('float32'), gtboxes.astype('float32'), \
+            gtlabels.astype('int32'), gtscores.astype('float32')
diff --git a/dygraph/yolov3/infer.py b/dygraph/yolov3/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a3f44cef3a8b1bade5a238e121fbf3bf5dcb120
--- /dev/null
+++ b/dygraph/yolov3/infer.py
@@ -0,0 +1,91 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import box_utils
+import reader
+from utility import print_arguments, parse_args, check_gpu
+from models.yolov3 import YOLOv3
+from paddle.fluid.dygraph.base import to_variable
+from pycocotools.coco import COCO
+from pycocotools.cocoeval import COCOeval, Params
+from config import cfg
+
+
+def infer():
+
+    # check if set use_gpu=True in paddlepaddle cpu version
+    check_gpu(cfg.use_gpu)
+
+    if not os.path.exists('output'):
+        os.mkdir('output')
+    place = fluid.CUDAPlace(0) if cfg.use_gpu else fluid.CPUPlace()
+    with fluid.dygraph.guard(place):
+
+        model = YOLOv3(3, is_train=False)
+        input_size = cfg.input_size
+        # yapf: disable
+        if cfg.weights:
+            restore, _ = fluid.load_dygraph(cfg.weights)
+            model.set_dict(restore)
+        # yapf: enable
+
+        # you can save inference model by following code
+        # fluid.io.save_inference_model("./output/yolov3",
+        #                               feeded_var_names=['image', 'im_shape'],
+        #                               target_vars=outputs,
+        #                               executor=exe)
+
+        image_names = []
+        if cfg.image_name is not None:
+            image_names.append(cfg.image_name)
+        else:
+            for image_name in os.listdir(cfg.image_path):
+                if image_name.split('.')[-1] in ['jpg', 'png']:
+                    image_names.append(image_name)
+        for image_name in image_names:
+            infer_reader = reader.infer(input_size,
+                                        os.path.join(cfg.image_path, image_name))
+            label_names, _ = reader.get_label_infos()
+            data = next(infer_reader())
+
+            img_data = np.array([x[0] for x in data]).astype('float32')
+            img = to_variable(img_data)
+
+            im_shape_data = np.array([x[2] for x in data]).astype('int32')
+            im_shape = to_variable(im_shape_data)
+
+            outputs = model(img, None, None, None, None, im_shape)
+
+            bboxes = outputs.numpy()
+            if bboxes.shape[1] != 6:
+                print("No object found in {}".format(image_name))
+                continue
+            labels = bboxes[:, 0].astype('int32')
+            scores = bboxes[:, 1].astype('float32')
+            boxes = bboxes[:, 2:].astype('float32')
+
+            path = os.path.join(cfg.image_path, image_name)
+            box_utils.draw_boxes_on_image(path, boxes, scores, labels, label_names,
+                                          cfg.draw_thresh)
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+    infer()
diff --git a/dygraph/yolov3/models/__init__.py b/dygraph/yolov3/models/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/dygraph/yolov3/models/darknet.py b/dygraph/yolov3/models/darknet.py
new file mode 100755
index 0000000000000000000000000000000000000000..bddae779ab74bd6104e20d19f07a9eee0b08e407
--- /dev/null
+++ b/dygraph/yolov3/models/darknet.py
@@ -0,0 +1,190 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.regularizer import L2Decay
+
+from paddle.fluid.dygraph.nn import Conv2D, BatchNorm
+from paddle.fluid.dygraph.base import to_variable
+
+class ConvBNLayer(fluid.dygraph.Layer):
+    def __init__(self,
+                 ch_in,
+                 ch_out,
+                 filter_size=3,
+                 stride=1,
+                 groups=1,
+                 padding=0,
+                 act="leaky",
+                 is_test=True):
+        super(ConvBNLayer, self).__init__()
+
+        self.conv = Conv2D(
+            num_channels=ch_in,
+            num_filters=ch_out,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            groups=groups,
+            param_attr=ParamAttr(
+                initializer=fluid.initializer.Normal(0., 0.02)),
+            bias_attr=False,
+            act=None)
+        self.batch_norm = BatchNorm(
+            num_channels=ch_out,
+            is_test=is_test,
+            param_attr=ParamAttr(
+                initializer=fluid.initializer.Normal(0., 0.02),
+                regularizer=L2Decay(0.)),
+            bias_attr=ParamAttr(
+                initializer=fluid.initializer.Constant(0.0),
+                regularizer=L2Decay(0.)))
+
+        self.act = act
+
+    def forward(self, inputs):
+        out = self.conv(inputs)
+        out = self.batch_norm(out)
+        if self.act == 'leaky':
+            out = fluid.layers.leaky_relu(x=out, alpha=0.1)
+        return out
+
+class DownSample(fluid.dygraph.Layer):
+    def __init__(self,
+                 ch_in,
+                 ch_out,
+                 filter_size=3,
+                 stride=2,
+                 padding=1,
+                 is_test=True):
+
+        super(DownSample, self).__init__()
+
+        self.conv_bn_layer = ConvBNLayer(
+            ch_in=ch_in,
+            ch_out=ch_out,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            is_test=is_test)
+        self.ch_out = ch_out
+    def forward(self, inputs):
+        out = self.conv_bn_layer(inputs)
+        return out
+
+class BasicBlock(fluid.dygraph.Layer):
+    def __init__(self, ch_in, ch_out, is_test=True):
+        super(BasicBlock, self).__init__()
+
+        self.conv1 = ConvBNLayer(
+            ch_in=ch_in,
+            ch_out=ch_out,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            is_test=is_test
+            )
+        self.conv2 = ConvBNLayer(
+            ch_in=ch_out,
+            ch_out=ch_out*2,
+            filter_size=3,
+            stride=1,
+            padding=1,
+            is_test=is_test
+            )
+    def forward(self, inputs):
+        conv1 = self.conv1(inputs)
+        conv2 = self.conv2(conv1)
+        out = fluid.layers.elementwise_add(x=inputs, y=conv2, act=None)
+        return out
+
+class LayerWarp(fluid.dygraph.Layer):
+    def __init__(self, ch_in, ch_out, count, is_test=True):
+        super(LayerWarp,self).__init__()
+
+        self.basicblock0 = BasicBlock(ch_in,
+            ch_out,
+            is_test=is_test)
+        self.res_out_list = []
+        for i in range(1,count):
+            res_out = self.add_sublayer("basic_block_%d" % (i),
+                BasicBlock(
+                    ch_out*2,
+                    ch_out,
+                    is_test=is_test))
+            self.res_out_list.append(res_out)
+        self.ch_out = ch_out
+    def forward(self,inputs):
+        y = self.basicblock0(inputs)
+        for basic_block_i in self.res_out_list:
+            y = basic_block_i(y)
+        return y
+
+
+DarkNet_cfg = {53: ([1, 2, 8, 8, 4])}
+
+
+class DarkNet53_conv_body(fluid.dygraph.Layer):
+    def __init__(self,
+                 ch_in=3,
+                 is_test=True):
+        super(DarkNet53_conv_body, self).__init__()
+        self.stages = DarkNet_cfg[53]
+        self.stages = self.stages[0:5]
+
+        self.conv0 = ConvBNLayer(
+            ch_in=ch_in,
+            ch_out=32,
+            filter_size=3,
+            stride=1,
+            padding=1,
+            is_test=is_test)
+
+        self.downsample0 = DownSample(
+            ch_in=32,
+            ch_out=32 * 2,
+            is_test=is_test)
+        self.darknet53_conv_block_list = []
+        self.downsample_list = []
+        ch_in = [64,128,256,512,1024]
+        for i, stage in enumerate(self.stages):
+            conv_block = self.add_sublayer(
+                "stage_%d" % (i),
+                LayerWarp(
+                int(ch_in[i]),
+                32*(2**i),
+                stage,
+                is_test=is_test))
+            self.darknet53_conv_block_list.append(conv_block)
+        for i in range(len(self.stages) - 1):
+            downsample = self.add_sublayer(
+                "stage_%d_downsample" % i,
+                DownSample(
+                    ch_in = 32*(2**(i+1)),
+                    ch_out = 32*(2**(i+2)),
+                    is_test=is_test))
+            self.downsample_list.append(downsample)
+    def forward(self,inputs):
+        
+        out = self.conv0(inputs)
+        out = self.downsample0(out)
+        blocks = []
+        for i, conv_block_i in enumerate(self.darknet53_conv_block_list):
+            out = conv_block_i(out)
+            blocks.append(out)
+            if i < len(self.stages) - 1:
+                out = self.downsample_list[i](out)
+        return blocks[-1:-4:-1]
+
diff --git a/dygraph/yolov3/models/yolov3.py b/dygraph/yolov3/models/yolov3.py
new file mode 100755
index 0000000000000000000000000000000000000000..b49c9f63a75da3f0d24ce12e2f06993457bc74ad
--- /dev/null
+++ b/dygraph/yolov3/models/yolov3.py
@@ -0,0 +1,241 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+from __future__ import division
+from __future__ import print_function
+
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Constant
+from paddle.fluid.initializer import Normal
+from paddle.fluid.regularizer import L2Decay
+
+from config import cfg
+
+from paddle.fluid.dygraph.nn import Conv2D, BatchNorm
+from .darknet import DarkNet53_conv_body
+from .darknet import ConvBNLayer
+
+from paddle.fluid.dygraph.base import to_variable
+
+class YoloDetectionBlock(fluid.dygraph.Layer):
+    def __init__(self,ch_in,channel,is_test=True):
+        super(YoloDetectionBlock, self).__init__()
+
+        assert channel % 2 == 0, \
+            "channel {} cannot be divided by 2".format(channel)
+
+        self.conv0 = ConvBNLayer(
+            ch_in=ch_in,
+            ch_out=channel,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            is_test=is_test
+            )
+        self.conv1 = ConvBNLayer(
+            ch_in=channel,
+            ch_out=channel*2,
+            filter_size=3,
+            stride=1,
+            padding=1,
+            is_test=is_test
+            )
+        self.conv2 = ConvBNLayer(
+            ch_in=channel*2,
+            ch_out=channel,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            is_test=is_test
+            )
+        self.conv3 = ConvBNLayer(
+            ch_in=channel,
+            ch_out=channel*2,
+            filter_size=3,
+            stride=1,
+            padding=1,
+            is_test=is_test
+            )
+        self.route = ConvBNLayer(
+            ch_in=channel*2,
+            ch_out=channel,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            is_test=is_test
+            )
+        self.tip = ConvBNLayer(
+            ch_in=channel,
+            ch_out=channel*2,
+            filter_size=3,
+            stride=1,
+            padding=1,
+            is_test=is_test
+            )
+    def forward(self, inputs):
+        out = self.conv0(inputs)
+        out = self.conv1(out)
+        out = self.conv2(out)
+        out = self.conv3(out)
+        route = self.route(out)
+        tip = self.tip(route)
+        return route, tip
+
+
+class Upsample(fluid.dygraph.Layer):
+    def __init__(self,scale=2):
+        super(Upsample,self).__init__()
+        self.scale = scale
+
+    def forward(self, inputs):
+        # get dynamic upsample output shape
+        shape_nchw = fluid.layers.shape(inputs)
+        shape_hw = fluid.layers.slice(shape_nchw, axes=[0], starts=[2], ends=[4])
+        shape_hw.stop_gradient = True
+        in_shape = fluid.layers.cast(shape_hw, dtype='int32')
+        out_shape = in_shape * self.scale
+        out_shape.stop_gradient = True
+
+        # reisze by actual_shape
+        out = fluid.layers.resize_nearest(
+            input=inputs, scale=self.scale, actual_shape=out_shape)
+        return out
+
+class YOLOv3(fluid.dygraph.Layer):
+    def __init__(self,ch_in,is_train=True, use_random=False):
+        super(YOLOv3,self).__init__()
+
+        self.is_train = is_train
+        self.use_random = use_random
+
+        self.block = DarkNet53_conv_body(ch_in=ch_in,
+                                         is_test = not self.is_train)
+        self.block_outputs = []
+        self.yolo_blocks = []
+        self.route_blocks_2 = []
+        ch_in_list = [1024,768,384]
+        for i in range(3):
+            yolo_block = self.add_sublayer(
+                "yolo_detecton_block_%d" % (i),
+                YoloDetectionBlock(ch_in_list[i],
+                                   channel = 512//(2**i),
+                                   is_test = not self.is_train))
+            self.yolo_blocks.append(yolo_block)
+
+            num_filters = len(cfg.anchor_masks[i]) * (cfg.class_num + 5)
+
+            block_out = self.add_sublayer(
+                "block_out_%d" % (i),
+                Conv2D(num_channels=1024//(2**i),
+                       num_filters=num_filters,
+                       filter_size=1,
+                       stride=1,
+                       padding=0,
+                       act=None,
+                       param_attr=ParamAttr(
+                           initializer=fluid.initializer.Normal(0., 0.02)),
+                       bias_attr=ParamAttr(
+                           initializer=fluid.initializer.Constant(0.0),
+                           regularizer=L2Decay(0.))))
+            self.block_outputs.append(block_out)
+            if i < 2:
+                route = self.add_sublayer("route2_%d"%i,
+                                          ConvBNLayer(ch_in=512//(2**i),
+                                                      ch_out=256//(2**i),
+                                                      filter_size=1,
+                                                      stride=1,
+                                                      padding=0,
+                                                      is_test=(not self.is_train)))
+                self.route_blocks_2.append(route)
+            self.upsample = Upsample()
+
+    def forward(self, inputs, gtbox=None, gtlabel=None, gtscore=None, im_id=None, im_shape=None ):
+        self.outputs = []
+        self.boxes = []
+        self.scores = []
+        self.losses = []
+        self.downsample = 32
+        blocks = self.block(inputs)
+        for i, block in enumerate(blocks):
+            if i > 0:
+                block = fluid.layers.concat(input=[route, block], axis=1)
+            route, tip = self.yolo_blocks[i](block)
+            block_out = self.block_outputs[i](tip)
+            self.outputs.append(block_out)
+
+            if i < 2:
+                route = self.route_blocks_2[i](route)
+                route = self.upsample(route)
+        self.gtbox = gtbox
+        self.gtlabel = gtlabel
+        self.gtscore = gtscore
+        self.im_id = im_id
+        self.im_shape = im_shape
+
+        # cal loss
+        for i,out in enumerate(self.outputs):
+            anchor_mask = cfg.anchor_masks[i]
+            if self.is_train:
+                loss = fluid.layers.yolov3_loss(
+                    x=out,
+                    gt_box=self.gtbox,
+                    gt_label=self.gtlabel,
+                    gt_score=self.gtscore,
+                    anchors=cfg.anchors,
+                    anchor_mask=anchor_mask,
+                    class_num=cfg.class_num,
+                    ignore_thresh=cfg.ignore_thresh,
+                    downsample_ratio=self.downsample,
+                    use_label_smooth=cfg.label_smooth)
+                self.losses.append(fluid.layers.reduce_mean(loss))
+
+            else:
+                mask_anchors = []
+                for m in anchor_mask:
+                    mask_anchors.append(cfg.anchors[2 * m])
+                    mask_anchors.append(cfg.anchors[2 * m + 1])
+                boxes, scores = fluid.layers.yolo_box(
+                    x=out,
+                    img_size=self.im_shape,
+                    anchors=mask_anchors,
+                    class_num=cfg.class_num,
+                    conf_thresh=cfg.valid_thresh,
+                    downsample_ratio=self.downsample,
+                    name="yolo_box" + str(i))
+                self.boxes.append(boxes)
+                self.scores.append(
+                    fluid.layers.transpose(
+                        scores, perm=[0, 2, 1]))
+            self.downsample //= 2
+
+
+
+        if not self.is_train:
+        # get pred
+            yolo_boxes = fluid.layers.concat(self.boxes, axis=1)
+            yolo_scores = fluid.layers.concat(self.scores, axis=2)
+
+            pred = fluid.layers.multiclass_nms(
+                bboxes=yolo_boxes,
+                scores=yolo_scores,
+                score_threshold=cfg.valid_thresh,
+                nms_top_k=cfg.nms_topk,
+                keep_top_k=cfg.nms_posk,
+                nms_threshold=cfg.nms_thresh,
+                background_label=-1)
+            return pred
+        else:
+            return sum(self.losses)
+
diff --git a/dygraph/yolov3/reader.py b/dygraph/yolov3/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..92a7ac1a59b457076e0c165fb25ca2f30195e092
--- /dev/null
+++ b/dygraph/yolov3/reader.py
@@ -0,0 +1,356 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import numpy as np
+import os
+import sys
+import random
+import time
+import copy
+import cv2
+import box_utils
+import image_utils
+from pycocotools.coco import COCO
+from data_utils import GeneratorEnqueuer
+from config import cfg
+import paddle.fluid as fluid
+
+class DataSetReader(object):
+    """A class for parsing and read COCO dataset"""
+
+    def __init__(self):
+        self.has_parsed_categpry = False
+
+    def _parse_dataset_dir(self, mode):
+        if 'coco2014' in cfg.dataset:
+            cfg.train_file_list = 'annotations/instances_train2014.json'
+            cfg.train_data_dir = 'train2014'
+            cfg.val_file_list = 'annotations/instances_val2014.json'
+            cfg.val_data_dir = 'val2014'
+        elif 'coco2017' in cfg.dataset:
+            cfg.train_file_list = 'annotations/instances_train2017.json'
+            cfg.train_data_dir = 'train2017'
+            cfg.val_file_list = 'annotations/instances_val2017.json'
+            cfg.val_data_dir = 'val2017'
+        else:
+            raise NotImplementedError('Dataset {} not supported'.format(
+                cfg.dataset))
+
+        if mode == 'train':
+            cfg.train_file_list = os.path.join(cfg.data_dir,
+                                               cfg.train_file_list)
+            cfg.train_data_dir = os.path.join(cfg.data_dir, cfg.train_data_dir)
+            self.COCO = COCO(cfg.train_file_list)
+            self.img_dir = cfg.train_data_dir
+        elif mode == 'test' or mode == 'infer':
+            cfg.val_file_list = os.path.join(cfg.data_dir, cfg.val_file_list)
+            cfg.val_data_dir = os.path.join(cfg.data_dir, cfg.val_data_dir)
+            self.COCO = COCO(cfg.val_file_list)
+            self.img_dir = cfg.val_data_dir
+
+    def _parse_dataset_catagory(self):
+        self.categories = self.COCO.loadCats(self.COCO.getCatIds())
+        self.num_category = len(self.categories)
+        self.label_names = []
+        self.label_ids = []
+        for category in self.categories:
+            self.label_names.append(category['name'])
+            self.label_ids.append(int(category['id']))
+        self.category_to_id_map = {v: i for i, v in enumerate(self.label_ids)}
+        print("Load in {} categories.".format(self.num_category))
+        if self.num_category != cfg.class_num:
+            raise ValueError("category number({}) in your dataset is not equal "
+                    "to --class_num={} settting, which may incur errors in "
+                    "eval/infer or cause precision loss.".format(
+                        self.num_category, cfg.class_num))
+        self.has_parsed_categpry = True
+
+    def get_label_infos(self):
+        if not self.has_parsed_categpry:
+            self._parse_dataset_dir("test")
+            self._parse_dataset_catagory()
+        return (self.label_names, self.label_ids)
+
+    def _parse_gt_annotations(self, img):
+        img_height = img['height']
+        img_width = img['width']
+        anno = self.COCO.loadAnns(
+            self.COCO.getAnnIds(
+                imgIds=img['id'], iscrowd=None))
+        gt_index = 0
+        for target in anno:
+            if target['area'] < cfg.gt_min_area:
+                continue
+            if 'ignore' in target and target['ignore']:
+                continue
+
+            box = box_utils.coco_anno_box_to_center_relative(
+                target['bbox'], img_height, img_width)
+            if box[2] <= 0 and box[3] <= 0:
+                continue
+
+            img['gt_boxes'][gt_index] = box
+            img['gt_labels'][gt_index] = \
+                self.category_to_id_map[target['category_id']]
+            gt_index += 1
+            if gt_index >= cfg.max_box_num:
+                break
+
+    def _parse_images(self, is_train):
+        image_ids = self.COCO.getImgIds()
+        image_ids.sort()
+        imgs = copy.deepcopy(self.COCO.loadImgs(image_ids))
+        for img in imgs:
+            img['image'] = os.path.join(self.img_dir, img['file_name'])
+            assert os.path.exists(img['image']), \
+                    "image {} not found.".format(img['image'])
+            box_num = cfg.max_box_num
+            img['gt_boxes'] = np.zeros((cfg.max_box_num, 4), dtype=np.float32)
+            img['gt_labels'] = np.zeros((cfg.max_box_num), dtype=np.int32)
+            for k in ['date_captured', 'url', 'license', 'file_name']:
+                if k in img:
+                    del img[k]
+
+            if is_train:
+                self._parse_gt_annotations(img)
+
+        print("Loaded {0} images from {1}.".format(len(imgs), cfg.dataset))
+
+        return imgs
+
+    def _parse_images_by_mode(self, mode):
+        if mode == 'infer':
+            return []
+        else:
+            return self._parse_images(is_train=(mode == 'train'))
+
+    def get_reader(self,
+                   mode,
+                   size=416,
+                   batch_size=None,
+                   shuffle=False,
+                   shuffle_seed=None,
+                   mixup_iter=0,
+                   random_sizes=[],
+                   image=None):
+        assert mode in ['train', 'test', 'infer'], "Unknow mode type!"
+        if mode != 'infer':
+            assert batch_size is not None, \
+                "batch size connot be None in mode {}".format(mode)
+            self._parse_dataset_dir(mode)
+            self._parse_dataset_catagory()
+
+        def img_reader(img, size, mean, std):
+            im_path = img['image']
+            im = cv2.imread(im_path).astype('float32')
+            im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
+
+            h, w, _ = im.shape
+            im_scale_x = size / float(w)
+            im_scale_y = size / float(h)
+            out_img = cv2.resize(
+                im,
+                None,
+                None,
+                fx=im_scale_x,
+                fy=im_scale_y,
+                interpolation=cv2.INTER_CUBIC)
+            mean = np.array(mean).reshape((1, 1, -1))
+            std = np.array(std).reshape((1, 1, -1))
+            out_img = (out_img / 255.0 - mean) / std
+            out_img = out_img.transpose((2, 0, 1))
+
+            return (out_img, int(img['id']), (h, w))
+
+        def img_reader_with_augment(img, size, mean, std, mixup_img):
+            im_path = img['image']
+            im = cv2.imread(im_path)
+            im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
+            gt_boxes = img['gt_boxes'].copy()
+            gt_labels = img['gt_labels'].copy()
+            gt_scores = np.ones_like(gt_labels)
+
+            if mixup_img:
+                mixup_im = cv2.imread(mixup_img['image'])
+                mixup_im = cv2.cvtColor(mixup_im, cv2.COLOR_BGR2RGB)
+                mixup_gt_boxes = np.array(mixup_img['gt_boxes']).copy()
+                mixup_gt_labels = np.array(mixup_img['gt_labels']).copy()
+                mixup_gt_scores = np.ones_like(mixup_gt_labels)
+                im, gt_boxes, gt_labels, gt_scores = \
+                    image_utils.image_mixup(im, gt_boxes, gt_labels,
+                                            gt_scores, mixup_im, mixup_gt_boxes,
+                                            mixup_gt_labels, mixup_gt_scores)
+
+            im, gt_boxes, gt_labels, gt_scores = \
+                image_utils.image_augment(im, gt_boxes, gt_labels,
+                                          gt_scores, size, mean)
+
+            mean = np.array(mean).reshape((1, 1, -1))
+            std = np.array(std).reshape((1, 1, -1))
+            out_img = (im / 255.0 - mean) / std
+            out_img = out_img.astype('float32').transpose((2, 0, 1))
+
+            return (out_img, gt_boxes, gt_labels, gt_scores)
+
+        def get_img_size(size, random_sizes=[]):
+            if len(random_sizes):
+                return np.random.choice(random_sizes)
+            return size
+
+        def get_mixup_img(imgs, mixup_iter, total_iter, read_cnt):
+            if total_iter >= mixup_iter:
+                return None
+
+            mixup_idx = np.random.randint(1, len(imgs))
+            mixup_img = imgs[(read_cnt + mixup_idx) % len(imgs)]
+            return mixup_img
+
+        def reader():
+            if mode == 'train':
+                imgs = self._parse_images_by_mode(mode)
+                if shuffle:
+                    if shuffle_seed is not None:
+                        np.random.seed(shuffle_seed)
+                    np.random.shuffle(imgs)
+                read_cnt = 0
+                total_iter = 0
+                batch_out = []
+                img_size = get_img_size(size, random_sizes)
+                while True:
+                    img = imgs[read_cnt % len(imgs)]
+                    mixup_img = get_mixup_img(imgs, mixup_iter, total_iter,
+                                              read_cnt)
+                    read_cnt += 1
+                    if read_cnt % len(imgs) == 0 and shuffle:
+                        np.random.shuffle(imgs)
+                    im, gt_boxes, gt_labels, gt_scores = \
+                        img_reader_with_augment(img, img_size, cfg.pixel_means,
+                                                cfg.pixel_stds, mixup_img)
+                    batch_out.append([im, gt_boxes, gt_labels, gt_scores])
+
+                    if len(batch_out) == batch_size:
+                        yield batch_out
+                        batch_out = []
+                        total_iter += 1
+                        img_size = get_img_size(size, random_sizes)
+
+            elif mode == 'test':
+                imgs = self._parse_images_by_mode(mode)
+                batch_out = []
+                for img in imgs:
+                    im, im_id, im_shape = img_reader(img, size, cfg.pixel_means,
+                                                     cfg.pixel_stds)
+                    batch_out.append((im, im_id, im_shape))
+                    if len(batch_out) == batch_size:
+                        yield batch_out
+                        batch_out = []
+                if len(batch_out) != 0:
+                    yield batch_out
+            else:
+                img = {}
+                img['image'] = image
+                img['id'] = 0
+                im, im_id, im_shape = img_reader(img, size, cfg.pixel_means,
+                                                 cfg.pixel_stds)
+                batch_out = [(im, im_id, im_shape)]
+                yield batch_out
+
+        # NOTE: yolov3 is a special model, if num_trainers > 1, each process 
+        # trian the completed dataset.
+        # num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+        # if mode == 'train' and num_trainers > 1:
+        #     assert shuffle_seed is not None, \
+        #         "If num_trainers > 1, the shuffle_seed must be set, because " \
+        #         "the order of batch data generated by reader " \
+        #         "must be the same in the respective processes."
+        #     reader = fluid.contrib.reader.distributed_batch_reader(reader)
+
+        return reader
+
+
+dsr = DataSetReader()
+
+
+def train(size=416,
+          batch_size=64,
+          shuffle=True,
+          shuffle_seed=None,
+          total_iter=0,
+          mixup_iter=0,
+          random_sizes=[],
+          num_workers=8,
+          max_queue=32,
+          use_multiprocess_reader=True):
+    generator = dsr.get_reader('train', size, batch_size, shuffle, shuffle_seed,
+                               int(mixup_iter / num_workers), random_sizes)
+
+    if not use_multiprocess_reader:
+        return generator
+    else:
+        if sys.platform == "win32":
+            print("multiprocess is not fully compatible with Windows, "
+                    "you can set --use_multiprocess_reader=False if you "
+                    "are training on Windows and there are errors incured "
+                    "by multiprocess.")
+        print("multiprocess reader starting up, it takes a while...")
+
+    def infinite_reader():
+        while True:
+            for data in generator():
+                yield data
+
+    def reader():
+        cnt = 0
+        try:
+            enqueuer = GeneratorEnqueuer(
+                infinite_reader(), use_multiprocessing=use_multiprocess_reader)
+            enqueuer.start(max_queue_size=max_queue, workers=num_workers)
+            generator_out = None
+            while True:
+                while enqueuer.is_running():
+                    if not enqueuer.queue.empty():
+                        generator_out = enqueuer.queue.get()
+                        break
+                    else:
+                        time.sleep(0.02)
+                yield generator_out
+                cnt += 1
+                if cnt >= total_iter:
+                    enqueuer.stop()
+                    return
+                generator_out = None
+        except Exception as e:
+            print("Exception occured in reader: {}".format(str(e)))
+        finally:
+            if enqueuer:
+                enqueuer.stop()
+
+    return reader
+
+
+def test(size=416, batch_size=1):
+    return dsr.get_reader('test', size, batch_size)
+
+
+def infer(size=416, image=None):
+    return dsr.get_reader('infer', size, image=image)
+
+
+def get_label_infos():
+    return dsr.get_label_infos()
diff --git a/dygraph/yolov3/start_paral.sh b/dygraph/yolov3/start_paral.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8e9c1f6a995b4246d2ee83366c4470ac0e008b6f
--- /dev/null
+++ b/dygraph/yolov3/start_paral.sh
@@ -0,0 +1 @@
+python -m paddle.distributed.launch --selected_gpu=0,1,2,3 --started_port=9999 train.py --batch_size=16 --use_data_parallel=1
diff --git a/dygraph/yolov3/train.py b/dygraph/yolov3/train.py
new file mode 100755
index 0000000000000000000000000000000000000000..7c1548e6faea7ac143e8a2586248b207df065723
--- /dev/null
+++ b/dygraph/yolov3/train.py
@@ -0,0 +1,196 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+
+
+def set_paddle_flags(flags):
+    for key, value in flags.items():
+        if os.environ.get(key, None) is None:
+            os.environ[key] = str(value)
+
+
+set_paddle_flags({
+    'FLAGS_eager_delete_tensor_gb': 0,  # enable gc
+    'FLAGS_memory_fraction_of_eager_deletion': 1,
+    'FLAGS_fraction_of_gpu_memory_to_use': 0.98
+})
+
+import sys
+import numpy as np
+import random
+import time
+import shutil
+import subprocess
+from utility import (parse_args, print_arguments,
+                     SmoothedValue, check_gpu)
+
+import paddle
+import paddle.fluid as fluid
+import reader
+from models.yolov3 import YOLOv3
+from config import cfg
+import dist_utils
+from paddle.fluid.dygraph.base import to_variable
+
+num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+
+
+def get_device_num():
+    # NOTE(zcd): for multi-processe training, each process use one GPU card.
+    if num_trainers > 1:
+        return 1
+    return fluid.core.get_cuda_device_count()
+
+
+def train():
+    # check if set use_gpu=True in paddlepaddle cpu version
+    check_gpu(cfg.use_gpu)
+
+    devices_num = get_device_num() if cfg.use_gpu else 1
+    print("Found {} CUDA/CPU devices.".format(devices_num))
+
+    if cfg.debug or args.enable_ce:
+        fluid.default_startup_program().random_seed = 1000
+        fluid.default_main_program().random_seed = 1000
+        random.seed(0)
+        np.random.seed(0)
+
+    if not os.path.exists(cfg.model_save_dir):
+        os.makedirs(cfg.model_save_dir)
+
+    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+    place = fluid.CUDAPlace(fluid.dygraph.parallel.Env().dev_id) if cfg.use_data_parallel else fluid.CUDAPlace(0)
+
+    with fluid.dygraph.guard(place):
+        if args.use_data_parallel:
+            strategy = fluid.dygraph.parallel.prepare_context()
+        model = YOLOv3(3, is_train=True)
+
+        if cfg.pretrain:
+            restore, _ = fluid.load_dygraph(cfg.pretrain)
+            model.block.set_dict(restore)
+
+        if cfg.finetune:
+            restore, _ = fluid.load_dygraph(cfg.finetune)
+            model.set_dict(restore, use_structured_name=True)
+
+        if args.use_data_parallel:
+            model = fluid.dygraph.parallel.DataParallel(model, strategy)
+
+        boundaries = cfg.lr_steps
+        gamma = cfg.lr_gamma
+        step_num = len(cfg.lr_steps)
+        learning_rate = cfg.learning_rate
+        values = [learning_rate * (gamma ** i) for i in range(step_num + 1)]
+
+        lr = fluid.dygraph.PiecewiseDecay(
+            boundaries=boundaries,
+            values=values,
+            begin=args.start_iter)
+
+        lr = fluid.layers.linear_lr_warmup(
+                learning_rate=lr,
+                warmup_steps=cfg.warm_up_iter,
+                start_lr=0.0,
+                end_lr=cfg.learning_rate,
+        )
+
+        optimizer = fluid.optimizer.Momentum(
+            learning_rate=lr,
+            regularization=fluid.regularizer.L2Decay(cfg.weight_decay),
+            momentum=cfg.momentum,
+            parameter_list=model.parameters()
+        )
+
+        start_time = time.time()
+        snapshot_loss = 0
+        snapshot_time = 0
+        total_sample = 0
+
+        input_size = cfg.input_size
+        shuffle = True
+        shuffle_seed = None
+        total_iter = cfg.max_iter - cfg.start_iter
+        mixup_iter = total_iter - cfg.no_mixup_iter
+
+        random_sizes = [cfg.input_size]
+        if cfg.random_shape:
+            random_sizes = [32 * i for i in range(10,20)]
+
+        train_reader = reader.train(
+            input_size,
+            batch_size=cfg.batch_size,
+            shuffle=shuffle,
+            shuffle_seed=shuffle_seed,
+            total_iter=total_iter * devices_num,
+            mixup_iter=mixup_iter * devices_num,
+            random_sizes=random_sizes,
+            use_multiprocess_reader=cfg.use_multiprocess_reader,
+            num_workers=cfg.worker_num)
+
+        if args.use_data_parallel:
+            train_reader = fluid.contrib.reader.distributed_batch_reader(train_reader)
+        smoothed_loss = SmoothedValue()
+
+        for iter_id, data in enumerate(train_reader()):
+            prev_start_time = start_time
+            start_time = time.time()
+
+            img = np.array([x[0] for x in data]).astype('float32')
+            img = to_variable(img)
+
+            gt_box = np.array([x[1] for x in data]).astype('float32')
+            gt_box = to_variable(gt_box)
+
+            gt_label = np.array([x[2] for x in data]).astype('int32')
+            gt_label = to_variable(gt_label)
+
+            gt_score = np.array([x[3] for x in data]).astype('float32')
+            gt_score = to_variable(gt_score)
+
+            loss = model(img, gt_box, gt_label, gt_score, None, None)
+            smoothed_loss.add_value(np.mean(loss.numpy()))
+            snapshot_loss += loss.numpy()
+            snapshot_time += start_time - prev_start_time
+            total_sample += 1
+
+            print("Iter {:d}, loss {:.6f}, time {:.5f}".format(
+                iter_id,
+                smoothed_loss.get_mean_value(),
+                start_time-prev_start_time))
+
+            if args.use_data_parallel:
+                loss = model.scale_loss(loss)
+                loss.backward()
+                model.apply_collective_grads()
+            loss.backward()
+
+            optimizer.minimize(loss)
+            model.clear_gradients()
+
+            save_parameters = (not args.use_data_parallel) or (
+                args.use_data_parallel and
+                    fluid.dygraph.parallel.Env().local_rank == 0)
+            if save_parameters and iter_id > 1 and iter_id % cfg.snapshot_iter == 0:
+                fluid.save_dygraph(model.state_dict(), args.model_save_dir + "/yolov3_{}".format(iter_id))
+
+
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+    train()
diff --git a/dygraph/yolov3/utility.py b/dygraph/yolov3/utility.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b0b4a59efc99e020148c9049d58e447f271ed49
--- /dev/null
+++ b/dygraph/yolov3/utility.py
@@ -0,0 +1,156 @@
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains common utility functions.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import sys
+import distutils.util
+import numpy as np
+import six
+import ast
+from collections import deque
+import paddle.fluid as fluid
+import argparse
+import functools
+from config import *
+
+
+def print_arguments(args):
+    """Print argparse's arguments.
+
+    Usage:
+
+    .. code-block:: python
+
+        parser = argparse.ArgumentParser()
+        parser.add_argument("name", default="Jonh", type=str, help="User name.")
+        args = parser.parse_args()
+        print_arguments(args)
+
+    :param args: Input argparse.Namespace for printing.
+    :type args: argparse.Namespace
+    """
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(six.iteritems(vars(args))):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+
+
+def add_arguments(argname, type, default, help, argparser, **kwargs):
+    """Add argparse's argument.
+
+    Usage:
+
+    .. code-block:: python
+
+        parser = argparse.ArgumentParser()
+        add_argument("name", str, "Jonh", "User name.", parser)
+        args = parser.parse_args()
+    """
+    type = distutils.util.strtobool if type == bool else type
+    argparser.add_argument(
+        "--" + argname,
+        default=default,
+        type=type,
+        help=help + ' Default: %(default)s.',
+        **kwargs)
+
+
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+
+    def __init__(self):
+        self.loss_sum = 0.0
+        self.iter_cnt = 0
+
+    def add_value(self, value):
+        self.loss_sum += np.mean(value)
+        self.iter_cnt += 1
+
+    def get_mean_value(self):
+        return self.loss_sum / self.iter_cnt
+
+
+def check_gpu(use_gpu):
+    """
+    Log error and exit when set use_gpu=True in paddlepaddle
+    cpu version.
+    """
+    err = "Config use_gpu cannot be set as True while you are " \
+          "using paddlepaddle cpu version ! \nPlease try: \n" \
+          "\t1. Install paddlepaddle-gpu to run model on GPU \n" \
+          "\t2. Set --use_gpu=False to run model on CPU"
+
+    try:
+        if use_gpu and not fluid.is_compiled_with_cuda():
+            print(err)
+            sys.exit(1)
+    except Exception as e:
+        pass
+
+
+def parse_args():
+    """return all args
+    """
+    parser = argparse.ArgumentParser(description=__doc__)
+    add_arg = functools.partial(add_arguments, argparser=parser)
+    # yapf: disable
+    # ENV
+    add_arg('use_gpu',          bool,   True,      "Whether use GPU.")
+    add_arg('model_save_dir',   str,    'checkpoints',     "The path to save model.")
+    add_arg('pretrain',         str,    'weights/darknet53', "The pretrain model path.")
+    add_arg('finetune',         str,    False, "The finetune model path.")
+    add_arg('weights',          str,    'weights/yolov3', "The weights path.")
+    add_arg('dataset',          str,    'coco2017',  "Dataset: coco2014, coco2017.")
+    add_arg('class_num',        int,    80,          "Class number.")
+    add_arg('data_dir',         str,    'dataset/coco',        "The data root path.")
+    add_arg('start_iter',       int,    0,      "Start iteration.")
+    add_arg('use_multiprocess_reader', bool,   True,   "whether use multiprocess reader.")
+    add_arg('worker_num',       int,   8,   "worker number for multiprocess reader.")
+    add_arg('use_data_parallel', ast.literal_eval, False, "the flag indicating whether to use data parallel model to train the model")
+    #SOLVER
+    add_arg('batch_size',       int,    8,      "Mini-batch size per device.")
+    add_arg('learning_rate',    float,  0.001,  "Learning rate.")
+    add_arg('max_iter',         int,    500200, "Iter number.")
+    add_arg('snapshot_iter',    int,    2000,   "Save model every snapshot stride.")
+    add_arg('label_smooth',     bool,   True,   "Use label smooth in class label.")
+    add_arg('no_mixup_iter',    int,    40000,  "Disable mixup in last N iter.")
+    # TRAIN TEST INFER
+    add_arg('input_size',       int,    608,    "Image input size of YOLOv3.")
+    add_arg('syncbn',           bool,   True,   "Whether to use synchronized batch normalization.")
+    add_arg('random_shape',     bool,   True,   "Resize to random shape for train reader.")
+    add_arg('valid_thresh',     float,  0.005,  "Valid confidence score for NMS.")
+    add_arg('nms_thresh',       float,  0.45,   "NMS threshold.")
+    add_arg('nms_topk',         int,    400,    "The number of boxes to perform NMS.")
+    add_arg('nms_posk',         int,    100,    "The number of boxes of NMS output.")
+    add_arg('debug',            bool,   False,  "Debug mode")
+    # SINGLE EVAL AND DRAW
+    add_arg('image_path',       str,   'image',
+            "The image path used to inference and visualize.")
+    add_arg('image_name',       str,    None,
+            "The single image used to inference and visualize. None to inference all images in image_path")
+    add_arg('draw_thresh',      float,  0.5,
+            "Confidence score threshold to draw prediction box in image in debug mode")
+    add_arg('enable_ce',        bool,  False,                "If set True, enable continuous evaluation job.")
+    # yapf: enable
+    args = parser.parse_args()
+    file_name = sys.argv[0]
+    merge_cfg_from_args(args)
+    return args
diff --git a/dygraph/yolov3/weights/download.sh b/dygraph/yolov3/weights/download.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6a3dfefe5093cc462429be946f6f5bdda737bdc3
--- /dev/null
+++ b/dygraph/yolov3/weights/download.sh
@@ -0,0 +1,6 @@
+DIR="$( cd "$(dirname "$0")" ; pwd -P )"
+cd "$DIR"
+
+# Download the pretrain weights.
+echo "Downloading..."
+wget https://paddlemodels.bj.bcebos.com/yolo/darknet53.pdparams
\ No newline at end of file
diff --git a/fluid/AutoDL/LRC/README.md b/fluid/AutoDL/LRC/README.md
index 546cb19169b965af5a3d0d41c903e318d4dfc64a..5a9b443ae4e6d32ee5cc16e7ce14ed22148510d1 100644
--- a/fluid/AutoDL/LRC/README.md
+++ b/fluid/AutoDL/LRC/README.md
@@ -3,4 +3,4 @@ Hi!
 
 This directory has been deprecated.
 
-Please visit the project at [AutoDL/LRC](../../../AutoDL/LRC).
+Please visit the project at [AutoDL/LRC](https://github.com/PaddlePaddle/AutoDL/tree/master/LRC).

Model +	输入图像分辨率 +	参数 resize_short_size +
Inception, Xception +	299 +	320 +
DarkNet53 +	256 +	256 +
Fix_ResNeXt101_32x48d_wsl +	320 +	320 +
EfficientNet: + 预测时的resize_short_size在其分辨率的长或高的基础上加32 + 在该系列模型训练和预测的过程中 + 图片resize参数interpolation的值设置为2（cubic插值方式） + 该模型在训练过程中使用了指数滑动平均策略 + 具体请参考指数滑动平均 +	B0: 224 +	256 +
	B1: 240 +	272 +
	B2: 260 +	292 +
	B3: 300 +	332 +
	B4: 380 +	412 +
	B5: 456 +	488 +
	B6: 528 +	560 +
	B7: 600 +	632 +
其余分类模型 +	224 +	256 +
Model +	Resolution +	Parameter: resize_short_size +
Inception, Xception +	299 +	320 +
DarkNet53 +	256 +	256 +
Fix_ResNeXt101_32x48d_wsl +	320 +	320 +
EfficientNet: + In the inference phase, the resize_short_size increases 32 compared to the resolution + and using the 2nd interpolation(cubic interpolation mode). + The ExponentialMovingAverage method is also applied during the training process + Please refer to ExponentialMovingAverage +	B0: 224 +	256 +
	B1: 240 +	272 +
	B2: 260 +	292 +
	B3: 300 +	332 +
	B4: 380 +	412 +
	B5: 456 +	488 +
	B6: 528 +	560 +
	B7: 600 +	632 +
Other models +	224 +	256 +