Add RRPN models for PaddleCV (#4148)

* add rrpn for models

Add RRPN models for PaddleCV (#4148)
* add rrpn for models
2b8f904e · chengjuntao · GitHub · 1853d687 · 2b8f904e · 2b8f904e
43 changed file
--- a/PaddleCV/rrpn/README.md
+++ b/PaddleCV/rrpn/README.md
+# RRPN 旋转物体检测
+---
+## 内容
+- [安装](#安装)
+- [简介](#简介)
+- [数据准备](#数据准备)
+- [模型训练](#模型训练)
+- [模型评估](#模型评估)
+- [模型推断及可视化](#模型推断及可视化)
+## 安装
+在当前目录下运行样例代码需要PadddlePaddle Fluid的develop或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本，请根据[安装文档](http://www.paddlepaddle.org/)中的说明来更新PaddlePaddle。
+## 简介
+RRPN是在Faster RCNN基础上拓展出的两阶段目标检测器，可用于文字检测和旋转物体检测。通过对图像生成候选区域，提取特征，判别特征类别并修正候选框位置。
+[RRPN](https://arxiv.org/abs/1703.01086) 整体网络可以分为4个主要内容：
+1. 基础卷积层。作为一种卷积神经网络目标检测方法，RRPN首先使用一组基础的卷积网络提取图像的特征图。特征图被后续RPN层和全连接层共享。本示例采用[ResNet-50](https://arxiv.org/abs/1512.03385)作为基础卷积层。
+2. 区域生成网络(RPN)。RPN网络用于生成候选区域(proposals)。该层通过一组固定的尺寸、比例和角度得到一组带方向锚点(anchors), 通过softmax判断旋转的锚点属于前景或者背景，再利用区域回归修正锚点从而获得精确的候选区域。
+3. Rotated RoI Align。该层收集输入的特征图和带方向的候选区域，将带方向的候选区域映射到特征图中进行并池化为统一大小的区域特征图，送入全连接层判定目标类别。
+4. 检测层。利用区域特征图计算候选区域的类别，同时再次通过区域回归获得检测框最终的精确位置。
+### 编译自定义OP
+自定义OP编译方式如下：
+    进入 `ext_op/src` 目录，执行编译脚本
+    ```
+    cd ext_op/src
+    sh make.sh  ${cuda_path} ${cudnn_path} ${nccl_path}
+    '''
+    其中${cuda_path}、$cudnn_path}和{nccl_path}分别为cuda、cudnn、nccl的安装路径，需通过命令行进行指定
+    成功编译后，`ext_op/src` 目录下将会生成 `rrpn_lib.so` 
+## 数据准备
+### 公开数据集
+在[ICDAR2015数据集](https://rrc.cvc.uab.es/?ch=4&com=downloads)上进行训练，数据集需进入官网进行注册后方可下载。
+数据目录结构如下：
+```
+dataset/icdar2015/
+├── ch4_training_images
+│   ├── img_143.jpg
+│   ├── img_144.jpg
+|   ...
+├── ch4_training_localization_transcription_gt
+│   ├── gt_img_143.txt
+│   ├── gt_img_144.txt
+|   ...
+├── ch4_test_images
+│   ├── img_111.jpg
+│   ├── img_112.jpg
+|   ...
+├── ch4_test_localization_transcription_gt
+│   ├── img_111.jpg
+│   ├── img_112.jpg
+|   ...
+```
+### 自定义数据
+原始的RRPN只提供了二分类，若要使用自己数据进行训练多分类，需在utility.py中将dataset改为icdar2017，然后将class_num改为需求类别数，其中0为背景类。
+训练自定义数据时，数据目录结构和ICDAR2015一致，标注数据格式如下：
+```
+x1, y1, x2, y2, x3, y3, x4, y4, class_name
+x1, y1, x2, y2, x3, y3, x4, y4, class_name
+```
+## 模型训练
+**下载预训练模型：** 本示例提供Resnet-50预训练模型，采用如下命令下载预训练模型：
+    sh ./pretrained/download.sh
+通过初始化`pretrained_model` 加载预训练模型。同时在参数微调时也采用该设置加载已训练模型。
+请在训练前确认预训练模型下载与加载正确，否则训练过程中损失可能会出现NAN。
+- RRPN
+    ```
+    python train.py \
+       --model_save_dir=output/ \
+       --pretrained_model=${path_to_pretrain_model} \
+       --data_dir=${path_to_data} \
+    ```
+    - 通过设置export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7指定8卡GPU训练。
+    - 可选参数见：
+        python train.py --help
+**数据读取器说明：** 数据读取器定义在reader.py中。所有图像将短边等比例缩放至`scales`，若长边大于`max_size`, 则再次将长边等比例缩放至`max_size`。在训练阶段，对图像采用随机旋转。
+**模型设置：**
+* 使用RotatedRoIAlign方法。
+* 训练过程pre\_nms=12000, post\_nms=2000，测试过程pre\_nms=6000, post\_nms=1000。nms阈值为0.7。
+* RPN网络得到labels的过程中，fg\_fraction=0.25，fg\_thresh=0.5，bg\_thresh_hi=0.5，bg\_thresh\_lo=0.0
+* RPN选择anchor时，rpn\_fg\_fraction=0.5，rpn\_positive\_overlap=0.7，rpn\_negative\_overlap=0.3
+**训练策略：**
+*  默认配置采用8卡，每卡batch size=1
+*  采用momentum优化算法训练，momentum=0.9。
+*  权重衰减系数为0.02，前500轮学习率从0.00333线性增加至0.01。在6250，12500轮时使用0.1,0.01乘子进行学习率衰减，最大训练17500轮。训练最大轮数和学习率策略可以在config.py中对max_iter和lr_steps进行设置。
+*  非基础卷积层卷积bias学习率为整体学习率2倍。
+*  基础卷积层中，affine_layers参数不更新，res2层参数不更新。
+## 模型评估
+模型评估是指对训练完毕的模型评估各类性能指标。本示例采用[ICDAR2015官方评估](https://rrc.cvc.uab.es/?com=contestant)
+`eval.py`是评估模块的主要执行程序，调用示例如下：
+- RRPN
+    ```
+    python eval.py \
+        --dataset=icdar2015 \
+        --pretrained_model=${path_to_trained_model}
+    ```
+    - 通过设置`--pretrained_model=${path_to_trained_model}`指定训练好的模型，注意不是初始化的模型。
+    - 通过设置`export CUDA\_VISIBLE\_DEVICES=0`指定单卡GPU评估。
+下表为模型评估结果：
+RRPN
+| 模型                   | 批量大小   | 迭代次数   | F1  |
+| :--------------- | :------------:    | :------------------:    |------: |
+| [RRPN](https://paddleseg.bj.bcebos.com/deploy/temp/model_final.tar) |8   |    17500       | 0.8048 |
+## 模型推断及可视化
+模型推断可以获取图像中的物体及其对应的类别，`infer.py`是主要执行程序，调用示例如下：
+```
+python infer.py \
+    --pretrained_model=${path_to_trained_model}  \
+    --image_path=dataset/icdar2015 \
+    --draw_threshold=0.6
+```
+注意，请正确设置模型路径`${path_to_trained_model}`和预测图片路径。默认使用GPU设备，也可通过设置`--use_gpu=False`使用CPU设备。可通过设置`draw_threshold`调节得分阈值控制检测框的个数。
+下图为模型可视化预测结果：
+<p align="center">
+<img src="image/img_120.jpg" height=576 width=1024 hspace='10'/>
+<img src="image/img_119.jpg" height=576 width=1024 hspace='10'/> <br />
+RRPN 预测可视化
+</p>
--- a/PaddleCV/rrpn/__init__.py
+++ b/PaddleCV/rrpn/__init__.py
--- a/PaddleCV/rrpn/checkpoint.py
+++ b/PaddleCV/rrpn/checkpoint.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+import errno
+import os
+import shutil
+import time
+import numpy as np
+import re
+import paddle.fluid as fluid
+import logging
+logger = logging.getLogger(__name__)
+def load_params(exe, prog, path):
+    """
+    Load model from the given path.
+    Args:
+        exe (fluid.Executor): The fluid.Executor object.
+        prog (fluid.Program): load weight to which Program object.
+        path (string): URL string or loca model path.
+    """
+    if not os.path.exists(path):
+        raise ValueError("Model pretrain path {} does not "
+                         "exists.".format(path))
+    logger.info('Loading parameters from {}...'.format(path))
+    def _if_exist(var):
+        param_exist = os.path.exists(os.path.join(path, var.name))
+        do_load = param_exist
+        if do_load:
+            logger.debug('load weight {}'.format(var.name))
+        return do_load
+    fluid.io.load_vars(exe, path, prog, predicate=_if_exist)
+def save(exe, prog, path):
+    """
+    Load model from the given path.
+    Args:
+        exe (fluid.Executor): The fluid.Executor object.
+        prog (fluid.Program): save weight from which Program object.
+        path (string): the path to save model.
+    """
+    if os.path.isdir(path):
+        shutil.rmtree(path)
+    logger.info('Save model to {}.'.format(path))
+    fluid.io.save_persistables(exe, path, prog)
+def load_and_fusebn(exe, prog, path):
+    """
+    Fuse params of batch norm to scale and bias.
+    Args:
+        exe (fluid.Executor): The fluid.Executor object.
+        prog (fluid.Program): save weight from which Program object.
+        path (string): the path to save model.
+    """
+    logger.info('Load model and fuse batch norm if have from {}...'.format(
+        path))
+    if not os.path.exists(path):
+        raise ValueError("Model path {} does not exists.".format(path))
+    def _if_exist(var):
+        b = os.path.exists(os.path.join(path, var.name))
+        if b:
+            logger.debug('load weight {}'.format(var.name))
+        return b
+    all_vars = list(filter(_if_exist, prog.list_vars()))
+    # Since the program uses affine-channel, there is no running mean and var
+    # in the program, here append running mean and var.
+    # NOTE, the params of batch norm should be like:
+    #  x_scale
+    #  x_offset
+    #  x_mean
+    #  x_variance
+    #  x is any prefix
+    mean_variances = set()
+    bn_vars = []
+    bn_in_path = True
+    inner_prog = fluid.Program()
+    inner_start_prog = fluid.Program()
+    inner_block = inner_prog.global_block()
+    with fluid.program_guard(inner_prog, inner_start_prog):
+        for block in prog.blocks:
+            ops = list(block.ops)
+            if not bn_in_path:
+                break
+            for op in ops:
+                if op.type == 'affine_channel':
+                    # remove 'scale' as prefix
+                    scale_name = op.input('Scale')[0]  # _scale
+                    bias_name = op.input('Bias')[0]  # _offset
+                    prefix = scale_name[:-5]
+                    mean_name = prefix + 'mean'
+                    variance_name = prefix + 'variance'
+                    if not os.path.exists(os.path.join(path, mean_name)):
+                        bn_in_path = False
+                        break
+                    if not os.path.exists(os.path.join(path, variance_name)):
+                        bn_in_path = False
+                        break
+                    bias = block.var(bias_name)
+                    mean_vb = inner_block.create_var(
+                        name=mean_name,
+                        type=bias.type,
+                        shape=bias.shape,
+                        dtype=bias.dtype,
+                        persistable=True)
+                    variance_vb = inner_block.create_var(
+                        name=variance_name,
+                        type=bias.type,
+                        shape=bias.shape,
+                        dtype=bias.dtype,
+                        persistable=True)
+                    mean_variances.add(mean_vb)
+                    mean_variances.add(variance_vb)
+                    bn_vars.append(
+                        [scale_name, bias_name, mean_name, variance_name])
+    if not bn_in_path:
+        fluid.io.load_vars(exe, path, prog, vars=all_vars)
+        logger.warning(
+            "There is no paramters of batch norm in model {}. "
+            "Skip to fuse batch norm. And load paramters done.".format(path))
+        return
+    # load running mean and running variance on cpu place into global scope.
+    place = fluid.CPUPlace()
+    exe_cpu = fluid.Executor(place)
+    fluid.io.load_vars(exe_cpu, path, vars=[v for v in mean_variances])
+    # load params on real place into global scope.
+    fluid.io.load_vars(exe, path, prog, vars=all_vars)
+    eps = 1e-5
+    for names in bn_vars:
+        scale_name, bias_name, mean_name, var_name = names
+        scale = fluid.global_scope().find_var(scale_name).get_tensor()
+        bias = fluid.global_scope().find_var(bias_name).get_tensor()
+        mean = fluid.global_scope().find_var(mean_name).get_tensor()
+        var = fluid.global_scope().find_var(var_name).get_tensor()
+        scale_arr = np.array(scale)
+        bias_arr = np.array(bias)
+        mean_arr = np.array(mean)
+        var_arr = np.array(var)
+        bn_std = np.sqrt(np.add(var_arr, eps))
+        new_scale = np.float32(np.divide(scale_arr, bn_std))
+        new_bias = bias_arr - mean_arr * new_scale
+        # fuse to scale and bias in affine_channel
+        scale.set(new_scale, exe.place)
+        bias.set(new_bias, exe.place)
--- a/PaddleCV/rrpn/config.py
+++ b/PaddleCV/rrpn/config.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#    http://www.apache.org/licenses/LICENSE-2.0
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License. 
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+from edict import AttrDict
+import six
+import numpy as np
+_C = AttrDict()
+cfg = _C
+#
+# Training options
+#
+_C.TRAIN = AttrDict()
+# scales an image's shortest side
+_C.TRAIN.scales = [800]
+# max size of longest side
+_C.TRAIN.max_size = 1333
+# images per GPU in minibatch
+_C.TRAIN.im_per_batch = 1
+# roi minibatch size per image
+_C.TRAIN.batch_size_per_im = 256
+# target fraction of foreground roi minibatch 
+_C.TRAIN.fg_fractrion = 0.25
+# overlap threshold for a foreground roi
+_C.TRAIN.fg_thresh = 0.5
+# overlap threshold for a background roi
+_C.TRAIN.bg_thresh_hi = 0.5
+_C.TRAIN.bg_thresh_lo = 0.0
+# If False, only resize image and not pad, image shape is different between
+# GPUs in one mini-batch. If True, image shape is the same in one mini-batch.
+_C.TRAIN.padding_minibatch = False
+# Snapshot period
+_C.TRAIN.snapshot_iter = 1000
+# number of RPN proposals to keep before NMS
+_C.TRAIN.rpn_pre_nms_top_n = 12000
+# number of RPN proposals to keep after NMS
+_C.TRAIN.rpn_post_nms_top_n = 2000
+# NMS threshold used on RPN proposals
+_C.TRAIN.rpn_nms_thresh = 0.7
+# min size in RPN proposals
+_C.TRAIN.rpn_min_size = 0.0
+# eta for adaptive NMS in RPN
+_C.TRAIN.rpn_eta = 1.0
+# number of RPN examples per image
+_C.TRAIN.rpn_batch_size_per_im = 256
+# remove anchors out of the image
+_C.TRAIN.rpn_straddle_thresh = 0.
+# target fraction of foreground examples pre RPN minibatch
+_C.TRAIN.rpn_fg_fraction = 0.5
+# min overlap between anchor and gt box to be a positive examples
+_C.TRAIN.rpn_positive_overlap = 0.7
+# max overlap between anchor and gt box to be a negative examples
+_C.TRAIN.rpn_negative_overlap = 0.3
+# stopgrad at a specified stage
+_C.TRAIN.freeze_at = 2
+# min area of ground truth box
+_C.TRAIN.gt_min_area = -1
+#
+# Inference options
+#
+_C.TEST = AttrDict()
+# scales an image's shortest side
+_C.TEST.scales = [800]
+# max size of longest side
+_C.TEST.max_size = 1333
+# eta for adaptive NMS in RPN
+_C.TEST.rpn_eta = 1.0
+# min score threshold to infer
+_C.TEST.score_thresh = 0.01
+# overlap threshold used for NMS
+_C.TEST.nms_thresh = 0.3
+# number of RPN proposals to keep before NMS
+_C.TEST.rpn_pre_nms_top_n = 6000
+# number of RPN proposals to keep after NMS
+_C.TEST.rpn_post_nms_top_n = 1000
+# min size in RPN proposals
+_C.TEST.rpn_min_size = 0.0
+# max number of detections
+_C.TEST.detections_per_im = 300
+# NMS threshold used on RPN proposals
+_C.TEST.rpn_nms_thresh = 0.7
+#
+# Model options
+#
+# Whether use mask rcnn head
+_C.MASK_ON = True
+# weight for bbox regression targets
+_C.bbox_reg_weights = [10.0, 10.0, 5.0, 5.0, 1.0]
+# RPN anchor sizes
+_C.anchor_sizes = [128, 256, 512]
+# RPN anchor ratio
+_C.aspect_ratio = [0.2, 0.5, 1.0]
+# RPN anchor angle
+_C.anchor_angle = [-30.0, 0.0, 30.0, 60.0, 90.0, 120.0]
+# variance of anchors
+_C.variances = [1., 1., 1., 1., 1.]
+# stride of feature map
+_C.rpn_stride = [16.0, 16.0]
+# pooled width and pooled height 
+_C.roi_resolution = 14
+# spatial scale 
+_C.spatial_scale = 1. / 16.
+# resolution to represent rotated roi align
+_C.resolution = 14
+#
+# SOLVER options
+#
+# derived learning rate the to get the final learning rate.
+_C.learning_rate = 0.01
+# maximum number of iterations
+_C.max_iter = 140000
+# warm up to learning rate 
+_C.warm_up_iter = 500
+_C.start_factor = 1. / 3
+# lr steps_with_decay
+_C.lr_steps = [6250, 12500]
+_C.lr_gamma = 0.1
+# L2 regularization hyperparameter
+_C.weight_decay = 0.0001
+# momentum with SGD
+_C.momentum = 0.9
+#
+# ENV options
+#
+# support both CPU and GPU
+_C.use_gpu = True
+# Whether use parallel
+_C.parallel = True
+# Class number
+_C.class_num = 81
+# support pyreader
+_C.use_pyreader = True
+_C.TRAIN.min_size = 800
+_C.TRAIN.max_size = 1333
+_C.TEST.min_size = 1000
+# pixel mean values
+_C.pixel_means = [0.485, 0.456, 0.406]
+_C.pixel_std = [0.229, 0.224, 0.225]
+# clip box to prevent overflowing
+_C.bbox_clip = np.log(1000. / 16.)
+def merge_cfg_from_args(args, mode):
+    """Merge config keys, values in args into the global config."""
+    if mode == 'train':
+        sub_d = _C.TRAIN
+    else:
+        sub_d = _C.TEST
+    for k, v in sorted(six.iteritems(vars(args))):
+        d = _C
+        try:
+            value = eval(v)
+        except:
+            value = v
+        if k in sub_d:
+            sub_d[k] = value
+        else:
+            d[k] = value
--- a/PaddleCV/rrpn/data_utils.py
+++ b/PaddleCV/rrpn/data_utils.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Based on:
+# --------------------------------------------------------
+# Detectron
+# Copyright (c) 2017-present, Facebook, Inc.
+# Licensed under the Apache License, Version 2.0;
+# Written by Ross Girshick
+# --------------------------------------------------------
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+import cv2
+import numpy as np
+from config import cfg
+import os
+from PIL import Image
+class DatasetPath(object):
+    def __init__(self, mode, dataset_name):
+        self.mode = mode
+        self.data_dir = dataset_name
+    def get_data_dir(self):
+        if self.mode == 'train':
+            return os.path.join(self.data_dir, 'ch4_training_images')
+        elif self.mode == 'val':
+            return os.path.join(self.data_dir, 'ch4_test_images')
+    def get_file_list(self):
+        if self.mode == 'train':
+            return os.path.join(self.data_dir,
+                                'ch4_training_localization_transcription_gt')
+        elif self.mode == 'val':
+            return os.path.join(self.data_dir,
+                                'ch4_test_localization_transcription_gt')
+def get_image_blob(roidb, mode):
+    """Builds an input blob from the images in the roidb at the specified
+    scales.
+    """
+    if mode == 'train' or mode == 'val':
+        with open(roidb['image'], 'rb') as f:
+            data = f.read()
+        data = np.frombuffer(data, dtype='uint8')
+        img = cv2.imdecode(data, 1)
+        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+        gt_boxes = roidb['boxes']
+        gt_label = roidb['gt_classes']
+        # resize
+        if mode == 'train':
+            img, im_scale = _resize(img, target_size=800, max_size=1333)
+            need_gt_boxes = gt_boxes.copy()
+            need_gt_boxes[:, :4] *= im_scale
+            img, need_gt_boxes, need_gt_label = _rotation(
+                img, need_gt_boxes, gt_label, prob=1.0, gt_margin=1.4)
+        else:
+            img, im_scale = _resize(img, target_size=1000, max_size=1778)
+            need_gt_boxes = gt_boxes
+            need_gt_label = gt_label
+        img = img.astype(np.float32, copy=False)
+        img = img / 255.0
+        mean = np.array(cfg.pixel_means)[np.newaxis, np.newaxis, :]
+        std = np.array(cfg.pixel_std)[np.newaxis, np.newaxis, :]
+        img -= mean
+        img /= std
+        img = img.transpose((2, 0, 1))
+        return img, im_scale, need_gt_boxes, need_gt_label
+def _get_size_scale(w, h, min_size, max_size=None):
+    size = min_size
+    scale = 1.0
+    if max_size is not None:
+        min_original_size = float(min((w, h)))
+        max_original_size = float(max((w, h)))
+        if max_original_size / min_original_size * size > max_size:
+            size = int(round(max_size * min_original_size / max_original_size))
+    if (w <= h and w == size) or (h <= w and h == size):
+        return (h, w), scale
+    if w < h:
+        ow = size
+        oh = int(size * h / w)
+        scale = size / w
+    else:
+        oh = size
+        ow = int(size * w / h)
+        scale = size / h
+    scale = ow / w
+    return (oh, ow), scale
+def _resize(im, target_size=800, max_size=1333):
+    if not isinstance(im, np.ndarray):
+        raise TypeError("{}: image type is not numpy.")
+    if len(im.shape) != 3:
+        raise ImageError('{}: image is not 3-dimensional.')
+    im_shape = im.shape
+    im_size_min = np.min(im_shape[0:2])
+    im_size_max = np.max(im_shape[0:2])
+    selected_size = target_size
+    if float(im_size_min) == 0:
+        raise ZeroDivisionError('min size of image is 0')
+    if max_size != 0:
+        im_scale = float(selected_size) / float(im_size_min)
+        # Prevent the biggest axis from being more than max_size
+        if np.round(im_scale * im_size_max) > max_size:
+            im_scale = float(max_size) / float(im_size_max)
+        im_scale_x = im_scale
+        im_scale_y = im_scale
+        resize_w = np.round(im_scale_x * float(im_shape[1]))
+        resize_h = np.round(im_scale_y * float(im_shape[0]))
+        im_info = [resize_h, resize_w, im_scale]
+    else:
+        im_scale_x = float(selected_size) / float(im_shape[1])
+        im_scale_y = float(selected_size) / float(im_shape[0])
+        resize_w = selected_size
+        resize_h = selected_size
+    im = Image.fromarray(im)
+    im = im.resize((int(resize_w), int(resize_h)), 2)
+    im = np.array(im)
+    return im, im_scale_x
+def _rotation(image,
+              gt_boxes,
+              gt_label,
+              prob,
+              fixed_angle=-1,
+              r_range=(360, 0),
+              gt_margin=1.4):
+    rotate_range = r_range[0]
+    shift = r_range[1]
+    angle = np.array([np.max([0, fixed_angle])])
+    if np.random.rand() <= prob:
+        angle = np.array(
+            np.random.rand(1) * rotate_range - shift, dtype=np.int16)
+    '''
+    rotate image
+    '''
+    image = np.array(image)
+    (h, w) = image.shape[:2]
+    scale = 1.0
+    # set the rotation center
+    center = (w / 2, h / 2)
+    # anti-clockwise angle in the function
+    M = cv2.getRotationMatrix2D(center, angle, scale)
+    image = cv2.warpAffine(image, M, (w, h))
+    # back to PIL image
+    im_width, im_height = w, h
+    '''
+    rotate boxes
+    '''
+    need_gt_boxes = gt_boxes.copy()
+    origin_gt_boxes = need_gt_boxes
+    rotated_gt_boxes = np.empty((len(need_gt_boxes), 5), dtype=np.float32)
+    # anti-clockwise to clockwise arc
+    cos_cita = np.cos(np.pi / 180 * angle)
+    sin_cita = np.sin(np.pi / 180 * angle)
+    # clockwise matrix
+    rotation_matrix = np.array([[cos_cita, sin_cita], [-sin_cita, cos_cita]])
+    pts_ctr = origin_gt_boxes[:, 0:2]
+    pts_ctr = pts_ctr - np.tile((im_width / 2, im_height / 2),
+                                (gt_boxes.shape[0], 1))
+    pts_ctr = np.array(np.dot(pts_ctr, rotation_matrix), dtype=np.int16)
+    pts_ctr = np.squeeze(
+        pts_ctr, axis=-1) + np.tile((im_width / 2, im_height / 2),
+                                    (gt_boxes.shape[0], 1))
+    origin_gt_boxes[:, 0:2] = pts_ctr
+    len_of_gt = len(origin_gt_boxes)
+    # rectificate the angle in the range of [-45, 45]
+    for idx in range(len_of_gt):
+        ori_angle = origin_gt_boxes[idx, 4]
+        height = origin_gt_boxes[idx, 3]
+        width = origin_gt_boxes[idx, 2]
+        # step 1: normalize gt (-45,135)
+        if width < height:
+            ori_angle += 90
+            width, height = height, width
+        # step 2: rotate (-45,495)
+        rotated_angle = ori_angle + angle
+        # step 3: normalize rotated_angle (-45,135)
+        while rotated_angle > 135:
+            rotated_angle = rotated_angle - 180
+        rotated_gt_boxes[idx, 0] = origin_gt_boxes[idx, 0]
+        rotated_gt_boxes[idx, 1] = origin_gt_boxes[idx, 1]
+        rotated_gt_boxes[idx, 3] = height * gt_margin
+        rotated_gt_boxes[idx, 2] = width * gt_margin
+        rotated_gt_boxes[idx, 4] = rotated_angle
+    x_inbound = np.logical_and(rotated_gt_boxes[:, 0] >= 0,
+                               rotated_gt_boxes[:, 0] < im_width)
+    y_inbound = np.logical_and(rotated_gt_boxes[:, 1] >= 0,
+                               rotated_gt_boxes[:, 1] < im_height)
+    inbound = np.logical_and(x_inbound, y_inbound)
+    need_gt_boxes = rotated_gt_boxes[inbound]
+    need_gt_label = gt_label.copy()
+    need_gt_label = need_gt_label[inbound]
+    return image, need_gt_boxes, need_gt_label
+def prep_im_for_blob(im, pixel_means, target_size, max_size):
+    """Prepare an image for use as a network input blob. Specially:
+      - Subtract per-channel pixel mean
+      - Convert to float32
+      - Rescale to each of the specified target size (capped at max_size)
+    Returns a list of transformed images, one for each target size. Also returns
+    the scale factors that were used to compute each returned image.
+    """
+    im = im.astype(np.float32, copy=False)
+    im -= pixel_means
+    im_shape = im.shape
+    im_size_min = np.min(im_shape[0:2])
+    im_size_max = np.max(im_shape[0:2])
+    im_scale = float(target_size) / float(im_size_min)
+    # Prevent the biggest axis from being more than max_size
+    if np.round(im_scale * im_size_max) > max_size:
+        im_scale = float(max_size) / float(im_size_max)
+    im = cv2.resize(
+        im,
+        None,
+        None,
+        fx=im_scale,
+        fy=im_scale,
+        interpolation=cv2.INTER_LINEAR)
+    im_height, im_width, channel = im.shape
+    channel_swap = (2, 0, 1)  #(batch, channel, height, width)
+    im = im.transpose(channel_swap)
+    return im, im_scale
--- a/PaddleCV/rrpn/edict.py
+++ b/PaddleCV/rrpn/edict.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+class AttrDict(dict):
+    def __init__(self, *args, **kwargs):
+        super(AttrDict, self).__init__(*args, **kwargs)
+    def __getattr__(self, name):
+        if name in self.__dict__:
+            return self.__dict__[name]
+        elif name in self:
+            return self[name]
+        else:
+            raise AttributeError(name)
+    def __setattr__(self, name, value):
+        if name in self.__dict__:
+            self.__dict__[name] = value
+        else:
+            self[name] = value
--- a/PaddleCV/rrpn/eval.py
+++ b/PaddleCV/rrpn/eval.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import cv2
+import time
+import numpy as np
+import pickle
+import paddle
+import paddle.fluid as fluid
+import reader
+import models.model_builder as model_builder
+import models.resnet as resnet
+import checkpoint as checkpoint
+from config import cfg
+from utility import print_arguments, parse_args, check_gpu
+from data_utils import DatasetPath
+from eval_helper import *
+import logging
+FORMAT = '%(asctime)s-%(levelname)s: %(message)s'
+logging.basicConfig(level=logging.INFO, format=FORMAT)
+logger = logging.getLogger(__name__)
+def eval():
+    place = fluid.CUDAPlace(0) if cfg.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    image_shape = [3, cfg.TEST.max_size, cfg.TEST.max_size]
+    class_nums = cfg.class_num
+    model = model_builder.RRPN(
+        add_conv_body_func=resnet.ResNet(),
+        add_roi_box_head_func=resnet.ResNetC5(),
+        use_pyreader=False,
+        mode='val')
+    startup_prog = fluid.Program()
+    infer_prog = fluid.Program()
+    with fluid.program_guard(infer_prog, startup_prog):
+        with fluid.unique_name.guard():
+            model.build_model(image_shape)
+            pred_boxes = model.eval_bbox_out()
+    infer_prog = infer_prog.clone(True)
+    exe.run(startup_prog)
+    # yapf: disable
+    def if_exist(var):
+        return os.path.exists(os.path.join(cfg.pretrained_model, var.name))
+    if cfg.pretrained_model:
+        checkpoint.load_params(exe, infer_prog, cfg.pretrained_model)
+    # yapf: enable
+    test_reader = reader.test(1)
+    feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
+    fetch_list = [pred_boxes]
+    res_list = []
+    keys = [
+        'bbox', 'gt_box', 'gt_class', 'is_crowed', 'im_info', 'im_id',
+        'is_difficult'
+    ]
+    for i, data in enumerate(test_reader()):
+        im_info = [data[0][1]]
+        result = exe.run(infer_prog,
+                         fetch_list=[v.name for v in fetch_list],
+                         feed=feeder.feed(data),
+                         return_numpy=False)
+        pred_boxes_v = result[0]
+        nmsed_out = pred_boxes_v
+        outs = np.array(nmsed_out)
+        res = get_key_dict(outs, data[0], keys)
+        res_list.append(res)
+        if i % 50 == 0:
+            logger.info('test_iter {}'.format(i))
+    icdar_eval(res_list)
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+    check_gpu(args.use_gpu)
+    eval()
--- a/PaddleCV/rrpn/eval_helper.py
+++ b/PaddleCV/rrpn/eval_helper.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+import os
+import numpy as np
+import paddle.fluid as fluid
+import math
+from config import cfg
+import six
+import numpy as np
+import cv2
+import Polygon as plg
+from PIL import Image
+from PIL import ImageDraw
+from PIL import ImageFont
+from config import cfg
+import logging
+logger = logging.getLogger(__name__)
+def get_key_dict(out, data, key):
+    res = {}
+    for i in range(len(key)):
+        if i == 0:
+            res[key[i]] = out
+        else:
+            res[key[i]] = data[i]
+    return res
+def get_labels_maps():
+    default_labels_maps = {1: 'text'}
+    if cfg.dataset == 'icdar2015':
+        return default_labels_maps
+    labels_map = {}
+    with open(os.path.join(cfg.data_dir, 'label_list')) as f:
+        lines = f.readlines()
+        for idx, line in enumerate(lines):
+            labels_map[idx + 1] = line.strip()
+        return labels_map
+def draw_bounding_box_on_image(image_path,
+                               image_name,
+                               nms_out,
+                               im_scale,
+                               draw_threshold=0.8):
+    #if image is None:
+    image = Image.open(os.path.join(image_path, image_name))
+    draw = ImageDraw.Draw(image)
+    im_width, im_height = image.size
+    labels_map = get_labels_maps()
+    for dt in np.array(nms_out):
+        num_id, score = dt.tolist()[:2]
+        x1, y1, x2, y2, x3, y3, x4, y4 = dt.tolist()[2:] / im_scale
+        if score < draw_threshold:
+            continue
+        draw.line(
+            [(x1, y1), (x2, y2), (x3, y3), (x4, y4), (x1, y1)],
+            width=2,
+            fill='red')
+        if image.mode == 'RGB':
+            draw.text((x1, y1), labels_map[num_id], (255, 255, 0))
+    print("image with bbox drawed saved as {}".format(image_name))
+    image.save(image_name)
+def polygon_from_points(points):
+    """
+    Returns a Polygon object to use with the Polygon2 class from a list of 8 points: x1,y1,x2,y2,x3,y3,x4,y4
+    """
+    res_boxes = np.empty([1, 8], dtype='int32')
+    res_boxes[0, 0] = int(points[0])
+    res_boxes[0, 4] = int(points[1])
+    res_boxes[0, 1] = int(points[2])
+    res_boxes[0, 5] = int(points[3])
+    res_boxes[0, 2] = int(points[4])
+    res_boxes[0, 6] = int(points[5])
+    res_boxes[0, 3] = int(points[6])
+    res_boxes[0, 7] = int(points[7])
+    point_mat = res_boxes[0].reshape([2, 4]).T
+    return plg.Polygon(point_mat)
+def clip_box(bbox, im_info):
+    h = im_info[0]
+    w = im_info[1]
+    res = []
+    for b in bbox:
+        pts = b.reshape(4, 2)
+        pts[np.where(pts < 0)] = 1
+        pts[np.where(pts[:, 0] > w), 0] = w - 1
+        pts[np.where(pts[:, 1] > h), 1] = h - 1
+        pts = pts.reshape(-1)
+        pts /= im_info[2]
+        res.append(pts)
+    return np.array(res)
+def get_union(det, gt):
+    area_det = det.area()
+    area_gt = gt.area()
+    return area_det + area_gt - get_intersection(det, gt)
+def get_intersection_over_union(det, gt):
+    try:
+        return get_intersection(det, gt) / get_union(det, gt)
+    except:
+        return 0
+def get_intersection(det, gt):
+    inter = det & gt
+    if len(inter) == 0:
+        return 0
+    return inter.area()
+def parse_gt(result, im_id):
+    for res in result:
+        if res['im_id'] == im_id:
+            gt_boxes = list(res['gt_box'])
+            gt_class = res['gt_class']
+            is_difficult = res['is_difficult'].reshape(-1)
+            objects = []
+            for i in range(len(gt_boxes)):
+                object_struct = {}
+                object_struct['bbox'] = gt_boxes[i]
+                object_struct['class'] = gt_class[i]
+                if is_difficult[i] == 1:
+                    object_struct['difficult'] = 1
+                else:
+                    object_struct['difficult'] = 0
+                object_struct['im_id'] = im_id
+                objects.append(object_struct)
+            return objects
+def calculate_ap(rec, prec):
+    # 11 point metric
+    ap = 0.
+    for t in np.arange(0., 1.1, 0.1):
+        if np.sum(rec >= t) == 0:
+            p = 0
+        else:
+            p = np.max(prec[rec >= t])
+        ap = ap + p / 11.
+    return ap
+def icdar_map(result, class_name, ovthresh):
+    im_ids = []
+    for res in result:
+        im_ids.append(res['im_id'])
+    recs = {}
+    for i, im_id in enumerate(im_ids):
+        recs[str(im_id)] = parse_gt(result, im_id)
+    class_recs = {}
+    npos = 0
+    for k in im_ids:
+        res = [obj for obj in recs[str(k)] if obj['class'] == class_name]
+        bbox = np.array([x['bbox'] for x in res])
+        difficult = np.array([x['difficult'] for x in res]).astype(np.bool)
+        det = [False] * len(res)
+        npos = npos + sum(~difficult)
+        class_recs[k] = {'bbox': bbox, 'difficult': difficult, 'det': det}
+    image_ids = []
+    confidence = []
+    bbox = []
+    for res in result:
+        im_info = res['im_info']
+        pred_boxes = res['bbox']
+        for box in pred_boxes:
+            if box[0] == class_name:
+                image_ids.append(res['im_id'])
+                confidence.append(box[1])
+                clipd_box = clip_box(box[2:].reshape(-1, 8), im_info)
+                bbox.append(clipd_box[0])
+    confidence = np.array(confidence)
+    sorted_ind = np.argsort(-confidence)
+    sorted_scores = np.sort(-confidence)
+    bbox = np.array(bbox)
+    bbox = bbox[sorted_ind, :]
+    image_ids = [image_ids[x] for x in sorted_ind]
+    nd = len(image_ids)
+    tp = np.zeros(nd)
+    fp = np.zeros(nd)
+    for d in range(nd):
+        res = class_recs[image_ids[d]]
+        bb = bbox[d, :].astype(float)
+        ovmax = -np.inf
+        gt_bbox = res['bbox'].astype(float)
+        if gt_bbox.size > 0:
+            # compute overlaps
+            gt_bbox_xmin = np.min(gt_bbox[:, 0::2], axis=1)
+            gt_bbox_ymin = np.min(gt_bbox[:, 1::2], axis=1)
+            gt_bbox_xmax = np.max(gt_bbox[:, 0::2], axis=1)
+            gt_bbox_ymax = np.max(gt_bbox[:, 1::2], axis=1)
+            bb_xmin = np.min(bb[0::2])
+            bb_ymin = np.min(bb[1::2])
+            bb_xmax = np.max(bb[0::2])
+            bb_ymax = np.max(bb[1::2])
+            ixmin = np.maximum(gt_bbox_xmin, bb_xmin)
+            iymin = np.maximum(gt_bbox_ymin, bb_ymin)
+            ixmax = np.minimum(gt_bbox_xmax, bb_xmax)
+            iymax = np.minimum(gt_bbox_ymax, bb_ymax)
+            iw = np.maximum(ixmax - ixmin + 1., 0.)
+            ih = np.maximum(iymax - iymin + 1., 0.)
+            inters = iw * ih
+            # union
+            uni = ((bb_xmax - bb_xmin + 1.) * (bb_ymax - bb_ymin + 1.) +
+                   (gt_bbox_xmax - gt_bbox_xmin + 1.) *
+                   (gt_bbox_ymax - gt_bbox_ymin + 1.) - inters)
+            overlaps = inters / uni
+            gt_bbox_keep_mask = overlaps > 0
+            gt_bbox_keep = gt_bbox[gt_bbox_keep_mask, :]
+            gt_bbox_keep_index = np.where(overlaps > 0)[0]
+            def calcoverlaps(gt_bbox_keep, bb):
+                overlaps = []
+                for index, _ in enumerate(gt_bbox_keep):
+                    p_g = polygon_from_points(gt_bbox_keep[index])
+                    p_d = polygon_from_points(bb)
+                    overlap = get_intersection_over_union(p_d, p_g)
+                    overlaps.append(overlap)
+                return overlaps
+            if len(gt_bbox_keep) > 0:
+                overlaps = calcoverlaps(gt_bbox_keep, bb)
+                ovmax = np.max(overlaps)
+                jmax = np.argmax(overlaps)
+                jmax = gt_bbox_keep_index[jmax]
+        if ovmax > ovthresh:
+            if not res['difficult'][jmax]:
+                if not res['det'][jmax]:
+                    tp[d] = 1.
+                    res['det'][jmax] = 1
+                else:
+                    fp[d] = 1.
+        else:
+            fp[d] = 1.
+    # compute precision recall
+    fp = np.cumsum(fp)
+    tp = np.cumsum(tp)
+    rec = tp / float(npos)
+    prec = tp / np.maximum(tp + fp, np.finfo(np.float64).eps)
+    ap = calculate_ap(rec, prec)
+    return rec, prec, ap
+def icdar_map_eval(result, num_class):
+    map = 0
+    for i in range(num_class - 1):
+        rec, prec, ap = icdar_map(result, i + 1, ovthresh=0.5)
+        map = map + ap
+    map = map / (num_class - 1)
+    logger.info('mAP {}'.format(map))
+def icdar_box_eval(result, thresh):
+    matched_sum = 0
+    num_global_care_gt = 0
+    num_global_care_det = 0
+    for res in result:
+        im_info = res['im_info']
+        h = im_info[1]
+        w = im_info[2]
+        gt_boxes = res['gt_box']
+        pred_boxes = res['bbox']
+        pred_boxes = pred_boxes[np.where(pred_boxes[:, 1] > thresh)]
+        pred_boxes = pred_boxes[:, 2:]
+        pred_boxes = clip_box(pred_boxes, im_info)
+        is_difficult = res['is_difficult']
+        det_matched = 0
+        iou_mat = np.empty([1, 1])
+        gt_pols = []
+        det_pols = []
+        gt_pol_points = []
+        det_pol_points = []
+        gt_dont_care_pols_num = []
+        det_dont_care_pols_num = []
+        det_matched_nums = []
+        points_list = list(gt_boxes)
+        dony_care = is_difficult.reshape(-1)
+        for i, points in enumerate(points_list):
+            gt_pol = polygon_from_points(list(points))
+            gt_pols.append(gt_pol)
+            gt_pol_points.append(list(points))
+            if dony_care[i] == 1:
+                gt_dont_care_pols_num.append(len(gt_pols) - 1)
+        for i, points in enumerate(pred_boxes):
+            points = list(points.reshape(8).astype(np.int32))
+            det_pol = polygon_from_points(points)
+            det_pols.append(det_pol)
+            det_pol_points.append(points)
+            if len(gt_dont_care_pols_num) > 0:
+                for dont_care_pol in gt_dont_care_pols_num:
+                    dont_care_pol = gt_pols[dont_care_pol]
+                    intersected_area = get_intersection(dont_care_pol, det_pol)
+                    pd_dimensions = det_pol.area()
+                    precision = 0 if pd_dimensions == 0 else intersected_area / pd_dimensions
+                    if (precision > 0.5):
+                        det_dont_care_pols_num.append(len(det_pols) - 1)
+                        break
+        if len(gt_pols) > 0 and len(det_pols) > 0:
+            # Calculate IoU and precision matrixs
+            output_shape = [len(gt_pols), len(det_pols)]
+            iou_mat = np.empty(output_shape)
+            gt_rect_mat = np.zeros(len(gt_pols), np.int8)
+            det_rect_mat = np.zeros(len(det_pols), np.int8)
+            for gt_num in range(len(gt_pols)):
+                for det_num in range(len(det_pols)):
+                    p_d = gt_pols[gt_num]
+                    p_g = det_pols[det_num]
+                    iou_mat[gt_num, det_num] = get_intersection_over_union(p_d,
+                                                                           p_g)
+            for gt_num in range(len(gt_pols)):
+                for det_num in range(len(det_pols)):
+                    if gt_rect_mat[gt_num] == 0 and det_rect_mat[
+                            det_num] == 0 and gt_num not in gt_dont_care_pols_num and det_num not in det_dont_care_pols_num:
+                        if iou_mat[gt_num, det_num] > 0.5:
+                            gt_rect_mat[gt_num] = 1
+                            det_rect_mat[det_num] = 1
+                            det_matched += 1
+                            det_matched_nums.append(det_num)
+        num_gt_care = (len(gt_pols) - len(gt_dont_care_pols_num))
+        num_det_care = (len(det_pols) - len(det_dont_care_pols_num))
+        matched_sum += det_matched
+        num_global_care_gt += num_gt_care
+        num_global_care_det += num_det_care
+    method_recall = 0 if num_global_care_gt == 0 else float(
+        matched_sum) / num_global_care_gt
+    method_precision = 0 if num_global_care_det == 0 else float(
+        matched_sum) / num_global_care_det
+    method_hmean = 0 if method_recall + method_precision == 0 else 2 * method_recall * method_precision / (
+        method_recall + method_precision)
+    logger.info('Recall {}'.format(method_recall))
+    logger.info('Precision {}'.format(method_precision))
+    logger.info('F1 {}'.format(method_hmean))
+def icdar_eval(result):
+    if cfg.dataset == 'icdar2015':
+        icdar_box_eval(result, 0.8)
+    else:
+        icdar_map_eval(result, cfg.class_num)
--- a/PaddleCV/rrpn/image/img_119.jpg
+++ b/PaddleCV/rrpn/image/img_119.jpg
--- a/PaddleCV/rrpn/image/img_120.jpg
+++ b/PaddleCV/rrpn/image/img_120.jpg
--- a/PaddleCV/rrpn/infer.py
+++ b/PaddleCV/rrpn/infer.py
+#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import cv2
+import time
+import numpy as np
+import pickle
+import paddle
+import paddle.fluid as fluid
+import reader
+import models.model_builder as model_builder
+import models.resnet as resnet
+import checkpoint as checkpoint
+from config import cfg
+from data_utils import DatasetPath
+from eval_helper import *
+from utility import print_arguments, parse_args, check_gpu
+def infer():
+    place = fluid.CUDAPlace(0) if cfg.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    image_shape = [3, cfg.TEST.max_size, cfg.TEST.max_size]
+    class_nums = cfg.class_num
+    model = model_builder.RRPN(
+        add_conv_body_func=resnet.ResNet(),
+        add_roi_box_head_func=resnet.ResNetC5(),
+        use_pyreader=False,
+        mode='infer')
+    startup_prog = fluid.Program()
+    infer_prog = fluid.Program()
+    with fluid.program_guard(infer_prog, startup_prog):
+        with fluid.unique_name.guard():
+            model.build_model(image_shape)
+            pred_boxes = model.eval_bbox_out()
+    infer_prog = infer_prog.clone(True)
+    exe.run(startup_prog)
+    # yapf: disable
+    def if_exist(var):
+        return os.path.exists(os.path.join(cfg.pretrained_model, var.name))
+    if cfg.pretrained_model:
+        checkpoint.load_params(exe, infer_prog, cfg.pretrained_model)
+    # yapf: enable
+    infer_reader = reader.infer(cfg.image_path)
+    feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
+    fetch_list = [pred_boxes]
+    imgs = os.listdir(cfg.image_path)
+    imgs.sort()
+    for i, data in enumerate(infer_reader()):
+        result = exe.run(infer_prog,
+                         fetch_list=[v.name for v in fetch_list],
+                         feed=feeder.feed(data),
+                         return_numpy=False)
+        nmsed_out = result[0]
+        im_info = data[0][1]
+        im_scale = im_info[2]
+        outs = np.array(nmsed_out)
+        draw_bounding_box_on_image(cfg.image_path, imgs[i], outs, im_scale,
+                                   cfg.draw_threshold)
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+    check_gpu(args.use_gpu)
+    infer()
--- a/PaddleCV/rrpn/models/__init__.py
+++ b/PaddleCV/rrpn/models/__init__.py
--- a/PaddleCV/rrpn/models/ext_op/rrpn_lib.py
+++ b/PaddleCV/rrpn/models/ext_op/rrpn_lib.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+import paddle.fluid as fluid
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.framework import Variable
+fluid.load_op_library('models/ext_op/src/rrpn_lib.so')
+def rrpn_target_assign(bbox_pred,
+                       cls_logits,
+                       anchor_box,
+                       gt_boxes,
+                       im_info,
+                       rpn_batch_size_per_im=256,
+                       rpn_straddle_thresh=0.0,
+                       rpn_fg_fraction=0.5,
+                       rpn_positive_overlap=0.7,
+                       rpn_negative_overlap=0.3,
+                       use_random=True):
+    """
+    **Target Assign Layer for rotated region proposal network (RRPN).**
+    This layer can be, for given the  Intersection-over-Union (IoU) overlap
+    between anchors and ground truth boxes, to assign classification and
+    regression targets to each each anchor, these target labels are used for
+    train RPN. The classification targets is a binary class label (of being
+    an object or not). Following the paper of RRPN, the positive labels
+    are two kinds of anchors: (i) the anchor/anchors with the highest IoU
+    overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap
+    higher than rpn_positive_overlap(0.7) with any ground-truth box. Note
+    that a single ground-truth box may assign positive labels to multiple
+    anchors. A non-positive anchor is when its IoU ratio is lower than
+    rpn_negative_overlap (0.3) for all ground-truth boxes. Anchors that are
+    neither positive nor negative do not contribute to the training objective.
+    The regression targets are the encoded ground-truth boxes associated with
+    the positive anchors.
+    Args:
+        bbox_pred(Variable): A 3-D Tensor with shape [N, M, 5] represents the
+            predicted locations of M bounding bboxes. N is the batch size,
+            and each bounding box has five coordinate values and the layout
+            is [x, y, w, h, angle]. The data type can be float32 or float64.
+        cls_logits(Variable): A 3-D Tensor with shape [N, M, 1] represents the
+            predicted confidence predictions. N is the batch size, 1 is the
+            frontground and background sigmoid, M is number of bounding boxes.
+            The data type can be float32 or float64.
+        anchor_box(Variable): A 2-D Tensor with shape [M, 5] holds M boxes,
+            each box is represented as [x, y, w, h, angle],
+            [x, y] is the left top coordinate of the anchor box,
+            if the input is image feature map, they are close to the origin
+            of the coordinate system. [w, h] is the right bottom
+            coordinate of the anchor box, angle is the rotation angle of box.
+            The data type can be float32 or float64.
+        gt_boxes (Variable): The ground-truth bounding boxes (bboxes) are a 2D
+            LoDTensor with shape [Ng, 5], Ng is the total number of ground-truth
+            bboxes of mini-batch input. The data type can be float32 or float64.
+        im_info (Variable): A 2-D LoDTensor with shape [N, 3]. N is the batch size,
+        3 is the height, width and scale.
+        rpn_batch_size_per_im(int): Total number of RPN examples per image.
+                                    The data type must be int32.
+        rpn_straddle_thresh(float): Remove RPN anchors that go outside the image
+            by straddle_thresh pixels. The data type must be float32.
+        rpn_fg_fraction(float): Target fraction of RoI minibatch that is labeled
+            foreground (i.e. class > 0), 0-th class is background. The data type must be float32.
+        rpn_positive_overlap(float): Minimum overlap required between an anchor
+            and ground-truth box for the (anchor, gt box) pair to be a positive
+            example. The data type must be float32.
+        rpn_negative_overlap(float): Maximum overlap allowed between an anchor
+            and ground-truth box for the (anchor, gt box) pair to be a negative
+            examples. The data type must be float32.
+        use_random(bool): Whether to sample randomly when sampling.
+    Returns:
+        tuple:
+        A tuple(predicted_scores, predicted_location, target_label,
+        target_bbox) is returned. The predicted_scores 
+        and predicted_location is the predicted result of the RPN.
+        The target_label and target_bbox is the ground truth,
+        respectively. The predicted_location is a 2D Tensor with shape
+        [F, 5], and the shape of target_bbox is same as the shape of
+        the predicted_location, F is the number of the foreground
+        anchors. The predicted_scores is a 2D Tensor with shape
+        [F + B, 1], and the shape of target_label is same as the shape
+        of the predicted_scores, B is the number of the background
+        anchors, the F and B is depends on the input of this operator.
+        Bbox_inside_weight represents whether the predicted loc is fake_fg
+        or not and the shape is [F, 5].
+    Examples:
+        .. code-block:: python
+            import paddle.fluid as fluid
+            bbox_pred = fluid.data(name='bbox_pred', shape=[None, 5], dtype='float32')
+            cls_logits = fluid.data(name='cls_logits', shape=[None, 1], dtype='float32')
+            anchor_box = fluid.data(name='anchor_box', shape=[None, 5], dtype='float32')
+            gt_boxes = fluid.data(name='gt_boxes', shape=[None, 5], dtype='float32')
+            im_info = fluid.data(name='im_infoss', shape=[None, 3], dtype='float32')
+            loc, score, loc_target, score_target = rrpn_target_assign(
+                bbox_pred, cls_logits, anchor_box, gt_boxes, im_info)
+    """
+    helper = LayerHelper('rrpn_target_assign', **locals())
+    # Assign target label to anchors
+    loc_index = helper.create_variable_for_type_inference(dtype='int32')
+    score_index = helper.create_variable_for_type_inference(dtype='int32')
+    target_label = helper.create_variable_for_type_inference(dtype='int32')
+    target_bbox = helper.create_variable_for_type_inference(
+        dtype=anchor_box.dtype)
+    helper.append_op(
+        type="rrpn_target_assign",
+        inputs={'Anchor': anchor_box,
+                'GtBoxes': gt_boxes,
+                'ImInfo': im_info},
+        outputs={
+            'LocationIndex': loc_index,
+            'ScoreIndex': score_index,
+            'TargetLabel': target_label,
+            'TargetBBox': target_bbox
+        },
+        attrs={
+            'rpn_batch_size_per_im': rpn_batch_size_per_im,
+            'rpn_straddle_thresh': rpn_straddle_thresh,
+            'rpn_positive_overlap': rpn_positive_overlap,
+            'rpn_negative_overlap': rpn_negative_overlap,
+            'rpn_fg_fraction': rpn_fg_fraction,
+            'use_random': use_random
+        })
+    loc_index.stop_gradient = True
+    score_index.stop_gradient = True
+    target_label.stop_gradient = True
+    target_bbox.stop_gradient = True
+    cls_logits = fluid.layers.reshape(x=cls_logits, shape=(-1, 1))
+    bbox_pred = fluid.layers.reshape(x=bbox_pred, shape=(-1, 5))
+    predicted_cls_logits = fluid.layers.gather(cls_logits, score_index)
+    predicted_bbox_pred = fluid.layers.gather(bbox_pred, loc_index)
+    return predicted_cls_logits, predicted_bbox_pred, target_label, target_bbox
+def rotated_anchor_generator(input,
+                             anchor_sizes=None,
+                             aspect_ratios=None,
+                             angles=None,
+                             variance=[1.0, 1.0, 1.0, 1.0, 1.0],
+                             stride=None,
+                             offset=0.5,
+                             name=None):
+    """
+    **Rotated Anchor generator operator**
+    Generate anchors for RRPN algorithm.
+    Each position of the input produce N anchors, N =
+    size(anchor_sizes) * size(aspect_ratios) * size(angles).
+    The order of generated anchors is firstly aspect_ratios
+    loop then anchor_sizes loop.
+    Args:
+       input(Variable): 4-D Tensor with shape [N,C,H,W]. The input feature map.
+       anchor_sizes(float32|list|tuple): The anchor sizes of generated
+          anchors, given in absolute pixels e.g. [64., 128., 256., 512.].
+          For instance, the anchor size of 64 means the area of this anchor 
+          equals to 64**2. None by default.
+       aspect_ratios(float32|list|tuple): The height / width ratios 
+           of generated anchors, e.g. [0.5, 1.0, 2.0]. None by default.
+       angle(list|tuple): Rotated angle of prior boxes. The data type is float32.
+       variance(list|tuple): The variances to be used in box 
+           regression deltas. The data type is float32, [1.0, 1.0, 1.0, 1.0, 1.0] by 
+           default.
+       stride(list|tuple): The anchors stride across width and height.
+           The data type is float32. e.g. [16.0, 16.0]. None by default.
+       offset(float32): Prior boxes center offset. 0.5 by default.
+       name(str): Name of this layer. None by default. 
+    Returns:
+       Anchors(Variable): The output anchors with a layout of [H, W, num_anchors, 5].
+                          H is the height of input, W is the width of input,
+                          num_anchors is the box count of each position. Each anchor is
+                          in (x, y, w, h, angle) format.
+       Variances(Variable): The expanded variances of anchors with a layout of
+                            [H, W, num_priors, 5]. H is the height of input,
+                            W is the width of input num_anchors is the box count
+                            of each position. Each variance is in (x, y, w, h, angle) format.
+    Examples:
+        .. code-block:: python
+            import paddle.fluid as fluid
+            conv1 = fluid.data(name='conv1', shape=[None, 48, 16, 16], dtype='float32')
+            anchor, var = rotated_anchor_generator(
+                input=conv1,
+                anchor_sizes=[128, 256, 512],
+                aspect_ratios=[0.2, 0.5, 1.0],
+                variance=[1.0, 1.0, 1.0, 1.0, 1.0],
+                stride=[16.0, 16.0],
+                offset=0.5)
+    """
+    helper = LayerHelper("rotated_anchor_generator", **locals())
+    dtype = helper.input_dtype()
+    def _is_list_or_tuple_(data):
+        return (isinstance(data, list) or isinstance(data, tuple))
+    if not _is_list_or_tuple_(anchor_sizes):
+        anchor_sizes = [anchor_sizes]
+    if not _is_list_or_tuple_(aspect_ratios):
+        aspect_ratios = [aspect_ratios]
+    if not _is_list_or_tuple_(angles):
+        angles = [angles]
+    if not (_is_list_or_tuple_(stride) and len(stride) == 2):
+        raise ValueError('stride should be a list or tuple ',
+                         'with length 2, (stride_width, stride_height).')
+    anchor_sizes = list(map(float, anchor_sizes))
+    aspect_ratios = list(map(float, aspect_ratios))
+    angles = list(map(float, angles))
+    stride = list(map(float, stride))
+    attrs = {
+        'anchor_sizes': anchor_sizes,
+        'aspect_ratios': aspect_ratios,
+        'angles': angles,
+        'variances': variance,
+        'stride': stride,
+        'offset': offset
+    }
+    anchor = helper.create_variable_for_type_inference(dtype)
+    var = helper.create_variable_for_type_inference(dtype)
+    helper.append_op(
+        type="rotated_anchor_generator",
+        inputs={"Input": input},
+        outputs={"Anchors": anchor,
+                 "Variances": var},
+        attrs=attrs, )
+    anchor.stop_gradient = True
+    var.stop_gradient = True
+    return anchor, var
+def rrpn_box_coder(prior_box, prior_box_var, target_box, name=None):
+    """
+    Args:
+        prior_box(Variable): Box list prior_box is a 2-D Tensor with shape 
+            [M, 5] holds M boxes and data type is float32 or float64. Each box
+            is represented as [x, y, w, h, angle], [x, y] is the 
+            center coordinate of the anchor box, [w, h] is the width and height
+            of the anchor box, angle is rotated angle of prior_box.
+        prior_box_var(List|Variable|None): "prior_box_var is a 2-D Tensor with
+             shape [M, 5] holds M group of variance."
+        target_box(Variable): This input can be a 2-D LoDTensor with shape 
+            [M, 5]. Each box is represented as [x, y, w, h, angle]. The data
+            type is float32 or float64.
+        name(str): Name of this layer. None by default. 
+    Returns:
+        Variable:
+        output_box(Variable): The output tensor of rrpn_box_coder_op with shape [N, 5] representing the 
+        result of N target boxes encoded with N Prior boxes and variances. 
+        N represents the number of box and 5 represents [x, y, w, h ,angle].
+    Examples:
+        .. code-block:: python
+            import paddle.fluid as fluid
+            prior_box_decode = fluid.data(name='prior_box_decode',
+                                          shape=[512, 5],
+                                          dtype='float32')
+            target_box_decode = fluid.data(name='target_box_decode',
+                                           shape=[512, 5],
+                                           dtype='float32')
+            output_decode = rrpn_box_coder(prior_box=prior_box_decode,
+                                           prior_box_var=[10, 10, 5, 5, 1],
+                                           target_box=target_box_decode)
+    """
+    helper = LayerHelper("rrpn_box_coder", **locals())
+    if name is None:
+        output_box = helper.create_variable_for_type_inference(
+            dtype=prior_box.dtype)
+    else:
+        output_box = helper.create_variable(
+            name=name, dtype=prior_box.dtype, persistable=False)
+    inputs = {"PriorBox": prior_box, "TargetBox": target_box}
+    attrs = {}
+    if isinstance(prior_box_var, Variable):
+        inputs['PriorBoxVar'] = prior_box_var
+    elif isinstance(prior_box_var, list):
+        attrs['variance'] = prior_box_var
+    else:
+        raise TypeError(
+            "Input variance of rrpn_box_coder must be Variable or list")
+    helper.append_op(
+        type="rrpn_box_coder",
+        inputs=inputs,
+        attrs=attrs,
+        outputs={"OutputBox": output_box})
+    return output_box
+def rotated_roi_align(input,
+                      rois,
+                      pooled_height=1,
+                      pooled_width=1,
+                      spatial_scale=1.0,
+                      name=None):
+    """
+    **RotatedRoIAlign Operator**
+    Rotated Region of interest align (also known as Rotated RoI align) is to perform
+    bilinear interpolation on inputs of nonuniform sizes to obtain 
+    fixed-size feature maps (e.g. 7*7)
+    Dividing each region proposal into equal-sized sections with
+    the pooled_width and pooled_height. Location remains the origin
+    result.
+    Each ROI bin are transformed to become horizontal by perspective transformation and
+    values in each ROI bin are computed directly through bilinear interpolation. The output is
+    the mean of all values.
+    Thus avoid the misaligned problem.  
+    """
+    helper = LayerHelper('rrpn_rotated_roi_align', **locals())
+    dtype = helper.input_dtype()
+    align_out = helper.create_variable_for_type_inference(dtype)
+    cx = helper.create_variable_for_type_inference('float32')
+    cy = helper.create_variable_for_type_inference('float32')
+    helper.append_op(
+        type="rrpn_rotated_roi_align",
+        inputs={"X": input,
+                "ROIs": rois},
+        outputs={"Out": align_out,
+                 "ConIdX": cx,
+                 "ConIdY": cy},
+        attrs={
+            "pooled_height": pooled_height,
+            "pooled_width": pooled_width,
+            "spatial_scale": spatial_scale,
+        })
+    return align_out
+def rotated_generate_proposal_labels(rpn_rois,
+                                     gt_classes,
+                                     is_crowd,
+                                     gt_boxes,
+                                     im_info,
+                                     batch_size_per_im=256,
+                                     fg_fraction=0.25,
+                                     fg_thresh=0.25,
+                                     bg_thresh_hi=0.5,
+                                     bg_thresh_lo=0.0,
+                                     bbox_reg_weights=[0.1, 0.1, 0.2, 0.2],
+                                     class_nums=None,
+                                     use_random=True,
+                                     is_cls_agnostic=False):
+    """
+    **Rotated Generate Proposal Labels**
+    This operator can be, for given the RotatedGenerateProposalOp output bounding boxes and groundtruth,
+    to sample foreground boxes and background boxes, and compute loss target.
+    RpnRois is the output boxes of RPN and was processed by rotated_generate_proposal_op, these boxes
+    were combined with groundtruth boxes and sampled according to batch_size_per_im and fg_fraction,
+    If an instance with a groundtruth overlap greater than fg_thresh, then it was considered as a foreground sample.
+    If an instance with a groundtruth overlap greater than bg_thresh_lo and lower than bg_thresh_hi,
+    then it was considered as a background sample.
+    After all foreground and background boxes are chosen (so called Rois),
+    then we apply random sampling to make sure
+    the number of foreground boxes is no more than batch_size_per_im * fg_fraction.
+    For each box in Rois, we assign the classification (class label) and regression targets (box label) to it.
+    Finally BboxInsideWeights and BboxOutsideWeights are used to specify whether it would contribute to training loss.
+    Args:
+        rpn_rois(Variable): A 2-D LoDTensor with shape [N, 5]. N is the number of the RotatedGenerateProposalOp's output, each element is a bounding box with [x, y, w, h, angle] format. The data type can be float32 or float64.
+        gt_classes(Variable): A 2-D LoDTensor with shape [M, 1]. M is the number of groundtruth, each element is a class label of groundtruth. The data type must be int32.
+        is_crowd(Variable): A 2-D LoDTensor with shape [M, 1]. M is the number of groundtruth, each element is a flag indicates whether a groundtruth is crowd. The data type must be int32.
+        gt_boxes(Variable): A 2-D LoDTensor with shape [M, 5]. M is the number of groundtruth, each element is a bounding box with [x, y, w, h, angle] format.
+        im_info(Variable): A 2-D LoDTensor with shape [B, 3]. B is the number of input images, each element consists of im_height, im_width, im_scale.
+        batch_size_per_im(int): Batch size of rois per images. The data type must be int32.
+        fg_fraction(float): Foreground fraction in total batch_size_per_im. The data type must be float32.
+        fg_thresh(float): Overlap threshold which is used to chose foreground sample. The data type must be float32.
+        bg_thresh_hi(float): Overlap threshold upper bound which is used to chose background sample. The data type must be float32.
+        bg_thresh_lo(float): Overlap threshold lower bound which is used to chose background sample. The data type must be float32.
+        bbox_reg_weights(list|tuple): Box regression weights. The data type must be float32.
+        class_nums(int): Class number. The data type must be int32.
+        use_random(bool): Use random sampling to choose foreground and background boxes.
+        is_cls_agnostic(bool): bbox regression use class agnostic simply which only represent fg and bg boxes.
+    Returns:
+        tuple:
+        A tuple with format``(rois, labels_int32, bbox_targets, bbox_inside_weights, bbox_outside_weights)``.
+        - **rois**: 2-D LoDTensor with shape ``[batch_size_per_im * batch_size, 5]``. The data type is the same as ``rpn_rois``.
+        - **labels_int32**: 2-D LoDTensor with shape ``[batch_size_per_im * batch_size, 1]``. The data type must be int32.
+        - **bbox_targets**: 2-D LoDTensor with shape ``[batch_size_per_im * batch_size, 5 * class_num]``. The regression targets of all RoIs. The data type is the same as ``rpn_rois``.
+        - **bbox_inside_weights**: 2-D LoDTensor with shape ``[batch_size_per_im * batch_size, 5 * class_num]``. The weights of foreground boxes' regression loss. The data type is the same as ``rpn_rois``.
+        - **bbox_outside_weights**: 2-D LoDTensor with shape ``[batch_size_per_im * batch_size, 5 * class_num]``. The weights of regression loss. The data type is the same as ``rpn_rois``.
+    Examples:
+        .. code-block:: python
+            import paddle.fluid as fluid
+            rpn_rois = fluid.data(name='rpn_rois', shape=[None, 5], dtype='float32')
+            gt_classes = fluid.data(name='gt_classes', shape=[None, 1], dtype='float32')
+            is_crowd = fluid.data(name='is_crowd', shape=[None, 1], dtype='float32')
+            gt_boxes = fluid.data(name='gt_boxes', shape=[None, 5], dtype='float32')
+            im_info = fluid.data(name='im_info', shape=[None, 3], dtype='float32')
+            rois, labels, bbox, inside_weights, outside_weights = rotated_generate_proposal_labels(
+                           rpn_rois, gt_classes, is_crowd, gt_boxes, im_info,
+                           class_nums=10)
+    """
+    helper = LayerHelper('rrpn_generate_proposal_labels', **locals())
+    rois = helper.create_variable_for_type_inference(dtype=rpn_rois.dtype)
+    labels_int32 = helper.create_variable_for_type_inference(
+        dtype=gt_classes.dtype)
+    bbox_targets = helper.create_variable_for_type_inference(
+        dtype=rpn_rois.dtype)
+    bbox_inside_weights = helper.create_variable_for_type_inference(
+        dtype=rpn_rois.dtype)
+    bbox_outside_weights = helper.create_variable_for_type_inference(
+        dtype=rpn_rois.dtype)
+    helper.append_op(
+        type="rrpn_generate_proposal_labels",
+        inputs={
+            'RpnRois': rpn_rois,
+            'GtClasses': gt_classes,
+            'IsCrowd': is_crowd,
+            'GtBoxes': gt_boxes,
+            'ImInfo': im_info
+        },
+        outputs={
+            'Rois': rois,
+            'LabelsInt32': labels_int32,
+            'BboxTargets': bbox_targets,
+            'BboxInsideWeights': bbox_inside_weights,
+            'BboxOutsideWeights': bbox_outside_weights
+        },
+        attrs={
+            'batch_size_per_im': batch_size_per_im,
+            'fg_fraction': fg_fraction,
+            'fg_thresh': fg_thresh,
+            'bg_thresh_hi': bg_thresh_hi,
+            'bg_thresh_lo': bg_thresh_lo,
+            'bbox_reg_weights': bbox_reg_weights,
+            'class_nums': class_nums,
+            'use_random': use_random,
+            'is_cls_agnostic': is_cls_agnostic
+        })
+    rois.stop_gradient = True
+    labels_int32.stop_gradient = True
+    bbox_targets.stop_gradient = True
+    bbox_inside_weights.stop_gradient = True
+    bbox_outside_weights.stop_gradient = True
+    return rois, labels_int32, bbox_targets, bbox_inside_weights, bbox_outside_weights
+def rotated_generate_proposals(scores,
+                               bbox_deltas,
+                               im_info,
+                               anchors,
+                               variances,
+                               pre_nms_top_n=6000,
+                               post_nms_top_n=1000,
+                               nms_thresh=0.5,
+                               min_size=0.1,
+                               name=None):
+    """
+    **Rotated Generate proposal**
+    This operation proposes Rotated RoIs according to each box with their
+    probability to be a foreground object and the box can be calculated by anchors.
+    bbox_deltas and scores are the output of RPN. Final proposals could be used to
+    train detection net. For generating proposals, this operation performs following steps:
+    1. Transposes and resizes scores and bbox_deltas in size of
+       (H*W*A, 1) and (H*W*A, 5)
+    2. Calculate box locations as proposals candidates. 
+    3. Remove predicted boxes with small area. 
+    4. Apply NMS to get final proposals as output.
+    Args:
+        scores(Variable): A 4-D Tensor with shape [N, A, H, W] represents
+            the probability for each box to be an object.
+            N is batch size, A is number of anchors, H and W are height and
+            width of the feature map. The data type must be float32.
+        bbox_deltas(Variable): A 4-D Tensor with shape [N, 5*A, H, W]
+            represents the differece between predicted box locatoin and
+            anchor location. The data type must be float32.
+        im_info(Variable): A 2-D Tensor with shape [N, 3] represents origin
+            image information for N batch. Info contains height, width and scale
+            between origin image size and the size of feature map.
+            The data type must be int32.
+        anchors(Variable):   A 4-D Tensor represents the anchors with a layout
+            of [H, W, A, 5]. H and W are height and width of the feature map,
+            num_anchors is the box count of each position. Each anchor is
+            in (x, y, w, h, angle) format. The data type must be float32.
+        variances(Variable): A 4-D Tensor. The expanded variances of anchors with a layout of
+            [H, W, num_priors, 5]. Each variance is in
+            (xcenter, ycenter, w, h) format. The data type must be float32.
+        pre_nms_top_n(float): Number of total bboxes to be kept per
+            image before NMS. The data type must be float32. `6000` by default.
+        post_nms_top_n(float): Number of total bboxes to be kept per
+            image after NMS. The data type must be float32. `1000` by default.
+        nms_thresh(float): Threshold in NMS. The data type must be float32. `0.5` by default.
+        min_size(float): Remove predicted boxes with either height or
+            width < min_size. The data type must be float32. `0.1` by default.
+    Returns:
+        tuple:
+        A tuple with format ``(rrpn_rois, rrpn_roi_probs)``.
+        - **rpn_rois**: The generated RoIs. 2-D Tensor with shape ``[N, 5]`` while ``N`` is the number of RoIs. The data type is the same as ``scores``.
+        - **rpn_roi_probs**: The scores of generated RoIs. 2-D Tensor with shape ``[N, 1]`` while ``N`` is the number of RoIs. The data type is the same as ``scores``.
+    Examples:
+        .. code-block:: python
+            import paddle.fluid as fluid
+            scores = fluid.data(name='scores', shape=[None, 4, 5, 5], dtype='float32')
+            bbox_deltas = fluid.data(name='bbox_deltas', shape=[None, 20, 5, 5], dtype='float32')
+            im_info = fluid.data(name='im_info', shape=[None, 3], dtype='float32')
+            anchors = fluid.data(name='anchors', shape=[None, 5, 4, 5], dtype='float32')
+            variances = fluid.data(name='variances', shape=[None, 5, 10, 5], dtype='float32')
+            rrois, rroi_probs = fluid.layers.rotated_generate_proposals(scores, bbox_deltas,
+                         im_info, anchors, variances)
+    """
+    helper = LayerHelper('rrpn_generate_proposals', **locals())
+    rpn_rois = helper.create_variable_for_type_inference(
+        dtype=bbox_deltas.dtype)
+    rpn_roi_probs = helper.create_variable_for_type_inference(
+        dtype=scores.dtype)
+    helper.append_op(
+        type="rrpn_generate_proposals",
+        inputs={
+            'Scores': scores,
+            'BboxDeltas': bbox_deltas,
+            'ImInfo': im_info,
+            'Anchors': anchors,
+            'Variances': variances
+        },
+        attrs={
+            'pre_nms_topN': pre_nms_top_n,
+            'post_nms_topN': post_nms_top_n,
+            'nms_thresh': nms_thresh,
+            'min_size': min_size
+        },
+        outputs={'RpnRois': rpn_rois,
+                 'RpnRoiProbs': rpn_roi_probs})
+    rpn_rois.stop_gradient = True
+    rpn_roi_probs.stop_gradient = True
+    return rpn_rois, rpn_roi_probs
--- a/PaddleCV/rrpn/models/ext_op/src/README.md
+++ b/PaddleCV/rrpn/models/ext_op/src/README.md
+# 自定义OP的编译过程
+## 代码结构
+  - src: 扩展OP C++/CUDA 源码
+  - rrpn_lib.py: Python封装
+## 安装PaddlePaddle
+请通过如下方式安装PaddlePaddle：
+- 通过[Paddle develop分支](https://github.com/PaddlePaddle/Paddle/tree/develop)源码编译安装，编译方法如下:
+  1. [Ubuntu](https://www.paddlepaddle.org.cn/install/doc/source/ubuntu)
+  1. [CentOS](https://www.paddlepaddle.org.cn/install/doc/source/centos)
+  1. [MasOS](https://www.paddlepaddle.org.cn/install/doc/source/macos)
+  1. [Windows](https://www.paddlepaddle.org.cn/install/doc/source/windows)
+  **说明：** 推荐使用docker编译
+- 安装Paddle develop[每日版本whl包](https://www.paddlepaddle.org.cn/install/doc/tables#多版本whl包列表-dev-11)
+  **注意：** 编译自定义OP使用的gcc版本须与Paddle编译使用gcc版本一致，Paddle develop每日版本目前采用**gcc 4.8.2**版本编译，若使用每日版本，请使用**gcc 4.8.2**版本编译自定义OP，否则可能出现兼容性问题。
+## 编译自定义OP
+自定义op需要将实现的C++、CUDA代码编译成动态库，mask.sh中通过g++/nvcc编译，当然您也可以写Makefile或者CMake。
+编译需要include PaddlePaddle的相关头文件，链接PaddlePaddle的lib库。 头文件和lib库可通过下面命令获取到:
+```
+# python
+>>> import paddle
+>>> print(paddle.sysconfig.get_include())
+/paddle/pyenv/local/lib/python2.7/site-packages/paddle/include
+>>> print(paddle.sysconfig.get_lib())
+/paddle/pyenv/local/lib/python2.7/site-packages/paddle/libs
+```
+我们提供动态库编译脚本如下：
+```
+cd src
+sh make.sh
+```
+最终编译会产出`rrpn_lib.so`
+**说明：** 若使用源码编译安装PaddlePaddle的方式，编译过程中`cmake`未设置`WITH_MKLDNN`的方式，
+编译自定义OP时会报错找不到`mkldnn.h`等文件，可在`make.sh`中删除编译命令中的`-DPADDLE_WITH_MKLDNN`选项。
+## 设置环境变量
+需要将Paddle的核心库设置到`LD_LIBRARY_PATH`里, 先运行下面程序获取路径:
+```
+import paddle
+print(paddle.sysconfig.get_lib())
+```
+可通过如下方式添加动态库路径:
+```
+export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:`python -c 'import paddle; print(paddle.sysconfig.get_lib())'`
+```
+更多关于如何在框架外部自定义 C++ OP，可阅读[官网说明文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_usage/index_cn.html)
--- a/PaddleCV/rrpn/models/ext_op/src/bbox_util.h
+++ b/PaddleCV/rrpn/models/ext_op/src/bbox_util.h
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+Based on
+--------------------------------------------------------
+@misc{ma2019rrpn,
+    author = {Jianqi Ma},
+    title = {{RRPN in pytorch}},
+    year = {2019},
+    howpublished = {\url{https://github.com/mjq11302010044/RRPN_pytorch}},
+}
+@article{Jianqi17RRPN,
+    Author = {Jianqi Ma and Weiyuan Shao and Hao Ye and Li Wang and Hong Wang
+and Yingbin Zheng and Xiangyang Xue},
+    Title = {Arbitrary-Oriented Scene Text Detection via Rotation Proposals},
+    journal = {IEEE Transactions on Multimedia},
+    volume={20},
+    number={11},
+    pages={3111-3122},
+    year={2018}
+}
+--------------------------------------------------------
+*/
+#pragma once
+#include <algorithm>
+#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/framework/tensor.h"
+namespace paddle {
+namespace operators {
+#define PI 3.141592654
+struct RangeInitFunctor {
+  int start;
+  int delta;
+  int* out;
+  HOSTDEVICE void operator()(size_t i) { out[i] = start + i * delta; }
+};
+// get trangle area after  decompose intersecting polygons into triangles
+template <typename T>
+inline T trangle_area(T* a, T* b, T* c) {
+  return ((a[0] - c[0]) * (b[1] - c[1]) - (a[1] - c[1]) * (b[0] - c[0])) / 2.0;
+}
+// get area of intersecting
+template <typename T>
+inline T get_area(T* int_pts, int num_of_inter) {
+  T area = 0.0;
+  for (int i = 0; i < num_of_inter - 2; i++) {
+    area += fabs(
+        trangle_area<T>(int_pts, int_pts + 2 * i + 2, int_pts + 2 * i + 4));
+  }
+  return area;
+}
+// sort points to decompose intersecting polygons into triangles
+template <typename T>
+inline void reorder_pts(T* int_pts, int num_of_inter) {
+  if (num_of_inter > 0) {
+    T center[2] = {0.0, 0.0};
+    for (int i = 0; i < num_of_inter; i++) {
+      center[0] += int_pts[2 * i];
+      center[1] += int_pts[2 * i + 1];
+    }
+    center[0] /= num_of_inter;
+    center[1] /= num_of_inter;
+    T vs[16];
+    T v[2];
+    T d;
+    for (int i = 0; i < num_of_inter; i++) {
+      v[0] = int_pts[2 * i] - center[0];
+      v[1] = int_pts[2 * i + 1] - center[1];
+      d = sqrt(v[0] * v[0] + v[1] * v[1]);
+      v[0] = v[0] / d;
+      v[1] = v[1] / d;
+      if (v[1] < 0) {
+        v[0] = -2 - v[0];
+      }
+      vs[i] = v[0];
+    }
+    float temp, tx, ty;
+    int j;
+    for (int i = 1; i < num_of_inter; ++i) {
+      if (vs[i - 1] > vs[i]) {
+        temp = vs[i];
+        tx = int_pts[2 * i];
+        ty = int_pts[2 * i + 1];
+        j = i;
+        while (j > 0 && vs[j - 1] > temp) {
+          vs[j] = vs[j - 1];
+          int_pts[j * 2] = int_pts[j * 2 - 2];
+          int_pts[j * 2 + 1] = int_pts[j * 2 - 1];
+          j--;
+        }
+        vs[j] = temp;
+        int_pts[j * 2] = tx;
+        int_pts[j * 2 + 1] = ty;
+      }
+    }
+  }
+}
+// determine if points intersect
+template <typename T>
+inline bool inter2line(T* pts1, T* pts2, int i, int j, T* temp_pts) {
+  T a[2] = {pts1[2 * i], pts1[2 * i + 1]};
+  T b[2] = {pts1[2 * ((i + 1) % 4)], pts1[2 * ((i + 1) % 4) + 1]};
+  T c[2] = {pts2[2 * j], pts2[2 * j + 1]};
+  T d[2] = {pts2[2 * ((j + 1) % 4)], pts2[2 * ((j + 1) % 4) + 1]};
+  T area_abc, area_abd, area_cda, area_cdb;
+  area_abc = trangle_area<T>(a, b, c);
+  area_abd = trangle_area<T>(a, b, d);
+  if (area_abc * area_abd >= -1e-5) {
+    return false;
+  }
+  area_cda = trangle_area<T>(c, d, a);
+  area_cdb = area_cda + area_abc - area_abd;
+  if (area_cda * area_cdb >= -1e-5) {
+    return false;
+  }
+  T t = area_cda / (area_abd - area_abc);
+  T dx = t * (b[0] - a[0]);
+  T dy = t * (b[1] - a[1]);
+  temp_pts[0] = a[0] + dx;
+  temp_pts[1] = a[1] + dy;
+  return true;
+}
+template <typename T>
+inline bool inrect(T pt_x, T pt_y, T* pts) {
+  T ab[2] = {pts[2] - pts[0], pts[3] - pts[1]};
+  T ad[2] = {pts[6] - pts[0], pts[7] - pts[1]};
+  T ap[2] = {pt_x - pts[0], pt_y - pts[1]};
+  T abab = ab[0] * ab[0] + ab[1] * ab[1];
+  T abap = ab[0] * ap[0] + ab[1] * ap[1];
+  T adad = ad[0] * ad[0] + ad[1] * ad[1];
+  T adap = ad[0] * ap[0] + ad[1] * ap[1];
+  bool result = (abab - abap >= -1) and (abap >= -1) and (adad - adap >= -1) and
+                (adap >= -1);
+  return result;
+}
+// calculate the number of intersection points
+template <typename T>
+inline int inter_pts(T* pts1, T* pts2, T* int_pts) {
+  int num_of_inter = 0;
+  for (int i = 0; i < 4; i++) {
+    if (inrect<T>(pts1[2 * i], pts1[2 * i + 1], pts2)) {
+      int_pts[num_of_inter * 2] = pts1[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts1[2 * i + 1];
+      num_of_inter++;
+    }
+    if (inrect<T>(pts2[2 * i], pts2[2 * i + 1], pts1)) {
+      int_pts[num_of_inter * 2] = pts2[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts2[2 * i + 1];
+      num_of_inter++;
+    }
+  }
+  T out_pts[2];
+  for (int i = 0; i < 4; i++) {
+    for (int j = 0; j < 4; j++) {
+      bool has_pts = inter2line<T>(pts1, pts2, i, j, out_pts);
+      if (has_pts) {
+        int_pts[num_of_inter * 2] = out_pts[0];
+        int_pts[num_of_inter * 2 + 1] = out_pts[1];
+        num_of_inter++;
+      }
+    }
+  }
+  return num_of_inter;
+}
+// convert x,y,w,h,angle to x1,y1,x2,y2,x3,y3,x4,y4
+template <typename T>
+inline void convert_region(T* pts,
+                           const framework::Tensor& _region,
+                           int index) {
+  auto region = framework::EigenTensor<T, 2>::From(_region);
+  T angle = region(index, 4);
+  T a_cos = cos(angle / 180.0 * PI);
+  T a_sin = -sin(angle / 180.0 * PI);  // anti clock-wise
+  T ctr_x = region(index, 0);
+  T ctr_y = region(index, 1);
+  T h = region(index, 3);
+  T w = region(index, 2);
+  T pts_x[4] = {-w / 2, -w / 2, w / 2, w / 2};
+  T pts_y[4] = {-h / 2, h / 2, h / 2, -h / 2};
+  for (int i = 0; i < 4; i++) {
+    pts[2 * i] = a_cos * pts_x[i] - a_sin * pts_y[i] + ctr_x;
+    pts[2 * i + 1] = a_sin * pts_x[i] + a_cos * pts_y[i] + ctr_y;
+  }
+}
+// Calculate the area of intersection
+template <typename T>
+inline float inter(const framework::Tensor& _region1,
+                   const framework::Tensor& _region2,
+                   const int& r,
+                   const int& c) {
+  T pts1[8];
+  T pts2[8];
+  T int_pts[16];
+  int num_of_inter;
+  convert_region<T>(pts1, _region1, r);
+  convert_region<T>(pts2, _region2, c);
+  num_of_inter = inter_pts<T>(pts1, pts2, int_pts);
+  reorder_pts<T>(int_pts, num_of_inter);
+  return get_area<T>(int_pts, num_of_inter);
+}
+template <typename T>
+inline float devRotateIoU(const framework::Tensor& _region1,
+                          const framework::Tensor& _region2,
+                          const int r,
+                          const int c) {
+  auto __region1 = framework::EigenTensor<T, 2>::From(_region1);
+  auto __region2 = framework::EigenTensor<T, 2>::From(_region2);
+  if ((fabs(__region1(r, 0) - __region2(c, 0)) < 1e-5) &&
+      (fabs(__region1(r, 1) - __region2(c, 1)) < 1e-5) &&
+      (fabs(__region1(r, 2) - __region2(c, 2)) < 1e-5) &&
+      (fabs(__region1(r, 3) - __region2(c, 3)) < 1e-5) &&
+      (fabs(__region1(r, 4) - __region2(c, 4)) < 1e-5)) {
+    return 1.0;
+  }
+  T area1, area2, area_inter;
+  area1 = __region1(r, 2) * __region1(r, 3);
+  area2 = __region2(c, 2) * __region2(c, 3);
+  area_inter = inter<T>(_region1, _region2, r, c);
+  auto result = area_inter / (area1 + area2 - area_inter);
+  if (result < 0) {
+    result = 0.0;
+  }
+  // may have bugs which cause overlap > 1
+  if (result > 1.00000001) {
+    result = 0.0;
+  }
+  return result;
+}
+template <typename T>
+inline void BoxToDelta2(const int box_num,
+                        const framework::Tensor& ex_boxes,
+                        const framework::Tensor& gt_boxes,
+                        const float* weights,
+                        framework::Tensor* box_delta) {
+  auto ex_boxes_et = framework::EigenTensor<T, 2>::From(ex_boxes);
+  auto gt_boxes_et = framework::EigenTensor<T, 2>::From(gt_boxes);
+  auto trg = framework::EigenTensor<T, 2>::From(*box_delta);
+  T ex_w, ex_h, ex_ctr_x, ex_ctr_y, ex_angle, gt_w, gt_h, gt_ctr_x, gt_ctr_y,
+      gt_angle;
+  for (int64_t i = 0; i < box_num; ++i) {
+    ex_w = ex_boxes_et(i, 2);
+    ex_h = ex_boxes_et(i, 3);
+    ex_ctr_x = ex_boxes_et(i, 0);
+    ex_ctr_y = ex_boxes_et(i, 1);
+    ex_angle = ex_boxes_et(i, 4);
+    gt_w = gt_boxes_et(i, 2);
+    gt_h = gt_boxes_et(i, 3);
+    gt_ctr_x = gt_boxes_et(i, 0);
+    gt_ctr_y = gt_boxes_et(i, 1);
+    gt_angle = gt_boxes_et(i, 4);
+    trg(i, 0) = (gt_ctr_x - ex_ctr_x) / ex_w;
+    trg(i, 1) = (gt_ctr_y - ex_ctr_y) / ex_h;
+    trg(i, 2) = std::log(gt_w / ex_w);
+    trg(i, 3) = std::log(gt_h / ex_h);
+    trg(i, 4) = gt_angle - ex_angle;
+    if (weights) {
+      trg(i, 0) = trg(i, 0) * weights[0];
+      trg(i, 1) = trg(i, 1) * weights[1];
+      trg(i, 2) = trg(i, 2) * weights[2];
+      trg(i, 3) = trg(i, 3) * weights[3];
+      trg(i, 4) = trg(i, 4) * weights[4];
+    }
+    if (gt_angle <= -30 && ex_angle >= 120) {
+      trg(i, 4) = trg(i, 4) + 180.0;
+    }
+    if (gt_angle >= 120 && ex_angle <= -30) {
+      trg(i, 4) = trg(i, 4) - 180.0;
+    }
+    trg(i, 4) = (PI / 180) * trg(i, 4);
+  }
+}
+template <typename T>
+void Gather(
+    const T* in, const int in_stride, const int* index, const int num, T* out) {
+  const int stride_bytes = in_stride * sizeof(T);
+  for (int i = 0; i < num; ++i) {
+    int id = index[i];
+    memcpy(out + i * in_stride, in + id * in_stride, stride_bytes);
+  }
+}
+template <typename T>
+void BboxOverlaps2(const framework::Tensor& r_boxes,
+                   const framework::Tensor& c_boxes,
+                   framework::Tensor* overlaps) {
+  auto overlaps_et = framework::EigenTensor<T, 2>::From(*overlaps);
+  int r_num = r_boxes.dims()[0];
+  int c_num = c_boxes.dims()[0];
+  for (int i = 0; i < r_num; ++i) {
+    for (int j = 0; j < c_num; ++j) {
+      overlaps_et(i, j) = devRotateIoU<T>(r_boxes, c_boxes, i, j);
+    }
+  }
+}
+}  // namespace operators
+}  // namespace paddle
--- a/PaddleCV/rrpn/models/ext_op/src/blas.h
+++ b/PaddleCV/rrpn/models/ext_op/src/blas.h
+//   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#pragma once
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/tensor.h"
+#ifdef PADDLE_WITH_MKLML
+#include "paddle/fluid/platform/dynload/mklml.h"
+#endif
+#ifdef PADDLE_WITH_LIBXSMM
+#include <libxsmm.h>
+#endif
+#ifdef PADDLE_USE_OPENBLAS
+#include <cblas.h>
+#endif
+namespace paddle {
+namespace operators {
+namespace math {
+/**
+ * Matrix Descriptor of a memory buffer.
+ *
+ * It is used for Blas::MatMul. MatMul operator can be batched.
+ * if Mat A is [BatchSize, H, W], Mat B is [BatchSize, H, W]. It will be a
+ * `batch_size` times of GEMM. The batched GEMM could be faster base on the
+ * implementation of the blas library. The batch size could be zero. If any
+ * matrix of `matmul` has a batch size, the will be a batched GEMM, too. e.g.,
+ * Mat A is [BatchSize, H1, W2], and Mat B [H2, W2], The result matrix wil be
+ * [BatchSize, H1, W2]
+ *
+ * The boolean flag, `trans`, describe the memory is the transpose of matrix or
+ * not. If the trans is true, the last two dims of matrix are transposed. The
+ * memory layout of the matrix is [Width, Height] or [BatchSize, Width, Height].
+ *
+ * The MatDescriptor is not only the dimension or shape of a matrix, it also
+ * contains the layout, stride of matrix. It is clearer to have a structure than
+ * reuse `DDim`.
+ */
+struct MatDescriptor {
+  int64_t height_;
+  int64_t width_;
+  int64_t stride_{0};
+  int64_t batch_size_{0};
+  bool trans_;
+};
+/**
+ * Create Matrix Descriptor from a tensor dim, num_flatten_cols, and transpose
+ * flag
+ *
+ * @param tensor_dim: The dimension of the tensor. The rank of this dimension
+ * must larger than 1.
+ *
+ * @param num_flatten_cols:  Reshape a tensor to a matrix. The matrix's first
+ * dimension(column length) will be the product of tensor's first `num_col_dims`
+ * dimensions. If num_flatten_cols is zero, the first N-2 dimension will be the
+ * batch_size of descriptor.
+ *
+ * @param trans: True if the matrix is transposed.
+ */
+extern MatDescriptor CreateMatrixDescriptor(const framework::DDim& tensor_dim,
+                                            int num_flatten_cols,
+                                            bool trans);
+template <typename DeviceContext>
+class Blas {
+public:
+  explicit Blas(const DeviceContext& context) : context_(context) {}
+  template <typename T>
+  void GEMM(CBLAS_TRANSPOSE transA,
+            CBLAS_TRANSPOSE transB,
+            int M,
+            int N,
+            int K,
+            T alpha,
+            const T* A,
+            const T* B,
+            T beta,
+            T* C) const;
+  template <typename T>
+  void GEMM(bool transA,
+            bool transB,
+            int M,
+            int N,
+            int K,
+            T alpha,
+            const T* A,
+            int lda,
+            const T* B,
+            int ldb,
+            T beta,
+            T* C,
+            int ldc) const;
+  template <typename T>
+  void GEMM(CBLAS_TRANSPOSE transA,
+            CBLAS_TRANSPOSE transB,
+            int M,
+            int N,
+            int K,
+            T alpha,
+            const T* A,
+            int lda,
+            const T* B,
+            int ldb,
+            T beta,
+            T* C,
+            int ldc) const;
+#ifdef PADDLE_WITH_MKLML
+  template <typename T>
+  T* GEMM_ALLOC(const CBLAS_IDENTIFIER id,
+                const int M,
+                const int N,
+                const int K) const;
+  template <typename T>
+  void GEMM_PACK(const CBLAS_IDENTIFIER id,
+                 const CBLAS_TRANSPOSE trans,
+                 int M,
+                 int N,
+                 int K,
+                 const T alpha,
+                 const T* src,
+                 const int ld,
+                 T* dst) const;
+  template <typename T>
+  void GEMM_COMPUTE(int transA,
+                    int transB,
+                    int M,
+                    int N,
+                    int K,
+                    const T* A,
+                    const int lda,
+                    const T* B,
+                    const int ldb,
+                    T beta,
+                    T* C,
+                    const int ldc) const;
+  template <typename T>
+  void GEMM_FREE(T* data) const;
+  template <typename T>
+  void CSRMM(const char* transa,
+             const int* m,
+             const int* n,
+             const int* k,
+             const T* alpha,
+             const char* matdescra,
+             const T* val,
+             const int* indx,
+             const int* pntrb,
+             const int* pntre,
+             const T* b,
+             const int* ldb,
+             const T* beta,
+             T* c,
+             const int* ldc) const;
+#if !defined(PADDLE_WITH_CUDA)
+  template <typename T>
+  void MatMulWithHead(const framework::Tensor& mat_a,
+                      const MatDescriptor& dim_a,
+                      const framework::Tensor& mat_b,
+                      const MatDescriptor& dim_b,
+                      T alpha,
+                      int head_number,
+                      framework::Tensor* mat_out,
+                      T beta,
+                      bool mat_y_split_vertical) const;
+#endif
+#endif
+  template <typename T>
+  void MatMul(const int M,
+              const int N,
+              const int K,
+              const T* A,
+              const T* B,
+              T* C) const;
+  template <typename T>
+  void MatMul(const framework::Tensor& mat_a,
+              bool trans_a,
+              const framework::Tensor& mat_b,
+              bool trans_b,
+              T alpha,
+              framework::Tensor* mat_out,
+              T beta) const;
+  template <typename T>
+  void MatMul(const framework::Tensor& mat_a,
+              bool trans_a,
+              const framework::Tensor& mat_b,
+              bool trans_b,
+              framework::Tensor* mat_out) const {
+    MatMul(mat_a,
+           trans_a,
+           mat_b,
+           trans_b,
+           static_cast<T>(1.0),
+           mat_out,
+           static_cast<T>(0.0));
+  }
+  template <typename T>
+  void MatMul(const framework::Tensor& mat_a,
+              const framework::Tensor& mat_b,
+              framework::Tensor* mat_out) const {
+    this->template MatMul<T>(mat_a, false, mat_b, false, mat_out);
+  }
+  template <typename T>
+  void AXPY(int n, T alpha, const T* x, T* y) const;
+  template <typename T>
+  void VADD(int n, const T* x, const T* y, T* z) const;
+  template <typename T>
+  void VSUB(int n, const T* x, const T* y, T* z) const;
+  template <typename T>
+  void VMUL(int n, const T* x, const T* y, T* z) const;
+  template <typename T>
+  void VDIV(int n, const T* x, const T* y, T* z) const;
+  template <typename T>
+  void VCOPY(int n, const T* x, T* y) const;
+  template <typename T>
+  void VEXP(int n, const T* x, T* y) const;
+  template <typename T>
+  void VSQUARE(int n, const T* x, T* y) const;
+  template <typename T>
+  void VPOW(int n, const T* x, T alpha, T* y) const;
+  template <typename T>
+  void GEMV(bool trans_a,
+            int M,
+            int N,
+            T alpha,
+            const T* A,
+            const T* B,
+            T beta,
+            T* C) const;
+  template <typename T>
+  T DOT(int n, const T* x, const T* y) const;
+  template <typename T>
+  void SCAL(int n, const T a, T* x) const;
+  template <typename T>
+  T ASUM(int n, T* x, int inc) const;
+  template <typename T>
+  void BatchedGEMM(CBLAS_TRANSPOSE transA,
+                   CBLAS_TRANSPOSE transB,
+                   int M,
+                   int N,
+                   int K,
+                   T alpha,
+                   const T* A,
+                   const T* B,
+                   T beta,
+                   T* C,
+                   int batchCount,
+                   int64_t strideA,
+                   int64_t strideB) const;
+#if defined(PADDLE_WITH_MKLML) && !defined(PADDLE_WITH_CUDA)
+  template <typename T>
+  void BatchedGEMMWithHead(CBLAS_TRANSPOSE transA,
+                           CBLAS_TRANSPOSE transB,
+                           int W1,
+                           int H1,
+                           int W2,
+                           int H2,
+                           T alpha,
+                           const T* A,
+                           const T* B,
+                           T beta,
+                           T* C,
+                           int batchCount,
+                           int64_t strideA,
+                           int64_t strideB,
+                           int64_t head_number,
+                           bool split_b_vertical) const;
+#endif
+  template <typename T>
+  void MatMul(const framework::Tensor& mat_a,
+              const MatDescriptor& dim_a,
+              const framework::Tensor& mat_b,
+              const MatDescriptor& dim_b,
+              T alpha,
+              framework::Tensor* mat_out,
+              T beta) const;
+  template <typename T>
+  void VINV(int n, const T* a, T* y) const;
+  template <typename T>
+  void VMERF(int n, const T* a, T* y, int64_t mode) const;
+private:
+  const DeviceContext& context_;
+};
+template <typename DeviceContext, typename T>
+class BlasT : private Blas<DeviceContext> {
+public:
+  using Blas<DeviceContext>::Blas;
+  template <typename... ARGS>
+  void GEMM(ARGS... args) const {
+    Base()->template GEMM<T>(args...);
+  }
+#ifdef PADDLE_WITH_MKLML
+  template <typename... ARGS>
+  T* GEMM_ALLOC(ARGS... args) const {
+    return Base()->template GEMM_ALLOC<T>(args...);
+  }
+  template <typename... ARGS>
+  void GEMM_PACK(ARGS... args) const {
+    Base()->template GEMM_PACK<T>(args...);
+  }
+  template <typename... ARGS>
+  void GEMM_COMPUTE(ARGS... args) const {
+    Base()->template GEMM_COMPUTE<T>(args...);
+  }
+  template <typename... ARGS>
+  void GEMM_FREE(ARGS... args) const {
+    Base()->template GEMM_FREE<T>(args...);
+  }
+  template <typename... ARGS>
+  void CSRMM(ARGS... args) const {
+    Base()->template CSRMM<T>(args...);
+  }
+#if !defined(PADDLE_WITH_CUDA)
+  template <typename... ARGS>
+  void MatMulWithHead(ARGS... args) const {
+    Base()->template MatMulWithHead<T>(args...);
+  }
+#endif
+#endif
+  template <typename... ARGS>
+  void MatMul(ARGS... args) const {
+    Base()->template MatMul<T>(args...);
+  }
+  template <typename... ARGS>
+  void AXPY(ARGS... args) const {
+    Base()->template AXPY<T>(args...);
+  }
+  template <typename... ARGS>
+  void VADD(ARGS... args) const {
+    Base()->template VADD<T>(args...);
+  }
+  template <typename... ARGS>
+  void VSUB(ARGS... args) const {
+    Base()->template VSUB<T>(args...);
+  }
+  template <typename... ARGS>
+  void VMUL(ARGS... args) const {
+    Base()->template VMUL<T>(args...);
+  }
+  template <typename... ARGS>
+  void VDIV(ARGS... args) const {
+    Base()->template VDIV<T>(args...);
+  }
+  template <typename... ARGS>
+  void VCOPY(ARGS... args) const {
+    Base()->template VCOPY<T>(args...);
+  }
+  template <typename... ARGS>
+  void VEXP(ARGS... args) const {
+    Base()->template VEXP<T>(args...);
+  }
+  template <typename... ARGS>
+  void VSQUARE(ARGS... args) const {
+    Base()->template VSQUARE<T>(args...);
+  }
+  template <typename... ARGS>
+  void VPOW(ARGS... args) const {
+    Base()->template VPOW<T>(args...);
+  }
+  template <typename... ARGS>
+  void GEMV(ARGS... args) const {
+    Base()->template GEMV<T>(args...);
+  }
+  template <typename... ARGS>
+  T DOT(ARGS... args) const {
+    return Base()->template DOT<T>(args...);
+  }
+  template <typename... ARGS>
+  void SCAL(ARGS... args) const {
+    Base()->template SCAL<T>(args...);
+  }
+  template <typename... ARGS>
+  T ASUM(ARGS... args) const {
+    return Base()->template ASUM<T>(args...);
+  }
+  template <typename... ARGS>
+  void BatchedGEMM(ARGS... args) const {
+    Base()->template BatchedGEMM<T>(args...);
+  }
+  template <typename... ARGS>
+  void VINV(ARGS... args) const {
+    Base()->template VINV<T>(args...);
+  }
+  template <typename... ARGS>
+  void VMERF(ARGS... args) const {
+    Base()->template VMERF<T>(args...);
+  }
+private:
+  const Blas<DeviceContext>* Base() const {
+    return static_cast<const Blas<DeviceContext>*>(this);
+  }
+};
+template <typename DeviceContext, typename T>
+inline BlasT<DeviceContext, T> GetBlas(
+    const framework::ExecutionContext& exe_ctx) {
+  return BlasT<DeviceContext, T>(
+      exe_ctx.template device_context<DeviceContext>());
+}
+template <typename DeviceContext, typename T>
+inline BlasT<DeviceContext, T> GetBlas(const DeviceContext& dev_ctx) {
+  return BlasT<DeviceContext, T>(dev_ctx);
+}
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
+#include "paddle/fluid/operators/math/blas_impl.h"
+#ifdef PADDLE_WITH_CUDA
+#include "paddle/fluid/operators/math/blas_impl.cu.h"
+#endif
--- a/PaddleCV/rrpn/models/ext_op/src/concat_and_split.cc
+++ b/PaddleCV/rrpn/models/ext_op/src/concat_and_split.cc
+/* Copyright (c) 2019 paddlepaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "concat_and_split.h"
+#include <vector>
+namespace paddle {
+namespace operators {
+namespace math {
+/*
+ * All tensors' dimension should be the same and the values of
+ * each dimension must be the same, except the axis dimension.
+ */
+template <typename T>
+class ConcatFunctor<platform::CPUDeviceContext, T> {
+public:
+  void operator()(const platform::CPUDeviceContext& context,
+                  const std::vector<framework::Tensor>& input,
+                  int axis,
+                  framework::Tensor* output) {
+    // TODO(zcd): Add input data validity checking
+    int num = input.size();
+    int rows = 1;
+    auto dim_0 = input[0].dims();
+    for (int i = 0; i < axis; ++i) {
+      rows *= dim_0[i];
+    }
+    int out_rows = rows, out_cols = 0;
+    std::vector<int64_t> input_cols(input.size());
+    for (int i = 0; i < num; ++i) {
+      int t_cols = input[i].numel() / rows;
+      out_cols += t_cols;
+      input_cols[i] = t_cols;
+    }
+    auto cpu_place = boost::get<platform::CPUPlace>(context.GetPlace());
+    // computation
+    auto output_data = output->data<T>();
+    int col_idx = 0;
+    for (int j = 0; j < num; ++j) {
+      int col_len = input_cols[j];
+      auto input_data = input[j].data<T>();
+      for (int k = 0; k < out_rows; ++k) {
+        memory::Copy(cpu_place,
+                     output_data + k * out_cols + col_idx,
+                     cpu_place,
+                     input_data + k * col_len,
+                     sizeof(T) * col_len);
+      }
+      col_idx += col_len;
+    }
+  }
+};
+#define DEFINE_FUNCTOR(type) \
+  template class ConcatFunctor<platform::CPUDeviceContext, type>;
+FOR_ALL_TYPES(DEFINE_FUNCTOR);
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
--- a/PaddleCV/rrpn/models/ext_op/src/concat_and_split.h
+++ b/PaddleCV/rrpn/models/ext_op/src/concat_and_split.h
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+#include <vector>
+#include "paddle/fluid/framework/data_type.h"
+#include "paddle/fluid/framework/lod_tensor.h"
+namespace paddle {
+namespace operators {
+namespace math {
+/*
+ * \brief Concatenate the input tensors along the dimension axis.
+ *  TODO(zcd): maybe it needs to be more detailed.
+ *  Examples:
+ *     Input[0] = [[1,2],[3,4]]
+ *     Input[1] = [[5,6]]
+ *     axis = 0
+ *
+ *     Output = [[1,2],
+ *               [3,4],
+ *               [5,6]]
+ */
+template <typename DeviceContext, typename T>
+class ConcatFunctor {
+public:
+  void operator()(const DeviceContext& context,
+                  const std::vector<framework::Tensor>& input,
+                  int axis,
+                  framework::Tensor* output);
+};
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
+#define FOR_ALL_TYPES(macro) \
+  macro(int);                \
+  macro(float);              \
+  macro(double);             \
+  macro(bool);               \
+  macro(int64_t);            \
+  macro(int16_t);            \
+  macro(uint8_t);            \
+  macro(int8_t);             \
+  macro(::paddle::platform::float16)
--- a/PaddleCV/rrpn/models/ext_op/src/gather.cu.h
+++ b/PaddleCV/rrpn/models/ext_op/src/gather.cu.h
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+#include <vector>
+#include "paddle/fluid/framework/dim.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/tensor.h"
+#include "paddle/fluid/memory/malloc.h"
+#include "paddle/fluid/platform/cuda_primitives.h"
+#include "paddle/fluid/platform/place.h"
+namespace paddle {
+namespace operators {
+using framework::Tensor;
+using platform::DeviceContext;
+#define CUDA_1D_KERNEL_LOOP(i, n)                              \
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); \
+       i += blockDim.x * gridDim.x)
+template <typename T, typename IndexT = int>
+__global__ void GatherCUDAKernel(const T* params,
+                                 const IndexT* indices,
+                                 T* output,
+                                 size_t index_size,
+                                 size_t slice_size) {
+  CUDA_1D_KERNEL_LOOP(i, index_size * slice_size) {
+    int indices_i = i / slice_size;
+    int slice_i = i - indices_i * slice_size;  // offset inside the slice
+    IndexT gather_i = indices[indices_i];
+    IndexT params_i = gather_i * slice_size + slice_i;
+    *(output + i) = *(params + params_i);
+  }
+}
+template <typename T, typename IndexT = int>
+__global__ void GatherNdCUDAKernel(const T* input,
+                                   const int* input_dims,
+                                   const IndexT* indices,
+                                   T* output,
+                                   size_t remain_size,
+                                   size_t slice_size,
+                                   size_t end_size) {
+  CUDA_1D_KERNEL_LOOP(i, remain_size * slice_size) {
+    int indices_i = i / slice_size;
+    int slice_i = i - indices_i * slice_size;  // offset inside the slice
+    IndexT gather_i = 0;
+    int64_t temp = slice_size;
+    for (int64_t j = end_size - 1; j >= 0; --j) {
+      auto index_value = indices[indices_i * end_size + j];
+      assert(index_value >= 0 && index_value < input_dims[j]);
+      gather_i += (index_value * temp);
+      temp *= input_dims[j];
+    }
+    IndexT input_i = gather_i + slice_i;
+    *(output + i) = *(input + input_i);
+  }
+}
+/**
+ * A thin wrapper on gpu tensor
+ * Return a new tensor from source tensor, gathered according to index
+ * input[src]: type-T source Tensor
+ * input[index]: type-IndexT index Tensor (1-D)
+ * return: output tensor
+ */
+template <typename T, typename IndexT = int>
+void GPUGather(const platform::DeviceContext& ctx,
+               const Tensor& src,
+               const Tensor& index,
+               Tensor* output) {
+  // check index of shape 1-D
+  if (index.dims().size() == 1) {
+    PADDLE_ENFORCE_GT(index.dims()[0],
+                      0,
+                      "The index of gather_op should not be empty when the "
+                      "index's rank is 1.");
+  } else if (index.dims().size() == 2) {
+    PADDLE_ENFORCE_EQ(index.dims()[1],
+                      1,
+                      " If the index's rank of gather_op is 2, the second "
+                      "dimension should be 1.");
+  }
+  int index_size = index.dims()[0];
+  auto src_dims = src.dims();
+  framework::DDim output_dims(src_dims);
+  output_dims[0] = index_size;
+  // slice size
+  int slice_size = 1;
+  for (int i = 1; i < src_dims.size(); ++i) slice_size *= src_dims[i];
+  const T* p_src = src.data<T>();
+  const IndexT* p_index = index.data<IndexT>();
+  T* p_output = output->data<T>();
+  int block = 512;
+  int n = slice_size * index_size;
+  int grid = (n + block - 1) / block;
+  GatherCUDAKernel<T, IndexT><<<
+      grid,
+      block,
+      0,
+      reinterpret_cast<const platform::CUDADeviceContext&>(ctx).stream()>>>(
+      p_src, p_index, p_output, index_size, slice_size);
+}
+}  // namespace operators
+}  // namespace paddle
--- a/PaddleCV/rrpn/models/ext_op/src/gather.h
+++ b/PaddleCV/rrpn/models/ext_op/src/gather.h
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+#include <memory.h>
+#include <cstring>
+#include "paddle/fluid/framework/ddim.h"
+#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/framework/tensor.h"
+#include "paddle/fluid/platform/place.h"
+namespace paddle {
+namespace operators {
+using framework::Tensor;
+/**
+ * A thin wrapper for gathering on cpu tensor
+ * Return a new tensor from source tensor, gathered according to index
+ * input[src]: type-T source Tensor
+ * input[index]: type-IndexT index Tensor (1-D)
+ * return: output tensor
+ */
+template <typename T, typename IndexT = int>
+void CPUGather(const platform::DeviceContext& ctx,
+               const Tensor& src,
+               const Tensor& index,
+               Tensor* output) {
+  PADDLE_ENFORCE_EQ(platform::is_cpu_place(ctx.GetPlace()), true);
+  // check index of shape 1-D
+  if (index.dims().size() == 2) {
+    PADDLE_ENFORCE_EQ(index.dims()[1],
+                      1,
+                      "index.dims()[1] should be 1 when index.dims().size() == "
+                      "2 in gather_op.");
+  } else {
+    PADDLE_ENFORCE_EQ(index.dims().size(),
+                      1,
+                      "index.dims().size() should be 1 or 2 in gather_op.");
+  }
+  int64_t index_size = index.dims()[0];
+  auto src_dims = src.dims();
+  const T* p_src = src.data<T>();
+  const IndexT* p_index = index.data<IndexT>();
+  T* p_output = output->data<T>();
+  // slice size
+  int slice_size = 1;
+  for (int i = 1; i < src_dims.size(); ++i) slice_size *= src_dims[i];
+  const size_t slice_bytes = slice_size * sizeof(T);
+  for (int64_t i = 0; i < index_size; ++i) {
+    IndexT index_ = p_index[i];
+    memcpy(p_output + i * slice_size, p_src + index_ * slice_size, slice_bytes);
+  }
+}
+}  // namespace operators
+}  // namespace paddle
--- a/PaddleCV/rrpn/models/ext_op/src/make.sh
+++ b/PaddleCV/rrpn/models/ext_op/src/make.sh
+include_dir=$( python -c 'import paddle; print(paddle.sysconfig.get_include())' )
+lib_dir=$( python -c 'import paddle; print(paddle.sysconfig.get_lib())' )
+echo $include_dir
+echo $lib_dir
+CUDA=$1
+CUDNN=$2
+NCCL=$3
+if [ ! -d "$CUDA" ]; then
+echo "Usage: sh make.sh \$CUDA_PATH \$CUDNN_PATH \$NCCL_PATH"
+exit
+fi
+if [ ! -d "$CUDNN" ]; then
+echo "Usage: sh make.sh \${CUDA_PATH} \${CUDNN_PATH} \${NCCL_PATH}"
+exit
+fi
+if [ ! -d "$NCCL" ]; then
+echo "Usage: sh make.sh \${CUDA_PATH} \${CUDNN_PATH} \${NCCL_PATH}"
+exit
+fi
+git clone https://github.com/NVlabs/cub.git
+nvcc rrpn_generate_proposals_op.cu -c -o rrpn_generate_proposals_op.cu.o -ccbin cc -DPADDLE_WITH_MKLDNN -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O3 -DNVCC \
+    -I ${include_dir} \
+   -I ${include_dir}/third_party \
+    -I ${CUDA}/include \
+    -I ${CUDNN}/include \
+    -I ${NCCL}/include \
+    -L ${lib_dir} -lpaddle_framework \
+    -L ${CUDA}/lib64 -lcudart
+nvcc rotated_anchor_generator_op.cu -c -o rotated_anchor_generator_op.cu.o -ccbin cc -DPADDLE_WITH_MKLDNN -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O3 -DNVCC \
+    -I ${include_dir} \
+    -I ${include_dir}/third_party \
+    -I ${CUDA}/include \
+    -I ${CUDNN}/include \
+    -I ${NCCL}/include \
+    -L ${lib_dir} -lpaddle_framework \
+    -L ${CUDA}/lib64 -lcudart
+nvcc rrpn_box_coder_op.cu -c -o rrpn_box_coder_op.cu.o -ccbin cc -DPADDLE_WITH_MKLDNN -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O3 -DNVCC \
+    -I ${include_dir} \
+    -I ${include_dir}/third_party \
+    -I ${CUDA}/include \
+    -I ${CUDNN}/include \
+    -I ${NCCL}/include \
+    -L ${lib_dir} -lpaddle_framework \
+    -L ${CUDA}/lib64 -lcudart
+nvcc rrpn_rotated_roi_align_op.cu -c -o rrpn_rotated_roi_align_op.cu.o -ccbin cc -DPADDLE_WITH_MKLDNN -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO -Xcompiler -fPIC -std=c++11 -Xcompiler -fPIC -w --expt-relaxed-constexpr -O3 -DNVCC \
+    -I ${include_dir} \
+    -I ${include_dir}/third_party \
+    -I ${CUDA}/include \
+    -I ${CUDNN}/include \
+    -I ${NCCL}/include \
+    -L ${lib_dir} -lpaddle_framework \
+    -L ${CUDA}/lib64 -lcudart
+g++ rotated_anchor_generator_op.cc concat_and_split.cc rrpn_generate_proposal_labels_op.cc rrpn_generate_proposals_op.cc rrpn_target_assign_op.cc rrpn_box_coder_op.cc rrpn_rotated_roi_align_op.cc rrpn_rotated_roi_align_op.cu.o rrpn_box_coder_op.cu.o rotated_anchor_generator_op.cu.o rrpn_generate_proposals_op.cu.o -o rrpn_lib.so -shared -fPIC -std=c++11 -O3 -DPADDLE_WITH_MKLDNN -DPADDLE_WITH_CUDA -DEIGEN_USE_GPU -DPADDLE_USE_DSO \
+  -I ${include_dir} \
+  -I ${include_dir}/third_party \
+  -I ${CUDA}/include \
+  -I ${CUDNN}/include \
+  -I ${NCCL}/include \
+  -L ${lib_dir} -lpaddle_framework \
+  -L ${CUDA}/lib64 -lcudart 
--- a/PaddleCV/rrpn/models/ext_op/src/math_function.cc
+++ b/PaddleCV/rrpn/models/ext_op/src/math_function.cc
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "math_function.h"
+#ifdef PADDLE_WITH_MKLML
+#include "paddle/fluid/platform/dynload/mklml.h"
+#endif
+#ifdef PADDLE_USE_OPENBLAS
+#include <cblas.h>
+#endif
+#include <vector>
+#include "math_function_impl.h"
+#include "paddle/fluid/framework/data_type.h"
+#include "paddle/fluid/platform/float16.h"
+namespace paddle {
+namespace operators {
+namespace math {
+#define DEFINE_CPU_TRANS(RANK)                                          \
+  template struct Transpose<platform::CPUDeviceContext,                 \
+                            platform::float16,                          \
+                            RANK>;                                      \
+  template struct Transpose<platform::CPUDeviceContext, float, RANK>;   \
+  template struct Transpose<platform::CPUDeviceContext, double, RANK>;  \
+  template struct Transpose<platform::CPUDeviceContext, int, RANK>;     \
+  template struct Transpose<platform::CPUDeviceContext, int64_t, RANK>; \
+  template struct Transpose<platform::CPUDeviceContext, bool, RANK>;    \
+  template struct Transpose<platform::CPUDeviceContext, int16_t, RANK>; \
+  template struct Transpose<platform::CPUDeviceContext, uint8_t, RANK>; \
+  template struct Transpose<platform::CPUDeviceContext, int8_t, RANK>;
+DEFINE_CPU_TRANS(1);
+DEFINE_CPU_TRANS(2);
+DEFINE_CPU_TRANS(3);
+DEFINE_CPU_TRANS(4);
+DEFINE_CPU_TRANS(5);
+DEFINE_CPU_TRANS(6);
+template <typename DeviceContext, typename T, int Rank>
+void Transpose<DeviceContext, T, Rank>::operator()(
+    const DeviceContext& context,
+    const framework::Tensor& in,
+    framework::Tensor* out,
+    const std::vector<int>& axis) {
+  Eigen::array<int, Rank> permute;
+  for (int i = 0; i < Rank; i++) {
+    permute[i] = axis[i];
+  }
+  auto eigen_in = framework::EigenTensor<T, Rank>::From(in);
+  auto eigen_out = framework::EigenTensor<T, Rank>::From(*out);
+  auto* dev = context.eigen_device();
+  eigen_out.device(*dev) = eigen_in.shuffle(permute);
+}
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
--- a/PaddleCV/rrpn/models/ext_op/src/math_function.h
+++ b/PaddleCV/rrpn/models/ext_op/src/math_function.h
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+#include <cmath>
+#include <vector>
+#include "paddle/fluid/framework/eigen.h"
+#include "paddle/fluid/framework/operator.h"
+#include "paddle/fluid/framework/tensor.h"
+#include "paddle/fluid/framework/tensor_util.h"
+#include "paddle/fluid/platform/device_context.h"
+#include "paddle/fluid/platform/enforce.h"
+namespace paddle {
+namespace operators {
+namespace math {
+template <typename DeviceContext, typename T, int Rank>
+struct Transpose {
+  void operator()(const DeviceContext& context,
+                  const framework::Tensor& in,
+                  framework::Tensor* out,
+                  const std::vector<int>& axis);
+};
+void set_constant(const platform::DeviceContext& context,
+                  framework::Tensor* tensor,
+                  float value);
+}  // namespace math
+}  // namespace operators
+}  // namespace paddle
--- a/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.cc
+++ b/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.cc
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "rotated_anchor_generator_op.h"
+namespace paddle {
+namespace operators {
+class RotatedAnchorGeneratorOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(
+        ctx->HasInput("Input"),
+        "Input(Input) of RotatedAnchorGeneratorOp should not be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("Anchors"),
+        "Output(Anchors) of RotatedAnchorGeneratorOp should not be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("Variances"),
+        "Output(Variances) of RotatedAnchorGeneratorOp should not be null.");
+    auto input_dims = ctx->GetInputDim("Input");
+    PADDLE_ENFORCE(input_dims.size() == 4, "The layout of input is NCHW.");
+    auto anchor_sizes = ctx->Attrs().Get<std::vector<float>>("anchor_sizes");
+    auto aspect_ratios = ctx->Attrs().Get<std::vector<float>>("aspect_ratios");
+    auto angles = ctx->Attrs().Get<std::vector<float>>("angles");
+    auto stride = ctx->Attrs().Get<std::vector<float>>("stride");
+    auto variances = ctx->Attrs().Get<std::vector<float>>("variances");
+    size_t num_anchors =
+        aspect_ratios.size() * anchor_sizes.size() * angles.size();
+    std::vector<int64_t> dim_vec(4);
+    dim_vec[0] = input_dims[2];
+    dim_vec[1] = input_dims[3];
+    dim_vec[2] = num_anchors;
+    dim_vec[3] = 5;
+    ctx->SetOutputDim("Anchors", framework::make_ddim(dim_vec));
+    ctx->SetOutputDim("Variances", framework::make_ddim(dim_vec));
+  }
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        ctx.Input<framework::Tensor>("Input")->type(), ctx.device_context());
+  }
+};
+class RotatedAnchorGeneratorOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("Input",
+             "(Tensor, default Tensor<float>), "
+             "the input feature is a tensor with a rank of 4. "
+             "The layout is NCHW.");
+    AddOutput("Anchors",
+              "(Tensor, default Tensor<float>), the output is a "
+              "tensor with a rank of 4. The layout is [H, W, num_anchors, 5]. "
+              "H is the height of input, W is the width of input, num_anchors "
+              "is the box count of each position. "
+              "Each anchor is in (xctr, yctr, w, h, thelta) format");
+    AddOutput("Variances",
+              "(Tensor, default Tensor<float>), the expanded variances for "
+              "normalizing bbox regression targets. The layout is [H, W, "
+              "num_anchors, 5]. "
+              "H is the height of input, W is the width of input, num_anchors "
+              "is the box count of each position. "
+              "Each variance is in (xctr, yctr, w, h, thelta) format");
+    AddAttr<std::vector<float>>(
+        "anchor_sizes",
+        "(vector<float>) List of Rotated Region Proposal Network(RRPN) anchor "
+        "sizes "
+        " given in absolute pixels e.g. (64, 128, 256, 512)."
+        " For instance, the anchor size of 64 means the area of this anchor "
+        "equals to 64**2.")
+        .AddCustomChecker([](const std::vector<float>& anchor_sizes) {
+          PADDLE_ENFORCE_GT(anchor_sizes.size(),
+                            0UL,
+                            "Size of anchor_sizes must be at least 1.");
+          for (size_t i = 0; i < anchor_sizes.size(); ++i) {
+            PADDLE_ENFORCE_GT(
+                anchor_sizes[i], 0.0, "anchor_sizes[%d] must be positive.", i);
+          }
+        });
+    AddAttr<std::vector<float>>(
+        "aspect_ratios",
+        "(vector<float>) List of Rotated Region Proposal Network(RRPN) anchor "
+        "aspect "
+        "ratios, e.g. (0.5, 1, 2)."
+        "For instacne, the aspect ratio of 0.5 means the height / width of "
+        "this anchor equals 0.5.");
+    AddAttr<std::vector<float>>(
+        "angles",
+        "(vector<float>) List of Rotated Region Proposal Network(RRPN) anchor "
+        "angles, "
+        "e.g. (-30.0, 0.0, 30.0, 60.0, 90.0, 120.0)."
+        "For instacne, the aspect ratio of 0.5 means the height / width of "
+        "this anchor equals 0.5.");
+    AddAttr<std::vector<float>>("variances",
+                                "(vector<float>) List of variances to be used "
+                                "in box regression deltas")
+        .AddCustomChecker([](const std::vector<float>& variances) {
+          PADDLE_ENFORCE_EQ(
+              variances.size(), 5UL, "Must and only provide 5 variance.");
+          for (size_t i = 0; i < variances.size(); ++i) {
+            PADDLE_ENFORCE_GT(
+                variances[i], 0.0, "variance[%d] must be greater than 0.", i);
+          }
+        });
+    AddAttr<std::vector<float>>("stride",
+                                "Anchors stride across width and height, "
+                                "with a default of (16, 16)")
+        .SetDefault(std::vector<float>(2, 16.0))
+        .AddCustomChecker([](const std::vector<float>& stride) {
+          PADDLE_ENFORCE_EQ(
+              stride.size(),
+              2UL,
+              "Must and only provide 2 stride for width and height.");
+          for (size_t i = 0; i < stride.size(); ++i) {
+            PADDLE_ENFORCE_GT(
+                stride[i], 0.0, "stride[%d] should be larger than 0.", i);
+          }
+        });
+    AddAttr<float>("offset",
+                   "(float) "
+                   "Anchor center offset, with a default of 0.5")
+        .SetDefault(0.5);
+    AddComment(R"DOC(
+RotatedAnchorGenerator operator
+Generates anchors for RRPN. algorithm.
+Each position of the input produce N anchors, N =
+ size(anchor_sizes) * size(aspect_ratios) * size(angles).
+Please get more information from the following papers:
+https://arxiv.org/abs/1703.01086.
+)DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(
+    rotated_anchor_generator,
+    ops::RotatedAnchorGeneratorOp,
+    ops::RotatedAnchorGeneratorOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
+REGISTER_OP_CPU_KERNEL(rotated_anchor_generator,
+                       ops::RotatedAnchorGeneratorOpKernel<float>,
+                       ops::RotatedAnchorGeneratorOpKernel<double>);
--- a/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.cu
+++ b/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.cu
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include "rotated_anchor_generator_op.h"
+namespace paddle {
+namespace operators {
+template <typename T>
+__global__ void GenRAnchors(T* out,
+                            const T* aspect_ratios,
+                            const int ar_num,
+                            const T* anchor_sizes,
+                            const int as_num,
+                            const T* angles,
+                            const int aa_num,
+                            const T* stride,
+                            const int sd_num,
+                            const int height,
+                            const int width,
+                            const T offset) {
+  int num_anchors = as_num * ar_num * aa_num;
+  int box_num = height * width * num_anchors;
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < box_num;
+       i += blockDim.x * gridDim.x) {
+    int h_idx = i / (num_anchors * width);
+    int w_idx = (i / num_anchors) % width;
+    T stride_width = stride[0];
+    T stride_height = stride[1];
+    T x_ctr = (w_idx * stride_width) + offset * stride_width - 1;
+    T y_ctr = (h_idx * stride_height) + offset * stride_height - 1;
+    T area, area_ratios;
+    T base_w, base_h;
+    T scale_w, scale_h;
+    T anchor_width, anchor_height;
+    int anch_idx = i % num_anchors;
+    int ar_idx = anch_idx / (as_num * aa_num);
+    int as_idx = anch_idx / aa_num % as_num;
+    int aa_idx = anch_idx % aa_num;
+    T aspect_ratio = aspect_ratios[ar_idx];
+    T anchor_size = anchor_sizes[as_idx];
+    T angle = angles[aa_idx];
+    area = stride_width * stride_height;
+    area_ratios = area / aspect_ratio;
+    base_w = round(sqrt(area_ratios));
+    base_h = round(base_w * aspect_ratio);
+    scale_w = anchor_size / stride_width;
+    scale_h = anchor_size / stride_height;
+    anchor_width = scale_w * base_w;
+    anchor_height = scale_h * base_h;
+    out[i * 5] = x_ctr;
+    out[i * 5 + 1] = y_ctr;
+    out[i * 5 + 2] = anchor_width;
+    out[i * 5 + 3] = anchor_height;
+    out[i * 5 + 4] = angle;
+  }
+}
+template <typename T>
+__global__ void SetVariance(T* out,
+                            const T* var,
+                            const int vnum,
+                            const int num) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < num;
+       i += blockDim.x * gridDim.x) {
+    out[i] = var[i % vnum];
+  }
+}
+template <typename T>
+class RotatedAnchorGeneratorOpCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* input = ctx.Input<paddle::framework::Tensor>("Input");
+    auto* anchors = ctx.Output<paddle::framework::Tensor>("Anchors");
+    auto* vars = ctx.Output<paddle::framework::Tensor>("Variances");
+    auto anchor_sizes = ctx.Attr<std::vector<float>>("anchor_sizes");
+    auto aspect_ratios = ctx.Attr<std::vector<float>>("aspect_ratios");
+    auto angles = ctx.Attr<std::vector<float>>("angles");
+    auto stride = ctx.Attr<std::vector<float>>("stride");
+    auto variances = ctx.Attr<std::vector<float>>("variances");
+    T offset = static_cast<T>(ctx.Attr<float>("offset"));
+    auto width = input->dims()[3];
+    auto height = input->dims()[2];
+    int num_anchors =
+        aspect_ratios.size() * anchor_sizes.size() * angles.size();
+    int box_num = width * height * num_anchors;
+    int block = 512;
+    int grid = (box_num + block - 1) / block;
+    auto stream =
+        ctx.template device_context<platform::CUDADeviceContext>().stream();
+    anchors->mutable_data<T>(ctx.GetPlace());
+    vars->mutable_data<T>(ctx.GetPlace());
+    framework::Tensor ar;
+    framework::TensorFromVector(aspect_ratios, ctx.device_context(), &ar);
+    framework::Tensor as;
+    framework::TensorFromVector(anchor_sizes, ctx.device_context(), &as);
+    framework::Tensor aa;
+    framework::TensorFromVector(angles, ctx.device_context(), &aa);
+    framework::Tensor sd;
+    framework::TensorFromVector(stride, ctx.device_context(), &sd);
+    GenRAnchors<T><<<grid, block, 0, stream>>>(anchors->data<T>(),
+                                               ar.data<T>(),
+                                               aspect_ratios.size(),
+                                               as.data<T>(),
+                                               anchor_sizes.size(),
+                                               aa.data<T>(),
+                                               angles.size(),
+                                               sd.data<T>(),
+                                               stride.size(),
+                                               height,
+                                               width,
+                                               offset);
+    framework::Tensor v;
+    framework::TensorFromVector(variances, ctx.device_context(), &v);
+    grid = (box_num * 5 + block - 1) / block;
+    SetVariance<T><<<grid, block, 0, stream>>>(
+        vars->data<T>(), v.data<T>(), variances.size(), box_num * 5);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(rotated_anchor_generator,
+                        ops::RotatedAnchorGeneratorOpCUDAKernel<float>,
+                        ops::RotatedAnchorGeneratorOpCUDAKernel<double>);
--- a/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.h
+++ b/PaddleCV/rrpn/models/ext_op/src/rotated_anchor_generator_op.h
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+#include <algorithm>
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+//#include "paddle/fluid/operators/math/math_function.h"
+#include "paddle/fluid/platform/transform.h"
+namespace paddle {
+namespace operators {
+template <typename T>
+class RotatedAnchorGeneratorOpKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* input = ctx.Input<paddle::framework::Tensor>("Input");
+    auto* anchors = ctx.Output<paddle::framework::Tensor>("Anchors");
+    auto* vars = ctx.Output<paddle::framework::Tensor>("Variances");
+    auto anchor_sizes = ctx.Attr<std::vector<float>>("anchor_sizes");
+    auto aspect_ratios = ctx.Attr<std::vector<float>>("aspect_ratios");
+    auto angles = ctx.Attr<std::vector<float>>("angles");
+    auto stride = ctx.Attr<std::vector<float>>("stride");
+    auto variances = ctx.Attr<std::vector<float>>("variances");
+    T offset = static_cast<T>(ctx.Attr<float>("offset"));
+    auto feature_width = input->dims()[3];
+    auto feature_height = input->dims()[2];
+    T stride_width, stride_height;
+    stride_width = stride[0];
+    stride_height = stride[1];
+    int num_anchors =
+        aspect_ratios.size() * anchor_sizes.size() * angles.size();
+    anchors->mutable_data<T>(ctx.GetPlace());
+    vars->mutable_data<T>(ctx.GetPlace());
+    auto e_anchors = framework::EigenTensor<T, 4>::From(*anchors);
+    for (int h_idx = 0; h_idx < feature_height; ++h_idx) {
+      for (int w_idx = 0; w_idx < feature_width; ++w_idx) {
+        T x_ctr = (w_idx * stride_width) + offset * stride_width - 1;
+        T y_ctr = (h_idx * stride_height) + offset * stride_height - 1;
+        T area, area_ratios;
+        T base_w, base_h;
+        T scale_w, scale_h;
+        T anchor_width, anchor_height;
+        int idx = 0;
+        for (size_t r = 0; r < aspect_ratios.size(); ++r) {
+          auto ar = aspect_ratios[r];
+          for (size_t s = 0; s < anchor_sizes.size(); ++s) {
+            auto anchor_size = anchor_sizes[s];
+            area = stride_width * stride_height;
+            area_ratios = area / ar;
+            base_w = round(sqrt(area_ratios));
+            base_h = round(base_w * ar);
+            scale_w = anchor_size / stride_width;
+            scale_h = anchor_size / stride_height;
+            anchor_width = scale_w * base_w;
+            anchor_height = scale_h * base_h;
+            for (size_t a = 0; a < angles.size(); ++a) {
+              auto angle = angles[a];
+              e_anchors(h_idx, w_idx, idx, 0) = x_ctr;
+              e_anchors(h_idx, w_idx, idx, 1) = y_ctr;
+              e_anchors(h_idx, w_idx, idx, 2) = anchor_width;
+              e_anchors(h_idx, w_idx, idx, 3) = anchor_height;
+              e_anchors(h_idx, w_idx, idx, 4) = angle;
+              idx++;
+            }
+          }
+        }
+      }
+    }
+    framework::Tensor var_t;
+    var_t.mutable_data<T>(
+        framework::make_ddim({1, static_cast<int>(variances.size())}),
+        ctx.GetPlace());
+    auto var_et = framework::EigenTensor<T, 2>::From(var_t);
+    for (size_t i = 0; i < variances.size(); ++i) {
+      var_et(0, i) = variances[i];
+    }
+    int anchor_num = feature_height * feature_width * num_anchors;
+    auto var_dim = vars->dims();
+    vars->Resize({anchor_num, static_cast<int>(variances.size())});
+    auto e_vars = framework::EigenMatrix<T, Eigen::RowMajor>::From(*vars);
+    e_vars = var_et.broadcast(Eigen::DSizes<int, 2>(anchor_num, 1));
+    vars->Resize(var_dim);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
--- a/PaddleCV/rrpn/models/ext_op/src/rrpn_box_coder_op.cc
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_box_coder_op.cc
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+//#include "rrpn_box_coder_op.h"
+#include <string>
+#include <vector>
+#include "paddle/fluid/framework/op_registry.h"
+namespace paddle {
+namespace operators {
+class RRPNBoxCoderOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+protected:
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("PriorBox"),
+                   "Input(PriorBox) of BoxCoderOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("TargetBox"),
+                   "Input(TargetBox) of BoxCoderOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("OutputBox"),
+                   "Output(OutputBox) of BoxCoderOp should not be null.");
+    auto prior_box_dims = ctx->GetInputDim("PriorBox");
+    // auto target_box_dims = ctx->GetInputDim("TargetBox");
+    if (ctx->IsRuntime()) {
+      PADDLE_ENFORCE_EQ(
+          prior_box_dims.size(), 2, "The rank of Input PriorBox must be 2");
+      PADDLE_ENFORCE_EQ(
+          prior_box_dims[1], 5, "The shape of PriorBox is [N, 5]");
+      if (ctx->HasInput("PriorBoxVar")) {
+        auto prior_box_var_dims = ctx->GetInputDim("PriorBoxVar");
+        PADDLE_ENFORCE(prior_box_var_dims.size() == 2,
+                       "Input(PriorBoxVar) of BoxCoderOp should be 2.");
+        PADDLE_ENFORCE_EQ(
+            prior_box_dims,
+            prior_box_var_dims,
+            "The dimension of Input(PriorBoxVar) should be equal to"
+            "the dimension of Input(PriorBox) when the rank is 2.");
+      }
+    }
+  }
+};
+class RRPNBoxCoderOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput(
+        "PriorBox",
+        "(Tensor, default Tensor<float>) "
+        "Box list PriorBox is a 2-D Tensor with shape [M, 5] holds M boxes, "
+        "each box is represented as [x, y, w, h, angle], "
+        "[x, y] is the center coordinate of the anchor box, "
+        "if the input is image feature map, they are close to the origin "
+        "of the coordinate system. [w, h] is the width and height "
+        "of the anchor box, angle is angle of rotation.");
+    AddInput("PriorBoxVar",
+             "(Tensor, default Tensor<float>, optional) "
+             "PriorBoxVar is a 2-D Tensor with shape [M, 5] holds M group "
+             "of variance. PriorBoxVar will set all elements to 1 by "
+             "default.")
+        .AsDispensable();
+    AddInput(
+        "TargetBox",
+        "(LoDTensor or Tensor) This input can be a 2-D LoDTensor with shape "
+        "[N, 5], each box is represented as [x, y, w, h, angle],"
+        "[x, y] is the center coordinate of the box, [w, h] is width and "
+        "height of the box,"
+        "angle is angle of rotation around the center of box.");
+    AddAttr<std::vector<float>>(
+        "variance",
+        "(vector<float>, default {}),"
+        "variance of prior box with shape [5]. PriorBoxVar and variance can"
+        "not be provided at the same time.")
+        .SetDefault(std::vector<float>{});
+    AddOutput("OutputBox",
+              "(Tensor) "
+              "2-D Tensor with shape [M, 5] which M represents the number of "
+              "deocded boxes"
+              "and 5 represents [x, y, w, h, angle]");
+    AddComment(R"DOC(
+Rotatedi Bounding Box Coder.
+Decode the target bounding box with the priorbox information.
+The Decoding schema described below:
+    ox = pw * tx / pxv + cx
+    oy = ph * ty / pyv + cy
+    ow = exp(tw / pwv) * pw
+    oh = exp(th / phv) * ph
+    oa = ta / pav  * 1.0 / 3.141592653 * 180 + pa
+where `tx`, `ty`, `tw`, `th`, `ta` denote the target box's center coordinates, width
+,height and angle respectively. Similarly, `px`, `py`, `pw`, `ph`, `pa` denote the
+priorbox's (anchor) center coordinates, width, height and angle. `pxv`, `pyv`, `pwv`,
+`phv`, `pav` denote the variance of the priorbox and `ox`, `oy`, `ow`, `oh`, `oa`
+denote the encoded/decoded coordinates, width and height. 
+)DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(
+    rrpn_box_coder,
+    ops::RRPNBoxCoderOp,
+    ops::RRPNBoxCoderOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
--- a/PaddleCV/rrpn/models/ext_op/src/rrpn_box_coder_op.cu
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_box_coder_op.cu
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <thrust/device_vector.h>
+#include <thrust/host_vector.h>
+#include <string>
+#include <vector>
+#include "paddle/fluid/memory/memory.h"
+//#include "rrpn_box_coder_op.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/platform/cuda_primitives.h"
+namespace paddle {
+namespace operators {
+#define PI 3.141592654
+template <typename T>
+__global__ void DecodeCenterSizeKernel(const T* prior_box_data,
+                                       const T* prior_box_var_data,
+                                       const T* target_box_data,
+                                       const int row,
+                                       const int len,
+                                       const T prior_box_var_size,
+                                       const float* variance,
+                                       const int var_size,
+                                       T* output) {
+  const int idx = threadIdx.x + blockIdx.x * blockDim.x;
+  int prior_box_offset = 0;
+  if (idx < row) {
+    const int row_idx = idx;
+    prior_box_offset = row_idx * len;
+    T prior_box_width = prior_box_data[prior_box_offset + 2];
+    T prior_box_height = prior_box_data[prior_box_offset + 3];
+    T prior_box_center_x = prior_box_data[prior_box_offset];
+    T prior_box_center_y = prior_box_data[prior_box_offset + 1];
+    T prior_box_angle = prior_box_data[prior_box_offset + 4];
+    T target_box_width, target_box_height, target_box_angle;
+    T target_box_center_x, target_box_center_y;
+    T box_var_x = T(1), box_var_y = T(1);
+    T box_var_w = T(1), box_var_h = T(1), box_var_angle = T(1);
+    if (prior_box_var_data) {
+      int prior_var_offset = row_idx * len;
+      box_var_x = prior_box_var_data[prior_var_offset];
+      box_var_y = prior_box_var_data[prior_var_offset + 1];
+      box_var_w = prior_box_var_data[prior_var_offset + 2];
+      box_var_h = prior_box_var_data[prior_var_offset + 3];
+      box_var_angle = prior_box_var_data[prior_var_offset + 4];
+    } else if (var_size == 5) {
+      box_var_x = static_cast<T>(variance[0]);
+      box_var_y = static_cast<T>(variance[1]);
+      box_var_w = static_cast<T>(variance[2]);
+      box_var_h = static_cast<T>(variance[3]);
+      box_var_angle = static_cast<T>(variance[4]);
+    }
+    target_box_width =
+        exp(target_box_data[idx * len + 2] / box_var_w) * prior_box_width / 1.4;
+    target_box_height = exp(target_box_data[idx * len + 3] / box_var_h) *
+                        prior_box_height / 1.4;
+    target_box_center_x =
+        target_box_data[idx * len] / box_var_x * prior_box_width +
+        prior_box_center_x;
+    target_box_center_y =
+        target_box_data[idx * len + 1] / box_var_y * prior_box_height +
+        prior_box_center_y;
+    target_box_angle =
+        (target_box_data[idx * len + 4] / box_var_angle) * 1.0 / PI * 180 +
+        prior_box_angle;
+    T a_cos = cos(PI / 180 * target_box_angle);
+    T a_sin = -sin(PI / 180 * target_box_angle);
+    T rotation_matrix[3][3];
+    rotation_matrix[0][0] = a_cos;
+    rotation_matrix[0][1] = a_sin;
+    rotation_matrix[0][2] = 0;
+    rotation_matrix[1][0] = -a_sin;
+    rotation_matrix[1][1] = a_cos;
+    rotation_matrix[1][2] = 0;
+    rotation_matrix[2][0] = -target_box_center_x * a_cos +
+                            target_box_center_y * a_sin + target_box_center_x;
+    rotation_matrix[2][1] = -target_box_center_x * a_sin -
+                            target_box_center_y * a_cos + target_box_center_y;
+    rotation_matrix[2][2] = 1;
+    T pt_x0 = target_box_center_x - target_box_width / 2;
+    T pt_x1 = target_box_center_x + target_box_width / 2;
+    T pt_x2 = target_box_center_x + target_box_width / 2;
+    T pt_x3 = target_box_center_x - target_box_width / 2;
+    T pt_y0 = target_box_center_y - target_box_height / 2;
+    T pt_y1 = target_box_center_y - target_box_height / 2;
+    T pt_y2 = target_box_center_y + target_box_height / 2;
+    T pt_y3 = target_box_center_y + target_box_height / 2;
+    output[idx * 8] = pt_x0 * rotation_matrix[0][0] +
+                      pt_y0 * rotation_matrix[1][0] + rotation_matrix[2][0];
+    output[idx * 8 + 1] = pt_x0 * rotation_matrix[0][1] +
+                          pt_y0 * rotation_matrix[1][1] + rotation_matrix[2][1];
+    output[idx * 8 + 2] = pt_x1 * rotation_matrix[0][0] +
+                          pt_y1 * rotation_matrix[1][0] + rotation_matrix[2][0];
+    output[idx * 8 + 3] = pt_x1 * rotation_matrix[0][1] +
+                          pt_y1 * rotation_matrix[1][1] + rotation_matrix[2][1];
+    output[idx * 8 + 4] = pt_x2 * rotation_matrix[0][0] +
+                          pt_y2 * rotation_matrix[1][0] + rotation_matrix[2][0];
+    output[idx * 8 + 5] = pt_x2 * rotation_matrix[0][1] +
+                          pt_y2 * rotation_matrix[1][1] + rotation_matrix[2][1];
+    output[idx * 8 + 6] = pt_x3 * rotation_matrix[0][0] +
+                          pt_y3 * rotation_matrix[1][0] + rotation_matrix[2][0];
+    output[idx * 8 + 7] = pt_x3 * rotation_matrix[0][1] +
+                          pt_y3 * rotation_matrix[1][1] + rotation_matrix[2][1];
+  }
+}
+template <typename DeviceContext, typename T>
+class RRPNBoxCoderCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    PADDLE_ENFORCE(platform::is_gpu_place(context.GetPlace()),
+                   "This kernel only runs on GPU device.");
+    auto* prior_box = context.Input<framework::Tensor>("PriorBox");
+    auto* prior_box_var = context.Input<framework::Tensor>("PriorBoxVar");
+    auto* target_box = context.Input<framework::LoDTensor>("TargetBox");
+    auto* output_box = context.Output<framework::Tensor>("OutputBox");
+    std::vector<float> variance = context.Attr<std::vector<float>>("variance");
+    const T* prior_box_data = prior_box->data<T>();
+    const T* target_box_data = target_box->data<T>();
+    const T* prior_box_var_data = nullptr;
+    auto prior_box_var_size = 0;
+    if (prior_box_var) {
+      PADDLE_ENFORCE(variance.empty(),
+                     "Input 'PriorBoxVar' and attribute 'variance' should not"
+                     "be used at the same time.");
+      prior_box_var_data = prior_box_var->data<T>();
+      prior_box_var_size = prior_box_var->dims().size();
+    }
+    if (!(variance.empty())) {
+      PADDLE_ENFORCE(static_cast<int>(variance.size()) == 5,
+                     "Size of attribute 'variance' should be 4");
+    }
+    if (target_box->lod().size()) {
+      PADDLE_ENFORCE_EQ(
+          target_box->lod().size(), 1, "Only support 1 level of LoD.");
+    }
+    const int var_size = static_cast<int>(variance.size());
+    auto row = target_box->dims()[0];
+    auto len = 5;
+    int block = 512;
+    int grid = (row + block - 1) / block;
+    auto& device_ctx = context.cuda_device_context();
+    int bytes = var_size * sizeof(float);
+    auto dev_var = memory::Alloc(device_ctx, bytes);
+    float* dev_var_data = reinterpret_cast<float*>(dev_var->ptr());
+    auto cplace = platform::CPUPlace();
+    const auto gplace = boost::get<platform::CUDAPlace>(context.GetPlace());
+    memory::Copy(
+        gplace, dev_var_data, cplace, &variance[0], bytes, device_ctx.stream());
+    output_box->mutable_data<T>({row, 8}, context.GetPlace());
+    T* output = output_box->data<T>();
+    DecodeCenterSizeKernel<T><<<grid, block, 0, device_ctx.stream()>>>(
+        prior_box_data,
+        prior_box_var_data,
+        target_box_data,
+        row,
+        len,
+        prior_box_var_size,
+        dev_var_data,
+        var_size,
+        output);
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(
+    rrpn_box_coder,
+    ops::RRPNBoxCoderCUDAKernel<paddle::platform::CUDADeviceContext, float>,
+    ops::RRPNBoxCoderCUDAKernel<paddle::platform::CUDADeviceContext, double>);
--- a/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposal_labels_op.cc
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposal_labels_op.cc
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <math.h>
+#include <algorithm>
+#include <fstream>
+#include <string>
+#include <vector>
+#include "bbox_util.h"
+#include "concat_and_split.h"
+#include "gather.h"
+#include "math_function.h"
+#include "paddle/fluid/framework/op_registry.h"
+namespace paddle {
+namespace operators {
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+const int kBoxDim = 5;
+template <typename T>
+void AppendRois(LoDTensor* out, int64_t offset, Tensor* to_add) {
+  auto* out_data = out->data<T>();
+  auto* to_add_data = to_add->data<T>();
+  memcpy(out_data + offset, to_add_data, to_add->numel() * sizeof(T));
+}
+class RRPNGenerateProposalLabelsOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("RpnRois"),
+                   "Input(RpnRois) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("GtClasses"),
+                   "Input(GtClasses) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("IsCrowd"),
+                   "Input(IsCrowd) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("GtBoxes"),
+                   "Input(GtBoxes) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("ImInfo"), "Input(ImInfo) shouldn't be null.");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("Rois"),
+        "Output(Rois) of RRPNGenerateProposalLabelsOp should not be null");
+    PADDLE_ENFORCE(ctx->HasOutput("LabelsInt32"),
+                   "Output(LabelsInt32) of RRPNGenerateProposalLabelsOp should "
+                   "not be null");
+    PADDLE_ENFORCE(ctx->HasOutput("BboxTargets"),
+                   "Output(BboxTargets) of RRPNGenerateProposalLabelsOp should "
+                   "not be null");
+    PADDLE_ENFORCE(ctx->HasOutput("BboxInsideWeights"),
+                   "Output(BboxInsideWeights) of RRPNGenerateProposalLabelsOp "
+                   "should not be null");
+    PADDLE_ENFORCE(ctx->HasOutput("BboxOutsideWeights"),
+                   "Output(BboxOutsideWeights) of RRPNGenerateProposalLabelsOp "
+                   "should not be null");
+    auto rpn_rois_dims = ctx->GetInputDim("RpnRois");
+    auto gt_boxes_dims = ctx->GetInputDim("GtBoxes");
+    auto im_info_dims = ctx->GetInputDim("ImInfo");
+    PADDLE_ENFORCE_EQ(
+        rpn_rois_dims.size(), 2, "The rank of Input(RpnRois) must be 2.");
+    PADDLE_ENFORCE_EQ(
+        gt_boxes_dims.size(), 2, "The rank of Input(GtBoxes) must be 2.");
+    PADDLE_ENFORCE_EQ(
+        im_info_dims.size(), 2, "The rank of Input(ImInfo) must be 2.");
+    int class_nums = ctx->Attrs().Get<int>("class_nums");
+    ctx->SetOutputDim("Rois", {-1, 5});
+    ctx->SetOutputDim("LabelsInt32", {-1, 1});
+    ctx->SetOutputDim("BboxTargets", {-1, 5 * class_nums});
+    ctx->SetOutputDim("BboxInsideWeights", {-1, 5 * class_nums});
+    ctx->SetOutputDim("BboxOutsideWeights", {-1, 5 * class_nums});
+  }
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        ctx.Input<framework::LoDTensor>("RpnRois")->type(),
+        platform::CPUPlace());
+  }
+};
+template <typename T>
+void Concat(const platform::CPUDeviceContext& context,
+            const Tensor& in_tensor_a,
+            const Tensor& in_tensor_b,
+            Tensor* out_tensor) {
+  int axis = 0;
+  std::vector<Tensor> inputs;
+  inputs.emplace_back(in_tensor_a);
+  inputs.emplace_back(in_tensor_b);
+  math::ConcatFunctor<platform::CPUDeviceContext, T> concat_functor;
+  concat_functor(context, inputs, axis, out_tensor);
+}
+template <typename T>
+std::vector<std::vector<int>> SampleFgBgGt(
+    const platform::CPUDeviceContext& context,
+    Tensor* iou,
+    const Tensor& is_crowd,
+    const int batch_size_per_im,
+    const float fg_fraction,
+    const float fg_thresh,
+    const float bg_thresh_hi,
+    const float bg_thresh_lo,
+    std::minstd_rand engine,
+    const bool use_random,
+    const Tensor& rpn_rois) {
+  std::vector<int> fg_inds;
+  std::vector<int> bg_inds;
+  std::vector<int> mapped_gt_inds;
+  int64_t gt_num = is_crowd.numel();
+  const int* crowd_data = is_crowd.data<int>();
+  T* proposal_to_gt_overlaps = iou->data<T>();
+  int64_t row = iou->dims()[0];
+  int64_t col = iou->dims()[1];
+  float epsilon = 0.00001;
+  const T* rpn_rois_dt = rpn_rois.data<T>();
+  // Follow the Faster RCNN's implementation
+  for (int64_t i = 0; i < row; ++i) {
+    const T* v = proposal_to_gt_overlaps + i * col;
+    T max_overlap = *std::max_element(v, v + col);
+    if ((i < gt_num) && (crowd_data[i])) {
+      max_overlap = -1.0;
+    }
+    if (max_overlap >= fg_thresh) {
+      // fg mapped gt label index
+      for (int64_t j = 0; j < col; ++j) {
+        T val = proposal_to_gt_overlaps[i * col + j];
+        auto diff = std::abs(max_overlap - val);
+        if (diff < epsilon) {
+          fg_inds.emplace_back(i);
+          mapped_gt_inds.emplace_back(j);
+          break;
+        }
+      }
+    } else if ((max_overlap >= bg_thresh_lo) && (max_overlap < bg_thresh_hi)) {
+      bg_inds.emplace_back(i);
+    } else {
+      continue;
+    }
+  }
+  std::vector<std::vector<int>> res;
+  // sampling fg
+  std::uniform_real_distribution<float> uniform(0, 1);
+  int fg_rois_per_im = std::floor(batch_size_per_im * fg_fraction);
+  int fg_rois_this_image = fg_inds.size();
+  int fg_rois_per_this_image = std::min(fg_rois_per_im, fg_rois_this_image);
+  if (use_random) {
+    const int64_t fg_size = static_cast<int64_t>(fg_inds.size());
+    if (fg_size > fg_rois_per_this_image) {
+      for (int64_t i = fg_rois_per_this_image; i < fg_size; ++i) {
+        int rng_ind = std::floor(uniform(engine) * i);
+        if (rng_ind < fg_rois_per_this_image) {
+          std::iter_swap(fg_inds.begin() + rng_ind, fg_inds.begin() + i);
+          std::iter_swap(mapped_gt_inds.begin() + rng_ind,
+                         mapped_gt_inds.begin() + i);
+        }
+      }
+    }
+  }
+  std::vector<int> new_fg_inds(fg_inds.begin(),
+                               fg_inds.begin() + fg_rois_per_this_image);
+  std::vector<int> new_gt_inds(mapped_gt_inds.begin(),
+                               mapped_gt_inds.begin() + fg_rois_per_this_image);
+  // sampling bg
+  int bg_rois_per_image = batch_size_per_im - fg_rois_per_this_image;
+  int bg_rois_this_image = bg_inds.size();
+  int bg_rois_per_this_image = std::min(bg_rois_per_image, bg_rois_this_image);
+  if (use_random) {
+    const int64_t bg_size = static_cast<int64_t>(bg_inds.size());
+    if (bg_size > bg_rois_per_this_image) {
+      for (int64_t i = bg_rois_per_this_image; i < bg_size; ++i) {
+        int rng_ind = std::floor(uniform(engine) * i);
+        if (rng_ind < fg_rois_per_this_image)
+          std::iter_swap(bg_inds.begin() + rng_ind, bg_inds.begin() + i);
+      }
+    }
+  }
+  std::vector<int> new_bg_inds(bg_inds.begin(),
+                               bg_inds.begin() + bg_rois_per_this_image);
+  res.emplace_back(new_fg_inds);
+  res.emplace_back(new_bg_inds);
+  res.emplace_back(new_gt_inds);
+  return res;
+}
+template <typename T>
+void GatherBoxesLabels(const platform::CPUDeviceContext& context,
+                       const Tensor& boxes,
+                       const Tensor& gt_boxes,
+                       const Tensor& gt_classes,
+                       const std::vector<int>& fg_inds,
+                       const std::vector<int>& bg_inds,
+                       const std::vector<int>& gt_inds,
+                       Tensor* sampled_boxes,
+                       Tensor* sampled_labels,
+                       Tensor* sampled_gts) {
+  int fg_num = fg_inds.size();
+  int bg_num = bg_inds.size();
+  Tensor fg_inds_t, bg_inds_t, gt_box_inds_t, gt_label_inds_t;
+  int* fg_inds_data = fg_inds_t.mutable_data<int>({fg_num}, context.GetPlace());
+  int* bg_inds_data = bg_inds_t.mutable_data<int>({bg_num}, context.GetPlace());
+  int* gt_box_inds_data =
+      gt_box_inds_t.mutable_data<int>({fg_num}, context.GetPlace());
+  int* gt_label_inds_data =
+      gt_label_inds_t.mutable_data<int>({fg_num}, context.GetPlace());
+  std::copy(fg_inds.begin(), fg_inds.end(), fg_inds_data);
+  std::copy(bg_inds.begin(), bg_inds.end(), bg_inds_data);
+  std::copy(gt_inds.begin(), gt_inds.end(), gt_box_inds_data);
+  std::copy(gt_inds.begin(), gt_inds.end(), gt_label_inds_data);
+  Tensor fg_boxes, bg_boxes, fg_labels, bg_labels;
+  fg_boxes.mutable_data<T>({fg_num, kBoxDim}, context.GetPlace());
+  CPUGather<T>(context, boxes, fg_inds_t, &fg_boxes);
+  bg_boxes.mutable_data<T>({bg_num, kBoxDim}, context.GetPlace());
+  CPUGather<T>(context, boxes, bg_inds_t, &bg_boxes);
+  Concat<T>(context, fg_boxes, bg_boxes, sampled_boxes);
+  CPUGather<T>(context, gt_boxes, gt_box_inds_t, sampled_gts);
+  fg_labels.mutable_data<int>({fg_num}, context.GetPlace());
+  CPUGather<int>(context, gt_classes, gt_label_inds_t, &fg_labels);
+  bg_labels.mutable_data<int>({bg_num}, context.GetPlace());
+  math::set_constant(context, &bg_labels, 0);
+  Concat<int>(context, fg_labels, bg_labels, sampled_labels);
+}
+template <typename T>
+std::vector<Tensor> SampleRoisForOneImage(
+    const platform::CPUDeviceContext& context,
+    const Tensor& rpn_rois_in,
+    const Tensor& gt_classes,
+    const Tensor& is_crowd,
+    const Tensor& gt_boxes,
+    const Tensor& im_info,
+    const int batch_size_per_im,
+    const float fg_fraction,
+    const float fg_thresh,
+    const float bg_thresh_hi,
+    const float bg_thresh_lo,
+    const std::vector<float>& bbox_reg_weights,
+    const int class_nums,
+    std::minstd_rand engine,
+    bool use_random,
+    bool is_cls_agnostic) {
+  // 1.1 map to original image
+  auto im_scale = im_info.data<T>()[2];
+  Tensor rpn_rois_slice;
+  Tensor rpn_rois;
+  rpn_rois.mutable_data<T>(rpn_rois_in.dims(), context.GetPlace());
+  const T* rpn_rois_in_dt = rpn_rois_in.data<T>();
+  T* rpn_rois_dt = rpn_rois.data<T>();
+  for (int i = 0; i < rpn_rois.numel(); ++i) {
+    rpn_rois_dt[i] = rpn_rois_in_dt[i];
+  }
+  // 1.2 compute overlaps
+  int proposals_num = gt_boxes.dims()[0] + rpn_rois.dims()[0];
+  Tensor boxes;
+  boxes.mutable_data<T>({proposals_num, kBoxDim}, context.GetPlace());
+  Concat<T>(context, gt_boxes, rpn_rois, &boxes);
+  Tensor proposal_to_gt_overlaps;
+  proposal_to_gt_overlaps.mutable_data<T>({proposals_num, gt_boxes.dims()[0]},
+                                          context.GetPlace());
+  BboxOverlaps2<T>(boxes, gt_boxes, &proposal_to_gt_overlaps);
+  std::vector<std::vector<int>> fg_bg_gt =
+      SampleFgBgGt<T>(context,
+                      &proposal_to_gt_overlaps,
+                      is_crowd,
+                      batch_size_per_im,
+                      fg_fraction,
+                      fg_thresh,
+                      bg_thresh_hi,
+                      bg_thresh_lo,
+                      engine,
+                      use_random,
+                      boxes);
+  std::vector<int> fg_inds = fg_bg_gt[0];
+  std::vector<int> bg_inds = fg_bg_gt[1];
+  std::vector<int> mapped_gt_inds = fg_bg_gt[2];  // mapped_gt_labels
+  Tensor sampled_boxes, sampled_labels, sampled_gts;
+  int fg_num = fg_inds.size();
+  int bg_num = bg_inds.size();
+  int boxes_num = fg_num + bg_num;
+  framework::DDim bbox_dim({boxes_num, kBoxDim});
+  sampled_boxes.mutable_data<T>(bbox_dim, context.GetPlace());
+  sampled_labels.mutable_data<int>({boxes_num}, context.GetPlace());
+  sampled_gts.mutable_data<T>({fg_num, kBoxDim}, context.GetPlace());
+  GatherBoxesLabels<T>(context,
+                       boxes,
+                       gt_boxes,
+                       gt_classes,
+                       fg_inds,
+                       bg_inds,
+                       mapped_gt_inds,
+                       &sampled_boxes,
+                       &sampled_labels,
+                       &sampled_gts);
+  // Compute targets
+  Tensor bbox_targets_single;
+  bbox_targets_single.mutable_data<T>(bbox_dim, context.GetPlace());
+  BoxToDelta2<T>(fg_num,
+                 sampled_boxes,
+                 sampled_gts,
+                 bbox_reg_weights.data(),
+                 &bbox_targets_single);
+  // Scale rois
+  Tensor sampled_rois;
+  sampled_rois.mutable_data<T>(sampled_boxes.dims(), context.GetPlace());
+  auto sampled_rois_et = framework::EigenTensor<T, 2>::From(sampled_rois);
+  auto sampled_boxes_et = framework::EigenTensor<T, 2>::From(sampled_boxes);
+  sampled_rois_et = sampled_boxes_et;
+  // Expand box targets
+  Tensor bbox_targets, bbox_inside_weights, bbox_outside_weights;
+  framework::DDim bbox_expand_dim({boxes_num, kBoxDim * class_nums});
+  bbox_targets.mutable_data<T>(bbox_expand_dim, context.GetPlace());
+  bbox_inside_weights.mutable_data<T>(bbox_expand_dim, context.GetPlace());
+  bbox_outside_weights.mutable_data<T>(bbox_expand_dim, context.GetPlace());
+  math::set_constant(context, &bbox_targets, 0.0);
+  math::set_constant(context, &bbox_inside_weights, 0.0);
+  math::set_constant(context, &bbox_outside_weights, 0.0);
+  auto* bbox_targets_single_data = bbox_targets_single.data<T>();
+  auto* sampled_labels_data = sampled_labels.data<int>();
+  auto* bbox_targets_data = bbox_targets.data<T>();
+  auto* bbox_inside_weights_data = bbox_inside_weights.data<T>();
+  auto* bbox_outside_weights_data = bbox_outside_weights.data<T>();
+  int width = kBoxDim * class_nums;
+  for (int64_t i = 0; i < boxes_num; ++i) {
+    int label = sampled_labels_data[i];
+    if (label > 0) {
+      if (is_cls_agnostic) {
+        label = 1;
+      }
+      int dst_idx = i * width + kBoxDim * label;
+      int src_idx = kBoxDim * i;
+      bbox_targets_data[dst_idx] = bbox_targets_single_data[src_idx];
+      bbox_targets_data[dst_idx + 1] = bbox_targets_single_data[src_idx + 1];
+      bbox_targets_data[dst_idx + 2] = bbox_targets_single_data[src_idx + 2];
+      bbox_targets_data[dst_idx + 3] = bbox_targets_single_data[src_idx + 3];
+      bbox_targets_data[dst_idx + 4] = bbox_targets_single_data[src_idx + 4];
+      bbox_inside_weights_data[dst_idx] = 1;
+      bbox_inside_weights_data[dst_idx + 1] = 1;
+      bbox_inside_weights_data[dst_idx + 2] = 1;
+      bbox_inside_weights_data[dst_idx + 3] = 1;
+      bbox_inside_weights_data[dst_idx + 4] = 1;
+      bbox_outside_weights_data[dst_idx] = 1;
+      bbox_outside_weights_data[dst_idx + 1] = 1;
+      bbox_outside_weights_data[dst_idx + 2] = 1;
+      bbox_outside_weights_data[dst_idx + 3] = 1;
+      bbox_outside_weights_data[dst_idx + 4] = 1;
+    }
+  }
+  std::vector<Tensor> res;
+  res.emplace_back(sampled_rois);
+  res.emplace_back(sampled_labels);
+  res.emplace_back(bbox_targets);
+  res.emplace_back(bbox_inside_weights);
+  res.emplace_back(bbox_outside_weights);
+  return res;
+}
+template <typename T>
+class RRPNGenerateProposalLabelsKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto* rpn_rois = context.Input<LoDTensor>("RpnRois");
+    auto* gt_classes = context.Input<LoDTensor>("GtClasses");
+    auto* is_crowd = context.Input<LoDTensor>("IsCrowd");
+    auto* gt_boxes = context.Input<LoDTensor>("GtBoxes");
+    auto* im_info = context.Input<LoDTensor>("ImInfo");
+    auto* rois = context.Output<LoDTensor>("Rois");
+    auto* labels_int32 = context.Output<LoDTensor>("LabelsInt32");
+    auto* bbox_targets = context.Output<LoDTensor>("BboxTargets");
+    auto* bbox_inside_weights = context.Output<LoDTensor>("BboxInsideWeights");
+    auto* bbox_outside_weights =
+        context.Output<LoDTensor>("BboxOutsideWeights");
+    int batch_size_per_im = context.Attr<int>("batch_size_per_im");
+    float fg_fraction = context.Attr<float>("fg_fraction");
+    float fg_thresh = context.Attr<float>("fg_thresh");
+    float bg_thresh_hi = context.Attr<float>("bg_thresh_hi");
+    float bg_thresh_lo = context.Attr<float>("bg_thresh_lo");
+    std::vector<float> bbox_reg_weights =
+        context.Attr<std::vector<float>>("bbox_reg_weights");
+    int class_nums = context.Attr<int>("class_nums");
+    bool use_random = context.Attr<bool>("use_random");
+    bool is_cls_agnostic = context.Attr<bool>("is_cls_agnostic");
+    PADDLE_ENFORCE_EQ(
+        rpn_rois->lod().size(),
+        1UL,
+        "RRPNGenerateProposalLabelsOp rpn_rois needs 1 level of LoD");
+    PADDLE_ENFORCE_EQ(
+        gt_classes->lod().size(),
+        1UL,
+        "RRPNGenerateProposalLabelsOp gt_classes needs 1 level of LoD");
+    PADDLE_ENFORCE_EQ(
+        is_crowd->lod().size(),
+        1UL,
+        "RRPNGenerateProposalLabelsOp is_crowd needs 1 level of LoD");
+    PADDLE_ENFORCE_EQ(
+        gt_boxes->lod().size(),
+        1UL,
+        "RRPNGenerateProposalLabelsOp gt_boxes needs 1 level of LoD");
+    int64_t n = static_cast<int64_t>(rpn_rois->lod().back().size() - 1);
+    rois->mutable_data<T>({n * batch_size_per_im, kBoxDim}, context.GetPlace());
+    labels_int32->mutable_data<int>({n * batch_size_per_im, 1},
+                                    context.GetPlace());
+    bbox_targets->mutable_data<T>({n * batch_size_per_im, kBoxDim * class_nums},
+                                  context.GetPlace());
+    bbox_inside_weights->mutable_data<T>(
+        {n * batch_size_per_im, kBoxDim * class_nums}, context.GetPlace());
+    bbox_outside_weights->mutable_data<T>(
+        {n * batch_size_per_im, kBoxDim * class_nums}, context.GetPlace());
+    std::random_device rnd;
+    std::minstd_rand engine;
+    int seed = rnd();
+    engine.seed(seed);
+    framework::LoD lod;
+    std::vector<size_t> lod0(1, 0);
+    int64_t num_rois = 0;
+    auto& dev_ctx = context.device_context<platform::CPUDeviceContext>();
+    auto rpn_rois_lod = rpn_rois->lod().back();
+    auto gt_classes_lod = gt_classes->lod().back();
+    auto is_crowd_lod = is_crowd->lod().back();
+    auto gt_boxes_lod = gt_boxes->lod().back();
+    for (int i = 0; i < n; ++i) {
+      if (rpn_rois_lod[i] == rpn_rois_lod[i + 1]) {
+        lod0.emplace_back(num_rois);
+        continue;
+      }
+      Tensor rpn_rois_slice =
+          rpn_rois->Slice(rpn_rois_lod[i], rpn_rois_lod[i + 1]);
+      Tensor gt_classes_slice =
+          gt_classes->Slice(gt_classes_lod[i], gt_classes_lod[i + 1]);
+      Tensor is_crowd_slice =
+          is_crowd->Slice(is_crowd_lod[i], is_crowd_lod[i + 1]);
+      Tensor gt_boxes_slice =
+          gt_boxes->Slice(gt_boxes_lod[i], gt_boxes_lod[i + 1]);
+      Tensor im_info_slice = im_info->Slice(i, i + 1);
+      std::vector<Tensor> tensor_output =
+          SampleRoisForOneImage<T>(dev_ctx,
+                                   rpn_rois_slice,
+                                   gt_classes_slice,
+                                   is_crowd_slice,
+                                   gt_boxes_slice,
+                                   im_info_slice,
+                                   batch_size_per_im,
+                                   fg_fraction,
+                                   fg_thresh,
+                                   bg_thresh_hi,
+                                   bg_thresh_lo,
+                                   bbox_reg_weights,
+                                   class_nums,
+                                   engine,
+                                   use_random,
+                                   is_cls_agnostic);
+      Tensor sampled_rois = tensor_output[0];
+      Tensor sampled_labels_int32 = tensor_output[1];
+      Tensor sampled_bbox_targets = tensor_output[2];
+      Tensor sampled_bbox_inside_weights = tensor_output[3];
+      Tensor sampled_bbox_outside_weights = tensor_output[4];
+      AppendRois<T>(rois, kBoxDim * num_rois, &sampled_rois);
+      AppendRois<int>(labels_int32, num_rois, &sampled_labels_int32);
+      AppendRois<T>(
+          bbox_targets, kBoxDim * num_rois * class_nums, &sampled_bbox_targets);
+      AppendRois<T>(bbox_inside_weights,
+                    kBoxDim * num_rois * class_nums,
+                    &sampled_bbox_inside_weights);
+      AppendRois<T>(bbox_outside_weights,
+                    kBoxDim * num_rois * class_nums,
+                    &sampled_bbox_outside_weights);
+      num_rois += sampled_rois.dims()[0];
+      lod0.emplace_back(num_rois);
+    }
+    lod.emplace_back(lod0);
+    rois->set_lod(lod);
+    labels_int32->set_lod(lod);
+    bbox_targets->set_lod(lod);
+    bbox_inside_weights->set_lod(lod);
+    bbox_outside_weights->set_lod(lod);
+    rois->Resize({num_rois, kBoxDim});
+    labels_int32->Resize({num_rois, 1});
+    bbox_targets->Resize({num_rois, kBoxDim * class_nums});
+    bbox_inside_weights->Resize({num_rois, kBoxDim * class_nums});
+    bbox_outside_weights->Resize({num_rois, kBoxDim * class_nums});
+  }
+};
+class RRPNGenerateProposalLabelsOpMaker
+    : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("RpnRois",
+             "(LoDTensor), This input is a 2D LoDTensor with shape [N, 5]. "
+             "N is the number of the GenerateProposalOp's output, "
+             "each element is a bounding box with [x, y, w, h, angle] format.");
+    AddInput("GtClasses",
+             "(LoDTensor), This input is a 2D LoDTensor with shape [M, 1]. "
+             "M is the number of groundtruth, "
+             "each element is a class label of groundtruth.");
+    AddInput(
+        "IsCrowd",
+        "(LoDTensor), This input is a 2D LoDTensor with shape [M, 1]. "
+        "M is the number of groundtruth, "
+        "each element is a flag indicates whether a groundtruth is crowd.");
+    AddInput("GtBoxes",
+             "(LoDTensor), This input is a 2D LoDTensor with shape [M, 5. "
+             "M is the number of groundtruth, "
+             "each element is a bounding box with [x, y, w, h, angle] format.");
+    AddInput("ImInfo",
+             "(Tensor), This input is a 2D Tensor with shape [B, 3]. "
+             "B is the number of input images, "
+             "each element consists of im_height, im_width, im_scale.");
+    AddOutput(
+        "Rois",
+        "(LoDTensor), This output is a 2D LoDTensor with shape [P, 5]. "
+        "P usuall equal to  batch_size_per_im * batch_size, "
+        "each element is a bounding box with [x, y, w, h ,angle] format.");
+    AddOutput("LabelsInt32",
+              "(LoDTensor), This output is a 2D LoDTensor with shape [P, 1], "
+              "each element repersents a class label of a roi");
+    AddOutput("BboxTargets",
+              "(LoDTensor), This output is a 2D LoDTensor with shape [P, 5 * "
+              "class_nums], "
+              "each element repersents a box label of a roi");
+    AddOutput(
+        "BboxInsideWeights",
+        "(LoDTensor), This output is a 2D LoDTensor with shape [P, 5 * "
+        "class_nums], "
+        "each element indicates whether a box should contribute to loss.");
+    AddOutput(
+        "BboxOutsideWeights",
+        "(LoDTensor), This output is a 2D LoDTensor with shape [P, 5 * "
+        "class_nums], "
+        "each element indicates whether a box should contribute to loss.");
+    AddAttr<int>("batch_size_per_im", "Batch size of rois per images.");
+    AddAttr<float>("fg_fraction",
+                   "Foreground fraction in total batch_size_per_im.");
+    AddAttr<float>(
+        "fg_thresh",
+        "Overlap threshold which is used to chose foreground sample.");
+    AddAttr<float>("bg_thresh_hi",
+                   "Overlap threshold upper bound which is used to chose "
+                   "background sample.");
+    AddAttr<float>("bg_thresh_lo",
+                   "Overlap threshold lower bound which is used to chose "
+                   "background sample.");
+    AddAttr<std::vector<float>>("bbox_reg_weights", "Box regression weights.");
+    AddAttr<int>("class_nums", "Class number.");
+    AddAttr<bool>(
+        "use_random",
+        "Use random sampling to choose foreground and background boxes.")
+        .SetDefault(true);
+    AddAttr<bool>(
+        "is_cls_agnostic",
+        "the box regress will only include fg and bg locations if set true ")
+        .SetDefault(false);
+    AddComment(R"DOC(
+This operator can be, for given the RotatedGenerateProposalOp output rotated bounding boxes and groundtruth,
+to sample foreground boxes and background boxes, and compute loss target.
+RpnRois is the output boxes of RPN and was processed by rotated_generate_proposal_op, these boxes
+were combined with groundtruth boxes and sampled according to batch_size_per_im and fg_fraction,
+If an instance with a groundtruth overlap greater than fg_thresh, then it was considered as a foreground sample.
+If an instance with a groundtruth overlap greater than bg_thresh_lo and lower than bg_thresh_hi,
+then it was considered as a background sample.
+After all foreground and background boxes are chosen (so called Rois),
+then we apply random sampling to make sure
+the number of foreground boxes is no more than batch_size_per_im * fg_fraction.
+For each box in Rois, we assign the classification (class label) and regression targets (box label) to it.
+Finally BboxInsideWeights and BboxOutsideWeights are used to specify whether it would contribute to training loss.
+    )DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(
+    rrpn_generate_proposal_labels,
+    ops::RRPNGenerateProposalLabelsOp,
+    ops::RRPNGenerateProposalLabelsOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
+REGISTER_OP_CPU_KERNEL(rrpn_generate_proposal_labels,
+                       ops::RRPNGenerateProposalLabelsKernel<float>,
+                       ops::RRPNGenerateProposalLabelsKernel<double>);
--- a/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposals_op.cc
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposals_op.cc
+/*opyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <cmath>
+#include <cstring>
+#include <fstream>
+#include <iostream>
+#include <string>
+#include <vector>
+#include "gather.h"
+#include "math_function.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "safe_ref.h"
+namespace paddle {
+namespace operators {
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+static const double kBBoxClipDefault = std::log(1000.0 / 16.0);
+#define PI 3.141592654
+static void RRPNAppendProposals(Tensor *dst,
+                                int64_t offset,
+                                const Tensor &src) {
+  auto *out_data = dst->data<void>();
+  auto *to_add_data = src.data<void>();
+  size_t size_of_t = framework::SizeOfType(src.type());
+  offset *= size_of_t;
+  std::memcpy(
+      reinterpret_cast<void *>(reinterpret_cast<uintptr_t>(out_data) + offset),
+      to_add_data,
+      src.numel() * size_of_t);
+}
+template <class T>
+inline T axr(T x, T r) {
+  return 0.5 * PI * r * r - x * sqrt(r * r - x * x) - r * r * std::asin(x / r);
+}
+class RRPNGenerateProposalsOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext *ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Scores"), "Input(Scores) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("BboxDeltas"),
+                   "Input(BboxDeltas) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("ImInfo"), "Input(ImInfo) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Anchors"),
+                   "Input(Anchors) shouldn't be null.");
+    PADDLE_ENFORCE(ctx->HasInput("Variances"),
+                   "Input(Variances) shouldn't be null.");
+    ctx->SetOutputDim("RpnRois", {-1, 5});
+    ctx->SetOutputDim("RpnRoiProbs", {-1, 1});
+  }
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext &ctx) const override {
+    return framework::OpKernelType(ctx.Input<Tensor>("Anchors")->type(),
+                                   ctx.device_context());
+  }
+};
+template <class T>
+static inline void RBoxCoder(const platform::DeviceContext &ctx,
+                             Tensor *all_anchors,
+                             Tensor *bbox_deltas,
+                             Tensor *variances,
+                             Tensor *proposals) {
+  T *proposals_data = proposals->mutable_data<T>(ctx.GetPlace());
+  int64_t row = all_anchors->dims()[0];
+  int64_t len = all_anchors->dims()[1];
+  auto *bbox_deltas_data = bbox_deltas->data<T>();
+  auto *anchor_data = all_anchors->data<T>();
+  const T *variances_data = nullptr;
+  if (variances) {
+    variances_data = variances->data<T>();
+  }
+  for (int64_t i = 0; i < row; ++i) {
+    T anchor_width = anchor_data[i * len + 2];
+    T anchor_height = anchor_data[i * len + 3];
+    T anchor_angle = anchor_data[i * len + 4];
+    T anchor_center_x = anchor_data[i * len];
+    T anchor_center_y = anchor_data[i * len + 1];
+    T bbox_center_x = 0, bbox_center_y = 0;
+    T bbox_width = 0, bbox_height = 0, bbox_angle = 0;
+    if (variances) {
+      bbox_center_x =
+          bbox_deltas_data[i * len] / variances_data[i * len] * anchor_width +
+          anchor_center_x;
+      bbox_center_y = bbox_deltas_data[i * len + 1] /
+                          variances_data[i * len + 1] * anchor_height +
+                      anchor_center_y;
+      bbox_width = std::exp(std::min<T>(bbox_deltas_data[i * len + 2] /
+                                            variances_data[i * len + 2],
+                                        kBBoxClipDefault)) *
+                   anchor_width;
+      bbox_height = std::exp(std::min<T>(bbox_deltas_data[i * len + 3] /
+                                             variances_data[i * len + 3],
+                                         kBBoxClipDefault)) *
+                    anchor_height;
+      bbox_angle =
+          (bbox_deltas_data[i * len + 4] / variances_data[i * len + 4]) * 1.0 /
+              PI * 180 +
+          anchor_angle;
+    } else {
+      bbox_center_x =
+          bbox_deltas_data[i * len] * anchor_width + anchor_center_x;
+      bbox_center_y =
+          bbox_deltas_data[i * len + 1] * anchor_height + anchor_center_y;
+      bbox_width = std::exp(std::min<T>(bbox_deltas_data[i * len + 2],
+                                        kBBoxClipDefault)) *
+                   anchor_width;
+      bbox_height = std::exp(std::min<T>(bbox_deltas_data[i * len + 3],
+                                         kBBoxClipDefault)) *
+                    anchor_height;
+      bbox_angle =
+          bbox_deltas_data[i * len + 4] * 1.0 / PI * 180 + anchor_angle;
+    }
+    proposals_data[i * len] = bbox_center_x;
+    proposals_data[i * len + 1] = bbox_center_y;
+    proposals_data[i * len + 2] = bbox_width;
+    proposals_data[i * len + 3] = bbox_height;
+    proposals_data[i * len + 4] = bbox_angle;
+  }
+  // return proposals;
+}
+template <class T>
+static inline void RFilterBoxes(const platform::DeviceContext &ctx,
+                                Tensor *boxes,
+                                float min_size,
+                                const Tensor &im_info,
+                                Tensor *keep) {
+  T *boxes_data = boxes->mutable_data<T>(ctx.GetPlace());
+  keep->Resize({boxes->dims()[0]});
+  min_size = std::max(min_size, 0.0f);
+  int *keep_data = keep->mutable_data<int>(ctx.GetPlace());
+  int keep_len = 0;
+  for (int i = 0; i < boxes->dims()[0]; ++i) {
+    T ws = boxes_data[5 * i + 2];
+    T hs = boxes_data[5 * i + 3];
+    if (ws >= min_size && hs >= min_size) {
+      keep_data[keep_len++] = i;
+    }
+  }
+  keep->Resize({keep_len});
+}
+template <class T>
+static inline std::vector<std::pair<T, int>> GetSortedScoreIndex(
+    const std::vector<T> &scores) {
+  std::vector<std::pair<T, int>> sorted_indices;
+  sorted_indices.reserve(scores.size());
+  for (size_t i = 0; i < scores.size(); ++i) {
+    sorted_indices.emplace_back(scores[i], i);
+  }
+  // Sort the score pair according to the scores in descending order
+  std::stable_sort(sorted_indices.begin(),
+                   sorted_indices.end(),
+                   [](const std::pair<T, int> &a, const std::pair<T, int> &b) {
+                     return a.first < b.first;
+                   });
+  return sorted_indices;
+}
+template <typename T>
+static inline Tensor VectorToTensor(const std::vector<T> &selected_indices,
+                                    int selected_num) {
+  Tensor keep_nms;
+  keep_nms.Resize({selected_num});
+  auto *keep_data = keep_nms.mutable_data<T>(platform::CPUPlace());
+  for (int i = 0; i < selected_num; ++i) {
+    keep_data[i] = selected_indices[i];
+  }
+  return keep_nms;
+}
+template <typename T>
+inline T trangle_area(T *a, T *b, T *c) {
+  return ((a[0] - c[0]) * (b[1] - c[1]) - (a[1] - c[1]) * (b[0] - c[0])) / 2.0;
+}
+template <typename T>
+inline T area(T *int_pts, int num_of_inter) {
+  float area = 0.0;
+  for (int i = 0; i < num_of_inter - 2; i++) {
+    area +=
+        fabs(trangle_area(int_pts, int_pts + 2 * i + 2, int_pts + 2 * i + 4));
+  }
+  return area;
+}
+template <typename T>
+inline void reorder_pts(T *int_pts, int num_of_inter) {
+  if (num_of_inter > 0) {
+    float center[2];
+    center[0] = 0.0;
+    center[1] = 0.0;
+    for (int i = 0; i < num_of_inter; i++) {
+      center[0] += int_pts[2 * i];
+      center[1] += int_pts[2 * i + 1];
+    }
+    center[0] /= num_of_inter;
+    center[1] /= num_of_inter;
+    float vs[16];
+    float v[2];
+    float d;
+    for (int i = 0; i < num_of_inter; i++) {
+      v[0] = int_pts[2 * i] - center[0];
+      v[1] = int_pts[2 * i + 1] - center[1];
+      d = sqrt(v[0] * v[0] + v[1] * v[1]);
+      v[0] = v[0] / d;
+      v[1] = v[1] / d;
+      if (v[1] < 0) {
+        v[0] = -2 - v[0];
+      }
+      vs[i] = v[0];
+    }
+    float temp, tx, ty;
+    int j;
+    for (int i = 1; i < num_of_inter; ++i) {
+      if (vs[i - 1] > vs[i]) {
+        temp = vs[i];
+        tx = int_pts[2 * i];
+        ty = int_pts[2 * i + 1];
+        j = i;
+        while (j > 0 && vs[j - 1] > temp) {
+          vs[j] = vs[j - 1];
+          int_pts[j * 2] = int_pts[j * 2 - 2];
+          int_pts[j * 2 + 1] = int_pts[j * 2 - 1];
+          j--;
+        }
+        vs[j] = temp;
+        int_pts[j * 2] = tx;
+        int_pts[j * 2 + 1] = ty;
+      }
+    }
+  }
+}
+template <typename T>
+inline bool inter2line(T *pts1, T *pts2, int i, int j, T *temp_pts) {
+  T a[2];
+  T b[2];
+  T c[2];
+  T d[2];
+  T area_abc, area_abd, area_cda, area_cdb;
+  a[0] = pts1[2 * i];
+  a[1] = pts1[2 * i + 1];
+  b[0] = pts1[2 * ((i + 1) % 4)];
+  b[1] = pts1[2 * ((i + 1) % 4) + 1];
+  c[0] = pts2[2 * j];
+  c[1] = pts2[2 * j + 1];
+  d[0] = pts2[2 * ((j + 1) % 4)];
+  d[1] = pts2[2 * ((j + 1) % 4) + 1];
+  area_abc = trangle_area(a, b, c);
+  area_abd = trangle_area(a, b, d);
+  if (area_abc * area_abd >= 0) {
+    return false;
+  }
+  area_cda = trangle_area(c, d, a);
+  area_cdb = area_cda + area_abc - area_abd;
+  if (area_cda * area_cdb >= 0) {
+    return false;
+  }
+  float t = area_cda / (area_abd - area_abc);
+  float dx = t * (b[0] - a[0]);
+  float dy = t * (b[1] - a[1]);
+  temp_pts[0] = a[0] + dx;
+  temp_pts[1] = a[1] + dy;
+  return true;
+}
+template <typename T>
+inline bool in_rect(T pt_x, T pt_y, T *pts) {
+  float ab[2];
+  float ad[2];
+  float ap[2];
+  float abab;
+  float abap;
+  float adad;
+  float adap;
+  ab[0] = pts[2] - pts[0];
+  ab[1] = pts[3] - pts[1];
+  ad[0] = pts[6] - pts[0];
+  ad[1] = pts[7] - pts[1];
+  ap[0] = pt_x - pts[0];
+  ap[1] = pt_y - pts[1];
+  abab = ab[0] * ab[0] + ab[1] * ab[1];
+  abap = ab[0] * ap[0] + ab[1] * ap[1];
+  adad = ad[0] * ad[0] + ad[1] * ad[1];
+  adap = ad[0] * ap[0] + ad[1] * ap[1];
+  return abab >= abap and abap >= 0 and adad >= adap and adap >= 0;
+}
+template <typename T>
+inline int inter_pts(T *pts1, T *pts2, T *int_pts) {
+  int num_of_inter = 0;
+  for (int i = 0; i < 4; i++) {
+    if (in_rect(pts1[2 * i], pts1[2 * i + 1], pts2)) {
+      int_pts[num_of_inter * 2] = pts1[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts1[2 * i + 1];
+      num_of_inter++;
+    }
+    if (in_rect(pts2[2 * i], pts2[2 * i + 1], pts1)) {
+      int_pts[num_of_inter * 2] = pts2[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts2[2 * i + 1];
+      num_of_inter++;
+    }
+  }
+  T temp_pts[2];
+  for (int i = 0; i < 4; i++) {
+    for (int j = 0; j < 4; j++) {
+      bool has_pts = inter2line(pts1, pts2, i, j, temp_pts);
+      if (has_pts) {
+        int_pts[num_of_inter * 2] = temp_pts[0];
+        int_pts[num_of_inter * 2 + 1] = temp_pts[1];
+        num_of_inter++;
+      }
+    }
+  }
+  return num_of_inter;
+}
+template <typename T>
+inline void convert_region(T *pts, const T *region) {
+  float angle = region[4];
+  float a_cos = cos(angle / 180.0 * PI);
+  float a_sin = -sin(angle / 180.0 * PI);  // anti clock-wise
+  float ctr_x = region[0];
+  float ctr_y = region[1];
+  float h = region[3];
+  float w = region[2];
+  float pts_x[4];
+  float pts_y[4];
+  pts_x[0] = -w / 2;
+  pts_x[1] = -w / 2;
+  pts_x[2] = w / 2;
+  pts_x[3] = w / 2;
+  pts_y[0] = -h / 2;
+  pts_y[1] = h / 2;
+  pts_y[2] = h / 2;
+  pts_y[3] = -h / 2;
+  for (int i = 0; i < 4; i++) {
+    pts[2 * i] = a_cos * pts_x[i] - a_sin * pts_y[i] + ctr_x;
+    pts[2 * i + 1] = a_sin * pts_x[i] + a_cos * pts_y[i] + ctr_y;
+  }
+}
+template <typename T>
+inline float inter(const T *region1, const T *region2) {
+  T pts1[8];
+  T pts2[8];
+  T int_pts[16];
+  int num_of_inter;
+  convert_region<T>(pts1, region1);
+  convert_region<T>(pts2, region2);
+  num_of_inter = inter_pts<T>(pts1, pts2, int_pts);
+  reorder_pts<T>(int_pts, num_of_inter);
+  return area<T>(int_pts, num_of_inter);
+}
+template <typename T>
+inline float DevRotateIoU(const T *region1, const T *region2) {
+  T area1 = region1[2] * region1[3];
+  T area2 = region2[2] * region2[3];
+  T area_inter = inter<T>(region1, region2);
+  return area_inter / (area1 + area2 - area_inter);
+}
+template <class T>
+static inline Tensor RNMS(const platform::DeviceContext &ctx,
+                          Tensor *bbox,
+                          Tensor *scores,
+                          T nms_threshold) {
+  PADDLE_ENFORCE_NOT_NULL(bbox);
+  int64_t num_boxes = bbox->dims()[0];
+  // 4: [xmin ymin xmax ymax]
+  int64_t box_size = bbox->dims()[1];
+  std::vector<T> scores_data(num_boxes);
+  std::copy_n(scores->data<T>(), num_boxes, scores_data.begin());
+  std::vector<std::pair<T, int>> sorted_indices =
+      GetSortedScoreIndex<T>(scores_data);
+  std::vector<int> selected_indices;
+  int selected_num = 0;
+  T adaptive_threshold = nms_threshold;
+  const T *bbox_data = bbox->data<T>();
+  while (sorted_indices.size() != 0) {
+    int idx = sorted_indices.back().second;
+    bool flag = true;
+    for (int kept_idx : selected_indices) {
+      if (flag) {
+        T overlap = DevRotateIoU<T>(bbox_data + idx * box_size,
+                                    bbox_data + kept_idx * box_size);
+        flag = (overlap <= adaptive_threshold);
+      } else {
+        break;
+      }
+    }
+    if (flag) {
+      selected_indices.push_back(idx);
+      ++selected_num;
+    }
+    sorted_indices.erase(sorted_indices.end() - 1);
+  }
+  return VectorToTensor(selected_indices, selected_num);
+}
+template <typename T>
+class RRPNGenerateProposalsKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext &context) const override {
+    auto *scores = context.Input<Tensor>("Scores");
+    auto *bbox_deltas = context.Input<Tensor>("BboxDeltas");
+    auto *im_info = context.Input<Tensor>("ImInfo");
+    auto anchors = detail::Ref(context.Input<Tensor>("Anchors"),
+                               "Cannot find input Anchors(%s) in scope",
+                               context.InputNames("Anchors")[0]);
+    auto variances = detail::Ref(context.Input<Tensor>("Variances"),
+                                 "Cannot find input Variances(%s) in scope",
+                                 context.InputNames("Variances")[0]);
+    auto *rpn_rois = context.Output<LoDTensor>("RpnRois");
+    auto *rpn_roi_probs = context.Output<LoDTensor>("RpnRoiProbs");
+    int pre_nms_top_n = context.Attr<int>("pre_nms_topN");
+    int post_nms_top_n = context.Attr<int>("post_nms_topN");
+    float nms_thresh = context.Attr<float>("nms_thresh");
+    float min_size = context.Attr<float>("min_size");
+    auto &dev_ctx =
+        context.template device_context<platform::CPUDeviceContext>();
+    auto &scores_dim = scores->dims();
+    int64_t num = scores_dim[0];
+    int64_t c_score = scores_dim[1];
+    int64_t h_score = scores_dim[2];
+    int64_t w_score = scores_dim[3];
+    auto &bbox_dim = bbox_deltas->dims();
+    int64_t c_bbox = bbox_dim[1];
+    int64_t h_bbox = bbox_dim[2];
+    int64_t w_bbox = bbox_dim[3];
+    rpn_rois->mutable_data<T>({bbox_deltas->numel() / 5, 5},
+                              context.GetPlace());
+    rpn_roi_probs->mutable_data<T>({scores->numel(), 1}, context.GetPlace());
+    Tensor bbox_deltas_swap, scores_swap;
+    bbox_deltas_swap.mutable_data<T>({num, h_bbox, w_bbox, c_bbox},
+                                     dev_ctx.GetPlace());
+    scores_swap.mutable_data<T>({num, h_score, w_score, c_score},
+                                dev_ctx.GetPlace());
+    math::Transpose<platform::CPUDeviceContext, T, 4> trans;
+    std::vector<int> axis = {0, 2, 3, 1};
+    trans(dev_ctx, *bbox_deltas, &bbox_deltas_swap, axis);
+    trans(dev_ctx, *scores, &scores_swap, axis);
+    framework::LoD lod;
+    lod.resize(1);
+    auto &lod0 = lod[0];
+    lod0.push_back(0);
+    anchors.Resize({anchors.numel() / 5, 5});
+    variances.Resize({variances.numel() / 5, 5});
+    int64_t num_proposals = 0;
+    for (int64_t i = 0; i < num; ++i) {
+      Tensor im_info_slice = im_info->Slice(i, i + 1);
+      Tensor bbox_deltas_slice = bbox_deltas_swap.Slice(i, i + 1);
+      Tensor scores_slice = scores_swap.Slice(i, i + 1);
+      bbox_deltas_slice.Resize({h_bbox * w_bbox * c_bbox / 5, 5});
+      scores_slice.Resize({h_score * w_score * c_score, 1});
+      std::pair<Tensor, Tensor> tensor_pair =
+          ProposalForOneImage(dev_ctx,
+                              im_info_slice,
+                              anchors,
+                              variances,
+                              bbox_deltas_slice,
+                              scores_slice,
+                              pre_nms_top_n,
+                              post_nms_top_n,
+                              nms_thresh,
+                              min_size);
+      Tensor &proposals = tensor_pair.first;
+      Tensor &scores = tensor_pair.second;
+      RRPNAppendProposals(rpn_rois, 5 * num_proposals, proposals);
+      RRPNAppendProposals(rpn_roi_probs, num_proposals, scores);
+      num_proposals += proposals.dims()[0];
+      lod0.push_back(num_proposals);
+    }
+    rpn_rois->set_lod(lod);
+    rpn_roi_probs->set_lod(lod);
+    rpn_rois->Resize({num_proposals, 5});
+    rpn_roi_probs->Resize({num_proposals, 1});
+  }
+  std::pair<Tensor, Tensor> ProposalForOneImage(
+      const platform::CPUDeviceContext &ctx,
+      const Tensor &im_info_slice,
+      const Tensor &anchors,
+      const Tensor &variances,
+      const Tensor &bbox_deltas_slice,  // [M, 5]
+      const Tensor &scores_slice,       // [N, 1]
+      int pre_nms_top_n,
+      int post_nms_top_n,
+      float nms_thresh,
+      float min_size) const {
+    auto *scores_data = scores_slice.data<T>();
+    // Sort index
+    Tensor index_t;
+    index_t.Resize({scores_slice.numel()});
+    int *index = index_t.mutable_data<int>(ctx.GetPlace());
+    for (int i = 0; i < scores_slice.numel(); ++i) {
+      index[i] = i;
+    }
+    auto compare = [scores_data](const int64_t &i, const int64_t &j) {
+      return scores_data[i] > scores_data[j];
+    };
+    if (pre_nms_top_n <= 0 || pre_nms_top_n >= scores_slice.numel()) {
+      std::sort(index, index + scores_slice.numel(), compare);
+    } else {
+      std::nth_element(
+          index, index + pre_nms_top_n, index + scores_slice.numel(), compare);
+      index_t.Resize({pre_nms_top_n});
+    }
+    Tensor scores_sel, bbox_sel, anchor_sel, var_sel;
+    scores_sel.mutable_data<T>({index_t.numel(), 1}, ctx.GetPlace());
+    bbox_sel.mutable_data<T>({index_t.numel(), 5}, ctx.GetPlace());
+    anchor_sel.mutable_data<T>({index_t.numel(), 5}, ctx.GetPlace());
+    var_sel.mutable_data<T>({index_t.numel(), 5}, ctx.GetPlace());
+    CPUGather<T>(ctx, scores_slice, index_t, &scores_sel);
+    CPUGather<T>(ctx, bbox_deltas_slice, index_t, &bbox_sel);
+    CPUGather<T>(ctx, anchors, index_t, &anchor_sel);
+    CPUGather<T>(ctx, variances, index_t, &var_sel);
+    auto *scores_ = scores_sel.data<T>();
+    Tensor proposals;
+    proposals.mutable_data<T>({index_t.numel(), 5}, ctx.GetPlace());
+    RBoxCoder<T>(ctx, &anchor_sel, &bbox_sel, &var_sel, &proposals);
+    Tensor keep;
+    RFilterBoxes<T>(ctx, &proposals, min_size, im_info_slice, &keep);
+    Tensor scores_filter;
+    bbox_sel.mutable_data<T>({keep.numel(), 5}, ctx.GetPlace());
+    scores_filter.mutable_data<T>({keep.numel(), 1}, ctx.GetPlace());
+    CPUGather<T>(ctx, proposals, keep, &bbox_sel);
+    CPUGather<T>(ctx, scores_sel, keep, &scores_filter);
+    if (nms_thresh <= 0) {
+      return std::make_pair(bbox_sel, scores_filter);
+    }
+    Tensor keep_nms = RNMS<T>(ctx, &bbox_sel, &scores_filter, nms_thresh);
+    if (post_nms_top_n > 0 && post_nms_top_n < keep_nms.numel()) {
+      keep_nms.Resize({post_nms_top_n});
+    }
+    proposals.mutable_data<T>({keep_nms.numel(), 5}, ctx.GetPlace());
+    scores_sel.mutable_data<T>({keep_nms.numel(), 1}, ctx.GetPlace());
+    CPUGather<T>(ctx, bbox_sel, keep_nms, &proposals);
+    CPUGather<T>(ctx, scores_filter, keep_nms, &scores_sel);
+    return std::make_pair(proposals, scores_sel);
+  }
+};
+class RRPNGenerateProposalsOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("Scores",
+             "(Tensor) The scores from conv is in shape (N, A, H, W), "
+             "N is batch size, A is number of anchors, "
+             "H and W are height and width of the feature map");
+    AddInput("BboxDeltas",
+             "(Tensor) Bounding box deltas from conv is in "
+             "shape (N, 5*A, H, W).");
+    AddInput("ImInfo",
+             "(Tensor) Information for image reshape is in shape (N, 3), "
+             "in format (height, width, scale)");
+    AddInput("Anchors",
+             "(Tensor) Bounding box anchors from anchor_generator_op "
+             "is in shape (A, H, W, 5).");
+    AddInput("Variances",
+             "(Tensor) Bounding box variances with same shape as `Anchors`.");
+    AddOutput("RpnRois",
+              "(LoDTensor), Output proposals with shape (rois_num, 5).");
+    AddOutput("RpnRoiProbs",
+              "(LoDTensor) Scores of proposals with shape (rois_num, 1).");
+    AddAttr<int>("pre_nms_topN",
+                 "Number of top scoring RPN proposals to keep before "
+                 "applying NMS.");
+    AddAttr<int>("post_nms_topN",
+                 "Number of top scoring RPN proposals to keep after "
+                 "applying NMS");
+    AddAttr<float>("nms_thresh", "NMS threshold used on RPN proposals.");
+    AddAttr<float>("min_size",
+                   "Proposal height and width both need to be greater "
+                   "than this min_size.");
+    AddComment(R"DOC(
+This operator Generate bounding box proposals for Faster RCNN.
+The propoasls are generated for a list of images based on image
+score 'Scores', bounding box regression result 'BboxDeltas' as
+well as predefined bounding box shapes 'anchors'. Greedy
+non-maximum suppression is applied to generate the final bounding
+boxes.
+)DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(
+    rrpn_generate_proposals,
+    ops::RRPNGenerateProposalsOp,
+    ops::RRPNGenerateProposalsOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
+REGISTER_OP_CPU_KERNEL(rrpn_generate_proposals,
+                       ops::RRPNGenerateProposalsKernel<float>,
+                       ops::RRPNGenerateProposalsKernel<double>);
--- a/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposals_op.cu
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_generate_proposals_op.cu
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+Based on
+--------------------------------------------------------
+@misc{ma2019rrpn,
+    author = {Jianqi Ma},
+    title = {{RRPN in pytorch}},
+    year = {2019},
+    howpublished = {\url{https://github.com/mjq11302010044/RRPN_pytorch}},
+}
+@article{Jianqi17RRPN,
+    Author = {Jianqi Ma and Weiyuan Shao and Hao Ye and Li Wang and Hong Wang
+and Yingbin Zheng and Xiangyang Xue},
+    Title = {Arbitrary-Oriented Scene Text Detection via Rotation Proposals},
+    journal = {IEEE Transactions on Multimedia},
+    volume={20},
+    number={11},
+    pages={3111-3122},
+    year={2018}
+}
+--------------------------------------------------------
+*/
+#include <paddle/fluid/memory/allocation/allocator.h>
+#include <stdio.h>
+#include <string>
+#include <vector>
+#include "cub/cub/cub.cuh"
+#include "gather.cu.h"
+#include "math_function.h"
+#include "paddle/fluid/framework/mixed_vector.h"
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/memory/memory.h"
+#include "paddle/fluid/platform/for_range.h"
+#include "safe_ref.h"
+namespace paddle {
+namespace operators {
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+#define PI 3.141592654
+namespace {
+#define DIVUP(m, n) ((m) / (n) + ((m) % (n) > 0))
+#define CUDA_1D_KERNEL_LOOP(i, n)                              \
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (n); \
+       i += blockDim.x * gridDim.x)
+int const kThreadsPerBlock = sizeof(uint64_t) * 8;
+static const double kBBoxClipDefault = std::log(1000.0 / 16.0);
+struct RangeInitFunctor {
+  int start_;
+  int delta_;
+  int *out_;
+  __device__ void operator()(size_t i) { out_[i] = start_ + i * delta_; }
+};
+template <typename T>
+static void RSortDescending(const platform::CUDADeviceContext &ctx,
+                            const Tensor &value,
+                            Tensor *value_out,
+                            Tensor *index_out) {
+  int num = static_cast<int>(value.numel());
+  Tensor index_in_t;
+  int *idx_in = index_in_t.mutable_data<int>({num}, ctx.GetPlace());
+  platform::ForRange<platform::CUDADeviceContext> for_range(ctx, num);
+  for_range(RangeInitFunctor{0, 1, idx_in});
+  int *idx_out = index_out->mutable_data<int>({num}, ctx.GetPlace());
+  const T *keys_in = value.data<T>();
+  T *keys_out = value_out->mutable_data<T>({num}, ctx.GetPlace());
+  // Determine temporary device storage requirements
+  size_t temp_storage_bytes = 0;
+  cub::DeviceRadixSort::SortPairsDescending<T, int>(
+      nullptr, temp_storage_bytes, keys_in, keys_out, idx_in, idx_out, num);
+  // Allocate temporary storage
+  auto place = boost::get<platform::CUDAPlace>(ctx.GetPlace());
+  auto d_temp_storage = memory::Alloc(place, temp_storage_bytes);
+  // Run sorting operation
+  cub::DeviceRadixSort::SortPairsDescending<T, int>(d_temp_storage->ptr(),
+                                                    temp_storage_bytes,
+                                                    keys_in,
+                                                    keys_out,
+                                                    idx_in,
+                                                    idx_out,
+                                                    num);
+}
+template <typename T>
+struct RBoxDecodeAndClipFunctor {
+  const T *anchor;
+  const T *deltas;
+  const T *var;
+  const int *index;
+  const T *im_info;
+  T *proposals;
+  RBoxDecodeAndClipFunctor(const T *anchor,
+                           const T *deltas,
+                           const T *var,
+                           const int *index,
+                           const T *im_info,
+                           T *proposals)
+      : anchor(anchor),
+        deltas(deltas),
+        var(var),
+        index(index),
+        im_info(im_info),
+        proposals(proposals) {}
+  T bbox_clip_default{static_cast<T>(kBBoxClipDefault)};
+  __device__ void operator()(size_t i) {
+    int k = index[i] * 5;
+    T w = anchor[k + 2];
+    T h = anchor[k + 3];
+    T cx = anchor[k];
+    T cy = anchor[k + 1];
+    T angle = anchor[k + 4];
+    T de_cx = deltas[k];
+    T de_cy = deltas[k + 1];
+    T de_w = deltas[k + 2];
+    T de_h = deltas[k + 3];
+    T de_g = deltas[k + 4];
+    T d_cx, d_cy, d_w, d_h, d_g;
+    if (var) {
+      d_cx = cx + de_cx * w / var[k];
+      d_cy = cy + de_cy * h / var[k + 1];
+      d_w = exp(Min(de_w / var[k + 2], bbox_clip_default)) * w;
+      d_h = exp(Min(de_h / var[k + 3], bbox_clip_default)) * h;
+      d_g = de_g / var[k + 4] * 1.0 / PI * 180 + angle;
+    } else {
+      d_cx = cx + de_cx * w;
+      d_cy = cy + de_cy * h;
+      d_w = exp(Min(de_w, bbox_clip_default)) * w;
+      d_h = exp(Min(de_h, bbox_clip_default)) * h;
+      d_g = de_g * 1.0 / PI * 180 + angle;
+    }
+    proposals[i * 5] = d_cx;
+    proposals[i * 5 + 1] = d_cy;
+    proposals[i * 5 + 2] = d_w;
+    proposals[i * 5 + 3] = d_h;
+    proposals[i * 5 + 4] = d_g;
+  }
+  __device__ __forceinline__ T Min(T a, T b) const { return a > b ? b : a; }
+  __device__ __forceinline__ T Max(T a, T b) const { return a > b ? a : b; }
+};
+template <typename T, int BlockSize>
+static __global__ void RFilterBBoxes(const T *bboxes,
+                                     const T *im_info,
+                                     const T min_size,
+                                     const int num,
+                                     int *keep_num,
+                                     int *keep) {
+  T im_h = im_info[0];
+  T im_w = im_info[1];
+  T im_scale = im_info[2];
+  int cnt = 0;
+  __shared__ int keep_index[BlockSize];
+  CUDA_1D_KERNEL_LOOP(i, num) {
+    keep_index[threadIdx.x] = -1;
+    __syncthreads();
+    int k = i * 5;
+    T cx = bboxes[k];
+    T cy = bboxes[k + 1];
+    T w_s = bboxes[k + 2];
+    T h_s = bboxes[k + 3];
+    if (w_s >= min_size && h_s >= min_size) {
+      keep_index[threadIdx.x] = i;
+    }
+    __syncthreads();
+    if (threadIdx.x == 0) {
+      int size = (num - i) < BlockSize ? num - i : BlockSize;
+      for (int j = 0; j < size; ++j) {
+        if (keep_index[j] > -1) {
+          keep[cnt++] = keep_index[j];
+        }
+      }
+    }
+    __syncthreads();
+  }
+  if (threadIdx.x == 0) {
+    keep_num[0] = cnt;
+  }
+}
+__device__ inline float trangle_area(float *a, float *b, float *c) {
+  return ((a[0] - c[0]) * (b[1] - c[1]) - (a[1] - c[1]) * (b[0] - c[0])) / 2.0;
+}
+__device__ inline float area(float *int_pts, int num_of_inter) {
+  float area = 0.0;
+  for (int i = 0; i < num_of_inter - 2; i++) {
+    area +=
+        fabs(trangle_area(int_pts, int_pts + 2 * i + 2, int_pts + 2 * i + 4));
+  }
+  return area;
+}
+__device__ inline void reorder_pts(float *int_pts, int num_of_inter) {
+  if (num_of_inter > 0) {
+    float center[2] = {0.0, 0.0};
+    //    center[0] = 0.0;
+    //    center[1] = 0.0;
+    for (int i = 0; i < num_of_inter; i++) {
+      center[0] += int_pts[2 * i];
+      center[1] += int_pts[2 * i + 1];
+    }
+    center[0] /= num_of_inter;
+    center[1] /= num_of_inter;
+    float vs[16];
+    float v[2];
+    float d;
+    for (int i = 0; i < num_of_inter; i++) {
+      v[0] = int_pts[2 * i] - center[0];
+      v[1] = int_pts[2 * i + 1] - center[1];
+      d = sqrt(v[0] * v[0] + v[1] * v[1]);
+      v[0] = v[0] / d;
+      v[1] = v[1] / d;
+      if (v[1] < 0) {
+        v[0] = -2 - v[0];
+      }
+      vs[i] = v[0];
+    }
+    float temp, tx, ty;
+    int j;
+    for (int i = 1; i < num_of_inter; ++i) {
+      if (vs[i - 1] > vs[i]) {
+        temp = vs[i];
+        tx = int_pts[2 * i];
+        ty = int_pts[2 * i + 1];
+        j = i;
+        while (j > 0 && vs[j - 1] > temp) {
+          vs[j] = vs[j - 1];
+          int_pts[j * 2] = int_pts[j * 2 - 2];
+          int_pts[j * 2 + 1] = int_pts[j * 2 - 1];
+          j--;
+        }
+        vs[j] = temp;
+        int_pts[j * 2] = tx;
+        int_pts[j * 2 + 1] = ty;
+      }
+    }
+  }
+}
+__device__ inline bool inter2line(
+    float *pts1, float *pts2, int i, int j, float *temp_pts) {
+  float a[2] = {pts1[2 * i], pts1[2 * i + 1]};
+  float b[2] = {pts1[2 * ((i + 1) % 4)], pts1[2 * ((i + 1) % 4) + 1]};
+  float c[2] = {pts2[2 * j], pts2[2 * j + 1]};
+  float d[2] = {pts2[2 * ((j + 1) % 4)], pts2[2 * ((j + 1) % 4) + 1]};
+  // T area_abc, area_abd, area_cda, area_cdb;
+  // a[0] = pts1[2 * i];
+  // a[1] = pts1[2 * i + 1];
+  // b[0] = pts1[2 * ((i + 1) % 4)];
+  // b[1] = pts1[2 * ((i + 1) % 4) + 1];
+  // c[0] = pts2[2 * j];
+  // c[1] = pts2[2 * j + 1];
+  // d[0] = pts2[2 * ((j + 1) % 4)];
+  // d[1] = pts2[2 * ((j + 1) % 4) + 1];
+  float area_abc = trangle_area(a, b, c);
+  float area_abd = trangle_area(a, b, d);
+  if (area_abc * area_abd >= 0) {
+    return false;
+  }
+  float area_cda = trangle_area(c, d, a);
+  float area_cdb = area_cda + area_abc - area_abd;
+  if (area_cda * area_cdb >= 0) {
+    return false;
+  }
+  float t = area_cda / (area_abd - area_abc);
+  float dx = t * (b[0] - a[0]);
+  float dy = t * (b[1] - a[1]);
+  temp_pts[0] = a[0] + dx;
+  temp_pts[1] = a[1] + dy;
+  return true;
+}
+__device__ inline bool in_rect(float pt_x, float pt_y, float *pts) {
+  float ab[2] = {pts[2] - pts[0], pts[3] - pts[1]};
+  float ad[2] = {pts[6] - pts[0], pts[7] - pts[1]};
+  float ap[2] = {pt_x - pts[0], pt_y - pts[1]};
+  //  float abab;
+  //  float abap;
+  //  float adad;
+  //  float adap;
+  //  ab[0] = pts[2] - pts[0];
+  //  ab[1] = pts[3] - pts[1];
+  //
+  //  ad[0] = pts[6] - pts[0];
+  //  ad[1] = pts[7] - pts[1];
+  //
+  //  ap[0] = pt_x - pts[0];
+  //  ap[1] = pt_y - pts[1];
+  float abab = ab[0] * ab[0] + ab[1] * ab[1];
+  float abap = ab[0] * ap[0] + ab[1] * ap[1];
+  float adad = ad[0] * ad[0] + ad[1] * ad[1];
+  float adap = ad[0] * ap[0] + ad[1] * ap[1];
+  return abab >= abap and abap >= 0 and adad >= adap and adap >= 0;
+}
+__device__ inline int inter_pts(float *pts1, float *pts2, float *int_pts) {
+  int num_of_inter = 0;
+  for (int i = 0; i < 4; i++) {
+    if (in_rect(pts1[2 * i], pts1[2 * i + 1], pts2)) {
+      int_pts[num_of_inter * 2] = pts1[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts1[2 * i + 1];
+      num_of_inter++;
+    }
+    if (in_rect(pts2[2 * i], pts2[2 * i + 1], pts1)) {
+      int_pts[num_of_inter * 2] = pts2[2 * i];
+      int_pts[num_of_inter * 2 + 1] = pts2[2 * i + 1];
+      num_of_inter++;
+    }
+  }
+  float temp_pts[2];
+  for (int i = 0; i < 4; i++) {
+    for (int j = 0; j < 4; j++) {
+      bool has_pts = inter2line(pts1, pts2, i, j, temp_pts);
+      if (has_pts) {
+        int_pts[num_of_inter * 2] = temp_pts[0];
+        int_pts[num_of_inter * 2 + 1] = temp_pts[1];
+        num_of_inter++;
+      }
+    }
+  }
+  return num_of_inter;
+}
+__device__ inline void convert_region(float *pts, const float *region) {
+  float angle = region[4];
+  float a_cos = cos(angle / 180.0 * PI);
+  float a_sin = -sin(angle / 180.0 * PI);  // anti clock-wise
+  float ctr_x = region[0];
+  float ctr_y = region[1];
+  float h = region[3];
+  float w = region[2];
+  float pts_x[4] = {-w / 2, -w / 2, w / 2, w / 2};
+  float pts_y[4] = {-h / 2, h / 2, h / 2, -h / 2};
+  //  pts_x[0] = -w / 2;
+  //  pts_x[1] = -w / 2;
+  //  pts_x[2] = w / 2;
+  //  pts_x[3] = w / 2;
+  //
+  //  pts_y[0] = -h / 2;
+  //  pts_y[1] = h / 2;
+  //  pts_y[2] = h / 2;
+  //  pts_y[3] = -h / 2;
+  for (int i = 0; i < 4; i++) {
+    pts[2 * i] = a_cos * pts_x[i] - a_sin * pts_y[i] + ctr_x;
+    pts[2 * i + 1] = a_sin * pts_x[i] + a_cos * pts_y[i] + ctr_y;
+  }
+}
+__device__ inline float inter(const float *region1, const float *region2) {
+  float pts1[8];
+  float pts2[8];
+  float int_pts[16];
+  int num_of_inter;
+  convert_region(pts1, region1);
+  convert_region(pts2, region2);
+  num_of_inter = inter_pts(pts1, pts2, int_pts);
+  reorder_pts(int_pts, num_of_inter);
+  return area(int_pts, num_of_inter);
+}
+__device__ inline float IoU(const float *region1, const float *region2) {
+  float area1 = region1[2] * region1[3];
+  float area2 = region2[2] * region2[3];
+  float area_inter = inter(region1, region2);
+  return area_inter / (area1 + area2 - area_inter);
+}
+static __global__ void RNMSKernel(const int n_boxes,
+                                  const float nms_overlap_thresh,
+                                  const float *dev_boxes,
+                                  uint64_t *dev_mask) {
+  const int row_start = blockIdx.y;
+  const int col_start = blockIdx.x;
+  const int row_size =
+      min(n_boxes - row_start * kThreadsPerBlock, kThreadsPerBlock);
+  const int col_size =
+      min(n_boxes - col_start * kThreadsPerBlock, kThreadsPerBlock);
+  __shared__ float block_boxes[kThreadsPerBlock * 5];
+  if (threadIdx.x < col_size) {
+    block_boxes[threadIdx.x * 5 + 0] =
+        dev_boxes[(kThreadsPerBlock * col_start + threadIdx.x) * 5 + 0];
+    block_boxes[threadIdx.x * 5 + 1] =
+        dev_boxes[(kThreadsPerBlock * col_start + threadIdx.x) * 5 + 1];
+    block_boxes[threadIdx.x * 5 + 2] =
+        dev_boxes[(kThreadsPerBlock * col_start + threadIdx.x) * 5 + 2];
+    block_boxes[threadIdx.x * 5 + 3] =
+        dev_boxes[(kThreadsPerBlock * col_start + threadIdx.x) * 5 + 3];
+    block_boxes[threadIdx.x * 5 + 4] =
+        dev_boxes[(kThreadsPerBlock * col_start + threadIdx.x) * 5 + 4];
+  }
+  __syncthreads();
+  if (threadIdx.x < row_size) {
+    const int cur_box_idx = kThreadsPerBlock * row_start + threadIdx.x;
+    const float *cur_box = dev_boxes + cur_box_idx * 5;
+    int i = 0;
+    uint64_t t = 0;
+    int start = 0;
+    if (row_start == col_start) {
+      start = threadIdx.x + 1;
+    }
+    for (i = start; i < col_size; i++) {
+      if (IoU(cur_box, block_boxes + i * 5) > nms_overlap_thresh) {
+        t |= 1ULL << i;
+      }
+    }
+    const int col_blocks = DIVUP(n_boxes, kThreadsPerBlock);
+    dev_mask[cur_box_idx * col_blocks + col_start] = t;
+  }
+}
+template <typename T>
+static void RNMS(const platform::CUDADeviceContext &ctx,
+                 const Tensor &proposals,
+                 const Tensor &sorted_indices,
+                 const T nms_threshold,
+                 Tensor *keep_out) {
+  int boxes_num = proposals.dims()[0];
+  PADDLE_ENFORCE_EQ(boxes_num, sorted_indices.dims()[0]);
+  const int col_blocks = DIVUP(boxes_num, kThreadsPerBlock);
+  dim3 blocks(DIVUP(boxes_num, kThreadsPerBlock),
+              DIVUP(boxes_num, kThreadsPerBlock));
+  dim3 threads(kThreadsPerBlock);
+  const T *boxes = proposals.data<T>();
+  auto place = boost::get<platform::CUDAPlace>(ctx.GetPlace());
+  framework::Vector<uint64_t> mask(boxes_num * col_blocks);
+  RNMSKernel<<<blocks, threads>>>(
+      boxes_num,
+      nms_threshold,
+      boxes,
+      mask.CUDAMutableData(boost::get<platform::CUDAPlace>(ctx.GetPlace())));
+  std::vector<uint64_t> remv(col_blocks);
+  memset(&remv[0], 0, sizeof(uint64_t) * col_blocks);
+  std::vector<int> keep_vec;
+  int num_to_keep = 0;
+  for (int i = 0; i < boxes_num; i++) {
+    int nblock = i / kThreadsPerBlock;
+    int inblock = i % kThreadsPerBlock;
+    if (!(remv[nblock] & (1ULL << inblock))) {
+      ++num_to_keep;
+      keep_vec.push_back(i);
+      uint64_t *p = &mask[0] + i * col_blocks;
+      for (int j = nblock; j < col_blocks; j++) {
+        remv[j] |= p[j];
+      }
+    }
+  }
+  int *keep = keep_out->mutable_data<int>({num_to_keep}, ctx.GetPlace());
+  memory::Copy(place,
+               keep,
+               platform::CPUPlace(),
+               keep_vec.data(),
+               sizeof(int) * num_to_keep,
+               ctx.stream());
+  ctx.Wait();
+}
+template <typename T>
+static std::pair<Tensor, Tensor> RRPNProposalForOneImage(
+    const platform::CUDADeviceContext &ctx,
+    const Tensor &im_info,
+    const Tensor &anchors,
+    const Tensor &variances,
+    const Tensor &bbox_deltas,  // [M, 5]
+    const Tensor &scores,       // [N, 1]
+    int pre_nms_top_n,
+    int post_nms_top_n,
+    float nms_thresh,
+    float min_size) {
+  // 1. pre nms
+  Tensor scores_sort, index_sort;
+  RSortDescending<T>(ctx, scores, &scores_sort, &index_sort);
+  int num = scores.numel();
+  int pre_nms_num = (pre_nms_top_n <= 0 || pre_nms_top_n > num) ? scores.numel()
+                                                                : pre_nms_top_n;
+  scores_sort.Resize({pre_nms_num, 1});
+  index_sort.Resize({pre_nms_num, 1});
+  // 2. box decode and clipping
+  Tensor proposals;
+  proposals.mutable_data<T>({pre_nms_num, 5}, ctx.GetPlace());
+  {
+    platform::ForRange<platform::CUDADeviceContext> for_range(ctx, pre_nms_num);
+    for_range(RBoxDecodeAndClipFunctor<T>{anchors.data<T>(),
+                                          bbox_deltas.data<T>(),
+                                          variances.data<T>(),
+                                          index_sort.data<int>(),
+                                          im_info.data<T>(),
+                                          proposals.data<T>()});
+  }
+  // 3. filter
+  Tensor keep_index, keep_num_t;
+  keep_index.mutable_data<int>({pre_nms_num}, ctx.GetPlace());
+  keep_num_t.mutable_data<int>({1}, ctx.GetPlace());
+  min_size = std::max(min_size, 0.0f);
+  auto stream = ctx.stream();
+  RFilterBBoxes<T, 256><<<1, 256, 0, stream>>>(proposals.data<T>(),
+                                               im_info.data<T>(),
+                                               min_size,
+                                               pre_nms_num,
+                                               keep_num_t.data<int>(),
+                                               keep_index.data<int>());
+  int keep_num;
+  const auto gpu_place = boost::get<platform::CUDAPlace>(ctx.GetPlace());
+  memory::Copy(platform::CPUPlace(),
+               &keep_num,
+               gpu_place,
+               keep_num_t.data<int>(),
+               sizeof(int),
+               ctx.stream());
+  ctx.Wait();
+  keep_index.Resize({keep_num});
+  Tensor scores_filter, proposals_filter;
+  proposals_filter.mutable_data<T>({keep_num, 5}, ctx.GetPlace());
+  scores_filter.mutable_data<T>({keep_num, 1}, ctx.GetPlace());
+  GPUGather<T>(ctx, proposals, keep_index, &proposals_filter);
+  GPUGather<T>(ctx, scores_sort, keep_index, &scores_filter);
+  if (nms_thresh <= 0) {
+    return std::make_pair(proposals_filter, scores_filter);
+  }
+  // 4. nms
+  Tensor keep_nms;
+  RNMS<T>(ctx, proposals_filter, keep_index, nms_thresh, &keep_nms);
+  if (post_nms_top_n > 0 && post_nms_top_n < keep_nms.numel()) {
+    keep_nms.Resize({post_nms_top_n});
+  }
+  Tensor scores_nms, proposals_nms;
+  proposals_nms.mutable_data<T>({keep_nms.numel(), 5}, ctx.GetPlace());
+  scores_nms.mutable_data<T>({keep_nms.numel(), 1}, ctx.GetPlace());
+  GPUGather<T>(ctx, proposals_filter, keep_nms, &proposals_nms);
+  GPUGather<T>(ctx, scores_filter, keep_nms, &scores_nms);
+  return std::make_pair(proposals_nms, scores_nms);
+}
+}  // namespace
+template <typename DeviceContext, typename T>
+class CUDARRPNGenerateProposalsKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext &context) const override {
+    auto *scores = context.Input<Tensor>("Scores");
+    auto *bbox_deltas = context.Input<Tensor>("BboxDeltas");
+    auto *im_info = context.Input<Tensor>("ImInfo");
+    auto anchors = detail::Ref(context.Input<Tensor>("Anchors"),
+                               "Cannot find input Anchors(%s) in scope",
+                               context.InputNames("Anchors")[0]);
+    auto variances = detail::Ref(context.Input<Tensor>("Variances"),
+                                 "Cannot find input Variances(%s) in scope",
+                                 context.InputNames("Variances")[0]);
+    auto *rpn_rois = context.Output<LoDTensor>("RpnRois");
+    auto *rpn_roi_probs = context.Output<LoDTensor>("RpnRoiProbs");
+    int pre_nms_top_n = context.Attr<int>("pre_nms_topN");
+    int post_nms_top_n = context.Attr<int>("post_nms_topN");
+    float nms_thresh = context.Attr<float>("nms_thresh");
+    float min_size = context.Attr<float>("min_size");
+    auto &dev_ctx = context.template device_context<DeviceContext>();
+    auto scores_dim = scores->dims();
+    int64_t num = scores_dim[0];
+    int64_t c_score = scores_dim[1];
+    int64_t h_score = scores_dim[2];
+    int64_t w_score = scores_dim[3];
+    auto bbox_dim = bbox_deltas->dims();
+    int64_t c_bbox = bbox_dim[1];
+    int64_t h_bbox = bbox_dim[2];
+    int64_t w_bbox = bbox_dim[3];
+    Tensor bbox_deltas_swap, scores_swap;
+    bbox_deltas_swap.mutable_data<T>({num, h_bbox, w_bbox, c_bbox},
+                                     dev_ctx.GetPlace());
+    scores_swap.mutable_data<T>({num, h_score, w_score, c_score},
+                                dev_ctx.GetPlace());
+    math::Transpose<DeviceContext, T, 4> trans;
+    std::vector<int> axis = {0, 2, 3, 1};
+    trans(dev_ctx, *bbox_deltas, &bbox_deltas_swap, axis);
+    trans(dev_ctx, *scores, &scores_swap, axis);
+    anchors.Resize({anchors.numel() / 5, 5});
+    variances.Resize({variances.numel() / 5, 5});
+    rpn_rois->mutable_data<T>({bbox_deltas->numel() / 5, 5},
+                              context.GetPlace());
+    rpn_roi_probs->mutable_data<T>({scores->numel(), 1}, context.GetPlace());
+    T *rpn_rois_data = rpn_rois->data<T>();
+    T *rpn_roi_probs_data = rpn_roi_probs->data<T>();
+    auto place = boost::get<platform::CUDAPlace>(dev_ctx.GetPlace());
+    int64_t num_proposals = 0;
+    std::vector<size_t> offset(1, 0);
+    for (int64_t i = 0; i < num; ++i) {
+      Tensor im_info_slice = im_info->Slice(i, i + 1);
+      Tensor bbox_deltas_slice = bbox_deltas_swap.Slice(i, i + 1);
+      Tensor scores_slice = scores_swap.Slice(i, i + 1);
+      bbox_deltas_slice.Resize({h_bbox * w_bbox * c_bbox / 5, 5});
+      scores_slice.Resize({h_score * w_score * c_score, 1});
+      // auto* scores_data = scores_slice.data<T>();
+      // for(int k=0; k < 256; k++) {
+      //     std::cout << scores_data[k] << std::endl;
+      // }
+      std::pair<Tensor, Tensor> box_score_pair =
+          RRPNProposalForOneImage<T>(dev_ctx,
+                                     im_info_slice,
+                                     anchors,
+                                     variances,
+                                     bbox_deltas_slice,
+                                     scores_slice,
+                                     pre_nms_top_n,
+                                     post_nms_top_n,
+                                     nms_thresh,
+                                     min_size);
+      Tensor &proposals = box_score_pair.first;
+      Tensor &scores = box_score_pair.second;
+      memory::Copy(place,
+                   rpn_rois_data + num_proposals * 5,
+                   place,
+                   proposals.data<T>(),
+                   sizeof(T) * proposals.numel(),
+                   dev_ctx.stream());
+      memory::Copy(place,
+                   rpn_roi_probs_data + num_proposals,
+                   place,
+                   scores.data<T>(),
+                   sizeof(T) * scores.numel(),
+                   dev_ctx.stream());
+      dev_ctx.Wait();
+      num_proposals += proposals.dims()[0];
+      offset.emplace_back(num_proposals);
+    }
+    framework::LoD lod;
+    lod.emplace_back(offset);
+    rpn_rois->set_lod(lod);
+    rpn_roi_probs->set_lod(lod);
+    rpn_rois->Resize({num_proposals, 5});
+    rpn_roi_probs->Resize({num_proposals, 1});
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(
+    rrpn_generate_proposals,
+    ops::CUDARRPNGenerateProposalsKernel<paddle::platform::CUDADeviceContext,
+                                         float>);
--- a/PaddleCV/rrpn/models/ext_op/src/rrpn_rotated_roi_align_op.cc
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_rotated_roi_align_op.cc
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <algorithm>
+#include <limits>
+#include <memory>
+#include "math_function.h"
+#include "paddle/fluid/framework/op_registry.h"
+namespace paddle {
+namespace operators {
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+class RRPNRotatedROIAlignOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("X"),
+                   "Input(X) of Rotated ROIAlignOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasInput("ROIs"),
+                   "Input(ROIs) of Rotated ROIAlignOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutput("Out"),
+                   "Output(Out) of Rotated ROIAlignOp should not be null.");
+    auto input_dims = ctx->GetInputDim("X");
+    auto rois_dims = ctx->GetInputDim("ROIs");
+    PADDLE_ENFORCE(input_dims.size() == 4,
+                   "The format of input tensor is NCHW.");
+    PADDLE_ENFORCE(rois_dims.size() == 2,
+                   "ROIs should be a 2-D LoDTensor of shape (num_rois, 5)"
+                   "given as [[x1, y1, x2, y2, theta], ...].");
+    if (ctx->IsRuntime()) {
+      PADDLE_ENFORCE(rois_dims[1] == 5,
+                     "ROIs should be a 2-D LoDTensor of shape (num_rois, 5)"
+                     "given as [[x1, y1, x2, y2, theta], ...].");
+    }
+    int pooled_height = ctx->Attrs().Get<int>("pooled_height");
+    int pooled_width = ctx->Attrs().Get<int>("pooled_width");
+    float spatial_scale = ctx->Attrs().Get<float>("spatial_scale");
+    PADDLE_ENFORCE_GT(
+        pooled_height, 0, "The pooled output height must greater than 0");
+    PADDLE_ENFORCE_GT(
+        pooled_width, 0, "The pooled output width must greater than 0");
+    PADDLE_ENFORCE_GT(
+        spatial_scale, 0.0f, "The spatial scale must greater than 0");
+    auto out_dims = input_dims;
+    out_dims[0] = rois_dims[0];
+    out_dims[1] = input_dims[1];
+    out_dims[2] = pooled_height;
+    out_dims[3] = pooled_width;
+    ctx->SetOutputDim("Out", out_dims);
+    ctx->SetOutputDim("ConIdX", out_dims);
+    ctx->SetOutputDim("ConIdY", out_dims);
+  }
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(ctx.Input<framework::Tensor>("X")->type(),
+                                   ctx.device_context());
+  }
+};
+class RRPNRotatedROIAlignGradOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput(framework::GradVarName("Out")),
+                   "The GRAD@Out of RotatedROIAlignGradOp should not be null.");
+    PADDLE_ENFORCE(ctx->HasOutputs(framework::GradVarName("X")),
+                   "The GRAD@X of RotatedROIAlignGradOp should not be null.");
+    ctx->SetOutputsDim(framework::GradVarName("X"), ctx->GetInputsDim("X"));
+  }
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(ctx.Input<framework::Tensor>("ROIs")->type(),
+                                   ctx.device_context());
+  }
+};
+class RRPNRotatedROIAlignOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("X",
+             "(Tensor), "
+             "The input of RRPNRotatedROIAlignOp. The data type is float32 or "
+             "float64."
+             "The format of input tensor is NCHW. Where N is batch size, "
+             "C is the number of input channels, "
+             "H is the height of the feature, and "
+             "W is the width of the feature.");
+    AddInput("ROIs",
+             "(LoDTensor), "
+             "ROIs (Regions of Interest) to pool over. "
+             "should be a 2-D LoDTensor of shape (num_rois, 5)"
+             "given as [[x, y, w, h, theta], ...]. "
+             "(x, y) is the center coordinates, and "
+             "(w, h) is the bottom right coordinates, theta is rotation angle"
+             "of ROI.");
+    AddOutput("Out",
+              "(Tensor), "
+              "The output of ROIAlignOp is a 4-D tensor with shape "
+              "(num_rois, channels, pooled_h, pooled_w). The data type is "
+              "float32 or float64.");
+    AddOutput("ConIdX",
+              "(Tensor), "
+              "index x of affine transform");
+    AddOutput("ConIdY",
+              "(Tensor), "
+              "index y of affine transform");
+    AddAttr<float>("spatial_scale",
+                   "(float, default 1.0), "
+                   "Multiplicative spatial scale factor "
+                   "to translate ROI coords from their input scale "
+                   "to the scale used when pooling.")
+        .SetDefault(1.0);
+    AddAttr<int>("pooled_height",
+                 "(int, default 1), "
+                 "The pooled output height.")
+        .SetDefault(1);
+    AddAttr<int>("pooled_width",
+                 "(int, default 1), "
+                 "The pooled output width.")
+        .SetDefault(1);
+    AddComment(R"DOC(
+**RotatedRoIAlign Operator**
+Rotated Region of interest align (also known as Rotated RoI align) is to perform
+bilinear interpolation on inputs of nonuniform sizes to obtain 
+fixed-size feature maps (e.g. 7*7)
+Dividing each region proposal into equal-sized sections with
+the pooled_width and pooled_height. Location remains the origin
+result.
+In each ROI bin, the value of the four regularly sampled locations 
+are computed directly through bilinear interpolation. The output is
+the mean of four locations.
+Thus avoid the misaligned problem.   
+    )DOC");
+  }
+};
+template <typename T>
+class RRPNRotatedROIAlignGradMaker : public framework::SingleGradOpMaker<T> {
+public:
+  using framework::SingleGradOpMaker<T>::SingleGradOpMaker;
+protected:
+  std::unique_ptr<T> Apply() const override {
+    std::unique_ptr<T> op(new T);
+    op->SetType("rrpn_rotated_roi_align_grad");
+    op->SetInput("X", this->Input("X"));
+    op->SetInput("ROIs", this->Input("ROIs"));
+    op->SetInput("ConIdX", this->Output("ConIdX"));
+    op->SetInput("ConIdY", this->Output("ConIdY"));
+    op->SetInput(framework::GradVarName("Out"), this->OutputGrad("Out"));
+    op->SetOutput(framework::GradVarName("X"), this->InputGrad("X"));
+    op->SetAttrMap(this->Attrs());
+    return op;
+  }
+};
+DECLARE_NO_NEED_BUFFER_VARS_INFERENCE(
+    RRPNRotatedRoiAlignGradNoNeedBufVarsInferer, "X");
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(
+    rrpn_rotated_roi_align,
+    ops::RRPNRotatedROIAlignOp,
+    ops::RRPNRotatedROIAlignOpMaker,
+    ops::RRPNRotatedROIAlignGradMaker<paddle::framework::OpDesc>,
+    ops::RRPNRotatedROIAlignGradMaker<paddle::imperative::OpBase>);
+REGISTER_OPERATOR(rrpn_rotated_roi_align_grad,
+                  ops::RRPNRotatedROIAlignGradOp,
+                  ops::RRPNRotatedRoiAlignGradNoNeedBufVarsInferer);
--- a/PaddleCV/rrpn/models/ext_op/src/rrpn_rotated_roi_align_op.cu
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_rotated_roi_align_op.cu
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+Based on
+@misc{ma2019rrpn,
+    author = {Jianqi Ma},
+    title = {{RRPN in pytorch}},
+    year = {2019},
+    howpublished = {\url{https://github.com/mjq11302010044/RRPN_pytorch}},
+}
+@article{Jianqi17RRPN,
+    Author = {Jianqi Ma and Weiyuan Shao and Hao Ye and Li Wang and Hong Wang
+and Yingbin Zheng and Xiangyang Xue},
+    Title = {Arbitrary-Oriented Scene Text Detection via Rotation Proposals},
+    journal = {IEEE Transactions on Multimedia},
+    volume={20},
+    number={11},
+    pages={3111-3122},
+    year={2018}
+}*/
+#include <algorithm>
+#include <limits>
+#include "paddle/fluid/framework/op_registry.h"
+#include "paddle/fluid/memory/memory.h"
+#include "paddle/fluid/platform/cuda_primitives.h"
+#define CUDA_1D_KERNEL_LOOP(i, n)                            \
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; \
+       i += blockDim.x * gridDim.x)
+namespace paddle {
+namespace operators {
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+static constexpr int kNumCUDAThreads = 512;
+static constexpr int kNumMaxinumNumBlocks = 4096;
+#define PI 3.141592654
+static inline int NumBlocks(const int N) {
+  return std::min((N + kNumCUDAThreads - 1) / kNumCUDAThreads,
+                  kNumMaxinumNumBlocks);
+}
+template <typename T>
+__global__ void Zero(T* x, int num) {
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < num;
+       i += blockDim.x * gridDim.x) {
+    x[i] = static_cast<T>(0);
+  }
+}
+template <typename T>
+__global__ void RROIAlignForward(const int nthreads,
+                                 const T* bottom_data,
+                                 const T spatial_scale,
+                                 int height,
+                                 int width,
+                                 int channels,
+                                 const int pooled_height,
+                                 const int pooled_width,
+                                 const T* bottom_rois,
+                                 int* roi_batch_id_data,
+                                 T* top_data,
+                                 T* con_idx_x,
+                                 T* con_idx_y) {
+  CUDA_1D_KERNEL_LOOP(index, nthreads) {
+    int imageWidth = width;
+    int imageHeight = height;
+    // (n, c, ph, pw) is an element in the pooled output
+    int n = index;
+    int pw = n % pooled_width;
+    n /= pooled_width;
+    int ph = n % pooled_height;
+    n /= pooled_height;
+    int c = n % channels;
+    n /= channels;
+    const T* offset_bottom_rois = bottom_rois + n * 5;
+    int roi_batch_ind = roi_batch_id_data[n];
+    T cx = offset_bottom_rois[0];
+    T cy = offset_bottom_rois[1];
+    T h = offset_bottom_rois[3];
+    T w = offset_bottom_rois[2];
+    T angle = offset_bottom_rois[4] / 180.0 * PI;
+    // TransformPrepare
+    T dx = -pooled_width / 2.0;
+    T dy = -pooled_height / 2.0;
+    T Sx = w * spatial_scale / pooled_width;
+    T Sy = h * spatial_scale / pooled_height;
+    T Alpha = cos(angle);
+    T Beta = sin(angle);
+    T Dx = cx * spatial_scale;
+    T Dy = cy * spatial_scale;
+    T M[2][3];
+    M[0][0] = Alpha * Sx;
+    M[0][1] = Beta * Sy;
+    M[0][2] = Alpha * Sx * dx + Beta * Sy * dy + Dx;
+    M[1][0] = -Beta * Sx;
+    M[1][1] = Alpha * Sy;
+    M[1][2] = -Beta * Sx * dx + Alpha * Sy * dy + Dy;
+    T P[8];
+    P[0] = M[0][0] * pw + M[0][1] * ph + M[0][2];
+    P[1] = M[1][0] * pw + M[1][1] * ph + M[1][2];
+    P[2] = M[0][0] * pw + M[0][1] * (ph + 1) + M[0][2];
+    P[3] = M[1][0] * pw + M[1][1] * (ph + 1) + M[1][2];
+    P[4] = M[0][0] * (pw + 1) + M[0][1] * ph + M[0][2];
+    P[5] = M[1][0] * (pw + 1) + M[1][1] * ph + M[1][2];
+    P[6] = M[0][0] * (pw + 1) + M[0][1] * (ph + 1) + M[0][2];
+    P[7] = M[1][0] * (pw + 1) + M[1][1] * (ph + 1) + M[1][2];
+    T leftMost = (max(round(min(min(P[0], P[2]), min(P[4], P[6]))), 0.0));
+    T rightMost =
+        (min(round(max(max(P[0], P[2]), max(P[4], P[6]))), imageWidth - 1.0));
+    T topMost = (max(round(min(min(P[1], P[3]), min(P[5], P[7]))), 0.0));
+    T bottomMost =
+        (min(round(max(max(P[1], P[3]), max(P[5], P[7]))), imageHeight - 1.0));
+    const T* offset_bottom_data =
+        bottom_data + (roi_batch_ind * channels + c) * height * width;
+    float bin_cx = (leftMost + rightMost) / 2.0;  // shift
+    float bin_cy = (topMost + bottomMost) / 2.0;
+    int bin_l = (int)floor(bin_cx);
+    int bin_r = (int)ceil(bin_cx);
+    int bin_t = (int)floor(bin_cy);
+    int bin_b = (int)ceil(bin_cy);
+    T lt_value = 0.0;
+    if (bin_t > 0 && bin_l > 0 && bin_t < height && bin_l < width)
+      lt_value = offset_bottom_data[bin_t * width + bin_l];
+    T rt_value = 0.0;
+    if (bin_t > 0 && bin_r > 0 && bin_t < height && bin_r < width)
+      rt_value = offset_bottom_data[bin_t * width + bin_r];
+    T lb_value = 0.0;
+    if (bin_b > 0 && bin_l > 0 && bin_b < height && bin_l < width)
+      lb_value = offset_bottom_data[bin_b * width + bin_l];
+    T rb_value = 0.0;
+    if (bin_b > 0 && bin_r > 0 && bin_b < height && bin_r < width)
+      rb_value = offset_bottom_data[bin_b * width + bin_r];
+    T rx = bin_cx - floor(bin_cx);
+    T ry = bin_cy - floor(bin_cy);
+    T wlt = (1.0 - rx) * (1.0 - ry);
+    T wrt = rx * (1.0 - ry);
+    T wrb = rx * ry;
+    T wlb = (1.0 - rx) * ry;
+    T inter_val = 0.0;
+    inter_val += lt_value * wlt;
+    inter_val += rt_value * wrt;
+    inter_val += rb_value * wrb;
+    inter_val += lb_value * wlb;
+    platform::CudaAtomicAdd(top_data + index, static_cast<T>(inter_val));
+    platform::CudaAtomicAdd(con_idx_x + index, static_cast<T>(bin_cx));
+    platform::CudaAtomicAdd(con_idx_y + index, static_cast<T>(bin_cy));
+  }
+}
+template <typename T>
+__global__ void RROIAlignBackward(const int nthreads,
+                                  const T* top_diff,
+                                  const float* con_idx_x,
+                                  const float* con_idx_y,
+                                  const int num_rois,
+                                  const float spatial_scale,
+                                  const int height,
+                                  const int width,
+                                  const int channels,
+                                  const int pooled_height,
+                                  const int pooled_width,
+                                  T* bottom_diff,
+                                  const T* bottom_rois,
+                                  int* roi_batch_id_data) {
+  CUDA_1D_KERNEL_LOOP(index, nthreads) {
+    // (n, c, ph, pw) is an element in the pooled output
+    int n = index;
+    n /= pooled_width;
+    n /= pooled_height;
+    int c = n % channels;
+    n /= channels;
+    const T* offset_bottom_rois = bottom_rois + n * 5;
+    int roi_batch_ind = roi_batch_id_data[n];
+    T* offset_bottom_diff =
+        bottom_diff + (roi_batch_ind * channels + c) * height * width;
+    float bw = con_idx_x[index];
+    float bh = con_idx_y[index];
+    int bin_xs = int(floor(bw));
+    int bin_ys = int(floor(bh));
+    float rx = bw - float(bin_xs);
+    float ry = bh - float(bin_ys);
+    T wlt = (1.0 - rx) * (1.0 - ry);
+    T wrt = rx * (1.0 - ry);
+    T wrb = rx * ry;
+    T wlb = (1.0 - rx) * ry;
+    int min_x = (int)floor(bw);
+    int max_x = (int)ceil(bw);
+    int min_y = (int)floor(bh);
+    int max_y = (int)ceil(bh);
+    T top_diff_of_bin = top_diff[index];
+    T v1 = wlt * top_diff_of_bin;
+    T v2 = wrt * top_diff_of_bin;
+    T v3 = wrb * top_diff_of_bin;
+    T v4 = wlb * top_diff_of_bin;
+    if (min_y > 0 && min_x > 0 && min_y < height - 1 && min_x < width - 1)
+      platform::CudaAtomicAdd(offset_bottom_diff + min_y * width + min_x,
+                              static_cast<T>(v1));
+    if (min_y > 0 && max_x < width - 1 && min_y < height - 1 && max_x > 0)
+      platform::CudaAtomicAdd(offset_bottom_diff + min_y * width + max_x,
+                              static_cast<T>(v2));
+    if (max_y < height - 1 && max_x < width - 1 && max_y > 0 && max_x > 0)
+      platform::CudaAtomicAdd(offset_bottom_diff + max_y * width + max_x,
+                              static_cast<T>(v3));
+    if (max_y < height - 1 && min_x > 0 && max_y > 0 && min_x < width - 1)
+      platform::CudaAtomicAdd(offset_bottom_diff + max_y * width + min_x,
+                              static_cast<T>(v4));
+  }
+}
+template <typename Place, typename T>
+class RRPNROIAlignRotatedCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* input = ctx.Input<Tensor>("X");
+    auto* rois = ctx.Input<LoDTensor>("ROIs");
+    auto* out = ctx.Output<Tensor>("Out");
+    auto* con_idx_x = ctx.Output<Tensor>("ConIdX");
+    auto* con_idx_y = ctx.Output<Tensor>("ConIdY");
+    auto pooled_height = ctx.Attr<int>("pooled_height");
+    auto pooled_width = ctx.Attr<int>("pooled_width");
+    auto spatial_scale = ctx.Attr<float>("spatial_scale");
+    auto in_dims = input->dims();
+    int batch_size = in_dims[0];
+    int channels = in_dims[1];
+    int height = in_dims[2];
+    int width = in_dims[3];
+    int rois_num = rois->dims()[0];
+    if (rois_num == 0) return;
+    int output_size = out->numel();
+    int blocks = NumBlocks(output_size);
+    int threads = kNumCUDAThreads;
+    Tensor roi_batch_id_list;
+    roi_batch_id_list.Resize({rois_num});
+    auto cplace = platform::CPUPlace();
+    int* roi_batch_id_data = roi_batch_id_list.mutable_data<int>(cplace);
+    auto lod = rois->lod();
+    PADDLE_ENFORCE_EQ(
+        lod.empty(),
+        false,
+        "Input(ROIs) Tensor of ROIAlignOp does not contain LoD information.");
+    auto rois_lod = lod.back();
+    int rois_batch_size = rois_lod.size() - 1;
+    PADDLE_ENFORCE_EQ(
+        rois_batch_size,
+        batch_size,
+        "The rois_batch_size and imgs batch_size must be the same.");
+    int rois_num_with_lod = rois_lod[rois_batch_size];
+    PADDLE_ENFORCE_EQ(rois_num,
+                      rois_num_with_lod,
+                      "The rois_num from input and lod must be the same.");
+    for (int n = 0; n < rois_batch_size; ++n) {
+      for (size_t i = rois_lod[n]; i < rois_lod[n + 1]; ++i) {
+        roi_batch_id_data[i] = n;
+      }
+    }
+    auto& dev_ctx = ctx.cuda_device_context();
+    int bytes = roi_batch_id_list.numel() * sizeof(int);
+    auto roi_ptr = memory::Alloc(dev_ctx, bytes);
+    int* roi_id_data = reinterpret_cast<int*>(roi_ptr->ptr());
+    const auto gplace = boost::get<platform::CUDAPlace>(ctx.GetPlace());
+    memory::Copy(gplace,
+                 roi_id_data,
+                 cplace,
+                 roi_batch_id_data,
+                 bytes,
+                 dev_ctx.stream());
+    T* out_ = out->mutable_data<T>(ctx.GetPlace());
+    T* con_idx_x_ = con_idx_x->mutable_data<T>(ctx.GetPlace());
+    T* con_idx_y_ = con_idx_y->mutable_data<T>(ctx.GetPlace());
+    int idx_x_num = con_idx_x->numel();
+    int idx_y_num = con_idx_y->numel();
+    int out_num = out->numel();
+    Zero<<<(idx_x_num + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(con_idx_x_,
+                                                                    idx_x_num);
+    Zero<<<(idx_y_num + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(con_idx_y_,
+                                                                    idx_y_num);
+    Zero<<<(out_num + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(out_,
+                                                                  out_num);
+    RROIAlignForward<T><<<blocks, threads, 0, dev_ctx.stream()>>>(
+        output_size,
+        input->data<T>(),
+        spatial_scale,
+        height,
+        width,
+        channels,
+        pooled_height,
+        pooled_width,
+        rois->data<T>(),
+        roi_id_data,
+        out_,
+        con_idx_x_,
+        con_idx_y_);
+  }
+};
+template <typename Place, typename T>
+class RRPNROIAlignRotatedGradCUDAKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& ctx) const override {
+    auto* input = ctx.Input<Tensor>("X");
+    auto* rois = ctx.Input<LoDTensor>("ROIs");
+    auto* out_grad = ctx.Input<Tensor>(framework::GradVarName("Out"));
+    auto* in_grad = ctx.Output<Tensor>(framework::GradVarName("X"));
+    auto* con_idx_x = ctx.Input<Tensor>("ConIdX");
+    auto* con_idx_y = ctx.Input<Tensor>("ConIdY");
+    auto pooled_height = ctx.Attr<int>("pooled_height");
+    auto pooled_width = ctx.Attr<int>("pooled_width");
+    auto spatial_scale = ctx.Attr<float>("spatial_scale");
+    int rois_num = rois->dims()[0];
+    int channels = input->dims()[1];
+    int height = input->dims()[2];
+    int width = input->dims()[3];
+    if (!in_grad) {
+      return;
+    }
+    Tensor roi_batch_id_list;
+    roi_batch_id_list.Resize({rois_num});
+    auto cplace = platform::CPUPlace();
+    int* roi_batch_id_data = roi_batch_id_list.mutable_data<int>(cplace);
+    auto rois_lod = rois->lod().back();
+    int rois_batch_size = rois_lod.size() - 1;
+    for (int n = 0; n < rois_batch_size; ++n) {
+      for (size_t i = rois_lod[n]; i < rois_lod[n + 1]; ++i) {
+        roi_batch_id_data[i] = n;
+      }
+    }
+    auto& dev_ctx = ctx.cuda_device_context();
+    auto roi_ptr =
+        memory::Alloc(dev_ctx, roi_batch_id_list.numel() * sizeof(int));
+    int* roi_id_data = reinterpret_cast<int*>(roi_ptr->ptr());
+    int bytes = roi_batch_id_list.numel() * sizeof(int);
+    const auto gplace = boost::get<platform::CUDAPlace>(ctx.GetPlace());
+    memory::Copy(gplace,
+                 roi_id_data,
+                 cplace,
+                 roi_batch_id_data,
+                 bytes,
+                 dev_ctx.stream());
+    T* in_grad_ = in_grad->mutable_data<T>(ctx.GetPlace());
+    int in_grad_num = in_grad->numel();
+    Zero<<<(in_grad_num + 512 - 1) / 512, 512, 0, dev_ctx.stream()>>>(
+        in_grad_, in_grad_num);
+    int output_grad_size = out_grad->numel();
+    int blocks = NumBlocks(output_grad_size);
+    int threads = kNumCUDAThreads;
+    con_idx_x->data<float>();
+    con_idx_y->data<float>();
+    out_grad->data<T>();
+    rois->data<T>();
+    if (output_grad_size > 0) {
+      RROIAlignBackward<T><<<blocks, threads, 0, dev_ctx.stream()>>>(
+          output_grad_size,
+          out_grad->data<T>(),
+          con_idx_x->data<float>(),
+          con_idx_y->data<float>(),
+          rois_num,
+          spatial_scale,
+          height,
+          width,
+          channels,
+          pooled_height,
+          pooled_width,
+          in_grad_,
+          // in_grad->mutable_data<T>(ctx.GetPlace()),
+          rois->data<T>(),
+          roi_id_data);
+    }
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OP_CUDA_KERNEL(
+    rrpn_rotated_roi_align,
+    ops::RRPNROIAlignRotatedCUDAKernel<paddle::platform::CUDADeviceContext,
+                                       float>,
+    ops::RRPNROIAlignRotatedCUDAKernel<paddle::platform::CUDADeviceContext,
+                                       double>);
+REGISTER_OP_CUDA_KERNEL(
+    rrpn_rotated_roi_align_grad,
+    ops::RRPNROIAlignRotatedGradCUDAKernel<paddle::platform::CUDADeviceContext,
+                                           float>,
+    ops::RRPNROIAlignRotatedGradCUDAKernel<paddle::platform::CUDADeviceContext,
+                                           double>);
--- a/PaddleCV/rrpn/models/ext_op/src/rrpn_target_assign_op.cc
+++ b/PaddleCV/rrpn/models/ext_op/src/rrpn_target_assign_op.cc
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#include <fstream>
+#include <iostream>
+#include <random>
+#include "bbox_util.h"
+#include "paddle/fluid/framework/op_registry.h"
+namespace paddle {
+namespace operators {
+using Tensor = framework::Tensor;
+using LoDTensor = framework::LoDTensor;
+template <typename T,
+          int MajorType = Eigen::RowMajor,
+          typename IndexType = Eigen::DenseIndex>
+using EigenMatrix = framework::EigenMatrix<T, MajorType, IndexType>;
+class RRpnTargetAssignOp : public framework::OperatorWithKernel {
+public:
+  using framework::OperatorWithKernel::OperatorWithKernel;
+  void InferShape(framework::InferShapeContext* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasInput("Anchor"),
+                   "Input(Anchor) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(ctx->HasInput("GtBoxes"),
+                   "Input(GtBoxes) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(ctx->HasInput("ImInfo"),
+                   "Input(ImInfo) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("LocationIndex"),
+        "Output(LocationIndex) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("ScoreIndex"),
+        "Output(ScoreIndex) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("TargetLabel"),
+        "Output(TargetLabel) of RRpnTargetAssignOp should not be null");
+    PADDLE_ENFORCE(
+        ctx->HasOutput("TargetBBox"),
+        "Output(TargetBBox) of RRpnTargetAssignOp should not be null");
+    auto anchor_dims = ctx->GetInputDim("Anchor");
+    auto gt_boxes_dims = ctx->GetInputDim("GtBoxes");
+    auto im_info_dims = ctx->GetInputDim("ImInfo");
+    PADDLE_ENFORCE_EQ(
+        anchor_dims.size(), 2, "The rank of Input(Anchor) must be 2.");
+    PADDLE_ENFORCE_EQ(
+        gt_boxes_dims.size(), 2, "The rank of Input(GtBoxes) must be 2.");
+    PADDLE_ENFORCE_EQ(
+        im_info_dims.size(), 2, "The rank of Input(ImInfo) must be 2.");
+    ctx->SetOutputDim("LocationIndex", {-1});
+    ctx->SetOutputDim("ScoreIndex", {-1});
+    ctx->SetOutputDim("TargetLabel", {-1, 1});
+    ctx->SetOutputDim("TargetBBox", {-1, 5});
+  }
+protected:
+  framework::OpKernelType GetExpectedKernelType(
+      const framework::ExecutionContext& ctx) const override {
+    return framework::OpKernelType(
+        ctx.Input<framework::LoDTensor>("Anchor")->type(),
+        platform::CPUPlace());
+  }
+};
+template <typename T>
+void AppendRpns(LoDTensor* out, int64_t offset, Tensor* to_add) {
+  auto* out_data = out->data<T>();
+  auto* to_add_data = to_add->data<T>();
+  memcpy(out_data + offset, to_add_data, to_add->numel() * sizeof(T));
+}
+template <typename T>
+std::vector<Tensor> FilterStraddleAnchor(
+    const platform::CPUDeviceContext& context,
+    const Tensor* anchor,
+    const float rpn_straddle_thresh,
+    T im_height,
+    T im_width,
+    int64_t offset) {
+  std::vector<int> inds_inside;
+  int anchor_num = anchor->dims()[0];
+  auto* anchor_data = anchor->data<T>();
+  if (rpn_straddle_thresh >= 0) {
+    int index;
+    for (int i = 0; i < anchor_num; ++i) {
+      index = i * offset;
+      if ((anchor_data[index + 0] >= -rpn_straddle_thresh) &&
+          (anchor_data[index + 1] >= -rpn_straddle_thresh) &&
+          (anchor_data[index + 2] < im_width + rpn_straddle_thresh) &&
+          (anchor_data[index + 3] < im_height + rpn_straddle_thresh)) {
+        inds_inside.emplace_back(i);
+      }
+    }
+  } else {
+    for (int i = 0; i < anchor_num; ++i) {
+      inds_inside.emplace_back(i);
+    }
+  }
+  int inside_num = inds_inside.size();
+  Tensor inds_inside_t;
+  int* inds_inside_data =
+      inds_inside_t.mutable_data<int>({inside_num}, context.GetPlace());
+  std::copy(inds_inside.begin(), inds_inside.end(), inds_inside_data);
+  Tensor inside_anchor_t;
+  T* inside_anchor_data =
+      inside_anchor_t.mutable_data<T>({inside_num, offset}, context.GetPlace());
+  Gather<T>(anchor->data<T>(),
+            offset,
+            inds_inside_data,
+            inside_num,
+            inside_anchor_data);
+  std::vector<Tensor> res;
+  res.emplace_back(inds_inside_t);
+  res.emplace_back(inside_anchor_t);
+  return res;
+}
+void ReservoirSampling(const int num,
+                       std::vector<int>* inds,
+                       std::minstd_rand engine,
+                       bool use_random) {
+  std::uniform_real_distribution<float> uniform(0, 1);
+  size_t len = inds->size();
+  if (len > static_cast<size_t>(num)) {
+    if (use_random) {
+      for (size_t i = num; i < len; ++i) {
+        int rng_ind = std::floor(uniform(engine) * i);
+        if (rng_ind < num)
+          std::iter_swap(inds->begin() + rng_ind, inds->begin() + i);
+      }
+    }
+    inds->resize(num);
+  }
+}
+template <typename T>
+void RRpnScoreAssign(const T* anchor_by_gt_overlap_data,
+                     const Tensor& anchor_to_gt_max,
+                     const Tensor& gt_to_anchor_max,
+                     const int rpn_batch_size_per_im,
+                     const float rpn_fg_fraction,
+                     const float rpn_positive_overlap,
+                     const float rpn_negative_overlap,
+                     std::vector<int>* fg_inds,
+                     std::vector<int>* bg_inds,
+                     std::vector<int>* tgt_lbl,
+                     std::minstd_rand engine,
+                     bool use_random) {
+  float epsilon = 0.00000001;
+  int anchor_num = anchor_to_gt_max.dims()[0];
+  int gt_num = gt_to_anchor_max.dims()[0];
+  std::vector<int> target_label(anchor_num, -1);
+  const T* anchor_to_gt_max_data = anchor_to_gt_max.data<T>();
+  const T* gt_to_anchor_max_data = gt_to_anchor_max.data<T>();
+  for (int64_t i = 0; i < anchor_num; ++i) {
+    bool is_anchors_with_max_overlap = false;
+    int64_t j = 0;
+    for (; j < gt_num; ++j) {
+      T value = anchor_by_gt_overlap_data[i * gt_num + j];
+      T diff = std::abs(value - gt_to_anchor_max_data[j]);
+      if (diff < epsilon) {
+        is_anchors_with_max_overlap = true;
+        break;
+      }
+    }
+    bool is_anchor_great_than_thresh =
+        (anchor_to_gt_max_data[i] >= rpn_positive_overlap);
+    if (is_anchors_with_max_overlap || is_anchor_great_than_thresh) {
+      fg_inds->emplace_back(i);
+      target_label[i] = 1;
+    }
+  }
+  // Reservoir Sampling
+  int fg_num = 0;
+  if (rpn_fg_fraction > 0 && rpn_batch_size_per_im > 0) {
+    fg_num = static_cast<int>(rpn_fg_fraction * rpn_batch_size_per_im);
+    ReservoirSampling(fg_num, fg_inds, engine, use_random);
+  }
+  fg_num = static_cast<int>(fg_inds->size());
+  for (int64_t i = 0; i < anchor_num; ++i) {
+    if (anchor_to_gt_max_data[i] < rpn_negative_overlap &&
+        target_label[i] != 1) {
+      bg_inds->emplace_back(i);
+      target_label[i] = 0;
+    }
+  }
+  int bg_num = 0;
+  if (rpn_fg_fraction > 0 && rpn_batch_size_per_im > 0) {
+    bg_num = rpn_batch_size_per_im - fg_num;
+    ReservoirSampling(bg_num, bg_inds, engine, use_random);
+  }
+  bg_num = static_cast<int>(bg_inds->size());
+  tgt_lbl->resize(fg_num + bg_num, 0);
+  std::vector<int> fg_lbl(fg_num, 1);
+  std::vector<int> bg_lbl(bg_num, 0);
+  std::copy(fg_lbl.begin(), fg_lbl.end(), tgt_lbl->data());
+  std::copy(bg_lbl.begin(), bg_lbl.end(), tgt_lbl->data() + fg_num);
+}
+template <typename T>
+std::vector<Tensor> SampleRRpnFgBgGt(const platform::CPUDeviceContext& ctx,
+                                     const Tensor& anchor_by_gt_overlap,
+                                     const int rpn_batch_size_per_im,
+                                     const float rpn_positive_overlap,
+                                     const float rpn_negative_overlap,
+                                     const float rpn_fg_fraction,
+                                     std::minstd_rand engine,
+                                     bool use_random) {
+  auto* anchor_by_gt_overlap_data = anchor_by_gt_overlap.data<T>();
+  int anchor_num = anchor_by_gt_overlap.dims()[0];
+  int gt_num = anchor_by_gt_overlap.dims()[1];
+  std::vector<int> fg_inds;
+  std::vector<int> bg_inds;
+  std::vector<int> gt_inds;
+  std::vector<int> tgt_lbl;
+  // Calculate the max IoU between anchors and gt boxes
+  // Map from anchor to gt box that has highest overlap
+  auto place = ctx.GetPlace();
+  Tensor anchor_to_gt_max, anchor_to_gt_argmax, gt_to_anchor_max;
+  anchor_to_gt_max.mutable_data<T>({anchor_num}, place);
+  int* argmax = anchor_to_gt_argmax.mutable_data<int>({anchor_num}, place);
+  gt_to_anchor_max.mutable_data<T>({gt_num}, place);
+  auto anchor_by_gt_overlap_et =
+      framework::EigenMatrix<T>::From(anchor_by_gt_overlap);
+  auto anchor_to_gt_max_et =
+      framework::EigenVector<T>::Flatten(anchor_to_gt_max);
+  auto gt_to_anchor_max_et =
+      framework::EigenVector<T>::Flatten(gt_to_anchor_max);
+  auto anchor_to_gt_argmax_et =
+      framework::EigenVector<int>::Flatten(anchor_to_gt_argmax);
+  anchor_to_gt_max_et =
+      anchor_by_gt_overlap_et.maximum(Eigen::DSizes<int, 1>(1));
+  anchor_to_gt_argmax_et =
+      anchor_by_gt_overlap_et.argmax(1).template cast<int>();
+  gt_to_anchor_max_et =
+      anchor_by_gt_overlap_et.maximum(Eigen::DSizes<int, 1>(0));
+  // Follow the Faster RCNN's implementation
+  RRpnScoreAssign(anchor_by_gt_overlap_data,
+                  anchor_to_gt_max,
+                  gt_to_anchor_max,
+                  rpn_batch_size_per_im,
+                  rpn_fg_fraction,
+                  rpn_positive_overlap,
+                  rpn_negative_overlap,
+                  &fg_inds,
+                  &bg_inds,
+                  &tgt_lbl,
+                  engine,
+                  use_random);
+  int fg_num = fg_inds.size();
+  int bg_num = bg_inds.size();
+  gt_inds.reserve(fg_num);
+  for (int i = 0; i < fg_num; ++i) {
+    gt_inds.emplace_back(argmax[fg_inds[i]]);
+  }
+  Tensor loc_index_t, score_index_t, tgt_lbl_t, gt_inds_t;
+  int* loc_index_data = loc_index_t.mutable_data<int>({fg_num}, place);
+  int* score_index_data =
+      score_index_t.mutable_data<int>({fg_num + bg_num}, place);
+  int* tgt_lbl_data = tgt_lbl_t.mutable_data<int>({fg_num + bg_num}, place);
+  int* gt_inds_data = gt_inds_t.mutable_data<int>({fg_num}, place);
+  std::copy(fg_inds.begin(), fg_inds.end(), loc_index_data);
+  std::copy(fg_inds.begin(), fg_inds.end(), score_index_data);
+  std::copy(bg_inds.begin(), bg_inds.end(), score_index_data + fg_num);
+  std::copy(tgt_lbl.begin(), tgt_lbl.end(), tgt_lbl_data);
+  std::copy(gt_inds.begin(), gt_inds.end(), gt_inds_data);
+  std::vector<Tensor> loc_score_tgtlbl_gt;
+  loc_score_tgtlbl_gt.emplace_back(loc_index_t);
+  loc_score_tgtlbl_gt.emplace_back(score_index_t);
+  loc_score_tgtlbl_gt.emplace_back(tgt_lbl_t);
+  loc_score_tgtlbl_gt.emplace_back(gt_inds_t);
+  return loc_score_tgtlbl_gt;
+}
+template <typename T>
+class RRpnTargetAssignKernel : public framework::OpKernel<T> {
+public:
+  void Compute(const framework::ExecutionContext& context) const override {
+    auto* anchor = context.Input<Tensor>("Anchor");  // (H*W*A) * 5
+    auto* gt_boxes = context.Input<LoDTensor>("GtBoxes");
+    auto* im_info = context.Input<LoDTensor>("ImInfo");
+    auto* loc_index = context.Output<LoDTensor>("LocationIndex");
+    auto* score_index = context.Output<LoDTensor>("ScoreIndex");
+    auto* tgt_bbox = context.Output<LoDTensor>("TargetBBox");
+    auto* tgt_lbl = context.Output<LoDTensor>("TargetLabel");
+    PADDLE_ENFORCE_EQ(gt_boxes->lod().size(),
+                      1UL,
+                      "RRpnTargetAssignOp gt_boxes needs 1 level of LoD");
+    int64_t anchor_num = static_cast<int64_t>(anchor->dims()[0]);
+    int64_t batch_num = static_cast<int64_t>(gt_boxes->lod().back().size() - 1);
+    int rpn_batch_size_per_im = context.Attr<int>("rpn_batch_size_per_im");
+    float rpn_straddle_thresh = context.Attr<float>("rpn_straddle_thresh");
+    float rpn_positive_overlap = context.Attr<float>("rpn_positive_overlap");
+    float rpn_negative_overlap = context.Attr<float>("rpn_negative_overlap");
+    float rpn_fg_fraction = context.Attr<float>("rpn_fg_fraction");
+    bool use_random = context.Attr<bool>("use_random");
+    int64_t max_num = batch_num * rpn_batch_size_per_im;
+    auto place = context.GetPlace();
+    loc_index->mutable_data<int>({max_num}, place);
+    score_index->mutable_data<int>({max_num}, place);
+    tgt_bbox->mutable_data<T>({max_num, 5}, place);
+    tgt_lbl->mutable_data<int>({max_num, 1}, place);
+    auto& dev_ctx = context.device_context<platform::CPUDeviceContext>();
+    std::random_device rnd;
+    std::minstd_rand engine;
+    int seed = rnd();
+    engine.seed(seed);
+    framework::LoD lod_loc, loc_score;
+    std::vector<size_t> lod0_loc(1, 0);
+    std::vector<size_t> lod0_score(1, 0);
+    int total_loc_num = 0;
+    int total_score_num = 0;
+    auto gt_boxes_lod = gt_boxes->lod().back();
+    for (int i = 0; i < batch_num; ++i) {
+      Tensor gt_boxes_slice =
+          gt_boxes->Slice(gt_boxes_lod[i], gt_boxes_lod[i + 1]);
+      Tensor im_info_slice = im_info->Slice(i, i + 1);
+      auto* im_info_data = im_info_slice.data<T>();
+      auto im_height = im_info_data[0];
+      auto im_width = im_info_data[1];
+      // auto im_scale = im_info_data[2];
+      // Filter straddle anchor
+      std::vector<Tensor> filter_output = FilterStraddleAnchor<T>(
+          dev_ctx, anchor, rpn_straddle_thresh, im_height, im_width, 5);
+      Tensor inds_inside = filter_output[0];
+      Tensor inside_anchor = filter_output[1];
+      Tensor anchor_by_gt_overlap;
+      anchor_by_gt_overlap.mutable_data<T>(
+          {inside_anchor.dims()[0], gt_boxes_slice.dims()[0]}, place);
+      BboxOverlaps2<T>(inside_anchor, gt_boxes_slice, &anchor_by_gt_overlap);
+      auto loc_score_tgtlbl_gt = SampleRRpnFgBgGt<T>(dev_ctx,
+                                                     anchor_by_gt_overlap,
+                                                     rpn_batch_size_per_im,
+                                                     rpn_positive_overlap,
+                                                     rpn_negative_overlap,
+                                                     rpn_fg_fraction,
+                                                     engine,
+                                                     use_random);
+      Tensor sampled_loc_index = loc_score_tgtlbl_gt[0];
+      Tensor sampled_score_index = loc_score_tgtlbl_gt[1];
+      Tensor sampled_tgtlbl = loc_score_tgtlbl_gt[2];
+      Tensor sampled_gt_index = loc_score_tgtlbl_gt[3];
+      int loc_num = sampled_loc_index.dims()[0];
+      int score_num = sampled_score_index.dims()[0];
+      // unmap to all anchor
+      Tensor sampled_loc_index_unmap, sampled_score_index_unmap;
+      sampled_loc_index_unmap.mutable_data<int>({loc_num}, place);
+      sampled_score_index_unmap.mutable_data<int>({score_num}, place);
+      Gather<int>(inds_inside.data<int>(),
+                  1,
+                  sampled_loc_index.data<int>(),
+                  loc_num,
+                  sampled_loc_index_unmap.data<int>());
+      Gather<int>(inds_inside.data<int>(),
+                  1,
+                  sampled_score_index.data<int>(),
+                  score_num,
+                  sampled_score_index_unmap.data<int>());
+      // get target bbox deltas
+      Tensor sampled_anchor, sampled_gt, sampled_tgt_bbox;
+      auto* sampled_anchor_data =
+          sampled_anchor.mutable_data<T>({loc_num, 5}, place);
+      auto* sampled_gt_data = sampled_gt.mutable_data<T>({loc_num, 5}, place);
+      Gather<T>(anchor->data<T>(),
+                5,
+                sampled_loc_index_unmap.data<int>(),
+                loc_num,
+                sampled_anchor_data);
+      Gather<T>(gt_boxes_slice.data<T>(),
+                5,
+                sampled_gt_index.data<int>(),
+                loc_num,
+                sampled_gt_data);
+      sampled_tgt_bbox.mutable_data<T>({loc_num, 5}, place);
+      BoxToDelta2<T>(
+          loc_num, sampled_anchor, sampled_gt, nullptr, &sampled_tgt_bbox);
+      std::ofstream file_anchor;
+      // Add anchor offset
+      int anchor_offset = i * anchor_num;
+      auto sampled_loc_index_unmap_et =
+          framework::EigenTensor<int, 1>::From(sampled_loc_index_unmap);
+      sampled_loc_index_unmap_et = sampled_loc_index_unmap_et + anchor_offset;
+      auto sampled_score_index_unmap_et =
+          framework::EigenTensor<int, 1>::From(sampled_score_index_unmap);
+      sampled_score_index_unmap_et =
+          sampled_score_index_unmap_et + anchor_offset;
+      AppendRpns<int>(loc_index, total_loc_num, &sampled_loc_index_unmap);
+      AppendRpns<int>(score_index, total_score_num, &sampled_score_index_unmap);
+      AppendRpns<T>(tgt_bbox, total_loc_num * 5, &sampled_tgt_bbox);
+      AppendRpns<int>(tgt_lbl, total_score_num, &sampled_tgtlbl);
+      total_loc_num += loc_num;
+      total_score_num += score_num;
+      lod0_loc.emplace_back(total_loc_num);
+      lod0_score.emplace_back(total_score_num);
+    }
+    PADDLE_ENFORCE_LE(total_loc_num, max_num);
+    PADDLE_ENFORCE_LE(total_score_num, max_num);
+    lod_loc.emplace_back(lod0_loc);
+    loc_score.emplace_back(lod0_score);
+    loc_index->set_lod(lod_loc);
+    score_index->set_lod(loc_score);
+    tgt_bbox->set_lod(lod_loc);
+    tgt_lbl->set_lod(loc_score);
+    loc_index->Resize({total_loc_num});
+    score_index->Resize({total_score_num});
+    tgt_bbox->Resize({total_loc_num, 5});
+    tgt_lbl->Resize({total_score_num, 1});
+  }
+};
+class RRpnTargetAssignOpMaker : public framework::OpProtoAndCheckerMaker {
+public:
+  void Make() override {
+    AddInput("Anchor",
+             "(Tensor) input anchor is a 2-D Tensor with shape [H*W*A, 5].");
+    AddInput("GtBoxes",
+             "(LoDTensor) input ground-truth bbox with shape [K, 5].");
+    AddInput("ImInfo",
+             "(LoDTensor) input image information with shape [N, 3]. "
+             "N is the batch size, each image information includes height, "
+             "width and scale.");
+    AddAttr<int>("rpn_batch_size_per_im",
+                 "Total number of RPN examples per image.")
+        .SetDefault(256);
+    AddAttr<float>(
+        "rpn_straddle_thresh",
+        "Remove RPN anchors that go outside the image by straddle_thresh "
+        "pixels, "
+        "Set to -1 or a large value, e.g. 100000, to disable pruning anchors.");
+    AddAttr<float>(
+        "rpn_positive_overlap",
+        "Minimum overlap required between an anchor and ground-truth "
+        "box for the (anchor, gt box) pair to be a positive example.")
+        .SetDefault(0.7);
+    AddAttr<float>(
+        "rpn_negative_overlap",
+        "Maximum overlap allowed between an anchor and ground-truth "
+        "box for the (anchor, gt box) pair to be a negative examples.")
+        .SetDefault(0.3);
+    AddAttr<float>(
+        "rpn_fg_fraction",
+        "Target fraction of RoI minibatch that "
+        "is labeled foreground (i.e. class > 0), 0-th class is background.")
+        .SetDefault(0.25);
+    AddAttr<bool>("use_random",
+                  "A flag indicating whether to use a ReservoirSampling. "
+                  "NOTE: DO NOT set this flag to false in training. "
+                  "Setting this flag to false is only useful in unittest.")
+        .SetDefault(true);
+    AddOutput(
+        "LocationIndex",
+        "(Tensor), The indexes of foreground anchors in all RPN anchors, the "
+        "shape of the LocationIndex is [F], F depends on the value of input "
+        "tensor and attributes.");
+    AddOutput(
+        "ScoreIndex",
+        "(Tensor), The indexes of foreground and background anchors in all "
+        "RPN anchors(The rest anchors are ignored). The shape of the "
+        "ScoreIndex is [F + B], F and B are sampled foreground and background "
+        " number.");
+    AddOutput("TargetBBox",
+              "(Tensor), The target bbox deltas with shape "
+              "[F, 5], F is the sampled foreground number.");
+    AddOutput(
+        "TargetLabel",
+        "(Tensor<int>), The target labels of each anchor with shape "
+        "[F + B, 1], F and B are sampled foreground and background number.");
+    AddComment(R"DOC(
+This operator can be, for a given set of ground truth bboxes and the
+anchors, to assign classification and regression targets to each prediction.
+The ScoreIndex and LocationIndex will be generated according to the anchor-groundtruth IOU.
+The rest anchors would not contibute to the RPN training loss
+ScoreIndex is composed of foreground anchor indexes(positive labels) and
+background anchor indexes(negative labels). LocationIndex is exactly same
+as the foreground anchor indexes since we can not assign regression target to 
+the background anchors.
+The classification targets(TargetLabel) is a binary class label (of being
+an object or not). Following the paper of Faster-RCNN, the positive labels
+are two kinds of anchors: (i) the anchor/anchors with the highest IoU
+overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap
+higher than rpn_positive_overlap(0.7) with any ground-truth box. Note that
+a single ground-truth box may assign positive labels to multiple anchors.
+A non-positive anchor is when its IoU ratio is lower than rpn_negative_overlap
+(0.3) for all ground-truth boxes. Anchors that are neither positive nor
+negative do not contribute to the training objective.
+)DOC");
+  }
+};
+}  // namespace operators
+}  // namespace paddle
+namespace ops = paddle::operators;
+REGISTER_OPERATOR(
+    rrpn_target_assign,
+    ops::RRpnTargetAssignOp,
+    ops::RRpnTargetAssignOpMaker,
+    paddle::framework::EmptyGradOpMaker<paddle::framework::OpDesc>,
+    paddle::framework::EmptyGradOpMaker<paddle::imperative::OpBase>);
+REGISTER_OP_CPU_KERNEL(rrpn_target_assign,
+                       ops::RRpnTargetAssignKernel<float>,
+                       ops::RRpnTargetAssignKernel<double>);
--- a/PaddleCV/rrpn/models/ext_op/src/safe_ref.h
+++ b/PaddleCV/rrpn/models/ext_op/src/safe_ref.h
+/* Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+#pragma once
+#include <vector>
+#include "paddle/fluid/platform/enforce.h"
+namespace paddle {
+namespace operators {
+namespace detail {
+/**
+ * Get Reference From Pointer with check. The error message is printf format,
+ * and passed by `args`
+ */
+template <typename T, typename... ARGS>
+inline T& Ref(T* ptr, ARGS&&... args) {
+  PADDLE_ENFORCE_NOT_NULL(ptr, ::paddle::string::Sprintf(args...));
+  return *ptr;
+}
+}  // namespace detail
+}  // namespace operators
+}  // namespace paddle
--- a/PaddleCV/rrpn/models/model_builder.py
+++ b/PaddleCV/rrpn/models/model_builder.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+import paddle.fluid as fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.initializer import Constant
+from paddle.fluid.initializer import Normal
+from paddle.fluid.initializer import MSRA
+from paddle.fluid.regularizer import L2Decay
+from config import cfg
+from models.ext_op.rrpn_lib import *
+class RRPN(object):
+    def __init__(self,
+                 add_conv_body_func=None,
+                 add_roi_box_head_func=None,
+                 mode='train',
+                 use_pyreader=True,
+                 use_random=True):
+        self.add_conv_body_func = add_conv_body_func
+        self.add_roi_box_head_func = add_roi_box_head_func
+        self.mode = mode
+        self.use_pyreader = use_pyreader
+        self.use_random = use_random
+    def build_model(self, image_shape):
+        self.build_input(image_shape)
+        body_conv = self.add_conv_body_func(self.image)
+        # RPN
+        self.rpn_heads(body_conv)
+        # Fast RCNN
+        self.fast_rcnn_heads(body_conv)
+        if self.mode != 'train':
+            self.eval_bbox()
+    def loss(self):
+        losses = []
+        # Fast RCNN loss
+        loss_cls, loss_bbox = self.fast_rcnn_loss()
+        # RPN loss
+        rpn_cls_loss, rpn_reg_loss = self.rpn_loss()
+        losses = [loss_cls, loss_bbox, rpn_cls_loss, rpn_reg_loss]
+        rkeys = ['loss', 'loss_cls', 'loss_bbox', \
+                 'loss_rpn_cls', 'loss_rpn_bbox',]
+        loss = fluid.layers.sum(losses)
+        rloss = [loss] + losses
+        return rloss, rkeys, self.rpn_rois
+    def eval_bbox_out(self):
+        return self.pred_result
+    def build_input(self, image_shape):
+        if self.use_pyreader:
+            in_shapes = [[-1] + image_shape, [-1, 5], [-1, 1], [-1, 1],
+                         [-1, 3], [-1, 1]]
+            lod_levels = [0, 1, 1, 1, 0, 0]
+            dtypes = [
+                'float32', 'float32', 'int32', 'int32', 'float32', 'int64'
+            ]
+            self.py_reader = fluid.layers.py_reader(
+                capacity=64,
+                shapes=in_shapes,
+                lod_levels=lod_levels,
+                dtypes=dtypes,
+                use_double_buffer=True)
+            ins = fluid.layers.read_file(self.py_reader)
+            self.image = ins[0]
+            self.gt_box = ins[1]
+            self.gt_label = ins[2]
+            self.is_crowd = ins[3]
+            self.im_info = ins[4]
+            self.im_id = ins[5]
+        else:
+            self.image = fluid.layers.data(
+                name='image', shape=image_shape, dtype='float32')
+            self.gt_box = fluid.layers.data(
+                name='gt_box', shape=[4], dtype='float32', lod_level=1)
+            self.gt_label = fluid.layers.data(
+                name='gt_label', shape=[1], dtype='int32', lod_level=1)
+            self.is_crowd = fluid.layers.data(
+                name='is_crowd', shape=[1], dtype='int32', lod_level=1)
+            self.im_info = fluid.layers.data(
+                name='im_info', shape=[3], dtype='float32')
+            self.im_id = fluid.layers.data(
+                name='im_id', shape=[1], dtype='int64')
+            self.difficult = fluid.layers.data(
+                name='difficult', shape=[1], dtype='float32', lod_level=1)
+    def feeds(self):
+        if self.mode == 'infer':
+            return [self.image, self.im_info]
+        if self.mode == 'val':
+            return [
+                self.image, self.gt_box, self.gt_label, self.is_crowd,
+                self.im_info, self.im_id, self.difficult
+            ]
+        return [
+            self.image, self.gt_box, self.gt_label, self.is_crowd, self.im_info,
+            self.im_id
+        ]
+    def eval_bbox(self):
+        self.im_scale = fluid.layers.slice(
+            self.im_info, [1], starts=[2], ends=[3])
+        im_scale_lod = fluid.layers.sequence_expand(self.im_scale,
+                                                    self.rpn_rois)
+        results = []
+        boxes = self.rpn_rois
+        cls_prob = fluid.layers.softmax(self.cls_score, use_cudnn=False)
+        bbox_pred = fluid.layers.reshape(self.bbox_pred, (-1, cfg.class_num, 5))
+        for i in range(cfg.class_num - 1):
+            bbox_pred_slice = fluid.layers.slice(
+                bbox_pred, axes=[1], starts=[i + 1], ends=[i + 2])
+            bbox_pred_reshape = fluid.layers.reshape(bbox_pred_slice, (-1, 5))
+            decoded_box = rrpn_box_coder(prior_box=boxes, \
+                                         target_box=bbox_pred_reshape, \
+                                         prior_box_var=cfg.bbox_reg_weights)
+            score_slice = fluid.layers.slice(
+                cls_prob, axes=[1], starts=[i + 1], ends=[i + 2])
+            score_slice = fluid.layers.reshape(score_slice, shape=[-1, 1])
+            box_positive = fluid.layers.reshape(decoded_box, shape=[-1, 8])
+            box_reshape = fluid.layers.reshape(x=box_positive, shape=[1, -1, 8])
+            score_reshape = fluid.layers.reshape(
+                x=score_slice, shape=[1, 1, -1])
+            pred_result = fluid.layers.multiclass_nms(
+                bboxes=box_reshape,
+                scores=score_reshape,
+                score_threshold=cfg.TEST.score_thresh,
+                nms_top_k=-1,
+                nms_threshold=cfg.TEST.nms_thresh,
+                keep_top_k=cfg.TEST.detections_per_im,
+                normalized=False,
+                background_label=-1)
+            result_shape = fluid.layers.shape(pred_result)
+            res_dimension = fluid.layers.slice(
+                result_shape, axes=[0], starts=[1], ends=[2])
+            res_dimension = fluid.layers.reshape(res_dimension, shape=[1, 1])
+            dimension = fluid.layers.fill_constant(
+                shape=[1, 1], value=2, dtype='int32')
+            cond = fluid.layers.less_than(dimension, res_dimension)
+            res = fluid.layers.create_global_var(
+                shape=[1, 10], value=0.0, dtype='float32', persistable=False)
+            with fluid.layers.control_flow.Switch() as switch:
+                with switch.case(cond):
+                    coordinate = fluid.layers.fill_constant(
+                        shape=[9], value=0.0, dtype='float32')
+                    pred_class = fluid.layers.fill_constant(
+                        shape=[1], value=i + 1, dtype='float32')
+                    add_class = fluid.layers.concat(
+                        [pred_class, coordinate], axis=0)
+                    normal_result = fluid.layers.elementwise_add(pred_result,
+                                                                 add_class)
+                    fluid.layers.assign(normal_result, res)
+                with switch.default():
+                    normal_result = fluid.layers.fill_constant(
+                        shape=[1, 10], value=-1.0, dtype='float32')
+                    fluid.layers.assign(normal_result, res)
+            results.append(res)
+        if len(results) == 1:
+            self.pred_result = results[0]
+            return
+        outs = []
+        out = fluid.layers.concat(results)
+        zero = fluid.layers.fill_constant(
+            shape=[1, 1], value=0.0, dtype='float32')
+        out_split, _ = fluid.layers.split(out, dim=1, num_or_sections=[1, 9])
+        out_bool = fluid.layers.greater_than(out_split, zero)
+        idx = fluid.layers.where(out_bool)
+        idx_split, _ = fluid.layers.split(idx, dim=1, num_or_sections=[1, 1])
+        idx = fluid.layers.reshape(idx_split, [-1, 1])
+        self.pred_result = fluid.layers.gather(input=out, index=idx)
+    def rpn_heads(self, rpn_input):
+        # RPN hidden representation
+        dim_out = rpn_input.shape[1]
+        rpn_conv = fluid.layers.conv2d(
+            input=rpn_input,
+            num_filters=dim_out,
+            filter_size=3,
+            stride=1,
+            padding=1,
+            act='relu',
+            name='conv_rpn',
+            param_attr=ParamAttr(
+                name="conv_rpn_w", initializer=Normal(
+                    loc=0., scale=0.01)),
+            bias_attr=ParamAttr(
+                name="conv_rpn_b", learning_rate=2., regularizer=L2Decay(0.)))
+        self.anchor, self.var = rotated_anchor_generator(
+            input=rpn_conv,
+            anchor_sizes=cfg.anchor_sizes,
+            aspect_ratios=cfg.aspect_ratios,
+            angles=cfg.anchor_angle,
+            variance=cfg.variance,
+            stride=cfg.rpn_stride,
+            offset=0.5)
+        num_anchor = self.anchor.shape[2]
+        # Proposal classification scores
+        self.rpn_cls_score = fluid.layers.conv2d(
+            rpn_conv,
+            num_filters=num_anchor,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            act=None,
+            name='rpn_cls_score',
+            param_attr=ParamAttr(
+                name="rpn_cls_logits_w", initializer=Normal(
+                    loc=0., scale=0.01)),
+            bias_attr=ParamAttr(
+                name="rpn_cls_logits_b",
+                learning_rate=2.,
+                regularizer=L2Decay(0.)))
+        # Proposal bbox regression deltas
+        self.rpn_bbox_pred = fluid.layers.conv2d(
+            rpn_conv,
+            num_filters=5 * num_anchor,
+            filter_size=1,
+            stride=1,
+            padding=0,
+            act=None,
+            name='rpn_bbox_pred',
+            param_attr=ParamAttr(
+                name="rpn_bbox_pred_w", initializer=Normal(
+                    loc=0., scale=0.01)),
+            bias_attr=ParamAttr(
+                name="rpn_bbox_pred_b",
+                learning_rate=2.,
+                regularizer=L2Decay(0.)))
+        rpn_cls_score_prob = fluid.layers.sigmoid(
+            self.rpn_cls_score, name='rpn_cls_score_prob')
+        param_obj = cfg.TRAIN if self.mode == 'train' else cfg.TEST
+        pre_nms_top_n = param_obj.rpn_pre_nms_top_n
+        post_nms_top_n = param_obj.rpn_post_nms_top_n
+        nms_thresh = param_obj.rpn_nms_thresh
+        min_size = param_obj.rpn_min_size
+        self.rpn_rois, self.rpn_roi_probs = rotated_generate_proposals(
+            scores=rpn_cls_score_prob,
+            bbox_deltas=self.rpn_bbox_pred,
+            im_info=self.im_info,
+            anchors=self.anchor,
+            variances=self.var,
+            pre_nms_top_n=pre_nms_top_n,
+            post_nms_top_n=post_nms_top_n,
+            nms_thresh=param_obj.rpn_nms_thresh,
+            min_size=param_obj.rpn_min_size)
+        if self.mode == 'train':
+            outs = rotated_generate_proposal_labels(
+                rpn_rois=self.rpn_rois,
+                gt_classes=self.gt_label,
+                is_crowd=self.is_crowd,
+                gt_boxes=self.gt_box,
+                im_info=self.im_info,
+                batch_size_per_im=cfg.TRAIN.batch_size_per_im,
+                fg_fraction=cfg.TRAIN.fg_fractrion,
+                fg_thresh=cfg.TRAIN.fg_thresh,
+                bg_thresh_hi=cfg.TRAIN.bg_thresh_hi,
+                bg_thresh_lo=cfg.TRAIN.bg_thresh_lo,
+                bbox_reg_weights=cfg.bbox_reg_weights,
+                class_nums=cfg.class_num,
+                use_random=self.use_random)
+            self.rois = outs[0]
+            self.labels_int32 = outs[1]
+            self.bbox_targets = outs[2]
+            self.bbox_inside_weights = outs[3]
+            self.bbox_outside_weights = outs[4]
+    def fast_rcnn_heads(self, roi_input):
+        if self.mode == 'train':
+            pool_rois = self.rois
+        else:
+            pool_rois = self.rpn_rois
+        pool = rotated_roi_align(
+            input=roi_input,
+            rois=pool_rois,
+            pooled_height=cfg.roi_resolution,
+            pooled_width=cfg.roi_resolution,
+            spatial_scale=cfg.spatial_scale)
+        self.res5_2_sum = self.add_roi_box_head_func(pool)
+        rcnn_out = fluid.layers.pool2d(
+            self.res5_2_sum, pool_type='avg', pool_size=7, name='res5_pool')
+        self.cls_score = fluid.layers.fc(input=rcnn_out,
+                                         size=cfg.class_num,
+                                         act=None,
+                                         name='cls_score',
+                                         param_attr=ParamAttr(
+                                             name='cls_score_w',
+                                             initializer=Normal(
+                                                 loc=0.0, scale=0.001)),
+                                         bias_attr=ParamAttr(
+                                             name='cls_score_b',
+                                             learning_rate=2.,
+                                             regularizer=L2Decay(0.)))
+        self.bbox_pred = fluid.layers.fc(input=rcnn_out,
+                                         size=5 * cfg.class_num,
+                                         act=None,
+                                         name='bbox_pred',
+                                         param_attr=ParamAttr(
+                                             name='bbox_pred_w',
+                                             initializer=Normal(
+                                                 loc=0.0, scale=0.01)),
+                                         bias_attr=ParamAttr(
+                                             name='bbox_pred_b',
+                                             learning_rate=2.,
+                                             regularizer=L2Decay(0.)))
+    def fast_rcnn_loss(self):
+        labels_int64 = fluid.layers.cast(x=self.labels_int32, dtype='int64')
+        labels_int64.stop_gradient = True
+        loss_cls = fluid.layers.softmax_with_cross_entropy(
+            logits=self.cls_score,
+            label=labels_int64,
+            numeric_stable_mode=True, )
+        loss_cls = fluid.layers.reduce_mean(loss_cls)
+        loss_bbox = fluid.layers.smooth_l1(
+            x=self.bbox_pred,
+            y=self.bbox_targets,
+            inside_weight=self.bbox_inside_weights,
+            outside_weight=self.bbox_outside_weights,
+            sigma=1.0)
+        loss_bbox = fluid.layers.reduce_mean(loss_bbox)
+        return loss_cls, loss_bbox
+    def rpn_loss(self):
+        rpn_cls_score_reshape = fluid.layers.transpose(
+            self.rpn_cls_score, perm=[0, 2, 3, 1])
+        rpn_bbox_pred_reshape = fluid.layers.transpose(
+            self.rpn_bbox_pred, perm=[0, 2, 3, 1])
+        anchor_reshape = fluid.layers.reshape(self.anchor, shape=(-1, 5))
+        var_reshape = fluid.layers.reshape(self.var, shape=(-1, 5))
+        rpn_cls_score_reshape = fluid.layers.reshape(
+            x=rpn_cls_score_reshape, shape=(0, -1, 1))
+        rpn_bbox_pred_reshape = fluid.layers.reshape(
+            x=rpn_bbox_pred_reshape, shape=(0, -1, 5))
+        score_pred, loc_pred, score_tgt, loc_tgt = \
+            rrpn_target_assign(
+                bbox_pred=rpn_bbox_pred_reshape,
+                cls_logits=rpn_cls_score_reshape,
+                anchor_box=anchor_reshape,
+                gt_boxes=self.gt_box,
+                im_info=self.im_info,
+                rpn_batch_size_per_im=cfg.TRAIN.rpn_batch_size_per_im,
+                rpn_straddle_thresh=-1,
+                rpn_fg_fraction=cfg.TRAIN.rpn_fg_fraction,
+                rpn_positive_overlap=cfg.TRAIN.rpn_positive_overlap,
+                rpn_negative_overlap=cfg.TRAIN.rpn_negative_overlap,
+                use_random=self.use_random)
+        score_tgt = fluid.layers.cast(x=score_tgt, dtype='float32')
+        rpn_cls_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
+            x=score_pred, label=score_tgt)
+        rpn_cls_loss = fluid.layers.reduce_mean(
+            rpn_cls_loss, name='loss_rpn_cls')
+        rpn_reg_loss = fluid.layers.smooth_l1(x=loc_pred, y=loc_tgt, sigma=3.0)
+        rpn_reg_loss = fluid.layers.reduce_sum(
+            rpn_reg_loss, name='loss_rpn_bbox')
+        score_shape = fluid.layers.shape(score_tgt)
+        score_shape = fluid.layers.cast(x=score_shape, dtype='float32')
+        norm = fluid.layers.reduce_prod(score_shape)
+        norm.stop_gradient = True
+        rpn_reg_loss = rpn_reg_loss / norm
+        return rpn_cls_loss, rpn_reg_loss
--- a/PaddleCV/rrpn/models/name_adapter.py
+++ b/PaddleCV/rrpn/models/name_adapter.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+class NameAdapter(object):
+    """Fix the backbones variable names for pretrained weight"""
+    def __init__(self, model):
+        super(NameAdapter, self).__init__()
+        self.model = model
+    @property
+    def model_type(self):
+        return getattr(self.model, '_model_type', '')
+    @property
+    def variant(self):
+        return getattr(self.model, 'variant', '')
+    def fix_conv_norm_name(self, name):
+        if name == "conv1":
+            bn_name = "bn_" + name
+        else:
+            bn_name = "bn" + name[3:]
+        # the naming rule is same as pretrained weight
+        if self.model_type == 'SEResNeXt':
+            bn_name = name + "_bn"
+        return bn_name
+    def fix_shortcut_name(self, name):
+        if self.model_type == 'SEResNeXt':
+            name = 'conv' + name + '_prj'
+        return name
+    def fix_bottleneck_name(self, name):
+        if self.model_type == 'SEResNeXt':
+            conv_name1 = 'conv' + name + '_x1'
+            conv_name2 = 'conv' + name + '_x2'
+            conv_name3 = 'conv' + name + '_x3'
+            shortcut_name = name
+        else:
+            conv_name1 = name + "_branch2a"
+            conv_name2 = name + "_branch2b"
+            conv_name3 = name + "_branch2c"
+            shortcut_name = name + "_branch1"
+        return conv_name1, conv_name2, conv_name3, shortcut_name
+    def fix_layer_warp_name(self, stage_num, count, i):
+        name = 'res' + str(stage_num)
+        if count > 10 and stage_num == 4:
+            if i == 0:
+                conv_name = name + "a"
+            else:
+                conv_name = name + "b" + str(i)
+        else:
+            conv_name = name + chr(ord("a") + i)
+        return conv_name
+    def fix_c1_stage_name(self):
+        return "conv1"
--- a/PaddleCV/rrpn/models/resnet.py
+++ b/PaddleCV/rrpn/models/resnet.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from collections import OrderedDict
+from paddle import fluid
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.framework import Variable
+from paddle.fluid.regularizer import L2Decay
+from paddle.fluid.initializer import Constant
+from numbers import Integral
+from .name_adapter import NameAdapter
+class ResNet(object):
+    """
+    Residual Network, see https://arxiv.org/abs/1512.03385
+    Args:
+        depth (int): ResNet depth, should be 18, 34, 50, 101, 152.
+        freeze_at (int): freeze the backbone at which stage
+        norm_type (str): normalization type, 'bn'/'sync_bn'/'affine_channel'
+        freeze_norm (bool): freeze normalization layers
+        norm_decay (float): weight decay for normalization layer weights
+        variant (str): ResNet variant, supports 'a', 'b', 'c', 'd' currently
+        feature_maps (list): index of stages whose feature maps are returned
+    """
+    __shared__ = ['norm_type', 'freeze_norm', 'weight_prefix_name']
+    def __init__(self,
+                 depth=50,
+                 freeze_at=2,
+                 norm_type='affine_channel',
+                 freeze_norm=True,
+                 norm_decay=0.,
+                 variant='b',
+                 feature_maps=4,
+                 weight_prefix_name=''):
+        super(ResNet, self).__init__()
+        if isinstance(feature_maps, Integral):
+            feature_maps = [feature_maps]
+        assert depth in [18, 34, 50, 101, 152], \
+            "depth {} not in [18, 34, 50, 101, 152]"
+        assert variant in ['a', 'b', 'c', 'd'], "invalid ResNet variant"
+        assert 0 <= freeze_at <= 4, "freeze_at should be 0, 1, 2, 3 or 4"
+        assert len(feature_maps) > 0, "need one or more feature maps"
+        assert norm_type in ['bn', 'sync_bn', 'affine_channel']
+        self.depth = depth
+        self.freeze_at = freeze_at
+        self.norm_type = norm_type
+        self.norm_decay = norm_decay
+        self.freeze_norm = freeze_norm
+        self.variant = variant
+        self._model_type = 'ResNet'
+        self.feature_maps = feature_maps
+        self.depth_cfg = {
+            18: ([2, 2, 2, 2], self.basicblock),
+            34: ([3, 4, 6, 3], self.basicblock),
+            50: ([3, 4, 6, 3], self.bottleneck),
+            101: ([3, 4, 23, 3], self.bottleneck),
+            152: ([3, 8, 36, 3], self.bottleneck)
+        }
+        self.stage_filters = [64, 128, 256, 512]
+        self._c1_out_chan_num = 64
+        self.na = NameAdapter(self)
+        self.prefix_name = weight_prefix_name
+    def _conv_offset(self,
+                     input,
+                     filter_size,
+                     stride,
+                     padding,
+                     act=None,
+                     name=None):
+        out_channel = filter_size * filter_size * 3
+        out = fluid.layers.conv2d(
+            input,
+            num_filters=out_channel,
+            filter_size=filter_size,
+            stride=stride,
+            padding=padding,
+            param_attr=ParamAttr(
+                initializer=Constant(0.0), name=name + ".w_0"),
+            bias_attr=ParamAttr(
+                initializer=Constant(0.0), name=name + ".b_0"),
+            act=act,
+            name=name)
+        return out
+    def _conv_norm(self,
+                   input,
+                   num_filters,
+                   filter_size,
+                   stride=1,
+                   groups=1,
+                   act=None,
+                   name=None):
+        _name = self.prefix_name + name if self.prefix_name != '' else name
+        conv = fluid.layers.conv2d(
+            input=input,
+            num_filters=num_filters,
+            filter_size=filter_size,
+            stride=stride,
+            padding=(filter_size - 1) // 2,
+            groups=groups,
+            act=None,
+            param_attr=ParamAttr(name=_name + "_weights"),
+            bias_attr=False,
+            name=_name + '.conv2d.output.1')
+        bn_name = self.na.fix_conv_norm_name(name)
+        bn_name = self.prefix_name + bn_name if self.prefix_name != '' else bn_name
+        norm_lr = 0. if self.freeze_norm else 1.
+        norm_decay = self.norm_decay
+        pattr = ParamAttr(
+            name=bn_name + '_scale',
+            learning_rate=norm_lr,
+            regularizer=L2Decay(norm_decay))
+        battr = ParamAttr(
+            name=bn_name + '_offset',
+            learning_rate=norm_lr,
+            regularizer=L2Decay(norm_decay))
+        if self.norm_type in ['bn', 'sync_bn']:
+            global_stats = True if self.freeze_norm else False
+            out = fluid.layers.batch_norm(
+                input=conv,
+                act=act,
+                name=bn_name + '.output.1',
+                param_attr=pattr,
+                bias_attr=battr,
+                moving_mean_name=bn_name + '_mean',
+                moving_variance_name=bn_name + '_variance',
+                use_global_stats=global_stats)
+            scale = fluid.framework._get_var(pattr.name)
+            bias = fluid.framework._get_var(battr.name)
+        elif self.norm_type == 'affine_channel':
+            scale = fluid.layers.create_parameter(
+                shape=[conv.shape[1]],
+                dtype=conv.dtype,
+                attr=pattr,
+                default_initializer=fluid.initializer.Constant(1.))
+            bias = fluid.layers.create_parameter(
+                shape=[conv.shape[1]],
+                dtype=conv.dtype,
+                attr=battr,
+                default_initializer=fluid.initializer.Constant(0.))
+            out = fluid.layers.affine_channel(
+                x=conv, scale=scale, bias=bias, act=act)
+        if self.freeze_norm:
+            scale.stop_gradient = True
+            bias.stop_gradient = True
+        return out
+    def _shortcut(self, input, ch_out, stride, is_first, name):
+        max_pooling_in_short_cut = self.variant == 'd'
+        ch_in = input.shape[1]
+        # the naming rule is same as pretrained weight
+        name = self.na.fix_shortcut_name(name)
+        std_senet = getattr(self, 'std_senet', False)
+        if ch_in != ch_out or stride != 1 or (self.depth < 50 and is_first):
+            if std_senet:
+                if is_first:
+                    return self._conv_norm(input, ch_out, 1, stride, name=name)
+                else:
+                    return self._conv_norm(input, ch_out, 3, stride, name=name)
+            if max_pooling_in_short_cut and not is_first:
+                input = fluid.layers.pool2d(
+                    input=input,
+                    pool_size=2,
+                    pool_stride=2,
+                    pool_padding=0,
+                    ceil_mode=True,
+                    pool_type='avg')
+                return self._conv_norm(input, ch_out, 1, 1, name=name)
+            return self._conv_norm(input, ch_out, 1, stride, name=name)
+        else:
+            return input
+    def bottleneck(self, input, num_filters, stride, is_first, name):
+        if self.variant == 'a':
+            stride1, stride2 = stride, 1
+        else:
+            stride1, stride2 = 1, stride
+        # ResNeXt
+        groups = getattr(self, 'groups', 1)
+        group_width = getattr(self, 'group_width', -1)
+        if groups == 1:
+            expand = 4
+        elif (groups * group_width) == 256:
+            expand = 1
+        else:  # FIXME hard code for now, handles 32x4d, 64x4d and 32x8d
+            num_filters = num_filters // 2
+            expand = 2
+        conv_name1, conv_name2, conv_name3, \
+            shortcut_name = self.na.fix_bottleneck_name(name)
+        std_senet = getattr(self, 'std_senet', False)
+        conv_def = [[num_filters, 1, stride1, 'relu', 1, conv_name1],
+                    [num_filters, 3, stride2, 'relu', groups, conv_name2],
+                    [num_filters * expand, 1, 1, None, 1, conv_name3]]
+        residual = input
+        for i, (c, k, s, act, g, _name) in enumerate(conv_def):
+            residual = self._conv_norm(
+                input=residual,
+                num_filters=c,
+                filter_size=k,
+                stride=s,
+                act=act,
+                groups=g,
+                name=_name)
+        short = self._shortcut(
+            input,
+            num_filters * expand,
+            stride,
+            is_first=is_first,
+            name=shortcut_name)
+        return fluid.layers.elementwise_add(
+            x=short, y=residual, act='relu', name=name + ".add.output.5")
+    def basicblock(self, input, num_filters, stride, is_first, name):  #,
+        conv0 = self._conv_norm(
+            input=input,
+            num_filters=num_filters,
+            filter_size=3,
+            act='relu',
+            stride=stride,
+            name=name + "_branch2a")
+        conv1 = self._conv_norm(
+            input=conv0,
+            num_filters=num_filters,
+            filter_size=3,
+            act=None,
+            name=name + "_branch2b")
+        short = self._shortcut(
+            input, num_filters, stride, is_first, name=name + "_branch1")
+        return fluid.layers.elementwise_add(x=short, y=conv1, act='relu')
+    def layer_warp(self, input, stage_num):
+        """
+        Args:
+            input (Variable): input variable.
+            stage_num (int): the stage number, should be 2, 3, 4, 5
+        Returns:
+            The last variable in endpoint-th stage.
+        """
+        assert stage_num in [2, 3, 4, 5]
+        stages, block_func = self.depth_cfg[self.depth]
+        count = stages[stage_num - 2]
+        ch_out = self.stage_filters[stage_num - 2]
+        is_first = False if stage_num != 2 else True
+        # Make the layer name and parameter name consistent
+        # with ImageNet pre-trained model
+        conv = input
+        for i in range(count):
+            conv_name = self.na.fix_layer_warp_name(stage_num, count, i)
+            if self.depth < 50:
+                is_first = True if i == 0 and stage_num == 2 else False
+            conv = block_func(
+                input=conv,
+                num_filters=ch_out,
+                stride=2 if i == 0 and stage_num != 2 else 1,
+                is_first=is_first,
+                name=conv_name)
+        return conv
+    def c1_stage(self, input):
+        out_chan = self._c1_out_chan_num
+        conv1_name = self.na.fix_c1_stage_name()
+        if self.variant in ['c', 'd']:
+            conv_def = [
+                [out_chan // 2, 3, 2, "conv1_1"],
+                [out_chan // 2, 3, 1, "conv1_2"],
+                [out_chan, 3, 1, "conv1_3"],
+            ]
+        else:
+            conv_def = [[out_chan, 7, 2, conv1_name]]
+        for (c, k, s, _name) in conv_def:
+            input = self._conv_norm(
+                input=input,
+                num_filters=c,
+                filter_size=k,
+                stride=s,
+                act='relu',
+                name=_name)
+        output = fluid.layers.pool2d(
+            input=input,
+            pool_size=3,
+            pool_stride=2,
+            pool_padding=1,
+            pool_type='max')
+        return output
+    def __call__(self, input):
+        assert isinstance(input, Variable)
+        assert not (set(self.feature_maps) - set([2, 3, 4, 5])), \
+            "feature maps {} not in [2, 3, 4, 5]".format(self.feature_maps)
+        res_endpoints = []
+        res = input
+        feature_maps = self.feature_maps
+        severed_head = getattr(self, 'severed_head', False)
+        if not severed_head:
+            res = self.c1_stage(res)
+            feature_maps = range(2, max(self.feature_maps) + 1)
+        for i in feature_maps:
+            res = self.layer_warp(res, i)
+            if i in self.feature_maps:
+                res_endpoints.append(res)
+            if self.freeze_at >= i:
+                res.stop_gradient = True
+        return res
+class ResNetC5(ResNet):
+    __doc__ = ResNet.__doc__
+    def __init__(self,
+                 depth=50,
+                 freeze_at=2,
+                 norm_type='affine_channel',
+                 freeze_norm=True,
+                 norm_decay=0.,
+                 variant='b',
+                 feature_maps=[5],
+                 weight_prefix_name=''):
+        super(ResNetC5, self).__init__(depth, freeze_at, norm_type, freeze_norm,
+                                       norm_decay, variant, feature_maps)
+        self.severed_head = True
--- a/PaddleCV/rrpn/pretrained/download.sh
+++ b/PaddleCV/rrpn/pretrained/download.sh
+# Download the data.
+echo "Downloading..."
+wget https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_cos_pretrained.tar --no-check-certificate
+echo "Extracting..."
+tar -xf ResNet50_cos_pretrained.tar
--- a/PaddleCV/rrpn/reader.py
+++ b/PaddleCV/rrpn/reader.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import random
+import numpy as np
+import xml.etree.ElementTree
+import os
+import time
+import copy
+import six
+import cv2
+import math
+import paddle
+from collections import deque
+import data_utils
+from roidbs import ICDAR2015Dataset, ICDAR2017Dataset
+from config import cfg
+from PIL import Image
+from data_utils import _resize
+num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+np.random.seed(10)
+def roidb_reader(roidb, mode):
+    im, im_scales, gt_boxes, gt_classes = data_utils.get_image_blob(roidb, mode)
+    im_id = roidb['im_id']
+    is_crowd = roidb['is_crowd']
+    im_height = np.round(roidb['height'] * im_scales)
+    im_width = np.round(roidb['width'] * im_scales)
+    is_difficult = roidb['is_difficult']
+    im_info = np.array([im_height, im_width, im_scales], dtype=np.float32)
+    if mode == 'val':
+        return im, gt_boxes, gt_classes, is_crowd, im_info, im_id, is_difficult
+    outs = (im, gt_boxes, gt_classes, is_crowd, im_info, im_id)
+    return outs
+def RRPNData(mode,
+             batch_size=None,
+             total_batch_size=None,
+             padding_total=False,
+             shuffle=False,
+             shuffle_seed=None):  #,
+    #roidbs=None):
+    total_batch_size = total_batch_size if total_batch_size else batch_size
+    assert total_batch_size % batch_size == 0
+    if cfg.dataset == "icdar2015":
+        icdar2015_dataset = ICDAR2015Dataset(mode)
+        roidbs = icdar2015_dataset.get_roidb()
+    else:
+        icdar2017_dataset = ICDAR2017Dataset(mode)
+        roidbs = icdar2017_dataset.get_roidb()
+    print("{} on {} with {} roidbs".format(mode, cfg.dataset, len(roidbs)))
+    def reader():
+        if mode == "train":
+            if shuffle:
+                if shuffle_seed is not None:
+                    np.random.seed(shuffle_seed)
+                roidb_perm = deque(np.random.permutation(roidbs))
+            else:
+                roidb_perm = deque(roidbs)
+            roidb_cur = 0
+            count = 0
+            batch_out = []
+            device_num = total_batch_size / batch_size
+            while True:
+                start = time.time()
+                roidb = roidb_perm[0]
+                roidb_cur += 1
+                roidb_perm.rotate(-1)
+                if roidb_cur >= len(roidbs):
+                    if shuffle:
+                        roidb_perm = deque(np.random.permutation(roidbs))
+                    else:
+                        roidb_perm = deque(roidbs)
+                    roidb_cur = 0
+                # im, gt_boxes, gt_classes, is_crowd, im_info, im_id, gt_masks
+                datas = roidb_reader(roidb, mode)
+                if datas[1].shape[0] == 0:
+                    continue
+                batch_out.append(datas)
+                end = time.time()
+                #print('reader time:', end - start)
+                if len(batch_out) == batch_size:
+                    yield batch_out
+                    count += 1
+                    batch_out = []
+                iter_id = count // device_num
+                if iter_id >= cfg.max_iter * num_trainers:
+                    return
+        elif mode == "val":
+            batch_out = []
+            for roidb in roidbs:
+                im, gt_boxes, gt_classes, is_crowd, im_info, im_id, is_difficult = roidb_reader(
+                    roidb, mode)
+                batch_out.append((im, gt_boxes, gt_classes, is_crowd, im_info,
+                                  im_id, is_difficult))
+                if len(batch_out) == batch_size:
+                    yield batch_out
+                    batch_out = []
+            if len(batch_out) != 0:
+                yield batch_out
+    return reader
+def train(batch_size,
+          total_batch_size=None,
+          padding_total=False,
+          num_workers=20,
+          shuffle=True,
+          shuffle_seed=None):
+    return RRPNData(
+        'train',
+        batch_size,
+        total_batch_size,
+        padding_total,
+        shuffle=shuffle,
+        shuffle_seed=shuffle_seed)
+def test(batch_size, total_batch_size=None, padding_total=False):
+    return RRPNData('val', batch_size, total_batch_size, shuffle=False)
+def infer(file_path):
+    def reader():
+        imgs = os.listdir(file_path)
+        imgs.sort()
+        for image in imgs:
+            if not os.path.exists(file_path):
+                raise ValueError("Image path [%s] does not exist." %
+                                 (file_path))
+            with open(os.path.join(file_path, image), 'rb') as f:
+                data = f.read()
+            data = np.frombuffer(data, dtype='uint8')
+            img = cv2.imdecode(data, 1)
+            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+            img, im_scale = _resize(img, target_size=1000, max_size=1778)
+            img = img.astype(np.float32, copy=False)
+            img = img / 255.0
+            mean = np.array(cfg.pixel_means)[np.newaxis, np.newaxis, :]
+            std = np.array(cfg.pixel_std)[np.newaxis, np.newaxis, :]
+            img -= mean
+            img /= std
+            img = img.transpose((2, 0, 1))
+            h = img.shape[1]
+            w = img.shape[2]
+            im_info = np.array([h, w, im_scale], dtype=np.float32)
+            yield [(img, im_info)]
+    return reader
+if __name__ == '__main__':
+    from utility import parse_args
+    args = parse_args()
+    train_reader = train(1, shuffle=True)
+    import time
+    time0 = time.time()
+    for iter_id, data in enumerate(train_reader()):
+        print('iter:', iter_id)
+        print('cost:', time.time() - time0)
+        time0 = time.time()
--- a/PaddleCV/rrpn/roidbs.py
+++ b/PaddleCV/rrpn/roidbs.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# Based on:
+# --------------------------------------------------------
+# Detectron
+# Copyright (c) 2017-present, Facebook, Inc.
+# Licensed under the Apache License, Version 2.0;
+# Written by Ross Girshick
+# --------------------------------------------------------
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from __future__ import unicode_literals
+import copy
+import logging
+import numpy as np
+import os
+import scipy.sparse
+import random
+import time
+import matplotlib
+import cv2
+#import segm_utils
+from config import cfg
+from data_utils import DatasetPath
+logger = logging.getLogger(__name__)
+class ICDAR2015Dataset(object):
+    """A class representing a ICDAR2015 dataset."""
+    def __init__(self, mode):
+        print('Creating: {}'.format(cfg.dataset))
+        self.name = cfg.data_dir
+        self.mode = mode
+        data_path = DatasetPath(mode, self.name)
+        data_dir = data_path.get_data_dir()
+        file_list = data_path.get_file_list()
+        self.image_dir = data_dir
+        self.gt_dir = file_list
+    def get_roidb(self):
+        """Return an roidb corresponding to the txt dataset. Optionally:
+           - include ground truth boxes in the roidb
+        """
+        image_list = os.listdir(self.image_dir)
+        image_list.sort()
+        im_infos = []
+        count = 0
+        for image in image_list:
+            prefix = image[:-4]
+            if image.split('.')[-1] != 'jpg':
+                continue
+            img_name = os.path.join(self.image_dir, image)
+            gt_name = os.path.join(self.gt_dir, 'gt_' + prefix + '.txt')
+            easy_boxes = []
+            hard_boxes = []
+            boxes = []
+            gt_obj = open(gt_name, 'r', encoding='UTF-8-sig')
+            gt_txt = gt_obj.read()
+            gt_split = gt_txt.split('\n')
+            img = cv2.imread(img_name)
+            f = False
+            for gt_line in gt_split:
+                gt_ind = gt_line.split(',')
+                # can get the text information
+                if len(gt_ind) > 3 and '###' not in gt_ind[8]:
+                    pt1 = (int(gt_ind[0]), int(gt_ind[1]))
+                    pt2 = (int(gt_ind[2]), int(gt_ind[3]))
+                    pt3 = (int(gt_ind[4]), int(gt_ind[5]))
+                    pt4 = (int(gt_ind[6]), int(gt_ind[7]))
+                    edge1 = np.sqrt((pt1[0] - pt2[0]) * (pt1[0] - pt2[0]) + (
+                        pt1[1] - pt2[1]) * (pt1[1] - pt2[1]))
+                    edge2 = np.sqrt((pt2[0] - pt3[0]) * (pt2[0] - pt3[0]) + (
+                        pt2[1] - pt3[1]) * (pt2[1] - pt3[1]))
+                    angle = 0
+                    if edge1 > edge2:
+                        width = edge1
+                        height = edge2
+                        if pt1[0] - pt2[0] != 0:
+                            angle = -np.arctan(
+                                float(pt1[1] - pt2[1]) /
+                                float(pt1[0] - pt2[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    elif edge2 >= edge1:
+                        width = edge2
+                        height = edge1
+                        # print pt2[0], pt3[0]
+                        if pt2[0] - pt3[0] != 0:
+                            angle = -np.arctan(
+                                float(pt2[1] - pt3[1]) /
+                                float(pt2[0] - pt3[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    if angle < -45.0:
+                        angle = angle + 180
+                    x_ctr = float(pt1[0] + pt3[
+                        0]) / 2  # pt1[0] + np.abs(float(pt1[0] - pt3[0])) / 2
+                    y_ctr = float(pt1[1] + pt3[
+                        1]) / 2  # pt1[1] + np.abs(float(pt1[1] - pt3[1])) / 2
+                    if self.mode == 'val':
+                        easy_boxes.append(
+                            list(np.array([pt1, pt2, pt3, pt4]).reshape(8)))
+                    else:
+                        easy_boxes.append([x_ctr, y_ctr, width, height, angle])
+                # can‘t get the text information    
+                if len(gt_ind) > 3 and '###' in gt_ind[8]:
+                    pt1 = (int(gt_ind[0]), int(gt_ind[1]))
+                    pt2 = (int(gt_ind[2]), int(gt_ind[3]))
+                    pt3 = (int(gt_ind[4]), int(gt_ind[5]))
+                    pt4 = (int(gt_ind[6]), int(gt_ind[7]))
+                    edge1 = np.sqrt((pt1[0] - pt2[0]) * (pt1[0] - pt2[0]) + (
+                        pt1[1] - pt2[1]) * (pt1[1] - pt2[1]))
+                    edge2 = np.sqrt((pt2[0] - pt3[0]) * (pt2[0] - pt3[0]) + (
+                        pt2[1] - pt3[1]) * (pt2[1] - pt3[1]))
+                    angle = 0
+                    if edge1 > edge2:
+                        width = edge1
+                        height = edge2
+                        if pt1[0] - pt2[0] != 0:
+                            angle = -np.arctan(
+                                float(pt1[1] - pt2[1]) /
+                                float(pt1[0] - pt2[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    elif edge2 >= edge1:
+                        width = edge2
+                        height = edge1
+                        if pt2[0] - pt3[0] != 0:
+                            angle = -np.arctan(
+                                float(pt2[1] - pt3[1]) /
+                                float(pt2[0] - pt3[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    if angle < -45.0:
+                        angle = angle + 180
+                    x_ctr = float(pt1[0] + pt3[
+                        0]) / 2  # pt1[0] + np.abs(float(pt1[0] - pt3[0])) / 2
+                    y_ctr = float(pt1[1] + pt3[
+                        1]) / 2  # pt1[1] + np.abs(float(pt1[1] - pt3[1])) / 2
+                    if self.mode == 'val':
+                        hard_boxes.append(
+                            list(np.array([pt1, pt2, pt3, pt4]).reshape(8)))
+                    else:
+                        hard_boxes.append([x_ctr, y_ctr, width, height, angle])
+            #print(easy_boxes)
+            if self.mode == 'train':
+                boxes.extend(easy_boxes)
+                # hard box only get 1/3 for train
+                boxes.extend(hard_boxes[0:int(len(hard_boxes) / 3)])
+                is_difficult = [0] * len(easy_boxes)
+                is_difficult.extend([1] * int(len(hard_boxes) / 3))
+            else:
+                boxes.extend(easy_boxes)
+                boxes.extend(hard_boxes)
+                is_difficult = [0] * len(easy_boxes)
+                is_difficult.extend([1] * int(len(hard_boxes)))
+            len_of_bboxes = len(boxes)
+            #is_difficult = [0] * len(easy_boxes)
+            #is_difficult.extend([1] * int(len(hard_boxes)))
+            is_difficult = np.array(is_difficult).reshape(
+                1, len_of_bboxes).astype(np.int32)
+            if self.mode == 'train':
+                gt_boxes = np.zeros((len_of_bboxes, 5), dtype=np.int32)
+            else:
+                gt_boxes = np.zeros((len_of_bboxes, 8), dtype=np.int32)
+            gt_classes = np.zeros((len_of_bboxes), dtype=np.int32)
+            is_crowd = np.zeros((len_of_bboxes), dtype=np.int32)
+            for idx in range(len(boxes)):
+                if self.mode == 'train':
+                    gt_boxes[idx, :] = [
+                        boxes[idx][0], boxes[idx][1], boxes[idx][2],
+                        boxes[idx][3], boxes[idx][4]
+                    ]
+                else:
+                    gt_boxes[idx, :] = [
+                        boxes[idx][0], boxes[idx][1], boxes[idx][2],
+                        boxes[idx][3], boxes[idx][4], boxes[idx][5],
+                        boxes[idx][6], boxes[idx][7]
+                    ]
+                gt_classes[idx] = 1
+            if gt_boxes.shape[0] <= 0:
+                continue
+            gt_boxes = gt_boxes.astype(np.float64)
+            im_info = {
+                'im_id': count,
+                'gt_classes': gt_classes,
+                'image': img_name,
+                'boxes': gt_boxes,
+                'height': img.shape[0],
+                'width': img.shape[1],
+                'is_crowd': is_crowd,
+                'is_difficult': is_difficult
+            }
+            im_infos.append(im_info)
+            count += 1
+        return im_infos
+class ICDAR2017Dataset(object):
+    """A class representing a ICDAR2017 dataset."""
+    def __init__(self, mode):
+        print('Creating: {}'.format(cfg.dataset))
+        self.name = cfg.data_dir
+        #print('**************', self.name)
+        self.mode = mode
+        data_path = DatasetPath(mode, self.name)
+        data_dir = data_path.get_data_dir()
+        #print("&**************", data_dir)
+        file_list = data_path.get_file_list()
+        self.image_dir = data_dir
+        self.gt_dir = file_list
+    def get_roidb(self):
+        """Return an roidb corresponding to the json dataset. Optionally:
+           - include ground truth boxes in the roidb
+        """
+        image_list = os.listdir(self.image_dir)
+        image_list.sort()
+        im_infos = []
+        count = 0
+        class_idx = 1
+        class_name = {}
+        post_fix = ['jpg', 'bmp', 'png']
+        if self.mode == 'val':
+            labels_map = get_labels_maps()
+        for image in image_list:
+            prefix = image[:-4]
+            #print(image)
+            if image.split('.')[-1] not in post_fix:
+                continue
+            img_name = os.path.join(self.image_dir, image)
+            gt_name = os.path.join(self.gt_dir, 'gt_' + prefix + '.txt')
+            gt_classes = []
+            #boxes = []
+            #hard_boxes = []
+            boxes = []
+            gt_obj = open(gt_name, 'r', encoding='UTF-8-sig')
+            gt_txt = gt_obj.read()
+            gt_split = gt_txt.split('\n')
+            img = cv2.imread(img_name)
+            f = False
+            for gt_line in gt_split:
+                gt_ind = gt_line.split(',')
+                # can get the text information
+                if len(gt_ind) > 3:
+                    if self.mode == 'val':
+                        gt_classes.append(labels_map[gt_ind[-1]])
+                    else:
+                        if gt_ind[-1] not in class_name:
+                            class_name[gt_ind[-1]] = class_idx
+                            #gt_classes.append(class_idx)
+                            class_idx += 1
+                        gt_classes.append(class_name[gt_ind[-1]])
+                    pt1 = (int(gt_ind[0]), int(gt_ind[1]))
+                    pt2 = (int(gt_ind[2]), int(gt_ind[3]))
+                    pt3 = (int(gt_ind[4]), int(gt_ind[5]))
+                    pt4 = (int(gt_ind[6]), int(gt_ind[7]))
+                    edge1 = np.sqrt((pt1[0] - pt2[0]) * (pt1[0] - pt2[0]) + (
+                        pt1[1] - pt2[1]) * (pt1[1] - pt2[1]))
+                    edge2 = np.sqrt((pt2[0] - pt3[0]) * (pt2[0] - pt3[0]) + (
+                        pt2[1] - pt3[1]) * (pt2[1] - pt3[1]))
+                    angle = 0
+                    if edge1 > edge2:
+                        width = edge1
+                        height = edge2
+                        if pt1[0] - pt2[0] != 0:
+                            angle = -np.arctan(
+                                float(pt1[1] - pt2[1]) /
+                                float(pt1[0] - pt2[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    elif edge2 >= edge1:
+                        width = edge2
+                        height = edge1
+                        # print pt2[0], pt3[0]
+                        if pt2[0] - pt3[0] != 0:
+                            angle = -np.arctan(
+                                float(pt2[1] - pt3[1]) /
+                                float(pt2[0] - pt3[0])) / np.pi * 180
+                        else:
+                            angle = 90.0
+                    if angle < -45.0:
+                        angle = angle + 180
+                    x_ctr = float(pt1[0] + pt3[
+                        0]) / 2  # pt1[0] + np.abs(float(pt1[0] - pt3[0])) / 2
+                    y_ctr = float(pt1[1] + pt3[
+                        1]) / 2  # pt1[1] + np.abs(float(pt1[1] - pt3[1])) / 2
+                    if self.mode == 'val':
+                        boxes.append(
+                            list(np.array([pt1, pt2, pt3, pt4]).reshape(8)))
+                    else:
+                        boxes.append([x_ctr, y_ctr, width, height, angle])
+            len_of_bboxes = len(boxes)
+            #print(len_of_bboxes)
+            is_difficult = np.zeros((len_of_bboxes, 1), dtype=np.int32)
+            if self.mode == 'train':
+                gt_boxes = np.zeros((len_of_bboxes, 5), dtype=np.int32)
+            else:
+                gt_boxes = np.zeros((len_of_bboxes, 8), dtype=np.int32)
+            gt_classes = np.array(gt_classes).reshape(len_of_bboxes, 1)
+            is_crowd = np.zeros((len_of_bboxes), dtype=np.int32)
+            for idx in range(len(boxes)):
+                if self.mode == 'train':
+                    gt_boxes[idx, :] = [
+                        boxes[idx][0], boxes[idx][1], boxes[idx][2],
+                        boxes[idx][3], boxes[idx][4]
+                    ]
+                else:
+                    gt_boxes[idx, :] = [
+                        boxes[idx][0], boxes[idx][1], boxes[idx][2],
+                        boxes[idx][3], boxes[idx][4], boxes[idx][5],
+                        boxes[idx][6], boxes[idx][7]
+                    ]
+                #gt_classes[idx] = 1
+            if gt_boxes.shape[0] <= 0:
+                continue
+            gt_boxes = gt_boxes.astype(np.float64)
+            im_info = {
+                'im_id': count,
+                'gt_classes': gt_classes,
+                'image': img_name,
+                'boxes': gt_boxes,
+                'height': img.shape[0],
+                'width': img.shape[1],
+                'is_crowd': is_crowd,
+                'is_difficult': is_difficult
+            }
+            im_infos.append(im_info)
+            count += 1
+            if self.mode == 'train':
+                with open(os.path.join(cfg.data_dir, 'label_list'), 'w') as g:
+                    for k in class_name:
+                        g.write(k + "\n")
+        return im_infos
+def get_labels_maps():
+    labels_map = {}
+    with open(os.path.join(cfg.data_dir, 'label_list')) as f:
+        lines = f.readlines()
+        for idx, line in enumerate(lines):
+            labels_map[line.strip()] = idx + 1
+        return labels_map
--- a/PaddleCV/rrpn/train.py
+++ b/PaddleCV/rrpn/train.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+def set_paddle_flags(flags):
+    for key, value in flags.items():
+        if os.environ.get(key, None) is None:
+            os.environ[key] = str(value)
+set_paddle_flags({
+    'FLAGS_conv_workspace_size_limit': 500,
+    'FLAGS_eager_delete_tensor_gb': 0,  # enable gc
+    'FLAGS_memory_fraction_of_eager_deletion': 1,
+    'FLAGS_fraction_of_gpu_memory_to_use': 0.98
+})
+import sys
+import numpy as np
+import time
+import shutil
+import collections
+import paddle
+import paddle.fluid as fluid
+import reader
+import models.model_builder as model_builder
+import models.resnet as resnet
+import checkpoint as checkpoint
+from config import cfg
+from utility import parse_args, print_arguments, SmoothedValue, TrainingStats, now_time, check_gpu
+num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+def get_device_num():
+    # NOTE(zcd): for multi-processe training, each process use one GPU card.
+    if num_trainers > 1:
+        return 1
+    return fluid.core.get_cuda_device_count()
+def train():
+    learning_rate = cfg.learning_rate
+    image_shape = [3, cfg.TRAIN.max_size, cfg.TRAIN.max_size]
+    devices_num = get_device_num()
+    total_batch_size = devices_num * cfg.TRAIN.im_per_batch
+    use_random = True
+    startup_prog = fluid.Program()
+    train_prog = fluid.Program()
+    with fluid.program_guard(train_prog, startup_prog):
+        with fluid.unique_name.guard():
+            model = model_builder.RRPN(
+                add_conv_body_func=resnet.ResNet(),
+                add_roi_box_head_func=resnet.ResNetC5(),
+                use_pyreader=cfg.use_pyreader,
+                use_random=use_random)
+            model.build_model(image_shape)
+            losses, keys, rpn_rois = model.loss()
+            loss = losses[0]
+            fetch_list = losses
+            boundaries = cfg.lr_steps
+            gamma = cfg.lr_gamma
+            step_num = len(cfg.lr_steps)
+            values = [learning_rate * (gamma**i) for i in range(step_num + 1)]
+            start_lr = learning_rate * cfg.start_factor
+            lr = fluid.layers.piecewise_decay(boundaries, values)
+            lr = fluid.layers.linear_lr_warmup(lr, cfg.warm_up_iter, start_lr,
+                                               learning_rate)
+            optimizer = fluid.optimizer.Momentum(
+                learning_rate=lr,
+                regularization=fluid.regularizer.L2Decay(cfg.weight_decay),
+                momentum=cfg.momentum)
+            optimizer.minimize(loss)
+            fetch_list = fetch_list + [lr]
+            for var in fetch_list:
+                var.persistable = True
+    gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
+    place = fluid.CUDAPlace(gpu_id) if cfg.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    build_strategy = fluid.BuildStrategy()
+    build_strategy.fuse_all_optimizer_ops = False
+    build_strategy.fuse_elewise_add_act_ops = True
+    exec_strategy = fluid.ExecutionStrategy()
+    exec_strategy.num_iteration_per_drop_scope = 1
+    exe.run(startup_prog)
+    if cfg.pretrained_model:
+        checkpoint.load_and_fusebn(exe, train_prog, cfg.pretrained_model)
+    compiled_train_prog = fluid.CompiledProgram(train_prog).with_data_parallel(
+        loss_name=loss.name,
+        build_strategy=build_strategy,
+        exec_strategy=exec_strategy)
+    shuffle = True
+    shuffle_seed = None
+    if num_trainers > 1:
+        shuffle_seed = 1
+    if cfg.use_pyreader:
+        train_reader = reader.train(
+            batch_size=cfg.TRAIN.im_per_batch,
+            total_batch_size=total_batch_size,
+            padding_total=cfg.TRAIN.padding_minibatch,
+            shuffle=shuffle,
+            shuffle_seed=shuffle_seed)
+        if num_trainers > 1:
+            assert shuffle_seed is not None, \
+                "If num_trainers > 1, the shuffle_seed must be set, because " \
+                "the order of batch data generated by reader " \
+                "must be the same in the respective processes."
+            # NOTE: the order of batch data generated by batch_reader
+            # must be the same in the respective processes.
+            if num_trainers > 1:
+                train_reader = fluid.contrib.reader.distributed_batch_reader(
+                    train_reader)
+        py_reader = model.py_reader
+        py_reader.decorate_paddle_reader(train_reader)
+    else:
+        if num_trainers > 1: shuffle = False
+        train_reader = reader.train(
+            batch_size=total_batch_size, shuffle=shuffle)
+        feeder = fluid.DataFeeder(place=place, feed_list=model.feeds())
+    def train_loop_pyreader():
+        py_reader.start()
+        train_stats = TrainingStats(cfg.log_window, keys)
+        try:
+            start_time = time.time()
+            prev_start_time = start_time
+            for iter_id in range(cfg.max_iter):
+                prev_start_time = start_time
+                start_time = time.time()
+                outs = exe.run(compiled_train_prog,
+                               fetch_list=[v.name for v in fetch_list])
+                stats = {k: np.array(v).mean() for k, v in zip(keys, outs[:-1])}
+                train_stats.update(stats)
+                logs = train_stats.log()
+                if iter_id % 10 == 0:
+                    strs = '{}, iter: {}, lr: {:.5f}, {}, time: {:.3f}'.format(
+                        now_time(), iter_id,
+                        np.mean(outs[-1]), logs, start_time - prev_start_time)
+                    print(strs)
+                sys.stdout.flush()
+                if (iter_id) % cfg.TRAIN.snapshot_iter == 0 and iter_id != 0:
+                    save_name = "{}".format(iter_id)
+                    checkpoint.save(exe, train_prog,
+                                    os.path.join(cfg.model_save_dir, save_name))
+                if (iter_id) == cfg.max_iter:
+                    checkpoint.save(
+                        exe, train_prog,
+                        os.path.join(cfg.model_save_dir, "model_final"))
+                    break
+            end_time = time.time()
+            total_time = end_time - start_time
+            last_loss = np.array(outs[0]).mean()
+        except (StopIteration, fluid.core.EOFException):
+            py_reader.reset()
+    def train_loop():
+        start_time = time.time()
+        prev_start_time = start_time
+        start = start_time
+        train_stats = TrainingStats(cfg.log_window, keys)
+        for iter_id, data in enumerate(train_reader()):
+            prev_start_time = start_time
+            start_time = time.time()
+            if data[0][1].shape[0] == 0:
+                continue
+            outs = exe.run(compiled_train_prog,
+                           fetch_list=[v.name for v in fetch_list],
+                           feed=feeder.feed(data))
+            stats = {k: np.array(v).mean() for k, v in zip(keys, outs[:-1])}
+            train_stats.update(stats)
+            logs = train_stats.log()
+            if iter_id % 10 == 0:
+                strs = '{}, iter: {}, lr: {:.5f}, {}, time: {:.3f}'.format(
+                    now_time(), iter_id,
+                    np.mean(outs[-1]), logs, start_time - prev_start_time)
+                print(strs)
+            sys.stdout.flush()
+            if (iter_id + 1) % cfg.TRAIN.snapshot_iter == 0 and iter_id != 0:
+                save_name = "{}".format(iter_id + 1)
+                checkpoint.save(exe, train_prog,
+                                os.path.join(cfg.model_save_dir, save_name))
+            if (iter_id + 1) == cfg.max_iter:
+                checkpoint.save(exe, train_prog,
+                                os.path.join(cfg.model_save_dir, "model_final"))
+                break
+        end_time = time.time()
+        total_time = end_time - start_time
+        last_loss = np.array(outs[0]).mean()
+    if cfg.use_pyreader:
+        train_loop_pyreader()
+    else:
+        train_loop()
+if __name__ == '__main__':
+    args = parse_args()
+    print_arguments(args)
+    check_gpu(args.use_gpu)
+    train()
--- a/PaddleCV/rrpn/utility.py
+++ b/PaddleCV/rrpn/utility.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+"""
+Contains common utility functions.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import sys
+import paddle.fluid as fluid
+import distutils.util
+import numpy as np
+import six
+import argparse
+import functools
+import collections
+import datetime
+from collections import deque
+from paddle.fluid import core
+from collections import deque
+from config import *
+def print_arguments(args):
+    """Print argparse's arguments.
+    Usage:
+    .. code-block:: python
+        parser = argparse.ArgumentParser()
+        parser.add_argument("name", default="Jonh", type=str, help="User name.")
+        args = parser.parse_args()
+        print_arguments(args)
+    :param args: Input argparse.Namespace for printing.
+    :type args: argparse.Namespace
+    """
+    print("-----------  Configuration Arguments -----------")
+    for arg, value in sorted(six.iteritems(vars(args))):
+        print("%s: %s" % (arg, value))
+    print("------------------------------------------------")
+def add_arguments(argname, type, default, help, argparser, **kwargs):
+    """Add argparse's argument.
+    Usage:
+    .. code-block:: python
+        parser = argparse.ArgumentParser()
+        add_argument("name", str, "Jonh", "User name.", parser)
+        args = parser.parse_args()
+    """
+    type = distutils.util.strtobool if type == bool else type
+    argparser.add_argument(
+        "--" + argname,
+        default=default,
+        type=type,
+        help=help + ' Default: %(default)s.',
+        **kwargs)
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+    def __init__(self, window_size):
+        self.deque = deque(maxlen=window_size)
+    def add_value(self, value):
+        self.deque.append(value)
+    def get_median_value(self):
+        return np.median(self.deque)
+def now_time():
+    return datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')
+class TrainingStats(object):
+    def __init__(self, window_size, stats_keys):
+        self.smoothed_losses_and_metrics = {
+            key: SmoothedValue(window_size)
+            for key in stats_keys
+        }
+    def update(self, stats):
+        for k, v in self.smoothed_losses_and_metrics.items():
+            v.add_value(stats[k])
+    def get(self, extras=None):
+        stats = collections.OrderedDict()
+        if extras:
+            for k, v in extras.items():
+                stats[k] = v
+        for k, v in self.smoothed_losses_and_metrics.items():
+            stats[k] = round(v.get_median_value(), 3)
+        return stats
+    def log(self, extras=None):
+        d = self.get(extras)
+        strs = ', '.join(str(dict({x: y})).strip('{}') for x, y in d.items())
+        return strs
+def parse_args():
+    """return all args
+    """
+    parser = argparse.ArgumentParser(description=__doc__)
+    add_arg = functools.partial(add_arguments, argparser=parser)
+    # yapf: disable
+    # ENV
+    add_arg('use_gpu',          bool,  True,      "Whether use GPU.")
+    add_arg('model_save_dir',   str,    'output',     "The path to save model.")
+    add_arg('pretrained_model', str,    'ResNet50_cos_pretrained', "The init model path.")
+    add_arg('dataset',          str,   'icdar2015',  "icdar2015, icdar2017.")
+    add_arg('class_num',        int,   2,          "Class number.")
+    add_arg('data_dir',         str,   'dataset/icdar2015',        "The data root path.")
+    add_arg('use_pyreader',     bool,   False,           "Use pyreader.")
+    add_arg('use_profile',         bool,   False,       "Whether use profiler.")
+    add_arg('padding_minibatch',bool,   False,
+        "If False, only resize image and not pad, image shape is different between"
+        " GPUs in one mini-batch. If True, image shape is the same in one mini-batch.")
+    #SOLVER
+    add_arg('learning_rate',    float,  0.02,     "Learning rate.")
+    add_arg('max_iter',         int,    17500,   "Iter number.")
+    add_arg('log_window',       int,    20,        "Log smooth window, set 1 for debug, set 20 for train.")
+    # RCNN
+    # RPN
+    add_arg('anchor_sizes',     int,    [128, 256, 512],  "The size of anchors.")
+    add_arg('aspect_ratios',    float,  [0.2, 0.5,1.0],    "The ratio of anchors.")
+    add_arg('anchor_angle',    float,  [-30.0, 0.0, 30.0, 60.0, 90.0, 120.0],    "The angles of anchors.")
+    add_arg('variance',         float,  [1.0, 1.0, 1.0, 1.0, 1.0],    "The variance of anchors.")
+    add_arg('rpn_stride',       float,  [16.,16.],    "Stride of the feature map that RPN is attached.")
+    add_arg('rpn_nms_thresh',    float,   0.7,          "NMS threshold used on RPN proposals")
+    # TRAIN VAL INFER
+    add_arg('im_per_batch',       int,   1,        "Minibatch size.")
+    add_arg('pixel_means',     float,   [0.485, 0.456, 0.406], "pixel mean")
+    add_arg('nms_thresh',    float, 0.3,    "NMS threshold.")
+    add_arg('score_thresh',    float, 0.01,    "score threshold for NMS.")
+    add_arg('snapshot_stride',  int,    1000,    "save model every snapshot stride.")
+    # SINGLE EVAL AND DRAW
+    add_arg('draw_threshold',  float, 0.8,    "Confidence threshold to draw bbox.")
+    add_arg('image_path',       str,   'ICDAR2015/tmp/',  "The image path used to inference and visualize.")
+    # yapf: enable
+    args = parser.parse_args()
+    file_name = sys.argv[0]
+    if 'train' in file_name or 'profile' in file_name:
+        merge_cfg_from_args(args, 'train')
+    else:
+        merge_cfg_from_args(args, 'val')
+    return args
+def check_gpu(use_gpu):
+    """
+     Log error and exit when set use_gpu=true in paddlepaddle
+     cpu version.
+     """
+    err = "Config use_gpu cannot be set as true while you are " \
+          "using paddlepaddle cpu version ! \nPlease try: \n" \
+          "\t1. Install paddlepaddle-gpu to run model on GPU \n" \
+          "\t2. Set use_gpu as false in config file to run " \
+          "model on CPU"
+    try:
+        if use_gpu and not fluid.is_compiled_with_cuda():
+            logger.error(err)
+            sys.exit(1)
+    except Exception as e:
+        pass