Update yolov3 tiny (#1033)

* add PPYOLO Co-authored-by: N longxiang <longxiang@baidu.com>

Update yolov3 tiny (#1033)
* add PPYOLO Co-authored-by: N longxiang <longxiang@baidu.com>
defcfec1 · Kaipeng Deng · GitHub · 38d420bb · defcfec1 · defcfec1
29 changed file
--- a/README.md
+++ b/README.md
@@ -83,7 +83,7 @@
 以下为选取各模型结构和骨干网络的代表模型COCO数据集精度mAP和单卡Tesla V100上预测速度(FPS)关系图。

 <div align="center">
-  <img src="docs/images/map_fps.png" />
+  <img src="docs/images/map_fps.png" width=800 />
 </div>

 **说明：**
@@ -92,6 +92,12 @@
 - PaddleDetection增强版`YOLOv3-ResNet50vd-DCN`在COCO数据集mAP高于原作10.6个绝对百分点，推理速度为61.3FPS，快于原作约70%
 - 图中模型均可在[模型库](#模型库)中获取

+以下为PaddleDetection发布的精度和预测速度优于YOLOv4模型的PPYOLO与前沿目标检测算法的COCO数据集精度与单卡Tesla V100预测速度(FPS)关系图， PPYOLO模型在[COCO](http://cocodataset.org) test2019数据集上精度达到45.2%，在单卡V100上FP32推理速度为72.9 FPS，详细信息见[PPYOLO模型](configs/ppyolo/README.md)
+
+<div align="center">
+  <img src="docs/images/ppyolo_map_fps.png" width=600 />
+</div>
+
 ## 文档教程

 ### 入门教程
@@ -129,6 +135,7 @@
 - [Anchor free模型](configs/anchor_free/README.md)
 - [人脸检测模型](docs/featured_model/FACE_DETECTION.md)
 - [YOLOv3增强模型](docs/featured_model/YOLOv3_ENHANCEMENT.md): COCO mAP高达43.6%，原论文精度为33.0%
+- [PPYOLO模型](configs/ppyolo/README.md): COCO mAP高达45.3%，单卡Tesla V100预测速度高达72.9 FPS
 - [行人检测预训练模型](docs/featured_model/CONTRIB_cn.md)
 - [车辆检测预训练模型](docs/featured_model/CONTRIB_cn.md)
 - [Objects365 2019 Challenge夺冠模型](docs/featured_model/champion_model/CACascadeRCNN.md)

--- a/README_en.md
+++ b/README_en.md
@@ -96,7 +96,7 @@ Advanced Features:
 The following is the relationship between COCO mAP and FPS on Tesla V100 of representative models of each architectures and backbones.

 <div align="center">
-  <img src="docs/images/map_fps.png" />
+  <img src="docs/images/map_fps.png" width=800 />
 </div>

 **NOTE:**
@@ -105,6 +105,12 @@ The following is the relationship between COCO mAP and FPS on Tesla V100 of repr
 - The enhanced `YOLOv3-ResNet50vd-DCN` is 10.6 absolute percentage points higher than paper on COCO mAP, and inference speed is nearly 70% faster than the darknet framework
 - All these models can be get in [Model Zoo](#Model-Zoo)

+The following is the relationship between COCO mAP and FPS on Tesla V100 of SOTA object detecters and PPYOLO, which is faster and has better performance than YOLOv4, and reached mAP(0.5:0.95) as 45.2% on COCO test2019 dataset and 72.9 FPS on single Test V100. Please refer to [PPYOLO](configs/ppyolo/README.md) for details.
+
+<div align="center">
+  <img src="docs/images/ppyolo_map_fps.png" width=600 />
+</div>
+
 ## Tutorials


@@ -146,6 +152,7 @@ The following is the relationship between COCO mAP and FPS on Tesla V100 of repr
 - [Pretrained models for pedestrian detection](docs/featured_model/CONTRIB.md)
 - [Pretrained models for vehicle detection](docs/featured_model/CONTRIB.md)
 - [YOLOv3 enhanced model](docs/featured_model/YOLOv3_ENHANCEMENT.md): Compared to MAP of 33.0% in paper, enhanced YOLOv3 reaches the MAP of 43.6%, and inference speed is improved as well
+- [PPYOLO](configs/ppyolo/README.md): PPYOLO reeached mAP as 45.3% on COCO dataset，and 72.9 FPS on single Tesla V100
 - [Objects365 2019 Challenge champion model](docs/featured_model/champion_model/CACascadeRCNN.md)
 - [Best single model of Open Images 2019-Object Detction](docs/featured_model/champion_model/OIDV5_BASELINE_MODEL.md)
 - [Practical Server-side detection method](configs/rcnn_enhance/README_en.md): Inference speed on single V100 GPU can reach 20FPS when COCO mAP is 47.8%.

--- a/configs/ppyolo/README.md
+++ b/configs/ppyolo/README.md
+# PPYOLO 模型
+
+## 内容
+- [简介](#简介)
+- [模型库与基线](#模型库与基线)
+- [使用说明](#使用说明)
+- [未来工作](#未来工作)
+- [附录](#附录)
+
+## 简介
+
+[PPYOLO](https://arxiv.org/abs/2007.12099)的PaddleDetection优化和改进的YOLOv3的模型，其精度(COCO数据集mAP)和推理速度均优于[YOLOv4](https://arxiv.org/abs/2004.10934)模型，要求使用PaddlePaddle 1.8.4(2020年8月中旬发布)或适当的[develop版本](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-dev)。
+
+PPYOLO在[COCO](http://cocodataset.org) test2019数据集上精度达到45.2%，在单卡V100上FP32推理速度为72.9 FPS, V100上开启TensorRT下FP16推理速度为155.6 FPS。
+
+<div align="center">
+  <img src="../../docs/images/ppyolo_map_fps.png" width=500 />
+</div>
+
+PPYOLO从如下方面优化和提升YOLOv3模型的精度和速度：
+
+- 更优的骨干网络: ResNet50vd-DCN
+- 更大的训练batch size: 8 GPU，每GPU batch_size=24，对应调整学习率和迭代轮数
+- [Drop Block](https://arxiv.org/abs/1810.12890)
+- [Exponential Moving Average](https://www.investopedia.com/terms/e/ema.asp)
+- [IoU Loss](https://arxiv.org/pdf/1902.09630.pdf)
+- [Grid Sensitive](https://arxiv.org/abs/2004.10934)
+- [Matrix NMS](https://arxiv.org/pdf/2003.10152.pdf)
+- [CoordConv](https://arxiv.org/abs/1807.03247)
+- [Spatial Pyramid Pooling](https://arxiv.org/abs/1406.4729)
+- 更优的预训练模型
+
+## 模型库
+
+|          模型            | GPU个数 | 每GPU图片个数 |  骨干网络  | 输入尺寸 | Box AP | V100 FP32(FPS) | V100 TensorRT FP16(FPS) | 模型下载 | 配置文件 |
+|:------------------------:|:-------:|:-------------:|:----------:| :-------:| :----: | :------------: | :---------------------: | :------: | :------: |
+| YOLOv4(AlexyAB)          |    -    |       -       | CSPDarknet |   608    |  43.5  |       62       |          105.5           | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| YOLOv4(AlexyAB)          |    -    |       -       | CSPDarknet |   512    |  43.0  |       83       |          138.4           | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| YOLOv4(AlexyAB)          |    -    |       -       | CSPDarknet |   416    |  41.2  |       96       |          164.0           | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| YOLOv4(AlexyAB)          |    -    |       -       | CSPDarknet |   320    |  38.0  |      123       |          199.0           | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| PPYOLO                   |    8    |      24       | ResNet50vd |   608    |  45.2  |      72.9      |          155.6          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PPYOLO                   |    8    |      24       | ResNet50vd |   512    |  44.4  |      89.9      |          188.4          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PPYOLO                   |    8    |      24       | ResNet50vd |   416    |  42.5  |     109.1      |          215.4          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PPYOLO                   |    8    |      24       | ResNet50vd |   320    |  39.3  |     132.2      |          242.2          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+
+**注意:**
+
+- PPYOLO模型使用COCO数据集中train2017作为训练集，使用test2019左右测试集。
+- PPYOLO模型训练过程中使用8GPU，每GPU batch size为24进行训练，如训练GPU数和batch size不使用上述配置，须参考[FAQ](../../docs/FAQ.md)调整学习率和迭代次数。
+- PPYOLO模型推理速度测试采用单卡V100，batch size=1进行测试，使用CUDA 10.2, CUDNN 7.5.1，TensorRT推理速度测试使用TensorRT 5.1.2.2。
+- PPYOLO模型推理速度测试数据为使用`tools/export_model.py`脚本导出模型后，使用`deploy/python/infer.py`脚本中的`--run_benchnark`参数使用Paddle预测库进行推理速度benchmark测试结果, 且测试的均为不包含数据预处理和模型输出后处理(NMS)的数据(与[YOLOv4(AlexyAB)](https://github.com/AlexeyAB/darknet)测试方法一致)。
+- TensorRT FP16的速度测试相比于FP32去除了`yolo_box`(bbox解码)部分耗时，即不包含数据预处理，bbox解码和NMS(与[YOLOv4(AlexyAB)](https://github.com/AlexeyAB/darknet)测试方法一致)。
+- YOLOv4(AlexyAB)模型精度和V100 FP32推理速度数据使用[YOLOv4 github库](https://github.com/AlexeyAB/darknet)提供的单卡V100上精度速度测试数据，V100 TensorRT FP16推理速度为使用[AlexyAB/darknet]库中tkDNN配置于单卡V100上的测试结果。
+- YOLOv4(AlexyAB)行`模型下载`和`配置文件`为PaddleDetection复现的YOLOv4模型，目前评估精度已对齐，支持finetune，训练精度对齐中，可参见[PaddleDetection YOLOv4 模型](../yolov4/README.md)
+
+
+## 使用说明
+
+### 1. 训练
+
+使用8GPU通过如下命令一键式启动训练(以下命令均默认在PaddleDetection根目录运行), 通过`--eval`参数开启训练中交替评估。
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python tools/train.py -c configs/ppyolo/ppyolo.yml --eval
+```
+
+### 2. 评估
+
+使用单GPU通过如下命令一键式评估模型效果
+
+```bash
+# 使用PaddleDetection发布的权重
+CUDA_VISIBLE_DEVICES=0 python tools/eval.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams
+
+# 使用训练保存的checkpoint
+CUDA_VISIBLE_DEVICES=0 python tools/eval.py -c configs/ppyolo/ppyolo.yml -o weights=output/ppyolo/best_model
+```
+
+### 3. 推理
+
+使用单GPU通过如下命令一键式推理图像，通过`--infer_img`指定图像路径，或通过`--infer_dir`指定目录并推理目录下所有图像
+
+```bash
+# 推理单张图像
+CUDA_VISIBLE_DEVICES=0 python tools/infer.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams --infer_img=demo/000000014439_640x640.jpg
+
+# 推理目录下所有图像
+CUDA_VISIBLE_DEVICES=0 python tools/infer.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams --infer_dir=demo
+```
+
+### 4. 推理部署与benchmark
+
+PPYOLO模型部署及推理benchmark需要通过`tools/export_model.py`导出模型后使用Paddle预测库进行部署和推理，可通过如下命令一键式启动。
+
+```bash
+# 导出模型，默认存储于output/ppyolo目录
+python tools/export_model.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams
+
+# 预测库推理
+CUDA_VISIBLE_DEVICES=0 python deploy/python/infer.py --model_dir=output/ppyolo --image_file=demo/000000014439_640x640.jpg --use_gpu=True
+```
+
+PPYOLO模型benchmark测试为不包含数据预处理和网络输出后处理(NMS)的网络结构部分数据，导出模型时须指定`--exlcude_nms`来裁剪掉模型中后处理的NMS部分，通过如下命令进行模型导出和benchmark测试。
+
+```bash
+# 导出模型，通过--exclude_nms参数裁剪掉模型中的NMS部分，默认存储于output/ppyolo目录
+python tools/export_model.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams --exclude_nms
+
+# FP32 benchmark测试
+CUDA_VISIBLE_DEVICES=0 python deploy/python/infer.py --model_dir=output/ppyolo --image_file=demo/000000014439_640x640.jpg --use_gpu=True --run_benchmark=True
+
+# TensorRT FP16 benchmark测试
+CUDA_VISIBLE_DEVICES=0 python deploy/python/infer.py --model_dir=output/ppyolo --image_file=demo/000000014439_640x640.jpg --use_gpu=True --run_benchmark=True --run_mode=trt_fp16
+```
+
+## 未来工作
+
+1. 发布PPYOLO-tiny模型
+2. 发布更多骨干网络的PPYOLO及PPYOLO-tiny模型
+
+## 附录
+
+PPYOLO模型相对于YOLOv3模型优化项消融实验数据如下表所示。
+
+| 序号 |        模型                  | Box AP | 参数量(M) | FLOPs(G) | V100 FP32 FPS |
+| :--: | :--------------------------- | :----: | :-------: | :------: | :-----------: |
+|  A   | YOLOv3-DarkNet53             |  38.9  |   59.13   |  65.52   |      58.2     |
+|  B   | YOLOv3-ResNet50vd-DCN        |  39.1  |   43.89   |  44.71   |      79.2     |
+|  C   | B + LB + EMA + DropBlock     |  41.4  |   43.89   |  44.71   |      79.2     |
+|  D   | C + IoU Loss                 |  41.9  |   43.89   |  44.71   |      79.2     |
+|  E   | D + IoU Aware                |  42.5  |   43.90   |  44.71   |      74.9     |
+|  F   | E + Grid Sensitive           |  42.8  |   43.90   |  44.71   |      74.8     |
+|  G   | F + Matrix NMS               |  43.5  |   43.90   |  44.71   |      74.8     |
+|  H   | G + CoordConv                |  44.0  |   43.93   |  44.76   |      74.1     |
+|  I   | H + SPP                      |  44.3  |   44.93   |  45.12   |      72.9     |
+|  J   | I + Better ImageNet Pretrain |  44.6  |   44.93   |  45.12   |      72.9     |
+
+**注意:**
+
+- 精度与推理速度数据均为使用输入图像尺寸为608的测试结果
+- Box AP为在COCO train2017数据集训练，val2017数据集上评估数据
+- 推理速度为单卡V100上，batch size=1, 使用上述benchmark测试方法的测试结果，测试环境配置为CUDA 10.2，CUDNN 7.5.1
+- [YOLOv3-DarkNet53](../yolov3_darknet.yml)精度38.9为PaddleDetection优化后的YOLOv3模型，可参见[模型库](../../docs/MODEL_ZOO_cn.md)
--- a/configs/ppyolo/ppyolo.yml
+++ b/configs/ppyolo/ppyolo.yml
+architecture: YOLOv3
+use_gpu: true
+max_iters: 500000
+log_smooth_window: 100
+log_iter: 100
+save_dir: output
+snapshot_iter: 10000
+metric: COCO
+pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_ssld_pretrained.tar
+weights: output/ppyolo/model_final
+num_classes: 80
+use_fine_grained_loss: true
+use_ema: true
+ema_decay: 0.9998
+
+YOLOv3:
+  backbone: ResNet
+  yolo_head: YOLOv3Head
+  use_fine_grained_loss: true
+
+ResNet:
+  norm_type: sync_bn
+  freeze_at: 0
+  freeze_norm: false
+  norm_decay: 0.
+  depth: 50
+  feature_maps: [3, 4, 5]
+  variant: d
+  dcn_v2_stages: [5]
+
+YOLOv3Head:
+  anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
+  anchors: [[10, 13], [16, 30], [33, 23],
+            [30, 61], [62, 45], [59, 119],
+            [116, 90], [156, 198], [373, 326]]
+  norm_decay: 0.
+  coord_conv: true
+  iou_aware: true
+  iou_aware_factor: 0.4
+  scale_x_y: 1.05
+  spp: true
+  yolo_loss: YOLOv3Loss
+  nms: MatrixNMS
+  drop_block: true
+
+YOLOv3Loss:
+  batch_size: 24
+  ignore_thresh: 0.7
+  scale_x_y: 1.05
+  label_smooth: false
+  use_fine_grained_loss: true
+  iou_loss: IouLoss
+  iou_aware_loss: IouAwareLoss
+
+IouLoss:
+  loss_weight: 2.5
+  max_height: 608
+  max_width: 608
+
+IouAwareLoss:
+  loss_weight: 1.0
+  max_height: 608
+  max_width: 608
+
+MatrixNMS:
+    background_label: -1
+    keep_top_k: 100
+    normalized: false
+    score_threshold: 0.01
+    post_threshold: 0.01
+
+LearningRate:
+  base_lr: 0.00333
+  schedulers:
+  - !PiecewiseDecay
+    gamma: 0.1
+    milestones:
+    - 400000
+    - 450000
+  - !LinearWarmup
+    start_factor: 0.
+    steps: 4000
+
+OptimizerBuilder:
+  optimizer:
+    momentum: 0.9
+    type: Momentum
+  regularizer:
+    factor: 0.0005
+    type: L2
+
+_READER_: 'ppyolo_reader.yml'
--- a/configs/ppyolo/ppyolo_lb.yml
+++ b/configs/ppyolo/ppyolo_lb.yml
+architecture: YOLOv3
+use_gpu: true
+max_iters: 250000
+log_smooth_window: 100
+log_iter: 100
+save_dir: output
+snapshot_iter: 10000
+metric: COCO
+pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_ssld_pretrained.tar
+weights: output/ppyolo_lb/model_final
+num_classes: 80
+use_fine_grained_loss: true
+use_ema: true
+ema_decay: 0.9998
+
+YOLOv3:
+  backbone: ResNet
+  yolo_head: YOLOv3Head
+  use_fine_grained_loss: true
+
+ResNet:
+  norm_type: sync_bn
+  freeze_at: 0
+  freeze_norm: false
+  norm_decay: 0.
+  depth: 50
+  feature_maps: [3, 4, 5]
+  variant: d
+  dcn_v2_stages: [5]
+
+YOLOv3Head:
+  anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
+  anchors: [[10, 13], [16, 30], [33, 23],
+            [30, 61], [62, 45], [59, 119],
+            [116, 90], [156, 198], [373, 326]]
+  norm_decay: 0.
+  coord_conv: true
+  iou_aware: true
+  iou_aware_factor: 0.4
+  scale_x_y: 1.05
+  spp: true
+  yolo_loss: YOLOv3Loss
+  nms: MatrixNMS
+  drop_block: true
+
+YOLOv3Loss:
+  batch_size: 24
+  ignore_thresh: 0.7
+  scale_x_y: 1.05
+  label_smooth: false
+  use_fine_grained_loss: true
+  iou_loss: IouLoss
+  iou_aware_loss: IouAwareLoss
+
+IouLoss:
+  loss_weight: 2.5
+  max_height: 608
+  max_width: 608
+
+IouAwareLoss:
+  loss_weight: 1.0
+  max_height: 608
+  max_width: 608
+
+MatrixNMS:
+    background_label: -1
+    keep_top_k: 100
+    normalized: false
+    score_threshold: 0.01
+    post_threshold: 0.01
+
+LearningRate:
+  base_lr: 0.01
+  schedulers:
+  - !PiecewiseDecay
+    gamma: 0.1
+    milestones:
+    - 150000
+    - 200000
+  - !LinearWarmup
+    start_factor: 0.
+    steps: 4000
+
+OptimizerBuilder:
+  optimizer:
+    momentum: 0.9
+    type: Momentum
+  regularizer:
+    factor: 0.0005
+    type: L2
+
+_READER_: 'ppyolo_reader_lb.yml'
--- a/configs/ppyolo/ppyolo_reader.yml
+++ b/configs/ppyolo/ppyolo_reader.yml
+TrainReader:
+  inputs_def:
+    fields: ['image', 'gt_bbox', 'gt_class', 'gt_score']
+    num_max_boxes: 50
+  dataset:
+    !COCODataSet
+      image_dir: train2017
+      anno_path: annotations/instances_train2017.json
+      dataset_dir: dataset/coco
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+      with_mixup: True
+    - !MixupImage
+      alpha: 1.5
+      beta: 1.5
+    - !ColorDistort {}
+    - !RandomExpand
+      fill_value: [123.675, 116.28, 103.53]
+    - !RandomCrop {}
+    - !RandomFlipImage
+      is_normalized: false
+    - !NormalizeBox {}
+    - !PadBox
+      num_max_boxes: 50
+    - !BboxXYXY2XYWH {}
+  batch_transforms:
+  - !RandomShape
+    sizes: [320, 352, 384, 416, 448, 480, 512, 544, 576, 608]
+    random_inter: True
+  - !NormalizeImage
+    mean: [0.485, 0.456, 0.406]
+    std: [0.229, 0.224, 0.225]
+    is_scale: True
+    is_channel_first: false
+  - !Permute
+    to_bgr: false
+    channel_first: True
+  # Gt2YoloTarget is only used when use_fine_grained_loss set as true,
+  # this operator will be deleted automatically if use_fine_grained_loss
+  # is set as false
+  - !Gt2YoloTarget
+    anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
+    anchors: [[10, 13], [16, 30], [33, 23],
+              [30, 61], [62, 45], [59, 119],
+              [116, 90], [156, 198], [373, 326]]
+    downsample_ratios: [32, 16, 8]
+  batch_size: 24
+  shuffle: true
+  mixup_epoch: 25000
+  drop_last: true
+  worker_num: 8
+  bufsize: 4
+  use_process: true
+
+EvalReader:
+  inputs_def:
+    fields: ['image', 'im_size', 'im_id']
+    num_max_boxes: 50
+  dataset:
+    !COCODataSet
+      image_dir: val2017
+      anno_path: annotations/instances_val2017.json
+      dataset_dir: dataset/coco
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+    - !ResizeImage
+      target_size: 608
+      interp: 2
+    - !NormalizeImage
+      mean: [0.485, 0.456, 0.406]
+      std: [0.229, 0.224, 0.225]
+      is_scale: True
+      is_channel_first: false
+    - !PadBox
+      num_max_boxes: 50
+    - !Permute
+      to_bgr: false
+      channel_first: True
+  batch_size: 8
+  drop_empty: false
+  worker_num: 8
+  bufsize: 4
+
+TestReader:
+  inputs_def:
+    image_shape: [3, 608, 608]
+    fields: ['image', 'im_size', 'im_id']
+  dataset:
+    !ImageFolder
+      anno_path: annotations/instances_val2017.json
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+    - !ResizeImage
+      target_size: 608
+      interp: 2
+    - !NormalizeImage
+      mean: [0.485, 0.456, 0.406]
+      std: [0.229, 0.224, 0.225]
+      is_scale: True
+      is_channel_first: false
+    - !Permute
+      to_bgr: false
+      channel_first: True
+  batch_size: 1
--- a/configs/ppyolo/ppyolo_reader_lb.yml
+++ b/configs/ppyolo/ppyolo_reader_lb.yml
+TrainReader:
+  inputs_def:
+    fields: ['image', 'gt_bbox', 'gt_class', 'gt_score']
+    num_max_boxes: 50
+  dataset:
+    !COCODataSet
+      image_dir: train2017
+      anno_path: annotations/instances_train2017.json
+      dataset_dir: dataset/coco
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+      with_mixup: True
+    - !MixupImage
+      alpha: 1.5
+      beta: 1.5
+    - !ColorDistort {}
+    - !RandomExpand
+      fill_value: [123.675, 116.28, 103.53]
+    - !RandomCrop {}
+    - !RandomFlipImage
+      is_normalized: false
+    - !NormalizeBox {}
+    - !PadBox
+      num_max_boxes: 50
+    - !BboxXYXY2XYWH {}
+  batch_transforms:
+  - !RandomShape
+    sizes: [320, 352, 384, 416, 448, 480, 512, 544, 576, 608]
+    random_inter: True
+  - !NormalizeImage
+    mean: [0.485, 0.456, 0.406]
+    std: [0.229, 0.224, 0.225]
+    is_scale: True
+    is_channel_first: false
+  - !Permute
+    to_bgr: false
+    channel_first: True
+  # Gt2YoloTarget is only used when use_fine_grained_loss set as true,
+  # this operator will be deleted automatically if use_fine_grained_loss
+  # is set as false
+  - !Gt2YoloTarget
+    anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
+    anchors: [[10, 13], [16, 30], [33, 23],
+              [30, 61], [62, 45], [59, 119],
+              [116, 90], [156, 198], [373, 326]]
+    downsample_ratios: [32, 16, 8]
+  batch_size: 24
+  shuffle: true
+  mixup_epoch: 25000
+  drop_last: true
+  worker_num: 8
+  bufsize: 4
+  use_process: true
+
+EvalReader:
+  inputs_def:
+    fields: ['image', 'im_size', 'im_id']
+    num_max_boxes: 50
+  dataset:
+    !COCODataSet
+      image_dir: val2017
+      anno_path: annotations/instances_val2017.json
+      dataset_dir: dataset/coco
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+    - !ResizeImage
+      target_size: 608
+      interp: 2
+    - !NormalizeImage
+      mean: [0.485, 0.456, 0.406]
+      std: [0.229, 0.224, 0.225]
+      is_scale: True
+      is_channel_first: false
+    - !PadBox
+      num_max_boxes: 50
+    - !Permute
+      to_bgr: false
+      channel_first: True
+  batch_size: 8
+  drop_empty: false
+  worker_num: 8
+  bufsize: 4
+
+TestReader:
+  inputs_def:
+    image_shape: [3, 608, 608]
+    fields: ['image', 'im_size', 'im_id']
+  dataset:
+    !ImageFolder
+      anno_path: annotations/instances_val2017.json
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+    - !ResizeImage
+      target_size: 608
+      interp: 2
+    - !NormalizeImage
+      mean: [0.485, 0.456, 0.406]
+      std: [0.229, 0.224, 0.225]
+      is_scale: True
+      is_channel_first: false
+    - !Permute
+      to_bgr: false
+      channel_first: True
+  batch_size: 1
--- a/configs/ppyolo/ppyolo_tiny.yml
+++ b/configs/ppyolo/ppyolo_tiny.yml
+architecture: YOLOv3
+use_gpu: true
+max_iters: 250000
+log_smooth_window: 20
+log_iter: 20
+save_dir: output
+snapshot_iter: 10000
+metric: COCO
+pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/ResNet18_vd_pretrained.tar
+weights: output/ppyolo_tiny/model_final
+num_classes: 80
+use_fine_grained_loss: true
+use_ema: true
+ema_decay: 0.9998
+
+YOLOv3:
+  backbone: ResNet
+  yolo_head: YOLOv3Head
+  use_fine_grained_loss: true
+
+ResNet:
+  norm_type: sync_bn
+  freeze_at: 0
+  freeze_norm: false
+  norm_decay: 0.
+  depth: 18
+  feature_maps: [4, 5]
+  variant: d
+
+YOLOv3Head:
+  anchor_masks: [[3, 4, 5], [0, 1, 2]]
+  anchors: [[10, 14], [23, 27], [37, 58],
+            [81, 82], [135, 169], [344, 319]]
+  norm_decay: 0.
+  conv_block_num: 0
+  iou_aware: true
+  iou_aware_factor: 0.4
+  scale_x_y: 1.05
+  yolo_loss: YOLOv3Loss
+  nms: MatrixNMS
+  drop_block: true
+
+YOLOv3Loss:
+  batch_size: 32
+  ignore_thresh: 0.7
+  scale_x_y: 1.05
+  label_smooth: false
+  use_fine_grained_loss: true
+  iou_loss: IouLoss
+  iou_aware_loss: IouAwareLoss
+
+IouLoss:
+  loss_weight: 2.5
+  max_height: 608
+  max_width: 608
+
+IouAwareLoss:
+  loss_weight: 1.0
+  max_height: 608
+  max_width: 608
+
+MatrixNMS:
+    background_label: -1
+    keep_top_k: 100
+    normalized: false
+    score_threshold: 0.01
+    post_threshold: 0.01
+
+LearningRate:
+  base_lr: 0.004
+  schedulers:
+  - !PiecewiseDecay
+    gamma: 0.1
+    milestones:
+    - 150000
+    - 200000
+  - !LinearWarmup
+    start_factor: 0.
+    steps: 4000
+
+OptimizerBuilder:
+  optimizer:
+    momentum: 0.9
+    type: Momentum
+  regularizer:
+    factor: 0.0005
+    type: L2
+
+_READER_: 'ppyolo_reader.yml'
+TrainReader:
+  inputs_def:
+    fields: ['image', 'gt_bbox', 'gt_class', 'gt_score']
+    num_max_boxes: 50
+  dataset:
+    !COCODataSet
+      image_dir: train2017
+      anno_path: annotations/instances_train2017.json
+      dataset_dir: train_data/dataset/coco
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+      with_mixup: True
+    - !MixupImage
+      alpha: 1.5
+      beta: 1.5
+    - !ColorDistort {}
+    - !RandomExpand
+      fill_value: [123.675, 116.28, 103.53]
+    - !RandomCrop {}
+    - !RandomFlipImage
+      is_normalized: false
+    - !NormalizeBox {}
+    - !PadBox
+      num_max_boxes: 50
+    - !BboxXYXY2XYWH {}
+  batch_transforms:
+  - !RandomShape
+    sizes: [320, 352, 384, 416, 448, 480, 512, 544, 576, 608]
+    random_inter: True
+  - !NormalizeImage
+    mean: [0.485, 0.456, 0.406]
+    std: [0.229, 0.224, 0.225]
+    is_scale: True
+    is_channel_first: false
+  - !Permute
+    to_bgr: false
+    channel_first: True
+  # Gt2YoloTarget is only used when use_fine_grained_loss set as true,
+  # this operator will be deleted automatically if use_fine_grained_loss
+  # is set as false
+  - !Gt2YoloTarget
+    anchor_masks: [[3, 4, 5], [0, 1, 2]]
+    anchors: [[10, 14], [23, 27], [37, 58],
+              [81, 82], [135, 169], [344, 319]]
+    downsample_ratios: [32, 16]
+  batch_size: 32
+  shuffle: true
+  mixup_epoch: 500
+  drop_last: true
+  worker_num: 16
+  bufsize: 8
+  use_process: true
--- a/deploy/python/infer.py
+++ b/deploy/python/infer.py
@@ -466,7 +466,12 @@ class Detector():
            results['masks'] = np_masks
        return results

-    def predict(self, image, threshold=0.5, warmup=0, repeats=1):
+    def predict(self,
+                image,
+                threshold=0.5,
+                warmup=0,
+                repeats=1,
+                run_benchmark=False):
        '''
        Args:
            image (str/np.ndarray): path of image/ np.ndarray read by cv2
@@ -500,7 +505,7 @@ class Detector():
                np_masks = np.array(outs[1])
        else:
            input_names = self.predictor.get_input_names()
-            for i in range(len(inputs)):
+            for i in range(len(input_names)):
                input_tensor = self.predictor.get_input_tensor(input_names[i])
                input_tensor.copy_from_cpu(inputs[input_names[i]])

@@ -528,12 +533,15 @@ class Detector():
            ms = (t2 - t1) * 1000.0 / repeats
            print("Inference: {} ms per batch image".format(ms))

-        if reduce(lambda x, y: x * y, np_boxes.shape) < 6:
-            print('[WARNNING] No object detected.')
-            results = {'boxes': np.array([])}
-        else:
-            results = self.postprocess(
-                np_boxes, np_masks, im_info, threshold=threshold)
+        # do not perform postprocess in benchmark mode
+        results = []
+        if not run_benchmark:
+            if reduce(lambda x, y: x * y, np_boxes.shape) < 6:
+                print('[WARNNING] No object detected.')
+                results = {'boxes': np.array([])}
+            else:
+                results = self.postprocess(
+                    np_boxes, np_masks, im_info, threshold=threshold)

        return results

@@ -543,7 +551,11 @@ def predict_image():
        FLAGS.model_dir, use_gpu=FLAGS.use_gpu, run_mode=FLAGS.run_mode)
    if FLAGS.run_benchmark:
        detector.predict(
-            FLAGS.image_file, FLAGS.threshold, warmup=100, repeats=100)
+            FLAGS.image_file,
+            FLAGS.threshold,
+            warmup=100,
+            repeats=100,
+            run_benchmark=True)
    else:
        results = detector.predict(FLAGS.image_file, FLAGS.threshold)
        visualize(

--- a/docs/images/ppyolo_map_fps.png
+++ b/docs/images/ppyolo_map_fps.png
--- a/ppdet/modeling/anchor_heads/yolo_head.py
+++ b/ppdet/modeling/anchor_heads/yolo_head.py
--- a/ppdet/modeling/architectures/blazeface.py
+++ b/ppdet/modeling/architectures/blazeface.py
@@ -251,7 +251,9 @@ class BlazeFace(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'eval')

-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')

    def is_bbox_normalized(self):

--- a/ppdet/modeling/architectures/cascade_mask_rcnn.py
+++ b/ppdet/modeling/architectures/cascade_mask_rcnn.py
@@ -434,5 +434,7 @@ class CascadeMaskRCNN(object):
            return self.build_multi_scale(feed_vars, mask_branch)
        return self.build(feed_vars, 'test')

-    def test(self, feed_vars):
-        return self.build(feed_vars, 'test')
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
+        return self.build(feed_vars, 'test', exclude_nms=exclude_nms)
--- a/ppdet/modeling/architectures/cascade_rcnn.py
+++ b/ppdet/modeling/architectures/cascade_rcnn.py
@@ -331,5 +331,7 @@ class CascadeRCNN(object):
            return self.build_multi_scale(feed_vars)
        return self.build(feed_vars, 'test')

-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/cascade_rcnn_cls_aware.py
+++ b/ppdet/modeling/architectures/cascade_rcnn_cls_aware.py
@@ -320,4 +320,6 @@ class CascadeRCNNClsAware(object):
        return self.build(feed_vars, 'test')

    def test(self, feed_vars):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/cornernet_squeeze.py
+++ b/ppdet/modeling/architectures/cornernet_squeeze.py
@@ -136,5 +136,7 @@ class CornerNetSqueeze(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, mode='test')

-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, mode='test')
--- a/ppdet/modeling/architectures/efficientdet.py
+++ b/ppdet/modeling/architectures/efficientdet.py
@@ -146,5 +146,7 @@ class EfficientDet(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'test')

-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/faceboxes.py
+++ b/ppdet/modeling/architectures/faceboxes.py
@@ -183,7 +183,9 @@ class FaceBoxes(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'eval')

-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')

    def is_bbox_normalized(self):

--- a/ppdet/modeling/architectures/faster_rcnn.py
+++ b/ppdet/modeling/architectures/faster_rcnn.py
@@ -244,5 +244,7 @@ class FasterRCNN(object):
            return self.build_multi_scale(feed_vars)
        return self.build(feed_vars, 'test')

-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/fcos.py
+++ b/ppdet/modeling/architectures/fcos.py
@@ -179,5 +179,7 @@ class FCOS(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'test')

-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/mask_rcnn.py
+++ b/ppdet/modeling/architectures/mask_rcnn.py
@@ -337,5 +337,7 @@ class MaskRCNN(object):
            return self.build_multi_scale(feed_vars, mask_branch)
        return self.build(feed_vars, 'test')

-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/retinanet.py
+++ b/ppdet/modeling/architectures/retinanet.py
@@ -125,5 +125,7 @@ class RetinaNet(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'test')

-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/ssd.py
+++ b/ppdet/modeling/architectures/ssd.py
@@ -134,7 +134,9 @@ class SSD(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'eval')

-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')

    def is_bbox_normalized(self):

--- a/ppdet/modeling/architectures/yolo.py
+++ b/ppdet/modeling/architectures/yolo.py
@@ -49,7 +49,7 @@ class YOLOv3(object):
        self.yolo_head = yolo_head
        self.use_fine_grained_loss = use_fine_grained_loss

-    def build(self, feed_vars, mode='train'):
+    def build(self, feed_vars, mode='train', exclude_nms=False):
        im = feed_vars['image']

        mixed_precision_enabled = mixed_precision_global_state() is not None
@@ -74,9 +74,9 @@ class YOLOv3(object):
            gt_score = feed_vars['gt_score']

            # Get targets for splited yolo loss calculation
-            # YOLOv3 supports up to 3 output layers currently
+            num_output_layer = len(self.yolo_head.anchor_masks)
            targets = []
-            for i in range(3):
+            for i in range(num_output_layer):
                k = 'target{}'.format(i)
                if k in feed_vars:
                    targets.append(feed_vars[k])
@@ -88,7 +88,9 @@ class YOLOv3(object):
            return loss
        else:
            im_size = feed_vars['im_size']
-            return self.yolo_head.get_prediction(body_feats, im_size)
+            # exclude_nms only for benchmark, postprocess(NMS) is not needed
+            return self.yolo_head.get_prediction(
+                body_feats, im_size, exclude_nms=exclude_nms)

    def _inputs_def(self, image_shape, num_max_boxes):
        im_shape = [None] + image_shape
@@ -106,11 +108,10 @@ class YOLOv3(object):

        if self.use_fine_grained_loss:
            # yapf: disable
-            targets_def = {
-                'target0':  {'shape': [None, 3, 86, 19, 19],  'dtype': 'float32',   'lod_level': 0},
-                'target1':  {'shape': [None, 3, 86, 38, 38],  'dtype': 'float32',   'lod_level': 0},
-                'target2':  {'shape': [None, 3, 86, 76, 76],  'dtype': 'float32',   'lod_level': 0},
-            }
+            num_output_layer = len(self.yolo_head.anchor_masks)
+            targets_def = {}
+            for i in range(num_output_layer):
+                targets_def['target{}'.format(i)] = {'shape': [None, 3, None, None, None],  'dtype': 'float32',   'lod_level': 0}
            # yapf: enable

            downsample = 32
@@ -139,7 +140,9 @@ class YOLOv3(object):
        # will be disabled for YOLOv3 architecture do not calculate loss in
        # eval/infer mode.
        if 'im_size' not in fields and self.use_fine_grained_loss:
-            fields.extend(['target0', 'target1', 'target2'])
+            num_output_layer = len(self.yolo_head.anchor_masks)
+            fields.extend(
+                ['target{}'.format(i) for i in range(num_output_layer)])
        feed_vars = OrderedDict([(key, fluid.data(
            name=key,
            shape=inputs_def[key]['shape'],
@@ -158,8 +161,8 @@ class YOLOv3(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, mode='test')

-    def test(self, feed_vars):
-        return self.build(feed_vars, mode='test')
+    def test(self, feed_vars, exclude_nms=False):
+        return self.build(feed_vars, mode='test', exclude_nms=exclude_nms)


 @register

--- a/ppdet/modeling/losses/iou_aware_loss.py
+++ b/ppdet/modeling/losses/iou_aware_loss.py
@@ -54,6 +54,7 @@ class IouAwareLoss(IouLoss):
                 anchors,
                 downsample_ratio,
                 batch_size,
+                 scale_x_y,
                 eps=1.e-10):
        '''
        Args:
@@ -67,9 +68,9 @@ class IouAwareLoss(IouLoss):
        '''

        pred = self._bbox_transform(x, y, w, h, anchors, downsample_ratio,
-                                    batch_size, False)
+                                    batch_size, False, scale_x_y, eps)
        gt = self._bbox_transform(tx, ty, tw, th, anchors, downsample_ratio,
-                                  batch_size, True)
+                                  batch_size, True, scale_x_y, eps)
        iouk = self._iou(pred, gt, ioup, eps)
        iouk.stop_gradient = True


--- a/ppdet/modeling/losses/iou_loss.py
+++ b/ppdet/modeling/losses/iou_loss.py
@@ -63,6 +63,7 @@ class IouLoss(object):
                 anchors,
                 downsample_ratio,
                 batch_size,
+                 scale_x_y=1.,
                 ioup=None,
                 eps=1.e-10):
        '''
@@ -75,9 +76,9 @@ class IouLoss(object):
            eps (float): the decimal to prevent the denominator eqaul zero
        '''
        pred = self._bbox_transform(x, y, w, h, anchors, downsample_ratio,
-                                    batch_size, False)
+                                    batch_size, False, scale_x_y, eps)
        gt = self._bbox_transform(tx, ty, tw, th, anchors, downsample_ratio,
-                                  batch_size, True)
+                                  batch_size, True, scale_x_y, eps)
        iouk = self._iou(pred, gt, ioup, eps)
        if self.loss_square:
            loss_iou = 1. - iouk * iouk
@@ -145,7 +146,7 @@ class IouLoss(object):
        return diou_term + ciou_term

    def _bbox_transform(self, dcx, dcy, dw, dh, anchors, downsample_ratio,
-                        batch_size, is_gt):
+                        batch_size, is_gt, scale_x_y, eps):
        grid_x = int(self._MAX_WI / downsample_ratio)
        grid_y = int(self._MAX_HI / downsample_ratio)
        an_num = len(anchors) // 2
@@ -179,8 +180,11 @@ class IouLoss(object):
            cy.gradient = True
        else:
            dcx_sig = fluid.layers.sigmoid(dcx)
-            cx = fluid.layers.elementwise_add(dcx_sig, gi) / grid_x_act
            dcy_sig = fluid.layers.sigmoid(dcy)
+            if (abs(scale_x_y - 1.0) > eps):
+                dcx_sig = scale_x_y * dcx_sig  - 0.5 * (scale_x_y - 1)
+                dcy_sig = scale_x_y * dcy_sig  - 0.5 * (scale_x_y - 1)
+            cx = fluid.layers.elementwise_add(dcx_sig, gi) / grid_x_act
            cy = fluid.layers.elementwise_add(dcy_sig, gj) / grid_y_act

        anchor_w_ = [anchors[i] for i in range(0, len(anchors)) if i % 2 == 0]

--- a/ppdet/modeling/losses/yolo_loss.py
+++ b/ppdet/modeling/losses/yolo_loss.py
@@ -92,7 +92,7 @@ class YOLOv3Loss(object):
            return {'loss': sum(losses)}

    def _get_fine_grained_loss(self, outputs, targets, gt_box, batch_size,
-                               num_classes, mask_anchors, ignore_thresh):
+                               num_classes, mask_anchors, ignore_thresh, eps=1.e-10):
        """
        Calculate fine grained YOLOv3 loss

@@ -136,12 +136,25 @@ class YOLOv3Loss(object):
            tx, ty, tw, th, tscale, tobj, tcls = self._split_target(target)

            tscale_tobj = tscale * tobj
-            loss_x = fluid.layers.sigmoid_cross_entropy_with_logits(
-                x, tx) * tscale_tobj
-            loss_x = fluid.layers.reduce_sum(loss_x, dim=[1, 2, 3])
-            loss_y = fluid.layers.sigmoid_cross_entropy_with_logits(
-                y, ty) * tscale_tobj
-            loss_y = fluid.layers.reduce_sum(loss_y, dim=[1, 2, 3])
+
+            scale_x_y = self.scale_x_y if not isinstance(
+                self.scale_x_y, Sequence) else self.scale_x_y[i]
+
+            if (abs(scale_x_y - 1.0) < eps):
+                loss_x = fluid.layers.sigmoid_cross_entropy_with_logits(
+                    x, tx) * tscale_tobj
+                loss_x = fluid.layers.reduce_sum(loss_x, dim=[1, 2, 3])
+                loss_y = fluid.layers.sigmoid_cross_entropy_with_logits(
+                    y, ty) * tscale_tobj
+                loss_y = fluid.layers.reduce_sum(loss_y, dim=[1, 2, 3])
+            else:
+                dx = scale_x_y * fluid.layers.sigmoid(x)  - 0.5 * (scale_x_y - 1.0)
+                dy = scale_x_y * fluid.layers.sigmoid(y)  - 0.5 * (scale_x_y - 1.0)
+                loss_x = fluid.layers.abs(dx - tx) * tscale_tobj
+                loss_x = fluid.layers.reduce_sum(loss_x, dim=[1, 2, 3])
+                loss_y = fluid.layers.abs(dy - ty) * tscale_tobj
+                loss_y = fluid.layers.reduce_sum(loss_y, dim=[1, 2, 3])
+
            # NOTE: we refined loss function of (w, h) as L1Loss
            loss_w = fluid.layers.abs(w - tw) * tscale_tobj
            loss_w = fluid.layers.reduce_sum(loss_w, dim=[1, 2, 3])
@@ -149,7 +162,7 @@ class YOLOv3Loss(object):
            loss_h = fluid.layers.reduce_sum(loss_h, dim=[1, 2, 3])
            if self._iou_loss is not None:
                loss_iou = self._iou_loss(x, y, w, h, tx, ty, tw, th, anchors,
-                                          downsample, self._batch_size)
+                                          downsample, self._batch_size, scale_x_y)
                loss_iou = loss_iou * tscale_tobj
                loss_iou = fluid.layers.reduce_sum(loss_iou, dim=[1, 2, 3])
                loss_ious.append(fluid.layers.reduce_mean(loss_iou))
@@ -157,14 +170,12 @@ class YOLOv3Loss(object):
            if self._iou_aware_loss is not None:
                loss_iou_aware = self._iou_aware_loss(
                    ioup, x, y, w, h, tx, ty, tw, th, anchors, downsample,
-                    self._batch_size)
+                    self._batch_size, scale_x_y)
                loss_iou_aware = loss_iou_aware * tobj
                loss_iou_aware = fluid.layers.reduce_sum(
                    loss_iou_aware, dim=[1, 2, 3])
                loss_iou_awares.append(fluid.layers.reduce_mean(loss_iou_aware))

-            scale_x_y = self.scale_x_y if not isinstance(
-                self.scale_x_y, Sequence) else self.scale_x_y[i]
            loss_obj_pos, loss_obj_neg = self._calc_obj_loss(
                output, obj, tobj, gt_box, self._batch_size, anchors,
                num_classes, downsample, self._ignore_thresh, scale_x_y)
@@ -293,7 +304,7 @@ class YOLOv3Loss(object):
            downsample_ratio=downsample,
            clip_bbox=False,
            scale_x_y=scale_x_y)
-
+   
        # 2. split pred bbox and gt bbox by sample, calculate IoU between pred bbox
        #    and gt bbox in each sample
        if batch_size > 1:
@@ -322,17 +333,17 @@ class YOLOv3Loss(object):
            pred = fluid.layers.squeeze(pred, axes=[0])
            gt = box_xywh2xyxy(fluid.layers.squeeze(gt, axes=[0]))
            ious.append(fluid.layers.iou_similarity(pred, gt))
-
+      
        iou = fluid.layers.stack(ious, axis=0)
        # 3. Get iou_mask by IoU between gt bbox and prediction bbox,
        #    Get obj_mask by tobj(holds gt_score), calculate objectness loss
-
+        
        max_iou = fluid.layers.reduce_max(iou, dim=-1)
        iou_mask = fluid.layers.cast(max_iou <= ignore_thresh, dtype="float32")
        if self.match_score:
            max_prob = fluid.layers.reduce_max(prob, dim=-1)
            iou_mask = iou_mask * fluid.layers.cast(
-                max_prob <= 0.25, dtype="float32")
+                max_prob <= 0.25, dtype="float32")        
        output_shape = fluid.layers.shape(output)
        an_num = len(anchors) // 2
        iou_mask = fluid.layers.reshape(iou_mask, (-1, an_num, output_shape[2],

--- a/ppdet/modeling/ops.py
+++ b/ppdet/modeling/ops.py
@@ -526,7 +526,7 @@ class MatrixNMS(object):
                 gaussian_sigma=2.,
                 normalized=False,
                 background_label=0):
-        super(MultiClassNMS, self).__init__()
+        super(MatrixNMS, self).__init__()
        self.score_threshold = score_threshold
        self.post_threshold = post_threshold
        self.nms_top_k = nms_top_k

--- a/tools/export_model.py
+++ b/tools/export_model.py
@@ -196,7 +196,8 @@ def main():
            inputs_def = cfg['TestReader']['inputs_def']
            inputs_def['use_dataloader'] = False
            feed_vars, _ = model.build_inputs(**inputs_def)
-            test_fetches = model.test(feed_vars)
+            # postprocess not need in exclude_nms, exclude NMS in exclude_nms mode
+            test_fetches = model.test(feed_vars, exclude_nms=FLAGS.exclude_nms)
    infer_prog = infer_prog.clone(True)
    check_py_func(infer_prog)

@@ -214,6 +215,11 @@ if __name__ == '__main__':
        type=str,
        default="output",
        help="Directory for storing the output model files.")
+    parser.add_argument(
+        "--exclude_nms",
+        action='store_true',
+        default=False,
+        help="Whether prune NMS for benchmark")

    FLAGS = parser.parse_args()
    main()